digitalmars.D - [WORK] std.file.update function

Andrei Alexandrescu (31/31) Sep 18 2016 There are quite a few situations in rdmd and dmd generally when we

Andrei Alexandrescu (5/7) Sep 18 2016 Forgot to mention a situation here: if you change the source code of a

rikki cattermole (2/9) Sep 18 2016 How does this compare against doing a checksum comparison on the file?

Andrei Alexandrescu (2/12) Sep 18 2016 Favorably :o). -- Andrei

rikki cattermole (3/16) Sep 18 2016 Confirmed in doing the checksum myself.

Chris Wright (2/21) Sep 18 2016 You have an operating system that automatically checksums every file?

R (3/5) Oct 18 2016 There are a few filesystems that keep checksums of blocks, but I

Patrick Schluter (2/8) Oct 18 2016 zfs , btrfs. If the checksum's accessible is anoher story.

Walter Bright (12/18) Sep 18 2016 The compiler currently creates the complete object file in a buffer, the...

Jacob Carlborg (16/18) Sep 18 2016 You already mentioned in an other post [1] that the compiler could do
Stefan Koch (8/10) Sep 18 2016 I'd like that as well.

Walter Bright (12/14) Sep 18 2016 A major part of the problem (that working with Optlink has made painfull...

ketmar (5/8) Sep 19 2016 yeah. there is a reason for absense of 100500 hobbyst FOSS

Andrei Alexandrescu (18/41) Sep 19 2016 Great. In that case, if the target .o file already exists, it should be

Stefan Koch (10/14) Sep 19 2016 Only if the TOC is unchanged.
Walter Bright (19/40) Sep 19 2016 That's right. I was just referring to the idea of incrementally writing ...

Stefan Koch (6/11) Sep 18 2016 If so we need it in druntime.
Chris Wright (10/10) Sep 18 2016 This will produce different behavior with hard links. With hard links,

Andrei Alexandrescu (5/15) Sep 18 2016 That's exactly right, and such considerations should also go in the

Brad Roberts via Digitalmars-d (5/22) Sep 18 2016 This is nice in the case of no changes, but problematic in the case of

Walter Bright (9/13) Sep 18 2016 As for compilation, I bet considerable speed increases could be had by n...
Andrei Alexandrescu (4/8) Sep 19 2016 Good point, should be also part of the doco or a flag with update (e.g.

Walter Bright (6/6) Sep 19 2016 One way to implement it is to open the existing file as a memory-mapped ...

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

There are quite a few situations in rdmd and dmd generally when we 
compute a dependency structure over sets of files. Based on that, we 
write new files that overwrite old, obsoleted files. Those changes in 
turn trigger other dependencies to go stale so more building is done etc.

Simplest case is - source file is being changed, therefore a new object 
file is being produced, therefore a new executable is being produced. 
And it only gets more involved.

We've discussed before using a simple method to avoid unnecessary stale 
dependencies when it's possible that a certain file won't, in fact, 
change contents:

1. Do all work on the side in a separate file e.g. file.ext.tmp

2. Compare the new file with the old file file.ext

3. If they're identical, delete file.ext.tmp; otherwise, rename 
file.ext.tmp into file.ext

There is actually an even better way at the application level. Consider 
a function in std.file:

updateS, Range)(S name, Range data);

updateFile does something interesting: it opens the file "name" for 
reading AND writing, then reads data from the Range _and_ the file. For 
as long as the data and the contents in the file agree, it just moves 
reading along. At the first difference between the data and the file 
contents, starts writing the data into the file through the end of the 
range.

So this makes zero writes (and leaves the "last modified time" intact) 
if the file has the same content as the data. Better yet, if it so 
happens that the file and the data have the same prefix, there's less 
writing going on, which IIRC is faster for most filesystems. Saving on 
writes happens to be particularly nice on new solid-state drives.

Who wants to take this with testing, measurements etc? It's a cool mini 
project.


Andrei

Sep 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.

Forgot to mention a situation here: if you change the source code of a 
module without influencing the object file (e.g. documentation, certain 
style changes, unittests in non-unittest builds etc) there'd be no 
linking upon rebuilding. -- Andrei

Sep 18 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.

 Forgot to mention a situation here: if you change the source code of a
 module without influencing the object file (e.g. documentation, certain
 style changes, unittests in non-unittest builds etc) there'd be no
 linking upon rebuilding. -- Andrei

How does this compare against doing a checksum comparison on the file?

Sep 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/18/16 11:24 AM, rikki cattermole wrote:
 On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.

 Forgot to mention a situation here: if you change the source code of a
 module without influencing the object file (e.g. documentation, certain
 style changes, unittests in non-unittest builds etc) there'd be no
 linking upon rebuilding. -- Andrei

 How does this compare against doing a checksum comparison on the file?

Favorably :o). -- Andrei

Sep 18 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 19/09/2016 3:41 AM, Andrei Alexandrescu wrote:
 On 9/18/16 11:24 AM, rikki cattermole wrote:
 On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.

 Forgot to mention a situation here: if you change the source code of a
 module without influencing the object file (e.g. documentation, certain
 style changes, unittests in non-unittest builds etc) there'd be no
 linking upon rebuilding. -- Andrei

 How does this compare against doing a checksum comparison on the file?

 Favorably :o). -- Andrei

Confirmed in doing the checksum myself.
However I have not compared against OS provided checksum.

Sep 18 2016

Chris Wright <dhasenan gmail.com> writes:

On Mon, 19 Sep 2016 04:24:41 +1200, rikki cattermole wrote:

 On 19/09/2016 3:41 AM, Andrei Alexandrescu wrote:
 On 9/18/16 11:24 AM, rikki cattermole wrote:
 On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new
 object file is being produced, therefore a new executable is being
 produced.

 Forgot to mention a situation here: if you change the source code of
 a module without influencing the object file (e.g. documentation,
 certain style changes, unittests in non-unittest builds etc) there'd
 be no linking upon rebuilding. -- Andrei

 How does this compare against doing a checksum comparison on the file?

 Favorably :o). -- Andrei

 
 Confirmed in doing the checksum myself.
 However I have not compared against OS provided checksum.

You have an operating system that automatically checksums every file?

Sep 18 2016

R <rjmcguire gmail.com> writes:

On Monday, 19 September 2016 at 02:57:01 UTC, Chris Wright wrote:

 You have an operating system that automatically checksums every 
 file?

There are a few filesystems that keep checksums of blocks, but I 
don't see one that keeps file checksums.

Oct 18 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 18 October 2016 at 13:51:48 UTC, R wrote:
 On Monday, 19 September 2016 at 02:57:01 UTC, Chris Wright 
 wrote:

 You have an operating system that automatically checksums 
 every file?

 There are a few filesystems that keep checksums of blocks, but 
 I don't see one that keeps file checksums.

zfs , btrfs. If the checksum's accessible is anoher story.

Oct 18 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 9/18/2016 8:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.

 Forgot to mention a situation here: if you change the source code of a module
 without influencing the object file (e.g. documentation, certain style changes,
 unittests in non-unittest builds etc) there'd be no linking upon rebuilding. --


The compiler currently creates the complete object file in a buffer, then
writes 
the buffer to a file in one command. The reason is mostly because the object 
file format isn't incremental, the beginning is written last and the body gets 
backpatched as the compilation progresses.

I can't really see a compilation producing an object file where the first half 
of it matches the previous object file and the second half is different,
because 
of the file format.

Interestingly, the win32 .lib format is designed for incredibly slow floppy 
disks, in that updating the library need not read/write every disk sector.

I'd love to design our own high speed formats, but then they'd be incompatible 
with everybody else's.

Sep 18 2016

Jacob Carlborg <doob me.com> writes:

On 2016-09-19 07:16, Walter Bright wrote:

 I'd love to design our own high speed formats, but then they'd be
 incompatible with everybody else's.

You already mentioned in an other post [1] that the compiler could do 
the linking as well. In that case you would need to write some form of 
linker. Then I suggest to develop the linker as a library, supporting 
all formats DMD currently supports. The library can be used both 
directly from DMD and to build an external linker. When we have our own 
linker we could create our own format too without having to worry about 
compatibility.

I guess we need to create other tools for the new format as well, like 
object dumpers. But I assume that's a natural thing to do anyway.

Bundle that with something like musl libc and we will have our own 
complete tool chain. It would also be easier to add support for 
cross-compiling.

[1] http://forum.dlang.org/post/nrnsn7$1h3k$1 digitalmars.com

-- 
/Jacob Carlborg

Sep 18 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 19 September 2016 at 05:16:37 UTC, Walter Bright wrote:
 I'd love to design our own high speed formats, but then they'd 
 be incompatible with everybody else's.

I'd like that as well.

I recently had a look at the ELF and the COFF file formats both 
are definitely in need of rework and dust-off :-)

There are some nice things we could do if we had certain features 
on every platform, wrt. linking and symbol-tables.

However the maintenance burden is a bit heavy we don't have 
enough menpower as it is.

Sep 18 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 9/18/2016 11:33 PM, Stefan Koch wrote:
 However the maintenance burden is a bit heavy we don't have enough menpower as
 it is.

A major part of the problem (that working with Optlink has made painfully
clear) 
is that although linking is conceptually a rather trivial task, the people 
who've designed the file formats have an unending love of making trivial things 
exceedingly complicated. Furthermore, the weird things about the format are 98% 
undocumented lore.

DMD still has problems generating "correct" Dwarf debug info because its 
correctness is not defined by the spec, but by lore and the idiosyncratic way 
that gcc emits it.

Doing a linker inside DMD means that object files imported from other C/C++ 
compilers have to be correctly interpreted. I could do it, but I couldn't do 
that and continue to work on D.

Sep 18 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Monday, 19 September 2016 at 06:53:47 UTC, Walter Bright wrote:
 Doing a linker inside DMD means that object files imported from 
 other C/C++ compilers have to be correctly interpreted. I could 
 do it, but I couldn't do that and continue to work on D.

yeah. there is a reason for absense of 100500 hobbyst FOSS 
linkers. ;-) contrary to what it may look like, correct linking 
is really hard task. and mostly not fun to write too. people 
usually trying, and then just silently returning to binutils. ;-)

Sep 19 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 09/19/2016 01:16 AM, Walter Bright wrote:
 On 9/18/2016 8:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.

 Forgot to mention a situation here: if you change the source code of a
 module
 without influencing the object file (e.g. documentation, certain style
 changes,
 unittests in non-unittest builds etc) there'd be no linking upon
 rebuilding. --


 The compiler currently creates the complete object file in a buffer,
 then writes the buffer to a file in one command. The reason is mostly
 because the object file format isn't incremental, the beginning is
 written last and the body gets backpatched as the compilation progresses.

Great. In that case, if the target .o file already exists, it should be 
compared against the buffer. If identical, there should be no write and 
the timestamp of the .o file should stay the same.

I need to re-emphasize this kind of stuff is important for tooling. Many 
files get recompiled to identical object files - e.g. the many innocent 
bystanders in a dense dependency structure when one module changes. We 
also embed documentation in source files. Being disciplined about 
reflecting actual changes in the actual file operations is very helpful 
for tools that track file writes and/or timestamps.

 I can't really see a compilation producing an object file where the
 first half of it matches the previous object file and the second half is
 different, because of the file format.

Interesting. What happens e.g. if one makes a change to a function whose 
generated code is somewhere in the middle of the object file? If it 
doesn't alter the call graph, doesn't the new .o file share a common 
prefix with the old one?

 Interestingly, the win32 .lib format is designed for incredibly slow
 floppy disks, in that updating the library need not read/write every
 disk sector.

 I'd love to design our own high speed formats, but then they'd be
 incompatible with everybody else's.

This (and the subsequent considerations) is drifting off-topic. This is 
about getting a useful function off the ground, and sadly is 
degenerating into yet another off-topic debate leading to no progress.


Andrei

Sep 19 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 19 September 2016 at 14:04:03 UTC, Andrei Alexandrescu 
wrote:
 Interesting. What happens e.g. if one makes a change to a 
 function whose generated code is somewhere in the middle of the 
 object file? If it doesn't alter the call graph, doesn't the 
 new .o file share a common prefix with the old one?

Only if the TOC is unchanged.
There are a lot of common sections in the same order but with 
different offsets.
we would need some binary patching method.

But I am unaware of file-systems supporting this.
Microsofts incremental linking mechnism makes use of thunks so it 
can avoid changing the header iirc.

But all of this needs codegen to adept.

Sep 19 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 9/19/2016 7:04 AM, Andrei Alexandrescu wrote:
 On 09/19/2016 01:16 AM, Walter Bright wrote:
 The compiler currently creates the complete object file in a buffer,
 then writes the buffer to a file in one command. The reason is mostly
 because the object file format isn't incremental, the beginning is
 written last and the body gets backpatched as the compilation progresses.

 Great. In that case, if the target .o file already exists, it should be
compared
 against the buffer. If identical, there should be no write and the timestamp of
 the .o file should stay the same.

That's right. I was just referring to the idea of incrementally writing and 
comparing, which is a great idea for sequential file writing, likely won't work 
for the object file case. I think it is distinct enough to merit a separate 
library function. Note that we already have:



Adding another "writeIfDifferent()" function would be a good thing. The range 
based incremental one should go into std.stdio.

Any case where writing is much more costly than reading (such as SSD drives you 
mentioned, and the new Seagate "archival" drives), would make your technique a 
good one. It works even for memory; I've used it in code to reduce swapping, as
in:

     if (*p != newvalue) *p = newvalue;

 I need to re-emphasize this kind of stuff is important for tooling. Many files
 get recompiled to identical object files - e.g. the many innocent bystanders in
 a dense dependency structure when one module changes. We also embed
 documentation in source files. Being disciplined about reflecting actual
changes
 in the actual file operations is very helpful for tools that track file writes
 and/or timestamps.

That's right.


 I can't really see a compilation producing an object file where the
 first half of it matches the previous object file and the second half is
 different, because of the file format.

 Interesting. What happens e.g. if one makes a change to a function whose
 generated code is somewhere in the middle of the object file? If it doesn't
 alter the call graph, doesn't the new .o file share a common prefix with the
old
 one?

Two things:

1. The object file starts out with a header that contains file offsets to the 
various tables and sections. Changing the size of any of the pieces in the file 
changes the header, and will likely require moving pieces around to make room.

2. Writing an object file can mean "backpatching" what was written earlier, as
a 
declaration one assumed was external turns out to be internal.

Sep 19 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 18 September 2016 at 15:17:31 UTC, Andrei Alexandrescu 
wrote:
 There are quite a few situations in rdmd and dmd generally when 
 we compute a dependency structure over sets of files. Based on 
 that, we write new files that overwrite old, obsoleted files. 
 Those changes in turn trigger other dependencies to go stale so 
 more building is done etc.

If so we need it in druntime.

Introducing phobos into ddmd is still considered a nono.

Personally I am pretty torn, without range-specific optimizations 
in dmd they do incur more overhead then they should.

Sep 18 2016

Chris Wright <dhasenan gmail.com> writes:

This will produce different behavior with hard links. With hard links, 
the temporary file mechanism you mention will result in the old file 
being accessible via the other path. With your recommended strategy, the 
data accessible from both paths is updated.

That's probably acceptable, and hard links aren't used that much anyway.

Obviously, if you have to overwrite large portions of the file, it's 
going to be faster to just write it. This is just for cases when you can 
get speedups down the line by not updating write timestamps, or when you 
know a large portion of the file is unchanged and the file is cached, or 
you're using a disk that sucks at writing data.

Sep 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/18/16 12:15 PM, Chris Wright wrote:
 This will produce different behavior with hard links. With hard links,
 the temporary file mechanism you mention will result in the old file
 being accessible via the other path. With your recommended strategy, the
 data accessible from both paths is updated.

 That's probably acceptable, and hard links aren't used that much anyway.

Awesome, this should be part of the docs.

 Obviously, if you have to overwrite large portions of the file, it's
 going to be faster to just write it. This is just for cases when you can
 get speedups down the line by not updating write timestamps, or when you
 know a large portion of the file is unchanged and the file is cached, or
 you're using a disk that sucks at writing data.

That's exactly right, and such considerations should also go in the 
function documentation. Wanna go for it?


Andrei

Sep 18 2016

Brad Roberts via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 9/18/2016 8:17 AM, Andrei Alexandrescu via Digitalmars-d wrote:
 There is actually an even better way at the application level. Consider
 a function in std.file:

 updateS, Range)(S name, Range data);

 updateFile does something interesting: it opens the file "name" for
 reading AND writing, then reads data from the Range _and_ the file. For
 as long as the data and the contents in the file agree, it just moves
 reading along. At the first difference between the data and the file
 contents, starts writing the data into the file through the end of the
 range.

 So this makes zero writes (and leaves the "last modified time" intact)
 if the file has the same content as the data. Better yet, if it so
 happens that the file and the data have the same prefix, there's less
 writing going on, which IIRC is faster for most filesystems. Saving on
 writes happens to be particularly nice on new solid-state drives.

 Who wants to take this with testing, measurements etc? It's a cool mini
 project.


 Andrei

This is nice in the case of no changes, but problematic in the case of 
some changes.  The standard write new, rename technique never has either 
file in a half-right state.  The file is atomically either old or new 
and nothing in between.  This can be critical.

Sep 18 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 9/18/2016 7:05 PM, Brad Roberts via Digitalmars-d wrote:
 This is nice in the case of no changes, but problematic in the case of some
 changes.  The standard write new, rename technique never has either file in a
 half-right state.  The file is atomically either old or new and nothing in
 between.  This can be critical.

As for compilation, I bet considerable speed increases could be had by never 
writing object files at all. (Not only does it save the read/write file time, 
but it saves the encoding into the object file format and decoding of that 
format.) Have the compiler do the linking directly.

dmd already does this for generating library files directly, and it's been very 
successful (although sometimes I suspect nobody has noticed(!) which is
actually 
a good thing). It took surprisingly little code to make that work, though doing 
a link step would be far more work.

Sep 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 09/18/2016 10:05 PM, Brad Roberts via Digitalmars-d wrote:
 This is nice in the case of no changes, but problematic in the case of
 some changes.  The standard write new, rename technique never has either
 file in a half-right state.  The file is atomically either old or new
 and nothing in between.  This can be critical.

Good point, should be also part of the doco or a flag with update (e.g.
Yes.atomic). Alternative: the caller may wish to rename the file prior 
to the operation and then rename it back after the operation. -- Andrei

Sep 19 2016

Walter Bright <newshound2 digitalmars.com> writes:

One way to implement it is to open the existing file as a memory-mapped file. 
Memory-mapped files only get paged into memory as the memory is referenced. So 
if you did a memcmp(oldfile, newfile, size), it will stop once the first 
difference is found, and the rest of the file is never read.

Also, only the changed pages of the memory-mapped file have to be written. On 
large files, this could be a big savings.

Sep 19 2016

D Programming

C/C++ Programming

Other

digitalmars.D - [WORK] std.file.update function