digitalmars.D - std.file.read implementation contest

Andrei Alexandrescu (76/76) Feb 16 2009 Someone mentioned an old bug in std.file.read here:

Steven Schveighoffer (48/89) Feb 16 2009 ...

Andrei Alexandrescu (35/44) Feb 16 2009 [snip]

Steven Schveighoffer (14/38) Feb 16 2009 Hm... won't this allocate only 15 bytes max if st_size is nonzero?

Andrei Alexandrescu (5/50) Feb 16 2009 Guess I'll revert to +1. The effect is likely to be negligible anyway.

Sean Kelly (7/14) Feb 16 2009 Neither one. std.read is intended to read the contents of a file in

Steven Schveighoffer (11/22) Feb 16 2009 Then you have just created for yourself a never ending bug report :) I

Andrei Alexandrescu (9/32) Feb 16 2009 I totally agree. The useful spec of std.file.read should be "reads the

Sean Kelly (4/7) Feb 16 2009 Okay, that's a fair definition. So the correct behavior for reading an

Andrei Alexandrescu (4/13) Feb 16 2009 I think that's reasonable and will document it as such. Nice example
Steve Schveighoffer (9/18) Feb 16 2009 Then why doesn't cp have a warning about copying /dev/random or /dev/
Andrei Alexandrescu (8/17) Feb 17 2009 I thought more about this and a nice solution is to have std.file.read

bearophile (5/5) Feb 17 2009 Regarding this really useful part of the std lib, I warmly suggest to al...

Denis Koroskin (2/9) Feb 17 2009 How does it work with /dev/random etc?

Sean Kelly (10/30) Feb 16 2009 I wasn't saying that at all. But try using this approach on

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Someone mentioned an old bug in std.file.read here:

http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

Two programmers sent in patches for the function. Which is to be 
committed and why? (Linux versions shown. Apologies for noisy line breaks.)

Andrei


// Implementation 1
void[] read(string name)
{
     invariant fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     void[] buf;
     auto size = statbuf.st_size;
     if (size == 0)
     {	/* The size could be 0 if the file is a device or a procFS file,
	 * so we just have to try reading it.
	 */
	int readsize = 1024;
	while (1)
	{
	    buf = GC.realloc(buf.ptr, size + readsize, GC.BlkAttr.NO_SCAN)[0 .. 
cast(int)size + readsize];
	    enforce(buf, "Out of memory");
	    scope(failure) delete buf;

	    auto toread = readsize;
	    while (toread)
	    {
		auto numread = std.c.linux.linux.read(fd, buf.ptr + size, toread);
		cenforce(numread != -1, name);
		size += numread;
		if (numread == 0)
		{   if (size == 0)			// it really was 0 size
			delete buf;			// don't need the buffer
		    return buf[0 .. size];		// end of file
		}
		toread -= numread;
	    }
	}
     }
     else
     {
	buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size];
	enforce(buf, "Out of memory");
	scope(failure) delete buf;

	cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name);

	return buf[0 .. size];
     }
}

// Implementation 2
void[] read(string name)
{
     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024;
     void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
         [0 .. initialAlloc];
     scope(failure) delete result;
     size_t size = 0;

     for (;;)
     {
         immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                 result.length - size);
         cenforce(actual != actual.max, name);
         size += actual;
         if (size < result.length) break;
         auto newAlloc = size + 1024 * 4;
         result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
             [0 .. newAlloc];
     }

     return result[0 .. size];
}

Feb 16 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be committed 
 and why? (Linux versions shown. Apologies for noisy line breaks.)

Implementation 2, save 1 bug:

 Andrei


 // Implementation 1

...

     else
     {
 buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size];
 enforce(buf, "Out of memory");
 scope(failure) delete buf;

 cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name);

 return buf[0 .. size];
     }
 }

This doesn't work for /sys files which consistently say they are 4096 bytes 
large, even though they can be smaller or larger (encountered this with 
Tango also).

I think you have to kill the check that the size read actually equals the 
size in stat, AND you have to continue reading even after you reach that 
size.

 // Implementation 2
 void[] read(string name)
 {
     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024;

Excellent idea, this allocates the minimum amount of space to tell if a file 
is larger than the file size returned in fstat.  This should handle the /sys 
case and /proc case properly.

     void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
         [0 .. initialAlloc];
     scope(failure) delete result;
     size_t size = 0;

     for (;;)
     {
         immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                 result.length - size);
         cenforce(actual != actual.max, name);
         size += actual;
         if (size < result.length) break;

Only bug:
You need to check if actual > 0.  A driver is allowed to return less data 
than requested, even if there is more to read.  You must read until read 
returns 0.  In this case you need a loop to do the reading.  Yes the disk 
driver never returns less than requested unless EOF is reached, but the 
/proc files are run by any driver, which might do something like that.

         auto newAlloc = size + 1024 * 4;
         result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
             [0 .. newAlloc];
     }

     return result[0 .. size];
 }

I would rewrite:

void[] read(string name)
{
    immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
    cenforce(fd != -1, name);
    scope(exit) std.c.linux.linux.close(fd);

    struct_stat statbuf = void;
    cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

    auto allocSize = statbuf.st_size ? statbuf.st_size + 1 : 1024;
    void[] result;
    scope(failure) delete result;
    size_t size = 0;

outer:
    while(true)
    {
        result = GC.realloc(result.ptr, allocSize, GC.BlkAttr.NO_SCAN)
            [0 .. allocSize];
        while(size < allocSize)
        {
            immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                    allocSize - size);
            if(actual == 0)
                break outer;
            cenforce(actual != actual.max, name);
            size += actual;
        }
        allocSize = size + 1024 * 4;
    }

    return result[0 .. size];
}

Feb 16 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be committed 
 and why? (Linux versions shown. Apologies for noisy line breaks.)

 
 Implementation 2, save 1 bug:

[snip]

Aha, cool. Thanks for the info. I've adapted the code to still only use 
one loop:

void[] read(string name)
{
     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     immutable initialAlloc = statbuf.st_size
         ? (statbuf.st_size + 16) & 15
         : 1024;
     void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
         [0 .. initialAlloc];
     scope(failure) delete result;
     size_t size = 0;

     for (;;)
     {
         immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                 result.length - size);
         cenforce(actual != -1, name);
         if (actual == 0) break;
         size += actual;
         if (size < result.length) continue;
         auto newAlloc = size + 1024 * 4;
         result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
             [0 .. newAlloc];
     }

     return result[0 .. size];
}

One more improvement suggested by Walter - I allocate a multiple of 16 
because that's allocator's granularity.


Andrei

Feb 16 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Andrei Alexandrescu" wrote
 Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

 Implementation 2, save 1 bug:

 [snip]

 Aha, cool. Thanks for the info. I've adapted the code to still only use 
 one loop:

 void[] read(string name)
 {
     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     immutable initialAlloc = statbuf.st_size
         ? (statbuf.st_size + 16) & 15
         : 1024;

Hm... won't this allocate only 15 bytes max if st_size is nonzero?

I think you meant:
(statbuf.st_size + 16) & ~15

Two more points:

if you allocate a size of 16, I think you'll actually get 32 bytes because 
of the sentinel byte.  Somehow you should account for that, or you 
automatically double the allocation size each time (or at least a page more 
than you want).

And the increments are not in 16, they are in powers of 2 *starting* with 
16.  For example, if you allocate a size of 80 bytes (16 * 5), you will 
actually get 128 bytes.

Other than that, it looks good.

-Steve

Feb 16 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

 Implementation 2, save 1 bug:

 [snip]

 Aha, cool. Thanks for the info. I've adapted the code to still only use 
 one loop:

 void[] read(string name)
 {
     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     immutable initialAlloc = statbuf.st_size
         ? (statbuf.st_size + 16) & 15
         : 1024;

 
 Hm... won't this allocate only 15 bytes max if st_size is nonzero?
 
 I think you meant:
 (statbuf.st_size + 16) & ~15

Stupid!!!

 Two more points:
 
 if you allocate a size of 16, I think you'll actually get 32 bytes because 
 of the sentinel byte.  Somehow you should account for that, or you 
 automatically double the allocation size each time (or at least a page more 
 than you want).
 
 And the increments are not in 16, they are in powers of 2 *starting* with 
 16.  For example, if you allocate a size of 80 bytes (16 * 5), you will 
 actually get 128 bytes.
 
 Other than that, it looks good.

Guess I'll revert to +1. The effect is likely to be negligible anyway.

Thanks!


Andrei

Feb 16 2009

Sean Kelly <sean invisibleduck.org> writes:

Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:
 
 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 
 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line breaks.)

Neither one.  std.read is intended to read the contents of a file in 
bulk into memory.  If a file size is zero then the size of the data 
available is either zero (expected case) or unbounded (screwy *nix 
case).  Streaming operations should be used on unbounded virtual files, 
not std.read.


Sean

Feb 16 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Sean Kelly" wrote
 Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

 Neither one.  std.read is intended to read the contents of a file in bulk 
 into memory.  If a file size is zero then the size of the data available 
 is either zero (expected case) or unbounded (screwy *nix case).  Streaming 
 operations should be used on unbounded virtual files, not std.read.

Then you have just created for yourself a never ending bug report :)  I 
think reading /proc and /sys files is an essential part of writing scripts 
and useful utilities.  std.file.read is especially useful since most of the 
time these files are very tiny, meant to be read all at once.  If this 
functionality *is* forbidden from std.file.read, then a 
std.file.readUnbounded should still be available.  Complaining that *nix 
isn't pure enough for the likes of D isn't very practical.

I view the usage of fstat to get the file size as being an optimization, not 
a validation of the OS' sanity.

-Steve

Feb 16 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

 Neither one.  std.read is intended to read the contents of a file in bulk 
 into memory.  If a file size is zero then the size of the data available 
 is either zero (expected case) or unbounded (screwy *nix case).  Streaming 
 operations should be used on unbounded virtual files, not std.read.

 
 Then you have just created for yourself a never ending bug report :)  I 
 think reading /proc and /sys files is an essential part of writing scripts 
 and useful utilities.  std.file.read is especially useful since most of the 
 time these files are very tiny, meant to be read all at once.  If this 
 functionality *is* forbidden from std.file.read, then a 
 std.file.readUnbounded should still be available.  Complaining that *nix 
 isn't pure enough for the likes of D isn't very practical.
 
 I view the usage of fstat to get the file size as being an optimization, not 
 a validation of the OS' sanity.

I totally agree. The useful spec of std.file.read should be "reads the 
file to exhaustion in a buffer and returns it", not "if the size 
according to fstat is the same as the actual file size, /proc stuff and 
concurrent access against the file notwithstanding, reads exactly that 
many bytes in a buffer and returns it. Otherwise, leaves you scratching 
your head." And yes, using fstat is just a little irrelevant 
implementation trick, not something that conditions the workings of read.

Andrei

Feb 16 2009

Sean Kelly <sean invisibleduck.org> writes:

Andrei Alexandrescu wrote:
 
 I totally agree. The useful spec of std.file.read should be "reads the 
 file to exhaustion in a buffer and returns it"

Okay, that's a fair definition.  So the correct behavior for reading an 
unbounded stream should be an out of memory error?  This would be 
entirely reasonable, but any failure conditions should be described as well.

Feb 16 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sean Kelly wrote:
 Andrei Alexandrescu wrote:
 I totally agree. The useful spec of std.file.read should be "reads the 
 file to exhaustion in a buffer and returns it"

 
 Okay, that's a fair definition.  So the correct behavior for reading an 
 unbounded stream should be an out of memory error?  This would be 
 entirely reasonable, but any failure conditions should be described as 
 well.

I think that's reasonable and will document it as such. Nice example 
with /dev/random though :o).

Andrei

Feb 16 2009

Steve Schveighoffer <schveiguy yahoo.com> writes:

On Mon, 16 Feb 2009 16:36:52 -0800, Sean Kelly wrote:

 Andrei Alexandrescu wrote:
 
 I totally agree. The useful spec of std.file.read should be "reads the
 file to exhaustion in a buffer and returns it"

 
 Okay, that's a fair definition.  So the correct behavior for reading an
 unbounded stream should be an out of memory error?  This would be
 entirely reasonable, but any failure conditions should be described as
 well.

Then why doesn't cp have a warning about copying /dev/random or /dev/
zero?  I think it's enough to specify what it does (read the entire file) 
and trust that people understand what they are doing when the attempt to 
read an unbounded stream.  The warning belongs with the stream, not with 
the simple utility.  It probably only takes once for the developer to 
screw it up to learn ;)

But I guess it doesn't hurt...

-Steve

Feb 16 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sean Kelly wrote:
 Andrei Alexandrescu wrote:
 I totally agree. The useful spec of std.file.read should be "reads the 
 file to exhaustion in a buffer and returns it"

 
 Okay, that's a fair definition.  So the correct behavior for reading an 
 unbounded stream should be an out of memory error?  This would be 
 entirely reasonable, but any failure conditions should be described as 
 well.

I thought more about this and a nice solution is to have std.file.read 
take an optional "up to" parameter:

void[] read(in char[] filename, size_t upTo = size_t.max);

Then you can get a random int with:

int rnd = (cast(int[]) read("/dev/random", 4))[0];

Yum.


Andrei

Feb 17 2009

bearophile <bearophileHUGS lycos.com> writes:

Regarding this really useful part of the std lib, I warmly suggest to also
benchmark the final code against a similar operation done in Perl or Python2.x.
Such tests very often show performance bugs/problems.

Python code for the test is really simple:

file("filename.txt).read()

Bye,
bearophile

Feb 17 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Tue, 17 Feb 2009 21:08:15 +0300, bearophile <bearophileHUGS lycos.com> wrote:

 Regarding this really useful part of the std lib, I warmly suggest to  
 also benchmark the final code against a similar operation done in Perl  
 or Python2.x. Such tests very often show performance bugs/problems.

 Python code for the test is really simple:

 file("filename.txt).read()

 Bye,
 bearophile

How does it work with /dev/random etc?

Feb 17 2009

Sean Kelly <sean invisibleduck.org> writes:

Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

 Neither one.  std.read is intended to read the contents of a file in bulk 
 into memory.  If a file size is zero then the size of the data available 
 is either zero (expected case) or unbounded (screwy *nix case).  Streaming 
 operations should be used on unbounded virtual files, not std.read.

 
 Then you have just created for yourself a never ending bug report :)  I 
 think reading /proc and /sys files is an essential part of writing scripts 
 and useful utilities.  std.file.read is especially useful since most of the 
 time these files are very tiny, meant to be read all at once.  If this 
 functionality *is* forbidden from std.file.read, then a 
 std.file.readUnbounded should still be available.  Complaining that *nix 
 isn't pure enough for the likes of D isn't very practical.

I wasn't saying that at all.  But try using this approach on 
/dev/random, for example.  One of the great things about *nix is that 
_anything_ can be represented as a file and therefore operated on via 
the same scripting techniques, however, I think it's an open issue 
whether this behavior lies within or outside the design of std.read.

I do think one could argue that std.read should make a best effort and 
read from such files up to some imposed maximum byte count, but this 
still seems like stretching its intended purpose to me.


Sean

Feb 16 2009

D Programming

C/C++ Programming

Other

digitalmars.D - std.file.read implementation contest