www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.file.read implementation contest

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Someone mentioned an old bug in std.file.read here:

http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

Two programmers sent in patches for the function. Which is to be 
committed and why? (Linux versions shown. Apologies for noisy line breaks.)

Andrei


// Implementation 1
void[] read(string name)
{
     invariant fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     void[] buf;
     auto size = statbuf.st_size;
     if (size == 0)
     {	/* The size could be 0 if the file is a device or a procFS file,
	 * so we just have to try reading it.
	 */
	int readsize = 1024;
	while (1)
	{
	    buf = GC.realloc(buf.ptr, size + readsize, GC.BlkAttr.NO_SCAN)[0 .. 
cast(int)size + readsize];
	    enforce(buf, "Out of memory");
	    scope(failure) delete buf;

	    auto toread = readsize;
	    while (toread)
	    {
		auto numread = std.c.linux.linux.read(fd, buf.ptr + size, toread);
		cenforce(numread != -1, name);
		size += numread;
		if (numread == 0)
		{   if (size == 0)			// it really was 0 size
			delete buf;			// don't need the buffer
		    return buf[0 .. size];		// end of file
		}
		toread -= numread;
	    }
	}
     }
     else
     {
	buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size];
	enforce(buf, "Out of memory");
	scope(failure) delete buf;

	cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name);

	return buf[0 .. size];
     }
}

// Implementation 2
void[] read(string name)
{
     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024;
     void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
         [0 .. initialAlloc];
     scope(failure) delete result;
     size_t size = 0;

     for (;;)
     {
         immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                 result.length - size);
         cenforce(actual != actual.max, name);
         size += actual;
         if (size < result.length) break;
         auto newAlloc = size + 1024 * 4;
         result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
             [0 .. newAlloc];
     }

     return result[0 .. size];
}
Feb 16 2009
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be committed 
 and why? (Linux versions shown. Apologies for noisy line breaks.)

Implementation 2, save 1 bug:
 Andrei


 // Implementation 1

     else
     {
 buf = GC.malloc(size, GC.BlkAttr.NO_SCAN)[0 .. size];
 enforce(buf, "Out of memory");
 scope(failure) delete buf;

 cenforce(std.c.linux.linux.read(fd, buf.ptr, size) == size, name);

 return buf[0 .. size];
     }
 }

This doesn't work for /sys files which consistently say they are 4096 bytes large, even though they can be smaller or larger (encountered this with Tango also). I think you have to kill the check that the size read actually equals the size in stat, AND you have to continue reading even after you reach that size.
 // Implementation 2
 void[] read(string name)
 {
     immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY);
     cenforce(fd != -1, name);
     scope(exit) std.c.linux.linux.close(fd);

     struct_stat statbuf = void;
     cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name);

     immutable initialAlloc = statbuf.st_size ? statbuf.st_size + 1 : 1024;

Excellent idea, this allocates the minimum amount of space to tell if a file is larger than the file size returned in fstat. This should handle the /sys case and /proc case properly.
     void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN)
         [0 .. initialAlloc];
     scope(failure) delete result;
     size_t size = 0;

     for (;;)
     {
         immutable actual = std.c.linux.linux.read(fd, result.ptr + size,
                 result.length - size);
         cenforce(actual != actual.max, name);
         size += actual;
         if (size < result.length) break;

Only bug: You need to check if actual > 0. A driver is allowed to return less data than requested, even if there is more to read. You must read until read returns 0. In this case you need a loop to do the reading. Yes the disk driver never returns less than requested unless EOF is reached, but the /proc files are run by any driver, which might do something like that.
         auto newAlloc = size + 1024 * 4;
         result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN)
             [0 .. newAlloc];
     }

     return result[0 .. size];
 }

I would rewrite: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); auto allocSize = statbuf.st_size ? statbuf.st_size + 1 : 1024; void[] result; scope(failure) delete result; size_t size = 0; outer: while(true) { result = GC.realloc(result.ptr, allocSize, GC.BlkAttr.NO_SCAN) [0 .. allocSize]; while(size < allocSize) { immutable actual = std.c.linux.linux.read(fd, result.ptr + size, allocSize - size); if(actual == 0) break outer; cenforce(actual != actual.max, name); size += actual; } allocSize = size + 1024 * 4; } return result[0 .. size]; }
Feb 16 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be committed 
 and why? (Linux versions shown. Apologies for noisy line breaks.)

Implementation 2, save 1 bug:

Aha, cool. Thanks for the info. I've adapted the code to still only use one loop: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? (statbuf.st_size + 16) & 15 : 1024; void[] result = GC.malloc(initialAlloc, GC.BlkAttr.NO_SCAN) [0 .. initialAlloc]; scope(failure) delete result; size_t size = 0; for (;;) { immutable actual = std.c.linux.linux.read(fd, result.ptr + size, result.length - size); cenforce(actual != -1, name); if (actual == 0) break; size += actual; if (size < result.length) continue; auto newAlloc = size + 1024 * 4; result = GC.realloc(result.ptr, newAlloc, GC.BlkAttr.NO_SCAN) [0 .. newAlloc]; } return result[0 .. size]; } One more improvement suggested by Walter - I allocate a multiple of 16 because that's allocator's granularity. Andrei
Feb 16 2009
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Andrei Alexandrescu" wrote
 Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

Implementation 2, save 1 bug:

Aha, cool. Thanks for the info. I've adapted the code to still only use one loop: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? (statbuf.st_size + 16) & 15 : 1024;

Hm... won't this allocate only 15 bytes max if st_size is nonzero? I think you meant: (statbuf.st_size + 16) & ~15 Two more points: if you allocate a size of 16, I think you'll actually get 32 bytes because of the sentinel byte. Somehow you should account for that, or you automatically double the allocation size each time (or at least a page more than you want). And the increments are not in 16, they are in powers of 2 *starting* with 16. For example, if you allocate a size of 80 bytes (16 * 5), you will actually get 128 bytes. Other than that, it looks good. -Steve
Feb 16 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_porting_d_to_the_mac/

 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)


Aha, cool. Thanks for the info. I've adapted the code to still only use one loop: void[] read(string name) { immutable fd = std.c.linux.linux.open(toStringz(name), O_RDONLY); cenforce(fd != -1, name); scope(exit) std.c.linux.linux.close(fd); struct_stat statbuf = void; cenforce(std.c.linux.linux.fstat(fd, &statbuf) == 0, name); immutable initialAlloc = statbuf.st_size ? (statbuf.st_size + 16) & 15 : 1024;

Hm... won't this allocate only 15 bytes max if st_size is nonzero? I think you meant: (statbuf.st_size + 16) & ~15

Stupid!!!
 Two more points:
 
 if you allocate a size of 16, I think you'll actually get 32 bytes because 
 of the sentinel byte.  Somehow you should account for that, or you 
 automatically double the allocation size each time (or at least a page more 
 than you want).
 
 And the increments are not in 16, they are in powers of 2 *starting* with 
 16.  For example, if you allocate a size of 80 bytes (16 * 5), you will 
 actually get 128 bytes.
 
 Other than that, it looks good.

Guess I'll revert to +1. The effect is likely to be negligible anyway. Thanks! Andrei
Feb 16 2009
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:
 
 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 
 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line breaks.)

Neither one. std.read is intended to read the contents of a file in bulk into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read. Sean
Feb 16 2009
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Sean Kelly" wrote
 Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

Neither one. std.read is intended to read the contents of a file in bulk into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read.

Then you have just created for yourself a never ending bug report :) I think reading /proc and /sys files is an essential part of writing scripts and useful utilities. std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once. If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available. Complaining that *nix isn't pure enough for the likes of D isn't very practical. I view the usage of fstat to get the file size as being an optimization, not a validation of the OS' sanity. -Steve
Feb 16 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read.

Then you have just created for yourself a never ending bug report :) I think reading /proc and /sys files is an essential part of writing scripts and useful utilities. std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once. If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available. Complaining that *nix isn't pure enough for the likes of D isn't very practical. I view the usage of fstat to get the file size as being an optimization, not a validation of the OS' sanity.

I totally agree. The useful spec of std.file.read should be "reads the file to exhaustion in a buffer and returns it", not "if the size according to fstat is the same as the actual file size, /proc stuff and concurrent access against the file notwithstanding, reads exactly that many bytes in a buffer and returns it. Otherwise, leaves you scratching your head." And yes, using fstat is just a little irrelevant implementation trick, not something that conditions the workings of read. Andrei
Feb 16 2009
parent reply Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu wrote:
 
 I totally agree. The useful spec of std.file.read should be "reads the 
 file to exhaustion in a buffer and returns it"

Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.
Feb 16 2009
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Sean Kelly wrote:
 Andrei Alexandrescu wrote:
 I totally agree. The useful spec of std.file.read should be "reads the 
 file to exhaustion in a buffer and returns it"

Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.

I think that's reasonable and will document it as such. Nice example with /dev/random though :o). Andrei
Feb 16 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Sean Kelly wrote:
 Andrei Alexandrescu wrote:
 I totally agree. The useful spec of std.file.read should be "reads the 
 file to exhaustion in a buffer and returns it"

Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.

I thought more about this and a nice solution is to have std.file.read take an optional "up to" parameter: void[] read(in char[] filename, size_t upTo = size_t.max); Then you can get a random int with: int rnd = (cast(int[]) read("/dev/random", 4))[0]; Yum. Andrei
Feb 17 2009
parent bearophile <bearophileHUGS lycos.com> writes:
Regarding this really useful part of the std lib, I warmly suggest to also
benchmark the final code against a similar operation done in Perl or Python2.x.
Such tests very often show performance bugs/problems.

Python code for the test is really simple:

file("filename.txt).read()

Bye,
bearophile
Feb 17 2009
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 Andrei Alexandrescu wrote:
 Someone mentioned an old bug in std.file.read here:

 http://www.reddit.com/r/programming/comments/7xnty/walter_bright_on_p
rting_d_to_the_mac/ 
 Two programmers sent in patches for the function. Which is to be 
 committed and why? (Linux versions shown. Apologies for noisy line 
 breaks.)

into memory. If a file size is zero then the size of the data available is either zero (expected case) or unbounded (screwy *nix case). Streaming operations should be used on unbounded virtual files, not std.read.

Then you have just created for yourself a never ending bug report :) I think reading /proc and /sys files is an essential part of writing scripts and useful utilities. std.file.read is especially useful since most of the time these files are very tiny, meant to be read all at once. If this functionality *is* forbidden from std.file.read, then a std.file.readUnbounded should still be available. Complaining that *nix isn't pure enough for the likes of D isn't very practical.

I wasn't saying that at all. But try using this approach on /dev/random, for example. One of the great things about *nix is that _anything_ can be represented as a file and therefore operated on via the same scripting techniques, however, I think it's an open issue whether this behavior lies within or outside the design of std.read. I do think one could argue that std.read should make a best effort and read from such files up to some imposed maximum byte count, but this still seems like stretching its intended purpose to me. Sean
Feb 16 2009
prev sibling next sibling parent Steve Schveighoffer <schveiguy yahoo.com> writes:
On Mon, 16 Feb 2009 16:36:52 -0800, Sean Kelly wrote:

 Andrei Alexandrescu wrote:
 
 I totally agree. The useful spec of std.file.read should be "reads the
 file to exhaustion in a buffer and returns it"

Okay, that's a fair definition. So the correct behavior for reading an unbounded stream should be an out of memory error? This would be entirely reasonable, but any failure conditions should be described as well.

Then why doesn't cp have a warning about copying /dev/random or /dev/ zero? I think it's enough to specify what it does (read the entire file) and trust that people understand what they are doing when the attempt to read an unbounded stream. The warning belongs with the stream, not with the simple utility. It probably only takes once for the developer to screw it up to learn ;) But I guess it doesn't hurt... -Steve
Feb 16 2009
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 17 Feb 2009 21:08:15 +0300, bearophile <bearophileHUGS lycos.com> wrote:

 Regarding this really useful part of the std lib, I warmly suggest to  
 also benchmark the final code against a similar operation done in Perl  
 or Python2.x. Such tests very often show performance bugs/problems.

 Python code for the test is really simple:

 file("filename.txt).read()

 Bye,
 bearophile

How does it work with /dev/random etc?
Feb 17 2009