www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Big Oversight with readln?

reply Jonathan Marler <johnnymarler gmail.com> writes:
I can't figure out how to make use of the full capacity of 
buffers that are allocated by readln.  Take the example code from 
the documentation:

     // Read lines from $(D stdin) and count words

     void main()
     {
         char[] buf;
         size_t words = 0;

         while (!stdin.eof)
         {
             char[] line = buf;
             stdin.readln(line);
             if (line.length > buf.length)
                 buf = line;

             words += line.split.length;
         }

         writeln(words);
     }

When buf is not large enough to hold the line, readln will 
allocate a new buffer to accomodate and this example shows how 
you can save that new buffer to reuse the next time.  The problem 
is that the capacity of the new buffer is nowhere to be found. 
readln only returns the line that was read which is only a slice 
of the buffer that was allocated.  The next time that readln is 
called, it will not read past the slice even if the capacity of 
the buffer it allocated was larger.  This will cause a new 
allocation/copy every time you read a line that was larger than 
all the previous lines, even if a previous allocation was already 
large enough. This seems like a big oversight to me, I must be 
missing something right?
Feb 23 2017
parent reply "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 02/23/2017 09:43 PM, Jonathan Marler wrote:
 I can't figure out how to make use of the full capacity of buffers that
 are allocated by readln.  Take the example code from the documentation:

      // Read lines from $(D stdin) and count words

      void main()
      {
          char[] buf;
          size_t words = 0;

          while (!stdin.eof)
          {
              char[] line = buf;
              stdin.readln(line);
              if (line.length > buf.length)
                  buf = line;

              words += line.split.length;
          }

          writeln(words);
      }

 When buf is not large enough to hold the line, readln will allocate a
 new buffer to accomodate and this example shows how you can save that
 new buffer to reuse the next time.  The problem is that the capacity of
 the new buffer is nowhere to be found. readln only returns the line that
 was read which is only a slice of the buffer that was allocated.  The
 next time that readln is called, it will not read past the slice even if
 the capacity of the buffer it allocated was larger.  This will cause a
 new allocation/copy every time you read a line that was larger than all
 the previous lines, even if a previous allocation was already large
 enough. This seems like a big oversight to me, I must be missing
 something right?
I don't think that problem is actually occurring: Let's step through the code, and suppose you're reading the following four lines of text: 12345 123456789 123 1234567 Starting out, buf.length is 0. When reading the first line, the buffer isn't big enough for 5, so readln allocates returns new buffer of length 5. That is more than buf.length (0), so the new buffer becomes the new buf. Second line, again, buf (length 5) isn't big enough for 9, so readln allocates a new buffer length 9. That's more than the old one (5), so again your code sets buf to the larger new buffer (length 9). Third line: buf (length 9) can definitely hold length 3, so readln does not allocate. The new slice returned (length 3) is NOT longer than buf (still length 9), so buf is NOT set to the slice returned by readln. So buf REMAINS length 9. Fourth line: buf (still length 9) can definitely hold length 7, so readln does not allocate.
Feb 23 2017
parent reply Jonathan Marler <johnnymarler gmail.com> writes:
On Friday, 24 February 2017 at 03:45:35 UTC, Nick Sabalausky 
(Abscissa) wrote:
 On 02/23/2017 09:43 PM, Jonathan Marler wrote:
 I can't figure out how to make use of the full capacity of 
 buffers that
 are allocated by readln.  Take the example code from the 
 documentation:

      // Read lines from $(D stdin) and count words

      void main()
      {
          char[] buf;
          size_t words = 0;

          while (!stdin.eof)
          {
              char[] line = buf;
              stdin.readln(line);
              if (line.length > buf.length)
                  buf = line;

              words += line.split.length;
          }

          writeln(words);
      }

 When buf is not large enough to hold the line, readln will 
 allocate a
 new buffer to accomodate and this example shows how you can 
 save that
 new buffer to reuse the next time.  The problem is that the 
 capacity of
 the new buffer is nowhere to be found. readln only returns the 
 line that
 was read which is only a slice of the buffer that was 
 allocated.  The
 next time that readln is called, it will not read past the 
 slice even if
 the capacity of the buffer it allocated was larger.  This will 
 cause a
 new allocation/copy every time you read a line that was larger 
 than all
 the previous lines, even if a previous allocation was already 
 large
 enough. This seems like a big oversight to me, I must be 
 missing
 something right?
I don't think that problem is actually occurring: Let's step through the code, and suppose you're reading the following four lines of text: 12345 123456789 123 1234567 Starting out, buf.length is 0. When reading the first line, the buffer isn't big enough for 5, so readln allocates returns new buffer of length 5. That is more than buf.length (0), so the new buffer becomes the new buf. Second line, again, buf (length 5) isn't big enough for 9, so readln allocates a new buffer length 9. That's more than the old one (5), so again your code sets buf to the larger new buffer (length 9). Third line: buf (length 9) can definitely hold length 3, so readln does not allocate. The new slice returned (length 3) is NOT longer than buf (still length 9), so buf is NOT set to the slice returned by readln. So buf REMAINS length 9. Fourth line: buf (still length 9) can definitely hold length 7, so readln does not allocate.
You're looking at this from the apps perspective and forgetting about what readln is doing under the hood. It can't know how big the next line is going to be before it reads it so it's going guess how much to allocate. If you look at the implementation in (http://github.com/dlang/phobos/blob/master/std/stdio.d), you can see it doubles the size of the current buffer and adds some more for good measure (on line 4479 as of writing this). So in your example after it reads the first line, its going to allocate an initial buffer of some size, maybe 200 or so, then eventually returns a slice of the first 5 characters into that buffer. When it reads the second line it can't use the rest of that initial buffer because the size of the buffer is gone, so it has to allocate a new buffer. At least that's what I thought until I found what I was missing! I discovered the .capacity property of arrays. I don't know why I've never seen this but it looks like this is how readln is recovering this seemingly lost peice of data. This does have an odd consequence though, if you pass a slice into readln it will read past the end of it if the underlying buffer is larger. This might be something worth adding to the documentation. Also I'm not completely sure how .capacity works, I assume it has to look up this information in the memory management metadata. Any enlightenment on this subject is appreciated. It also says it's a O(log(n)) operation so I'm guessing it's looking it up in some sort of binary tree data structure.
Feb 23 2017
parent reply Mike Parker <aldacron gmail.com> writes:
On Friday, 24 February 2017 at 04:22:17 UTC, Jonathan Marler 
wrote:

 I discovered the .capacity property of arrays.  I don't know 
 why I've never seen this but it looks like this is how readln 
 is recovering this seemingly lost peice of data.  This does 
 have an odd consequence though, if you pass a slice into readln 
 it will read past the end of it if the underlying buffer is 
 larger.  This might be something worth adding to the 
 documentation.

 Also I'm not completely sure how .capacity works, I assume it 
 has to look up this information in the memory management 
 metadata.  Any enlightenment on this subject is appreciated. It 
 also says it's a O(log(n)) operation so I'm guessing it's 
 looking it up in some sort of binary tree data structure.
https://dlang.org/phobos/object.html#.capacity
Feb 23 2017
parent Mike Parker <aldacron gmail.com> writes:
On Friday, 24 February 2017 at 07:38:04 UTC, Mike Parker wrote:
 On Friday, 24 February 2017 at 04:22:17 UTC, Jonathan Marler 
 wrote:

 I discovered the .capacity property of arrays.  I don't know 
 why I've never seen this but it looks like this is how readln 
 is recovering this seemingly lost peice of data.  This does 
 have an odd consequence though, if you pass a slice into 
 readln it will read past the end of it if the underlying 
 buffer is larger.  This might be something worth adding to the 
 documentation.

 Also I'm not completely sure how .capacity works, I assume it 
 has to look up this information in the memory management 
 metadata.  Any enlightenment on this subject is appreciated. 
 It also says it's a O(log(n)) operation so I'm guessing it's 
 looking it up in some sort of binary tree data structure.
https://dlang.org/phobos/object.html#.capacity
And although it never mentions capacity directly (capacity may not have been available when it was written, I can't recall), the How It Works section of Steven's array article gives the gist of it. https://dlang.org/d-array-article.html
Feb 23 2017