www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - byChunk odd behavior?

reply Hanh <hanh425 gmail.com> writes:
Hi all,

I'm trying to process a rather large file as an InputRange and 
run into something strange with byChunk / take.

void test() {
	auto file = new File("test.txt");
	auto input = file.byChunk(2).joiner;
	input.take(3).array;
	foreach (char c; input) {
		writeln(c);
	}
}

Let's say test.txt contains "123456".

The output will be
3
4
5
6

The "take" consumed one chunk from the file, but if I increase 
the chunk size to 4, then it won't.

It looks like if "take" spans two chunks, it affects the input 
range otherwise it doesn't.

Actually, what is the easiest way to read a large file as a 
stream? My file contains a bunch of serialized messages of 
variable length.

Thanks,
--h
Mar 22 2016
next sibling parent Hanh <hanh425 gmail.com> writes:
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
 Hi all,

 I'm trying to process a rather large file as an InputRange and 
 run into something strange with byChunk / take.

 void test() {
 	auto file = new File("test.txt");
 	auto input = file.byChunk(2).joiner;
 	input.take(3).array;
 	foreach (char c; input) {
 		writeln(c);
 	}
 }

 Let's say test.txt contains "123456".

 The output will be
 3
 4
 5
 6

 The "take" consumed one chunk from the file, but if I increase 
 the chunk size to 4, then it won't.

 It looks like if "take" spans two chunks, it affects the input 
 range otherwise it doesn't.

 Actually, what is the easiest way to read a large file as a 
 stream? My file contains a bunch of serialized messages of 
 variable length.

 Thanks,
 --h
I have the feeling that it's related to the forward only nature of an InputRange. All would be ok with a take(N)+popFrontN method. I'm going to keep looking.
Mar 22 2016
prev sibling next sibling parent Taylor Hillegeist <taylorh140 gmail.com> writes:
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
 Hi all,

 I'm trying to process a rather large file as an InputRange and 
 run into something strange with byChunk / take.

 void test() {
 	auto file = new File("test.txt");
 	auto input = file.byChunk(2).joiner;
 	input.take(3).array;
 	foreach (char c; input) {
 		writeln(c);
 	}
 }

 Let's say test.txt contains "123456".

 The output will be
 3
 4
 5
 6

 The "take" consumed one chunk from the file, but if I increase 
 the chunk size to 4, then it won't.

 It looks like if "take" spans two chunks, it affects the input 
 range otherwise it doesn't.

 Actually, what is the easiest way to read a large file as a 
 stream? My file contains a bunch of serialized messages of 
 variable length.

 Thanks,
 --h
I dont know if this helps, but it looks like since take three doesn't consume the chunk it is not removed from the range. import std.stdio; import std.algorithm; import std.range; void main() { auto file = stdin; auto input = file.byChunk(2).joiner; foreach (char c; input.take(3).array) { writeln(c); } foreach (char c; input) { writeln(c); } } Produces: 1 2 3 < Got data but didn't eat the chunk. 3 4 5 6
Mar 22 2016
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 03/22/2016 12:17 AM, Hanh wrote:
 Hi all,

 I'm trying to process a rather large file as an InputRange and run into
 something strange with byChunk / take.

 void test() {
      auto file = new File("test.txt");
      auto input = file.byChunk(2).joiner;
      input.take(3).array;
      foreach (char c; input) {
          writeln(c);
      }
 }

 Let's say test.txt contains "123456".

 The output will be
 3
 4
 5
 6

 The "take" consumed one chunk from the file, but if I increase the chunk
 size to 4, then it won't.
I don't understand the issue fully but byChunk() will treat every character in the file. So, even the newline character(s) are considered.
 Actually, what is the easiest way to read a large file as a stream? My
 file contains a bunch of serialized messages of variable length.
If it's a text file I think I would start with File.byLine (or byLineCopy). Then it depends on how the messages are layed out. One per line? Do you know the size at the start? etc. Alternatively, use (or examine) one of the great D serialization modules out there. :) (We already need something like this in the standard library, which I think some people are already working on.) Ali
Mar 22 2016
prev sibling parent reply cy <dlang verge.info.tm> writes:
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
 	input.take(3).array;
 	foreach (char c; input) {
Never use an input range twice. So, here's how to use it twice: If it's a "forward range" you can use save() to get a copy to use later (but all the std.stdio.* ranges don't implement that). You can also use "std.range.tee" to send the results to an "output range" (something implementing put(K)(K)) while iterating over them. tee can't produce two input ranges, because without caching all iterated items in memory, only one range can request items on-demand; the other must take them passively. You could write a thing that takes an InputRange and produces a ForwardRange, by caching those items in memory, but at that point you might as well use .array and get the whole thing. ByChunk is an input range (not a forward range), so there's undefined behavior when you use it twice. No bugs there, since it wasn't meant to be reused anyway. What it does is cache the last seen chunk, first iterate over that, then read more chunks from the file. So every time you iterate, you'll get that same last chunk. It's also tricky to use input ranges after mutating their underlying data structure. If you seek in the file, for instance, then a previously created ByChunk will produce the chunk it has cached, and only then start reading chunks from that exact position in the file. A range over some sort of list, if you delete the current item in the list, should the range produce the previous item? The next item? null? So, as a general rule, never use input ranges twice, and never use them after mutating the underlying data structure. Just recreate them if you want to do something twice, or use tee as mentioned above.
Mar 22 2016
parent reply Hanh <hanh425 gmail.com> writes:
Thanks for your help everyone.

I agree that the issue is due to the misusage of an InputRange 
but what is the semantics of 'take' when applied to an 
InputRange? It seems that calling it invalidates the range; in 
which case what is the recommended way to get a few bytes and 
keep on advancing.

For instance, to read a ushort, I use
range.read!(ushort)()
Unfortunately, it reads a single value.

For now, I use a loop

foreach (i; 0..N) {
   buffer[i] = range.front;
   range.popFront();
   }

Is there a more idiomatic way to do the same thing?

In Scala, 'take' consumes bytes from the iterator. So the same 
code would be
buffer = range.take(N).toArray
Mar 22 2016
next sibling parent Chris Wright <dhasenan gmail.com> writes:
On Wed, 23 Mar 2016 03:17:05 +0000, Hanh wrote:
 In Scala, 'take' consumes bytes from the iterator. So the same code
 would be buffer = range.take(N).toArray
import std.range, std.array; auto bytes = byteRange.takeExactly(N).array; There's also take(N), but if the range contains fewer than N elements, it will only give you as many as the range contains. If If you're trying to deserialize something, takeExactly is probably better. http://dpldocs.info/experimental-docs/std.range.takeExactly.html http://dpldocs.info/experimental-docs/std.array.array.1.html
Mar 23 2016
prev sibling parent reply cym13 <cpicard openmailbox.org> writes:
On Wednesday, 23 March 2016 at 03:17:05 UTC, Hanh wrote:
 Thanks for your help everyone.

 I agree that the issue is due to the misusage of an InputRange 
 but what is the semantics of 'take' when applied to an 
 InputRange? It seems that calling it invalidates the range; in 
 which case what is the recommended way to get a few bytes and 
 keep on advancing.
Doing *anything* to a range invalidates it (or at least you should expect it to), a range is read-once. Never reuse a range. Some ranges can be saved in order to use a copy of it, but never expect a range to be implicitely reusable.
 For instance, to read a ushort, I use
 range.read!(ushort)()
 Unfortunately, it reads a single value.

 For now, I use a loop

 foreach (element ; range.enumerate) {
   buffer[i] = range.front;
   range.popFront();
   }

 Is there a more idiomatic way to do the same thing?
Two ways, the first one being for reference: import std.range: enumerate; foreach (element, index ; range.enumerate) { buffer[index] = element; } And the other one
 In Scala, 'take' consumes bytes from the iterator. So the same 
 code would be
 buffer = range.take(N).toArray
Then just do that! import std.range, std.array; auto buffer = range.take(N).array; auto example = iota(0, 200, 5).take(5).array; assert(example == [0, 5, 10, 15, 20]);
Mar 23 2016
parent reply Hanh <hanh425 gmail.com> writes:
On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:

 In Scala, 'take' consumes bytes from the iterator. So the same 
 code would be
 buffer = range.take(N).toArray
Then just do that! import std.range, std.array; auto buffer = range.take(N).array; auto example = iota(0, 200, 5).take(5).array; assert(example == [0, 5, 10, 15, 20]);
Well, that's what I do in the first post but you can't call it twice with an InputRange. auto buffer1 = range.take(4).array; // ok range.popFrontN(4); // not ok auto buffer2 = range.take(4).array; // not ok
Mar 24 2016
parent reply cym13 <cpicard openmailbox.org> writes:
On Thursday, 24 March 2016 at 07:52:27 UTC, Hanh wrote:
 On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:

 In Scala, 'take' consumes bytes from the iterator. So the 
 same code would be
 buffer = range.take(N).toArray
Then just do that! import std.range, std.array; auto buffer = range.take(N).array; auto example = iota(0, 200, 5).take(5).array; assert(example == [0, 5, 10, 15, 20]);
Well, that's what I do in the first post but you can't call it twice with an InputRange. auto buffer1 = range.take(4).array; // ok range.popFrontN(4); // not ok auto buffer2 = range.take(4).array; // not ok
Please, take some time to reread cy's answer above. void main(string[] args) { import std.range; import std.array; import std.algorithm; auto range = iota(0, 25, 5); // Will not consume (forward ranges only) // // Note however that range elements are not stored in any way by default // so reusing the range will also need you to recompute them each time! auto buffer1 = range.save.take(4).array; assert(buffer1 == [0, 5, 10, 15]); // The solution to the recomputation problème, and often the best way to // handle range reuse is to store them in an array // // This is reusable at will with no redundant computation auto arr = range.save.array; assert(arr == [0, 5, 10, 15, 20]); // And it has a range interface too auto buffer2 = arr.take(4).array; assert(buffer2 == [0, 5, 10, 15]); // This consume auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); }
Mar 25 2016
parent reply Hanh <hanh425 gmail.com> writes:
On Friday, 25 March 2016 at 08:01:04 UTC, cym13 wrote:
         // This consume
         auto buffer3 = range.take(4).array;
         assert(buffer3 == [0, 5, 10, 15]);
     }
Thanks for your help. However the last statement is incorrect. I am in fact looking for a version of 'take' that consumes the InputRange. You can see it by doing a second take afterwards. auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); auto buffer4 = range.take(4).array; assert(buffer4 == [0, 5, 10, 15]); I haven't clearly explained my main goal. I have a large binary file that I need to deserialize. It's not my file and it's in a custom but simple format, so I would prefer not to depend on a third party serializer library but I will look into that. I was thinking around the lines of: 1. Open file 2. Map a byChunk.joiner to read by chunks and present an iterator interface 3. Read data with std.bitmanip/read functions Step 3. works fine as long as items are single scalar values. bitmanip doesn't have array readers. Obviously, I could loop but then I thought that for the case of a ubyte[], there would be a shortcut that I don't know about. Thanks, --h
Mar 25 2016
parent reply cym13 <cpicard openmailbox.org> writes:
On Saturday, 26 March 2016 at 02:28:53 UTC, Hanh wrote:
 On Friday, 25 March 2016 at 08:01:04 UTC, cym13 wrote:
         // This consume
         auto buffer3 = range.take(4).array;
         assert(buffer3 == [0, 5, 10, 15]);
     }
Thanks for your help. However the last statement is incorrect. I am in fact looking for a version of 'take' that consumes the InputRange. You can see it by doing a second take afterwards. auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); auto buffer4 = range.take(4).array; assert(buffer4 == [0, 5, 10, 15]); I haven't clearly explained my main goal. I have a large binary file that I need to deserialize. It's not my file and it's in a custom but simple format, so I would prefer not to depend on a third party serializer library but I will look into that. I was thinking around the lines of: 1. Open file 2. Map a byChunk.joiner to read by chunks and present an iterator interface 3. Read data with std.bitmanip/read functions Step 3. works fine as long as items are single scalar values. bitmanip doesn't have array readers. Obviously, I could loop but then I thought that for the case of a ubyte[], there would be a shortcut that I don't know about. Thanks, --h
Sorry, it seems I completely misunderstood you goal. I thought that take() consumed its input (which mostly only shows that I really am careful about not reusing ranges). Writting a take that consume shouldn't be difficult though: import std.range, std.traits; Take!R takeConsume(R)(auto ref R input, size_t n) if (isInputRange!(Unqual!R) && !isInfinite!(Unqual!R) { auto buffer = input.take(n); input = input.drop(buffer.walkLength); return buffer; } but I think going with std.bitmanip/read may be the easiest in the end.
Mar 26 2016
parent Hanh <hanh425 gmail.com> writes:
On Saturday, 26 March 2016 at 08:34:04 UTC, cym13 wrote:

 Sorry, it seems I completely misunderstood you goal. I thought 
 that take() consumed its input (which mostly only shows that I 
 really am careful about not reusing ranges). Writting a take 
 that consume shouldn't be difficult though:

     import std.range, std.traits;
     Take!R takeConsume(R)(auto ref R input, size_t n)
         if (isInputRange!(Unqual!R)
         && !isInfinite!(Unqual!R)
     {
         auto buffer = input.take(n);
         input = input.drop(buffer.walkLength);
         return buffer;
     }

 but I think going with std.bitmanip/read may be the easiest in 
 the end.
Turns out bitmanip is actually using a loop. foreach(ref e; bytes) { e = range.front; range.popFront(); } By the way, in your code above you are actually reusing the range: take is followed by drop and it won't work on an input range like 'byChunk'. That's the problem I ran into (see first post).
Mar 26 2016