digitalmars.D.learn - byChunk odd behavior?

Hanh (26/26) Mar 22 2016 Hi all,

Hanh (4/30) Mar 22 2016 I have the feeling that it's related to the forward only nature
Taylor Hillegeist (26/52) Mar 22 2016 I dont know if this helps, but it looks like since take three
=?UTF-8?Q?Ali_=c3=87ehreli?= (11/32) Mar 22 2016 I don't understand the issue fully but byChunk() will treat every
cy (30/32) Mar 22 2016 Never use an input range twice. So, here's how to use it twice:

Hanh (18/18) Mar 22 2016 Thanks for your help everyone.

Chris Wright (8/10) Mar 23 2016 import std.range, std.array;
cym13 (16/34) Mar 23 2016 Doing *anything* to a range invalidates it (or at least you

Hanh (6/14) Mar 24 2016 Well, that's what I do in the first post but you can't call it

cym13 (29/46) Mar 25 2016 Please, take some time to reread cy's answer above.

Hanh (24/28) Mar 25 2016 Thanks for your help. However the last statement is incorrect. I

cym13 (16/44) Mar 26 2016 Sorry, it seems I completely misunderstood you goal. I thought

Hanh (11/26) Mar 26 2016 Turns out bitmanip is actually using a loop.

Hanh <hanh425 gmail.com> writes:

Hi all,

I'm trying to process a rather large file as an InputRange and 
run into something strange with byChunk / take.

void test() {
	auto file = new File("test.txt");
	auto input = file.byChunk(2).joiner;
	input.take(3).array;
	foreach (char c; input) {
		writeln(c);
	}
}

Let's say test.txt contains "123456".

The output will be
3
4
5
6

The "take" consumed one chunk from the file, but if I increase 
the chunk size to 4, then it won't.

It looks like if "take" spans two chunks, it affects the input 
range otherwise it doesn't.

Actually, what is the easiest way to read a large file as a 
stream? My file contains a bunch of serialized messages of 
variable length.

Thanks,
--h

Mar 22 2016

Hanh <hanh425 gmail.com> writes:

On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
 Hi all,

 I'm trying to process a rather large file as an InputRange and 
 run into something strange with byChunk / take.

 void test() {
 	auto file = new File("test.txt");
 	auto input = file.byChunk(2).joiner;
 	input.take(3).array;
 	foreach (char c; input) {
 		writeln(c);
 	}
 }

 Let's say test.txt contains "123456".

 The output will be
 3
 4
 5
 6

 The "take" consumed one chunk from the file, but if I increase 
 the chunk size to 4, then it won't.

 It looks like if "take" spans two chunks, it affects the input 
 range otherwise it doesn't.

 Actually, what is the easiest way to read a large file as a 
 stream? My file contains a bunch of serialized messages of 
 variable length.

 Thanks,
 --h

I have the feeling that it's related to the forward only nature 
of an InputRange. All would be ok with a take(N)+popFrontN 
method. I'm going to keep looking.

Mar 22 2016

Taylor Hillegeist <taylorh140 gmail.com> writes:

On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
 Hi all,

 I'm trying to process a rather large file as an InputRange and 
 run into something strange with byChunk / take.

 void test() {
 	auto file = new File("test.txt");
 	auto input = file.byChunk(2).joiner;
 	input.take(3).array;
 	foreach (char c; input) {
 		writeln(c);
 	}
 }

 Let's say test.txt contains "123456".

 The output will be
 3
 4
 5
 6

 The "take" consumed one chunk from the file, but if I increase 
 the chunk size to 4, then it won't.

 It looks like if "take" spans two chunks, it affects the input 
 range otherwise it doesn't.

 Actually, what is the easiest way to read a large file as a 
 stream? My file contains a bunch of serialized messages of 
 variable length.

 Thanks,
 --h

I dont know if this helps, but it looks like since take three 
doesn't consume the chunk it is not removed from the range.

import std.stdio;
import std.algorithm;
import std.range;

void main() {
	auto file = stdin;
	auto input = file.byChunk(2).joiner;
	
	foreach (char c; input.take(3).array) {
		writeln(c);
	}
	
	foreach (char c; input) {
		writeln(c);
	}
}

Produces:
1
2
3 < Got data but didn't eat the chunk.
3
4
5
6

Mar 22 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 03/22/2016 12:17 AM, Hanh wrote:
 Hi all,

 I'm trying to process a rather large file as an InputRange and run into
 something strange with byChunk / take.

 void test() {
      auto file = new File("test.txt");
      auto input = file.byChunk(2).joiner;
      input.take(3).array;
      foreach (char c; input) {
          writeln(c);
      }
 }

 Let's say test.txt contains "123456".

 The output will be
 3
 4
 5
 6

 The "take" consumed one chunk from the file, but if I increase the chunk
 size to 4, then it won't.

I don't understand the issue fully but byChunk() will treat every 
character in the file. So, even the newline character(s) are considered.

 Actually, what is the easiest way to read a large file as a stream? My
 file contains a bunch of serialized messages of variable length.

If it's a text file I think I would start with File.byLine (or 
byLineCopy). Then it depends on how the messages are layed out. One per 
line? Do you know the size at the start? etc.

Alternatively, use (or examine) one of the great D serialization modules 
out there. :)

(We already need something like this in the standard library, which I 
think some people are already working on.)

Ali

Mar 22 2016

cy <dlang verge.info.tm> writes:

On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
 	input.take(3).array;
 	foreach (char c; input) {

Never use an input range twice. So, here's how to use it twice:

If it's a "forward range" you can use save() to get a copy to use 
later (but all the std.stdio.* ranges don't implement that). You 
can also use "std.range.tee" to send the results to an "output 
range" (something implementing put(K)(K)) while iterating over 
them.

tee can't produce two input ranges, because without caching all 
iterated items in memory, only one range can request items 
on-demand; the other must take them passively.

You could write a thing that takes an InputRange and produces a 
ForwardRange, by caching those items in memory, but at that point 
you might as well use .array and get the whole thing.

ByChunk is an input range (not a forward range), so there's 
undefined behavior when you use it twice. No bugs there, since it 
wasn't meant to be reused anyway. What it does is cache the last 
seen chunk, first iterate over that, then read more chunks from 
the file. So every time you iterate, you'll get that same last 
chunk.

It's also tricky to use input ranges after mutating their 
underlying data structure. If you seek in the file, for instance, 
then a previously created ByChunk will produce the chunk it has 
cached, and only then start reading chunks from that exact 
position in the file. A range over some sort of list, if you 
delete the current item in the list, should the range produce the 
previous item? The next item? null?

So, as a general rule, never use input ranges twice, and never 
use them after mutating the underlying data structure. Just 
recreate them if you want to do something twice, or use tee as 
mentioned above.

Mar 22 2016

Hanh <hanh425 gmail.com> writes:

Thanks for your help everyone.

I agree that the issue is due to the misusage of an InputRange 
but what is the semantics of 'take' when applied to an 
InputRange? It seems that calling it invalidates the range; in 
which case what is the recommended way to get a few bytes and 
keep on advancing.

For instance, to read a ushort, I use
range.read!(ushort)()
Unfortunately, it reads a single value.

For now, I use a loop

foreach (i; 0..N) {
   buffer[i] = range.front;
   range.popFront();
   }

Is there a more idiomatic way to do the same thing?

In Scala, 'take' consumes bytes from the iterator. So the same 
code would be
buffer = range.take(N).toArray

Mar 22 2016

Chris Wright <dhasenan gmail.com> writes:

On Wed, 23 Mar 2016 03:17:05 +0000, Hanh wrote:
 In Scala, 'take' consumes bytes from the iterator. So the same code
 would be buffer = range.take(N).toArray

import std.range, std.array;
auto bytes = byteRange.takeExactly(N).array;

There's also take(N), but if the range contains fewer than N elements, it 
will only give you as many as the range contains. If If you're trying to 
deserialize something, takeExactly is probably better.


http://dpldocs.info/experimental-docs/std.range.takeExactly.html
http://dpldocs.info/experimental-docs/std.array.array.1.html

Mar 23 2016

cym13 <cpicard openmailbox.org> writes:

On Wednesday, 23 March 2016 at 03:17:05 UTC, Hanh wrote:
 Thanks for your help everyone.

 I agree that the issue is due to the misusage of an InputRange 
 but what is the semantics of 'take' when applied to an 
 InputRange? It seems that calling it invalidates the range; in 
 which case what is the recommended way to get a few bytes and 
 keep on advancing.

Doing *anything* to a range invalidates it (or at least you 
should expect it to), a range is read-once. Never reuse a range. 
Some ranges can be saved in order to use a copy of it, but never 
expect a range to be implicitely reusable.

 For instance, to read a ushort, I use
 range.read!(ushort)()
 Unfortunately, it reads a single value.

 For now, I use a loop

 foreach (element ; range.enumerate) {
   buffer[i] = range.front;
   range.popFront();
   }

 Is there a more idiomatic way to do the same thing?

Two ways, the first one being for reference:

     import std.range: enumerate;
     foreach (element, index ; range.enumerate) {
         buffer[index] = element;
     }

And the other one

 In Scala, 'take' consumes bytes from the iterator. So the same 
 code would be
 buffer = range.take(N).toArray

Then just do that!

     import std.range, std.array;
     auto buffer = range.take(N).array;

     auto example = iota(0, 200, 5).take(5).array;
     assert(example == [0, 5, 10, 15, 20]);

Mar 23 2016

Hanh <hanh425 gmail.com> writes:

On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:

 In Scala, 'take' consumes bytes from the iterator. So the same 
 code would be
 buffer = range.take(N).toArray

 Then just do that!

     import std.range, std.array;
     auto buffer = range.take(N).array;

     auto example = iota(0, 200, 5).take(5).array;
     assert(example == [0, 5, 10, 15, 20]);

Well, that's what I do in the first post but you can't call it 
twice with an InputRange.

auto buffer1 = range.take(4).array; // ok
range.popFrontN(4); // not ok
auto buffer2 = range.take(4).array; // not ok

Mar 24 2016

cym13 <cpicard openmailbox.org> writes:

On Thursday, 24 March 2016 at 07:52:27 UTC, Hanh wrote:
 On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:

 In Scala, 'take' consumes bytes from the iterator. So the 
 same code would be
 buffer = range.take(N).toArray

 Then just do that!

     import std.range, std.array;
     auto buffer = range.take(N).array;

     auto example = iota(0, 200, 5).take(5).array;
     assert(example == [0, 5, 10, 15, 20]);

 Well, that's what I do in the first post but you can't call it 
 twice with an InputRange.

 auto buffer1 = range.take(4).array; // ok
 range.popFrontN(4); // not ok
 auto buffer2 = range.take(4).array; // not ok

Please, take some time to reread cy's answer above.

     void main(string[] args) {
         import std.range;
         import std.array;
         import std.algorithm;

         auto range = iota(0, 25, 5);

         // Will not consume (forward ranges only)
         //
         // Note however that range elements are not stored in any 
way by default
         // so reusing the range will also need you to recompute 
them each time!
         auto buffer1 = range.save.take(4).array;
         assert(buffer1 == [0, 5, 10, 15]);

         // The solution to the recomputation problème, and often 
the best way to
         // handle range reuse is to store them in an array
         //
         // This is reusable at will with no redundant computation
         auto arr = range.save.array;
         assert(arr == [0, 5, 10, 15, 20]);

         // And it has a range interface too
         auto buffer2 = arr.take(4).array;
         assert(buffer2 == [0, 5, 10, 15]);

         // This consume
         auto buffer3 = range.take(4).array;
         assert(buffer3 == [0, 5, 10, 15]);
     }

Mar 25 2016

Hanh <hanh425 gmail.com> writes:

On Friday, 25 March 2016 at 08:01:04 UTC, cym13 wrote:
         // This consume
         auto buffer3 = range.take(4).array;
         assert(buffer3 == [0, 5, 10, 15]);
     }

Thanks for your help. However the last statement is incorrect. I 
am in fact looking for a version of 'take' that consumes the 
InputRange.

You can see it by doing a second take afterwards.

     auto buffer3 = range.take(4).array;
     assert(buffer3 == [0, 5, 10, 15]);
     auto buffer4 = range.take(4).array;
     assert(buffer4 == [0, 5, 10, 15]);

I haven't clearly explained my main goal. I have a large binary 
file that I need to deserialize. It's not my file and it's in a 
custom but simple format, so I would prefer not to depend on a 
third party serializer library but I will look into that.

I was thinking around the lines of:
1. Open file
2. Map a byChunk.joiner to read by chunks and present an iterator 
interface
3. Read data with std.bitmanip/read functions

Step 3. works fine as long as items are single scalar values. 
bitmanip doesn't have array readers. Obviously, I could loop but 
then I thought that for the case of a ubyte[], there would be a 
shortcut that I don't know about.

Thanks,
--h

Mar 25 2016

cym13 <cpicard openmailbox.org> writes:

On Saturday, 26 March 2016 at 02:28:53 UTC, Hanh wrote:
 On Friday, 25 March 2016 at 08:01:04 UTC, cym13 wrote:
         // This consume
         auto buffer3 = range.take(4).array;
         assert(buffer3 == [0, 5, 10, 15]);
     }

 Thanks for your help. However the last statement is incorrect. 
 I am in fact looking for a version of 'take' that consumes the 
 InputRange.

 You can see it by doing a second take afterwards.

     auto buffer3 = range.take(4).array;
     assert(buffer3 == [0, 5, 10, 15]);
     auto buffer4 = range.take(4).array;
     assert(buffer4 == [0, 5, 10, 15]);

 I haven't clearly explained my main goal. I have a large binary 
 file that I need to deserialize. It's not my file and it's in a 
 custom but simple format, so I would prefer not to depend on a 
 third party serializer library but I will look into that.

 I was thinking around the lines of:
 1. Open file
 2. Map a byChunk.joiner to read by chunks and present an 
 iterator interface
 3. Read data with std.bitmanip/read functions

 Step 3. works fine as long as items are single scalar values. 
 bitmanip doesn't have array readers. Obviously, I could loop 
 but then I thought that for the case of a ubyte[], there would 
 be a shortcut that I don't know about.

 Thanks,
 --h

Sorry, it seems I completely misunderstood you goal. I thought 
that take() consumed its input (which mostly only shows that I 
really am careful about not reusing ranges). Writting a take that 
consume shouldn't be difficult though:

     import std.range, std.traits;
     Take!R takeConsume(R)(auto ref R input, size_t n)
         if (isInputRange!(Unqual!R)
         && !isInfinite!(Unqual!R)
     {
         auto buffer = input.take(n);
         input = input.drop(buffer.walkLength);
         return buffer;
     }

but I think going with std.bitmanip/read may be the easiest in 
the end.

Mar 26 2016

Hanh <hanh425 gmail.com> writes:

On Saturday, 26 March 2016 at 08:34:04 UTC, cym13 wrote:

 Sorry, it seems I completely misunderstood you goal. I thought 
 that take() consumed its input (which mostly only shows that I 
 really am careful about not reusing ranges). Writting a take 
 that consume shouldn't be difficult though:

     import std.range, std.traits;
     Take!R takeConsume(R)(auto ref R input, size_t n)
         if (isInputRange!(Unqual!R)
         && !isInfinite!(Unqual!R)
     {
         auto buffer = input.take(n);
         input = input.drop(buffer.walkLength);
         return buffer;
     }

 but I think going with std.bitmanip/read may be the easiest in 
 the end.

Turns out bitmanip is actually using a loop.

foreach(ref e; bytes)
{
   e = range.front;
   range.popFront();
}

By the way, in your code above you are actually reusing the 
range: take is followed by drop and it won't work on an input 
range like 'byChunk'. That's the problem I ran into (see first 
post).

Mar 26 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - byChunk odd behavior?