www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Input ranges

reply "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
It seems input ranges without any indirection in memory are not
working well with algorithms. This seems to be understood by the
D community. I did not know. Here is my story on the topic so
far:

Recently, I learned that I did not know input ranges much at all,
totally misjudging std.range.refRange in its usefulness to input
ranges:
https://github.com/D-Programming-Language/phobos/pull/3123

At this point some experiments might be in order. (using 2.067.0)

Input ranges from std.stdio are used for reading files. So
assuming we create a file

     auto f = File("test.txt", "w");
     f.writeln(iota(5).map!(a => repeat(to!string(a), 
4)).joiner.joiner("\n"));
     f.close();

We should be able groupBy (chunkBy) its lines:

     writeln(File("test.txt").byLine.groupBy!((a,b) => a == b));

The result is just one group, that is all lines are considered 
equal:

     [["0", "0", "0", "0", "1", "1", "1", "1", "2", "2", "2", "2", 
"3", "3", "3", "3", "4", "4", "4", "4"]]

Alas, byLine reuses the same buffer for each line and thus
groupBy keeps comparing each line with itself. There is a version
of byLine that makes copies:

     writeln(File("test.txt").byLineCopy.groupBy!((a,b) => a == 
b));

Indeed, the result is as expected:

     [["0", "0", "0", "0"], ["1", "1", "1", "1"], ["2", "2", "2", 
"2"], ["3", "3", "3", "3"], ["4", "4", "4", "4"]]

A final test with the undocumented byRecord method (the mapping
after groupBy is for beauty only and does not change the result):

     writeln(File("test.txt")
             .byRecord!string("%s")
             .groupBy!((a,b) => a == b)
             .map!(map!(a => a[0])));

Here, the result is most peculiar:

     [["0", "0", "0", "0"], ["1", "1", "1"], ["2", "2", "2"], 
["3", "3", "3"], ["4", "4", "4"]]

Is byRecord broken? (It is undocumented after all.) In a way,
because it does not contain any indirection. The current fields
tuple is a simple member of the ByRecord struct.

In contrast, the ByLineCopy struct is just a wrapper to a ref
counted ByLineCopyImpl struct with a simple note:

         /* Ref-counting stops the source range's ByLineCopyImpl
          * from getting out of sync after the range is copied, 
e.g.
          * when accessing range.front, then using std.range.take,
          * then accessing range.front again. */

I am uncomfortable at this point. Simple and efficient input
ranges fail in unexpected ways. Internal indirections make all
the difference. It feels like input ranges are hiding something
that should not be hidden.

What am I missing?
Apr 18 2015
parent reply "anonymous" <anonymous example.com> writes:
On Saturday, 18 April 2015 at 22:01:56 UTC, Ulrich Küttler wrote:
 Input ranges from std.stdio are used for reading files. So
 assuming we create a file

     auto f = File("test.txt", "w");
     f.writeln(iota(5).map!(a => repeat(to!string(a), 
 4)).joiner.joiner("\n"));
     f.close();

 We should be able groupBy (chunkBy) its lines:

     writeln(File("test.txt").byLine.groupBy!((a,b) => a == b));

 The result is just one group, that is all lines are considered 
 equal:

     [["0", "0", "0", "0", "1", "1", "1", "1", "2", "2", "2", 
 "2", "3", "3", "3", "3", "4", "4", "4", "4"]]

 Alas, byLine reuses the same buffer for each line and thus
 groupBy keeps comparing each line with itself. There is a 
 version
 of byLine that makes copies:

     writeln(File("test.txt").byLineCopy.groupBy!((a,b) => a == 
 b));

 Indeed, the result is as expected:

     [["0", "0", "0", "0"], ["1", "1", "1", "1"], ["2", "2", 
 "2", "2"], ["3", "3", "3", "3"], ["4", "4", "4", "4"]]
Yeah, byLine is dangerous. byLineCopy should probably have been the default. Maybe we should rename byLine to byLineNoCopy (doing the proper deprecation dance, of course).
 A final test with the undocumented byRecord method (the mapping
 after groupBy is for beauty only and does not change the 
 result):

     writeln(File("test.txt")
             .byRecord!string("%s")
             .groupBy!((a,b) => a == b)
             .map!(map!(a => a[0])));

 Here, the result is most peculiar:

     [["0", "0", "0", "0"], ["1", "1", "1"], ["2", "2", "2"], 
 ["3", "3", "3"], ["4", "4", "4"]]

 Is byRecord broken? (It is undocumented after all.) In a way,
 because it does not contain any indirection. The current fields
 tuple is a simple member of the ByRecord struct.

 In contrast, the ByLineCopy struct is just a wrapper to a ref
 counted ByLineCopyImpl struct with a simple note:

         /* Ref-counting stops the source range's ByLineCopyImpl
          * from getting out of sync after the range is copied, 
 e.g.
          * when accessing range.front, then using 
 std.range.take,
          * then accessing range.front again. */

 I am uncomfortable at this point. Simple and efficient input
 ranges fail in unexpected ways. Internal indirections make all
 the difference. It feels like input ranges are hiding something
 that should not be hidden.

 What am I missing?
I guess the problem is the mix of value and reference semantics. ByRecord's `current` is a value, but its `file` has reference semantics. So, a copy of a ByRecord affects one part of the original but not the other. Maybe copying should be ` disable`d for such ranges/structs. Then you couldn't pass it by value to groupBy. Instead you would have to use something like (the fixed version of) refRange, which works properly.
Apr 19 2015
parent reply "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
On Sunday, 19 April 2015 at 11:33:26 UTC, anonymous wrote:
 I guess the problem is the mix of value and reference 
 semantics. ByRecord's `current` is a value, but its `file` has 
 reference semantics. So, a copy of a ByRecord affects one part 
 of the original but not the other.
I agree. Yet I am convinced most (all?) proper input ranges read input from an external source. (Reference semantic right there.) Input ranges are one-pass ranges after all. Therefore, reference semantics are required in any case (unless the use of the range is known beforehand.) groupBy is a nice example as it laboriously adds reference semantics to forward ranges but assumes input ranges to posses reference semantics by themselves.
 Maybe copying should be ` disable`d for such ranges/structs. 
 Then you couldn't pass it by value to groupBy. Instead you 
 would have to use something like (the fixed version of) 
 refRange, which works properly.
Again, I agree. Disallow copying for such ranges would prevent the error, still it would be a bit harsh. It is not just groupBy that goes astray. You can also get strange results from take: auto r = File("test.txt").byRecord!string("%s"); foreach (i; 0 .. 5) writeln(r.take(4).map!(a => a[0])); I was hoping to find some communication how input ranges should be done. At this point the solution in byLineCopy feels ad hoc.
Apr 19 2015
parent reply "anonymous" <anonymous example.com> writes:
On Sunday, 19 April 2015 at 21:42:23 UTC, Ulrich Küttler wrote:
 groupBy is a nice example as it laboriously adds reference 
 semantics to forward ranges but assumes input ranges to posses 
 reference semantics by themselves.
All ranges are input ranges, though. Input ranges are the least specialised category. I think it's a mistake to assume/require anything only for input ranges that are not forward ranges. I guess we could require reference semantics for all input ranges (including forward ranges and higher-ups). That would be a rather clean way out of this mess. But it would be a major effort. And I guess it would hurt performance, maybe a lot. [...]
 Again, I agree. Disallow copying for such ranges would prevent 
 the error, still it would be a bit harsh. It is not just 
 groupBy that goes astray. You can also get strange results from 
 take:

     auto r = File("test.txt").byRecord!string("%s");
     foreach (i; 0 .. 5)
         writeln(r.take(4).map!(a => a[0]));
That would also not be possible if ByRecord had an ` disable this(this);`. But I'm not at all sure that this is the way to go. It's bound to be forgotten, and there are probably places where it's needlessly restrictive.
 I was hoping to find some communication how input ranges should 
 be done.
I'm right there with you. This is all a bit iffy. I suspect that this is an oversight in the design of ranges, as the documentation of std.range doesn't say anything on the topic. This may be too advanced for D.learn. I don't think Andrei and Walter come down here very often. Maybe take it to the general board.
 At this point the solution in byLineCopy feels ad hoc.
I think byLineCopy solves a different problem. ByLine already has this: https://github.com/D-Programming-Language/phobos/blob/v2.067.0/std/stdio.d#L1592-L1598
Apr 19 2015
parent "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
On Sunday, 19 April 2015 at 23:49:08 UTC, anonymous wrote:
 On Sunday, 19 April 2015 at 21:42:23 UTC, Ulrich Küttler wrote:
 groupBy is a nice example as it laboriously adds reference 
 semantics to forward ranges but assumes input ranges to posses 
 reference semantics by themselves.
All ranges are input ranges, though. Input ranges are the least specialised category. I think it's a mistake to assume/require anything only for input ranges that are not forward ranges.
Yes and no. It is reasonable to provide special implementations for ranges that are just input ranges. This is what groupBy does. The user gets a function that works with all kinds of ranges.
 I guess we could require reference semantics for all input 
 ranges (including forward ranges and higher-ups). That would be 
 a rather clean way out of this mess. But it would be a major 
 effort. And I guess it would hurt performance, maybe a lot.
Definitely, general reference semantics would solve a ton of difficulty. However, this would not be D anymore. This seems to be the catch: For optimal performance memory layout is important. You cannot hide it behind an interface.
 At this point the solution in byLineCopy feels ad hoc.
I think byLineCopy solves a different problem. ByLine already has this: https://github.com/D-Programming-Language/phobos/blob/v2.067.0/std/stdio.d#L1592-L1598
The same indirection in byLineCopy. This is what I was referring to: https://github.com/D-Programming-Language/phobos/blob/v2.067.0/std/stdio.d#L1783-L1789 Whoever did that knew very well what is going on.
Apr 20 2015