www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - std.algorithm.joiner unexpected behavior

reply Sergei Degtiarev <sdegtiarev yahoo.com> writes:
Hi,
I tried to create a simple range concatenating several files, 
something like this:

	File[] files;
	....
	foreach(ln; joiner(files.map!(a => a.byLine)))
		writeln(ln);

and I see every first line of each file is missing.
However, when I do same thing with separator:

	char[][] separator=["**** new file ****".dup];
	foreach(ln; joiner(files.map!(a => a.byLine), separator))
		writeln(ln);

everything works fine, both separator and first lines are printed.
It looks pretty much as a bug for me, but I'm not sure I use it 
correctly.
Aug 31
next sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Thursday, 31 August 2017 at 18:26:33 UTC, Sergei Degtiarev 
wrote:
 Hi,
 I tried to create a simple range concatenating several files, 
 something like this:

 	File[] files;
 	....
 	foreach(ln; joiner(files.map!(a => a.byLine)))
 		writeln(ln);

 and I see every first line of each file is missing.
 However, when I do same thing with separator:

 	char[][] separator=["**** new file ****".dup];
 	foreach(ln; joiner(files.map!(a => a.byLine), separator))
 		writeln(ln);

 everything works fine, both separator and first lines are 
 printed.
 It looks pretty much as a bug for me, but I'm not sure I use it 
 correctly.
Doesn't byLine() reuse a buffer, joiner probably caches the first line and calls .popFront()
Aug 31
next sibling parent Jonathan M Davis via Digitalmars-d-learn writes:
On Thursday, August 31, 2017 18:43:40 Jesse Phillips via Digitalmars-d-learn 
wrote:
 On Thursday, 31 August 2017 at 18:26:33 UTC, Sergei Degtiarev

 wrote:
 Hi,
 I tried to create a simple range concatenating several files,

 something like this:
     File[] files;
     ....
     foreach(ln; joiner(files.map!(a => a.byLine)))

         writeln(ln);

 and I see every first line of each file is missing.

 However, when I do same thing with separator:
     char[][] separator=["**** new file ****".dup];
     foreach(ln; joiner(files.map!(a => a.byLine), separator))

         writeln(ln);

 everything works fine, both separator and first lines are
 printed.
 It looks pretty much as a bug for me, but I'm not sure I use it
 correctly.
Doesn't byLine() reuse a buffer, joiner probably caches the first line and calls .popFront()
In general, byLine does not work with other range-based algorithms precisely because it reuses the buffer. I think that it does manage to work for some, but IMHO, it should have just supported foreach via opApply and not been a range at all. It's great for efficiency in a loop but horrible for range chaining. Folks get bit by this problem all the time. Fortunately, we now have byLineCopy which actually uses a separate dynamic array for each line, but even still, often, someone grabs byLine and then gets weird behavior that they don't understand. - Jonathan M Davis
Aug 31
prev sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Aug 31, 2017 at 01:34:39PM -0600, Jonathan M Davis via
Digitalmars-d-learn wrote:
[...]
 In general, byLine does not work with other range-based algorithms
 precisely because it reuses the buffer. I think that it does manage to
 work for some, but IMHO, it should have just supported foreach via
 opApply and not been a range at all. It's great for efficiency in a
 loop but horrible for range chaining.
[...] Transient ranges are tricky to work with, I agree, but I don't agree that they're "horrible for range chaining". I argue that many range algorithms that assume the persistence of .front are inherently wrong, and ought to be implemented in a way that does *not* make this assumption. Many std.algorithm algorithms *can* in fact be written in this way, and those that aren't, are arguably buggy. For example, some std.algorithm functions take a forward range but then tries to save .front to a local variable. Rather, they should use .save to save the previous position of the range so that they can call .front on that to access the previous element, instead of making the unfounded assumption that whatever the local variable refers to will still remain valid after calling .popFront. It's just sloppy coding IMNSHO. T -- MS Windows: 64-bit rehash of 32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit operating system originally coded for a 4-bit microprocessor, written by a 2-bit company that can't stand 1-bit of competition.
Aug 31
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Aug 31, 2017 at 06:26:33PM +0000, Sergei Degtiarev via
Digitalmars-d-learn wrote:
[...]
 	File[] files;
 	....
 	foreach(ln; joiner(files.map!(a => a.byLine)))
 		writeln(ln);
You probably want to use byLineCopy instead. The problem here is that .byLine returns a transient range (i.e., the value returned by .front changes after .popFront is called), which can cause unexpected behaviour when used with certain algorithms. T -- I don't trust computers, I've spent too long programming to think that they can get anything right. -- James Miller
Aug 31
prev sibling next sibling parent Jonathan M Davis via Digitalmars-d-learn writes:
On Thursday, August 31, 2017 14:09:55 H. S. Teoh via Digitalmars-d-learn 
wrote:
 On Thu, Aug 31, 2017 at 01:34:39PM -0600, Jonathan M Davis via
 Digitalmars-d-learn wrote: [...]

 In general, byLine does not work with other range-based algorithms
 precisely because it reuses the buffer. I think that it does manage to
 work for some, but IMHO, it should have just supported foreach via
 opApply and not been a range at all. It's great for efficiency in a
 loop but horrible for range chaining.
[...] Transient ranges are tricky to work with, I agree, but I don't agree that they're "horrible for range chaining". I argue that many range algorithms that assume the persistence of .front are inherently wrong, and ought to be implemented in a way that does *not* make this assumption. Many std.algorithm algorithms *can* in fact be written in this way, and those that aren't, are arguably buggy. For example, some std.algorithm functions take a forward range but then tries to save .front to a local variable. Rather, they should use .save to save the previous position of the range so that they can call .front on that to access the previous element, instead of making the unfounded assumption that whatever the local variable refers to will still remain valid after calling .popFront. It's just sloppy coding IMNSHO.
I know. We've had this argument before. Personally, I think that the range spec should require that front _not_ be transient and that any ranges that are be considered non-conformant. Yes, under some set of circumstances, they can be made to work, but IMHO, it's simply not worth it. As it is, simply calling save when it's supposed to be called gets totally botched all the time even if ranges with transient front aren't involved. And adding such ranges just makes it too complicated. What we have currently with the range API allows for a lot of stuff to work correctly as long as certain assumptions are bet, but as soon as folks start doing stuff that doesn't behave the same way that a dynamic array does, things start falling apart. Even copying ranges in generic code does not have defined behavior, because different types have different semantics even if they follow the range API and pass isInputRange, isForwardRange, etc. The status quo works well enough that we get by, but it's a mess when you get into the details - especially if you want code that's actually generic. And IMHO, trying to add ranges with a transient front into the mix is taking it way to far. It's simply not worth it. But I know that you don't agree with that. - Jonathan M Davis
Aug 31
prev sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Aug 31, 2017 at 05:37:20PM -0600, Jonathan M Davis via
Digitalmars-d-learn wrote:
 On Thursday, August 31, 2017 14:09:55 H. S. Teoh via Digitalmars-d-learn 
[...]
 I know. We've had this argument before.
I know. Let's not rehash that. :-) [...]
 Even copying ranges in generic code does not have defined behavior,
 because different types have different semantics even if they follow
 the range API and pass isInputRange, isForwardRange, etc.
This is actually one of the prime reasons for my stance that we ought to always use .save instead of assigning .front to a local variable. Consider the case where .front returns a subrange. As you state above, copying this subrange does not have defined behaviour. One reason is the difference in semantics between reference types and value types: the assignment operator `=` means different things for each kind of type. `x=y;` for a value type makes a new copy of the value in x, but `x=y;` for a reference type only copies the reference, not the value. So how do you ensure that after assigning .front to a local variable, it will continue to be valid after .popFront is called on the outer range? You can't. If you simply assign it, there's no guarantee it isn't just copying a reference to a buffer reused by .popFront. But if you try to copy it, the result is not defined, as you said. Worse yet, the user can overload opAssign() to do something with side-effects. So this code: auto e = r.front; r.popFront(); userCallback(e); may potentially have a different result from: userCallback(r.front); r.popFront(); The only safe approach is to make as few assumptions as possible, i.e., don't assume that `=` will produce the right result, so avoid saving anything in local variables completely and always use .save instead if you need to refer to a previous value of .front after calling .popFront. Yes this greatly complicates generic code, and I wouldn't impose it on user code. But one would expect that Phobos, at the very least, ought to be of a high enough standard to be able to handle such things correctly. Taking a step back from these nitpicky details, though: this seems to be symptomic of the underlying difficulty of nailing exact range semantics in an imperative language. In a pure functional language without mutation, you wouldn't have such issues, so there you could compose arbitrarily complex ranges of arbitrarily complex behaviours with impunity and not have to worry that something might break if one of the subranges was transient. We can't kill mutation in D, though, so unfortunately we have to live with these complications. T -- Life begins when you can spend your spare time programming instead of watching television. -- Cal Keegan
Aug 31
next sibling parent Sergei Degtiarev <sdegtiarev yahoo.com> writes:
Let me clarify, what I was going to create was a small utility, 
analogous to Perl <> operator, taking a list of file names and 
allowing forward iteration as it would be single stream of text 
lines. It would take no time to write the range from scratch, but 
what are all the phobos primitives for then? So it looks like 
this:
     auto catFiles(string[] args)
     {
	File[] files=args.map!(a => File(a)).array;
	if(files.empty)
		files~=stdin;
	return joiner(files.map!(a => a.byLine));
     }
but unfortunately it does not work.
It seems, everybody agreed the problem is likely in the line 
buffer reused, however, replacing .byLine() with .byLineCopy() 
yields same result. Also, neither .byLine() nor .byLineCopy() 
return ForwardRange, just InputRange, so no .save() is probably 
called on them. And last, the utility is supposed to be fast, so 
using .byLineCopy() is very undesirable. Transient ranges may be 
uneasy to handle, but they allow to retain status of D as "better 
C" suitable for system programming. Using unsolicited internal 
copies is more typical for high-level scripting languages, and is 
one of the reasons of slower execution and high memory 
consumption.
In short, I agree with your arguments, but still believe this is 
a bug, probably in joiner rather than in .byLine(). I'm going to 
take look at the code and will tell if find anything.
Thank you for the discussion.
Sep 02
prev sibling parent Sergei Degtiarev <sdegtiarev yahoo.com> writes:
On Friday, 1 September 2017 at 00:09:16 UTC, H. S. Teoh wrote:
 Consider the case where .front returns a subrange.  As you 
 state above, copying this subrange does not have defined 
 behaviour. One reason is the difference in semantics between 
 reference types and value types: the assignment operator `=` 
 means different things for each kind of type. `x=y;` for a 
 value type makes a new copy of the value in x, but `x=y;` for a 
 reference type only copies the reference, not the value.
Absolutely, in particular, InputRange shouldn't be assumed copyable. joiner saves whatever was passed to it, in this particular case, result of files.map!(a => a.byLine) which is range itself and is evaluated lazily. This makes every call of .front() to re-evaluate (a => a.byLine) and to call the constructor again, skipping first line every time. The partial solution is simple, force joiner to make immediate evaluation instead of lazy one: return joiner(array(files.map!(a => a.byLine))); here, the array is saved in joiner and no lazy constructor call is performed. However, in general I have a feeling that something is wrong here with design. Copying of InputRanges should be disabled possibly, or lazy range evaluation should be under control. Can't say what is wrong exactly, but something definitely is.
Sep 11