digitalmars.D.learn - std.algorithm.joiner unexpected behavior

Sergei Degtiarev (15/15) Aug 31 2017 Hi,

Jesse Phillips (4/20) Aug 31 2017 Doesn't byLine() reuse a buffer, joiner probably caches the first

Jonathan M Davis via Digitalmars-d-learn (11/37) Aug 31 2017 In general, byLine does not work with other range-based algorithms preci...
H. S. Teoh via Digitalmars-d-learn (18/23) Aug 31 2017 [...]

H. S. Teoh via Digitalmars-d-learn (9/13) Aug 31 2017 You probably want to use byLineCopy instead.
Jonathan M Davis via Digitalmars-d-learn (20/40) Aug 31 2017 I know. We've had this argument before. Personally, I think that the ran...
H. S. Teoh via Digitalmars-d-learn (44/49) Aug 31 2017 I know. Let's not rehash that. :-)

Sergei Degtiarev (29/29) Sep 02 2017 Let me clarify, what I was going to create was a small utility,
Sergei Degtiarev (18/25) Sep 11 2017 Absolutely, in particular, InputRange shouldn't be assumed

Sergei Degtiarev <sdegtiarev yahoo.com> writes:

Hi,
I tried to create a simple range concatenating several files, 
something like this:

	File[] files;
	....
	foreach(ln; joiner(files.map!(a => a.byLine)))
		writeln(ln);

and I see every first line of each file is missing.
However, when I do same thing with separator:

	char[][] separator=["**** new file ****".dup];
	foreach(ln; joiner(files.map!(a => a.byLine), separator))
		writeln(ln);

everything works fine, both separator and first lines are printed.
It looks pretty much as a bug for me, but I'm not sure I use it 
correctly.

Aug 31 2017

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Thursday, 31 August 2017 at 18:26:33 UTC, Sergei Degtiarev 
wrote:
 Hi,
 I tried to create a simple range concatenating several files, 
 something like this:

 	File[] files;
 	....
 	foreach(ln; joiner(files.map!(a => a.byLine)))
 		writeln(ln);

 and I see every first line of each file is missing.
 However, when I do same thing with separator:

 	char[][] separator=["**** new file ****".dup];
 	foreach(ln; joiner(files.map!(a => a.byLine), separator))
 		writeln(ln);

 everything works fine, both separator and first lines are 
 printed.
 It looks pretty much as a bug for me, but I'm not sure I use it 
 correctly.

Doesn't byLine() reuse a buffer, joiner probably caches the first 
line and calls .popFront()

Aug 31 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Thursday, August 31, 2017 18:43:40 Jesse Phillips via Digitalmars-d-learn 
wrote:
 On Thursday, 31 August 2017 at 18:26:33 UTC, Sergei Degtiarev

 wrote:
 Hi,
 I tried to create a simple range concatenating several files,

 something like this:
     File[] files;
     ....
     foreach(ln; joiner(files.map!(a => a.byLine)))

         writeln(ln);

 and I see every first line of each file is missing.

 However, when I do same thing with separator:
     char[][] separator=["**** new file ****".dup];
     foreach(ln; joiner(files.map!(a => a.byLine), separator))

         writeln(ln);

 everything works fine, both separator and first lines are
 printed.
 It looks pretty much as a bug for me, but I'm not sure I use it
 correctly.

 Doesn't byLine() reuse a buffer, joiner probably caches the first
 line and calls .popFront()

In general, byLine does not work with other range-based algorithms precisely
because it reuses the buffer. I think that it does manage to work for some,
but IMHO, it should have just supported foreach via opApply and not been a
range at all. It's great for efficiency in a loop but horrible for range
chaining. Folks get bit by this problem all the time. Fortunately, we now
have byLineCopy which actually uses a separate dynamic array for each line,
but even still, often, someone grabs byLine and then gets weird behavior
that they don't understand.

- Jonathan M Davis

Aug 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Aug 31, 2017 at 01:34:39PM -0600, Jonathan M Davis via
Digitalmars-d-learn wrote:
[...]
 In general, byLine does not work with other range-based algorithms
 precisely because it reuses the buffer. I think that it does manage to
 work for some, but IMHO, it should have just supported foreach via
 opApply and not been a range at all. It's great for efficiency in a
 loop but horrible for range chaining.

[...]

Transient ranges are tricky to work with, I agree, but I don't agree
that they're "horrible for range chaining".  I argue that many range
algorithms that assume the persistence of .front are inherently wrong,
and ought to be implemented in a way that does *not* make this
assumption.  Many std.algorithm algorithms *can* in fact be written in
this way, and those that aren't, are arguably buggy.

For example, some std.algorithm functions take a forward range but then
tries to save .front to a local variable. Rather, they should use .save
to save the previous position of the range so that they can call .front
on that to access the previous element, instead of making the unfounded
assumption that whatever the local variable refers to will still remain
valid after calling .popFront.  It's just sloppy coding IMNSHO.


T

-- 
MS Windows: 64-bit rehash of 32-bit extensions and a graphical shell for a
16-bit patch to an 8-bit operating system originally coded for a 4-bit
microprocessor, written by a 2-bit company that can't stand 1-bit of
competition.

Aug 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Aug 31, 2017 at 06:26:33PM +0000, Sergei Degtiarev via
Digitalmars-d-learn wrote:
[...]
 	File[] files;
 	....
 	foreach(ln; joiner(files.map!(a => a.byLine)))
 		writeln(ln);

You probably want to use byLineCopy instead.

The problem here is that .byLine returns a transient range (i.e., the
value returned by .front changes after .popFront is called), which can
cause unexpected behaviour when used with certain algorithms.


T

-- 
I don't trust computers, I've spent too long programming to think that they can
get anything right. -- James Miller

Aug 31 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Thursday, August 31, 2017 14:09:55 H. S. Teoh via Digitalmars-d-learn 
wrote:
 On Thu, Aug 31, 2017 at 01:34:39PM -0600, Jonathan M Davis via
 Digitalmars-d-learn wrote: [...]

 In general, byLine does not work with other range-based algorithms
 precisely because it reuses the buffer. I think that it does manage to
 work for some, but IMHO, it should have just supported foreach via
 opApply and not been a range at all. It's great for efficiency in a
 loop but horrible for range chaining.

 [...]

 Transient ranges are tricky to work with, I agree, but I don't agree
 that they're "horrible for range chaining".  I argue that many range
 algorithms that assume the persistence of .front are inherently wrong,
 and ought to be implemented in a way that does *not* make this
 assumption.  Many std.algorithm algorithms *can* in fact be written in
 this way, and those that aren't, are arguably buggy.

 For example, some std.algorithm functions take a forward range but then
 tries to save .front to a local variable. Rather, they should use .save
 to save the previous position of the range so that they can call .front
 on that to access the previous element, instead of making the unfounded
 assumption that whatever the local variable refers to will still remain
 valid after calling .popFront.  It's just sloppy coding IMNSHO.

I know. We've had this argument before. Personally, I think that the range
spec should require that front _not_ be transient and that any ranges that
are be considered non-conformant. Yes, under some set of circumstances, they
can be made to work, but IMHO, it's simply not worth it. As it is, simply
calling save when it's supposed to be called gets totally botched all the
time even if ranges with transient front aren't involved. And adding such
ranges just makes it too complicated. What we have currently with the range
API allows for a lot of stuff to work correctly as long as certain
assumptions are bet, but as soon as folks start doing stuff that doesn't
behave the same way that a dynamic array does, things start falling apart.
Even copying ranges in generic code does not have defined behavior, because
different types have different semantics even if they follow the range API
and pass isInputRange, isForwardRange, etc. The status quo works well enough
that we get by, but it's a mess when you get into the details - especially
if you want code that's actually generic. And IMHO, trying to add ranges
with a transient front into the mix is taking it way to far. It's simply not
worth it. But I know that you don't agree with that.

- Jonathan M Davis

Aug 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Aug 31, 2017 at 05:37:20PM -0600, Jonathan M Davis via
Digitalmars-d-learn wrote:
 On Thursday, August 31, 2017 14:09:55 H. S. Teoh via Digitalmars-d-learn 

[...]
 I know. We've had this argument before.

I know. Let's not rehash that. :-)


[...]
 Even copying ranges in generic code does not have defined behavior,
 because different types have different semantics even if they follow
 the range API and pass isInputRange, isForwardRange, etc.

This is actually one of the prime reasons for my stance that we ought to
always use .save instead of assigning .front to a local variable.

Consider the case where .front returns a subrange.  As you state above,
copying this subrange does not have defined behaviour. One reason is the
difference in semantics between reference types and value types: the
assignment operator `=` means different things for each kind of type.
`x=y;` for a value type makes a new copy of the value in x, but `x=y;`
for a reference type only copies the reference, not the value.  So how
do you ensure that after assigning .front to a local variable, it will
continue to be valid after .popFront is called on the outer range? You
can't.  If you simply assign it, there's no guarantee it isn't just
copying a reference to a buffer reused by .popFront.  But if you try to
copy it, the result is not defined, as you said.

Worse yet, the user can overload opAssign() to do something with
side-effects. So this code:

	auto e = r.front;
	r.popFront();
	userCallback(e);

may potentially have a different result from:

	userCallback(r.front);
	r.popFront();

The only safe approach is to make as few assumptions as possible, i.e.,
don't assume that `=` will produce the right result, so avoid saving
anything in local variables completely and always use .save instead if
you need to refer to a previous value of .front after calling .popFront.

Yes this greatly complicates generic code, and I wouldn't impose it on
user code.  But one would expect that Phobos, at the very least, ought
to be of a high enough standard to be able to handle such things
correctly.


Taking a step back from these nitpicky details, though: this seems to be
symptomic of the underlying difficulty of nailing exact range semantics
in an imperative language.  In a pure functional language without
mutation, you wouldn't have such issues, so there you could compose
arbitrarily complex ranges of arbitrarily complex behaviours with
impunity and not have to worry that something might break if one of the
subranges was transient.  We can't kill mutation in D, though, so
unfortunately we have to live with these complications.


T

-- 
Life begins when you can spend your spare time programming instead of watching
television. -- Cal Keegan

Aug 31 2017

Sergei Degtiarev <sdegtiarev yahoo.com> writes:

Let me clarify, what I was going to create was a small utility, 
analogous to Perl <> operator, taking a list of file names and 
allowing forward iteration as it would be single stream of text 
lines. It would take no time to write the range from scratch, but 
what are all the phobos primitives for then? So it looks like 
this:
     auto catFiles(string[] args)
     {
	File[] files=args.map!(a => File(a)).array;
	if(files.empty)
		files~=stdin;
	return joiner(files.map!(a => a.byLine));
     }
but unfortunately it does not work.
It seems, everybody agreed the problem is likely in the line 
buffer reused, however, replacing .byLine() with .byLineCopy() 
yields same result. Also, neither .byLine() nor .byLineCopy() 
return ForwardRange, just InputRange, so no .save() is probably 
called on them. And last, the utility is supposed to be fast, so 
using .byLineCopy() is very undesirable. Transient ranges may be 
uneasy to handle, but they allow to retain status of D as "better 
C" suitable for system programming. Using unsolicited internal 
copies is more typical for high-level scripting languages, and is 
one of the reasons of slower execution and high memory 
consumption.
In short, I agree with your arguments, but still believe this is 
a bug, probably in joiner rather than in .byLine(). I'm going to 
take look at the code and will tell if find anything.
Thank you for the discussion.

Sep 02 2017

Sergei Degtiarev <sdegtiarev yahoo.com> writes:

On Friday, 1 September 2017 at 00:09:16 UTC, H. S. Teoh wrote:
 Consider the case where .front returns a subrange.  As you 
 state above, copying this subrange does not have defined 
 behaviour. One reason is the difference in semantics between 
 reference types and value types: the assignment operator `=` 
 means different things for each kind of type. `x=y;` for a 
 value type makes a new copy of the value in x, but `x=y;` for a 
 reference type only copies the reference, not the value.

Absolutely, in particular, InputRange shouldn't be assumed 
copyable.
joiner saves whatever was passed to it, in this particular case, 
result of
files.map!(a => a.byLine) which is range itself and is evaluated 
lazily. This makes every call of .front() to re-evaluate (a => 
a.byLine) and to call the constructor again, skipping first line 
every time.
The partial solution is simple, force joiner to make immediate 
evaluation instead of lazy one:
     return joiner(array(files.map!(a => a.byLine)));
here, the array is saved in joiner and no lazy constructor call 
is performed. However, in general I have a feeling that something 
is wrong here with design. Copying of InputRanges should be 
disabled possibly, or lazy range evaluation should be under 
control. Can't say what is wrong exactly, but something 
definitely is.

Sep 11 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - std.algorithm.joiner unexpected behavior