digitalmars.D - Tricky semantics of ranges & potentially numerous Phobos bugs

H. S. Teoh (50/50) Oct 15 2012 In an effort to locate the suspected Phobos bug that I ran into in my

H. S. Teoh (55/98) Oct 16 2012 I think a better solution is to somehow differentiate between ranges

Andrei Alexandrescu (4/5) Oct 17 2012 isTransient!R is exactly the same thing as isInputRange!R &&

monarch_dodra (6/11) Oct 17 2012 Technically, (isTransient!R) is just a subset of (isInputRange!R

Andrei Alexandrescu (3/16) Oct 17 2012 Depends on implementation - popFront may actually wipe the character.

monarch_dodra (5/25) Oct 17 2012 "byDChar" (if it existed), *could* be a perfectly valid

Jonathan M Davis (37/42) Oct 17 2012 In what way? There is nothing anywhere that states that an input range's...

monarch_dodra (9/15) Oct 18 2012 That's already a pretty big usage :) The thing is you just can't

H. S. Teoh (30/48) Oct 17 2012 It's perfectly possible to implement joiner, chain, find, count, cmp,

Don Clugston (5/49) Oct 18 2012 Is it actually orthogonal? Is it possible for a forward range to be

monarch_dodra (63/68) Oct 18 2012 Save just means the range can save its position. If it is
H. S. Teoh (9/23) Oct 18 2012 [...]

Andrei Alexandrescu (6/10) Oct 17 2012 Such issues do happen, are relatively rare, and are virtually untested

David Nadlinger (7/12) Oct 17 2012 If an abstraction encourages use in a way which leads to

Andrei Alexandrescu (7/15) Oct 17 2012 When I designed input ranges vs. forward ranges there were long

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

In an effort to locate the suspected Phobos bug that I ran into in my
new Cartesian product implementation, I looked over the code for
std.algorithm joiner more carefully, and noticed the following:

Given a range of ranges ror, it assigns ror.front to a struct member and
then calls ror.popFront() immediately. Then as the user calls the joined
range's popFront, it successively pops its local copy of ror.front until
it's empty, whereupon it assigns the new ror.front to the local copy
again, and so on.

While this works for array of arrays, it *doesn't* work for ranges that
overwrite ror.front when ror.popFront() is called.  For example, it
wouldn't work for stdin.byLine(), because the underlying buffer it
reused. Here's the proof:

Code:
	// Filename: test2.d
	import std.algorithm;
	import std.stdio;
	void main() {
		auto lines = stdin.byLine();
		writeln(lines.joiner);
	}

Compile & test:


	dmd test2.d



	(echo abc; echo def; echo ghi) | ./test2


	defghighi

As you can see, the output is mangled, because byLine() has reused the
line buffer before joiner.Result got to it.

Long story short: saving a local copy of range.front and then calling
range.popFront() may _invalidate_ the copy. So basically, either you
need a forward range and use .save to keep track of old values of
range.front, or you have to refrain from calling popFront() until you're
well and truly done with the current value of .front. While not
respecting this will work with many common ranges, it will also fail in
subtle ways when given other ranges.

The scary thing is, I see similar code like this all over Phobos. Does
this mean that most of std.algorithm may need to be revised to address
this issue? At the very least, it would seem that a code audit is in
order to weed out this particular issue.

:-/


T

-- 
A linguistics professor was lecturing to his class one day. "In
English," he said, "A double negative forms a positive. In some
languages, though, such as Russian, a double negative is still a
negative. However, there is no language wherein a double positive can
form a negative."
A voice from the back of the room piped up, "Yeah, yeah."

Oct 15 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Oct 16, 2012 at 07:28:38PM +0200, Jonathan M Davis wrote:
 On Tuesday, October 16, 2012 10:13:57 H. S. Teoh wrote:

[...]
 ByLine is perfectly valid range insofar as you realize that it's
 likely to go completely south if you use it in any way that could
 involve keeping front around after popFront has been called, which
 means that anything which relies on keeping front around isn't going
 to work. So, it's a range, but it's essentially an unsafe one (though
 I'm not sure that it's an un- safe one).
 
 So, it's fine that ByLine is a range as long as we're willing to put
 up with it not working with a lot of range-based functions because of
 its abnormal behavior. But I don't think that it's at all reasonable
 for range-based functions in general to not be able to rely on front
 returning the same type every time or on its value disappearing or
 becoming invalid in some way after a call to popFront. That's
 completely untenable IMHO.

I think a better solution is to somehow differentiate between ranges
whose .front value is permanent (arrays), and ranges whose .front values
are transient (e.g. byLine, but we shouldn't single it out because there
are other occasions where you may *want* to have a range that doesn't
call .dup every time -- say a range over all permutations of an array
that modifies the array in-place).

Perhaps mark ranges with an .isTransient property (which can be an enum
so it doesn't waste any space), and range-based functions that require
non-transient ranges will refuse to work with a transient range. Or
alternatively, switch to a different implementation that takes
transience into account (which may be slower, etc., but the important
thing is to do it right -- after all, D's motto is safe by default,
unsafe if you ask for it).


 Ranges _can_ define semantics which violate that, but they have to
 make it clear that they do so that programmers using them realize that
 they may not work right with a lot of range-based functions (which
 potentially makes it so that it they really shouldn't have been ranges
 in the first place).

I think this approach of what amounts to guessing which ranges are
safe/unsafe with which functions is what's untenable. We all know that
people don't read documentation, or at least, not thoroughly. Something
like this is too easy to miss, and bugs will slip into code unnoticed
for a long time, and then explode in your face. It's unsafe by default,
which goes against D's philosophy.

I think the problem is better addressed by an .isTransient property that
can be selected by templates at compile-time.  I agree that requiring
*every* range function to expect .front to mutate upon calling
.popFront() is unreasonable, but *some* functions *can* be written in a
way that they will still work. Marking ranges with transient .front
values will allow functions that *can* take transient ranges do so,
while leaving other functions accepting only non-transient ranges. That
way, if somebody tries to pass a transient range to a template that only
works with a non-transient range, they will know at compile-time instead
of getting a nasty surprise later on.


 So what is (or should be) the semantic design of ranges? Let's work
 out a precise definition so that we have something to build on.

 
 As far as front (or back) goes, range-based functions should be able
 to rely on
 
 1. front returning the exact same value on every call until popFront
 has been called (though there's no guarantee that front won't have to
 be recalculated on each call, so assigning the result of front to a
 local variable is advisable for efficiency if front must be used
 multiple times before a call to popFront).
 
 2. the result of front continuing to be valid and unchanged after
 popFront has been called if it was assigned to a variable.
 
 Any range is free to violate this, but because range-based functions
 are free to rely on it, such ranges risk not working correctly with
 many range-based functions and must be used with caution.

[...]

I don't like the idea that some ranges _may_ or _may not_ work with some
functions. That's just too unreliable, and it's too easy to make
mistakes. Let's put in an isTransient property on the unsafe ranges and
let the templates' signature constraints enforce passing the right kind
of range.


On Tue, Oct 16, 2012 at 10:58:23AM -0700, David Gileadi wrote:
[...]
 As a D dabbler, it seems wrong to me that a built-in range like
 byLine, which is often used in example code, should have weird side
 effects with the built-in methods in std.algorithm.  IMO this needs to
 be fixed one way or another, and a mere documentation change is likely
 not enough for casual D users.

I agree. We should add an isTransient property to at least the built-in
ranges in Phobos, so that at least those ranges will always work
properly (or complain loudly at compile-time if passed to the wrong
function).

User-defined ranges need to be marked too, but that is up to the user. I
don't think this part can be statically checked (unless you want to
enforce it using the type system by making non-transient ranges return
immutable from .front -- but that may be impractical in some cases). So
this has to be documented, and users will have the option of marking
their own ranges and get the benefit of compile-time compatibility
checks. But at the very least, the Phobos ranges must always work
correctly, or refuse to work at compile-time. Leaving things up to
convention (via documentation) is not a good idea.


T

-- 
ASCII stupid question, getty stupid ANSI.

Oct 16 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property

isTransient!R is exactly the same thing as isInputRange!R && 
!isForwardRange!R.

Andrei

Oct 17 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei
Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property

 isTransient!R is exactly the same thing as isInputRange!R && 
 !isForwardRange!R.

 Andrei

Technically, (isTransient!R) is just a subset of (isInputRange!R
&& !isForwardRange!R)

"byDChar" (if it existed), would be a perfectly valid
non-transient input only range.

Oct 17 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/17/12 1:29 PM, monarch_dodra wrote:
 On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei
 Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property

 isTransient!R is exactly the same thing as isInputRange!R &&
 !isForwardRange!R.

 Andrei

 Technically, (isTransient!R) is just a subset of (isInputRange!R
 && !isForwardRange!R)

 "byDChar" (if it existed), would be a perfectly valid
 non-transient input only range.

Depends on implementation - popFront may actually wipe the character.

Andrei

Oct 17 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 17 October 2012 at 18:13:42 UTC, Andrei 
Alexandrescu wrote:
 On 10/17/12 1:29 PM, monarch_dodra wrote:
 On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei
 Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property

 isTransient!R is exactly the same thing as isInputRange!R &&
 !isForwardRange!R.

 Andrei

 Technically, (isTransient!R) is just a subset of 
 (isInputRange!R
 && !isForwardRange!R)

 "byDChar" (if it existed), would be a perfectly valid
 non-transient input only range.

 Depends on implementation - popFront may actually wipe the 
 character.

 Andrei

"byDChar" (if it existed), *could* be a perfectly valid 
non-transient input only range, if the implementation returns the 
character by value. :)

Oct 17 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, October 17, 2012 13:15:12 Andrei Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property

 
 isTransient!R is exactly the same thing as isInputRange!R &&
 !isForwardRange!R.

In what way? There is nothing anywhere that states that an input range's front 
could be invalidated upon a call to popFront. There's nothing inherent in the 
fact that range can't be saved that makes it so that front must be invalidated 
upon a call to popFront. You could easily define an input range of bytes over a 
file where front is perfectly valid after a call to popFront, because it's a 
value type. You could also define ByLine so that its front is perfectly valid 
after a call to popFront. It's just that it requires that a new buffer be 
allocated instead of reusing the buffer. That has nothing to do with the fact 
that its incapable of copying its state such that you could have multiple 
copies of it being iterated over separately. I don't see anything in the 
concept or definition of input ranges which implies that front would be 
invalidated by a call to popFront.

I'm increasingly convinced that input ranges which are not forward ranges are 
useless for pretty much anything other than foreach. Far too much requires 
that you be able to save the current state - and most stuff _inherently_ 
requires it such that it's not simply a question of implementing the function 
differently. And adding even further restrictions on input ranges just makes it 
worse. It actually wouldn't hurt my feelings one whit if we got rid of the 
idea of input ranges entirely. It's perfectly possible to implement ranges 
like ByLine as forward ranges. It just requires a bit more work. But they'd be 
_way_ more useful if they were. In my experience the only time that you don't 
need to dup what ByLine or ByChunk gives you is when all you need is foreach, 
and if that's really the case, then opApply can be used for foreach, and 
ByLine and ByChunk can be defined in ways that actually allow them to not only 
not invalidate the result of previous front calls but to actually be full-on 
forward ranges, making them actually useful for things beyond foreach.

Regardless, there's nothing in how input ranges are currently defined which 
indicates that front would ever be invalidated for _any_ type of range, and 
ByLine and ByChunk are pretty much the only ranges I've ever seen which 
invalidate previous calls to front. So, I don't see how you could think that 
they're anything but abnormal. And if you really want to argue that whether 
front can be invalidated or not is somehow part of the difference between an 
input range and a forward range, then the documentation on that needs to make 
that _very_ clear, and it's going to be that much worse to deal with input 
ranges which aren't forward ranges.

- Jonathan M Davis

Oct 17 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 17 October 2012 at 19:56:08 UTC, Jonathan M Davis 
wrote:
 [SNIP]
 I'm increasingly convinced that input ranges which are not 
 forward ranges are
 useless for pretty much anything other than foreach.
 [SNIP]
 - Jonathan M Davis

That's already a pretty big usage :) The thing is you just can't 
"do" anything with them, but that *is* the design.

Just read them once to place them into another container. The 
fact they interface with, say "array", or "appender", or copy, 
makes the interface convenient.

The fact that byLine will choke on a call to "array", IMO, has 
nothing to do with it being a forward range.

Oct 18 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Oct 17, 2012 at 12:55:56PM -0700, Jonathan M Davis wrote:
[...]
 I'm increasingly convinced that input ranges which are not forward
 ranges are useless for pretty much anything other than foreach. Far
 too much requires that you be able to save the current state - and
 most stuff _inherently_ requires it such that it's not simply a
 question of implementing the function differently.

It's perfectly possible to implement joiner, chain, find, count, cmp,
equal, until, filter, map, reduce, without assuming that the value
returned by .front is persistent. Just to name a few. In fact, it's even
possible to implement cartesianProduct in which one of the ranges is an
input range.  I'd hardly call that useless.


 And adding even further restrictions on input ranges just makes it
 worse. It actually wouldn't hurt my feelings one whit if we got rid of
 the idea of input ranges entirely.

The motivating example for input ranges, at least according to TDPL, is
find(). There's nothing about find() that precludes non-forward input
ranges. A lot would be missing from the usefulness of ranges if we were
forced to only use forward ranges.


[...]
 Regardless, there's nothing in how input ranges are currently defined
 which indicates that front would ever be invalidated for _any_ type of
 range, and ByLine and ByChunk are pretty much the only ranges I've
 ever seen which invalidate previous calls to front. So, I don't see
 how you could think that they're anything but abnormal.

I can think of quite a few situations in which it's useful to not assume
that the return value of .front is persistent, which I've already
mentioned before: in-place array permutation, reused buffers for complex
computations, etc..


 And if you really want to argue that whether front can be invalidated
 or not is somehow part of the difference between an input range and a
 forward range, then the documentation on that needs to make that
 _very_ clear, and it's going to be that much worse to deal with input
 ranges which aren't forward ranges.

[...]

I think I'm not so sure about Andrei's lumping input ranges with
persistent return values from .front together with forward ranges. Some
algorithms, like findAdjacent, do not need a forward range, but they do
need a persistent .front. I do not like the idea of artificially
limiting the scope of findAdjacent just because you can't assume input
ranges' .front returns a persistent value. Like somebody else mentioned,
whether .front is transient or not is orthogonal to whether the range is
an input range or a forward range. There can be ranges whose .front is
persistent, but they can't be forward ranges for practical reasons.


T

-- 
Having a smoking section in a restaurant is like having a peeing section
in a swimming pool. -- Edward Burr

Oct 17 2012

Don Clugston <dac nospam.com> writes:

On 17/10/12 23:41, H. S. Teoh wrote:
 On Wed, Oct 17, 2012 at 12:55:56PM -0700, Jonathan M Davis wrote:
 [...]
 I'm increasingly convinced that input ranges which are not forward
 ranges are useless for pretty much anything other than foreach. Far
 too much requires that you be able to save the current state - and
 most stuff _inherently_ requires it such that it's not simply a
 question of implementing the function differently.

 It's perfectly possible to implement joiner, chain, find, count, cmp,
 equal, until, filter, map, reduce, without assuming that the value
 returned by .front is persistent. Just to name a few. In fact, it's even
 possible to implement cartesianProduct in which one of the ranges is an
 input range.  I'd hardly call that useless.


 And adding even further restrictions on input ranges just makes it
 worse. It actually wouldn't hurt my feelings one whit if we got rid of
 the idea of input ranges entirely.

 The motivating example for input ranges, at least according to TDPL, is
 find(). There's nothing about find() that precludes non-forward input
 ranges. A lot would be missing from the usefulness of ranges if we were
 forced to only use forward ranges.


 [...]
 Regardless, there's nothing in how input ranges are currently defined
 which indicates that front would ever be invalidated for _any_ type of
 range, and ByLine and ByChunk are pretty much the only ranges I've
 ever seen which invalidate previous calls to front. So, I don't see
 how you could think that they're anything but abnormal.

 I can think of quite a few situations in which it's useful to not assume
 that the return value of .front is persistent, which I've already
 mentioned before: in-place array permutation, reused buffers for complex
 computations, etc..


 And if you really want to argue that whether front can be invalidated
 or not is somehow part of the difference between an input range and a
 forward range, then the documentation on that needs to make that
 _very_ clear, and it's going to be that much worse to deal with input
 ranges which aren't forward ranges.

 [...]

 I think I'm not so sure about Andrei's lumping input ranges with
 persistent return values from .front together with forward ranges. Some
 algorithms, like findAdjacent, do not need a forward range, but they do
 need a persistent .front. I do not like the idea of artificially
 limiting the scope of findAdjacent just because you can't assume input
 ranges' .front returns a persistent value. Like somebody else mentioned,
 whether .front is transient or not is orthogonal to whether the range is
 an input range or a forward range. There can be ranges whose .front is
 persistent, but they can't be forward ranges for practical reasons.

Is it actually orthogonal? Is it possible for a forward range to be 
transient?

Or is it an intermediate concept?
TransientInputRange -> NonTransientInputRange -> ForwardRange

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 07:09:04 UTC, Don Clugston wrote:
 On 17/10/12 23:41, H. S. Teoh wrote:

 Is it actually orthogonal? Is it possible for a forward range 
 to be transient?

 Or is it an intermediate concept?
 TransientInputRange -> NonTransientInputRange -> ForwardRange

Save just means the range can save its position. If it is 
returning via a buffer, Forward of not, it is going to be 
transient.

Take this forward range, that returns the strings "A", "B" and 
"C" ad infinitum:

//----
enum _ABC = "ABC";

struct ABC
{
     char[1] buf = _ABC[0];
     size_t i;

     enum empty = false;
      property char[] front(){return buf;}
     void popFront()
     {
         ++i;
         buf[0] = _ABC[i%3];
     }
      property ABC save()
     {
         return this;
     }
}
//----

This is a perfectly valid range, which you can save, but the 
returned string is transient:

//----
void main()
{
   ABC abc;

   writeln("Printing 10 elements: ");
   abc.take(10).writeln('\n');

   writeln("Duplicating range");
   auto abc2 = abc.save;
   abc.popFront;
   foreach(v; zip(abc, abc2).take(5))
     write("[", v[0], ", ", v[1], "]");
   writeln('\n');

   writeln("Prnting two consecutive elements:");
   auto first = abc.front;
   abc.popFront();
   auto second = abc.front;
   writeln("[", first, ", ", second, "]");
}
//----

Produces:

//----
Printing 10 elements:
["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"]

Duplicating range
[B, A][C, B][A, C][B, A][C, B]

Prnting two consecutive elements:
[C, C]
//----

As you can see, you can perfectly iterate.
You can perfectly save the range. The saved range can be used to 
backtrack.
But if you attempt to store two consecutive fronts, things don't 
go well.

The same holds true for a Random Access range BTW.

Iteration and transient-ness of returned value are orthogonal 
concepts

Oct 18 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Oct 18, 2012 at 09:09:03AM +0200, Don Clugston wrote:
 On 17/10/12 23:41, H. S. Teoh wrote:

[...]
I think I'm not so sure about Andrei's lumping input ranges with
persistent return values from .front together with forward ranges.
Some algorithms, like findAdjacent, do not need a forward range, but
they do need a persistent .front. I do not like the idea of
artificially limiting the scope of findAdjacent just because you
can't assume input ranges' .front returns a persistent value. Like
somebody else mentioned, whether .front is transient or not is
orthogonal to whether the range is an input range or a forward range.
There can be ranges whose .front is persistent, but they can't be
forward ranges for practical reasons.

 
 Is it actually orthogonal? Is it possible for a forward range to be
 transient?

[...]

What about a range over all permutations of an array, that modifies the
array in-place? It can be a forward range by having .save copy the
current state of the array, but .front is transient nonetheless.


T

-- 
Life is too short to run proprietary software. -- Bdale Garbee

Oct 18 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/16/12 1:09 AM, H. S. Teoh wrote:
 The scary thing is, I see similar code like this all over Phobos. Does
 this mean that most of std.algorithm may need to be revised to address
 this issue? At the very least, it would seem that a code audit is in
 order to weed out this particular issue.

Such issues do happen, are relatively rare, and are virtually untested 
because there's been no unittests with a deliberately "bad" input range. 
Although of course we do need to add the appropriate fixes and 
unittests, I'm not worried about it at all systemically.

Andrei

Oct 17 2012

"David Nadlinger" <see klickverbot.at> writes:

On Wednesday, 17 October 2012 at 15:14:44 UTC, Andrei 
Alexandrescu wrote:
 Such issues do happen, are relatively rare, and are virtually 
 untested because there's been no unittests with a deliberately 
 "bad" input range. Although of course we do need to add the 
 appropriate fixes and unittests, I'm not worried about it at 
 all systemically.

If an abstraction encourages use in a way which leads to 
hard-to-detect logic bugs that do not occur with the most common 
test cases, this is _exactly_ the thing we should be worried 
about!

David

Oct 17 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/17/12 12:57 PM, David Nadlinger wrote:
 On Wednesday, 17 October 2012 at 15:14:44 UTC, Andrei Alexandrescu wrote:
 Such issues do happen, are relatively rare, and are virtually untested
 because there's been no unittests with a deliberately "bad" input
 range. Although of course we do need to add the appropriate fixes and
 unittests, I'm not worried about it at all systemically.

 If an abstraction encourages use in a way which leads to hard-to-detect
 logic bugs that do not occur with the most common test cases, this is
 _exactly_ the thing we should be worried about!

When I designed input ranges vs. forward ranges there were long 
discussion on how to distinguish them. If you have a better design it 
would probably not influence the state of the affair, but it would be a 
good discussion to have. FWIW I can already think of a couple but all 
rely on UFCS.

Andrei

Oct 17 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Tricky semantics of ranges & potentially numerous Phobos bugs