www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Tricky semantics of ranges & potentially numerous Phobos bugs

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
In an effort to locate the suspected Phobos bug that I ran into in my
new Cartesian product implementation, I looked over the code for
std.algorithm joiner more carefully, and noticed the following:

Given a range of ranges ror, it assigns ror.front to a struct member and
then calls ror.popFront() immediately. Then as the user calls the joined
range's popFront, it successively pops its local copy of ror.front until
it's empty, whereupon it assigns the new ror.front to the local copy
again, and so on.

While this works for array of arrays, it *doesn't* work for ranges that
overwrite ror.front when ror.popFront() is called.  For example, it
wouldn't work for stdin.byLine(), because the underlying buffer it
reused. Here's the proof:

Code:
	// Filename: test2.d
	import std.algorithm;
	import std.stdio;
	void main() {
		auto lines = stdin.byLine();
		writeln(lines.joiner);
	}

Compile & test:


	dmd test2.d



	(echo abc; echo def; echo ghi) | ./test2


	defghighi

As you can see, the output is mangled, because byLine() has reused the
line buffer before joiner.Result got to it.

Long story short: saving a local copy of range.front and then calling
range.popFront() may _invalidate_ the copy. So basically, either you
need a forward range and use .save to keep track of old values of
range.front, or you have to refrain from calling popFront() until you're
well and truly done with the current value of .front. While not
respecting this will work with many common ranges, it will also fail in
subtle ways when given other ranges.

The scary thing is, I see similar code like this all over Phobos. Does
this mean that most of std.algorithm may need to be revised to address
this issue? At the very least, it would seem that a code audit is in
order to weed out this particular issue.

:-/


T

-- 
A linguistics professor was lecturing to his class one day. "In
English," he said, "A double negative forms a positive. In some
languages, though, such as Russian, a double negative is still a
negative. However, there is no language wherein a double positive can
form a negative."
A voice from the back of the room piped up, "Yeah, yeah."
Oct 15 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Oct 16, 2012 at 07:28:38PM +0200, Jonathan M Davis wrote:
 On Tuesday, October 16, 2012 10:13:57 H. S. Teoh wrote:
[...]
 ByLine is perfectly valid range insofar as you realize that it's
 likely to go completely south if you use it in any way that could
 involve keeping front around after popFront has been called, which
 means that anything which relies on keeping front around isn't going
 to work. So, it's a range, but it's essentially an unsafe one (though
 I'm not sure that it's an un- safe one).
 
 So, it's fine that ByLine is a range as long as we're willing to put
 up with it not working with a lot of range-based functions because of
 its abnormal behavior. But I don't think that it's at all reasonable
 for range-based functions in general to not be able to rely on front
 returning the same type every time or on its value disappearing or
 becoming invalid in some way after a call to popFront. That's
 completely untenable IMHO.
I think a better solution is to somehow differentiate between ranges whose .front value is permanent (arrays), and ranges whose .front values are transient (e.g. byLine, but we shouldn't single it out because there are other occasions where you may *want* to have a range that doesn't call .dup every time -- say a range over all permutations of an array that modifies the array in-place). Perhaps mark ranges with an .isTransient property (which can be an enum so it doesn't waste any space), and range-based functions that require non-transient ranges will refuse to work with a transient range. Or alternatively, switch to a different implementation that takes transience into account (which may be slower, etc., but the important thing is to do it right -- after all, D's motto is safe by default, unsafe if you ask for it).
 Ranges _can_ define semantics which violate that, but they have to
 make it clear that they do so that programmers using them realize that
 they may not work right with a lot of range-based functions (which
 potentially makes it so that it they really shouldn't have been ranges
 in the first place).
I think this approach of what amounts to guessing which ranges are safe/unsafe with which functions is what's untenable. We all know that people don't read documentation, or at least, not thoroughly. Something like this is too easy to miss, and bugs will slip into code unnoticed for a long time, and then explode in your face. It's unsafe by default, which goes against D's philosophy. I think the problem is better addressed by an .isTransient property that can be selected by templates at compile-time. I agree that requiring *every* range function to expect .front to mutate upon calling .popFront() is unreasonable, but *some* functions *can* be written in a way that they will still work. Marking ranges with transient .front values will allow functions that *can* take transient ranges do so, while leaving other functions accepting only non-transient ranges. That way, if somebody tries to pass a transient range to a template that only works with a non-transient range, they will know at compile-time instead of getting a nasty surprise later on.
 So what is (or should be) the semantic design of ranges? Let's work
 out a precise definition so that we have something to build on.
As far as front (or back) goes, range-based functions should be able to rely on 1. front returning the exact same value on every call until popFront has been called (though there's no guarantee that front won't have to be recalculated on each call, so assigning the result of front to a local variable is advisable for efficiency if front must be used multiple times before a call to popFront). 2. the result of front continuing to be valid and unchanged after popFront has been called if it was assigned to a variable. Any range is free to violate this, but because range-based functions are free to rely on it, such ranges risk not working correctly with many range-based functions and must be used with caution.
[...] I don't like the idea that some ranges _may_ or _may not_ work with some functions. That's just too unreliable, and it's too easy to make mistakes. Let's put in an isTransient property on the unsafe ranges and let the templates' signature constraints enforce passing the right kind of range. On Tue, Oct 16, 2012 at 10:58:23AM -0700, David Gileadi wrote: [...]
 As a D dabbler, it seems wrong to me that a built-in range like
 byLine, which is often used in example code, should have weird side
 effects with the built-in methods in std.algorithm.  IMO this needs to
 be fixed one way or another, and a mere documentation change is likely
 not enough for casual D users.
I agree. We should add an isTransient property to at least the built-in ranges in Phobos, so that at least those ranges will always work properly (or complain loudly at compile-time if passed to the wrong function). User-defined ranges need to be marked too, but that is up to the user. I don't think this part can be statically checked (unless you want to enforce it using the type system by making non-transient ranges return immutable from .front -- but that may be impractical in some cases). So this has to be documented, and users will have the option of marking their own ranges and get the benefit of compile-time compatibility checks. But at the very least, the Phobos ranges must always work correctly, or refuse to work at compile-time. Leaving things up to convention (via documentation) is not a good idea. T -- ASCII stupid question, getty stupid ANSI.
Oct 16 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property
isTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Oct 17 2012
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei
Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property
isTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Technically, (isTransient!R) is just a subset of (isInputRange!R && !isForwardRange!R) "byDChar" (if it existed), would be a perfectly valid non-transient input only range.
Oct 17 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/17/12 1:29 PM, monarch_dodra wrote:
 On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei
 Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property
isTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Technically, (isTransient!R) is just a subset of (isInputRange!R && !isForwardRange!R) "byDChar" (if it existed), would be a perfectly valid non-transient input only range.
Depends on implementation - popFront may actually wipe the character. Andrei
Oct 17 2012
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 17 October 2012 at 18:13:42 UTC, Andrei 
Alexandrescu wrote:
 On 10/17/12 1:29 PM, monarch_dodra wrote:
 On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei
 Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property
isTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Technically, (isTransient!R) is just a subset of (isInputRange!R && !isForwardRange!R) "byDChar" (if it existed), would be a perfectly valid non-transient input only range.
Depends on implementation - popFront may actually wipe the character. Andrei
"byDChar" (if it existed), *could* be a perfectly valid non-transient input only range, if the implementation returns the character by value. :)
Oct 17 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, October 17, 2012 13:15:12 Andrei Alexandrescu wrote:
 On 10/16/12 3:07 PM, H. S. Teoh wrote:
 Perhaps mark ranges with an .isTransient property
isTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R.
In what way? There is nothing anywhere that states that an input range's front could be invalidated upon a call to popFront. There's nothing inherent in the fact that range can't be saved that makes it so that front must be invalidated upon a call to popFront. You could easily define an input range of bytes over a file where front is perfectly valid after a call to popFront, because it's a value type. You could also define ByLine so that its front is perfectly valid after a call to popFront. It's just that it requires that a new buffer be allocated instead of reusing the buffer. That has nothing to do with the fact that its incapable of copying its state such that you could have multiple copies of it being iterated over separately. I don't see anything in the concept or definition of input ranges which implies that front would be invalidated by a call to popFront. I'm increasingly convinced that input ranges which are not forward ranges are useless for pretty much anything other than foreach. Far too much requires that you be able to save the current state - and most stuff _inherently_ requires it such that it's not simply a question of implementing the function differently. And adding even further restrictions on input ranges just makes it worse. It actually wouldn't hurt my feelings one whit if we got rid of the idea of input ranges entirely. It's perfectly possible to implement ranges like ByLine as forward ranges. It just requires a bit more work. But they'd be _way_ more useful if they were. In my experience the only time that you don't need to dup what ByLine or ByChunk gives you is when all you need is foreach, and if that's really the case, then opApply can be used for foreach, and ByLine and ByChunk can be defined in ways that actually allow them to not only not invalidate the result of previous front calls but to actually be full-on forward ranges, making them actually useful for things beyond foreach. Regardless, there's nothing in how input ranges are currently defined which indicates that front would ever be invalidated for _any_ type of range, and ByLine and ByChunk are pretty much the only ranges I've ever seen which invalidate previous calls to front. So, I don't see how you could think that they're anything but abnormal. And if you really want to argue that whether front can be invalidated or not is somehow part of the difference between an input range and a forward range, then the documentation on that needs to make that _very_ clear, and it's going to be that much worse to deal with input ranges which aren't forward ranges. - Jonathan M Davis
Oct 17 2012
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 17 October 2012 at 19:56:08 UTC, Jonathan M Davis 
wrote:
 [SNIP]
 I'm increasingly convinced that input ranges which are not 
 forward ranges are
 useless for pretty much anything other than foreach.
 [SNIP]
 - Jonathan M Davis
That's already a pretty big usage :) The thing is you just can't "do" anything with them, but that *is* the design. Just read them once to place them into another container. The fact they interface with, say "array", or "appender", or copy, makes the interface convenient. The fact that byLine will choke on a call to "array", IMO, has nothing to do with it being a forward range.
Oct 18 2012
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Oct 17, 2012 at 12:55:56PM -0700, Jonathan M Davis wrote:
[...]
 I'm increasingly convinced that input ranges which are not forward
 ranges are useless for pretty much anything other than foreach. Far
 too much requires that you be able to save the current state - and
 most stuff _inherently_ requires it such that it's not simply a
 question of implementing the function differently.
It's perfectly possible to implement joiner, chain, find, count, cmp, equal, until, filter, map, reduce, without assuming that the value returned by .front is persistent. Just to name a few. In fact, it's even possible to implement cartesianProduct in which one of the ranges is an input range. I'd hardly call that useless.
 And adding even further restrictions on input ranges just makes it
 worse. It actually wouldn't hurt my feelings one whit if we got rid of
 the idea of input ranges entirely.
The motivating example for input ranges, at least according to TDPL, is find(). There's nothing about find() that precludes non-forward input ranges. A lot would be missing from the usefulness of ranges if we were forced to only use forward ranges. [...]
 Regardless, there's nothing in how input ranges are currently defined
 which indicates that front would ever be invalidated for _any_ type of
 range, and ByLine and ByChunk are pretty much the only ranges I've
 ever seen which invalidate previous calls to front. So, I don't see
 how you could think that they're anything but abnormal.
I can think of quite a few situations in which it's useful to not assume that the return value of .front is persistent, which I've already mentioned before: in-place array permutation, reused buffers for complex computations, etc..
 And if you really want to argue that whether front can be invalidated
 or not is somehow part of the difference between an input range and a
 forward range, then the documentation on that needs to make that
 _very_ clear, and it's going to be that much worse to deal with input
 ranges which aren't forward ranges.
[...] I think I'm not so sure about Andrei's lumping input ranges with persistent return values from .front together with forward ranges. Some algorithms, like findAdjacent, do not need a forward range, but they do need a persistent .front. I do not like the idea of artificially limiting the scope of findAdjacent just because you can't assume input ranges' .front returns a persistent value. Like somebody else mentioned, whether .front is transient or not is orthogonal to whether the range is an input range or a forward range. There can be ranges whose .front is persistent, but they can't be forward ranges for practical reasons. T -- Having a smoking section in a restaurant is like having a peeing section in a swimming pool. -- Edward Burr
Oct 17 2012
parent reply Don Clugston <dac nospam.com> writes:
On 17/10/12 23:41, H. S. Teoh wrote:
 On Wed, Oct 17, 2012 at 12:55:56PM -0700, Jonathan M Davis wrote:
 [...]
 I'm increasingly convinced that input ranges which are not forward
 ranges are useless for pretty much anything other than foreach. Far
 too much requires that you be able to save the current state - and
 most stuff _inherently_ requires it such that it's not simply a
 question of implementing the function differently.
It's perfectly possible to implement joiner, chain, find, count, cmp, equal, until, filter, map, reduce, without assuming that the value returned by .front is persistent. Just to name a few. In fact, it's even possible to implement cartesianProduct in which one of the ranges is an input range. I'd hardly call that useless.
 And adding even further restrictions on input ranges just makes it
 worse. It actually wouldn't hurt my feelings one whit if we got rid of
 the idea of input ranges entirely.
The motivating example for input ranges, at least according to TDPL, is find(). There's nothing about find() that precludes non-forward input ranges. A lot would be missing from the usefulness of ranges if we were forced to only use forward ranges. [...]
 Regardless, there's nothing in how input ranges are currently defined
 which indicates that front would ever be invalidated for _any_ type of
 range, and ByLine and ByChunk are pretty much the only ranges I've
 ever seen which invalidate previous calls to front. So, I don't see
 how you could think that they're anything but abnormal.
I can think of quite a few situations in which it's useful to not assume that the return value of .front is persistent, which I've already mentioned before: in-place array permutation, reused buffers for complex computations, etc..
 And if you really want to argue that whether front can be invalidated
 or not is somehow part of the difference between an input range and a
 forward range, then the documentation on that needs to make that
 _very_ clear, and it's going to be that much worse to deal with input
 ranges which aren't forward ranges.
[...] I think I'm not so sure about Andrei's lumping input ranges with persistent return values from .front together with forward ranges. Some algorithms, like findAdjacent, do not need a forward range, but they do need a persistent .front. I do not like the idea of artificially limiting the scope of findAdjacent just because you can't assume input ranges' .front returns a persistent value. Like somebody else mentioned, whether .front is transient or not is orthogonal to whether the range is an input range or a forward range. There can be ranges whose .front is persistent, but they can't be forward ranges for practical reasons.
Is it actually orthogonal? Is it possible for a forward range to be transient? Or is it an intermediate concept? TransientInputRange -> NonTransientInputRange -> ForwardRange
Oct 18 2012
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 07:09:04 UTC, Don Clugston wrote:
 On 17/10/12 23:41, H. S. Teoh wrote:

 Is it actually orthogonal? Is it possible for a forward range 
 to be transient?

 Or is it an intermediate concept?
 TransientInputRange -> NonTransientInputRange -> ForwardRange
Save just means the range can save its position. If it is returning via a buffer, Forward of not, it is going to be transient. Take this forward range, that returns the strings "A", "B" and "C" ad infinitum: //---- enum _ABC = "ABC"; struct ABC { char[1] buf = _ABC[0]; size_t i; enum empty = false; property char[] front(){return buf;} void popFront() { ++i; buf[0] = _ABC[i%3]; } property ABC save() { return this; } } //---- This is a perfectly valid range, which you can save, but the returned string is transient: //---- void main() { ABC abc; writeln("Printing 10 elements: "); abc.take(10).writeln('\n'); writeln("Duplicating range"); auto abc2 = abc.save; abc.popFront; foreach(v; zip(abc, abc2).take(5)) write("[", v[0], ", ", v[1], "]"); writeln('\n'); writeln("Prnting two consecutive elements:"); auto first = abc.front; abc.popFront(); auto second = abc.front; writeln("[", first, ", ", second, "]"); } //---- Produces: //---- Printing 10 elements: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"] Duplicating range [B, A][C, B][A, C][B, A][C, B] Prnting two consecutive elements: [C, C] //---- As you can see, you can perfectly iterate. You can perfectly save the range. The saved range can be used to backtrack. But if you attempt to store two consecutive fronts, things don't go well. The same holds true for a Random Access range BTW. Iteration and transient-ness of returned value are orthogonal concepts
Oct 18 2012
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Oct 18, 2012 at 09:09:03AM +0200, Don Clugston wrote:
 On 17/10/12 23:41, H. S. Teoh wrote:
[...]
I think I'm not so sure about Andrei's lumping input ranges with
persistent return values from .front together with forward ranges.
Some algorithms, like findAdjacent, do not need a forward range, but
they do need a persistent .front. I do not like the idea of
artificially limiting the scope of findAdjacent just because you
can't assume input ranges' .front returns a persistent value. Like
somebody else mentioned, whether .front is transient or not is
orthogonal to whether the range is an input range or a forward range.
There can be ranges whose .front is persistent, but they can't be
forward ranges for practical reasons.
Is it actually orthogonal? Is it possible for a forward range to be transient?
[...] What about a range over all permutations of an array, that modifies the array in-place? It can be a forward range by having .save copy the current state of the array, but .front is transient nonetheless. T -- Life is too short to run proprietary software. -- Bdale Garbee
Oct 18 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/16/12 1:09 AM, H. S. Teoh wrote:
 The scary thing is, I see similar code like this all over Phobos. Does
 this mean that most of std.algorithm may need to be revised to address
 this issue? At the very least, it would seem that a code audit is in
 order to weed out this particular issue.
Such issues do happen, are relatively rare, and are virtually untested because there's been no unittests with a deliberately "bad" input range. Although of course we do need to add the appropriate fixes and unittests, I'm not worried about it at all systemically. Andrei
Oct 17 2012
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Wednesday, 17 October 2012 at 15:14:44 UTC, Andrei 
Alexandrescu wrote:
 Such issues do happen, are relatively rare, and are virtually 
 untested because there's been no unittests with a deliberately 
 "bad" input range. Although of course we do need to add the 
 appropriate fixes and unittests, I'm not worried about it at 
 all systemically.
If an abstraction encourages use in a way which leads to hard-to-detect logic bugs that do not occur with the most common test cases, this is _exactly_ the thing we should be worried about! David
Oct 17 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/17/12 12:57 PM, David Nadlinger wrote:
 On Wednesday, 17 October 2012 at 15:14:44 UTC, Andrei Alexandrescu wrote:
 Such issues do happen, are relatively rare, and are virtually untested
 because there's been no unittests with a deliberately "bad" input
 range. Although of course we do need to add the appropriate fixes and
 unittests, I'm not worried about it at all systemically.
If an abstraction encourages use in a way which leads to hard-to-detect logic bugs that do not occur with the most common test cases, this is _exactly_ the thing we should be worried about!
When I designed input ranges vs. forward ranges there were long discussion on how to distinguish them. If you have a better design it would probably not influence the state of the affair, but it would be a good discussion to have. FWIW I can already think of a couple but all rely on UFCS. Andrei
Oct 17 2012