www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - std.algorithm.splitter on a string not always bidirectional

reply Steven Schveighoffer <schveiguy gmail.com> writes:
auto sp1 = "a|b|c".splitter('|');

writeln(sp1.back); // ok

auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

writeln(sp2.back); // error, not bidirectional

Why? is it an oversight, or is there a good reason for it?

-Steve
Jan 21
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jan 21, 2021 at 05:43:37PM -0500, Steven Schveighoffer via
Digitalmars-d-learn wrote:
 auto sp1 = "a|b|c".splitter('|');
 
 writeln(sp1.back); // ok
 
 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));
 
 writeln(sp2.back); // error, not bidirectional
 
 Why? is it an oversight, or is there a good reason for it?
[...] Likely an oversight. But I wouldn't be surprised if there was some surprising corner case for which this doesn't work / would have onerous characteristics. T -- MSDOS = MicroSoft's Denial Of Service
Jan 21
parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/21/21 4:51 PM, H. S. Teoh wrote:

 But I wouldn't be surprised if there was some
 surprising corner case for which this doesn't work / would have onerous
 characteristics.
Likely. :) Here is one for uniq.back, which includes a link at the bottom for zip.back: https://issues.dlang.org/show_bug.cgi?id=16588 Ali
Jan 21
prev sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?

 -Steve
I believe the reason is two-fold. First, splitter is lazy. Second, the range splitting is defined in the forward direction, not the reverse direction. A bidirectional range is only supported if it is guaranteed that the splits will occur at the same points in the range when run in either direction. That's why the single element delimiter is supported. Its clearly the case for the predicate function in your example. If that's known to be always true then perhaps it would make sense to enhance splitter to generate bidirectional results in this case. --Jon
Jan 21
parent reply Jon Degenhardt <jond noreply.com> writes:
On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt wrote:
 On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
 Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?

 -Steve
I believe the reason is two-fold. First, splitter is lazy. Second, the range splitting is defined in the forward direction, not the reverse direction. A bidirectional range is only supported if it is guaranteed that the splits will occur at the same points in the range when run in either direction. That's why the single element delimiter is supported. Its clearly the case for the predicate function in your example. If that's known to be always true then perhaps it would make sense to enhance splitter to generate bidirectional results in this case. --Jon
Note that the predicate might use a random number generator to pick the split points. Even for same sequence of random numbers, the split points would be different if run from the front than if run from the back.
Jan 21
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/22/21 12:55 AM, Jon Degenhardt wrote:
 On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt wrote:
 On Thursday, 21 January 2021 at 22:43:37 UTC, Steven Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?
I believe the reason is two-fold. First, splitter is lazy. Second, the range splitting is defined in the forward direction, not the reverse direction. A bidirectional range is only supported if it is guaranteed that the splits will occur at the same points in the range when run in either direction. That's why the single element delimiter is supported. Its clearly the case for the predicate function in your example. If that's known to be always true then perhaps it would make sense to enhance splitter to generate bidirectional results in this case.
Note that the predicate might use a random number generator to pick the split points. Even for same sequence of random numbers, the split points would be different if run from the front than if run from the back.
I think this isn't a good explanation. All forms of splitter accept a predicate (including the one which supports a bi-directional result). Many other phobos algorithms that accept a predicate provide bidirectional support. The splitter result is also a forward range (which makes no sense in the context of random splits). Finally, I'd suggest that even if you split based on a subrange that is also bidirectional, it doesn't make sense that you couldn't split backwards based on that. Common sense says a range split on substrings is the same whether you split it forwards or backwards. I can do this too (and in fact I will, because it works, even though it's horrifically ugly): auto sp3 = "a.b|c".splitter!((c, unused) => !isAlphaNum(c))('?'); writeln(sp3.back); // ok Looking at the code, it looks like the first form of spltter uses a different result struct than the other two (which have a common implementation). It just needs cleanup. -Steve
Jan 22
parent reply Jon Degenhardt <jond noreply.com> writes:
On Friday, 22 January 2021 at 14:14:50 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 12:55 AM, Jon Degenhardt wrote:
 On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt 
 wrote:
 On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
 Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?
I believe the reason is two-fold. First, splitter is lazy. Second, the range splitting is defined in the forward direction, not the reverse direction. A bidirectional range is only supported if it is guaranteed that the splits will occur at the same points in the range when run in either direction. That's why the single element delimiter is supported. Its clearly the case for the predicate function in your example. If that's known to be always true then perhaps it would make sense to enhance splitter to generate bidirectional results in this case.
Note that the predicate might use a random number generator to pick the split points. Even for same sequence of random numbers, the split points would be different if run from the front than if run from the back.
I think this isn't a good explanation. All forms of splitter accept a predicate (including the one which supports a bi-directional result). Many other phobos algorithms that accept a predicate provide bidirectional support. The splitter result is also a forward range (which makes no sense in the context of random splits). Finally, I'd suggest that even if you split based on a subrange that is also bidirectional, it doesn't make sense that you couldn't split backwards based on that. Common sense says a range split on substrings is the same whether you split it forwards or backwards. I can do this too (and in fact I will, because it works, even though it's horrifically ugly): auto sp3 = "a.b|c".splitter!((c, unused) => !isAlphaNum(c))('?'); writeln(sp3.back); // ok Looking at the code, it looks like the first form of spltter uses a different result struct than the other two (which have a common implementation). It just needs cleanup. -Steve
I think the idea is that if a construct like 'xyz.splitter(args)' produces a range with the sequence of elements {"a", "bc", "def"}, then 'xyz.splitter(args).back' should produce "def". But, if finding the split points starting from the back results in something like {"f", "de", "abc"} then that relationship hasn't held, and the results are unexpected. Note that in the above example, 'xyz.retro.splitter(args)' might produce {"f", "ed", "cba"}, so again not the same. Another way to look at it: If split (eager) took a predicate, that 'xyz.splitter(args).back' and 'xyz.split(args).back' should produce the same result. But they will not with the example given. I believe these consistency issues are the reason why the bidirectional support is limited. Note: I didn't design any of this, but I did redo the examples in the documentation at one point, which is why I looked at this. --Jon
Jan 22
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 
 I think the idea is that if a construct like 'xyz.splitter(args)' 
 produces a range with the sequence of elements {"a", "bc", "def"}, then 
 'xyz.splitter(args).back' should produce "def". But, if finding the 
 split points starting from the back results in something like {"f", 
 "de", "abc"} then that relationship hasn't held, and the results are 
 unexpected.
But that is possible with all 3 splitter variants. Why is one allowed to be bidirectional and the others are not?
 
 Another way to look at it: If split (eager) took a predicate, that 
 'xyz.splitter(args).back' and 'xyz.split(args).back' should produce the 
 same result. But they will not with the example given.
With what example given? The example you gave is incomplete (what are args?)
 I believe these consistency issues are the reason why the bidirectional 
 support is limited.
 
 Note: I didn't design any of this, but I did redo the examples in the 
 documentation at one point, which is why I looked at this.
We sometimes spend time justifying why the existing implementation is the way it is, when we should be questioning why it was designed that way in the first place. If splitter should be restricted based on possible edge cases, then it should be consistently restricted. My opinion is it should not be restricted in any case. All three cases provide equal possibility of bidirectional correctness. The one case that should be restricted is the splitter version that accepts a non-bi-directional delimiting range. -Steve
Jan 22
next sibling parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 [...]
But that is possible with all 3 splitter variants. Why is one allowed to be bidirectional and the others are not? [...]
+1
Jan 22
prev sibling next sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 
 I think the idea is that if a construct like 
 'xyz.splitter(args)' produces a range with the sequence of 
 elements {"a", "bc", "def"}, then 'xyz.splitter(args).back' 
 should produce "def". But, if finding the split points 
 starting from the back results in something like {"f", "de", 
 "abc"} then that relationship hasn't held, and the results are 
 unexpected.
But that is possible with all 3 splitter variants. Why is one allowed to be bidirectional and the others are not?
I'm not defending it, just explaining what I believe the thinking was based on the examination I did. It wasn't just looking at the code, there was a discussion somewhere. A forum discussion, PR discussion, bug or code comments. Something somewhere, but I don't remember exactly. However, to answer your question - The relationship described is guaranteed if the basis for the split is a single element. If the range is a string, that's a single 'char'. If the range is composed of integers, then a single integer. Note that if the basis for the split is itself a range, then the relationship described is not guaranteed. Personally, I can see a good argument that bidirectionality should not be supported in any of these cases, and instead force the user to choose between eager splitting or reversing the range via retro. For the common case of strings, the further argument could be made that the distinction between char and dchar is another point of inconsistency. Regardless whether the choices made were the best choices, there was some thinking that went into it, and it is worth understanding the thinking when considering changes. --Jon
Jan 22
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/22/21 2:13 PM, Jon Degenhardt wrote:
 On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 I think the idea is that if a construct like 'xyz.splitter(args)' 
 produces a range with the sequence of elements {"a", "bc", "def"}, 
 then 'xyz.splitter(args).back' should produce "def". But, if finding 
 the split points starting from the back results in something like 
 {"f", "de", "abc"} then that relationship hasn't held, and the 
 results are unexpected.
But that is possible with all 3 splitter variants. Why is one allowed to be bidirectional and the others are not?
I'm not defending it, just explaining what I believe the thinking was based on the examination I did. It wasn't just looking at the code, there was a discussion somewhere. A forum discussion, PR discussion, bug or code comments. Something somewhere, but I don't remember exactly. However, to answer your question - The relationship described is guaranteed if the basis for the split is a single element. If the range is a string, that's a single 'char'. If the range is composed of integers, then a single integer. Note that if the basis for the split is itself a range, then the relationship described is not guaranteed. Personally, I can see a good argument that bidirectionality should not be supported in any of these cases, and instead force the user to choose between eager splitting or reversing the range via retro. For the common case of strings, the further argument could be made that the distinction between char and dchar is another point of inconsistency.
I would not want that. My use case is splitting a string on punctuation, and using the lazy result for testing equality of something. But I have some special suffix items that I want to handle first (and pop off). dchar/char inconsistency isn't a problem, because they are both dchar ranges (and both are bidirectional).
 Regardless whether the choices made were the best choices, there was 
 some thinking that went into it, and it is worth understanding the 
 thinking when considering changes.
I believe there was that thinking. It's why I posted, because before I filed a bug, I wanted to make sure there wasn't a good reason. It looks like there is NOT a good reason for the single-item based splitting as you say to prevent bidirectional access. But there IS a good reason (thanks for the example H.S. Teoh) to prevent it for multi-element delimiters. -Steve
Jan 23
prev sibling parent reply H. S. Teoh <hsteoh quickfur.ath.cx> writes:
On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
[...]
 Another way to look at it: If split (eager) took a predicate, 
 that 'xyz.splitter(args).back' and 'xyz.split(args).back' 
 should produce the same result. But they will not with the 
 example given.
With what example given? The example you gave is incomplete (what are args?)
[...] Here is a case for which iterating forwards yields a different sequence from iterating backwards (if we were to allow the latter): "bbcbcba".splitter("bcb") Iterating forwards gives us the subranges: "b", "cba". Iterating backwards gives us: "a", "bbc". So it cannot be a bidirectional range, at least not in the expected sense that iterating from the back ought to give us the same subranges as iterating from the front, only in a reverse order. Here iterating backwards yields a completely different decomposition. --T
Jan 22
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/22/21 2:56 PM, H. S. Teoh wrote:
 On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
[...]
 Another way to look at it: If split (eager) took a predicate, that 
 'xyz.splitter(args).back' and 'xyz.split(args).back' should produce 
 the same result. But they will not with the example given.
With what example given? The example you gave is incomplete (what are args?)
[...] Here is a case for which iterating forwards yields a different sequence from iterating backwards (if we were to allow the latter):     "bbcbcba".splitter("bcb") Iterating forwards gives us the subranges: "b", "cba". Iterating backwards gives us: "a", "bbc". So it cannot be a bidirectional range, at least not in the expected sense that iterating from the back ought to give us the same subranges as iterating from the front, only in a reverse order.  Here iterating backwards yields a completely different decomposition.
Yes thank you! That makes sense, and I wasn't thinking of that. I still believe that any splitter based on individual elements should be bidirectional. -Steve
Jan 23