digitalmars.D.learn - std.algorithm.splitter on a string not always bidirectional

Steven Schveighoffer (6/6) Jan 21 2021 auto sp1 = "a|b|c".splitter('|');

H. S. Teoh (8/17) Jan 21 2021 [...]

=?UTF-8?Q?Ali_=c3=87ehreli?= (5/8) Jan 21 2021 Likely. :) Here is one for uniq.back, which includes a link at the

Jon Degenhardt (12/18) Jan 21 2021 I believe the reason is two-fold. First, splitter is lazy.

Jon Degenhardt (5/29) Jan 21 2021 Note that the predicate might use a random number generator to

Steven Schveighoffer (18/44) Jan 22 2021 I think this isn't a good explanation.

Jon Degenhardt (18/70) Jan 22 2021 I think the idea is that if a construct like 'xyz.splitter(args)'

Steven Schveighoffer (13/29) Jan 22 2021 But that is possible with all 3 splitter variants. Why is one allowed to...

Imperatorn (3/8) Jan 22 2021 +1
Jon Degenhardt (23/34) Jan 22 2021 I'm not defending it, just explaining what I believe the thinking

Steven Schveighoffer (13/45) Jan 23 2021 I would not want that. My use case is splitting a string on punctuation,...

H. S. Teoh (17/24) Jan 22 2021 [...]

Steven Schveighoffer (5/29) Jan 23 2021 Yes thank you! That makes sense, and I wasn't thinking of that.

Steven Schveighoffer <schveiguy gmail.com> writes:

auto sp1 = "a|b|c".splitter('|');

writeln(sp1.back); // ok

auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

writeln(sp2.back); // error, not bidirectional

Why? is it an oversight, or is there a good reason for it?

-Steve

Jan 21 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jan 21, 2021 at 05:43:37PM -0500, Steven Schveighoffer via
Digitalmars-d-learn wrote:
 auto sp1 = "a|b|c".splitter('|');
 
 writeln(sp1.back); // ok
 
 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));
 
 writeln(sp2.back); // error, not bidirectional
 
 Why? is it an oversight, or is there a good reason for it?

[...]

Likely an oversight.  But I wouldn't be surprised if there was some
surprising corner case for which this doesn't work / would have onerous
characteristics.


T

-- 
MSDOS = MicroSoft's Denial Of Service

Jan 21 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 1/21/21 4:51 PM, H. S. Teoh wrote:

 But I wouldn't be surprised if there was some
 surprising corner case for which this doesn't work / would have onerous
 characteristics.

Likely. :) Here is one for uniq.back, which includes a link at the 
bottom for zip.back:

   https://issues.dlang.org/show_bug.cgi?id=16588

Ali

Jan 21 2021

Jon Degenhardt <jond noreply.com> writes:

On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?

 -Steve

I believe the reason is two-fold. First, splitter is lazy. 
Second, the range splitting is defined in the forward direction, 
not the reverse direction. A bidirectional range is only 
supported if it is guaranteed that the splits will occur at the 
same points in the range when run in either direction. That's why 
the single element delimiter is supported. Its clearly the case 
for the predicate function in your example. If that's known to be 
always true then perhaps it would make sense to enhance splitter 
to generate bidirectional results in this case.

--Jon

Jan 21 2021

Jon Degenhardt <jond noreply.com> writes:

On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt wrote:
 On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
 Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?

 -Steve

 I believe the reason is two-fold. First, splitter is lazy. 
 Second, the range splitting is defined in the forward 
 direction, not the reverse direction. A bidirectional range is 
 only supported if it is guaranteed that the splits will occur 
 at the same points in the range when run in either direction. 
 That's why the single element delimiter is supported. Its 
 clearly the case for the predicate function in your example. If 
 that's known to be always true then perhaps it would make sense 
 to enhance splitter to generate bidirectional results in this 
 case.

 --Jon

Note that the predicate might use a random number generator to 
pick the split points. Even for same sequence of random numbers, 
the split points would be different if run from the front than if 
run from the back.

Jan 21 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/22/21 12:55 AM, Jon Degenhardt wrote:
 On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt wrote:
 On Thursday, 21 January 2021 at 22:43:37 UTC, Steven Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?

 I believe the reason is two-fold. First, splitter is lazy. Second, the 
 range splitting is defined in the forward direction, not the reverse 
 direction. A bidirectional range is only supported if it is guaranteed 
 that the splits will occur at the same points in the range when run in 
 either direction. That's why the single element delimiter is 
 supported. Its clearly the case for the predicate function in your 
 example. If that's known to be always true then perhaps it would make 
 sense to enhance splitter to generate bidirectional results in this case.

 
 Note that the predicate might use a random number generator to pick the 
 split points. Even for same sequence of random numbers, the split points 
 would be different if run from the front than if run from the back.

I think this isn't a good explanation.

All forms of splitter accept a predicate (including the one which 
supports a bi-directional result). Many other phobos algorithms that 
accept a predicate provide bidirectional support. The splitter result is 
also a forward range (which makes no sense in the context of random splits).

Finally, I'd suggest that even if you split based on a subrange that is 
also bidirectional, it doesn't make sense that you couldn't split 
backwards based on that. Common sense says a range split on substrings 
is the same whether you split it forwards or backwards.

I can do this too (and in fact I will, because it works, even though 
it's horrifically ugly):

auto sp3 = "a.b|c".splitter!((c, unused) => !isAlphaNum(c))('?');

writeln(sp3.back); // ok

Looking at the code, it looks like the first form of spltter uses a 
different result struct than the other two (which have a common 
implementation). It just needs cleanup.

-Steve

Jan 22 2021

Jon Degenhardt <jond noreply.com> writes:

On Friday, 22 January 2021 at 14:14:50 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 12:55 AM, Jon Degenhardt wrote:
 On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt 
 wrote:
 On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
 Schveighoffer wrote:
 auto sp1 = "a|b|c".splitter('|');

 writeln(sp1.back); // ok

 auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

 writeln(sp2.back); // error, not bidirectional

 Why? is it an oversight, or is there a good reason for it?

 I believe the reason is two-fold. First, splitter is lazy. 
 Second, the range splitting is defined in the forward 
 direction, not the reverse direction. A bidirectional range 
 is only supported if it is guaranteed that the splits will 
 occur at the same points in the range when run in either 
 direction. That's why the single element delimiter is 
 supported. Its clearly the case for the predicate function in 
 your example. If that's known to be always true then perhaps 
 it would make sense to enhance splitter to generate 
 bidirectional results in this case.

 
 Note that the predicate might use a random number generator to 
 pick the split points. Even for same sequence of random 
 numbers, the split points would be different if run from the 
 front than if run from the back.

 I think this isn't a good explanation.

 All forms of splitter accept a predicate (including the one 
 which supports a bi-directional result). Many other phobos 
 algorithms that accept a predicate provide bidirectional 
 support. The splitter result is also a forward range (which 
 makes no sense in the context of random splits).

 Finally, I'd suggest that even if you split based on a subrange 
 that is also bidirectional, it doesn't make sense that you 
 couldn't split backwards based on that. Common sense says a 
 range split on substrings is the same whether you split it 
 forwards or backwards.

 I can do this too (and in fact I will, because it works, even 
 though it's horrifically ugly):

 auto sp3 = "a.b|c".splitter!((c, unused) => 
 !isAlphaNum(c))('?');

 writeln(sp3.back); // ok

 Looking at the code, it looks like the first form of spltter 
 uses a different result struct than the other two (which have a 
 common implementation). It just needs cleanup.

 -Steve

I think the idea is that if a construct like 'xyz.splitter(args)' 
produces a range with the sequence of elements {"a", "bc", 
"def"}, then 'xyz.splitter(args).back' should produce "def". But, 
if finding the split points starting from the back results in 
something like {"f", "de", "abc"} then that relationship hasn't 
held, and the results are unexpected.

Note that in the above example, 'xyz.retro.splitter(args)' might 
produce {"f", "ed", "cba"}, so again not the same.

Another way to look at it: If split (eager) took a predicate, 
that 'xyz.splitter(args).back' and 'xyz.split(args).back' should 
produce the same result. But they will not with the example given.

I believe these consistency issues are the reason why the 
bidirectional support is limited.

Note: I didn't design any of this, but I did redo the examples in 
the documentation at one point, which is why I looked at this.

--Jon

Jan 22 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 
 I think the idea is that if a construct like 'xyz.splitter(args)' 
 produces a range with the sequence of elements {"a", "bc", "def"}, then 
 'xyz.splitter(args).back' should produce "def". But, if finding the 
 split points starting from the back results in something like {"f", 
 "de", "abc"} then that relationship hasn't held, and the results are 
 unexpected.

But that is possible with all 3 splitter variants. Why is one allowed to 
be bidirectional and the others are not?

 
 Another way to look at it: If split (eager) took a predicate, that 
 'xyz.splitter(args).back' and 'xyz.split(args).back' should produce the 
 same result. But they will not with the example given.

With what example given? The example you gave is incomplete (what are args?)

 I believe these consistency issues are the reason why the bidirectional 
 support is limited.
 
 Note: I didn't design any of this, but I did redo the examples in the 
 documentation at one point, which is why I looked at this.

We sometimes spend time justifying why the existing implementation is 
the way it is, when we should be questioning why it was designed that 
way in the first place.

If splitter should be restricted based on possible edge cases, then it 
should be consistently restricted. My opinion is it should not be 
restricted in any case. All three cases provide equal possibility of 
bidirectional correctness. The one case that should be restricted is the 
splitter version that accepts a non-bi-directional delimiting range.

-Steve

Jan 22 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 [...]

 But that is possible with all 3 splitter variants. Why is one 
 allowed to be bidirectional and the others are not?

 [...]

+1

Jan 22 2021

Jon Degenhardt <jond noreply.com> writes:

On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 
 I think the idea is that if a construct like 
 'xyz.splitter(args)' produces a range with the sequence of 
 elements {"a", "bc", "def"}, then 'xyz.splitter(args).back' 
 should produce "def". But, if finding the split points 
 starting from the back results in something like {"f", "de", 
 "abc"} then that relationship hasn't held, and the results are 
 unexpected.

 But that is possible with all 3 splitter variants. Why is one 
 allowed to be bidirectional and the others are not?

I'm not defending it, just explaining what I believe the thinking 
was based on the examination I did. It wasn't just looking at the 
code, there was a discussion somewhere. A forum discussion, PR 
discussion, bug or code comments. Something somewhere, but I 
don't remember exactly.

However, to answer your question - The relationship described is 
guaranteed if the basis for the split is a single element. If the 
range is a string, that's a single 'char'. If the range is 
composed of integers, then a single integer. Note that if the 
basis for the split is itself a range, then the relationship 
described is not guaranteed.

Personally, I can see a good argument that bidirectionality 
should not be supported in any of these cases, and instead force 
the user to choose between eager splitting or reversing the range 
via retro. For the common case of strings, the further argument 
could be made that the distinction between char and dchar is 
another point of inconsistency.

Regardless whether the choices made were the best choices, there 
was some thinking that went into it, and it is worth 
understanding the thinking when considering changes.

--Jon

Jan 22 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/22/21 2:13 PM, Jon Degenhardt wrote:
 On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:
 I think the idea is that if a construct like 'xyz.splitter(args)' 
 produces a range with the sequence of elements {"a", "bc", "def"}, 
 then 'xyz.splitter(args).back' should produce "def". But, if finding 
 the split points starting from the back results in something like 
 {"f", "de", "abc"} then that relationship hasn't held, and the 
 results are unexpected.

 But that is possible with all 3 splitter variants. Why is one allowed 
 to be bidirectional and the others are not?

 
 I'm not defending it, just explaining what I believe the thinking was 
 based on the examination I did. It wasn't just looking at the code, 
 there was a discussion somewhere. A forum discussion, PR discussion, bug 
 or code comments. Something somewhere, but I don't remember exactly.
 
 However, to answer your question - The relationship described is 
 guaranteed if the basis for the split is a single element. If the range 
 is a string, that's a single 'char'. If the range is composed of 
 integers, then a single integer. Note that if the basis for the split is 
 itself a range, then the relationship described is not guaranteed.
 
 Personally, I can see a good argument that bidirectionality should not 
 be supported in any of these cases, and instead force the user to choose 
 between eager splitting or reversing the range via retro. For the common 
 case of strings, the further argument could be made that the distinction 
 between char and dchar is another point of inconsistency.

I would not want that. My use case is splitting a string on punctuation, 
and using the lazy result for testing equality of something. But I have 
some special suffix items that I want to handle first (and pop off).

dchar/char inconsistency isn't a problem, because they are both dchar 
ranges (and both are bidirectional).

 Regardless whether the choices made were the best choices, there was 
 some thinking that went into it, and it is worth understanding the 
 thinking when considering changes.

I believe there was that thinking. It's why I posted, because before I 
filed a bug, I wanted to make sure there wasn't a good reason.

It looks like there is NOT a good reason for the single-item based 
splitting as you say to prevent bidirectional access. But there IS a 
good reason (thanks for the example H.S. Teoh) to prevent it for 
multi-element delimiters.

-Steve

Jan 23 2021

H. S. Teoh <hsteoh quickfur.ath.cx> writes:

On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer 
wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:

[...]
 Another way to look at it: If split (eager) took a predicate, 
 that 'xyz.splitter(args).back' and 'xyz.split(args).back' 
 should produce the same result. But they will not with the 
 example given.

 With what example given? The example you gave is incomplete 
 (what are args?)

[...]

Here is a case for which iterating forwards yields a different 
sequence
from iterating backwards (if we were to allow the latter):

     "bbcbcba".splitter("bcb")

Iterating forwards gives us the subranges: "b", "cba".

Iterating backwards gives us: "a", "bbc".

So it cannot be a bidirectional range, at least not in the 
expected sense
that iterating from the back ought to give us the same subranges 
as
iterating from the front, only in a reverse order.  Here iterating
backwards yields a completely different decomposition.


--T

Jan 22 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/22/21 2:56 PM, H. S. Teoh wrote:
 On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer wrote:
 On 1/22/21 11:57 AM, Jon Degenhardt wrote:

 [...]
 Another way to look at it: If split (eager) took a predicate, that 
 'xyz.splitter(args).back' and 'xyz.split(args).back' should produce 
 the same result. But they will not with the example given.

 With what example given? The example you gave is incomplete (what are 
 args?)

 [...]
 
 Here is a case for which iterating forwards yields a different sequence
 from iterating backwards (if we were to allow the latter):
 
      "bbcbcba".splitter("bcb")
 
 Iterating forwards gives us the subranges: "b", "cba".
 
 Iterating backwards gives us: "a", "bbc".
 
 So it cannot be a bidirectional range, at least not in the expected sense
 that iterating from the back ought to give us the same subranges as
 iterating from the front, only in a reverse order.  Here iterating
 backwards yields a completely different decomposition.

Yes thank you! That makes sense, and I wasn't thinking of that.

I still believe that any splitter based on individual elements should be 
bidirectional.

-Steve

Jan 23 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - std.algorithm.splitter on a string not always bidirectional