digitalmars.D - std.string.reverse() for mutable array of chars

bearophile (32/32) Dec 09 2011 Reversing an array of chars/wchars is a common enough operation (mutable...

Jonathan M Davis (16/76) Dec 09 2011 If you want to reverse a char[], then cast it to ubyte[] and reverse tha...

bearophile (6/14) Dec 09 2011 I am not interested in reversing code units. Sorry if my post has led to...

Jonathan M Davis (9/31) Dec 09 2011 I don't expect that std.string will _ever_ be grapheme-aware or be proce...

bearophile (4/6) Dec 09 2011 OK, let's forget about graphemes in this discussion. Now I am not asking...

Jonathan M Davis (5/11) Dec 09 2011 It sounded like you were, because you were complaining about how dchar d...

bearophile (4/5) Dec 09 2011 Right, I was :-) But you have changed my mind when you have explained me...

Jonathan M Davis (27/33) Dec 09 2011 So, now you're asking that char and wchar arrays be reversible with reve...
Jonathan M Davis (5/40) Dec 09 2011 Well, it looks like Andrei had some free time this morning and figured i...

bearophile (4/6) Dec 09 2011 Thanks you Andrei Alexandrescu and Jonathan :-)

bearophile <bearophileHUGS lycos.com> writes:

Reversing an array of chars/wchars is a common enough operation (mutable arrays
often come from precedent operations that have built it). Currently
std.algorithm.reverse() can't be used:


import std.algorithm;
void main() {
    dchar[] s1 = "hello"d.dup;
    s1.reverse(); // OK
    wchar[] s2 = "hello"w.dup;
    s2.reverse(); // error
    char[] s3 = "hello".dup;
    s3.reverse(); // error
}


I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse() (or
to add a std.string.reverse()), to make it work on those types too. Generally
std.algorithms don't work on UTF8/UTF16 because of the variable length of its
items, but for this specific algorithm I think this is not a problem because:

1) Reversing an array is an O(n) operation, and decoding UTF adds a constant
overhead, so the computational complexity of reverse doesn't change.
2) If you reverse an char[] or wchar[] the result will fit in the input array
(is this always true? Please tell me if this isn't true). It "just" needs to
correctly swap the bytes of multi-byte chars, and swap if there are combined
codepoints too.

- - - - - - - - - - - - - - - - - -

And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32):


import std.algorithm: reverse;
void main() {
    dchar[] txt = "\U00000041\U00000308\U00000042"d.dup;
    txt.reverse();
    assert(txt == "\U00000042\U00000308\U00000041"d);
}


txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL LETTER
B (see bug 7084 for more details).

A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL
LETTER A, COMBINING DIAERESIS):

"\U00000042\U00000041\U00000308"d


See for some code:
http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in-place

See also:
http://d.puremagic.com/issues/show_bug.cgi?id=7085

Regarding the printing of unicode strings see also:
http://d.puremagic.com/issues/show_bug.cgi?id=7084

Bye,
bearophile

Dec 09 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, December 09, 2011 04:46:35 bearophile wrote:
 Reversing an array of chars/wchars is a common enough operation (mutable
 arrays often come from precedent operations that have built it). Currently
 std.algorithm.reverse() can't be used:
 
 
 import std.algorithm;
 void main() {
     dchar[] s1 = "hello"d.dup;
     s1.reverse(); // OK
     wchar[] s2 = "hello"w.dup;
     s2.reverse(); // error
     char[] s3 = "hello".dup;
     s3.reverse(); // error
 }
 
 
 I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse()
 (or to add a std.string.reverse()), to make it work on those types too.
 Generally std.algorithms don't work on UTF8/UTF16 because of the variable
 length of its items, but for this specific algorithm I think this is not a
 problem because:
 
 1) Reversing an array is an O(n) operation, and decoding UTF adds a constant
 overhead, so the computational complexity of reverse doesn't change. 2) If
 you reverse an char[] or wchar[] the result will fit in the input array (is
 this always true? Please tell me if this isn't true). It "just" needs to
 correctly swap the bytes of multi-byte chars, and swap if there are
 combined codepoints too.
 
 - - - - - - - - - - - - - - - - - -
 
 And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32):
 
 
 import std.algorithm: reverse;
 void main() {
     dchar[] txt = "\U00000041\U00000308\U00000042"d.dup;
     txt.reverse();
     assert(txt == "\U00000042\U00000308\U00000041"d);
 }
 
 
 txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL
 LETTER B (see bug 7084 for more details).
 
 A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL
 LETTER A, COMBINING DIAERESIS):
 
 "\U00000042\U00000041\U00000308"d
 
 
 See for some code:
 http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in
 -place
 
 See also:
 http://d.puremagic.com/issues/show_bug.cgi?id=7085
 
 Regarding the printing of unicode strings see also:
 http://d.puremagic.com/issues/show_bug.cgi?id=7084

If you want to reverse a char[], then cast it to ubyte[] and reverse that. If 
you want to reverse a wchar[], then cast it to ushort[] and reverse that. In 
Phobos, strings are ranges of dchar, so reverse is going to reverse code 
points. If you want it to reverse code units instead, then you just use the 
appropriate cast. There's no reason to have it reverse the code units and 
completely mess up unicode strings.


completely correct. It's reversing the code points, _not_ the graphemes. If 
you want to operate on graphemes, you need a range of graphemes, which Phobos 
does not yet support. Once it does (or if you implement it yourself), you can 
reverse a string based on graphemes if that's what you want to do. But as it 
stands, ranges of code points are the most advanced unicode construct that 
Phobos currently supports, so that's what its functions are going to operate 
on.

- Jonathan M Davis

Dec 09 2011

bearophile <bearophileHUGS lycos.com> writes:

Jonathan M Davis:


 completely correct. It's reversing the code points, _not_ the graphemes.

OK. Maybe I will open a differently worded enhancement request, for a
grapheme-aware std.string.


 If you want to reverse a char[], then cast it to ubyte[] and reverse that. If 
 you want to reverse a wchar[], then cast it to ushort[] and reverse that. In 
 Phobos, strings are ranges of dchar, so reverse is going to reverse code 
 points. If you want it to reverse code units instead, then you just use the 
 appropriate cast. There's no reason to have it reverse the code units and 
 completely mess up unicode strings.

I am not interested in reversing code units. Sorry if my post has led to this
wrong idea. For this specific problem I am not going to cast to ubyte[] or
ushort[] because it gives very wrong results.

It's possible to write a "correct" (that doesn't take into account graphemes)
reverse even if you do not use casts, keeping the array as char[] or wchar[],
reversing the bytes, and then reversing the bytes of each variable-length
codepoint. This is what I was asking to an in-place reverse().

Bye,
bearophile

Dec 09 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, December 09, 2011 05:58:40 bearophile wrote:
 Jonathan M Davis:

 dchar[] is completely correct. It's reversing the code points, _not_
 the graphemes.

 OK. Maybe I will open a differently worded enhancement request, for a
 grapheme-aware std.string.
 If you want to reverse a char[], then cast it to ubyte[] and reverse
 that. If you want to reverse a wchar[], then cast it to ushort[] and
 reverse that. In Phobos, strings are ranges of dchar, so reverse is
 going to reverse code points. If you want it to reverse code units
 instead, then you just use the appropriate cast. There's no reason to
 have it reverse the code units and completely mess up unicode strings.

 
 I am not interested in reversing code units. Sorry if my post has led to
 this wrong idea. For this specific problem I am not going to cast to
 ubyte[] or ushort[] because it gives very wrong results.
 
 It's possible to write a "correct" (that doesn't take into account
 graphemes) reverse even if you do not use casts, keeping the array as
 char[] or wchar[], reversing the bytes, and then reversing the bytes of
 each variable-length codepoint. This is what I was asking to an in-place
 reverse().

I don't expect that std.string will _ever_ be grapheme-aware or be processed 
by default as a range of graphemes. That's far too expensive as far as 
performance goes. Rather, we're likely to have a wrapper and/or separate 
range-type which handles graphemes. Then if you want the extra correctness and 
are willing to pay the cost, you use that. As I understand it, std.regex does 
have the beginnings of such, but we do still need to have a range type of some 
variety (probably in std.utf) which fully supports graphemes.

- Jonathan M Davis

Dec 09 2011

bearophile <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 I don't expect that std.string will _ever_ be grapheme-aware or be processed 
 by default as a range of graphemes.

OK, let's forget about graphemes in this discussion. Now I am not asking for a
grapheme-aware reverse.

Bye,
bearophile

Dec 09 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, December 09, 2011 06:15:04 bearophile wrote:
 Jonathan M Davis:
 I don't expect that std.string will _ever_ be grapheme-aware or be
 processed by default as a range of graphemes.

 
 OK, let's forget about graphemes in this discussion. Now I am not asking for
 a grapheme-aware reverse.

It sounded like you were, because you were complaining about how dchar did an 
exact reversal of the code points rather than taking combining code points 
into account.

- Jonathan M Davis

Dec 09 2011

bearophile <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 It sounded like you were,

Right, I was :-) But you have changed my mind when you have explained me that
nothing in std.algorithm is grapheme-aware. So I have reduced the amount of
what I am asking for.

Bye,
bearophile

Dec 09 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, December 09, 2011 12:44:37 bearophile wrote:
 Jonathan M Davis:
 It sounded like you were,

 
 Right, I was :-) But you have changed my mind when you have explained me
 that nothing in std.algorithm is grapheme-aware. So I have reduced the
 amount of what I am asking for.

So, now you're asking that char and wchar arrays be reversible with reverse 
such that their code points are reversed (i.e. the result is the same as if 
you reversed an array of dchar). Well, I'm not sure that you can actually do 
that with the same efficiency. I'd have to think about it more. Regardless, the 
implementation would be _really_ complicated in comparison to how reverse 
works right now. char[] and wchar[] don't work with reverse, because their 
elements aren't swappable. So, you can't just swap elements as you iterate in 
from both ends. You'd have to be moving stuff off into temporaries as you 
swapped them, because the code point on one side wouldn't necessarily fit where 
the code point on the other side was, and in the worst case (i.e. all of the 
code points on one half of the string are multiple code units and all of those 
on the other side are single code units), you'd pretty much end up having to 
copy half the array while you waited for enough space to open up on one side 
to fit the characters from the other side. So, regardless of whether it has the 
same computational complexity as the current reverse, its memory requirements 
would be far more.

I don't think that the request is completely unreasonable, but also I'm not 
sure it's acceptable for reverse to change its performance characteristics as 
much as would be required for it to work with arrays of char or wchar - 
particularly with regards to how much memory would be required. In general, 
the performance characteristics of the algorithms in Phobos don't vary much 
with regards to the type that that's used. I'm pretty sure that in terms of 
big-o notation, the memory complexity wouldn't match (though I don't recall 
exactly how big-o notation works with memory rather than computational 
complexity).

- Jonathan M Davis

Dec 09 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, December 09, 2011 13:27:01 Jonathan M Davis wrote:
 On Friday, December 09, 2011 12:44:37 bearophile wrote:
 Jonathan M Davis:
 It sounded like you were,

 
 Right, I was :-) But you have changed my mind when you have explained me
 that nothing in std.algorithm is grapheme-aware. So I have reduced the
 amount of what I am asking for.

 
 So, now you're asking that char and wchar arrays be reversible with reverse
 such that their code points are reversed (i.e. the result is the same as if
 you reversed an array of dchar). Well, I'm not sure that you can actually do
 that with the same efficiency. I'd have to think about it more. Regardless,
 the implementation would be _really_ complicated in comparison to how
 reverse works right now. char[] and wchar[] don't work with reverse,
 because their elements aren't swappable. So, you can't just swap elements
 as you iterate in from both ends. You'd have to be moving stuff off into
 temporaries as you swapped them, because the code point on one side
 wouldn't necessarily fit where the code point on the other side was, and in
 the worst case (i.e. all of the code points on one half of the string are
 multiple code units and all of those on the other side are single code
 units), you'd pretty much end up having to copy half the array while you
 waited for enough space to open up on one side to fit the characters from
 the other side. So, regardless of whether it has the same computational
 complexity as the current reverse, its memory requirements would be far
 more.
 
 I don't think that the request is completely unreasonable, but also I'm not
 sure it's acceptable for reverse to change its performance characteristics
 as much as would be required for it to work with arrays of char or wchar -
 particularly with regards to how much memory would be required. In general,
 the performance characteristics of the algorithms in Phobos don't vary much
 with regards to the type that that's used. I'm pretty sure that in terms of
 big-o notation, the memory complexity wouldn't match (though I don't recall
 exactly how big-o notation works with memory rather than computational
 complexity).

Well, it looks like Andrei had some free time this morning and figured it out. 
He has a pull request for it:

https://github.com/D-Programming-Language/phobos/pull/359

- Jonathan M Davis

Dec 09 2011

bearophile <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 Well, it looks like Andrei had some free time this morning and figured it out. 
 He has a pull request for it:

Thanks you Andrei Alexandrescu and Jonathan :-)

Bye,
bearophile

Dec 09 2011

D Programming

C/C++ Programming

Other

digitalmars.D - std.string.reverse() for mutable array of chars