www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.string.reverse() for mutable array of chars

reply bearophile <bearophileHUGS lycos.com> writes:
Reversing an array of chars/wchars is a common enough operation (mutable arrays
often come from precedent operations that have built it). Currently
std.algorithm.reverse() can't be used:


import std.algorithm;
void main() {
    dchar[] s1 = "hello"d.dup;
    s1.reverse(); // OK
    wchar[] s2 = "hello"w.dup;
    s2.reverse(); // error
    char[] s3 = "hello".dup;
    s3.reverse(); // error
}


I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse() (or
to add a std.string.reverse()), to make it work on those types too. Generally
std.algorithms don't work on UTF8/UTF16 because of the variable length of its
items, but for this specific algorithm I think this is not a problem because:

1) Reversing an array is an O(n) operation, and decoding UTF adds a constant
overhead, so the computational complexity of reverse doesn't change.
2) If you reverse an char[] or wchar[] the result will fit in the input array
(is this always true? Please tell me if this isn't true). It "just" needs to
correctly swap the bytes of multi-byte chars, and swap if there are combined
codepoints too.

- - - - - - - - - - - - - - - - - -

And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32):


import std.algorithm: reverse;
void main() {
    dchar[] txt = "\U00000041\U00000308\U00000042"d.dup;
    txt.reverse();
    assert(txt == "\U00000042\U00000308\U00000041"d);
}


txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL LETTER
B (see bug 7084 for more details).

A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL
LETTER A, COMBINING DIAERESIS):

"\U00000042\U00000041\U00000308"d


See for some code:
http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in-place

See also:
http://d.puremagic.com/issues/show_bug.cgi?id=7085

Regarding the printing of unicode strings see also:
http://d.puremagic.com/issues/show_bug.cgi?id=7084

Bye,
bearophile
Dec 09 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, December 09, 2011 04:46:35 bearophile wrote:
 Reversing an array of chars/wchars is a common enough operation (mutable
 arrays often come from precedent operations that have built it). Currently
 std.algorithm.reverse() can't be used:
 
 
 import std.algorithm;
 void main() {
     dchar[] s1 = "hello"d.dup;
     s1.reverse(); // OK
     wchar[] s2 = "hello"w.dup;
     s2.reverse(); // error
     char[] s3 = "hello".dup;
     s3.reverse(); // error
 }
 
 
 I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse()
 (or to add a std.string.reverse()), to make it work on those types too.
 Generally std.algorithms don't work on UTF8/UTF16 because of the variable
 length of its items, but for this specific algorithm I think this is not a
 problem because:
 
 1) Reversing an array is an O(n) operation, and decoding UTF adds a constant
 overhead, so the computational complexity of reverse doesn't change. 2) If
 you reverse an char[] or wchar[] the result will fit in the input array (is
 this always true? Please tell me if this isn't true). It "just" needs to
 correctly swap the bytes of multi-byte chars, and swap if there are
 combined codepoints too.
 
 - - - - - - - - - - - - - - - - - -
 
 And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32):
 
 
 import std.algorithm: reverse;
 void main() {
     dchar[] txt = "\U00000041\U00000308\U00000042"d.dup;
     txt.reverse();
     assert(txt == "\U00000042\U00000308\U00000041"d);
 }
 
 
 txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL
 LETTER B (see bug 7084 for more details).
 
 A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL
 LETTER A, COMBINING DIAERESIS):
 
 "\U00000042\U00000041\U00000308"d
 
 
 See for some code:
 http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in
 -place
 
 See also:
 http://d.puremagic.com/issues/show_bug.cgi?id=7085
 
 Regarding the printing of unicode strings see also:
 http://d.puremagic.com/issues/show_bug.cgi?id=7084

If you want to reverse a char[], then cast it to ubyte[] and reverse that. If you want to reverse a wchar[], then cast it to ushort[] and reverse that. In Phobos, strings are ranges of dchar, so reverse is going to reverse code points. If you want it to reverse code units instead, then you just use the appropriate cast. There's no reason to have it reverse the code units and completely mess up unicode strings. And as I explained in bug# 7085, reverse's behavior with regards to dchar[] is completely correct. It's reversing the code points, _not_ the graphemes. If you want to operate on graphemes, you need a range of graphemes, which Phobos does not yet support. Once it does (or if you implement it yourself), you can reverse a string based on graphemes if that's what you want to do. But as it stands, ranges of code points are the most advanced unicode construct that Phobos currently supports, so that's what its functions are going to operate on. - Jonathan M Davis
Dec 09 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Jonathan M Davis:

 And as I explained in bug# 7085, reverse's behavior with regards to dchar[] is 
 completely correct. It's reversing the code points, _not_ the graphemes.

OK. Maybe I will open a differently worded enhancement request, for a grapheme-aware std.string.
 If you want to reverse a char[], then cast it to ubyte[] and reverse that. If 
 you want to reverse a wchar[], then cast it to ushort[] and reverse that. In 
 Phobos, strings are ranges of dchar, so reverse is going to reverse code 
 points. If you want it to reverse code units instead, then you just use the 
 appropriate cast. There's no reason to have it reverse the code units and 
 completely mess up unicode strings.

I am not interested in reversing code units. Sorry if my post has led to this wrong idea. For this specific problem I am not going to cast to ubyte[] or ushort[] because it gives very wrong results. It's possible to write a "correct" (that doesn't take into account graphemes) reverse even if you do not use casts, keeping the array as char[] or wchar[], reversing the bytes, and then reversing the bytes of each variable-length codepoint. This is what I was asking to an in-place reverse(). Bye, bearophile
Dec 09 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, December 09, 2011 05:58:40 bearophile wrote:
 Jonathan M Davis:
 And as I explained in bug# 7085, reverse's behavior with regards to
 dchar[] is completely correct. It's reversing the code points, _not_
 the graphemes.

grapheme-aware std.string.
 If you want to reverse a char[], then cast it to ubyte[] and reverse
 that. If you want to reverse a wchar[], then cast it to ushort[] and
 reverse that. In Phobos, strings are ranges of dchar, so reverse is
 going to reverse code points. If you want it to reverse code units
 instead, then you just use the appropriate cast. There's no reason to
 have it reverse the code units and completely mess up unicode strings.

I am not interested in reversing code units. Sorry if my post has led to this wrong idea. For this specific problem I am not going to cast to ubyte[] or ushort[] because it gives very wrong results. It's possible to write a "correct" (that doesn't take into account graphemes) reverse even if you do not use casts, keeping the array as char[] or wchar[], reversing the bytes, and then reversing the bytes of each variable-length codepoint. This is what I was asking to an in-place reverse().

I don't expect that std.string will _ever_ be grapheme-aware or be processed by default as a range of graphemes. That's far too expensive as far as performance goes. Rather, we're likely to have a wrapper and/or separate range-type which handles graphemes. Then if you want the extra correctness and are willing to pay the cost, you use that. As I understand it, std.regex does have the beginnings of such, but we do still need to have a range type of some variety (probably in std.utf) which fully supports graphemes. - Jonathan M Davis
Dec 09 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Jonathan M Davis:

 I don't expect that std.string will _ever_ be grapheme-aware or be processed 
 by default as a range of graphemes.

OK, let's forget about graphemes in this discussion. Now I am not asking for a grapheme-aware reverse. Bye, bearophile
Dec 09 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, December 09, 2011 06:15:04 bearophile wrote:
 Jonathan M Davis:
 I don't expect that std.string will _ever_ be grapheme-aware or be
 processed by default as a range of graphemes.

OK, let's forget about graphemes in this discussion. Now I am not asking for a grapheme-aware reverse.

It sounded like you were, because you were complaining about how dchar did an exact reversal of the code points rather than taking combining code points into account. - Jonathan M Davis
Dec 09 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Jonathan M Davis:

 It sounded like you were,

Right, I was :-) But you have changed my mind when you have explained me that nothing in std.algorithm is grapheme-aware. So I have reduced the amount of what I am asking for. Bye, bearophile
Dec 09 2011
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, December 09, 2011 13:27:01 Jonathan M Davis wrote:
 On Friday, December 09, 2011 12:44:37 bearophile wrote:
 Jonathan M Davis:
 It sounded like you were,

Right, I was :-) But you have changed my mind when you have explained me that nothing in std.algorithm is grapheme-aware. So I have reduced the amount of what I am asking for.

So, now you're asking that char and wchar arrays be reversible with reverse such that their code points are reversed (i.e. the result is the same as if you reversed an array of dchar). Well, I'm not sure that you can actually do that with the same efficiency. I'd have to think about it more. Regardless, the implementation would be _really_ complicated in comparison to how reverse works right now. char[] and wchar[] don't work with reverse, because their elements aren't swappable. So, you can't just swap elements as you iterate in from both ends. You'd have to be moving stuff off into temporaries as you swapped them, because the code point on one side wouldn't necessarily fit where the code point on the other side was, and in the worst case (i.e. all of the code points on one half of the string are multiple code units and all of those on the other side are single code units), you'd pretty much end up having to copy half the array while you waited for enough space to open up on one side to fit the characters from the other side. So, regardless of whether it has the same computational complexity as the current reverse, its memory requirements would be far more. I don't think that the request is completely unreasonable, but also I'm not sure it's acceptable for reverse to change its performance characteristics as much as would be required for it to work with arrays of char or wchar - particularly with regards to how much memory would be required. In general, the performance characteristics of the algorithms in Phobos don't vary much with regards to the type that that's used. I'm pretty sure that in terms of big-o notation, the memory complexity wouldn't match (though I don't recall exactly how big-o notation works with memory rather than computational complexity).

Well, it looks like Andrei had some free time this morning and figured it out. He has a pull request for it: https://github.com/D-Programming-Language/phobos/pull/359 - Jonathan M Davis
Dec 09 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Jonathan M Davis:

 Well, it looks like Andrei had some free time this morning and figured it out. 
 He has a pull request for it:

Thanks you Andrei Alexandrescu and Jonathan :-) Bye, bearophile
Dec 09 2011
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, December 09, 2011 12:44:37 bearophile wrote:
 Jonathan M Davis:
 It sounded like you were,

Right, I was :-) But you have changed my mind when you have explained me that nothing in std.algorithm is grapheme-aware. So I have reduced the amount of what I am asking for.

So, now you're asking that char and wchar arrays be reversible with reverse such that their code points are reversed (i.e. the result is the same as if you reversed an array of dchar). Well, I'm not sure that you can actually do that with the same efficiency. I'd have to think about it more. Regardless, the implementation would be _really_ complicated in comparison to how reverse works right now. char[] and wchar[] don't work with reverse, because their elements aren't swappable. So, you can't just swap elements as you iterate in from both ends. You'd have to be moving stuff off into temporaries as you swapped them, because the code point on one side wouldn't necessarily fit where the code point on the other side was, and in the worst case (i.e. all of the code points on one half of the string are multiple code units and all of those on the other side are single code units), you'd pretty much end up having to copy half the array while you waited for enough space to open up on one side to fit the characters from the other side. So, regardless of whether it has the same computational complexity as the current reverse, its memory requirements would be far more. I don't think that the request is completely unreasonable, but also I'm not sure it's acceptable for reverse to change its performance characteristics as much as would be required for it to work with arrays of char or wchar - particularly with regards to how much memory would be required. In general, the performance characteristics of the algorithms in Phobos don't vary much with regards to the type that that's used. I'm pretty sure that in terms of big-o notation, the memory complexity wouldn't match (though I don't recall exactly how big-o notation works with memory rather than computational complexity). - Jonathan M Davis
Dec 09 2011