www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 391] New: .sort and .reverse break utf8 encoding

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=391

           Summary: .sort and .reverse break utf8 encoding
           Product: D
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla digitalmars.com
        ReportedBy: ddparnell bigpond.com


import std.utf;
import std.stdio;
void main()
{
    char[] a;
    a = "\u3026\u2021\u3061\n";
    writefln("plain");    validate(a);
    writefln("sorted");   validate(a.sort);  // fails
    writefln("reversed"); validate(a.reverse); // fails
}


-- 
Oct 02 2006
next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
d-bugmail puremagic.com wrote:
<snip>
 import std.utf;
 import std.stdio;
 void main()
 {
     char[] a;
     a = "\u3026\u2021\u3061\n";
     writefln("plain");    validate(a);
     writefln("sorted");   validate(a.sort);  // fails
     writefln("reversed"); validate(a.reverse); // fails
 }

AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string. But hmm.... Stewart. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/M d- s:- C++ a->--- UB P+ L E W++ N+++ o K- w++ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++++ h-- r-- !y ------END GEEK CODE BLOCK------ My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Oct 03 2006
parent reply Derek Parnell <derek psyc.ward> writes:
On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:

 d-bugmail puremagic.com wrote:
 <snip>
 import std.utf;
 import std.stdio;
 void main()
 {
     char[] a;
     a = "\u3026\u2021\u3061\n";
     writefln("plain");    validate(a);
     writefln("sorted");   validate(a.sort);  // fails
     writefln("reversed"); validate(a.reverse); // fails
 }

AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string. But hmm....

Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Oct 03 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:
 
 d-bugmail puremagic.com wrote:
     writefln("sorted");   validate(a.sort);  // fails
     writefln("reversed"); validate(a.reverse); // fails

elements of the array, rather than the Unicode characters that make up a string. But hmm....

Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker.

.sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first. Both behaviors will be fixed in the next update.
Oct 03 2006
parent reply Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 Derek Parnell wrote:
 On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:

 d-bugmail puremagic.com wrote:
     writefln("sorted");   validate(a.sort);  // fails
     writefln("reversed"); validate(a.reverse); // fails

elements of the array, rather than the Unicode characters that make up a string. But hmm....

Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker.

.sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first.

Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning. And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII). I'm starting to feel like people are harping on Unicode issues just for the sake of doing so rather than because these are actual problems. Can someone please explain what I'm missing? Sean
Oct 04 2006
next sibling parent Walter Bright <newshound digitalmars.com> writes:
Sean Kelly wrote:
 Changing the behavior of .reverse kind of makes sense, but I don't 
 understand the reason for changing .sort aside from consistency. 
 Personally, I've never had a reason to sort a char array in the first 
 place unless the chars were intended to represent something other than 
 their lexical meaning.  And that aside, sorting chars in a string 
 without a comparison predicate will do so using the char's binary value, 
 which has no lexical significance beyond the 26 letters of the English 
 alphabet (as represented in ASCII).  I'm starting to feel like people 
 are harping on Unicode issues just for the sake of doing so rather than 
 because these are actual problems.  Can someone please explain what I'm 
 missing?

A use for it is collecting character usage frequency statistics is one such. Read a text file into a buffer, sort the buffer, and dump the result! I don't mind the harping on it. Getting the details right is important, even if the details themselves aren't. Besides, it's an easy fix.
Oct 04 2006
prev sibling parent Lionello Lunesu <lio lunesu.remove.com> writes:
Sean Kelly wrote:
 Walter Bright wrote:
 Derek Parnell wrote:
 On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:

 d-bugmail puremagic.com wrote:
     writefln("sorted");   validate(a.sort);  // fails
     writefln("reversed"); validate(a.reverse); // fails

elements of the array, rather than the Unicode characters that make up a string. But hmm....

Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker.

.sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first.

Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning. And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII).

What if you want to use a quick binary search look-up to see if a text contains a given character? ;) Not that I've ever needed it, but it makes sense to just fix it. How often do you .reverse a string, for that matter? L.
Oct 04 2006
prev sibling next sibling parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

d-bugmail puremagic.com schrieb am 2006-10-02:
 http://d.puremagic.com/issues/show_bug.cgi?id=391

 import std.utf;
 import std.stdio;
 void main()
 {
     char[] a;
     a = "\u3026\u2021\u3061\n";
     writefln("plain");    validate(a);
     writefln("sorted");   validate(a.sort);  // fails
     writefln("reversed"); validate(a.reverse); // fails
 }

Added to DStress as http://dstress.kuehne.cn/run/r/reverse_08_A.d http://dstress.kuehne.cn/run/r/reverse_08_B.d http://dstress.kuehne.cn/run/r/reverse_08_C.d http://dstress.kuehne.cn/run/s/sort_16_A.d http://dstress.kuehne.cn/run/s/sort_16_B.d http://dstress.kuehne.cn/run/s/sort_16_C.d Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFI033LK5blCcjpWoRAgxQAJ4soetJ+LZHkmwiFl5YqkGdrjmOjACeI2GG wkC8F4+qfNmVEbLeUT0t06g= =HqWF -----END PGP SIGNATURE-----
Oct 03 2006
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=391


bugzilla digitalmars.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Comment #2 from bugzilla digitalmars.com  2006-10-10 03:29 -------
Fixed DMD 0.169


-- 
Oct 10 2006
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=391


thomas-dloop kuehne.cn changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |




------- Comment #3 from thomas-dloop kuehne.cn  2006-12-23 07:10 -------
Process terminating with default action of signal 11 (SIGSEGV)
Bad permissions for mapped region at address 0x805A0EC

at 0x80544A3: _D3std8typeinfo8ti_dchar10TypeInfo_w4swapMFPvPvZv (in
run/s/sort_16_A.d.exe)
by 0x8050ACD: _adSort (in run/s/sort_16_A.d.exe)
by 0x804A0F4: _Dmain (in run/s/sort_16_A.d:17)
by 0x804BBE6: main (in run/s/sort_16_A.d.exe)


-- 
Dec 23 2006
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=391


thomas-dloop kuehne.cn changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |FIXED




------- Comment #4 from thomas-dloop kuehne.cn  2007-01-24 07:46 -------
Fixed indeed in DMD 0.169

The test cases failed due to missing dups and thus trying to sort an constant 
string in place.


-- 
Jan 24 2007
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=391


clugdbug yahoo.com.au changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |




------- Comment #5 from clugdbug yahoo.com.au  2009-04-21 03:54 -------
This case (cut down from reverse_08_C) is still failing.

int main(){
   wchar[] a = "a\U00000081b\U00002000c\U00010000";
   wchar[] b = a.dup;

   b.reverse; // OK
   b.reverse; // fails
   return 0;
}


-- 
Apr 21 2009
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=391


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |ASSIGNED
                 CC|                            |andrei metalanguage.com
            Version|1.00                        |D1 & D2


--- Comment #6 from Andrei Alexandrescu <andrei metalanguage.com> 2010-11-26
11:30:22 PST ---
Don's latest fails both on 1.065 and 2.050. Marking as a D1 & D2 issue.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 26 2010