digitalmars.D.bugs - [Issue 391] New: .sort and .reverse break utf8 encoding
- d-bugmail puremagic.com Oct 02 2006
- Stewart Gordon <smjg_1998 yahoo.com> Oct 03 2006
- Derek Parnell <derek psyc.ward> Oct 03 2006
- Walter Bright <newshound digitalmars.com> Oct 03 2006
- Sean Kelly <sean f4.ca> Oct 04 2006
- Walter Bright <newshound digitalmars.com> Oct 04 2006
- Lionello Lunesu <lio lunesu.remove.com> Oct 04 2006
- Thomas Kuehne <thomas-dloop kuehne.cn> Oct 03 2006
- d-bugmail puremagic.com Oct 10 2006
- d-bugmail puremagic.com Dec 23 2006
- d-bugmail puremagic.com Jan 24 2007
- d-bugmail puremagic.com Apr 21 2009
- d-bugmail puremagic.com Nov 26 2010
http://d.puremagic.com/issues/show_bug.cgi?id=391 Summary: .sort and .reverse break utf8 encoding Product: D Version: unspecified Platform: PC OS/Version: All Status: NEW Severity: major Priority: P2 Component: DMD AssignedTo: bugzilla digitalmars.com ReportedBy: ddparnell bigpond.com import std.utf; import std.stdio; void main() { char[] a; a = "\u3026\u2021\u3061\n"; writefln("plain"); validate(a); writefln("sorted"); validate(a.sort); // fails writefln("reversed"); validate(a.reverse); // fails } --
Oct 02 2006
d-bugmail puremagic.com wrote: <snip>import std.utf; import std.stdio; void main() { char[] a; a = "\u3026\u2021\u3061\n"; writefln("plain"); validate(a); writefln("sorted"); validate(a.sort); // fails writefln("reversed"); validate(a.reverse); // fails }
AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string. But hmm.... Stewart. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/M d- s:- C++ a->--- UB P+ L E W++ N+++ o K- w++ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++++ h-- r-- !y ------END GEEK CODE BLOCK------ My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Oct 03 2006
On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:d-bugmail puremagic.com wrote: <snip>import std.utf; import std.stdio; void main() { char[] a; a = "\u3026\u2021\u3061\n"; writefln("plain"); validate(a); writefln("sorted"); validate(a.sort); // fails writefln("reversed"); validate(a.reverse); // fails }
AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string. But hmm....
Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Oct 03 2006
Derek Parnell wrote:On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:d-bugmail puremagic.com wrote:writefln("sorted"); validate(a.sort); // fails writefln("reversed"); validate(a.reverse); // fails
elements of the array, rather than the Unicode characters that make up a string. But hmm....
Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker.
.sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first. Both behaviors will be fixed in the next update.
Oct 03 2006
Walter Bright wrote:Derek Parnell wrote:On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:d-bugmail puremagic.com wrote:writefln("sorted"); validate(a.sort); // fails writefln("reversed"); validate(a.reverse); // fails
elements of the array, rather than the Unicode characters that make up a string. But hmm....
Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker.
.sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first.
Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning. And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII). I'm starting to feel like people are harping on Unicode issues just for the sake of doing so rather than because these are actual problems. Can someone please explain what I'm missing? Sean
Oct 04 2006
Sean Kelly wrote:Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning. And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII). I'm starting to feel like people are harping on Unicode issues just for the sake of doing so rather than because these are actual problems. Can someone please explain what I'm missing?
A use for it is collecting character usage frequency statistics is one such. Read a text file into a buffer, sort the buffer, and dump the result! I don't mind the harping on it. Getting the details right is important, even if the details themselves aren't. Besides, it's an easy fix.
Oct 04 2006
Sean Kelly wrote:Walter Bright wrote:Derek Parnell wrote:On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:d-bugmail puremagic.com wrote:writefln("sorted"); validate(a.sort); // fails writefln("reversed"); validate(a.reverse); // fails
elements of the array, rather than the Unicode characters that make up a string. But hmm....
Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker.
.sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first.
Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning. And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII).
What if you want to use a quick binary search look-up to see if a text contains a given character? ;) Not that I've ever needed it, but it makes sense to just fix it. How often do you .reverse a string, for that matter? L.
Oct 04 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 d-bugmail puremagic.com schrieb am 2006-10-02:http://d.puremagic.com/issues/show_bug.cgi?id=391
import std.utf; import std.stdio; void main() { char[] a; a = "\u3026\u2021\u3061\n"; writefln("plain"); validate(a); writefln("sorted"); validate(a.sort); // fails writefln("reversed"); validate(a.reverse); // fails }
Added to DStress as http://dstress.kuehne.cn/run/r/reverse_08_A.d http://dstress.kuehne.cn/run/r/reverse_08_B.d http://dstress.kuehne.cn/run/r/reverse_08_C.d http://dstress.kuehne.cn/run/s/sort_16_A.d http://dstress.kuehne.cn/run/s/sort_16_B.d http://dstress.kuehne.cn/run/s/sort_16_C.d Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFI033LK5blCcjpWoRAgxQAJ4soetJ+LZHkmwiFl5YqkGdrjmOjACeI2GG wkC8F4+qfNmVEbLeUT0t06g= =HqWF -----END PGP SIGNATURE-----
Oct 03 2006
http://d.puremagic.com/issues/show_bug.cgi?id=391 bugzilla digitalmars.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from bugzilla digitalmars.com 2006-10-10 03:29 ------- Fixed DMD 0.169 --
Oct 10 2006
http://d.puremagic.com/issues/show_bug.cgi?id=391 thomas-dloop kuehne.cn changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #3 from thomas-dloop kuehne.cn 2006-12-23 07:10 ------- Process terminating with default action of signal 11 (SIGSEGV) Bad permissions for mapped region at address 0x805A0EC at 0x80544A3: _D3std8typeinfo8ti_dchar10TypeInfo_w4swapMFPvPvZv (in run/s/sort_16_A.d.exe) by 0x8050ACD: _adSort (in run/s/sort_16_A.d.exe) by 0x804A0F4: _Dmain (in run/s/sort_16_A.d:17) by 0x804BBE6: main (in run/s/sort_16_A.d.exe) --
Dec 23 2006
http://d.puremagic.com/issues/show_bug.cgi?id=391 thomas-dloop kuehne.cn changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #4 from thomas-dloop kuehne.cn 2007-01-24 07:46 ------- Fixed indeed in DMD 0.169 The test cases failed due to missing dups and thus trying to sort an constant string in place. --
Jan 24 2007
http://d.puremagic.com/issues/show_bug.cgi?id=391 clugdbug yahoo.com.au changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #5 from clugdbug yahoo.com.au 2009-04-21 03:54 ------- This case (cut down from reverse_08_C) is still failing. int main(){ wchar[] a = "a\U00000081b\U00002000c\U00010000"; wchar[] b = a.dup; b.reverse; // OK b.reverse; // fails return 0; } --
Apr 21 2009
http://d.puremagic.com/issues/show_bug.cgi?id=391 Andrei Alexandrescu <andrei metalanguage.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |ASSIGNED CC| |andrei metalanguage.com Version|1.00 |D1 & D2 --- Comment #6 from Andrei Alexandrescu <andrei metalanguage.com> 2010-11-26 11:30:22 PST --- Don's latest fails both on 1.065 and 2.050. Marking as a D1 & D2 issue. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 26 2010









Walter Bright <newshound digitalmars.com> 