www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unicode handling comparison

reply "bearophile" <bearophileHUGS lycos.com> writes:
Through Reddit I have seen this small comparison of Unicode 
handling between different programming languages:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

D+Phobos seem to fail most things (it produces BAFFLE):
http://dpaste.dzfl.pl/a5268c435

Bye,
bearophile
Nov 27 2013
next sibling parent =?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On 2013-11-27 13:46, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode handling
 between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435
Indeed it does. Have you tried with std.uni? -- Simen
Nov 27 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 D+Phobos seem to fail most things (it produces BAFFLE):
I still think we're doing pretty good. At least, we *handle* unicode at all (looking at you C++). And we handle *true* unicode, not BMP style UCS (looking at you Java/C#), with the options of storing said strings in any encoding: UTF8 through UTF32, and the possibility to also have ASCII. We don't yet totally handle things like diacritics or ligatures, but we are getting there. As a whole, I find that D is incredibly "unicode correct enough" out of the box, and with no extra effort involved.
Nov 27 2013
prev sibling next sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode 
 handling between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435
If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results. As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap. David
Nov 27 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-11-27 15:45, David Nadlinger wrote:

 If you need to perform this kind of operations on Unicode strings in D,
 you can call normalize (std.uni) on the string first to make sure it is
 in one of the Normalization Forms. For example, just appending
 .normalize to your strings (which defaults to NFC) would make the code
 produce the "expected" results.
That didn't work out very well: std/uni.d(6301): Error: undefined identifier tuple -- /Jacob Carlborg
Nov 27 2013
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Wednesday, 27 November 2013 at 15:03:37 UTC, Jacob Carlborg 
wrote:
 std/uni.d(6301): Error: undefined identifier tuple
Yeah, I saw it too. The fix is simple: https://github.com/D-Programming-Language/phobos/pull/1728 tbh this makes me think version(unittest) might just be considered harmful. I'm sure that code passed the tests, but only because a vital import was in a version(unittest) secion!
Nov 27 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-11-27 16:07, Adam D. Ruppe wrote:

 Yeah, I saw it too. The fix is simple:

 https://github.com/D-Programming-Language/phobos/pull/1728

 tbh this makes me think version(unittest) might just be considered
 harmful. I'm sure that code passed the tests, but only because a vital
 import was in a version(unittest) secion!
You were faster. But I created an issue as well. -- /Jacob Carlborg
Nov 27 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
David Nadlinger:

 If you need to perform this kind of operations on Unicode 
 strings in D, you can call normalize (std.uni) on the string 
 first to make sure it is in one of the Normalization Forms. For 
 example, just appending .normalize to your strings (which 
 defaults to NFC) would make the code produce the "expected" 
 results.

 As far as I'm aware, this behavior is the result of a 
 deliberate decision, as normalizing strings on the fly isn't 
 really cheap.
Thank you :-) Bye, bearophile
Nov 27 2013
prev sibling next sibling parent reply "Wyatt" <wyatt.epp gmail.com> writes:
On Wednesday, 27 November 2013 at 14:45:32 UTC, David Nadlinger 
wrote:
 If you need to perform this kind of operations on Unicode 
 strings in D, you can call normalize (std.uni) on the string 
 first to make sure it is in one of the Normalization Forms. For 
 example, just appending .normalize to your strings (which 
 defaults to NFC) would make the code produce the "expected" 
 results.
Seems like a pretty big "gotcha" from a usability standpoint; it's not exactly intuitive. I understand WHY this decision was made, but it feels like a source of code smell and weird string comparison errors.
 As far as I'm aware, this behavior is the result of a 
 deliberate decision, as normalizing strings on the fly isn't 
 really cheap.
I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe there's even room for a std.string.i18n submodule?) -Wyatt
Nov 27 2013
next sibling parent "Dicebot" <public dicebot.lv> writes:
On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:
 Seems like a pretty big "gotcha" from a usability standpoint; 
 it's not exactly intuitive.  I understand WHY this decision was 
 made, but it feels like a source of code smell and weird string 
 comparison errors.
It probably is, but is Unicode gotcha, not D one.
Nov 27 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-11-27 17:15, Wyatt wrote:

 I don't remember if it was brought up before, but this makes me wonder
 if something like an i18nString should exist for cases where it IS
 important.  Making i18n stuff as simple as it looks like it "should" be
 has merit, IMO.  (Maybe there's even room for a std.string.i18n submodule?)
I think we should have that. -- /Jacob Carlborg
Nov 27 2013
prev sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:
 I don't remember if it was brought up before, but this makes me 
 wonder if something like an i18nString should exist for cases 
 where it IS important.  Making i18n stuff as simple as it looks 
 like it "should" be has merit, IMO.  (Maybe there's even room 
 for a std.string.i18n submodule?)

 -Wyatt
What would it do that std.uni doesn't already? i18nString sounds like a range of graphemes to me. I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid. In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.
Nov 27 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-11-27 18:22, Jakob Ovrum wrote:

 What would it do that std.uni doesn't already?
A class/struct that handles all these normalizations and other stuff automatically. -- /Jacob Carlborg
Nov 27 2013
parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg 
wrote:
 On 2013-11-27 18:22, Jakob Ovrum wrote:

 What would it do that std.uni doesn't already?
A class/struct that handles all these normalizations and other stuff automatically.
Sounds terrible :)
Nov 27 2013
parent reply "Dicebot" <public dicebot.lv> writes:
On Wednesday, 27 November 2013 at 17:37:48 UTC, Jakob Ovrum wrote:
 On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg 
 wrote:
 On 2013-11-27 18:22, Jakob Ovrum wrote:

 What would it do that std.uni doesn't already?
A class/struct that handles all these normalizations and other stuff automatically.
Sounds terrible :)
+1 Working with graphemes is rather expensive thing to do performance-wise. I like how D makes this fact obvious and provides continuous transition through abstraction levels here. It is important to make the costs obvious.
Nov 27 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-11-27 18:56, Dicebot wrote:

 +1

 Working with graphemes is rather expensive thing to do performance-wise.
 I like how D makes this fact obvious and provides continuous transition
 through abstraction levels here. It is important to make the costs obvious.
I think it's missing a final high level abstraction. As with the rest of the abstractions you're not forced to use them. -- /Jacob Carlborg
Nov 27 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Nov-2013 22:54, Jacob Carlborg пишет:
 On 2013-11-27 18:56, Dicebot wrote:

 +1

 Working with graphemes is rather expensive thing to do performance-wise.
 I like how D makes this fact obvious and provides continuous transition
 through abstraction levels here. It is important to make the costs
 obvious.
I think it's missing a final high level abstraction. As with the rest of the abstractions you're not forced to use them.
This could give an idea of what Perl folks do to get the grapheme feel like a unit of string: http://www.parrot.org/content/ucs-4-nfg-and-how-grapheme-tables-makes-it-awesome You seriously don't want this kind of behind the scenes work taking place in systems language. P.S. The text linked presents some incorrect "facts" about Unicode that I'm not to be held responsible for :) I do believe however that the general idea described is interesting and is worth trying out in addition to what we have in std.uni. -- Dmitry Olshansky
Nov 27 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Nov 27, 2013 at 06:22:41PM +0100, Jakob Ovrum wrote:
 On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:
I don't remember if it was brought up before, but this makes me
wonder if something like an i18nString should exist for cases
where it IS important.  Making i18n stuff as simple as it looks
like it "should" be has merit, IMO.  (Maybe there's even room for
a std.string.i18n submodule?)

-Wyatt
What would it do that std.uni doesn't already? i18nString sounds like a range of graphemes to me.
Maybe it should be called graphemeString? I'm not sure what this has to do with i18n, though. Properly done i18n should use Unicode line-breaking algorithms and other such standardized functions, rather than manipulating graphemes directly (which fails to take into account double-width characters, language-specific decomposition rules, and many other gotchas, not to mention poorly-performing). AFAIK std.uni already provides a way to extract graphemes when you need it (e.g., for rendering fonts), so there's really no reason to default to graphemeString everywhere in your program. *That* is a sign of poorly written code, IMNSHO.
 I would like a convenient function in std.uni to get such a range of
 graphemes from a range of points, but I wouldn't want to elevate it to
 any particular status; that would be a knee-jerk reaction. D's
 granularity when it comes to Unicode is because there is an
 appropriate level of representation for each domain. Shoe-horning
 everything into a range of graphemes is something we should avoid.
 
 In D, we can write code that is both Unicode-correct and highly
 performant, while still being simple and pleasant to read. To write
 such code, one must have a modicum of understanding of how Unicode
 works (in order to choose the right tools from the toolbox), but I
 think it's a novel compromise.
Agreed. T -- MASM = Mana Ada Sistem, Man!
Nov 27 2013
prev sibling next sibling parent "Wyatt" <wyatt.epp gmail.com> writes:
On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote:
 i18nString sounds like a range of graphemes to me.
Maybe. If I had called it...say, "normalisedString"? Would you still think that? That was an off-the-cuff name because my morning brain imagined that this sort of thing would be useful for user input where you can't make assumptions about its form.
 I would like a convenient function in std.uni to get such a 
 range of graphemes from a range of points, but I wouldn't want 
 to elevate it to any particular status; that would be a 
 knee-jerk reaction. D's granularity when it comes to Unicode is 
 because there is an appropriate level of representation for 
 each domain. Shoe-horning everything into a range of graphemes 
 is something we should avoid.
Okay, hold up. It's a bit late to prevent everyone from diving down this rabbit hole, but let me be clear: This really isn't about graphemes. Not really. They may be involved, but I think focusing on that obscures the point. If you recall the original article, I don't think he's being unfair in expecting "noël" to have a length of four no matter how it was composed. I don't think it's unfair to expect that "noël".take(3) returns "noë", and I don't think it's unfair that reversing it should be "lëon". All the places where his expectations were defied (and more!) are implementation details. While I stated before that I don't necessarily have anything against people learning more about unicode, neither do I fundamentally believe that's something a lot of people _need_ to worry about. I'm not saying the default string in D should change or anything crazy like that. All I'm suggesting is maybe, rather than telling people they should read a small book about the most arcane stuff imaginable and then explaining which tool does what when that doesn't take, we could just tell them "Here, use this library type where you need it" with the admonishment that it may be too slow if abused. I think THAT could be useful.
 In D, we can write code that is both Unicode-correct and highly 
 performant, while still being simple and pleasant to read. To 
 write such code, one must have a modicum of understanding of 
 how Unicode works (in order to choose the right tools from the 
 toolbox), but I think it's a novel compromise.
See, this sways me only a little bit. The reason for that is, often, convenience greatly trumps elegance or performance. Sure I COULD write something in C to look for obvious bad stuff in my syslog, but would I bother when I have a shell with pipes, grep, cut, and sed? This all isn't to say I don't LIKE performance and elegance; but I live, work, and play on both sides of this spectrum, and I'd like to think they can peacefully coexist without too much fuss. -Wyatt
Nov 27 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/27/2013 9:22 AM, Jakob Ovrum wrote:
 In D, we can write code that is both Unicode-correct and highly performant,
 while still being simple and pleasant to read. To write such code, one must
have
 a modicum of understanding of how Unicode works (in order to choose the right
 tools from the toolbox), but I think it's a novel compromise.
Sadly, std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges. This means that all algorithms on strings will be crippled as far as performance goes. http://dlang.org/glossary.html#narrow strings Very, very few operations on strings need decoding. The decoding should have gone into a separate layer.
Nov 28 2013
next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright 
wrote:
 Sadly, std.array is determined to decode (i.e. convert to 
 dchar[]) all your strings when they are used as ranges. This 
 means that all algorithms on strings will be crippled as far as 
 performance goes.

 http://dlang.org/glossary.html#narrow strings

 Very, very few operations on strings need decoding. The 
 decoding should have gone into a separate layer.
Decoding by default means that algorithms can work reasonably with strings without being designed specifically for strings. The algorithms can then later be specialized for narrow strings, which I believe is happening for a few algorithms in std.algorithm like substring search. Decoding is still available as a separate layer through std.utf, when more control over decoding is required.
Nov 28 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 This means that all algorithms on strings will be crippled
 as far as performance goes.
If you want to sort an array of chars you need to use a dchar[], or code like this: char[] word = "just a test".dup; auto sword = cast(char[])word.representation.sort().release; See: http://d.puremagic.com/issues/show_bug.cgi?id=10162 Bye, bearophile
Nov 28 2013
prev sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright
wrote:
 Sadly,
I think it's great. It means by default, your strings will always be handled correctly. I think there's quite a few algorithms that were written without ever taking strings into account, but still happen to work with them.
 std.array is determined to decode (i.e. convert to dchar[]) all 
 your strings when they are used as ranges.
 This means that all algorithms on
 strings will be crippled as far as performance goes.
Quite a few algorithms in array/algorithm/string *don't* decode the string when they don't need to actually.
 Very, very few operations on strings need decoding. The 
 decoding should have gone into a separate layer.
Which operations are you thinking of in std.array that decode when they shouldn't?
Nov 28 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/28/2013 5:24 AM, monarch_dodra wrote:
 Which operations are you thinking of in std.array that decode
 when they shouldn't?
front() in std.array looks like: property dchar front(T)(T[] a) safe pure if (isNarrowString!(T[])) { assert(a.length, "Attempting to fetch the front of an empty array of " ~ T.stringof); size_t i = 0; return decode(a, i); } So anytime I write a generic algorithm using empty, front, and popFront(), it decodes the strings, which is a large pessimization.
Nov 28 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Nov 28, 2013 at 09:52:08AM -0800, Walter Bright wrote:
 On 11/28/2013 5:24 AM, monarch_dodra wrote:
Which operations are you thinking of in std.array that decode
when they shouldn't?
front() in std.array looks like: property dchar front(T)(T[] a) safe pure if (isNarrowString!(T[])) { assert(a.length, "Attempting to fetch the front of an empty array of " ~ T.stringof); size_t i = 0; return decode(a, i); } So anytime I write a generic algorithm using empty, front, and popFront(), it decodes the strings, which is a large pessimization.
OTOH, it is actually correct by default. If it *didn't* decode, things like std.algorithm.sort and std.range.retro would mangle all your multibyte UTF-8 characters. Having said that, though, it would be nice if there were a standard ASCII string type that didn't decode by default. Always decoding strings *is* slow, esp. when you already know that it only contains ASCII characters. Maybe we want something like this: struct AsciiString { immutable(ubyte)[] impl; alias impl this; // This is so that .front returns char instead of ubyte property char front() { return cast(char) impl[0]; } char opIndex(size_t idx) { ... /* ditto */ } ... // other range methods here } AsciiString assumeAscii(string s) { return AsciiString(cast(immutable(ubyte)[]) s); } T -- "640K ought to be enough" -- Bill G., 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.
Nov 28 2013
next sibling parent reply "Dicebot" <public dicebot.lv> writes:
http://dlang.org/phobos/std_encoding.html#.AsciiString ?
Nov 28 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 28 November 2013 at 18:55:44 UTC, Dicebot wrote:
 http://dlang.org/phobos/std_encoding.html#.AsciiString ?
Yeah, that or just ubyte[]. The problem with both of these though, is printing :/ (which prints ugly as sin) Something like: struct AsciiChar { private char c; alias c this; } Could be a very easy and efficient alternative.
Nov 28 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/28/2013 10:19 AM, H. S. Teoh wrote:
 Always decoding strings
 *is* slow, esp. when you already know that it only contains ASCII
 characters.
It doesn't have to be merely ASCII. You can do string substring searches without any need for decoding, for example. You don't even need decoding to do regex. Decoding is rarely needed.
Nov 28 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Nov-2013 17:24, monarch_dodra пишет:
 On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright
 wrote:
 Sadly,
I think it's great. It means by default, your strings will always be handled correctly. I think there's quite a few algorithms that were written without ever taking strings into account, but still happen to work with them.
The greatest problem is surprisingly that you can't use range functions to the implicit codeunit range even if you REALLY wanted to. To not go far away - the only reason std.regex can't take e.g. retro of string: match(retro("hleb), ".el."); is because of the automatic dumbing down at the moment you apply range adapter. What I'd need in std.regex is a codeunit range that due to convention also "happens to be" a range of codepoints. The second problem is that string code is carefully special cased but the effort is completely wasted the moment you have a slice of char-s that come from anywhere else (circular buffer, for instance) then built-in strings. I had a (a bit cloudy) vision of settling encoded ranges problem once and for good. That includes defining notion of an encoded range that is 2 in one: some stronger (as in capabilities) range of code elements and the default decoded view imposed on top of it (that can be weaker). -- Dmitry Olshansky
Nov 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/28/2013 11:32 AM, Dmitry Olshansky wrote:
 I had a (a bit cloudy) vision of settling encoded ranges problem once and for
 good. That includes defining notion of an encoded range that is 2 in one: some
 stronger (as in capabilities) range of code elements and the default decoded
 view imposed on top of it (that can be weaker).
I suspect the correct approach would be to have the range over string to produce bytes. If you want decoded values, then run it through an adapter algorithm.
Nov 28 2013
prev sibling next sibling parent Charles Hixson <charleshixsn earthlink.net> writes:
On 11/27/2013 06:45 AM, David Nadlinger wrote:
 On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode handling 
 between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435
If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results. As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap. David
I don't like the overhead, and I don't know how important this is, but perhaps the best way to solve it would be to have string include a "normalization" byte, saying whether it was normalized, and if so in what way. That there can be multiple ways of normalizing is painful, but it *is* the standard. And this would allow normalization to be skipped whenever the comparison of two strings showed the same normalization (or lack thereof). What to do if they're normalized differently is a bit of a puzzle, but most reasonable solutions would work for most cases, so you just need a way to override the defaults. -- Charles Hixson
Nov 27 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Nov-2013 18:45, David Nadlinger пишет:
 On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode handling
 between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435
If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results. As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.
It's anything but cheap. At the minimum imagine crawling the string and issuing a table lookup per codepoint.
 David
-- Dmitry Olshansky
Nov 27 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/27/2013 12:06 PM, Dmitry Olshansky wrote:
 27-Nov-2013 18:45, David Nadlinger пишет:
 As far as I'm aware, this behavior is the result of a deliberate
 decision, as normalizing strings on the fly isn't really cheap.
It's anything but cheap. At the minimum imagine crawling the string and issuing a table lookup per codepoint.
Decoding isn't cheap, either, which is why I rant about it being the default behavior.
Nov 28 2013
prev sibling next sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode 
 handling between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/
Most of the points are good, but the author seems to confuse UCS-2 with UTF-16, so the whole point about UTF-16 is plain wrong. The author also doesn't seem to understand the Unicode definitions of character and grapheme, which is a shame, because the difference is more or less the whole point of the post.
 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435
D strings are arrays of code units and ranges of code points. The failure here is yours; in that you didn't use std.uni to handle graphemes. On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).
Nov 27 2013
next sibling parent reply "Wyatt" <wyatt.epp gmail.com> writes:
On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote:
 The author also doesn't seem to understand the Unicode 
 definitions of character and grapheme, which is a shame, 
 because the difference is more or less the whole point of the 
 post.
I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything. If they know, they know; if they don't, the wall of jargon is intimidating and hard to grasp (more examples up front of more things that you'd actually use std.uni for). Even though I'm decently familiar with Unicode, I was having trouble following all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to std.uni?). On the flip side, std.utf has a serious dearth of examples and the relationship between the two isn't clear.
 On that note, I tried to use std.uni to write a simple example 
 of how to correctly handle this in D, but it became apparent 
 that std.uni should expose something like `byGrapheme` which 
 lazily transforms a range of code points to a range of 
 graphemes (probably needs a `byCodePoint` to do the converse 
 too). The two extant grapheme functions, `decodeGrapheme` and 
 `graphemeStride`, are *awful* for string manipulation (granted, 
 they are probably perfect for text rendering).
Yes, please. While operations on single codepoints and characters seem pretty robust (i.e. you can do lots of things with and to them), it feels like it just falls apart when you try to work with strings. It honestly surprised me how many things in std.uni don't seem to work on ranges. -Wyatt
Nov 27 2013
next sibling parent reply "Wyatt" <wyatt.epp gmail.com> writes:
On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 trouble following all that (e.g. Isn't "noe\u0308l" a grapheme
Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it." Is that about right? -Wyatt
Nov 27 2013
next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 27 November 2013 at 16:22:58 UTC, Wyatt wrote:
 Whoops, overzealous pasting.  That is, "e\u0308", which 
 composes to "ë".  A grapheme cluster seems to represent one 
 printed character: "...a horizontally segmentable unit of text, 
 consisting of some grapheme base (which may consist of a Korean 
 syllable) together with any number of nonspacing marks applied 
 to it."

 Is that about right?

 -Wyatt
Yes. A grapheme is also sometimes explained as being the unit that lay people intuitively think of as being a "character". The difference between a grapheme and a grapheme cluster is just a matter of perspective, like the difference between a character and a code point; the former simply refers to the decoded result, while the latter refers to the sum of encoding parts (where the parts are code points for grapheme cluster, and code units for a code point). Yet another example is that of the UTF-32 code unit: one UTF-32 code unit is (currently) equal to one Unicode code point, but both terms are meaningful in the right context.
Nov 27 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Nov-2013 20:22, Wyatt пишет:
 On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 trouble following all that (e.g. Isn't "noe\u0308l" a grapheme
Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it." Is that about right?
As much as standard defines it. (actually they talk about boundaries, and grapheme is what happens to be in between). More specifically D's std.uni follows the notion of the extended grapheme cluster. There is no need to stick with ugly legacy crap. See also http://www.unicode.org/reports/tr29/
 -Wyatt
-- Dmitry Olshansky
Nov 27 2013
prev sibling next sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 I agree with the assertion that people SHOULD know how unicode 
 works if they want to work with it, but the way our docs are 
 now is off-putting enough that most probably won't learn 
 anything.  If they know, they know; if they don't, the wall of 
 jargon is intimidating and hard to grasp (more examples up 
 front of more things that you'd actually use std.uni for).  
 Even though I'm decently familiar with Unicode, I was having 
 trouble following all that (e.g. Isn't "noe\u0308l" a grapheme 
 cluster according to std.uni?).  On the flip side, std.utf has 
 a serious dearth of examples and the relationship between the 
 two isn't clear.
I thought it was nice that std.uni had a proper terminology section, complete with links to Unicode documents to kick-start beginners to Unicode. It mentions its relationship with std.utf right at the top. Maybe the first paragraph is just too thin, and it's hard to see the big picture. Maybe it should include a small leading paragraph detailing the three levels of Unicode granularity that D/Phobos chooses; arrays of code units -> ranges of code points -> std.uni for graphemes and algorithms.
 Yes, please.  While operations on single codepoints and 
 characters seem pretty robust (i.e. you can do lots of things 
 with and to them), it feels like it just falls apart when you 
 try to work with strings.  It honestly surprised me how many 
 things in std.uni don't seem to work on ranges.

 -Wyatt
Most string code is Unicode-correct as long as it works on code points and all inputs are of the same normalization format; explicit grapheme-awareness is rarely a necessity. By that I mean the most common string operations, such as searching, getting a substring etc. will work without any special grapheme decoding (beyond normalization). The hiccups appear when code points are shuffled around, or the order is changed. Apart from these rare string manipulation cases, grapheme awareness is necessary for rendering code.
Nov 27 2013
parent Charles Hixson <charleshixsn earthlink.net> writes:
On 11/27/2013 08:53 AM, Jakob Ovrum wrote:
 On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 I agree with the assertion that people SHOULD know how unicode works 
 if they want to work with it, but the way our docs are now is 
 off-putting enough that most probably won't learn anything.  If they 
 know, they know; if they don't, the wall of jargon is intimidating 
 and hard to grasp (more examples up front of more things that you'd 
 actually use std.uni for).  Even though I'm decently familiar with 
 Unicode, I was having trouble following all that (e.g. Isn't 
 "noe\u0308l" a grapheme cluster according to std.uni?).  On the flip 
 side, std.utf has a serious dearth of examples and the relationship 
 between the two isn't clear.
I thought it was nice that std.uni had a proper terminology section, complete with links to Unicode documents to kick-start beginners to Unicode. It mentions its relationship with std.utf right at the top. Maybe the first paragraph is just too thin, and it's hard to see the big picture. Maybe it should include a small leading paragraph detailing the three levels of Unicode granularity that D/Phobos chooses; arrays of code units -> ranges of code points -> std.uni for graphemes and algorithms.
 Yes, please.  While operations on single codepoints and characters 
 seem pretty robust (i.e. you can do lots of things with and to them), 
 it feels like it just falls apart when you try to work with strings.  
 It honestly surprised me how many things in std.uni don't seem to 
 work on ranges.

 -Wyatt
Most string code is Unicode-correct as long as it works on code points and all inputs are of the same normalization format; explicit grapheme-awareness is rarely a necessity. By that I mean the most common string operations, such as searching, getting a substring etc. will work without any special grapheme decoding (beyond normalization). The hiccups appear when code points are shuffled around, or the order is changed. Apart from these rare string manipulation cases, grapheme awareness is necessary for rendering code.
I would put things a bit more emphatically. The codepoint is analogous to assembler, where the character is analogous to a high level language (and the binary representation is analogous to a binary representation). The desire is to make the characters easy to use in a way that is cheap to do. To me this means that the highlevel language (i.e., D) should make it easy to deal with characters, possible to deal with codepoints, and you can deal with binary representations if you really want to. (Also note the isomorphism between assembler code and binary is matched by an isomorphism between codepoints and binary representation.) To do this cheaply, D needs to know what kind of normalization each string is in. This is likely to cost one byte per string, unless there's some slack in the current representation. But is this worth while? This is the direction that things will eventually go, but that doesn't really mean that we need to push them in that direction today. But if D had a default normalization that occurred during i/o operations, to cost of the normalization would probably be lost during the impedance matching between RAM and storage. (Again, however, any default requires the ability to be overridden.) Also, of course, none of this will be of any significance to ASCII. -- Charles Hixson
Nov 27 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/27/2013 8:18 AM, Wyatt wrote:
 It honestly surprised me how
 many things in std.uni don't seem to work on ranges.
Many things in Phobos either predate ranges, or are written by people who aren't used to ranges and don't think in terms of ranges. It's an ongoing issue, and one we need to improve upon. And, of course, you're welcome to pitch in and help with pull requests on the documentation and implementation!
Nov 27 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Nov-2013 20:18, Wyatt пишет:
 On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote:
 It
 honestly surprised me how many things in std.uni don't seem to work on
 ranges.
Which ones? Or do you mean more like isAlpha(rangeOfCodepoints)? -- Dmitry Olshansky
Nov 27 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 11/27/13 7:43 AM, Jakob Ovrum wrote:
 On that note, I tried to use std.uni to write a simple example of how to
 correctly handle this in D, but it became apparent that std.uni should
 expose something like `byGrapheme` which lazily transforms a range of
 code points to a range of graphemes (probably needs a `byCodePoint` to
 do the converse too). The two extant grapheme functions,
 `decodeGrapheme` and `graphemeStride`, are *awful* for string
 manipulation (granted, they are probably perfect for text rendering).
Yah, byGrapheme would be a great addition. Andrei
Nov 27 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:
 On 11/27/13 7:43 AM, Jakob Ovrum wrote:
On that note, I tried to use std.uni to write a simple example of how
to correctly handle this in D, but it became apparent that std.uni
should expose something like `byGrapheme` which lazily transforms a
range of code points to a range of graphemes (probably needs a
`byCodePoint` to do the converse too). The two extant grapheme
functions, `decodeGrapheme` and `graphemeStride`, are *awful* for
string manipulation (granted, they are probably perfect for text
rendering).
Yah, byGrapheme would be a great addition.
[...] +1. This is better than the GraphemeString / i18nString proposal elsewhere in this thread, because it discourages people from using graphemes (poor performance) unless where actually necessary. T -- He who laughs last thinks slowest.
Nov 27 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Nov-2013 22:12, H. S. Teoh пишет:
 On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:
 On 11/27/13 7:43 AM, Jakob Ovrum wrote:
 On that note, I tried to use std.uni to write a simple example of how
 to correctly handle this in D, but it became apparent that std.uni
 should expose something like `byGrapheme` which lazily transforms a
 range of code points to a range of graphemes (probably needs a
 `byCodePoint` to do the converse too). The two extant grapheme
 functions, `decodeGrapheme` and `graphemeStride`, are *awful* for
 string manipulation (granted, they are probably perfect for text
 rendering).
Yah, byGrapheme would be a great addition.
[...] +1. This is better than the GraphemeString / i18nString proposal elsewhere in this thread, because it discourages people from using graphemes (poor performance) unless where actually necessary.
I could have sworn we had byGrapheme somewhere, well apparently not :( BTW I believe that GraphemeString could still be a valuable addition. I known of at least one good implementation that gives you O(1) grapheme access with nice memory footprint numbers. It has many benefits but the chief problem with it: a) It doesn't at all solve the interchange at all - you'd have to encode on write/re-code on read b) It relies on having global shared state across the whole program, and that's the real show-stopper thing about it In any case it's a direction well worth exploring.
 T
-- Dmitry Olshansky
Nov 27 2013
parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 27 November 2013 at 20:13:32 UTC, Dmitry Olshansky 
wrote:
 I could have sworn we had byGrapheme somewhere, well apparently 
 not :(
Simple attempt: https://github.com/D-Programming-Language/phobos/pull/1736
Nov 29 2013
prev sibling parent =?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On 27.11.2013 19:07, Andrei Alexandrescu wrote:
 On 11/27/13 7:43 AM, Jakob Ovrum wrote:
 On that note, I tried to use std.uni to write a simple example of how to
 correctly handle this in D, but it became apparent that std.uni should
 expose something like `byGrapheme` which lazily transforms a range of
 code points to a range of graphemes (probably needs a `byCodePoint` to
 do the converse too). The two extant grapheme functions,
 `decodeGrapheme` and `graphemeStride`, are *awful* for string
 manipulation (granted, they are probably perfect for text rendering).
Yah, byGrapheme would be a great addition.
It shouldn't be hard to make, either: import std.uni : Grapheme, decodeGrapheme; import std.traits : isSomeString; import std.array : empty; struct ByGrapheme(T) if (isSomeString!T) { Grapheme _front; bool _empty; T _range; this(T value) { _range = value; popFront(); } property Grapheme front() { assert(!empty); return _front; } void popFront() { assert(!empty); _empty = _range.empty; if (!_empty) { _front = decodeGrapheme(_range); } } property bool empty() { return _empty; } } auto byGrapheme(T)(T value) if (isSomeString!T) { return ByGrapheme!T(value); } void main() { import std.stdio; string s = "তঃঅ৩৵பஂஅபூ௩ᐁᑦᕵᙧᚠᚳᛦᛰ¥¼Ññ"; writeln(s.byGrapheme); } -- Simen
Nov 27 2013
prev sibling parent "Gary Willoughby" <dev nomad.so> writes:
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode 
 handling between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435

 Bye,
 bearophile
Ha, i was just discussing that here: http://forum.dlang.org/thread/xmusisihhbmefeigvxvd forum.dlang.org
Nov 27 2013