digitalmars.D.learn - How to correctly deal with unicode strings?
- Gary Willoughby (22/22) Nov 27 2013 I've just been reading this article:
- David Nadlinger (6/8) Nov 27 2013 In this specific example you could e.g. use std.uni.normalize,
- monarch_dodra (5/13) Nov 27 2013 I'll stress your "in this specific example", as that will not
- Dicebot (21/21) Nov 27 2013 D strings have dual nature. They behave as arrays of code units
- monarch_dodra (36/59) Nov 27 2013 Beaophile also linked this article:
- Adam D. Ruppe (48/48) Nov 27 2013 The normalize function in std.uni helps a lot here (well, it
I've just been reading this article: http://mortoray.com/2013/11/27/the-string-type-is-broken/ and wanted to test if D performed in the same way as he describes, i.e. unicode strings being 'broken' because they are just arrays. Although i understand the difference between code units and code points it's not entirely clear in D what i need to do to avoid the situations he describes. For example: import std.algorithm; import std.stdio; void main(string[] args) { char[] x = "noël".dup; assert(x.length == 6); // Actual // assert(x.length == 4); // Expected. assert(x[0 .. 3] == "noe".dup); // Actual. // assert(x[0 .. 3] == "noë".dup); // Expected. x.reverse; assert(x == "l̈eon".dup); // Actual // assert(x == "lëon".dup); // Expected. } Here i understand what is happening but how could i improve this example to make the expected asserts true?
Nov 27 2013
On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby wrote:Here i understand what is happening but how could i improve this example to make the expected asserts true?In this specific example you could e.g. use std.uni.normalize, which by default puts the string into NFC, which has all the canonical compositions applied (e.g. ë as a single character). David
Nov 27 2013
On Wednesday, 27 November 2013 at 14:48:07 UTC, David Nadlinger wrote:On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby wrote:I'll stress your "in this specific example", as that will not work if there is no way to represent the "character" as a single codepoint.Here i understand what is happening but how could i improve this example to make the expected asserts true?In this specific example you could e.g. use std.uni.normalize, which by default puts the string into NFC, which has all the canonical compositions applied (e.g. ë as a single character). David
Nov 27 2013
D strings have dual nature. They behave as arrays of code units when slicing or accessing .length directly (because of O(1) guarantees for those operations) but all algorithms in standard library work with them as with arrays of dchar: import std.algorithm; import std.range : walkLength, take; import std.array : array; void main(string[] args) { char[] x = "noël".dup; assert(x.length == 6); assert(x.walkLength == 5); // ë is two symbols on my machine assert(x[0 .. 3] == "noe".dup); // Actual. assert(array(take(x, 4)) == "noë"d); x.reverse; assert(x == "l̈eon".dup); // Actual and correct! } Problem you have here is that ë can be represented as two separate Unicode code points despite being single drawn symbol. It has nothing to do with strings as arrays of code units, using array of `dchar` will result in same behavior.
Nov 27 2013
Beaophile also linked this article: http://forum.dlang.org/thread/nieoqqmidngwoqwnktih forum.dlang.org On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby wrote:I've just been reading this article: http://mortoray.com/2013/11/27/the-string-type-is-broken/ and wanted to test if D performed in the same way as he describes, i.e. unicode strings being 'broken' because they are just arrays.No. While unidcode string are "just" arrays, that's not why it's "broken". Unicode string are *stored* in arrays, as a sequence of "codeunits", but they are still decoded entire "codepoints" at once, so that's not the issue. The main issue is that in unicode, a "character" (if that means anything), or a "grapheme", can be composed of two codepoints, that mesn't be separated. Currently, D does not know how to deal with this.Although i understand the difference between code units and code points it's not entirely clear in D what i need to do to avoid the situations he describes. For example: import std.algorithm; import std.stdio; void main(string[] args) { char[] x = "noël".dup; assert(x.length == 6); // Actual // assert(x.length == 4); // Expected.This is a source of confusion: a string is *not* a random access range. This means that "length" is not actually part of the "string interface": It is only an "underlying implementation detail". try this: alias String = string; static if (hasLength!String) assert(x.length == 4); else assert(x.walkLength == 4); This will work regardless of string's "width" (char/wchar/dchar).assert(x[0 .. 3] == "noe".dup); // Actual. // assert(x[0 .. 3] == "noë".dup); // Expected.Again, don't slice your strings like that, a string isn't random access nor sliceable. You have no guarantee your third character will start at index 3. You want: assert(equal(x.take(3), "noe")); Note that "x.take(3)" will not actually give you a slice, but a lazy range. If you want a slice, you need to walk the string, and extract the index: auto index = x.length - x.dropFront(3).length; assert(x[0 .. index] == "noe"); Note that this is *only* "UTF-correct", but it is still wrong from a unicode point of view. Again, it's because ë is actually a single grapheme composed of *two* codepoints.x.reverse; assert(x == "l̈eon".dup); // Actual // assert(x == "lëon".dup); // Expected. } Here i understand what is happening but how could i improve this example to make the expected asserts true?AFAIK, We don't have any way of dealing with this (yet).
Nov 27 2013
The normalize function in std.uni helps a lot here (well, it would if it actually compiled, but that's an easy fix, just import std.typecons in your copy of phobos/src/uni.d. It does so already but only in version(unittest)! LOL) dstrings are a bit easier to use too: import std.algorithm; import std.stdio; import std.uni; void main(string[] args) { dstring x = "noël"d.normalize; assert(x.length == 4); // Expected. assert(x[0 .. 3] == "noë"d.normalize); // Expected. import std.range; dstring y = x.retro.array; assert(y == "lëon"d.normalize); // Expected. } All of that works. The normalize function does the character pairs, like 'ë', into single code points. Take a gander at this: foreach(dchar c; "noël"d) writeln(cast(int) c); 110 // 'n' 111 // 'o' 101 // 'e' 776 // special character to add the dieresis to the preceding character 108 // 'l' This btw is a great example of why .length should *not* be expected to give the number of characters, not even with dstrings, since a code point is not necessarily the same as a character! And, of course, with string, a code point is often not the same as a code unit. What the normalize function does is goes through and combines those combining characters into one thing: import std.uni; foreach(dchar c; "noël"d.normalize) writeln(cast(int) c); 110 // 'n' 111 // 'o' 235 // 'ë' 108 // 'l' BTW, since I'm copy/pasting here, I'm honestly not sure if the string is actually different in the D source or not, since they display the same way... But still, this is what's going on, and the normalize function is the key to get these comparisons and reversals easier to do. A normalized dstring is about as close as you can get to the simplified ideal of one index into the array is one character.
Nov 27 2013