www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to correctly deal with unicode strings?

reply "Gary Willoughby" <dev nomad.so> writes:
I've just been reading this article: 
http://mortoray.com/2013/11/27/the-string-type-is-broken/ and 
wanted to test if D performed in the same way as he describes, 
i.e. unicode strings being 'broken' because they are just arrays.

Although i understand the difference between code units and code 
points it's not entirely clear in D what i need to do to avoid 
the situations he describes. For example:

import std.algorithm;
import std.stdio;

void main(string[] args)
{
	char[] x = "noël".dup;

	assert(x.length == 6); // Actual
	// assert(x.length == 4); // Expected.

	assert(x[0 .. 3] == "noe".dup); // Actual.
	// assert(x[0 .. 3] == "noë".dup); // Expected.

	x.reverse;

	assert(x == "l̈eon".dup); // Actual
	// assert(x == "lëon".dup); // Expected.
}

Here i understand what is happening but how could i improve this 
example to make the expected asserts true?
Nov 27 2013
next sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby 
wrote:
 Here i understand what is happening but how could i improve 
 this example to make the expected asserts true?
In this specific example you could e.g. use std.uni.normalize, which by default puts the string into NFC, which has all the canonical compositions applied (e.g. ë as a single character). David
Nov 27 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 27 November 2013 at 14:48:07 UTC, David Nadlinger 
wrote:
 On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby 
 wrote:
 Here i understand what is happening but how could i improve 
 this example to make the expected asserts true?
In this specific example you could e.g. use std.uni.normalize, which by default puts the string into NFC, which has all the canonical compositions applied (e.g. ë as a single character). David
I'll stress your "in this specific example", as that will not work if there is no way to represent the "character" as a single codepoint.
Nov 27 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
D strings have dual nature. They behave as arrays of code units 
when slicing or accessing .length directly (because of O(1) 
guarantees for those operations) but all algorithms in standard 
library work with them as with arrays of dchar:

import std.algorithm;
import std.range : walkLength, take;
import std.array : array;

void main(string[] args)
{
	char[] x = "noël".dup;

	assert(x.length == 6);
	assert(x.walkLength == 5); // ë is two symbols on my machine

	assert(x[0 .. 3] == "noe".dup); // Actual.
	assert(array(take(x, 4)) == "noë"d);

	x.reverse;

	assert(x == "l̈eon".dup); // Actual and correct!
}

Problem you have here is that ë can be represented as two 
separate Unicode code points despite being single drawn symbol. 
It has nothing to do with strings as arrays of code units, using 
array of `dchar` will result in same behavior.
Nov 27 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
Beaophile also linked this article:
http://forum.dlang.org/thread/nieoqqmidngwoqwnktih forum.dlang.org

On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby 
wrote:
 I've just been reading this article: 
 http://mortoray.com/2013/11/27/the-string-type-is-broken/ and 
 wanted to test if D performed in the same way as he describes, 
 i.e. unicode strings being 'broken' because they are just 
 arrays.
No. While unidcode string are "just" arrays, that's not why it's "broken". Unicode string are *stored* in arrays, as a sequence of "codeunits", but they are still decoded entire "codepoints" at once, so that's not the issue. The main issue is that in unicode, a "character" (if that means anything), or a "grapheme", can be composed of two codepoints, that mesn't be separated. Currently, D does not know how to deal with this.
 Although i understand the difference between code units and 
 code points it's not entirely clear in D what i need to do to 
 avoid the situations he describes. For example:

 import std.algorithm;
 import std.stdio;

 void main(string[] args)
 {
 	char[] x = "noël".dup;

 	assert(x.length == 6); // Actual
 	// assert(x.length == 4); // Expected.
This is a source of confusion: a string is *not* a random access range. This means that "length" is not actually part of the "string interface": It is only an "underlying implementation detail". try this: alias String = string; static if (hasLength!String) assert(x.length == 4); else assert(x.walkLength == 4); This will work regardless of string's "width" (char/wchar/dchar).
 	assert(x[0 .. 3] == "noe".dup); // Actual.
 	// assert(x[0 .. 3] == "noë".dup); // Expected.
Again, don't slice your strings like that, a string isn't random access nor sliceable. You have no guarantee your third character will start at index 3. You want: assert(equal(x.take(3), "noe")); Note that "x.take(3)" will not actually give you a slice, but a lazy range. If you want a slice, you need to walk the string, and extract the index: auto index = x.length - x.dropFront(3).length; assert(x[0 .. index] == "noe"); Note that this is *only* "UTF-correct", but it is still wrong from a unicode point of view. Again, it's because ë is actually a single grapheme composed of *two* codepoints.
 	x.reverse;

 	assert(x == "l̈eon".dup); // Actual
 	// assert(x == "lëon".dup); // Expected.
 }

 Here i understand what is happening but how could i improve 
 this example to make the expected asserts true?
AFAIK, We don't have any way of dealing with this (yet).
Nov 27 2013
prev sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
The normalize function in std.uni helps a lot here (well, it 
would if it actually compiled, but that's an easy fix, just 
import std.typecons in your copy of phobos/src/uni.d. It does so 
already but only in version(unittest)! LOL)

dstrings are a bit easier to use too:

import std.algorithm;
import std.stdio;
import std.uni;

void main(string[] args)
{
         dstring x = "noël"d.normalize;

         assert(x.length == 4); // Expected.

         assert(x[0 .. 3] == "noë"d.normalize); // Expected.

         import std.range;
         dstring y = x.retro.array;

         assert(y == "lëon"d.normalize); // Expected.
}


All of that works. The normalize function does the character 
pairs, like 'ë', into single code points. Take a gander at this:


         foreach(dchar c; "noël"d)
                 writeln(cast(int) c);

110 // 'n'
111 // 'o'
101 // 'e'
776 // special character to add the dieresis to the preceding 
character
108 // 'l'

This btw is a great example of why .length should *not* be 
expected to give the number of characters, not even with 
dstrings, since a code point is not necessarily the same as a 
character! And, of course, with string, a code point is often not 
the same as a code unit.


What the normalize function does is goes through and combines 
those combining characters into one thing:

         import std.uni;
         foreach(dchar c; "noël"d.normalize)
                 writeln(cast(int) c);

110 // 'n'
111 // 'o'
235 // 'ë'
108 // 'l'


BTW, since I'm copy/pasting here, I'm honestly not sure if the 
string is actually different in the D source or not, since they 
display the same way...

But still, this is what's going on, and the normalize function is 
the key to get these comparisons and reversals easier to do. A 
normalized dstring is about as close as you can get to the 
simplified ideal of one index into the array is one character.
Nov 27 2013