www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF-8 issues

reply Eldar Insafutdinov <e.insafutdinov mail.ru> writes:
I faced some issues with utf-8 support in D.
As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.
But foreach works correctly. So utf-8 support is partial. Maybe there are
functions from standart library that does this work? I checked D2 new features
- there was not improving utf-8 support - am I wrong?
Sep 15 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D. As it stated in
 http://www.digitalmars.com/d/2.0/cppstrings.html strings support
 slicing and length-calculation. Since strings are char arrays this is
 correct only for latin strings. So when the strings for example
 cyrillic chars - length is wrong, indexing also doesn't work, and
 slicing too. But foreach works correctly. So utf-8 support is
 partial. Maybe there are functions from standart library that does
 this work? I checked D2 new features - there was not improving utf-8
 support - am I wrong?

This should help: http://www.digitalmars.com/d/2.0/phobos/std_utf.html
Sep 15 2008
prev sibling next sibling parent reply "Chris R. Miller" <lordsauronthegreat gmail.com> writes:
Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.
 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.
 But foreach works correctly. So utf-8 support is partial. Maybe there are
functions from standart library that does this work? I checked D2 new features
- there was not improving utf-8 support - am I wrong?

IIRC a char array in D will compress itself for ASCII-encodable characters, which destroys the integrity of the length variable. Well, it's still valid in terms of how long in words the array is, but in terms of real characters it's no longer valid. If you used a wchar or dchar things would be different.
Sep 15 2008
next sibling parent reply "Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:
On Mon, Sep 15, 2008 at 2:38 PM, Chris R. Miller
<lordsauronthegreat gmail.com> wrote:
 Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.
 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.
 But foreach works correctly. So utf-8 support is partial. Maybe there are
functions from standart library that does this work? I checked D2 new features
- there was not improving utf-8 support - am I wrong?

IIRC a char array in D will compress itself for ASCII-encodable characters, which destroys the integrity of the length variable. Well, it's still valid in terms of how long in words the array is, but in terms of real characters it's no longer valid.

It's called UTF-8, and it's supposed to work like that. That D does not provide some kind of interface for dealing with multibyte encodings (other than foreach and the encode/decode functions) is a failing on its part, not Unicode's. (Though it could be argued that multibyte encodings are stupid as hell, and I would agree with that.)
 If you used a wchar or dchar things would be different.

If he used dchar it'd be different. wchar still has multi-element encodings (surrogate pairs) for codepoints outside the BMP. Which, admittedly, are not that common, but it can still happen.
Sep 15 2008
parent Lutger <lutger.blijdestijn gmail.com> writes:
Jarrett Billingsley wrote:
...
 
 It's called UTF-8, and it's supposed to work like that.  That D does
 not provide some kind of interface for dealing with multibyte
 encodings (other than foreach and the encode/decode functions) is a
 failing on its part, not Unicode's.

There's also std.string of course. What do you find so lacking? (just curious)
Sep 16 2008
prev sibling parent "Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:
On Tue, Sep 16, 2008 at 4:57 PM, Lutger <lutger.blijdestijn gmail.com> wrote:
 Jarrett Billingsley wrote:
 ...
 It's called UTF-8, and it's supposed to work like that.  That D does
 not provide some kind of interface for dealing with multibyte
 encodings (other than foreach and the encode/decode functions) is a
 failing on its part, not Unicode's.

There's also std.string of course. What do you find so lacking? (just curious)

The lack of any way to index or slice a string according to codepoint indices (instead of byte/short indices), get the length of a string in codepoints, or to find the nearest beginning character given an arbitrary character index. (std.string is also embarrassingly missing any functionality for wchar[] or dchar[] but that's a slightly different issue.)
Sep 16 2008
prev sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.

The important thing to remember is that a string is absolutely NOT an array of characters, and you can't treat it as such. As you've noticed, a char[] string is actually an array of UTF-8 encoded bytes. Iterating directly through that array is extremely touchy and error-prone. Instead, always use the standard library functions. D1/Tango: http://dsource.org/projects/tango/docs/current/tango.text.Util.html http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html D1/Phobos: http://digitalmars.com/d/1.0/phobos/std_utf.html D2/Phobos: http://digitalmars.com/d/2.0/phobos/std_utf.html Although the libraries do a decent job of hiding the ugly details, my opinion (which is not very popular around here) is that D's string processing is a major design flaw.
 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.

Indexing, slicing, and lengh-calculation of D strings is based on byte-position, not character position. Character-position indexing and slicing is only possible by iterating from the beginning of the string, decoding the characters on-the-fly, and keeping track of the number of bytes used by each character. That's what the standard library functions basically do. Calculating the actual character-length of the string is fundamentally the same as in C, where strings are null-terminated (e.g., you can't determine the actual length of the string until you've iterated from the beginning to the end). The Phobox & Tango libraries handle all of those details for you, but I think it's important to know what's going on behind the scenes, so that you have a rough idea of the true cost of each operation. --benji
Sep 15 2008
parent reply Eldar Insafutdinov <e.insafutdinov mail.ru> writes:
Benji Smith Wrote:

 Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.

The important thing to remember is that a string is absolutely NOT an array of characters, and you can't treat it as such. As you've noticed, a char[] string is actually an array of UTF-8 encoded bytes. Iterating directly through that array is extremely touchy and error-prone. Instead, always use the standard library functions. D1/Tango: http://dsource.org/projects/tango/docs/current/tango.text.Util.html http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html D1/Phobos: http://digitalmars.com/d/1.0/phobos/std_utf.html D2/Phobos: http://digitalmars.com/d/2.0/phobos/std_utf.html Although the libraries do a decent job of hiding the ugly details, my opinion (which is not very popular around here) is that D's string processing is a major design flaw.
 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.

Indexing, slicing, and lengh-calculation of D strings is based on byte-position, not character position. Character-position indexing and slicing is only possible by iterating from the beginning of the string, decoding the characters on-the-fly, and keeping track of the number of bytes used by each character. That's what the standard library functions basically do. Calculating the actual character-length of the string is fundamentally the same as in C, where strings are null-terminated (e.g., you can't determine the actual length of the string until you've iterated from the beginning to the end). The Phobox & Tango libraries handle all of those details for you, but I think it's important to know what's going on behind the scenes, so that you have a rough idea of the true cost of each operation. --benji

Yeah - I know that this operations works with bytes rather than chars[]. But it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly, that strings support slicing:
D has the array slice syntax, not possible with C++:

char[] s1 = "hello world";
char[] s2 = s1[6 .. 11];	// s2 is "world"

So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
Sep 15 2008
next sibling parent Benji Smith <dlanguage benjismith.net> writes:
Eldar Insafutdinov wrote:
 Yeah - I know that this operations works with bytes rather than chars[]. But
it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly,
that strings support slicing:
 
 D has the array slice syntax, not possible with C++:

 char[] s1 = "hello world";
 char[] s2 = s1[6 .. 11];	// s2 is "world"

So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.

That's my understanding. --benji
Sep 15 2008
prev sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Eldar Insafutdinov wrote:
 Benji Smith Wrote:
 D has the array slice syntax, not possible with C++:

 char[] s1 = "hello world";
 char[] s2 = s1[6 .. 11];	// s2 is "world"

So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.

It is not wrong for UTF-8 strings. It just won't work for arbitrary indices. But I don't think you will ever use arbitrary indices. All indices will be the result of other string functions (such as find) which behave correctly for UTF-8 strings. Incrementing/decrementing can be done using std.utf or similar. UTF-8 also makes it very easy to determine if an arbitrary position in a UTF-8 sequence lies at the start or in the middle of a multi-byte encoded character. Indexing a UTF-8 string by character rather than byte index is horribly inefficient. As others have said, if you really need to do that, use dchar[](1). Although, I've never personally come across a place where I needed that. 1) Be aware that you will need to make sure your data is of a composed unicode normal form, otherwise it could still use several code points(2) to represent a single grapheme. 2) A code point is a point in the Unicode codespace, which is what a dchar encodes. -- Oskar
Sep 16 2008