digitalmars.D - UTF-8 issues

Eldar Insafutdinov (3/3) Sep 15 2008 I faced some issues with utf-8 support in D.

Walter Bright (3/12) Sep 15 2008 This should help:
Chris R. Miller (6/9) Sep 15 2008 IIRC a char array in D will compress itself for ASCII-encodable

Jarrett Billingsley (11/20) Sep 15 2008 It's called UTF-8, and it's supposed to work like that. That D does

Lutger (4/9) Sep 16 2008 There's also std.string of course. What do you find so lacking? (just

Jarrett Billingsley (7/16) Sep 16 2008 The lack of any way to index or slice a string according to codepoint

Benji Smith (30/32) Sep 15 2008 The important thing to remember is that a string is absolutely NOT an

Eldar Insafutdinov (3/54) Sep 15 2008 So this example is only correct in case of latin chars, but in general i...

Benji Smith (3/11) Sep 15 2008 That's my understanding.
Oskar Linde (19/26) Sep 16 2008 It is not wrong for UTF-8 strings. It just won't work for arbitrary

Eldar Insafutdinov <e.insafutdinov mail.ru> writes:

I faced some issues with utf-8 support in D.
As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.
But foreach works correctly. So utf-8 support is partial. Maybe there are
functions from standart library that does this work? I checked D2 new features
- there was not improving utf-8 support - am I wrong?

Sep 15 2008

Walter Bright <newshound1 digitalmars.com> writes:

Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D. As it stated in
 http://www.digitalmars.com/d/2.0/cppstrings.html strings support
 slicing and length-calculation. Since strings are char arrays this is
 correct only for latin strings. So when the strings for example
 cyrillic chars - length is wrong, indexing also doesn't work, and
 slicing too. But foreach works correctly. So utf-8 support is
 partial. Maybe there are functions from standart library that does
 this work? I checked D2 new features - there was not improving utf-8
 support - am I wrong?


This should help:

http://www.digitalmars.com/d/2.0/phobos/std_utf.html

Sep 15 2008

"Chris R. Miller" <lordsauronthegreat gmail.com> writes:

Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.
 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.
 But foreach works correctly. So utf-8 support is partial. Maybe there are
functions from standart library that does this work? I checked D2 new features
- there was not improving utf-8 support - am I wrong?

IIRC a char array in D will compress itself for ASCII-encodable
characters, which destroys the integrity of the length variable.  Well,
it's still valid in terms of how long in words the array is, but in
terms of real characters it's no longer valid.

If you used a wchar or dchar things would be different.

Sep 15 2008

"Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:

On Mon, Sep 15, 2008 at 2:38 PM, Chris R. Miller
<lordsauronthegreat gmail.com> wrote:
 Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.
 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.
 But foreach works correctly. So utf-8 support is partial. Maybe there are
functions from standart library that does this work? I checked D2 new features
- there was not improving utf-8 support - am I wrong?

 IIRC a char array in D will compress itself for ASCII-encodable
 characters, which destroys the integrity of the length variable.  Well,
 it's still valid in terms of how long in words the array is, but in
 terms of real characters it's no longer valid.

It's called UTF-8, and it's supposed to work like that.  That D does
not provide some kind of interface for dealing with multibyte
encodings (other than foreach and the encode/decode functions) is a
failing on its part, not Unicode's.

(Though it could be argued that multibyte encodings are stupid as
hell, and I would agree with that.)

 If you used a wchar or dchar things would be different.

If he used dchar it'd be different.  wchar still has multi-element
encodings (surrogate pairs) for codepoints outside the BMP.  Which,
admittedly, are not that common, but it can still happen.

Sep 15 2008

Lutger <lutger.blijdestijn gmail.com> writes:

Jarrett Billingsley wrote:
...
 
 It's called UTF-8, and it's supposed to work like that.  That D does
 not provide some kind of interface for dealing with multibyte
 encodings (other than foreach and the encode/decode functions) is a
 failing on its part, not Unicode's.

There's also std.string of course. What do you find so lacking? (just
curious)

Sep 16 2008

"Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:

On Tue, Sep 16, 2008 at 4:57 PM, Lutger <lutger.blijdestijn gmail.com> wrote:
 Jarrett Billingsley wrote:
 ...
 It's called UTF-8, and it's supposed to work like that.  That D does
 not provide some kind of interface for dealing with multibyte
 encodings (other than foreach and the encode/decode functions) is a
 failing on its part, not Unicode's.

 There's also std.string of course. What do you find so lacking? (just
 curious)

The lack of any way to index or slice a string according to codepoint
indices (instead of byte/short indices), get the length of a string in
codepoints, or to find the nearest beginning character given an
arbitrary character index.  (std.string is also embarrassingly missing
any functionality for wchar[] or dchar[] but that's a slightly
different issue.)

Sep 16 2008

Benji Smith <dlanguage benjismith.net> writes:

Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.

The important thing to remember is that a string is absolutely NOT an 
array of characters, and you can't treat it as such.

As you've noticed, a char[] string is actually an array of UTF-8 encoded 
bytes. Iterating directly through that array is extremely touchy and 
error-prone. Instead, always use the standard library functions.

D1/Tango:

http://dsource.org/projects/tango/docs/current/tango.text.Util.html
http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html

D1/Phobos:

http://digitalmars.com/d/1.0/phobos/std_utf.html

D2/Phobos:

http://digitalmars.com/d/2.0/phobos/std_utf.html

Although the libraries do a decent job of hiding the ugly details, my 
opinion (which is not very popular around here) is that D's string 
processing is a major design flaw.

 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.

Indexing, slicing, and lengh-calculation of D strings is based on 
byte-position, not character position.

Character-position indexing and slicing is only possible by iterating 
from the beginning of the string, decoding the characters on-the-fly, 
and keeping track of the number of bytes used by each character.

That's what the standard library functions basically do.

Calculating the actual character-length of the string is fundamentally 
the same as in C, where strings are null-terminated (e.g., you can't 
determine the actual length of the string until you've iterated from the 
beginning to the end).

The Phobox & Tango libraries handle all of those details for you, but I 
think it's important to know what's going on behind the scenes, so that 
you have a rough idea of the true cost of each operation.

--benji

Sep 15 2008

Eldar Insafutdinov <e.insafutdinov mail.ru> writes:

Benji Smith Wrote:

 Eldar Insafutdinov wrote:
 I faced some issues with utf-8 support in D.

 
 The important thing to remember is that a string is absolutely NOT an 
 array of characters, and you can't treat it as such.
 
 As you've noticed, a char[] string is actually an array of UTF-8 encoded 
 bytes. Iterating directly through that array is extremely touchy and 
 error-prone. Instead, always use the standard library functions.
 
 D1/Tango:
 
 http://dsource.org/projects/tango/docs/current/tango.text.Util.html
 http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html
 
 D1/Phobos:
 
 http://digitalmars.com/d/1.0/phobos/std_utf.html
 
 D2/Phobos:
 
 http://digitalmars.com/d/2.0/phobos/std_utf.html
 
 Although the libraries do a decent job of hiding the ugly details, my 
 opinion (which is not very popular around here) is that D's string 
 processing is a major design flaw.
 
 As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings
support slicing and length-calculation. Since strings are char arrays this is
correct only for latin strings. So when the strings for example cyrillic chars
- length is wrong, indexing also doesn't work, and slicing too.

 
 Indexing, slicing, and lengh-calculation of D strings is based on 
 byte-position, not character position.
 
 Character-position indexing and slicing is only possible by iterating 
 from the beginning of the string, decoding the characters on-the-fly, 
 and keeping track of the number of bytes used by each character.
 
 That's what the standard library functions basically do.
 
 Calculating the actual character-length of the string is fundamentally 
 the same as in C, where strings are null-terminated (e.g., you can't 
 determine the actual length of the string until you've iterated from the 
 beginning to the end).
 
 The Phobox & Tango libraries handle all of those details for you, but I 
 think it's important to know what's going on behind the scenes, so that 
 you have a rough idea of the true cost of each operation.
 
 --benji

Yeah - I know that this operations works with bytes rather than chars[]. But it
is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly,
that strings support slicing:

D has the array slice syntax, not possible with C++:

char[] s1 = "hello world";
char[] s2 = s1[6 .. 11];	// s2 is "world"

So this example is only correct in case of latin chars, but in general it is
wrong for UTF-8 strings.

Sep 15 2008

Benji Smith <dlanguage benjismith.net> writes:

Eldar Insafutdinov wrote:
 Yeah - I know that this operations works with bytes rather than chars[]. But
it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly,
that strings support slicing:
 
 D has the array slice syntax, not possible with C++:

 
 char[] s1 = "hello world";
 char[] s2 = s1[6 .. 11];	// s2 is "world"

 
 So this example is only correct in case of latin chars, but in general it is
wrong for UTF-8 strings.

That's my understanding.

--benji

Sep 15 2008

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Eldar Insafutdinov wrote:
 Benji Smith Wrote:
 D has the array slice syntax, not possible with C++:

 
 char[] s1 = "hello world";
 char[] s2 = s1[6 .. 11];	// s2 is "world"

 
 So this example is only correct in case of latin chars, but in general it is
wrong for UTF-8 strings.

It is not wrong for UTF-8 strings. It just won't work for arbitrary 
indices. But I don't think you will ever use arbitrary indices. All 
indices will be the result of other string functions (such as find) 
which behave correctly for UTF-8 strings. Incrementing/decrementing can 
be done using std.utf or similar. UTF-8 also makes it very easy to 
determine if an arbitrary position in a UTF-8 sequence lies at the start 
or in the middle of a multi-byte encoded character.

Indexing a UTF-8 string by character rather than byte index is horribly 
inefficient. As others have said, if you really need to do that, use 
dchar[](1). Although, I've never personally come across a place where I 
needed that.

1) Be aware that you will need to make sure your data is of a composed 
unicode normal form, otherwise it could still use several code points(2) 
to represent a single grapheme.

2) A code point is a point in the Unicode codespace, which is what a 
dchar encodes.

-- 
Oskar

Sep 16 2008

D Programming

C/C++ Programming

Other

digitalmars.D - UTF-8 issues