www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Inconsitency

reply "nickles" <ben world-of-ben.de> writes:
Why does <string>.length return the number of bytes and not the
number of UTF-8 characters, whereas <wstring.>length and
<dstring>.length return the number of UTF-16 and UTF-32
characters?

Wouldn't it be more consistent to have <string>.length return the
number of UTF-8 characters as well (instead of having to use
std.utf.count(<string>)?
Oct 13 2013
next sibling parent "Dicebot" <public dicebot.lv> writes:
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
 Why does <string>.length return the number of bytes and not the
 number of UTF-8 characters, whereas <wstring.>length and
 <dstring>.length return the number of UTF-16 and UTF-32
 characters?

 Wouldn't it be more consistent to have <string>.length return 
 the
 number of UTF-8 characters as well (instead of having to use
 std.utf.count(<string>)?
Because `length` must be O(1) operation for built-in arrays and for UTF-8 strings it would require storing additional length field making it binary incompatible with other array types.
Oct 13 2013
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Oct-2013 16:36, nickles пишет:
 Why does <string>.length return the number of bytes and not the
 number of UTF-8 characters, whereas <wstring.>length and
 <dstring>.length return the number of UTF-16 and UTF-32
 characters?
??? This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.
 Wouldn't it be more consistent to have <string>.length return the
 number of UTF-8 characters as well (instead of having to use
 std.utf.count(<string>)?
It's consistent as is. -- Dmitry Olshansky
Oct 13 2013
parent reply "nickles" <ben world-of-ben.de> writes:
 This is simply wrong. All strings return number of codeunits. 
 And it's only UTF-32 where codepoint (~ character) happens to 
 fit into one codeunit.
I do not agree: writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem) writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16) writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32) This is not consistent - from my point of view.
Oct 13 2013
next sibling parent reply "Dicebot" <public dicebot.lv> writes:
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
 I do not agree:

    writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 
 2 [D094], UTF-8)
    writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
    writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
    writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)

 This is not consistent - from my point of view.
Because you have wrong understanding of what does "length" mean.
Oct 13 2013
parent reply "nickles" <ben world-of-ben.de> writes:
Ok, if my understandig is wrong, how do YOU measure the length of 
a string?
Do you always use count(), or is there an alternative?
Oct 13 2013
next sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote:
 Ok, if my understandig is wrong, how do YOU measure the length 
 of a string?
Depends on how you define the "length" of a string. Doing that is surprisingly difficult once the full variety of Unicode code points comes into play, even if you ignore the question of encoding (UTF-8, UTF-16, …). David
Oct 13 2013
prev sibling next sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 13.10.2013 15:25, schrieb nickles:
 Ok, if my understandig is wrong, how do YOU measure the length of a string?
 Do you always use count(), or is there an alternative?
The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is, the number of actual display characters. UTF is quite a complex beast and doing any operations on it _correctly_ generally requires a lot of care. If you need to do these kinds of operations, I would highly recommend to read up the basics of UTF and Unicode first (quick overview on Wikipedia: <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>). arr.length is meant to be used in conjunction with array indexing and slicing (arr[...]) and its value is consistent for all string and array types for this purpose.
Oct 13 2013
next sibling parent reply "nickles" <ben world-of-ben.de> writes:
Ok, I understand, that "length" is - obviously - used in analogy 
to any array's length value.

Still, this seems to be inconsistent. D elaborates on 
implementing "char"s as UTF-8 which means that a "char" in D can 
be of any length between 1 and 4 bytes for an arbitrary Unicode 
code point. Shouldn't then this (i.e. the character's length) be 
the "unit of measurement" for "char"s - like e.g. the size of the 
underlying struct in an array of "struct"s? The story continues 
with indexing "string"s: In a consistent implementation, shouldn't

    writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic 
letter?
Btw. how do YOU implement this for "string" (for "dstring" it 
works - logically, for "wstring" the same problem arises for code 
points above D800)?

Also, I understand, that there is the std.utf.count() function 
which returns the length that I was searching for. However, why - 
if D is so UTF-8-centric - isn't this function implemented in the 
core like ".length"?
Oct 13 2013
next sibling parent "Michael" <pr m1xa.com> writes:
 implementation, shouldn't

    writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic 
 letter?
First index is zero, no?
Oct 13 2013
prev sibling next sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 13.10.2013 16:14, schrieb nickles:
 Ok, I understand, that "length" is - obviously - used in analogy to any
 array's length value.

 Still, this seems to be inconsistent. D elaborates on implementing
 "char"s as UTF-8 which means that a "char" in D can be of any length
 between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't
 then this (i.e. the character's length) be the "unit of measurement" for
 "char"s - like e.g. the size of the underlying struct in an array of
 "struct"s? The story continues with indexing "string"s: In a consistent
 implementation, shouldn't

     writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic letter?
This will _not_ return a trailing surrogate of a Cyrillic letter. It will return the second code unit of the "ä" character (U+00E4). However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented. If the string were in UTF-32, [2] could yield either the Cyrillic character, or the umlaut diacritic. The .length of the UTF-32 string could be either 3 or 4. There are multiple reasons why .length and index access is based on code units rather than code points or any higher level representation, but one is that the complexity would suddenly be O(n) instead of O(1). In-place modifications of char[] arrays also wouldn't be possible anymore as the size of the underlying array might have to change.
Oct 13 2013
parent reply "nickles" <ben world-of-ben.de> writes:
 This will _not_ return a trailing surrogate of a Cyrillic 
 letter. It will return the second code unit of the "ä" 
 character (U+00E4).
True. It's UTF-8, not UTF-16.
 However, it could also yield the first code unit of the umlaut 
 diacritic, depending on how the string is represented.
This is not true for UTF-8, which is not subject to "endianism".
 If the string were in UTF-32, [2] could yield either the 
 Cyrillic character, or the umlaut diacritic.
 The .length of the UTF-32 string could be either 3 or 4.
Both are not true for UTF-32. There is no interpretation (except for the "endianism", which could be taken care of in a library/the core) for the code point.
 There are multiple reasons why .length and index access is 
 based on code units rather than code points or any higher level 
 representation, but one is that the complexity would suddenly 
 be O(n) instead of O(1).
see my last statement below
 In-place modifications of char[] arrays also wouldn't be 
 possible anymore
They would be, but
 as the size of the underlying array might have to change.
Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent implementation would put the burden on the implementors of the libraries/core, i.e. the ones who usually have a better understanding of Unicode than the average programmer. Also, implementing such a semantics would not per se abandon a byte-wise access, would it? So, how do you guys handle UTF-8 strings in D? What are your solutions to the problems described? Does it all come down to converting "string"s and "wstring"s to "dstring"s, manipulating them, and re-convert them to "string"s? Btw, what would this mean in terms of speed? These is no irony in my questions. I'm really looking for solutions...
Oct 13 2013
next sibling parent reply "Dicebot" <public dicebot.lv> writes:
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
 Well that's a point; on the other hand, D is constantly 
 creating and throwing away new strings, so this isn't quite an 
 argument. The current solution puts the programmer in charge of 
 dealing with UTF-x, where a more consistent implementation 
 would put the burden on the implementors of the libraries/core, 
 i.e. the ones who usually have a better understanding of 
 Unicode than the average programmer.
Ironically, reason is consistency. `string` is just `immutable(char)[]` and it conforms to usual array behavior rules. Saying that array element value assignment may allocate it hardly a good option.
 So, how do you guys handle UTF-8 strings in D? What are your 
 solutions to the problems described? Does it all come down to 
 converting "string"s and "wstring"s to "dstring"s, manipulating 
 them, and re-convert them to "string"s? Btw, what would this 
 mean in terms of speed?
If single element access is needed, str.front yields decoded `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the fact it is O(n) operation at least. As `str.front` yields dchar, most `std.algorithm` and `std.range` utilities will also work correctly on default UTF-8 strings. Slicing / .length are probably the only operations that do not respect UTF-8 encoding (because they are exactly the same for all arrays).
Oct 13 2013
parent "Kagamin" <spam here.lot> writes:
On Sunday, 13 October 2013 at 17:01:15 UTC, Dicebot wrote:
 If single element access is needed, str.front yields decoded 
 `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the 
 fact it is O(n) operation at least. As `str.front` yields 
 dchar, most `std.algorithm` and `std.range` utilities will also 
 work correctly on default UTF-8 strings.
No, he needs graphemes, so `std.algorithm` won't work correctly for him as Peter has shown: grapheme doesn't fit in dchar.
Oct 15 2013
prev sibling next sibling parent "anonymous" <anonymous example.com> writes:
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
 However, it could also yield the first code unit of the umlaut 
 diacritic, depending on how the string is represented.
This is not true for UTF-8, which is not subject to "endianism".
This is not about endianness. It's "\u00E4" vs "a\u0308". The first is the single code point 'ä', the second is two code points, 'a' plus umlaut dots. [...]
 Well that's a point; on the other hand, D is constantly 
 creating and throwing away new strings, so this isn't quite an 
 argument. The current solution puts the programmer in charge of 
 dealing with UTF-x, where a more consistent implementation 
 would put the burden on the implementors of the libraries/core, 
 i.e. the ones who usually have a better understanding of 
 Unicode than the average programmer.

 Also, implementing such a semantics would not per se abandon a 
 byte-wise access, would it?

 So, how do you guys handle UTF-8 strings in D? What are your 
 solutions to the problems described? Does it all come down to 
 converting "string"s and "wstring"s to "dstring"s, manipulating 
 them, and re-convert them to "string"s? Btw, what would this 
 mean in terms of speed?

 These is no irony in my questions. I'm really looking for 
 solutions...
I think, std.uni and std.utf are supposed to supply everything Unicode.
Oct 13 2013
prev sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
 However, it could also yield the first code unit of the umlaut 
 diacritic, depending on how the string is represented.
This is not true for UTF-8, which is not subject to "endianism".
You are correct in that UTF-8 is endian agnostic, but I don't believe that was Sönke's point. The point is that ä can be produced in Unicode in more than one way. This program illustrates: import std.stdio; void main() { string a = "ä"; string b = "a\u0308"; writeln(a); writeln(b); writeln(cast(ubyte[])a); writeln(cast(ubyte[])b); } This prints: ä ä [195, 164] [97, 204, 136] Notice that they are both the same "character" but have different representations. The first is just the 'ä' code point, which consists of two code units, the second is the 'a' code point followed by a Combining Diaeresis code point. In short, the string "ä" could be either 2 or 3 code units, and either 1 or 2 code points.
Oct 13 2013
parent reply "Temtaime" <temtaime gmail.com> writes:
I've found another one inconsitency problem.

void foo(const char *);
void foo(const wchar *);
void foo(const dchar *);

void main() {
	foo(`123`);
	foo(`123`w);
	foo(`123`d);
}

Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(wchar)[])
Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(dchar)[])

And typeof(`123`).stringof == `string`. Why `123` can be stored 
as null terminated utf8 string in rdata segment and `123`w nor 
`123`d are not? For example wide strings(utf16) are usable with 
windows *W functions.
Oct 13 2013
next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 13 October 2013 at 22:34:00 UTC, Temtaime wrote:
 I've found another one inconsitency problem.

 void foo(const char *);
 void foo(const wchar *);
 void foo(const dchar *);

 void main() {
 	foo(`123`);
 	foo(`123`w);
 	foo(`123`d);
 }

 Error: function hello.foo (const(char*)) is not callable using 
 argument types (immutable(wchar)[])
 Error: function hello.foo (const(char*)) is not callable using 
 argument types (immutable(dchar)[])

 And typeof(`123`).stringof == `string`. Why `123` can be stored 
 as null terminated utf8 string in rdata segment and `123`w nor 
 `123`d are not? For example wide strings(utf16) are usable with 
 windows *W functions.
The first one is made to interface with C. It is a special case.
Oct 13 2013
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 10/14/13, Temtaime <temtaime gmail.com> wrote:
 And typeof(`123`).stringof == `string`. Why `123` can be stored
 as null terminated utf8 string in rdata segment and `123`w nor
 `123`d are not? For example wide strings(utf16) are usable with
 windows *W functions.
http://d.puremagic.com/issues/show_bug.cgi?id=6032
Oct 13 2013
prev sibling next sibling parent "Maxim Fomin" <maxim maxim-fomin.ru> writes:
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Ok, I understand, that "length" is - obviously - used in 
 analogy to any array's length value.

 Still, this seems to be inconsistent. D elaborates on 
 implementing "char"s as UTF-8 which means that a "char" in D 
 can be of any length between 1 and 4 bytes for an arbitrary 
 Unicode code point. Shouldn't then this (i.e. the character's 
 length) be the "unit of measurement" for "char"s - like e.g. 
 the size of the underlying struct in an array of "struct"s? The 
 story continues with indexing "string"s: In a consistent 
 implementation, shouldn't

    writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic 
 letter?
This is impossible given current design. At runtime "säд"[2] is viewed as struct { void *ptr; size_t length; }; ptr points to memory having at least five bytes and length having value 5. Druntime hasn't taken UTF course. One option would be to add support in druntime so it can correctly handle such strings, or implement separate string type which does not default to char[], but of course the easiest way is to convince everybody that everything is OK and advice to use some library function which does the job correctly essentially implying that the language does the job wrong (pardon me, some D skepticism, the deeper I am in it, the more critically view it).
Oct 13 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Ok, I understand, that "length" is - obviously - used in 
 analogy to any array's length value.

 Still, this seems to be inconsistent. D elaborates on 
 implementing "char"s as UTF-8 which means that a "char" in D 
 can be of any length between 1 and 4 bytes for an arbitrary 
 Unicode code point. Shouldn't then this (i.e. the character's 
 length) be the "unit of measurement" for "char"s - like e.g. 
 the size of the underlying struct in an array of "struct"s? The 
 story continues with indexing "string"s: In a consistent 
 implementation, shouldn't

    writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic 
 letter?
I think the root misunderstanding is that you think that a string is random access. A string *isn't* random access. They are implemented *inside* an array, but unless you know *exactly* what you are doing, you shouldn't index, slice or take the length of a string. A string should be handled like a bidirectional range. Once you've understood that, it becomes much simpler. You want the first character? front. You want to skip the first character? popFront. You want an arbitrary character in o(N) time? myString.dropFrontExactly(N).front; You want an arbitrary character in o(1) time? You can't.
Oct 13 2013
prev sibling next sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Ok, I understand, that "length" is - obviously - used in 
 analogy to any array's length value.
That isn't an analogy. It is usually a good idea to try to understand thing before judging if it is consistent.
Oct 13 2013
parent reply "nickles" <ben world-of-ben.de> writes:
It's easy to state this, but - please - don't get sarcastical!

I'm obviously (as I've learned) speaking about UTF-8 "char"s as 
they are NOT implemented right now in D; so I'm criticizing that 
D, as a language which emphasizes on "UTF-8 characters", isn't 
taking "the last step", like e.g. Python does (and no, I'm not a 
Python fan, nor do I consider D a bad language).
Oct 14 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/13 1:09 AM, nickles wrote:
 It's easy to state this, but - please - don't get sarcastical!
Thanks for making this point. String handling in D follows two simple principles: 1. The support is a slice of code units (which often are immutable, seeing as string is an alias for immutable(char)[]). Slice primitives are readily accessible. 2. The standard library (and the foreach language construct) recognize that arrays of code units are special and define bidirectional range primitives on top of them. These are empty, save, front, back, popFront, and popBack. So for a string you may use the range primitives and related algorithms to manipulate code points, or the slice primitives to manipulate code units. This duality has been discussed in the past, and alternatives have proposed (mainly gravitating around making one of the aspects explicit rather than implicit). It is my opinion that a better solution exists (in the form of making representation accessible only through a property .rep). But the current design has "won" not only because it's the existing one, but also because it has good simplicity and flexibility advantages. At this point there is no question about changing the semantics of existing constructs. Andrei
Oct 14 2013
prev sibling parent reply "Kagamin" <spam here.lot> writes:
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() function 
 which returns the length that I was searching for. However, why 
 - if D is so UTF-8-centric - isn't this function implemented in 
 the core like ".length"?
Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Oct 15 2013
parent reply "qznc" <qznc web.de> writes:
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() function 
 which returns the length that I was searching for. However, 
 why - if D is so UTF-8-centric - isn't this function 
 implemented in the core like ".length"?
Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
Oct 16 2013
next sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this function 
 implemented in the core like ".length"?
Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.
Oct 16 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
 On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this 
 function implemented in the core like ".length"?
Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.
I'm not sure this is a "D" issue though: It's a fact of unicode that there are two different ways to write ä.
Oct 16 2013
next sibling parent "Chris" <wendlec tcd.ie> writes:
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra 
wrote:
 On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
 On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this 
 function implemented in the core like ".length"?
Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.
I'm not sure this is a "D" issue though: It's a fact of unicode that there are two different ways to write ä.
My point was it would have been nice to have a native D function that can convert between the two types, especially because this is a well known issue. NSString (Cocoa / Objective-C) for example has things like precomposedStringWithCompatibilityMapping etc.
Oct 16 2013
prev sibling parent "Maxim Fomin" <maxim maxim-fomin.ru> writes:
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra 
wrote:
 On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
 On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this 
 function implemented in the core like ".length"?
Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.
I'm not sure this is a "D" issue though: It's a fact of unicode that there are two different ways to write ä.
As I argued previously, it is implementation issue which treats "bär" is sequence of objects which are not capable of representing values (like int[] = [3.14]). By the way, it is a rare case of type system hole. Usually in D you need cast or union to reinterpret some value, with "bär"[X] you need not.
Oct 16 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-10-16 10:03, qznc wrote:

 Most code might be buggy then.

 An issue the often comes up is file names. A file called "bär" will be
 normalized differently depending on the operating system. In both cases
 it is one grapheme. However, on Linux it is one code point, but on OS X
 it is two code points.
Why would it require two code points? -- /Jacob Carlborg
Oct 16 2013
parent reply "qznc" <qznc web.de> writes:
On Wednesday, 16 October 2013 at 12:18:40 UTC, Jacob Carlborg 
wrote:
 On 2013-10-16 10:03, qznc wrote:

 Most code might be buggy then.

 An issue the often comes up is file names. A file called "bär" 
 will be
 normalized differently depending on the operating system. In 
 both cases
 it is one grapheme. However, on Linux it is one code point, 
 but on OS X
 it is two code points.
Why would it require two code points?
It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is "combining diaeresis" [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work [0] http://www.fileformat.info/info/unicode/char/0308/index.htm [1] http://en.wikipedia.org/wiki/Combining_character
Oct 16 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two code
 points. The second is "combining diaeresis" [0]. Not required, but
 possible. Those combining characters [1] provide a nearly infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character
Aha, now I see. -- /Jacob Carlborg
Oct 16 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
wrote:
 On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two 
 code
 points. The second is "combining diaeresis" [0]. Not required, 
 but
 possible. Those combining characters [1] provide a nearly 
 infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character
Aha, now I see.
One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör" Which is the correct behavior? There is no correct answer. So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity. Long story short: unicode is f***ing complicated. And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with. On the other hand, I don't know many C++ coders that understand unicode.
Oct 16 2013
parent reply "qznc" <qznc web.de> writes:
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra 
wrote:
 On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
 wrote:
 On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two 
 code
 points. The second is "combining diaeresis" [0]. Not 
 required, but
 possible. Those combining characters [1] provide a nearly 
 infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] 
 http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character
Aha, now I see.
One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör" Which is the correct behavior? There is no correct answer. So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity. Long story short: unicode is f***ing complicated. And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with. On the other hand, I don't know many C++ coders that understand unicode.
I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme. "a\u0308" is one grapheme. U+00e4 is the same grapheme as "a\u0308". http://en.wikipedia.org/wiki/Grapheme
Oct 16 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
16-Oct-2013 23:42, qznc пишет:
 On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:
 On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:
 On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two code
 points. The second is "combining diaeresis" [0]. Not required, but
 possible. Those combining characters [1] provide a nearly infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character
Aha, now I see.
One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör" Which is the correct behavior? There is no correct answer. So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity. Long story short: unicode is f***ing complicated. And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with. On the other hand, I don't know many C++ coders that understand unicode.
I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme. "a\u0308" is one grapheme. U+00e4 is the same grapheme as "a\u0308".
s/the same/canonically equivalent/ :)
 http://en.wikipedia.org/wiki/Grapheme
-- Dmitry Olshansky
Oct 16 2013
prev sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 16 October 2013 at 19:42:59 UTC, qznc wrote:
 I agree with your point. Nevertheless you understanding of 
 grapheme is off. U+0308 is not a grapheme.  "a\u0308" is one 
 grapheme. U+00e4 is the same grapheme as "a\u0308".

 http://en.wikipedia.org/wiki/Grapheme
Ah. Learn something new every day. :)
Oct 16 2013
prev sibling parent "Kagamin" <spam here.lot> writes:
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 Most code might be buggy then.
All code is buggy.
 An issue the often comes up is file names. A file called "bär" 
 will be normalized differently depending on the operating 
 system. In both cases it is one grapheme. However, on Linux it 
 is one code point, but on OS X it is two code points.
And on Windows it's case-insensitive - 2^^N variants of each string. So what?
Oct 20 2013
prev sibling parent "Chris" <wendlec tcd.ie> writes:
On Sunday, 13 October 2013 at 13:40:21 UTC, Sönke Ludwig wrote:
 Am 13.10.2013 15:25, schrieb nickles:
 Ok, if my understandig is wrong, how do YOU measure the length 
 of a string?
 Do you always use count(), or is there an alternative?
The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is, the number of actual display characters. UTF is quite a complex beast and doing any operations on it _correctly_ generally requires a lot of care. If you need to do these kinds of operations, I would highly recommend to read up the basics of UTF and Unicode first (quick overview on Wikipedia: <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>). arr.length is meant to be used in conjunction with array indexing and slicing (arr[...]) and its value is consistent for all string and array types for this purpose.
I recently discovered a bug in my program. If you take the letter "é" for example (Linux, Ubuntu 12.04), std.utf.count() returns 1 and .length returns 2. I needed the length to slice the string at a given point. Using .length instead of std.utf.count() fixed the bug.
Oct 14 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Oct-2013 17:25, nickles пишет:
 Ok, if my understandig is wrong, how do YOU measure the length of a string?
 Do you always use count(), or is there an alternative?
It's all there: http://www.unicode.org/glossary/ http://www.unicode.org/versions/Unicode6.3.0/ I measure string length in code units (as defined in the above standard). This bears no easy relation to the number of visible characters but I don't mind it. Measuring number of visible characters isn't trivial but can be done by counting number of graphemes. For simple alphabets counting code points will do the trick as well (what count does). -- Dmitry Olshansky
Oct 13 2013
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 13.10.2013 15:50, schrieb Dmitry Olshansky:
 13-Oct-2013 17:25, nickles пишет:
 Ok, if my understandig is wrong, how do YOU measure the length of a
 string?
 Do you always use count(), or is there an alternative?
It's all there: http://www.unicode.org/glossary/ http://www.unicode.org/versions/Unicode6.3.0/ I measure string length in code units (as defined in the above standard). This bears no easy relation to the number of visible characters but I don't mind it. Measuring number of visible characters isn't trivial but can be done by counting number of graphemes. For simple alphabets counting code points will do the trick as well (what count does).
But you have to take care to normalize the string WRT diacritics if the estimate is supposed to work. OS X for example (if I remember right) always uses explicit combining characters, while Windows uses precomposed characters if possible.
Oct 13 2013
prev sibling parent "Maxim Fomin" <maxim maxim-fomin.ru> writes:
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
 This is simply wrong. All strings return number of codeunits. 
 And it's only UTF-32 where codepoint (~ character) happens to 
 fit into one codeunit.
I do not agree: writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem) writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16) writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32) This is not consistent - from my point of view.
This is not a single inconsistency here. First of all, typeof("säд") yileds string type (immutable char) while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[], nor even dchar[] but int[]. In this case D is close to C which also treats character literals as integer type. Secondly, character arrays are only one who have two kinds of array literals - usual [item. item, item] and special "blah", as you see there is no correspondence between them. If you try char[] x = cast(char[])['s', 'ä', 'д'] then length would be indeed 3 (but don't use it - it is broken). In D dynamic array is at binary level represented as struct { void *ptr; size_t length; }. When you perform some operations on dynamic arrays they are implemented by compiler as calls to runtime functions. However, during runtime it is impossible to do something useful on arrays for which there is only information about address of beginning and total elements (this is a source of other problems in D). To handle this, compiler generates and passes as separate argument "TypeInfo" to runtime functions. TypeInfo contains some data, most relevant here is size of the element. What happens is follows. Compiler recognizes that "säд" should be string literal and encoded as UTF-8 (http://dlang.org/lex.html#DoubleQuotedString), so element type should be char, which requires to have 5 elements in array. So, at runtime an object "säд" is treated as array of 5 elements each having 1 byte per element. Basically string (and char[]) plays dual role in the language - on the one hand, it is array of elements having strictly 1 byte size by definition, on the other hand D tries to use it as 'generic' UTF type for which size is not fixed. So, there is contradiction - in source code such strings are viewed by programmer as some abstract UTF string, but druntime views it as 5 byte array. In my view, trouble begins when "säд" is internally casted to char (which is no better than int[] x = [3.14, 5.6]). And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so there is great inconsistency here. By the way, UTF definition is irrelevant here, this is pure implementation issue (I think it is design fault).
Oct 13 2013
prev sibling parent "ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
 Why does <string>.length return the number of bytes and not the
 number of UTF-8 characters, whereas <wstring.>length and
 <dstring>.length return the number of UTF-16 and UTF-32
 characters?

 Wouldn't it be more consistent to have <string>.length return 
 the
 number of UTF-8 characters as well (instead of having to use
 std.utf.count(<string>)?
Technically, UTF-16 can contain 2 ushort's for 1 character, so <wstring.>length return the number of ushort's, not the UTF-16 characters.
Oct 13 2013