digitalmars.D - Inconsitency

nickles (7/7) Oct 13 2013 Why does .length return the number of bytes and not the

Dicebot (4/12) Oct 13 2013 Because `length` must be O(1) operation for built-in arrays and
Dmitry Olshansky (7/14) Oct 13 2013 ???

nickles (7/10) Oct 13 2013 I do not agree:

Dicebot (2/9) Oct 13 2013 Because you have wrong understanding of what does "length" mean.

nickles (3/3) Oct 13 2013 Ok, if my understandig is wrong, how do YOU measure the length of

David Nadlinger (6/8) Oct 13 2013 Depends on how you define the "length" of a string. Doing that is
=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (11/13) Oct 13 2013 The thing is that even count(), which gives you the number of *code

nickles (19/19) Oct 13 2013 Ok, I understand, that "length" is - obviously - used in analogy

Michael (1/5) Oct 13 2013 First index is zero, no?
=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (12/23) Oct 13 2013 This will _not_ return a trailing surrogate of a Cyrillic letter. It

nickles (23/38) Oct 13 2013 Both are not true for UTF-32. There is no interpretation (except

Dicebot (13/25) Oct 13 2013 Ironically, reason is consistency. `string` is just

Kagamin (3/8) Oct 15 2013 No, he needs graphemes, so `std.algorithm` won't work correctly

anonymous (7/26) Oct 13 2013 This is not about endianness. It's "\u00E4" vs "a\u0308". The
Peter Alexander (26/29) Oct 13 2013 You are correct in that UTF-8 is endian agnostic, but I don't

Temtaime (17/17) Oct 13 2013 I've found another one inconsitency problem.

deadalnix (2/19) Oct 13 2013 The first one is made to interface with C. It is a special case.
Andrej Mitrovic (2/6) Oct 13 2013 http://d.puremagic.com/issues/show_bug.cgi?id=6032

Maxim Fomin (12/25) Oct 13 2013 This is impossible given current design. At runtime "säд"[2] is
monarch_dodra (14/27) Oct 13 2013 I think the root misunderstanding is that you think that a string
deadalnix (3/5) Oct 13 2013 That isn't an analogy. It is usually a good idea to try to

nickles (6/6) Oct 14 2013 It's easy to state this, but - please - don't get sarcastical!

Andrei Alexandrescu (21/22) Oct 14 2013 Thanks for making this point.

Kagamin (3/7) Oct 15 2013 Most code doesn't need to count graphemes and lives happily with

qznc (6/13) Oct 16 2013 Most code might be buggy then.

Chris (9/23) Oct 16 2013 Now that you mention it, I had a program that would send strings

monarch_dodra (3/29) Oct 16 2013 I'm not sure this is a "D" issue though: It's a fact of unicode

Chris (6/37) Oct 16 2013 My point was it would have been nice to have a native D function
Maxim Fomin (7/38) Oct 16 2013 As I argued previously, it is implementation issue which treats

Jacob Carlborg (4/9) Oct 16 2013 Why would it require two code points?

qznc (9/20) Oct 16 2013 It is either [U+00E4] as one code point or [a,U+0308] for two

Jacob Carlborg (4/11) Oct 16 2013 Aha, now I see.

monarch_dodra (18/31) Oct 16 2013 One of the interesting points, is with "ba\u00E4r" vs

qznc (6/40) Oct 16 2013 I agree with your point. Nevertheless you understanding of

Dmitry Olshansky (4/42) Oct 16 2013 --
monarch_dodra (2/6) Oct 16 2013 Ah. Learn something new every day. :)

Kagamin (4/9) Oct 20 2013 And on Windows it's case-insensitive - 2^^N variants of each

Chris (6/23) Oct 14 2013 I recently discovered a bug in my program. If you take the letter

Dmitry Olshansky (12/14) Oct 13 2013 It's all there:

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (5/20) Oct 13 2013 But you have to take care to normalize the string WRT diacritics if the

Maxim Fomin (39/49) Oct 13 2013 This is not a single inconsistency here.

ilya-stromberg (4/12) Oct 13 2013 Technically, UTF-16 can contain 2 ushort's for 1 character, so

"nickles" <ben world-of-ben.de> writes:

Why does <string>.length return the number of bytes and not the
number of UTF-8 characters, whereas <wstring.>length and
<dstring>.length return the number of UTF-16 and UTF-32
characters?

Wouldn't it be more consistent to have <string>.length return the
number of UTF-8 characters as well (instead of having to use
std.utf.count(<string>)?

Oct 13 2013

"Dicebot" <public dicebot.lv> writes:

On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
 Why does <string>.length return the number of bytes and not the
 number of UTF-8 characters, whereas <wstring.>length and
 <dstring>.length return the number of UTF-16 and UTF-32
 characters?

 Wouldn't it be more consistent to have <string>.length return 
 the
 number of UTF-8 characters as well (instead of having to use
 std.utf.count(<string>)?

Because `length` must be O(1) operation for built-in arrays and 
for UTF-8 strings it would require storing additional length 
field making it binary incompatible with other array types.

Oct 13 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

13-Oct-2013 16:36, nickles пишет:
 Why does <string>.length return the number of bytes and not the
 number of UTF-8 characters, whereas <wstring.>length and
 <dstring>.length return the number of UTF-16 and UTF-32
 characters?

???
This is simply wrong. All strings return number of codeunits. And it's 
only UTF-32 where codepoint (~ character) happens to fit into one codeunit.

 Wouldn't it be more consistent to have <string>.length return the
 number of UTF-8 characters as well (instead of having to use
 std.utf.count(<string>)?

It's consistent as is.

-- 
Dmitry Olshansky

Oct 13 2013

"nickles" <ben world-of-ben.de> writes:

 This is simply wrong. All strings return number of codeunits. 
 And it's only UTF-32 where codepoint (~ character) happens to 
 fit into one codeunit.

I do not agree:

    writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 2 
[D094], UTF-8)
    writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
    writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
    writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.

Oct 13 2013

"Dicebot" <public dicebot.lv> writes:

On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
 I do not agree:

    writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 
 2 [D094], UTF-8)
    writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
    writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
    writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)

 This is not consistent - from my point of view.

Because you have wrong understanding of what does "length" mean.

Oct 13 2013

"nickles" <ben world-of-ben.de> writes:

Ok, if my understandig is wrong, how do YOU measure the length of 
a string?
Do you always use count(), or is there an alternative?

Oct 13 2013

"David Nadlinger" <code klickverbot.at> writes:

On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote:
 Ok, if my understandig is wrong, how do YOU measure the length 
 of a string?

Depends on how you define the "length" of a string. Doing that is 
surprisingly difficult once the full variety of Unicode code 
points comes into play, even if you ignore the question of 
encoding (UTF-8, UTF-16, …).

David

Oct 13 2013

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 13.10.2013 15:25, schrieb nickles:
 Ok, if my understandig is wrong, how do YOU measure the length of a string?
 Do you always use count(), or is there an alternative?

The thing is that even count(), which gives you the number of *code 
points*, isn't necessarily what is desired - that is, the number of 
actual display characters. UTF is quite a complex beast and doing any 
operations on it _correctly_ generally requires a lot of care. If you 
need to do these kinds of operations, I would highly recommend to read 
up the basics of UTF and Unicode first (quick overview on Wikipedia: 
<http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>).

arr.length is meant to be used in conjunction with array indexing and 
slicing (arr[...]) and its value is consistent for all string and array 
types for this purpose.

Oct 13 2013

"nickles" <ben world-of-ben.de> writes:

Ok, I understand, that "length" is - obviously - used in analogy 
to any array's length value.

Still, this seems to be inconsistent. D elaborates on 
implementing "char"s as UTF-8 which means that a "char" in D can 
be of any length between 1 and 4 bytes for an arbitrary Unicode 
code point. Shouldn't then this (i.e. the character's length) be 
the "unit of measurement" for "char"s - like e.g. the size of the 
underlying struct in an array of "struct"s? The story continues 
with indexing "string"s: In a consistent implementation, shouldn't

    writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic 
letter?
Btw. how do YOU implement this for "string" (for "dstring" it 
works - logically, for "wstring" the same problem arises for code 
points above D800)?

Also, I understand, that there is the std.utf.count() function 
which returns the length that I was searching for. However, why - 
if D is so UTF-8-centric - isn't this function implemented in the 
core like ".length"?

Oct 13 2013

"Michael" <pr m1xa.com> writes:

 implementation, shouldn't

    writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic 
 letter?

First index is zero, no?

Oct 13 2013

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 13.10.2013 16:14, schrieb nickles:
 Ok, I understand, that "length" is - obviously - used in analogy to any
 array's length value.

 Still, this seems to be inconsistent. D elaborates on implementing
 "char"s as UTF-8 which means that a "char" in D can be of any length
 between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't
 then this (i.e. the character's length) be the "unit of measurement" for
 "char"s - like e.g. the size of the underlying struct in an array of
 "struct"s? The story continues with indexing "string"s: In a consistent
 implementation, shouldn't

     writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic letter?

This will _not_ return a trailing surrogate of a Cyrillic letter. It 
will return the second code unit of the "ä" character (U+00E4). However, 
it could also yield the first code unit of the umlaut diacritic, 
depending on how the string is represented. If the string were in 
UTF-32, [2] could yield either the Cyrillic character, or the umlaut 
diacritic. The .length of the UTF-32 string could be either 3 or 4.

There are multiple reasons why .length and index access is based on code 
units rather than code points or any higher level representation, but 
one is that the complexity would suddenly be O(n) instead of O(1). 
In-place modifications of char[] arrays also wouldn't be possible 
anymore as the size of the underlying array might have to change.

Oct 13 2013

"nickles" <ben world-of-ben.de> writes:

 This will _not_ return a trailing surrogate of a Cyrillic 
 letter. It will return the second code unit of the "ä" 
 character (U+00E4).

True. It's UTF-8, not UTF-16.

 However, it could also yield the first code unit of the umlaut 
 diacritic, depending on how the string is represented.

This is not true for UTF-8, which is not subject to "endianism".

 If the string were in UTF-32, [2] could yield either the 
 Cyrillic character, or the umlaut diacritic.
 The .length of the UTF-32 string could be either 3 or 4.

Both are not true for UTF-32. There is no interpretation (except 
for the "endianism", which could be taken care of in a 
library/the core) for the code point.

 There are multiple reasons why .length and index access is 
 based on code units rather than code points or any higher level 
 representation, but one is that the complexity would suddenly 
 be O(n) instead of O(1).

see my last statement below

 In-place modifications of char[] arrays also wouldn't be 
 possible anymore

They would be, but

 as the size of the underlying array might have to change.

Well that's a point; on the other hand, D is constantly creating 
and throwing away new strings, so this isn't quite an argument. 
The current solution puts the programmer in charge of dealing 
with UTF-x, where a more consistent implementation would put the 
burden on the implementors of the libraries/core, i.e. the ones 
who usually have a better understanding of Unicode than the 
average programmer.

Also, implementing such a semantics would not per se abandon a 
byte-wise access, would it?

So, how do you guys handle UTF-8 strings in D? What are your 
solutions to the problems described? Does it all come down to 
converting "string"s and "wstring"s to "dstring"s, manipulating 
them, and re-convert them to "string"s? Btw, what would this mean 
in terms of speed?

These is no irony in my questions. I'm really looking for 
solutions...

Oct 13 2013

"Dicebot" <public dicebot.lv> writes:

On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
 Well that's a point; on the other hand, D is constantly 
 creating and throwing away new strings, so this isn't quite an 
 argument. The current solution puts the programmer in charge of 
 dealing with UTF-x, where a more consistent implementation 
 would put the burden on the implementors of the libraries/core, 
 i.e. the ones who usually have a better understanding of 
 Unicode than the average programmer.

Ironically, reason is consistency. `string` is just 
`immutable(char)[]` and it conforms to usual array behavior 
rules. Saying that array element value assignment may allocate it 
hardly a good option.

 So, how do you guys handle UTF-8 strings in D? What are your 
 solutions to the problems described? Does it all come down to 
 converting "string"s and "wstring"s to "dstring"s, manipulating 
 them, and re-convert them to "string"s? Btw, what would this 
 mean in terms of speed?

If single element access is needed, str.front yields decoded 
`dchar`. Or simple `foreach (dchar d; str)` - it won't hide the 
fact it is O(n) operation at least. As `str.front` yields dchar, 
most `std.algorithm` and `std.range` utilities will also work 
correctly on default UTF-8 strings.

Slicing / .length are probably the only operations that do not 
respect UTF-8 encoding (because they are exactly the same for all 
arrays).

Oct 13 2013

"Kagamin" <spam here.lot> writes:

On Sunday, 13 October 2013 at 17:01:15 UTC, Dicebot wrote:
 If single element access is needed, str.front yields decoded 
 `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the 
 fact it is O(n) operation at least. As `str.front` yields 
 dchar, most `std.algorithm` and `std.range` utilities will also 
 work correctly on default UTF-8 strings.

No, he needs graphemes, so `std.algorithm` won't work correctly 
for him as Peter has shown: grapheme doesn't fit in dchar.

Oct 15 2013

"anonymous" <anonymous example.com> writes:

On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
 However, it could also yield the first code unit of the umlaut 
 diacritic, depending on how the string is represented.

 This is not true for UTF-8, which is not subject to "endianism".

This is not about endianness. It's "\u00E4" vs "a\u0308". The 
first is the single code point 'ä', the second is two code 
points, 'a' plus umlaut dots.

[...]
 Well that's a point; on the other hand, D is constantly 
 creating and throwing away new strings, so this isn't quite an 
 argument. The current solution puts the programmer in charge of 
 dealing with UTF-x, where a more consistent implementation 
 would put the burden on the implementors of the libraries/core, 
 i.e. the ones who usually have a better understanding of 
 Unicode than the average programmer.

 Also, implementing such a semantics would not per se abandon a 
 byte-wise access, would it?

 So, how do you guys handle UTF-8 strings in D? What are your 
 solutions to the problems described? Does it all come down to 
 converting "string"s and "wstring"s to "dstring"s, manipulating 
 them, and re-convert them to "string"s? Btw, what would this 
 mean in terms of speed?

 These is no irony in my questions. I'm really looking for 
 solutions...

I think, std.uni and std.utf are supposed to supply everything 
Unicode.

Oct 13 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
 However, it could also yield the first code unit of the umlaut 
 diacritic, depending on how the string is represented.

 This is not true for UTF-8, which is not subject to "endianism".

You are correct in that UTF-8 is endian agnostic, but I don't
believe that was Sönke's point. The point is that ä can be
produced in Unicode in more than one way. This program
illustrates:

import std.stdio;
void main()
{
       string a = "ä";
       string b = "a\u0308";
       writeln(a);
       writeln(b);
       writeln(cast(ubyte[])a);
       writeln(cast(ubyte[])b);
}

This prints:

ä
ä
[195, 164]
[97, 204, 136]

Notice that they are both the same "character" but have different
representations. The first is just the 'ä' code point, which
consists of two code units, the second is the 'a' code point
followed by a Combining Diaeresis code point.

In short, the string "ä" could be either 2 or 3 code units, and
either 1 or 2 code points.

Oct 13 2013

"Temtaime" <temtaime gmail.com> writes:

I've found another one inconsitency problem.

void foo(const char *);
void foo(const wchar *);
void foo(const dchar *);

void main() {
	foo(`123`);
	foo(`123`w);
	foo(`123`d);
}

Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(wchar)[])
Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(dchar)[])

And typeof(`123`).stringof == `string`. Why `123` can be stored 
as null terminated utf8 string in rdata segment and `123`w nor 
`123`d are not? For example wide strings(utf16) are usable with 
windows *W functions.

Oct 13 2013

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 13 October 2013 at 22:34:00 UTC, Temtaime wrote:
 I've found another one inconsitency problem.

 void foo(const char *);
 void foo(const wchar *);
 void foo(const dchar *);

 void main() {
 	foo(`123`);
 	foo(`123`w);
 	foo(`123`d);
 }

 Error: function hello.foo (const(char*)) is not callable using 
 argument types (immutable(wchar)[])
 Error: function hello.foo (const(char*)) is not callable using 
 argument types (immutable(dchar)[])

 And typeof(`123`).stringof == `string`. Why `123` can be stored 
 as null terminated utf8 string in rdata segment and `123`w nor 
 `123`d are not? For example wide strings(utf16) are usable with 
 windows *W functions.

The first one is made to interface with C. It is a special case.

Oct 13 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 10/14/13, Temtaime <temtaime gmail.com> wrote:
 And typeof(`123`).stringof == `string`. Why `123` can be stored
 as null terminated utf8 string in rdata segment and `123`w nor
 `123`d are not? For example wide strings(utf16) are usable with
 windows *W functions.

http://d.puremagic.com/issues/show_bug.cgi?id=6032

Oct 13 2013

"Maxim Fomin" <maxim maxim-fomin.ru> writes:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Ok, I understand, that "length" is - obviously - used in 
 analogy to any array's length value.

 Still, this seems to be inconsistent. D elaborates on 
 implementing "char"s as UTF-8 which means that a "char" in D 
 can be of any length between 1 and 4 bytes for an arbitrary 
 Unicode code point. Shouldn't then this (i.e. the character's 
 length) be the "unit of measurement" for "char"s - like e.g. 
 the size of the underlying struct in an array of "struct"s? The 
 story continues with indexing "string"s: In a consistent 
 implementation, shouldn't

    writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic 
 letter?

This is impossible given current design. At runtime "säд"[2] is 
viewed as struct { void *ptr; size_t length; }; ptr points to 
memory having at least five bytes and length having value 5. 
Druntime hasn't taken UTF course.

One option would be to add support in druntime so it can 
correctly handle such strings, or implement separate string type 
which does not default to char[], but of course the easiest way 
is to convince everybody that everything is OK and advice to use 
some library function which does the job correctly essentially 
implying that the language does the job wrong (pardon me, some D 
skepticism, the deeper I am in it, the more critically view it).

Oct 13 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Ok, I understand, that "length" is - obviously - used in 
 analogy to any array's length value.

 Still, this seems to be inconsistent. D elaborates on 
 implementing "char"s as UTF-8 which means that a "char" in D 
 can be of any length between 1 and 4 bytes for an arbitrary 
 Unicode code point. Shouldn't then this (i.e. the character's 
 length) be the "unit of measurement" for "char"s - like e.g. 
 the size of the underlying struct in an array of "struct"s? The 
 story continues with indexing "string"s: In a consistent 
 implementation, shouldn't

    writeln("säд"[2])

 return "д" instead of the trailing surrogate of this cyrillic 
 letter?

I think the root misunderstanding is that you think that a string 
is random access.

A string *isn't* random access. They are implemented *inside* an 
array, but unless you know *exactly* what you are doing, you 
shouldn't index, slice or take the length of a string.

A string should be handled like a bidirectional range.

Once you've understood that, it becomes much simpler.
You want the first character? front.
You want to skip the first character? popFront.

You want an arbitrary character in o(N) time?
myString.dropFrontExactly(N).front;
You want an arbitrary character in o(1) time?
You can't.

Oct 13 2013

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Ok, I understand, that "length" is - obviously - used in 
 analogy to any array's length value.

That isn't an analogy. It is usually a good idea to try to 
understand thing before judging if it is consistent.

Oct 13 2013

"nickles" <ben world-of-ben.de> writes:

It's easy to state this, but - please - don't get sarcastical!

I'm obviously (as I've learned) speaking about UTF-8 "char"s as 
they are NOT implemented right now in D; so I'm criticizing that 
D, as a language which emphasizes on "UTF-8 characters", isn't 
taking "the last step", like e.g. Python does (and no, I'm not a 
Python fan, nor do I consider D a bad language).

Oct 14 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/13 1:09 AM, nickles wrote:
 It's easy to state this, but - please - don't get sarcastical!

Thanks for making this point.

String handling in D follows two simple principles:

1. The support is a slice of code units (which often are immutable, 
seeing as string is an alias for immutable(char)[]). Slice primitives 
are readily accessible.

2. The standard library (and the foreach language construct) recognize 
that arrays of code units are special and define bidirectional range 
primitives on top of them. These are empty, save, front, back, popFront, 
and popBack.

So for a string you may use the range primitives and related algorithms 
to manipulate code points, or the slice primitives to manipulate code units.

This duality has been discussed in the past, and alternatives have 
proposed (mainly gravitating around making one of the aspects explicit 
rather than implicit). It is my opinion that a better solution exists 
(in the form of making representation accessible only through a property 
.rep). But the current design has "won" not only because it's the 
existing one, but also because it has good simplicity and flexibility 
advantages. At this point there is no question about changing the 
semantics of existing constructs.


Andrei

Oct 14 2013

"Kagamin" <spam here.lot> writes:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() function 
 which returns the length that I was searching for. However, why 
 - if D is so UTF-8-centric - isn't this function implemented in 
 the core like ".length"?

Most code doesn't need to count graphemes and lives happily with 
just strings, that's why it's not in the core.

Oct 15 2013

"qznc" <qznc web.de> writes:

On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() function 
 which returns the length that I was searching for. However, 
 why - if D is so UTF-8-centric - isn't this function 
 implemented in the core like ".length"?

 Most code doesn't need to count graphemes and lives happily 
 with just strings, that's why it's not in the core.

Most code might be buggy then.

An issue the often comes up is file names. A file called "bär" 
will be normalized differently depending on the operating system. 
In both cases it is one grapheme. However, on Linux it is one 
code point, but on OS X it is two code points.

Oct 16 2013

"Chris" <wendlec tcd.ie> writes:

On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this function 
 implemented in the core like ".length"?

 Most code doesn't need to count graphemes and lives happily 
 with just strings, that's why it's not in the core.

 Most code might be buggy then.

 An issue the often comes up is file names. A file called "bär" 
 will be normalized differently depending on the operating 
 system. In both cases it is one grapheme. However, on Linux it 
 is one code point, but on OS X it is two code points.

Now that you mention it, I had a program that would send strings 
to a socket written in D. Before I could process the strings on 
OS X, I had to normalize the decomposed OS X version of the 
strings to the composed form that D could handle, else it 
wouldn't work. I used libutf8proc for it (only one tiny little 
function). It was no problem to interface to the C library, 
however, I thought it would have been nice, if D could've handled 
this on its own without depending on third party libraries.

Oct 16 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
 On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this 
 function implemented in the core like ".length"?

 Most code doesn't need to count graphemes and lives happily 
 with just strings, that's why it's not in the core.

 Most code might be buggy then.

 An issue the often comes up is file names. A file called "bär" 
 will be normalized differently depending on the operating 
 system. In both cases it is one grapheme. However, on Linux it 
 is one code point, but on OS X it is two code points.

 Now that you mention it, I had a program that would send 
 strings to a socket written in D. Before I could process the 
 strings on OS X, I had to normalize the decomposed OS X version 
 of the strings to the composed form that D could handle, else 
 it wouldn't work. I used libutf8proc for it (only one tiny 
 little function). It was no problem to interface to the C 
 library, however, I thought it would have been nice, if D 
 could've handled this on its own without depending on third 
 party libraries.

I'm not sure this is a "D" issue though: It's a fact of unicode
that there are two different ways to write ä.

Oct 16 2013

"Chris" <wendlec tcd.ie> writes:

On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra 
wrote:
 On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
 On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this 
 function implemented in the core like ".length"?

 Most code doesn't need to count graphemes and lives happily 
 with just strings, that's why it's not in the core.

 Most code might be buggy then.

 An issue the often comes up is file names. A file called 
 "bär" will be normalized differently depending on the 
 operating system. In both cases it is one grapheme. However, 
 on Linux it is one code point, but on OS X it is two code 
 points.

 Now that you mention it, I had a program that would send 
 strings to a socket written in D. Before I could process the 
 strings on OS X, I had to normalize the decomposed OS X 
 version of the strings to the composed form that D could 
 handle, else it wouldn't work. I used libutf8proc for it (only 
 one tiny little function). It was no problem to interface to 
 the C library, however, I thought it would have been nice, if 
 D could've handled this on its own without depending on third 
 party libraries.

 I'm not sure this is a "D" issue though: It's a fact of unicode
 that there are two different ways to write ä.

My point was it would have been nice to have a native D function 
that can convert between the two types, especially because this 
is a well known issue. NSString (Cocoa / Objective-C) for example 
has things like precomposedStringWithCompatibilityMapping etc.

Oct 16 2013

"Maxim Fomin" <maxim maxim-fomin.ru> writes:

On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra 
wrote:
 On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
 On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
 On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
 Also, I understand, that there is the std.utf.count() 
 function which returns the length that I was searching for. 
 However, why - if D is so UTF-8-centric - isn't this 
 function implemented in the core like ".length"?

 Most code doesn't need to count graphemes and lives happily 
 with just strings, that's why it's not in the core.

 Most code might be buggy then.

 An issue the often comes up is file names. A file called 
 "bär" will be normalized differently depending on the 
 operating system. In both cases it is one grapheme. However, 
 on Linux it is one code point, but on OS X it is two code 
 points.

 Now that you mention it, I had a program that would send 
 strings to a socket written in D. Before I could process the 
 strings on OS X, I had to normalize the decomposed OS X 
 version of the strings to the composed form that D could 
 handle, else it wouldn't work. I used libutf8proc for it (only 
 one tiny little function). It was no problem to interface to 
 the C library, however, I thought it would have been nice, if 
 D could've handled this on its own without depending on third 
 party libraries.

 I'm not sure this is a "D" issue though: It's a fact of unicode
 that there are two different ways to write ä.

As I argued previously, it is implementation issue which treats 
"bär" is sequence of objects which are not capable of 
representing values (like int[] = [3.14]). By the way, it is a 
rare case of type system hole. Usually in D you need cast or 
union to reinterpret some value, with "bär"[X] you need not.

Oct 16 2013

Jacob Carlborg <doob me.com> writes:

On 2013-10-16 10:03, qznc wrote:

 Most code might be buggy then.

 An issue the often comes up is file names. A file called "bär" will be
 normalized differently depending on the operating system. In both cases
 it is one grapheme. However, on Linux it is one code point, but on OS X
 it is two code points.

Why would it require two code points?

-- 
/Jacob Carlborg

Oct 16 2013

"qznc" <qznc web.de> writes:

On Wednesday, 16 October 2013 at 12:18:40 UTC, Jacob Carlborg 
wrote:
 On 2013-10-16 10:03, qznc wrote:

 Most code might be buggy then.

 An issue the often comes up is file names. A file called "bär" 
 will be
 normalized differently depending on the operating system. In 
 both cases
 it is one grapheme. However, on Linux it is one code point, 
 but on OS X
 it is two code points.

 Why would it require two code points?

It is either [U+00E4] as one code point or [a,U+0308] for two 
code points. The second is "combining diaeresis" [0]. Not 
required, but possible. Those combining characters [1] provide a 
nearly infinite number of combinations. You can go crazy with it: 
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

[0] http://www.fileformat.info/info/unicode/char/0308/index.htm
[1] http://en.wikipedia.org/wiki/Combining_character

Oct 16 2013

Jacob Carlborg <doob me.com> writes:

On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two code
 points. The second is "combining diaeresis" [0]. Not required, but
 possible. Those combining characters [1] provide a nearly infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character

Aha, now I see.

-- 
/Jacob Carlborg

Oct 16 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
wrote:
 On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two 
 code
 points. The second is "combining diaeresis" [0]. Not required, 
 but
 possible. Those combining characters [1] provide a nearly 
 infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character

 Aha, now I see.

One of the interesting points, is with "ba\u00E4r" vs 
"baa\u0308r", you can run a replace to replace 'a' with 'o'. 
Then, you'll get: "boär" vs "boör"

Which is the correct behavior? There is no correct answer.

So while a grapheme should never be separated from it's "letter" 
(eg, sorting "oäa" should *not* generate "aaö". What it *should* 
generate is up to debate), you can't entirely consider that a 
letter+grapheme is a single entity.

Long story short: unicode is f***ing complicated.

And I think D does a *damn* fine job of supporting it. In 
particular, it does an awesome job of *teaching* the coder *what* 
unicode is. Virtually everyone here has solid knowledge of 
unicode (I feel). They understand, and can explain it, and can 
work with.

On the other hand, I don't know many C++ coders that understand 
unicode.

Oct 16 2013

"qznc" <qznc web.de> writes:

On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra 
wrote:
 On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
 wrote:
 On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two 
 code
 points. The second is "combining diaeresis" [0]. Not 
 required, but
 possible. Those combining characters [1] provide a nearly 
 infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] 
 http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character

 Aha, now I see.

 One of the interesting points, is with "ba\u00E4r" vs 
 "baa\u0308r", you can run a replace to replace 'a' with 'o'. 
 Then, you'll get: "boär" vs "boör"

 Which is the correct behavior? There is no correct answer.

 So while a grapheme should never be separated from it's 
 "letter" (eg, sorting "oäa" should *not* generate "aaö". What 
 it *should* generate is up to debate), you can't entirely 
 consider that a letter+grapheme is a single entity.

 Long story short: unicode is f***ing complicated.

 And I think D does a *damn* fine job of supporting it. In 
 particular, it does an awesome job of *teaching* the coder 
 *what* unicode is. Virtually everyone here has solid knowledge 
 of unicode (I feel). They understand, and can explain it, and 
 can work with.

 On the other hand, I don't know many C++ coders that understand 
 unicode.

I agree with your point. Nevertheless you understanding of 
grapheme is off. U+0308 is not a grapheme.  "a\u0308" is one 
grapheme. U+00e4 is the same grapheme as "a\u0308".

http://en.wikipedia.org/wiki/Grapheme

Oct 16 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

16-Oct-2013 23:42, qznc пишет:
 On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:
 On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:
 On 2013-10-16 14:33, qznc wrote:

 It is either [U+00E4] as one code point or [a,U+0308] for two code
 points. The second is "combining diaeresis" [0]. Not required, but
 possible. Those combining characters [1] provide a nearly infinite
 number of combinations. You can go crazy with it:
 http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

 [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
 [1] http://en.wikipedia.org/wiki/Combining_character

 Aha, now I see.

 One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r",
 you can run a replace to replace 'a' with 'o'. Then, you'll get:
 "boär" vs "boör"

 Which is the correct behavior? There is no correct answer.

 So while a grapheme should never be separated from it's "letter" (eg,
 sorting "oäa" should *not* generate "aaö". What it *should* generate
 is up to debate), you can't entirely consider that a letter+grapheme
 is a single entity.

 Long story short: unicode is f***ing complicated.

 And I think D does a *damn* fine job of supporting it. In particular,
 it does an awesome job of *teaching* the coder *what* unicode is.
 Virtually everyone here has solid knowledge of unicode (I feel). They
 understand, and can explain it, and can work with.

 On the other hand, I don't know many C++ coders that understand unicode.

 I agree with your point. Nevertheless you understanding of grapheme is
 off. U+0308 is not a grapheme.  "a\u0308" is one grapheme. U+00e4 is the
 same grapheme as "a\u0308".

s/the same/canonically equivalent/ :)

 http://en.wikipedia.org/wiki/Grapheme


-- 
Dmitry Olshansky

Oct 16 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 16 October 2013 at 19:42:59 UTC, qznc wrote:
 I agree with your point. Nevertheless you understanding of 
 grapheme is off. U+0308 is not a grapheme.  "a\u0308" is one 
 grapheme. U+00e4 is the same grapheme as "a\u0308".

 http://en.wikipedia.org/wiki/Grapheme

Ah. Learn something new every day. :)

Oct 16 2013

"Kagamin" <spam here.lot> writes:

On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
 Most code might be buggy then.

All code is buggy.

 An issue the often comes up is file names. A file called "bär" 
 will be normalized differently depending on the operating 
 system. In both cases it is one grapheme. However, on Linux it 
 is one code point, but on OS X it is two code points.

And on Windows it's case-insensitive - 2^^N variants of each 
string. So what?

Oct 20 2013

"Chris" <wendlec tcd.ie> writes:

On Sunday, 13 October 2013 at 13:40:21 UTC, Sönke Ludwig wrote:
 Am 13.10.2013 15:25, schrieb nickles:
 Ok, if my understandig is wrong, how do YOU measure the length 
 of a string?
 Do you always use count(), or is there an alternative?

 The thing is that even count(), which gives you the number of 
 *code points*, isn't necessarily what is desired - that is, the 
 number of actual display characters. UTF is quite a complex 
 beast and doing any operations on it _correctly_ generally 
 requires a lot of care. If you need to do these kinds of 
 operations, I would highly recommend to read up the basics of 
 UTF and Unicode first (quick overview on Wikipedia: 
 <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>).

 arr.length is meant to be used in conjunction with array 
 indexing and slicing (arr[...]) and its value is consistent for 
 all string and array types for this purpose.

I recently discovered a bug in my program. If you take the letter 
"é" for example (Linux, Ubuntu 12.04), std.utf.count() returns 1 
and .length returns 2. I needed the length to slice the string at 
a given point. Using .length instead of std.utf.count() fixed the 
bug.

Oct 14 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

13-Oct-2013 17:25, nickles пишет:
 Ok, if my understandig is wrong, how do YOU measure the length of a string?
 Do you always use count(), or is there an alternative?

It's all there:
http://www.unicode.org/glossary/
http://www.unicode.org/versions/Unicode6.3.0/

I measure string length in code units (as defined in the above 
standard). This bears no easy relation to the number of visible 
characters but I don't mind it.

Measuring number of visible characters isn't trivial but can be done by 
counting number of graphemes. For simple alphabets counting code points 
will do the trick as well (what count does).

-- 
Dmitry Olshansky

Oct 13 2013

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 13.10.2013 15:50, schrieb Dmitry Olshansky:
 13-Oct-2013 17:25, nickles пишет:
 Ok, if my understandig is wrong, how do YOU measure the length of a
 string?
 Do you always use count(), or is there an alternative?

 It's all there:
 http://www.unicode.org/glossary/
 http://www.unicode.org/versions/Unicode6.3.0/

 I measure string length in code units (as defined in the above
 standard). This bears no easy relation to the number of visible
 characters but I don't mind it.

 Measuring number of visible characters isn't trivial but can be done by
 counting number of graphemes. For simple alphabets counting code points
 will do the trick as well (what count does).

But you have to take care to normalize the string WRT diacritics if the 
estimate is supposed to work. OS X for example (if I remember right) 
always uses explicit combining characters, while Windows uses 
precomposed characters if possible.

Oct 13 2013

"Maxim Fomin" <maxim maxim-fomin.ru> writes:

On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
 This is simply wrong. All strings return number of codeunits. 
 And it's only UTF-32 where codepoint (~ character) happens to 
 fit into one codeunit.

 I do not agree:

    writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 
 2 [D094], UTF-8)
    writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
    writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
    writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)

 This is not consistent - from my point of view.

This is not a single inconsistency here.

First of all, typeof("säд") yileds string type (immutable char) 
while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[], 
nor even dchar[] but int[]. In this case D is close to C which 
also treats character literals as integer type. Secondly, 
character arrays are only one who have two kinds of array 
literals - usual [item. item, item] and special "blah", as you 
see there is no correspondence between them.

If you try char[] x = cast(char[])['s', 'ä', 'д'] then length 
would be indeed 3 (but don't use it - it is broken).

In D dynamic array is at binary level represented as struct { 
void *ptr; size_t length; }. When you perform some operations on 
dynamic arrays they are implemented by compiler as calls to 
runtime functions. However, during runtime it is impossible to do 
something useful on arrays for which there is only information 
about address of beginning and total elements (this is a source 
of other problems in D). To handle this, compiler generates and 
passes as separate argument "TypeInfo" to runtime functions. 
TypeInfo contains some data, most relevant here is size of the 
element.

What happens is follows. Compiler recognizes that "säд" should be 
string literal and encoded as UTF-8 
(http://dlang.org/lex.html#DoubleQuotedString), so element type 
should be char, which requires to have 5 elements in array. So, 
at runtime an object "säд" is treated as array of 5 elements each 
having 1 byte per element.

Basically string (and char[]) plays dual role in the language - 
on the one hand, it is array of elements having strictly 1 byte 
size by definition, on the other hand D tries to use it as 
'generic' UTF type for which size is not fixed. So, there is 
contradiction - in source code such strings are viewed by 
programmer as some abstract UTF string, but druntime views it as 
5 byte array. In my view, trouble begins when "säд" is internally 
casted to char (which is no better than int[] x = [3.14, 5.6]). 
And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so 
there is great inconsistency here.

By the way, UTF definition is irrelevant here, this is pure 
implementation issue (I think it is design fault).

Oct 13 2013

"ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:

On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
 Why does <string>.length return the number of bytes and not the
 number of UTF-8 characters, whereas <wstring.>length and
 <dstring>.length return the number of UTF-16 and UTF-32
 characters?

 Wouldn't it be more consistent to have <string>.length return 
 the
 number of UTF-8 characters as well (instead of having to use
 std.utf.count(<string>)?

Technically, UTF-16 can contain 2 ushort's for 1 character, so 
<wstring.>length return the number of ushort's, not the UTF-16 
characters.

Oct 13 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Inconsitency