www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Beginner not getting "string"

reply Nick <nick example.com> writes:
Reading Andrei's book and something seems amiss:

1. A char in D is a code *unit* not a code point. Considering that code 
units are generally used to encode in an encoding, I would have expected 
that the type for a code unit to be byte or something similar, as far 
from code points as possible. In my mind, Unicode characters, aka chars 
are code points.

2. Thus a string in D is an array of code *units*, although in Unicode a 
string is really an array of code points.

3. Iterating a string in D is wrong by default, iterating over code 
units instead of characters (code points). Even worse, the error does 
not appear until you put some non-ascii text in there.

4. All string-processing calls (like sort, toupper, split and such) are 
by default wrong on non-ascii strings. Wrong without any error, warning 
or anything.

So I guess my question is why, in a language with the power and 
expressiveness of D, in our day and age, would one choose such an 
exposed, fragile implementation of string that ensures that the default 
code one writes for text manipulation is most likely wrong?

I18N is one of the first things I judge a new language by and so far D 
is... puzzling.

I don't know much about D so I am probably just not getting it but can 
you please point me to some rationale behind these string design decisions?

Thanks!
Nick
Aug 29 2010
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Nick (nick example.com)'s article
 Reading Andrei's book and something seems amiss:
 1. A char in D is a code *unit* not a code point. Considering that code
 units are generally used to encode in an encoding, I would have expected
 that the type for a code unit to be byte or something similar, as far
 from code points as possible. In my mind, Unicode characters, aka chars
 are code points.

Basically, the reason is because you can't have a regular array of code points, you'd need to maintain some additional data structures. These can easily be built on top of an array of code units. You can't build an array of code units on top of an array of code points, at least not efficiently.
 2. Thus a string in D is an array of code *units*, although in Unicode a
 string is really an array of code points.

This is admittedly a wart. However, when you use std.range, ElementType!(string) == dchar, so iterating with range primitives does what you'd think it should.
 3. Iterating a string in D is wrong by default, iterating over code
 units instead of characters (code points). Even worse, the error does
 not appear until you put some non-ascii text in there.

See answer to (2).
 4. All string-processing calls (like sort, toupper, split and such) are
 by default wrong on non-ascii strings. Wrong without any error, warning
 or anything.

If these don't work right for non-ASCII strings then it's an bug, not a design issue. Please file bug reports (1 per function).
 So I guess my question is why, in a language with the power and
 expressiveness of D, in our day and age, would one choose such an
 exposed, fragile implementation of string that ensures that the default
 code one writes for text manipulation is most likely wrong?
 I18N is one of the first things I judge a new language by and so far D
 is... puzzling.

Part of it was a silly design error that became too hard to change. Most of it, though, is to avoid abstraction inversion (http://en.wikipedia.org/wiki/Abstraction_inversion) by providing access to the lower level aspects of unicode strings.
Aug 29 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/29/2010 12:44 PM, Nick wrote:
 Reading Andrei's book and something seems amiss:

 1. A char in D is a code *unit* not a code point. Considering that code
 units are generally used to encode in an encoding, I would have expected
 that the type for a code unit to be byte or something similar, as far
 from code points as possible. In my mind, Unicode characters, aka chars
 are code points.

(Background for others: code point == actual conceptual character, code unit == the smallest unit of encoding (one byte for UTF8, two bytes for UTF16, four bytes for UTF32). In UTF32 code units are chosen to be equal to code points.) Indeed, D's char is a UTF-8 code unit, and wchar is a UTF-16 code point. (dchar is at the same time a UTF-32 code unit and a Unicode code point.) Making the type of a code unit byte would considerably weaken the expressive power because an array of byte[] could be considered either untyped data or UTF-encoded data without a static means to differentiate between the two. This would be largely obviated by making string an elaborate type, but there are considerable advantages to having string a regular array type.
 2. Thus a string in D is an array of code *units*, although in Unicode a
 string is really an array of code points.

In Unicode a string is generally a _sequence_ of code points. Due to the variable-length encoding enacted by UTF-8 and UTF-16, it would be difficult to emulate array semantics on such representations.
 3. Iterating a string in D is wrong by default, iterating over code
 units instead of characters (code points). Even worse, the error does
 not appear until you put some non-ascii text in there.

It's been discussed before that foreach (c; str) should set by default the type of c to dchar. I agree. That being said, iterating a string with the formal iteration mechanism defined by std.range is always correct and moves one code point at a time. So what I can advise is to use foreach (dchar c; str). Other than that, everything should work properly.
 4. All string-processing calls (like sort, toupper, split and such) are
 by default wrong on non-ascii strings. Wrong without any error, warning
 or anything.

You'll be glad to hear that this assumption is false. 1. sort does not compile for char[] or wchar[]. The reason is that char[] and wchar[] do not obey the random-access requirements. 2. All overloads of split work correctly with non-ASCII strings. If you find anything that doesn't, that's a bug in the implementation, not in the design. I also recommend you look up splitter in std.algorithm.
 So I guess my question is why, in a language with the power and
 expressiveness of D, in our day and age, would one choose such an
 exposed, fragile implementation of string that ensures that the default
 code one writes for text manipulation is most likely wrong?

 I18N is one of the first things I judge a new language by and so far D
 is... puzzling.

 I don't know much about D so I am probably just not getting it but can
 you please point me to some rationale behind these string design decisions?

Support of UTF in D could be better but it definitely compares favorably to that in many other languages (including all languages that I know). The choice of array clarifies the representation and offer random access to individual code units, which is sometimes necessary for efficient manipulation. However, the formal range interface offers bidirectional access to code points. As I mentioned elsewhere, I could not find an edit distance implementation for any other language than D that works directly on UTF-encoded inputs. And it's not special-cased - the same implementation works e.g. for lists of integers. Andrei
Aug 29 2010
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:i5e88k$2p0t$1 digitalmars.com...
 wchar is a UTF-16 code point.

Please tell me that's a typo. This isn't the era of UCS-2.
Aug 29 2010
next sibling parent reply BCS <none anon.com> writes:
Hello Nick,

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:i5e88k$2p0t$1 digitalmars.com...
 
 wchar is a UTF-16 code point.
 


Even if choosing to use UTF-16 is a bad idea (and I'm in no position to say it is)being able to read/write it to interact with things that already use it is a good idea. -- ... <IXOYE><
Aug 29 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"BCS" <none anon.com> wrote in message 
news:a6268ff1afa68cd1591044967ce news.digitalmars.com...
 Hello Nick,

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:i5e88k$2p0t$1 digitalmars.com...

 wchar is a UTF-16 code point.


Even if choosing to use UTF-16 is a bad idea (and I'm in no position to say it is)being able to read/write it to interact with things that already use it is a good idea.

What I meant is that I'm fairly sure a wchar should be a code *unit* (just like char is a code unit), not a code *point*.
Aug 29 2010
parent BCS <none anon.com> writes:
Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff1afa68cd1591044967ce news.digitalmars.com...
 
 Hello Nick,
 
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in
 message news:i5e88k$2p0t$1 digitalmars.com...
 
 wchar is a UTF-16 code point.
 


to say it is)being able to read/write it to interact with things that already use it is a good idea.

(just like char is a code unit), not a code *point*.

Given that I had to read three differnt pages twice each before I located a clear statement of which was which (talk about boring reading!) I'd be willing to guess that that was a slip up. -- ... <IXOYE><
Aug 29 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/29/2010 02:01 PM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:i5e88k$2p0t$1 digitalmars.com...
 wchar is a UTF-16 code point.

Please tell me that's a typo. This isn't the era of UCS-2.

Yah, typo. Andrei
Aug 29 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 29 Aug 2010 14:17:44 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 08/29/2010 12:44 PM, Nick wrote:

 4. All string-processing calls (like sort, toupper, split and such) are
 by default wrong on non-ascii strings. Wrong without any error, warning
 or anything.

You'll be glad to hear that this assumption is false. 1. sort does not compile for char[] or wchar[]. The reason is that char[] and wchar[] do not obey the random-access requirements.

char[] x; x.sort; // compiles 2.048 Is a deprecation planned? -Steve
Aug 30 2010