digitalmars.D - Beginner not getting "string"

Nick (24/24) Aug 29 2010 Reading Andrei's book and something seems amiss:

dsimcha (14/34) Aug 29 2010 Basically, the reason is because you can't have a regular array of code ...
Andrei Alexandrescu (39/61) Aug 29 2010 (Background for others: code point == actual conceptual character, code

Nick Sabalausky (3/4) Aug 29 2010 Please tell me that's a typo. This isn't the era of UCS-2.

BCS (6/13) Aug 29 2010 Even if choosing to use UTF-16 is a bad idea (and I'm in no position to ...

Nick Sabalausky (4/15) Aug 29 2010 What I meant is that I'm fairly sure a wchar should be a code *unit* (ju...

BCS (6/25) Aug 29 2010 Given that I had to read three differnt pages twice each before I locate...

Andrei Alexandrescu (3/8) Aug 29 2010 Yah, typo.

Steven Schveighoffer (6/13) Aug 30 2010 char[] x;

Nick <nick example.com> writes:

Reading Andrei's book and something seems amiss:

1. A char in D is a code *unit* not a code point. Considering that code 
units are generally used to encode in an encoding, I would have expected 
that the type for a code unit to be byte or something similar, as far 
from code points as possible. In my mind, Unicode characters, aka chars 
are code points.

2. Thus a string in D is an array of code *units*, although in Unicode a 
string is really an array of code points.

3. Iterating a string in D is wrong by default, iterating over code 
units instead of characters (code points). Even worse, the error does 
not appear until you put some non-ascii text in there.

4. All string-processing calls (like sort, toupper, split and such) are 
by default wrong on non-ascii strings. Wrong without any error, warning 
or anything.

So I guess my question is why, in a language with the power and 
expressiveness of D, in our day and age, would one choose such an 
exposed, fragile implementation of string that ensures that the default 
code one writes for text manipulation is most likely wrong?

I18N is one of the first things I judge a new language by and so far D 
is... puzzling.

I don't know much about D so I am probably just not getting it but can 
you please point me to some rationale behind these string design decisions?

Thanks!
Nick

Aug 29 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Nick (nick example.com)'s article
 Reading Andrei's book and something seems amiss:
 1. A char in D is a code *unit* not a code point. Considering that code
 units are generally used to encode in an encoding, I would have expected
 that the type for a code unit to be byte or something similar, as far
 from code points as possible. In my mind, Unicode characters, aka chars
 are code points.

Basically, the reason is because you can't have a regular array of code points,
you'd need to maintain some additional data structures.  These can easily be
built
on top of an array of code units.  You can't build an array of code units on top
of an array of code points, at least not efficiently.
 2. Thus a string in D is an array of code *units*, although in Unicode a
 string is really an array of code points.

This is admittedly a wart.  However, when you use std.range,
ElementType!(string)
== dchar, so iterating with range primitives does what you'd think it should.

 3. Iterating a string in D is wrong by default, iterating over code
 units instead of characters (code points). Even worse, the error does
 not appear until you put some non-ascii text in there.

See answer to (2).

 4. All string-processing calls (like sort, toupper, split and such) are
 by default wrong on non-ascii strings. Wrong without any error, warning
 or anything.

If these don't work right for non-ASCII strings then it's an bug, not a design
issue.  Please file bug reports (1 per function).

 So I guess my question is why, in a language with the power and
 expressiveness of D, in our day and age, would one choose such an
 exposed, fragile implementation of string that ensures that the default
 code one writes for text manipulation is most likely wrong?
 I18N is one of the first things I judge a new language by and so far D
 is... puzzling.

Part of it was a silly design error that became too hard to change.  Most of it,
though, is to avoid abstraction inversion
(http://en.wikipedia.org/wiki/Abstraction_inversion) by providing access to the
lower level aspects of unicode strings.

Aug 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/29/2010 12:44 PM, Nick wrote:
 Reading Andrei's book and something seems amiss:

 1. A char in D is a code *unit* not a code point. Considering that code
 units are generally used to encode in an encoding, I would have expected
 that the type for a code unit to be byte or something similar, as far
 from code points as possible. In my mind, Unicode characters, aka chars
 are code points.

(Background for others: code point == actual conceptual character, code 
unit == the smallest unit of encoding (one byte for UTF8, two bytes for 
UTF16, four bytes for UTF32). In UTF32 code units are chosen to be equal 
to code points.)

Indeed, D's char is a UTF-8 code unit, and wchar is a UTF-16 code point. 
(dchar is at the same time a UTF-32 code unit and a Unicode code point.)

Making the type of a code unit byte would considerably weaken the 
expressive power because an array of byte[] could be considered either 
untyped data or UTF-encoded data without a static means to differentiate 
between the two. This would be largely obviated by making string an 
elaborate type, but there are considerable advantages to having string a 
regular array type.

 2. Thus a string in D is an array of code *units*, although in Unicode a
 string is really an array of code points.

In Unicode a string is generally a _sequence_ of code points. Due to the 
variable-length encoding enacted by UTF-8 and UTF-16, it would be 
difficult to emulate array semantics on such representations.

 3. Iterating a string in D is wrong by default, iterating over code
 units instead of characters (code points). Even worse, the error does
 not appear until you put some non-ascii text in there.

It's been discussed before that foreach (c; str) should set by default 
the type of c to dchar. I agree. That being said, iterating a string 
with the formal iteration mechanism defined by std.range is always 
correct and moves one code point at a time.

So what I can advise is to use foreach (dchar c; str). Other than that, 
everything should work properly.

 4. All string-processing calls (like sort, toupper, split and such) are
 by default wrong on non-ascii strings. Wrong without any error, warning
 or anything.

You'll be glad to hear that this assumption is false.

1. sort does not compile for char[] or wchar[]. The reason is that 
char[] and wchar[] do not obey the random-access requirements.

2. All overloads of split work correctly with non-ASCII strings. If you 
find anything that doesn't, that's a bug in the implementation, not in 
the design. I also recommend you look up splitter in std.algorithm.

 So I guess my question is why, in a language with the power and
 expressiveness of D, in our day and age, would one choose such an
 exposed, fragile implementation of string that ensures that the default
 code one writes for text manipulation is most likely wrong?

 I18N is one of the first things I judge a new language by and so far D
 is... puzzling.

 I don't know much about D so I am probably just not getting it but can
 you please point me to some rationale behind these string design decisions?

Support of UTF in D could be better but it definitely compares favorably 
to that in many other languages (including all languages that I know). 
The choice of array clarifies the representation and offer random access 
to individual code units, which is sometimes necessary for efficient 
manipulation. However, the formal range interface offers bidirectional 
access to code points.

As I mentioned elsewhere, I could not find an edit distance 
implementation for any other language than D that works directly on 
UTF-encoded inputs. And it's not special-cased - the same implementation 
works e.g. for lists of integers.


Andrei

Aug 29 2010

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:i5e88k$2p0t$1 digitalmars.com...
 wchar is a UTF-16 code point.

Please tell me that's a typo. This isn't the era of UCS-2.

Aug 29 2010

BCS <none anon.com> writes:

Hello Nick,

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:i5e88k$2p0t$1 digitalmars.com...
 
 wchar is a UTF-16 code point.
 

 Please tell me that's a typo. This isn't the era of UCS-2.
 

Even if choosing to use UTF-16 is a bad idea (and I'm in no position to say 
it is)being able to read/write it to interact with things that already use 
it is a good idea.

-- 
... <IXOYE><

Aug 29 2010

"Nick Sabalausky" <a a.a> writes:

"BCS" <none anon.com> wrote in message 
news:a6268ff1afa68cd1591044967ce news.digitalmars.com...
 Hello Nick,

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:i5e88k$2p0t$1 digitalmars.com...

 wchar is a UTF-16 code point.

 Please tell me that's a typo. This isn't the era of UCS-2.

 Even if choosing to use UTF-16 is a bad idea (and I'm in no position to 
 say it is)being able to read/write it to interact with things that already 
 use it is a good idea.

What I meant is that I'm fairly sure a wchar should be a code *unit* (just 
like char is a code unit), not a code *point*.

Aug 29 2010

BCS <none anon.com> writes:

Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff1afa68cd1591044967ce news.digitalmars.com...
 
 Hello Nick,
 
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in
 message news:i5e88k$2p0t$1 digitalmars.com...
 
 wchar is a UTF-16 code point.
 

 Please tell me that's a typo. This isn't the era of UCS-2.
 

 Even if choosing to use UTF-16 is a bad idea (and I'm in no position
 to say it is)being able to read/write it to interact with things that
 already use it is a good idea.
 

 What I meant is that I'm fairly sure a wchar should be a code *unit*
 (just like char is a code unit), not a code *point*.
 

Given that I had to read three differnt pages twice each before I located 
a clear statement of which was which (talk about boring reading!) I'd be 
willing to guess that that was a slip up.

-- 
... <IXOYE><

Aug 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/29/2010 02:01 PM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:i5e88k$2p0t$1 digitalmars.com...
 wchar is a UTF-16 code point.

 Please tell me that's a typo. This isn't the era of UCS-2.

Yah, typo.

Andrei

Aug 29 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 29 Aug 2010 14:17:44 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 08/29/2010 12:44 PM, Nick wrote:

 4. All string-processing calls (like sort, toupper, split and such) are
 by default wrong on non-ascii strings. Wrong without any error, warning
 or anything.

 You'll be glad to hear that this assumption is false.

 1. sort does not compile for char[] or wchar[]. The reason is that  
 char[] and wchar[] do not obey the random-access requirements.

char[] x;
x.sort; // compiles 2.048

Is a deprecation planned?

-Steve

Aug 30 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Beginner not getting "string"