D - Unicode

Scott Egan (2/2) Apr 14 2004 Would it have been better just to stick to Unicode internally, and left ...

Ilya Minkov (7/9) Apr 14 2004 By Walter's convention, all char[] are UTF-8, and where the standard

Scott Egan (5/14) Apr 14 2004 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick wit...

=?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= (20/22) Apr 14 2004 char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

Hauke Duden (3/8) Apr 14 2004 This is not correct. dchar is UTF-32 and wchar is UTF-16.

Ben Hinkle (3/5) Apr 14 2004 heh. I can never remember which one is which either.

Walter (3/4) Apr 14 2004 LOL! Wish I'd thought of that!

Scott Egan (10/32) Apr 15 2004 Given the intent of D to maintain some of the low level 'system' capabil...

Ben Hinkle (6/10) Apr 15 2004 Efficiency would depend on the application - one that copies

Scott Egan (19/29) Apr 15 2004 I've done some more homework and have a few other points:

Serge K (2/4) Apr 16 2004 Actually, it uses UTF-16.
=?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= (18/40) Apr 18 2004 No, as far as I know UTF-8/UTF-16, it expects the "storage class" to be

Ben Hinkle (6/8) Apr 14 2004 UTF-8 is a compromise between Unicode support and C's character model.

J C Calvarese (6/17) Apr 14 2004 Since it has come up before, I've made a list of some of these threads:

"Scott Egan" <scotte tpg.com.aux> writes:

Would it have been better just to stick to Unicode internally, and left any
conversion to the IO classes?

Apr 14 2004

Ilya Minkov <minkov cs.tum.edu> writes:

Scott Egan schrieb:
 Would it have been better just to stick to Unicode internally, and left any
 conversion to the IO classes?

By Walter's convention, all char[] are UTF-8, and where the standard 
library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to 
be considered as a bug. He has stated it at least for 10 times already. 
And char[] is to be the standardway of excanging unicode strings within 
D programmes. There are also dchar and wchar for other Unicode encodings.

-eye

Apr 14 2004

"Scott Egan" <scotte tpg.com.aux> writes:

Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
UCS-2 (UTF-16?), ie straight 16bit chars?

"Ilya Minkov" <minkov cs.tum.edu> wrote in message
news:c5jfuf$1ibt$1 digitaldaemon.com...
 Scott Egan schrieb:
 Would it have been better just to stick to Unicode internally, and left


any
 conversion to the IO classes?

 By Walter's convention, all char[] are UTF-8, and where the standard
 library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to
 be considered as a bug. He has stated it at least for 10 times already.
 And char[] is to be the standardway of excanging unicode strings within
 D programmes. There are also dchar and wchar for other Unicode encodings.

 -eye

Apr 14 2004

=?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= <sigbjorn lundolsen.net> writes:

Scott Egan wrote:

 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
 UCS-2 (UTF-16?), ie straight 16bit chars?

char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

Of the three Unicode transition formats, personally, I find UTF-8 and 
UTF-32 to be the most appealing, due to UTF-8's backwards compatability 
with the widespread (if lacking) ASCII format, and UTF-32 because it's 
by far the most straightforward of the formats. UTF-16 is *not* straight 
16bit chars, there are 1 million (roughly) defined Unicode character 
points, and you often need two utf16 shorts to map to a character. Only 
UTF-32 can offer a straightforward 'same sized block' character encoding 
within the Unicode standard, and personally, I can't see what UTF-16 has 
to offer compared to the two alternatives in the Unicode standard, aside 
from size savings in some languages. But, of course, UTF-8 (and probably 
UTF-32) 'suck' out of the box, don't they?

At any rate, the three formats seem to all be in use in different 
spheres of computing, and thus all have their proper place in a generic 
programming language.

Furthermore, the standard library has (afaik - I haven't used them) 
functions for converting between the three formats.

Cheers,
Sigbj�rn Lund Olsen

Apr 14 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Sigbj�rn Lund Olsen wrote:
 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
 UCS-2 (UTF-16?), ie straight 16bit chars?

 
 
 char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

This is not correct. dchar is UTF-32 and wchar is UTF-16.

Hauke

Apr 14 2004

"Ben Hinkle" <bhinkle4 juno.com> writes:

 char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

 This is not correct. dchar is UTF-32 and wchar is UTF-16.

heh. I can never remember which one is which either.
How about changing dchar to wwchar for "weally wide char", which
is scalable to any number of bytes - weally weally wide char, etc ;-)

Apr 14 2004

"Walter" <walter digitalmars.com> writes:

"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:c5jun3$28gi$1 digitaldaemon.com...
 How about changing dchar to wwchar for "weally wide char",

LOL! Wish I'd thought of that!

Apr 14 2004

"Scott Egan" <scotte tpg.com.aux> writes:

Given the intent of D to maintain some of the low level 'system' capability
I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

The various stream libraies could be left to take care of any necessary
conversions.

However, that said, I'll drop it.


"Sigbj�rn Lund Olsen" <sigbjorn lundolsen.net> wrote in message
news:c5jlp2$1r51$1 digitaldaemon.com...
 Scott Egan wrote:

 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick


with
 UCS-2 (UTF-16?), ie straight 16bit chars?

 char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

 Of the three Unicode transition formats, personally, I find UTF-8 and
 UTF-32 to be the most appealing, due to UTF-8's backwards compatability
 with the widespread (if lacking) ASCII format, and UTF-32 because it's
 by far the most straightforward of the formats. UTF-16 is *not* straight
 16bit chars, there are 1 million (roughly) defined Unicode character
 points, and you often need two utf16 shorts to map to a character. Only
 UTF-32 can offer a straightforward 'same sized block' character encoding
 within the Unicode standard, and personally, I can't see what UTF-16 has
 to offer compared to the two alternatives in the Unicode standard, aside
 from size savings in some languages. But, of course, UTF-8 (and probably
 UTF-32) 'suck' out of the box, don't they?

 At any rate, the three formats seem to all be in use in different
 spheres of computing, and thus all have their proper place in a generic
 programming language.

 Furthermore, the standard library has (afaik - I haven't used them)
 functions for converting between the three formats.

 Cheers,
 Sigbj�rn Lund Olsen

Apr 15 2004

Ben Hinkle <bhinkle4 juno.com> writes:

On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte tpg.com.aux>
wrote:

Given the intent of D to maintain some of the low level 'system' capability
I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

Efficiency would depend on the application - one that copies
strings alot would slow down significantly assuming most strings
would fit "nicely" in UTF-16 or UTF-8. Walter's experience has
been that more programs copy strings than index as characters.

Apr 15 2004

"Scott Egan" <scotte tpg.com.aux> writes:

I've done some more homework and have a few other points:

Walter's experiance may be that programmers copy strings, but have you
looked at the library lately?

It's full of index work.

BTW none of the string library is Unicode compatible; it just treats the
char[] as arrays of single bytes (as is my 'split' offering does ;).
If char is supposed to be UTF-8 then the system needs to be aware of
supplemental chars etc (doesn't it???) for correct word boundary and
capitalisation efforts.  I would also expect that it would be very easy to
produce invalid Unicode streams with some of the functions.

And...

Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2


Or given that the unicode standard is 21 bits, just use the fixed with
UTF-32?

Now I will shut up!

"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:2tus70dgshsjb5seh2hcrfvl3raj2mui20 4ax.com...
 On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte tpg.com.aux>
 wrote:

Given the intent of D to maintain some of the low level 'system'


capability
I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

 Efficiency would depend on the application - one that copies
 strings alot would slow down significantly assuming most strings
 would fit "nicely" in UTF-16 or UTF-8. Walter's experience has
 been that more programs copy strings than index as characters.

Apr 15 2004

"Serge K" <skarebo programmer.net> writes:

 Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2


Actually, it uses UTF-16.
Just like Windows does nowadays.

Apr 16 2004

=?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= <sigbjorn lundolsen.net> writes:

Scott Egan wrote:

 I've done some more homework and have a few other points:
 
 Walter's experiance may be that programmers copy strings, but have you
 looked at the library lately?
 
 It's full of index work.
 
 BTW none of the string library is Unicode compatible; it just treats the
 char[] as arrays of single bytes (as is my 'split' offering does ;).
 If char is supposed to be UTF-8 then the system needs to be aware of
 supplemental chars etc (doesn't it???) for correct word boundary and
 capitalisation efforts.  I would also expect that it would be very easy to
 produce invalid Unicode streams with some of the functions.

No, as far as I know UTF-8/UTF-16, it expects the "storage class" to be 
containers of a certain bit width. That is, it does not expect 'char' to 
represent a character - it would be just the part of a character. A more 
semantically correct name for 'char' would be 'utf8byte' but some would 
think that too wordy. Personally it's one of the first things I alias.

 And...
 
 Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2

 
 Or given that the unicode standard is 21 bits, just use the fixed with
 UTF-32?
 
 Now I will shut up!

Sometimes space is a consideration. If I had a database of English text, 
lets say a couple of billion characters, well, I *know* I pretty much 
only need ASCII codes except in rare cases, and since I want to have as 
much of the database cached in memory at any given time to serve said 
English text faster, I would rather have UTF-8 encoded the text than 
UTF-32.

In many cases you'll find that a particular encoding may be more 
appropriate than another, even if several encodings are appealing in 
their design. D gives you choice, and that's good. I like to think that 
the programmer knows better than a language designer what she wishes to do.

Cheers,
Sigbj�rn Lund Olsen

Apr 18 2004

"Ben Hinkle" <bhinkle4 juno.com> writes:

"Scott Egan" <scotte tpg.com.aux> wrote in message
news:c5jhsa$1l7v$1 digitaldaemon.com...
 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
 UCS-2 (UTF-16?), ie straight 16bit chars?

UTF-8 is a compromise between Unicode support and C's character model.
Unicode hasn't flared up on the newsgroup in a while so you might have to
look back a while to find Walter's arguments for and against the various
ideas.

Apr 14 2004

J C Calvarese <jcc7 cox.net> writes:

Ben Hinkle wrote:
 "Scott Egan" <scotte tpg.com.aux> wrote in message
 news:c5jhsa$1l7v$1 digitaldaemon.com...
 
Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
UCS-2 (UTF-16?), ie straight 16bit chars?

 
 
 UTF-8 is a compromise between Unicode support and C's character model.
 Unicode hasn't flared up on the newsgroup in a while so you might have to
 look back a while to find Walter's arguments for and against the various
 ideas.

Since it has come up before, I've made a list of some of these threads:
http://www.wikiservice.at/d/wiki.cgi?UnicodeIssues

-- 
Justin
http://jcc_7.tripod.com/d/

Apr 14 2004

D Programming

C/C++ Programming

Other

D - Unicode