www.digitalmars.com         C & C++   DMDScript  

D - Unicode

reply "Scott Egan" <scotte tpg.com.aux> writes:
Would it have been better just to stick to Unicode internally, and left any
conversion to the IO classes?
Apr 14 2004
parent reply Ilya Minkov <minkov cs.tum.edu> writes:
Scott Egan schrieb:
 Would it have been better just to stick to Unicode internally, and left any
 conversion to the IO classes?

By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings. -eye
Apr 14 2004
parent reply "Scott Egan" <scotte tpg.com.aux> writes:
Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
UCS-2 (UTF-16?), ie straight 16bit chars?

"Ilya Minkov" <minkov cs.tum.edu> wrote in message
news:c5jfuf$1ibt$1 digitaldaemon.com...
 Scott Egan schrieb:
 Would it have been better just to stick to Unicode internally, and left


 conversion to the IO classes?

By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings. -eye

Apr 14 2004
next sibling parent reply =?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= <sigbjorn lundolsen.net> writes:
Scott Egan wrote:

 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
 UCS-2 (UTF-16?), ie straight 16bit chars?

char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32. Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they? At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language. Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats. Cheers, Sigbjørn Lund Olsen
Apr 14 2004
next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Sigbjørn Lund Olsen wrote:
 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
 UCS-2 (UTF-16?), ie straight 16bit chars?

char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

This is not correct. dchar is UTF-32 and wchar is UTF-16. Hauke
Apr 14 2004
parent reply "Ben Hinkle" <bhinkle4 juno.com> writes:
 char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

This is not correct. dchar is UTF-32 and wchar is UTF-16.

heh. I can never remember which one is which either. How about changing dchar to wwchar for "weally wide char", which is scalable to any number of bytes - weally weally wide char, etc ;-)
Apr 14 2004
parent "Walter" <walter digitalmars.com> writes:
"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:c5jun3$28gi$1 digitaldaemon.com...
 How about changing dchar to wwchar for "weally wide char",

LOL! Wish I'd thought of that!
Apr 14 2004
prev sibling parent reply "Scott Egan" <scotte tpg.com.aux> writes:
Given the intent of D to maintain some of the low level 'system' capability
I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

The various stream libraies could be left to take care of any necessary
conversions.

However, that said, I'll drop it.


"Sigbjørn Lund Olsen" <sigbjorn lundolsen.net> wrote in message
news:c5jlp2$1r51$1 digitaldaemon.com...
 Scott Egan wrote:

 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick


 UCS-2 (UTF-16?), ie straight 16bit chars?

char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32. Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they? At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language. Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats. Cheers, Sigbjørn Lund Olsen

Apr 15 2004
parent reply Ben Hinkle <bhinkle4 juno.com> writes:
On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte tpg.com.aux>
wrote:

Given the intent of D to maintain some of the low level 'system' capability
I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

Efficiency would depend on the application - one that copies strings alot would slow down significantly assuming most strings would fit "nicely" in UTF-16 or UTF-8. Walter's experience has been that more programs copy strings than index as characters.
Apr 15 2004
parent reply "Scott Egan" <scotte tpg.com.aux> writes:
I've done some more homework and have a few other points:

Walter's experiance may be that programmers copy strings, but have you
looked at the library lately?

It's full of index work.

BTW none of the string library is Unicode compatible; it just treats the
char[] as arrays of single bytes (as is my 'split' offering does ;).
If char is supposed to be UTF-8 then the system needs to be aware of
supplemental chars etc (doesn't it???) for correct word boundary and
capitalisation efforts.  I would also expect that it would be very easy to
produce invalid Unicode streams with some of the functions.

And...

Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2
(fixed 2 byte representation) like C#?

Or given that the unicode standard is 21 bits, just use the fixed with
UTF-32?

Now I will shut up!

"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:2tus70dgshsjb5seh2hcrfvl3raj2mui20 4ax.com...
 On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte tpg.com.aux>
 wrote:

Given the intent of D to maintain some of the low level 'system'


I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

Efficiency would depend on the application - one that copies strings alot would slow down significantly assuming most strings would fit "nicely" in UTF-16 or UTF-8. Walter's experience has been that more programs copy strings than index as characters.

Apr 15 2004
next sibling parent "Serge K" <skarebo programmer.net> writes:
 Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2
 (fixed 2 byte representation) like C#?

Actually, it uses UTF-16. Just like Windows does nowadays.
Apr 16 2004
prev sibling parent =?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= <sigbjorn lundolsen.net> writes:
Scott Egan wrote:

 I've done some more homework and have a few other points:
 
 Walter's experiance may be that programmers copy strings, but have you
 looked at the library lately?
 
 It's full of index work.
 
 BTW none of the string library is Unicode compatible; it just treats the
 char[] as arrays of single bytes (as is my 'split' offering does ;).
 If char is supposed to be UTF-8 then the system needs to be aware of
 supplemental chars etc (doesn't it???) for correct word boundary and
 capitalisation efforts.  I would also expect that it would be very easy to
 produce invalid Unicode streams with some of the functions.

No, as far as I know UTF-8/UTF-16, it expects the "storage class" to be containers of a certain bit width. That is, it does not expect 'char' to represent a character - it would be just the part of a character. A more semantically correct name for 'char' would be 'utf8byte' but some would think that too wordy. Personally it's one of the first things I alias.
 And...
 
 Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2
 (fixed 2 byte representation) like C#?
 
 Or given that the unicode standard is 21 bits, just use the fixed with
 UTF-32?
 
 Now I will shut up!

Sometimes space is a consideration. If I had a database of English text, lets say a couple of billion characters, well, I *know* I pretty much only need ASCII codes except in rare cases, and since I want to have as much of the database cached in memory at any given time to serve said English text faster, I would rather have UTF-8 encoded the text than UTF-32. In many cases you'll find that a particular encoding may be more appropriate than another, even if several encodings are appealing in their design. D gives you choice, and that's good. I like to think that the programmer knows better than a language designer what she wishes to do. Cheers, Sigbjørn Lund Olsen
Apr 18 2004
prev sibling parent reply "Ben Hinkle" <bhinkle4 juno.com> writes:
"Scott Egan" <scotte tpg.com.aux> wrote in message
news:c5jhsa$1l7v$1 digitaldaemon.com...
 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
 UCS-2 (UTF-16?), ie straight 16bit chars?

UTF-8 is a compromise between Unicode support and C's character model. Unicode hasn't flared up on the newsgroup in a while so you might have to look back a while to find Walter's arguments for and against the various ideas.
Apr 14 2004
parent J C Calvarese <jcc7 cox.net> writes:
Ben Hinkle wrote:
 "Scott Egan" <scotte tpg.com.aux> wrote in message
 news:c5jhsa$1l7v$1 digitaldaemon.com...
 
Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
UCS-2 (UTF-16?), ie straight 16bit chars?

UTF-8 is a compromise between Unicode support and C's character model. Unicode hasn't flared up on the newsgroup in a while so you might have to look back a while to find Walter's arguments for and against the various ideas.

Since it has come up before, I've made a list of some of these threads: http://www.wikiservice.at/d/wiki.cgi?UnicodeIssues -- Justin http://jcc_7.tripod.com/d/
Apr 14 2004