www.digitalmars.com         C & C++   DMDScript  

D - Unicode discussion

reply Elias Martenson <elias-m algonet.se> writes:
DISCLAIMER: I am not a "D programmer". I certainly haven't written any
real-world applications in the language yet but I am very knowlegeable
in localisation issues.

After the recent discussion regarding Unicode in D, which seems to
have faded away now, I have decided to write some initial comments on
what needs to be done to the language and API's to make it support all
languages, not only English and Latin (which to my knowledge are the
only lnguages that can be written using 7-bit ASCII).

char types
----------

Today, according to the specification, there are three char
types. char, wchar and dchar. These are then used in an array to
create three different kinds of internal string representaions: UTF-8,
UTF-16 and UTF-32.

There are several problems with this. First and foremost, when an
expression such as this: "char[] foo" you get the impression that this
is an array of characters. This is wrong. The UTF-8 specification
dictates that a UTF-8 string is an array of bytes, not
characters. This is an important distiction to make since you cannot
take the n'th character from a UTF-8 stream like this: string[n],
since you may get a part of a multibyte character sequence.

The wchar data type has the exact same problem, since it uses UTF-16
which also uses variable lengths for its characters.

What is needed is a "char" datatype that is infact able to hold a
character. You need 21 bits to describe a unicode character (Unicode
allocates 17*2^16 code points, all of which are not yet defined) and
therefore it seems reasonable to use a 32-bit data type for this.

In my opinion this data type should be named "char". For UTF-8 and
UTF-16 strings, one can use the "byte" and "short" data types, which
would be in keeping with the Unicode standards which (to my knowledge,
I'd have to look up the exact wording) declare UTF-8 strings as being
sequences of bytes and 16-bit words respectively, and not
"characters".

String classes and functions
----------------------------

There are a set of const char[] arrays containing various character
sequences including: hexdigits, digits, uppercase, letters,
whitespace, etc... There are also character classification functions
that accept 8-bit characters. These should really be replaced by a new
but similar set of functions that work with 32-bit char types.

     isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace()

These cannot be inlined functions since newer versions of the Unicode
standard can declare new code points and we need to be forward
compatible.

Another funtion is also needed: getCharacterCategory() which returns
the Unicode category. Some other functions are needed to determine
other properites of the characters such as the directionality. Take a
look at the Java classes java.text.BreakIterator and java.text.Bidi to
get some ideas.

Streams
-------

The current std.stream is not adequate for Unicode. It doesn't seem to
take encodings into consideration at all but is simply a binary
interface.

Strings in the Phobos stream library seems to deal primarily with
char[] and wchar[]. The most important stream type, dchar[] is not
even considered. Another problem with the library is that the point as
which native encodiding<->unicode conversion is performed is not
defined.

Personally, I have not given this much considering yet, although I
kind of like the way Java did it by introducing two different kinds of
streams, byte streams and character stream. More discussion is clearly
needed.

Interoperability
----------------

In particular, C often uses 8-bit char arrays to represent
strings. This causes a problem when all strings are 32-bit
internally. The most straightforward olution is to convert UTF-32
char[] to UTF-8 byte[] before a call to a legacy function. This would
also very elegantly deal with the problem is zero-terminated C
strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
conversions functions should create a zero-terminated byte array).
Dec 15 2003
next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Elias Martenson" <elias-m algonet.se> wrote in message
news:brjvsf$28lb$1 digitaldaemon.com...
 char types
 ----------

 Today, according to the specification, there are three char
 types. char, wchar and dchar. These are then used in an array to
 create three different kinds of internal string representaions: UTF-8,
 UTF-16 and UTF-32.

 There are several problems with this. First and foremost, when an
 expression such as this: "char[] foo" you get the impression that this
 is an array of characters. This is wrong. The UTF-8 specification
 dictates that a UTF-8 string is an array of bytes, not
 characters. This is an important distiction to make since you cannot
 take the n'th character from a UTF-8 stream like this: string[n],
 since you may get a part of a multibyte character sequence.

 The wchar data type has the exact same problem, since it uses UTF-16
 which also uses variable lengths for its characters.

 What is needed is a "char" datatype that is infact able to hold a
 character. You need 21 bits to describe a unicode character (Unicode
 allocates 17*2^16 code points, all of which are not yet defined) and
 therefore it seems reasonable to use a 32-bit data type for this.

 In my opinion this data type should be named "char". For UTF-8 and
 UTF-16 strings, one can use the "byte" and "short" data types, which
 would be in keeping with the Unicode standards which (to my knowledge,
 I'd have to look up the exact wording) declare UTF-8 strings as being
 sequences of bytes and 16-bit words respectively, and not
 "characters".

The data type you're looking for is implemented in D and is the 'dchar'. A 'dchar' is 32 bits wide, wide enough for all the current and future unicode characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16 short. Having 'char' be a separate type from 'byte' is pretty handy for overloading purposes. (A minor clarification, 'byte' in D is signed, I think you meant 'ubyte', since UTF-8 bytes are unsigned.)
 String classes and functions
 ----------------------------

 There are a set of const char[] arrays containing various character
 sequences including: hexdigits, digits, uppercase, letters,
 whitespace, etc... There are also character classification functions
 that accept 8-bit characters. These should really be replaced by a new
 but similar set of functions that work with 32-bit char types.

      isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace()

 These cannot be inlined functions since newer versions of the Unicode
 standard can declare new code points and we need to be forward
 compatible.

 Another funtion is also needed: getCharacterCategory() which returns
 the Unicode category. Some other functions are needed to determine
 other properites of the characters such as the directionality. Take a
 look at the Java classes java.text.BreakIterator and java.text.Bidi to
 get some ideas.

I agree that more needs to be done in the D runtime library along these lines. I am not an expert on unicode - would you care to write those functions and contribute them to the D project?
 Streams
 -------

 The current std.stream is not adequate for Unicode. It doesn't seem to
 take encodings into consideration at all but is simply a binary
 interface.

That's correct.
 Strings in the Phobos stream library seems to deal primarily with
 char[] and wchar[]. The most important stream type, dchar[] is not
 even considered. Another problem with the library is that the point as
 which native encodiding<->unicode conversion is performed is not
 defined.

That's correct as well. The library's support for unicode is inadequate. But there also is a nice package (std.utf) which will convert between char[], wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system API supports. (For win32 this would be UTF-16, I am unsure what linux supports.)
 Personally, I have not given this much considering yet, although I
 kind of like the way Java did it by introducing two different kinds of
 streams, byte streams and character stream. More discussion is clearly
 needed.

 Interoperability
 ----------------

 In particular, C often uses 8-bit char arrays to represent
 strings. This causes a problem when all strings are 32-bit
 internally. The most straightforward olution is to convert UTF-32
 char[] to UTF-8 byte[] before a call to a legacy function. This would
 also very elegantly deal with the problem is zero-terminated C
 strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
 conversions functions should create a zero-terminated byte array).

D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.
Dec 15 2003
parent reply Elias Martenson <no spam.spam> writes:
Den Mon, 15 Dec 2003 02:28:01 -0800 skrev Walter:

 In my opinion this data type should be named "char". For UTF-8 and
 UTF-16 strings, one can use the "byte" and "short" data types, which
 would be in keeping with the Unicode standards which (to my knowledge,
 I'd have to look up the exact wording) declare UTF-8 strings as being
 sequences of bytes and 16-bit words respectively, and not
 "characters".

The data type you're looking for is implemented in D and is the 'dchar'. A 'dchar' is 32 bits wide, wide enough for all the current and future unicode characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16 short. Having 'char' be a separate type from 'byte' is pretty handy for overloading purposes. (A minor clarification, 'byte' in D is signed, I think you meant 'ubyte', since UTF-8 bytes are unsigned.)

Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway. The overloading issue is interesting, but may I suggest that char and whcar are at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters. And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either. I was almost going to provide a summary of the issues we're having in C with regards to this, but I don't know if it's necessary, and it's also getting late here (work early tomorrow).
 [ my own comments regarding strings snipped ]

I agree that more needs to be done in the D runtime library along these lines. I am not an expert on unicode - would you care to write those functions and contribute them to the D project?

I'd love to help out and do these things. But two things are needed first: - At least one other person needs to volunteer. I've had bad experiences when one person does this by himself, - The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what a "string" really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.
 Streams
 -------

 The current std.stream is not adequate for Unicode. It doesn't seem to
 take encodings into consideration at all but is simply a binary
 interface.

That's correct.

Agree. And as such it's very good.
 Strings in the Phobos stream library seems to deal primarily with
 char[] and wchar[]. The most important stream type, dchar[] is not
 even considered. Another problem with the library is that the point as
 which native encodiding<->unicode conversion is performed is not
 defined.

That's correct as well. The library's support for unicode is inadequate. But there also is a nice package (std.utf) which will convert between char[], wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system API supports. (For win32 this would be UTF-16, I am unsure what linux supports.)

Yes. But this would then assume that char[] is always in native encoding and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs native decoding to UTF-8 when reading into a char[] array. Unless fundamental encoding/decoding is embedded in the streams library, it would be best to simply read text data into a byte array and then perform native decoding manually afterwards using functions similar to the C mbstowcs() and wcstombs(). The drawback to this is that you cannot read text data in platform encoding without copying through a separate buffer, even in cases when this is not needed.
 In particular, C often uses 8-bit char arrays to represent
 strings. This causes a problem when all strings are 32-bit
 internally. The most straightforward olution is to convert UTF-32
 char[] to UTF-8 byte[] before a call to a legacy function. This would
 also very elegantly deal with the problem is zero-terminated C
 strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
 conversions functions should create a zero-terminated byte array).

D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.

This can be done in a much better, platform independent way, by using the native<->unicode conversion routines. In C, as already mentioned, these are called mbstowcs() and wcstombs(). For Windows, these would convert to and from UTF-16. For Unix, these would convert to and from whatever encoding the application is running under (dictated by the LC_CTYPE environment variable). There really is no need to make the API's platform dependent in any way here. In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest. This goes for all the other methods and functions that accept string parameters. This of course still depends on what a "string" really is, this really needs to be decided, and I think you are the only one who can make that call. Although more discussion on the subject might be needed first? Regards Elias MÃ¥rtenson
Dec 15 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Elias Martenson" <no spam.spam> wrote in message
news:pan.2003.12.15.23.07.24.569047 spam.spam...
 Actually, byte or ubyte doesn't really matter. One is not supposed to
 look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.
 The overloading issue is interesting, but may I suggest that char and

 are at least renamed to something more appropriate? Maybe utf8byte and
 utf16byte? I feel it's important to point out that they aren't characters.

I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.
 And here is also the core of the problem: having an array of "char"
 implies to the unwary programmer that the elements in the sequence
 are in fact "characters", and that you should be allowed to do stuff
 like isspace() on them. The fact that the libraries provide such
 function doesn't help either.

I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is for the moment.
 I'd love to help out and do these things. But two things are needed first:
     - At least one other person needs to volunteer.
       I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!
     - The core concepts needs to be decided upon. Things seems to be
       somewhat in flux right now, with three different string types
       and all. At the very least it needs to be deicded what a "string"
       really is, is it a UTF-8 byte sequence or a UTF-32 character
       sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.
 That's correct as well. The library's support for unicode is inadequate.


 there also is a nice package (std.utf) which will convert between


 wchar[], and dchar[]. This can be used to convert the text strings into
 whatever unicode stream type the underlying operating system API


 (For win32 this would be UTF-16, I am unsure what linux supports.)

and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs native decoding to UTF-8 when reading into a char[] array.

char[] strings are UTF-8, and as such I don't know what you mean by 'native decoding'. There is only one possible conversion of UTF-8 to UTF-16.
 Unless fundamental encoding/decoding is embedded in the streams library,
 it would be best to simply read text data into a byte array and then
 perform native decoding manually afterwards using functions similar
 to the C mbstowcs() and wcstombs(). The drawback to this is that you
 cannot read text data in platform encoding without copying through
 a separate buffer, even in cases when this is not needed.

If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page dependent. They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.
 D is headed that way. The current version of the library I'm working on
 converts the char[] strings in the file name API's to UTF-16 via
 std.utf.toUTF16z(), for use calling the win32 API's.

the native<->unicode conversion routines.

The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.
 In C, as already mentioned,
 these are called mbstowcs() and wcstombs(). For Windows, these would
 convert to and from UTF-16. For Unix, these would convert to and from
 whatever encoding the application is running under (dictated by the
 LC_CTYPE environment variable). There really is no need to make the
 API's platform dependent in any way here.

After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from UTF. This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having #ifdef _UNICODE all over the place? I've done that too much already. No thanks!) UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.
 In general, you should be able to open a file, by specifying the file
 name as a dchar[], and then the libraries should handle the rest.

It does that now, except they take a char[].
 This
 goes for all the other methods and functions that accept string
 parameters. This of course still depends on what a "string" really is,
 this really needs to be decided, and I think you are the only one who
 can make that call. Although more discussion on the subject might be
 needed first?

It's been debated here before <g>.
Dec 15 2003
next sibling parent reply Lewis <dethbomb hotmail.com> writes:
Walter wrote:
 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 
Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.
The overloading issue is interesting, but may I suggest that char and

whcar
are at least renamed to something more appropriate? Maybe utf8byte and
utf16byte? I feel it's important to point out that they aren't characters.

I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.
And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is for the moment.
I'd love to help out and do these things. But two things are needed first:
    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!
    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.
That's correct as well. The library's support for unicode is inadequate.


But
there also is a nice package (std.utf) which will convert between


char[],
wchar[], and dchar[]. This can be used to convert the text strings into
whatever unicode stream type the underlying operating system API


supports.
(For win32 this would be UTF-16, I am unsure what linux supports.)

Yes. But this would then assume that char[] is always in native encoding and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs native decoding to UTF-8 when reading into a char[] array.

char[] strings are UTF-8, and as such I don't know what you mean by 'native decoding'. There is only one possible conversion of UTF-8 to UTF-16.
Unless fundamental encoding/decoding is embedded in the streams library,
it would be best to simply read text data into a byte array and then
perform native decoding manually afterwards using functions similar
to the C mbstowcs() and wcstombs(). The drawback to this is that you
cannot read text data in platform encoding without copying through
a separate buffer, even in cases when this is not needed.

If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page dependent. They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.
D is headed that way. The current version of the library I'm working on
converts the char[] strings in the file name API's to UTF-16 via
std.utf.toUTF16z(), for use calling the win32 API's.

This can be done in a much better, platform independent way, by using the native<->unicode conversion routines.

The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.
In C, as already mentioned,
these are called mbstowcs() and wcstombs(). For Windows, these would
convert to and from UTF-16. For Unix, these would convert to and from
whatever encoding the application is running under (dictated by the
LC_CTYPE environment variable). There really is no need to make the
API's platform dependent in any way here.

After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from UTF. This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having #ifdef _UNICODE all over the place? I've done that too much already. No thanks!) UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.
In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

It does that now, except they take a char[].
This
goes for all the other methods and functions that accept string
parameters. This of course still depends on what a "string" really is,
this really needs to be decided, and I think you are the only one who
can make that call. Although more discussion on the subject might be
needed first?

It's been debated here before <g>.

heres a page i found with some c++ code that may help in creating decoders etc... http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html for windows coding its easy enough to use com api's to manipulate and create unicode strings? (for utf16)
Dec 15 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
Lewis wrote:

 heres a page i found with some c++ code that may help in creating decoders
 etc...
 
 http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html
 
 for windows coding its easy enough to use com api's to manipulate and
 create unicode strings? (for utf16)

IBM has a set of Unicode tools. Last time I googled for them I found them right away but not I can't. I'll keel looking and post again when I find the link. Regards Elias Mårtenson
Dec 16 2003
parent reply uwem <uwem_member pathlink.com> writes:
You mean icu?!

http://oss.software.ibm.com/icu/

Bye
uwe

In article <brmlf3$83b$1 digitaldaemon.com>, Elias Martenson says...
Lewis wrote:

 heres a page i found with some c++ code that may help in creating decoders
 etc...
 
 http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html
 
 for windows coding its easy enough to use com api's to manipulate and
 create unicode strings? (for utf16)

IBM has a set of Unicode tools. Last time I googled for them I found them right away but not I can't. I'll keel looking and post again when I find the link. Regards Elias Mårtenson

Dec 16 2003
parent Elias Martenson <elias-m algonet.se> writes:
uwem wrote:

 You mean icu?!
 
 http://oss.software.ibm.com/icu/

Yes that's it! No wonder I didn't find it, I was searching for "classes for unicode". Regards Elias Mårtenson
Dec 16 2003
prev sibling next sibling parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
"Walter" <walter digitalmars.com> wrote in message
news:brll85$1oko$1 digitaldaemon.com...
 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 Actually, byte or ubyte doesn't really matter. One is not supposed to
 look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always

 to be looking at the lower level elements anyway. I wrestled with this for

 while, and eventually decided that char[], wchar[], and dchar[] would be

 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

The problem is that [] would be a horribly inefficient way to index UTF-8 characters. foreach would be ok. Sean
Dec 16 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brmeos$2v9c$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

The problem is that [] would be a horribly inefficient way to index UTF-8 characters. foreach would be ok.

You're right.
Dec 16 2003
parent Elias Martenson <elias-m algonet.se> writes:
Walter wrote:

 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brmeos$2v9c$1 digitaldaemon.com...
 
"Walter" <walter digitalmars.com> wrote in message

One could design a wrapper class for them that
overloads [] to provide automatic decoding if desired.

The problem is that [] would be a horribly inefficient way to index UTF-8 characters. foreach would be ok.

You're right.

Agreed. Some kind of itarator for strings are desperately needed. May I ask that they be designed in such a way that they are compatible/consistent with other iterators, such as the collections and things like the break iterator (also for strings). Regards Elias Mårtenson
Dec 17 2003
prev sibling next sibling parent reply Elias Martenson <elias-m algonet.se> writes:
Walter wrote:

 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 
Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.

All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual characters.
 I see your point, but I just can't see making utf8byte into a keyword <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
 of an issue here.

Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.
And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is for the moment.

As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.
I'd love to help out and do these things. But two things are needed first:
    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!

Indeed, but no one else volunteered yet. :-)
    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.

OK, if that is your descision then you will not see me argue against it. :-) However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options: void log_to_file(char[] str); void log_to_file(wchar[] str); void log_to_file(dchar[] str); Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise. Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]). Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings. Would it be possible to use something like this? dchar get_first_char(string str) { return str[0]; } string str1 = (dchar[])"A UTF-32 string"; string str2 = (char[])"A UTF-8 string"; // call the function to demonstrate that the "string" // type can be used in declarations dchar x = get_first_char(str1); dchar y = get_first_char(str2); I.e. the "string" data type would be a wrapper or supertype for the three different string types.
 char[] strings are UTF-8, and as such I don't know what you mean by 'native
 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R. In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.
 If you're talking about win32 code pages, I'm going to draw a line in the
 sand and assert that D char[] strings are NOT locale or code page dependent.
 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)
 The UTF-8 to UTF-16 conversion is defined and platform independent. The D
 runtime library includes routines to convert back and forth between them.
 They could probably be optimized better, but that's another issue. I feel
 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale
 dependent character sets are pushed off to the side as merely an input or
 output translation nuisance. The core routines all expect UTF strings, and
 so are platform and language independent. I personally think the future is
 UTF, and locale dependent encodings will fall by the wayside.

Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?
 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language and
 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having
 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally. So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal encoding.
 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.
In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

It does that now, except they take a char[].

Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed. Regards Elias Mårtenson
Dec 16 2003
next sibling parent reply "Ben Hinkle" <bhinkle4 juno.com> writes:
I think Walter once said char had been called 'ascii'. That doesn't sound
all that bad to me. Perhaps we should have the primitive types
'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane,
I know, but at least then you never will mistake an ascii[] for a utf32[]
(or a utf8[], for that matter).

-Ben

"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
 Walter wrote:

 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...

Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always


 to be looking at the lower level elements anyway. I wrestled with this


 while, and eventually decided that char[], wchar[], and dchar[] would be


 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual

 I see your point, but I just can't see making utf8byte into a keyword


 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see


 of an issue here.

Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.
And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

I think the library functions should be improved to handle unicode


 But I'm not much of an expert on how to do it right, so it is the way it


 for the moment.

As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.
I'd love to help out and do these things. But two things are needed



    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!

Indeed, but no one else volunteered yet. :-)
    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to


 UTF-16, or UTF-32 representations.

OK, if that is your descision then you will not see me argue against it.

 However, suppose you are going to write a function that accepts a
 string. Let's call it log_to_file(). How do you declare it? Today, you
 have three different options:

      void log_to_file(char[] str);
      void log_to_file(wchar[] str);
      void log_to_file(dchar[] str);

 Which one of these should I use? Should I use all of them? Today, people
 seems to use the first option, but UTF-8 is horribly inefficient
 performance-wise.

 Also, in the case of char and wchar strings, how do I access an
 individual character? Unless I missed something, the only way today is
 to use decode(). This is a fairly common operation which needs a better
 syntax, or people will keep accessing individual elements using the
 array notation (str[n]).

 Obviously the three different string types needs to be wrapped somehow.
 Either through a class (named "String" perhaps?) or through a keyword
 ("string"?) that is able to encapsulate the different behaviour of the
 three different kinds of strings.

 Would it be possible to use something like this?

      dchar get_first_char(string str)
      {
          return str[0];
      }

      string str1 = (dchar[])"A UTF-32 string";
      string str2 = (char[])"A UTF-8 string";

      // call the function to demonstrate that the "string"
      // type can be used in declarations
      dchar x = get_first_char(str1);
      dchar y = get_first_char(str2);

 I.e. the "string" data type would be a wrapper or supertype for the
 three different string types.

 char[] strings are UTF-8, and as such I don't know what you mean by


 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R. In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.
 If you're talking about win32 code pages, I'm going to draw a line in


 sand and assert that D char[] strings are NOT locale or code page


 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)
 The UTF-8 to UTF-16 conversion is defined and platform independent. The


 runtime library includes routines to convert back and forth between


 They could probably be optimized better, but that's another issue. I


 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with


 dependent character sets are pushed off to the side as merely an input


 output translation nuisance. The core routines all expect UTF strings,


 so are platform and language independent. I personally think the future


 UTF, and locale dependent encodings will fall by the wayside.

Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?
 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language


 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from


 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How


 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about


 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally. So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal encoding.
 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.
In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

It does that now, except they take a char[].

Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed. Regards Elias Mårtenson

Dec 16 2003
next sibling parent reply Elias Martenson <elias-m algonet.se> writes:
Ben Hinkle wrote:

 I think Walter once said char had been called 'ascii'. That doesn't sound
 all that bad to me. Perhaps we should have the primitive types
 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane,
 I know, but at least then you never will mistake an ascii[] for a utf32[]
 (or a utf8[], for that matter).

No. This would be extremely bad. The (unfortunately) very large amount of english-only programmers will use "ascii" exclusively, and we'll end up with yet another english/latin-only language. ASCII really has no place in modern computing enironments anymore. All oeprating systems and languages has migrated, or ar in the process of migrating, to Unicode. Regards Elias Mårtenson
Dec 16 2003
parent reply "Ben Hinkle" <bhinkle4 juno.com> writes:
"Elias Martenson" <elias-m algonet.se> wrote in message
news:brn3tp$t93$1 digitaldaemon.com...
 Ben Hinkle wrote:

 I think Walter once said char had been called 'ascii'. That doesn't


 all that bad to me. Perhaps we should have the primitive types
 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.


 I know, but at least then you never will mistake an ascii[] for a


 (or a utf8[], for that matter).

No. This would be extremely bad. The (unfortunately) very large amount of english-only programmers will use "ascii" exclusively, and we'll end up with yet another english/latin-only language. ASCII really has no place in modern computing enironments anymore. All oeprating systems and languages has migrated, or ar in the process of migrating, to Unicode.

But ASCII has a place in a practical programming language designed to work with legacy system and code. If you pass a utf-8 or utf-32 format string that isn't ASCII to printf it probably won't print out what you want. That's life. In terms of encouraging a healthy, happy future... the only thing the D language definition can do is choose what type to use for string literals. ie, given the declarations void foo(ascii[]); void foo(utf8[]); void foo(utf32[]); what function does foo("bar") calll? Right now it would call foo(utf8[]). You are arguing it should call utf32[]. I am on the fence about what it should call. Phobos should have routines to handle any encoding - ascii (or just rely on the std.c for these), utf8, utf16 and utf32. -Ben
Dec 16 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
Ben Hinkle wrote:

 But ASCII has a place in a practical programming language designed to work
 with legacy system and code. If you pass a utf-8 or utf-32 format string
 that isn't ASCII to printf it probably won't print out what you want. That's
 life.

For legacy code, you should have to take an extra step to make it work. However, it should certainly be possible. Allow me to compare to how Java does it: String str = "this is a unicode string"; byte[] asciiString = str.getBytes("ASCII"); You can also convert it to UTF-8 if you like: byte[] utf8String = str.getBytes("UTF-8"); If the "default" string type in D is a simple ASCII string, do you honestly think that programmers who only speak english will even bother to do the right thing? Do you think they will even know that they are writing effectively broken code? I am suffering from these kinds of bugs every day (I speak swedish natively, but have also need to work with cyrillic) and let me tell you: 99% of all problems I have are caused by bugs similar to this. Also, I don't think it's a good idea to design a language around a legacy character set (ASCII) which will hopefully be gone in a few years (for newly written programs that is).
 In terms of encouraging a healthy, happy future... the only thing the D
 language definition can do is choose what type to use for string literals.
 ie, given the declarations
 
  void foo(ascii[]);
  void foo(utf8[]);
  void foo(utf32[]);
 
 what function does
 
  foo("bar")
 
 calll? Right now it would call foo(utf8[]). You are arguing it should call
 utf32[]. I am on the fence about what it should call.

Yes, with Walters previous posting in mind, I argue that in foo() is overloaded with all three string types, it would call the dchar[] version. If one is not available, it would fall back to the wchar[], and lastly the char[] version. Then again, I also argue that there should be a way to using the supertype, "string", to avoid having to mess with the overloading and transparent string conversions.
 Phobos should have routines to handle any encoding - ascii (or just rely on
 the std.c for these), utf8, utf16 and utf32.

The C standard library has largely migrated away from pure ASCII. It's there for backwards compatibility reasons, but people people still tend to use them, that's not the languages fault though but rather the developers. Regards Elias Mårtenson
Dec 16 2003
parent reply "Ben Hinkle" <bhinkle4 juno.com> writes:
 If the "default" string type in D is a simple ASCII string, do you
 honestly think that programmers who only speak english will even bother
 to do the right thing? Do you think they will even know that they are
 writing effectively broken code?

I didn't say the default type should be ASCII. I just said it should be explicit when it is ASCII. For example, I think printf should be declared as accepting an ascii* format string, not a char* as it is currently declared (same for fopen etc etc). I said I didn't know what the default type should be, though I'm leaning towards UTF-8 so that casting to ascii[] doesn't have to reallocate anything. -Ben
Dec 16 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
Ben Hinkle wrote:

If the "default" string type in D is a simple ASCII string, do you
honestly think that programmers who only speak english will even bother
to do the right thing? Do you think they will even know that they are
writing effectively broken code?

I didn't say the default type should be ASCII. I just said it should be explicit when it is ASCII.

But for all intents and puproses, ASCII does not exist anymore. It's a legacy character set, and it should certainly not be the "natural" way of dealing with strings. Believe it or not, there are a lot of programmers out there who still believe that ASCII ought to be enough for anybody.
 For example, I think printf should be declared as
 accepting an ascii* format string, not a char* as it is currently declared
 (same for fopen etc etc).

But printf() works very well with UTF-8 in most cases.
 I said I didn't know what the default type should
 be, though I'm leaning towards UTF-8 so that casting to ascii[]
 doesn't have
 to reallocate anything.

True. But then again, isn't the intent to try to avoid legacy calls ats much as possible? Is it a good idea to set the default in order to accomodate a legacy character set? You have to remember that UTF-8 is very inefficient. Suppose you have a 10000 character long string, and you want to retrieve the 9000'th character from that string. If the string is UTF-32, that means a single memory lookup. With UTF-8 it could mean anywhere between 9000 and 54000 memory lookups. Now imagine if the string is ten or one hundred times as long... Now, with the current design, what many people are going to do is: char c = str[9000]; // now play happily(?) with the char "c" that probably isn't the // 9000'th character and maybe was a part of a UTF-8 multi byte // character Again, this is a huge problem. The bug will not be evident until some other person (me, for example) tries to use non-ASCII characters. The above broken code may have run through every single test that the developer wrote, simply because he didn't think of putting a non-ASCII character in the string. This is a real problem, and it desperately needs to be solved. Several solutions has already been presented, the question is just which one of them that Walter will support? He already explained in the previous post, but it seems that there is still some things to be said. Regards Elias Mårtenson
Dec 16 2003
next sibling parent Elias Martenson <elias-m algonet.se> writes:
Elias Martenson wrote:

     char c = str[9000];

Regards Elias Mårtenson
Dec 16 2003
prev sibling next sibling parent reply "Ben Hinkle" <bhinkle4 juno.com> writes:
      char c = str[8999];
      // now play happily(?) with the char "c" that probably isn't the
      // 9000'th character and maybe was a part of a UTF-8 multi byte
      // character

which was why I suggested doing away with the generic "char" type entirely. If str was declared as an ascii array then it would be ascii c = str[8999]; Which is completely safe and reasonable. If it was declared as utf8[] then when the user writes ubyte c = str[8999] and they don't have any a-priori knowledge about str they should feel very nervous since I completely agree indexing into an arbitrary utf-8 encoded array is pretty meaningless. Plus in my experience using individual characters isn't that common - I'd say easily 90% of the time a variable is declared as char* or char[] rather than just char. By the way, I also think any utf8, utf16 and utf32 types should be aliased to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I dunno. About Java and D: when I program in Java I never worry about the size of a char because Java is very different than C and you have to jump through hoops to call C. But when I program in D I feel like it is an extension of C like C++. Imagine if C++ decided that char should be 32 bits. That would have been very painful. -Ben
Dec 16 2003
parent Elias Martenson <elias-m algonet.se> writes:
Ben Hinkle wrote:

     char c = str[8999];
     // now play happily(?) with the char "c" that probably isn't the
     // 9000'th character and maybe was a part of a UTF-8 multi byte
     // character

which was why I suggested doing away with the generic "char" type entirely. If str was declared as an ascii array then it would be ascii c = str[8999]; Which is completely safe and reasonable.

No, it would certainly NOT be safe. You must remember that ASCII doesn't exist anymore. It's a legacy character set. It's dead. Gone. Bye bye. And yes, sometimes it's needed for backwards compatibility, but in those cases it should be made explicit that you are throwing away information when converting.
 If it was declared as utf8[] then
 when the user writes
  ubyte c = str[8999]
 and they don't have any a-priori knowledge about str they should feel very
 nervous since I completely agree indexing into an arbitrary utf-8 encoded
 array is pretty meaningless. Plus in my experience using individual
 characters isn't that common - I'd say easily 90% of the time a variable is
 declared as char* or char[] rather than just char.

You are right. Actually it's probably more than 90%. Especially when dealing with unicode. Very often it's not allowed to split a unicode string because of composite characters. However, you still need to be able to do indicidual character classification, such as isspace().
 By the way, I also think any utf8, utf16 and utf32 types should be aliased
 to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I
 dunno.

ASCII has no business in a modern programming language.
 About Java and D: when I program in Java I never worry about the size of a
 char because Java is very different than C and you have to jump through
 hoops to call C. But when I program in D I feel like it is an extension of C
 like C++. Imagine if C++ decided that char should be 32 bits. That would
 have been very painful.

All I was suggesting was a renaming of the types so that it's made explicit what type you have to use in order to be able to hold a single character. In D, this type is called "dchar", char doesn't cut it. In C on unix, it's called wchar_t. In C on windows the type to use is called "int" or "long". And finally in Java, you ahve to use "int". In all of these languages, "char" is insufficient to hold a character. Don't you think it's logical that the data type that can hold a character is called "char"? Regards Elias Mårtenson
Dec 17 2003
prev sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
Elias Martenson wrote:
 accepting an ascii* format string, not a char* as it is currently 
 declared
 (same for fopen etc etc).

But printf() works very well with UTF-8 in most cases.

None of these alternatives is correct. printf will only work correctly with UTF-8 if the string data is either ASCII or UTF-8 happens to be the current system code page. And ASCII will only work for english systems, which is even worse. As I said before, the C functions should be passed strings encoded in current system code page. That way all strings that are written in the system language will be printed perfectly. Also, characters that are not in the code page can be replaced with ? during the conversion, which is better than having printf output garbage. Hauke
Dec 17 2003
prev sibling parent "Charles" <sanders-consulting comcast.net> writes:
It does sound insane, I like it.  I vote for this.

C

"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:brn1eq$ppk$1 digitaldaemon.com...
 I think Walter once said char had been called 'ascii'. That doesn't sound
 all that bad to me. Perhaps we should have the primitive types
 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.

 I know, but at least then you never will mistake an ascii[] for a utf32[]
 (or a utf8[], for that matter).

 -Ben

 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 Walter wrote:

 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...

Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always


 to be looking at the lower level elements anyway. I wrestled with this


 while, and eventually decided that char[], wchar[], and dchar[] would



 low
 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual

 I see your point, but I just can't see making utf8byte into a keyword


 The world has already gotten used to multibyte 'char' in C and the



 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't



 much
 of an issue here.

Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.
And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

I think the library functions should be improved to handle unicode


 But I'm not much of an expert on how to do it right, so it is the way



 is
 for the moment.

As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.
I'd love to help out and do these things. But two things are needed



    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!

Indeed, but no one else volunteered yet. :-)
    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a




      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to


 UTF-16, or UTF-32 representations.

OK, if that is your descision then you will not see me argue against it.

 However, suppose you are going to write a function that accepts a
 string. Let's call it log_to_file(). How do you declare it? Today, you
 have three different options:

      void log_to_file(char[] str);
      void log_to_file(wchar[] str);
      void log_to_file(dchar[] str);

 Which one of these should I use? Should I use all of them? Today, people
 seems to use the first option, but UTF-8 is horribly inefficient
 performance-wise.

 Also, in the case of char and wchar strings, how do I access an
 individual character? Unless I missed something, the only way today is
 to use decode(). This is a fairly common operation which needs a better
 syntax, or people will keep accessing individual elements using the
 array notation (str[n]).

 Obviously the three different string types needs to be wrapped somehow.
 Either through a class (named "String" perhaps?) or through a keyword
 ("string"?) that is able to encapsulate the different behaviour of the
 three different kinds of strings.

 Would it be possible to use something like this?

      dchar get_first_char(string str)
      {
          return str[0];
      }

      string str1 = (dchar[])"A UTF-32 string";
      string str2 = (char[])"A UTF-8 string";

      // call the function to demonstrate that the "string"
      // type can be used in declarations
      dchar x = get_first_char(str1);
      dchar y = get_first_char(str2);

 I.e. the "string" data type would be a wrapper or supertype for the
 three different string types.

 char[] strings are UTF-8, and as such I don't know what you mean by


 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R. In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.
 If you're talking about win32 code pages, I'm going to draw a line in


 sand and assert that D char[] strings are NOT locale or code page


 They are UTF-8 strings. If you are reading code page or locale



 strings, to put them into a char[] will require running it through a
 conversion.

Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)
 The UTF-8 to UTF-16 conversion is defined and platform independent.



 D
 runtime library includes routines to convert back and forth between


 They could probably be optimized better, but that's another issue. I


 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with


 dependent character sets are pushed off to the side as merely an input


 output translation nuisance. The core routines all expect UTF strings,


 so are platform and language independent. I personally think the



 is
 UTF, and locale dependent encodings will fall by the wayside.

Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?
 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language


 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating



 UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How


 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about


 #ifdef _UNICODE all over the place? I've done that too much already.



 thanks!)

Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally. So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal


 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.
In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

It does that now, except they take a char[].

Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed. Regards Elias Mårtenson


Dec 16 2003
prev sibling next sibling parent reply "Carlos Santander B." <carlos8294 msn.com> writes:
"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
| for example). Why not use the same names as are used in C? mbstowcs()
| and wcstombs()?
|

Sorry to ask, but what do those do? What do they stand for?

-------------------------
Carlos Santander
"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
| for example). Why not use the same names as are used in C? mbstowcs()
| and wcstombs()?
|

Sorry to ask, but what do those do? What do they stand for?

-------------------------
Carlos Santander
Dec 16 2003
next sibling parent reply Elias Martenson <no spam.spam> writes:
Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.:

 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |
 
 Sorry to ask, but what do those do? What do they stand for?

mbstowcs() = multi byte string to wide character string wcstombs() = wide character string to multi byte string A multi byte string is a (char *), i.e. the platform encoding. This means that if you are running Unix in a UTF-8 locale (standard these days) then it contains a UTF-8 string. If you are running Unix or Windows with an ISO-8859-1 locale, then it contains ISO-8859-1 data. A wide character string is a (wchar_t *) which is a UTF-32 string on Unix, and a UTF-16 string on Windows. As you can see, the windows way of using UTF-16 causes the exact same problems as you would suffer when using UTF-8, so working with wchar_t on Windows would be of doubtful use if not for the fact that all Unicode functions in Windows deal with wchar_t. On Unix it's easier, since you know that the full Unicode range fits in a wchar_t. This is the reason why I have been advocating against the UTF-16 representation in D. It makes little sense compared to UTF-8 and UTF-32. Regards Elias MÃ¥rtenson
Dec 16 2003
parent "Carlos Santander B." <carlos8294 msn.com> writes:
Thank you both.

"Elias Martenson" <no spam.spam> wrote in message
news:pan.2003.12.16.22.27.42.233945 spam.spam...
| Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.:
|
| > "Elias Martenson" <elias-m algonet.se> wrote in message
| > news:brml3p$7hp$1 digitaldaemon.com...
| > | for example). Why not use the same names as are used in C? mbstowcs()
| > | and wcstombs()?
| > |
| >
| > Sorry to ask, but what do those do? What do they stand for?
|
| mbstowcs() = multi byte string to wide character string
| wcstombs() = wide character string to multi byte string
|
| A multi byte string is a (char *), i.e. the platform encoding. This means
| that if you are running Unix in a UTF-8 locale (standard these days) then
| it contains a UTF-8 string. If you are running Unix or Windows with an
| ISO-8859-1 locale, then it contains ISO-8859-1 data.
|
| A wide character string is a (wchar_t *) which is a UTF-32 string on
| Unix, and a UTF-16 string on Windows.
|
| As you can see, the windows way of using UTF-16 causes the exact same
| problems as you would suffer when using UTF-8, so working with wchar_t on
| Windows would be of doubtful use if not for the fact that all Unicode
| functions in Windows deal with wchar_t. On Unix it's easier, since you
| know that the full Unicode range fits in a wchar_t.
|
| This is the reason why I have been advocating against the UTF-16
| representation in D. It makes little sense compared to UTF-8 and UTF-32.
|
| Regards
|
| Elias MÃ¥rtenson
|


---

Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.552 / Virus Database: 344 - Release Date: 2003-12-15
Dec 16 2003
prev sibling next sibling parent "Julio César Carrascal Urquijo" <adnoctum phreaker.net> writes:
mbstowcs - Multi Byte to Wide Character String
wcstombs - Wide Character String to Multi Byte


Carlos Santander B. <carlos8294 msn.com> escribió en el mensaje de noticias
brnpe0$206i$3 digitaldaemon.com...
 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |

 Sorry to ask, but what do those do? What do they stand for?

Dec 16 2003
prev sibling parent reply Andy Friesen <andy ikagames.com> writes:
Carlos Santander B. wrote:
 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |
 
 Sorry to ask, but what do those do? What do they stand for?

Ironically enough, you question answers Elias's question quite succinctly. ;) -- andy
Dec 16 2003
parent Elias Martenson <elias-m algonet.se> writes:
Andy Friesen wrote:

 Carlos Santander B. wrote:
 
 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |

 Sorry to ask, but what do those do? What do they stand for?

Ironically enough, you question answers Elias's question quite succinctly. ;)

Dang! How do you americans say? Three strikes, I'm out. :-) Regards Elias Mårtenson
Dec 17 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
 As for the functions that handle individual characters, the first thing
 that absolutely has to be done is to change them to accept dchar instead
 of char.

Yes.
 However, suppose you are going to write a function that accepts a
 string. Let's call it log_to_file(). How do you declare it? Today, you
 have three different options:

      void log_to_file(char[] str);
      void log_to_file(wchar[] str);
      void log_to_file(dchar[] str);

 Which one of these should I use? Should I use all of them? Today, people
 seems to use the first option, but UTF-8 is horribly inefficient
 performance-wise.

Do it as char[]. Have the internal implementation convert it to whatever format the underling operating system API uses. I don't agree that UTF-8 is horribly inefficient (this is from experience, UTF-32 is much, much worse).
 Also, in the case of char and wchar strings, how do I access an
 individual character? Unless I missed something, the only way today is
 to use decode(). This is a fairly common operation which needs a better
 syntax, or people will keep accessing individual elements using the
 array notation (str[n]).

It's fairly easy to write a wrapper class for it that decodes it automatically with foreach and [] overloads.
 Obviously the three different string types needs to be wrapped somehow.
 Either through a class (named "String" perhaps?) or through a keyword
 ("string"?) that is able to encapsulate the different behaviour of the
 three different kinds of strings.

 Would it be possible to use something like this?

      dchar get_first_char(string str)
      {
          return str[0];
      }

      string str1 = (dchar[])"A UTF-32 string";
      string str2 = (char[])"A UTF-8 string";

      // call the function to demonstrate that the "string"
      // type can be used in declarations
      dchar x = get_first_char(str1);
      dchar y = get_first_char(str2);

 I.e. the "string" data type would be a wrapper or supertype for the
 three different string types.

The best thing is to stick with one scheme for a program.
 char[] strings are UTF-8, and as such I don't know what you mean by


 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R.

For char types, yes. But not for UTF-16, and win32 internally is all UTF-16. There are no locale-specific encodings in UTF-16.
 In Unix the platform specific encoding is determined by the environment
 variable LC_CTYPE, although the trend is to be moving towards UTF-8 for
 all locales. We're not quite there yet though. Check out
 http://www.utf-8.org/ for some information about this.

Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>. Let's look forward instead of those backward locale dependent encodings.
 If you're talking about win32 code pages, I'm going to draw a line in


 sand and assert that D char[] strings are NOT locale or code page


 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)

No, I think D will provide an optional filter for I/O which will translate to/from locale dependent encodings. Wherever possible, the UTF-16 API's will be used to avoid any need for locale dependent encodings.
 Internally, yes. But there needs to be a clear layer where the platform
 encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding.
 Obviously this layer seems to be located in the streams. But we need a
 separate function to do this for byte arrays as well (since there are
 other ways of communicating with the outside world, memory mapped files
 for example). Why not use the same names as are used in C? mbstowcs()
 and wcstombs()?

'cuz I can never remember how they're spelled <g>.
 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language


 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from


 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How


 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about


 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character".

Frankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.
 The Unix standard goes a step
 further and defines wchar_t to be a unicode character. Obviously D goes
 the Unix route here (for dchar), and that is very good.

 However, Windows defined wchar_t to be a 16-bit Unicode character back
 in the days Unicode fit inside 16 bits. This is the same mistake Java
 did, and we have now ended up with having UTF-16 strings internally.

Windows made the right decision given what was known at the time, it was the unicode folks who goofed by not defining unicode right in the first place.
 Indeed. And using UTF-8 internally is not a bad idea. The problem is
 that we're also allowed to use UTF-16 and UTF-32 as internal encoding,
 and if this is to remain, it needs to be abstracted away somehow.
In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.


compiler could add automatic conversion to and from the "string" type as needed.

It already does that for string literals. I've thought about implicit conversions for runtime strings, but sometimes trouble results from too many implicit conversions, so I'm hanging back a bit on this to see how things evolve.
Dec 16 2003
next sibling parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
"Walter" <walter digitalmars.com> wrote in message
news:brnurb$2bc5$1 digitaldaemon.com...
 Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
 stems from the fact that according to the C standard wchar_t is not
 Unicode. It's simply a "wide character".

Frankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.

It's stupid to not agree on a standard size for char, since it's easy to "fix" the sign of a char register by biasing it by 128 (xor 0x80 works too), doing the operation, then biasing it again (un-biasing it). If all else fails, you can promote it. How often is this important anyway? If it's crucial, it's worth the time to emulate the sign if you have to. It is no good to run fast if the wrong results are generated. It's just a portability landmine, waiting for the unwary programmer, and shame on whoever let it get into a so-called "standard".
 The Unix standard goes a step
 further and defines wchar_t to be a unicode character. Obviously D goes
 the Unix route here (for dchar), and that is very good.

 However, Windows defined wchar_t to be a 16-bit Unicode character back
 in the days Unicode fit inside 16 bits. This is the same mistake Java
 did, and we have now ended up with having UTF-16 strings internally.

Windows made the right decision given what was known at the time, it was

 unicode folks who goofed by not defining unicode right in the first place.

I still don't understand why they couldn't have packed all the languages that actually get used into the lowest 16 bits, and put all the crud like box-drawing characters and visible control codes and byzantine musical notes and runes and Aleutian indian that won't fit into the next 16 pages. There's lots of gaps in the first 65536 anyway. And probably plenty of overlap, duplicated symbols (lots of languages have the same characters, especially latin-based ones). Hell they should probably have done away with accented characters being distinct characters and enforced a combining rule from the start. But the Unicode standards body wanted to please the typesetters, as opposed to giving the world a computer encoding that would actually be usable as a common text-storage and processing medium. This thread shows just how convoluted Unicode really is. I think someone can (and probably will) do better. Unfortunately I also believe that such an effort is doomed to failure. Sean
Dec 17 2003
next sibling parent Elias Martenson <elias-m algonet.se> writes:
Sean L. Palmer wrote:

 It's stupid to not agree on a standard size for char, since it's easy to
 "fix" the sign of a char register by biasing it by 128 (xor 0x80 works too),
 doing the operation, then biasing it again (un-biasing it).  If all else
 fails, you can promote it.  How often is this important anyway?  If it's
 crucial, it's worth the time to emulate the sign if you have to.  It is no
 good to run fast if the wrong results are generated.  It's just a
 portability landmine, waiting for the unwary programmer, and shame on
 whoever let it get into a so-called "standard".

C doesn't define any standard sizes at all (well, you do have stdint.h these days). This is both a curse and a blessing. More often than not, it's a curse though.
 I still don't understand why they couldn't have packed all the languages
 that actually get used into the lowest 16 bits, and put all the crud like
 box-drawing characters and visible control codes and byzantine musical notes
 and runes and Aleutian indian that won't fit into the next 16 pages.
 There's lots of gaps in the first 65536 anyway.  And probably plenty of
 overlap, duplicated symbols (lots of languages have the same characters,
 especially latin-based ones).  Hell they should probably have done away with
 accented characters being distinct characters and enforced a combining rule
 from the start.  But the Unicode standards body wanted to please the
 typesetters, as opposed to giving the world a computer encoding that would
 actually be usable as a common text-storage and processing medium.  This
 thread shows just how convoluted Unicode really is.
 
 I think someone can (and probably will) do better.  Unfortunately I also
 believe that such an effort is doomed to failure.

Agreed. Unicode has a lot of cruft. One of my favourite pet peeves are the two characters: 00C5 Å: LATIN CAPITAL LETTER A WITH RING ABOVE and 212B Å: ANGSTROM SIGN The comment even says that the preferred representation is the latin Å. But, like you say, trying to do it once again will not succeed. It has taken us 10 or so years to get where we are. I'd say we accept Unicode for what it is. It's a hell of a lot better than the previous mess. Regards Elias Mårtenson
Dec 17 2003
prev sibling parent "Sean L. Palmer" <palmer.sean verizon.net> writes:
Sorry, "sign" of char.

"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brp52o$1gc6$1 digitaldaemon.com...
 It's stupid to not agree on a standard size for char, since it's easy to
 "fix" the sign of a char register by biasing it by 128 (xor 0x80 works

Dec 17 2003
prev sibling parent reply Elias Martenson <elias-m algonet.se> writes:
Walter wrote:

 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 
As for the functions that handle individual characters, the first thing
that absolutely has to be done is to change them to accept dchar instead
of char.

Yes.

Good, I like it. :-)
Which one of these should I use? Should I use all of them? Today, people
seems to use the first option, but UTF-8 is horribly inefficient
performance-wise.

Do it as char[]. Have the internal implementation convert it to whatever format the underling operating system API uses. I don't agree that UTF-8 is horribly inefficient (this is from experience, UTF-32 is much, much worse).

Memory-wise perhaps. But for everything else UTF-8 is always slower. Consider what happens when the program is used with russian? Every single character will need special decoding, except punctuation of course. Now think about chinese and japenese. These are even worse.
Also, in the case of char and wchar strings, how do I access an
individual character? Unless I missed something, the only way today is
to use decode(). This is a fairly common operation which needs a better
syntax, or people will keep accessing individual elements using the
array notation (str[n]).

It's fairly easy to write a wrapper class for it that decodes it automatically with foreach and [] overloads.

Indeed. But they will be slow. Now, personally I can accept the slowness. Again, it's your call. What we do need to make sure is that the string/character handling package that we build is comprehensive in terms on Unicode support, and also that every single string handling function handles UTF-32 as well as UTF-8. This way a developer who is having performance problems with the default UTF-8 strings can easily change his hotspots to work with UTF-32 instead.
I.e. the "string" data type would be a wrapper or supertype for the
three different string types.

The best thing is to stick with one scheme for a program.

Unless the developer is bitten by the poor performance of UTF-8 that is. A package with perl-like functionality would be horribly slow if using UTF-8 rather than UTF-32. If we are to stick with UTF-8 as default internal string format, UTF-32 must be available as an option, and it must be easy to use.
 For char types, yes. But not for UTF-16, and win32 internally is all UTF-16.
 There are no locale-specific encodings in UTF-16.

True. But I can't see any use for UTF-16 outside communicating with external windows libraries. UTF-16 really is the worst of both worlds compared to UTF-8 and UTF-32. UTF-16 should really be considered the "native encoding" and left at that. Just like [the content of LC_CTYPE] is the native encoding when run in Unix. The developer should be shielded from the native encoding in that he should be able to say: "convert my string to the encoding my operating system wants (i.e. the native encoding)". As it happens, this is what wcstombs() does.
In Unix the platform specific encoding is determined by the environment
variable LC_CTYPE, although the trend is to be moving towards UTF-8 for
all locales. We're not quite there yet though. Check out
http://www.utf-8.org/ for some information about this.

Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>. Let's look forward instead of those backward locale dependent encodings.

Agreed. I am heavily lobbying for proper Unicode support everywhere. I've been bitten by too many broken applications. However, Windows has decided on UTF-16. Unix has decided on UTF-8. We need a way of transprently inputting and outputting strings so that they are converted to whatever encoding the host operating system uses. If we don't do this we are going to end up with a lot of conditional code that checks which OS (and encoding) is being used.
 No, I think D will provide an optional filter for I/O which will translate
 to/from locale dependent encodings. Wherever possible, the UTF-16 API's will
 be used to avoid any need for locale dependent encodings.

Why UTF-16? There is no need to involve platform specifics at this level. Remember that UTF-16 can be considered platform specific for Windows.
 'cuz I can never remember how they're spelled <g>.

Allright... So how about adding to the utf8 package some functions called... Hmm... nativeToUTF8(), nativeToUTF32() and then an overloaded function utfToNative() (which accepts char[], wchar[] and dchar[]}. "native" in this case would be a byte[] or ubyte[] to point out that this form is not supposed to be used in the program.
Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
stems from the fact that according to the C standard wchar_t is not
Unicode. It's simply a "wide character".

Frankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.

Indeed. That's why the Unix standard went a bit forther and specified a wchar_t to be a Unicode character. The problem is with Windows where wchar_t is 16-bit and thus cannot hold a Unicode character. And thus we end up with the current situation where using wchar_t in Windows really doesn't buy you anything because you have the same problems as you would with UTF-8. You still cannot assume that a wchar_t can hold a single character. You still need all the funky iterators and decoding stuff to be able to extract individual characters. This is why I'm saying that the UTF-16 in Windows is horrible, and that UTF-16 is the worst of both worlds.
 Windows made the right decision given what was known at the time, it was the
 unicode folks who goofed by not defining unicode right in the first place.

I agree 100%. Java is in the same boat. How many people know that from JDK1.5 and onwards it's a bad idea to use String.charAt()? (in JDK1.5 the internal representation for String will change from UCS-2 to UTF-16). In other words, the exact same problem Windows faced. The Unicode people argues that they never guaranteed that it was a 16-bit character set, and while this is technically true, they are really trying to cover up their mess.
 It already does that for string literals. I've thought about implicit
 conversions for runtime strings, but sometimes trouble results from too many
 implicit conversions, so I'm hanging back a bit on this to see how things
 evolve.

True. We suffer from this in C++ (costly implicit conversions) and it would be nice to be able to avoid this. Regards Elias Mårtenson
Dec 17 2003
parent "Walter" <walter digitalmars.com> writes:
I think we're mostly in agreement!
Dec 17 2003
prev sibling next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
The overloading issue is interesting, but may I suggest that char and

whcar
are at least renamed to something more appropriate? Maybe utf8byte and
utf16byte? I feel it's important to point out that they aren't characters.

I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.

This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all. A lot of english-speaking programmers simply treat chars as ASCII characters, even if there's some comment somewhere stating that the data should be UTF-8. I agree with Elias that the "char" type should be 32 bit, so that people who simply use a char array as a string, as they have done for years in other languages, will actually get the behaviour they expect, without losing the Unicode support. Btw: this could also be used to solve the "oops, I forgot to make the string null-terminated" problem when interacting with C functions. If the D char is a different type than the old C char (which could be called char_c or charz instead) then people will automatically be reminded that they need to convert them. So how about the following proposal: - char is a 32 bit Unicode character - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 or 32 bits (depending on the system), provided for interoperability with C functions - charz (or char_c? c_char?) is a normal 8 bit C character, also provided for interoperability with C functions UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This would at the same time remind users that the elements are NOT characters but simply a bunch of binary data. I don't see the need to define a new type for these - there are a lot of encodings out there, so why treat UTF-8 and UTF-16 specially? With this system it would be instantly obvious that D strings are Unicode. Interacting with legacy C code is still possible, and accidentally passing a wrong (e.g. UTF-8) string to a C function that expects ASCII or Latin-1 is impossible. Also, pure D code will automatically be UTF-32, which is exactly what you need if you want to make the lives of newbies easier. Otherwise people WILL end up using ASCII strings when they start out. Hauke
Dec 16 2003
next sibling parent reply Elias Martenson <elias-m algonet.se> writes:
Hauke Duden wrote:

 I see your point, but I just can't see making utf8byte into a keyword 
 <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't 
 see much
 of an issue here.

This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all. A lot of english-speaking programmers simply treat chars as ASCII characters, even if there's some comment somewhere stating that the data should be UTF-8.

I agree. You are better at explaining these things than I am. :-)
 I agree with Elias that the "char" type should be 32 bit, so that people 
 who simply use a char array as a string, as they have done for years in 
 other languages, will actually get the behaviour they expect, without 
 losing the Unicode support.

Indeed. In many cases existing code would actually continue working, since char[] would still declare a string. It wouldn't work when using legacy libraries though, but they won't anyway because of the zero-termination issue.
 Btw: this could also be used to solve the "oops, I forgot to make the 
 string null-terminated" problem when interacting with C functions. If 
 the D char is a different type than the old C char (which could be 
 called char_c or charz instead) then people will automatically be 
 reminded that they need to convert them.

Exactly.
 So how about the following proposal:
 
 - char is a 32 bit Unicode character
 - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 
 or 32 bits (depending on the system), provided for interoperability with 
 C functions
 - charz (or char_c? c_char?) is a normal 8 bit C character, also 
 provided for interoperability with C functions
 
 UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This 
 would at the same time remind users that the elements are NOT characters 
 but simply a bunch of binary data. I don't see the need to define a new 
 type for these - there are a lot of encodings out there, so why treat 
 UTF-8 and UTF-16 specially?
 
 With this system it would be instantly obvious that D strings are 
 Unicode. Interacting with legacy C code is still possible, and 
 accidentally passing a wrong (e.g. UTF-8) string to a C function that 
 expects ASCII or Latin-1 is impossible. Also, pure D code will 
 automatically be UTF-32, which is exactly what you need if you want to 
 make the lives of newbies easier. Otherwise people WILL end up using 
 ASCII strings when they start out.

We have to keep in mind, that in most cases, when you call a legacy C functions accepting (char *) the correct thing is to pas in a UTF-8 encoded string. The number of functions which actually fail when doing so are quite few. What I'm saying here is that there are actually few "C function[s] that expects ASCII or Latin-1". Most of them expect a (char *) and work on them as if they were a byte array. Compare this to my (and your) suggestion of using byte[] (or ubyte[]) for UTF-8 strings. Regards Elias Mårtenson
Dec 16 2003
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Elias Martenson wrote:
 We have to keep in mind, that in most cases, when you call a legacy C 
 functions accepting (char *) the correct thing is to pas in a UTF-8 
 encoded string. The number of functions which actually fail when doing 
 so are quite few.

They are not quite as few as one may think. For example, if you pass an UTF-8 string to fopen then it will only work correctly if the filename is made up of only ASCII characters only. printf will print garbage if you pass it a UTF-8 character. If you use scanf to read a string from stdin then the returned string will not be UTF-8, so you have to deal with that. The is-functions (isalpha, etc.) will not work correctly for all characters. toupper, tolower, etc. are not able to work with non-ASCII characters. The list goes on... Pretty much the only thing I can think of that will work correctly under all circumstances are simple C functions that pass strings through unmodified (if they modify them they might slice them in the middle of a UTF-8 sequence). IMHO, the safest way to call C functions is to pass them strings encoded using the current system code page, because that's what the CRT expects a char array to be. Since the code page is different from system to system this makes a runtime conversion pretty much inevitable, but there's no way around that if you want Unicode support. Hauke
Dec 16 2003
parent Elias Martenson <elias-m algonet.se> writes:
Hauke Duden wrote:

 They are not quite as few as one may think. For example, if you pass an 
 UTF-8 string to fopen then it will only work correctly if the filename 
 is made up of only ASCII characters only.

Depends on the OS. Unix handles it perfectly.
 printf will print garbage if
 you pass it a UTF-8 character. If you use scanf to read a string from 
 stdin then the returned string will not be UTF-8, so you have to deal 
 with that. The is-functions (isalpha, etc.) will not work correctly for 
 all characters. toupper, tolower, etc. are not able to work with 
 non-ASCII characters. The list goes on...

Exactly. But the number of functions that do these things are still pretty smapp, compared to the total number of functions accetping strings. Take a look at your own code and try to classify them as UTF-8 safe or not. I think you'll be surprised.
 Pretty much the only thing I can think of that will work correctly under 
 all circumstances are simple C functions that pass strings through 
 unmodified (if they modify them they might slice them in the middle of a 
 UTF-8 sequence).

And, believe it or not, this is the major part of all such functions. But, the discussion is really irrelevant since we both agree that it is inherently unsafe. Regards Elias Mårtenson
Dec 16 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:brnas5$1940$1 digitaldaemon.com...
 Walter wrote:
 I see your point, but I just can't see making utf8byte into a keyword


 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see


 of an issue here.

multibyte chars in C at all.

Multibyte char programming in C has been common on the IBM PC for 20 years now (my C compiler has supported it for that long, since it was distributed to an international community), and it was standardized into C in 1989. I agree that many ignore it, but that's because it's badly designed. Dealing with locale-dependent encodings is a real chore in C.
 A lot of english-speaking programmers
 simply treat chars as ASCII characters, even if there's some comment
 somewhere stating that the data should be UTF-8.

True, but code doesn't have to be changed much to allow for UTF-8. For example, D source text is UTF-8, and supporting that required little change in the D front end, and none in the back end. Trying to use UTF-32 internally to support this would have been a disaster.
 I agree with Elias that the "char" type should be 32 bit, so that people
 who simply use a char array as a string, as they have done for years in
 other languages, will actually get the behaviour they expect, without
 losing the Unicode support.

Other problems are introduced with that for the naive programmer who expects it to work just like ascii. For example, many people don't bother multiplying by sizeof(char) when allocating storage for char arrays. chars and 'bytes' in C are used willy-nilly interchangeably. Direct manipulation of chars (without going through ctype.h) is common for converting lower case to upper case. Etc. The nice thing about UTF-8 is it does work just like ascii when you're dealing with ascii data.
 Btw: this could also be used to solve the "oops, I forgot to make the
 string null-terminated" problem when interacting with C functions. If
 the D char is a different type than the old C char (which could be
 called char_c or charz instead) then people will automatically be
 reminded that they need to convert them.

 So how about the following proposal:

 - char is a 32 bit Unicode character

Already have that, it's 'dchar' <g>. There is nothing in D that prevents a programmer from using dchar's for his character handling chores.
 - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16
 or 32 bits (depending on the system), provided for interoperability with
 C functions

I've dealt with porting large projects between win32 and linux and the change in wchar_t size from 16 to 32. I've come to believe that method is a mistake, hence wchar and dchar in D. (One of the wretched problems is one cannot intermingle printf and wprintf to stdout in C.)
 - charz (or char_c? c_char?) is a normal 8 bit C character, also
 provided for interoperability with C functions

I agree that the 0 termination is an issue when calling C functions. I think this issue will fade, however, as the D libraries get more comprehensive. Another problem with 'normal' C chars is the confusion about whether they are signed or unsigned. The D char type is unsigned, period <g>.
 UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This
 would at the same time remind users that the elements are NOT characters
 but simply a bunch of binary data. I don't see the need to define a new
 type for these - there are a lot of encodings out there, so why treat
 UTF-8 and UTF-16 specially?

Treating UTF-8 and UTF-16 specially in D has great advantages in making the internal workings of the compiler and runtime library consistent. (No more problems mixing printf and wprintf!) I'm convinced that UTF is becoming the linqua franca of computing, and the other encodings will be relegated to sideshow status.
 With this system it would be instantly obvious that D strings are
 Unicode. Interacting with legacy C code is still possible, and
 accidentally passing a wrong (e.g. UTF-8) string to a C function that
 expects ASCII or Latin-1 is impossible.

Windows NT, 2000, XP, and onwards are internally all UTF-16. Any win32 API functions that accept 8 bit chars will immediately convert them to UTF-16. wchar_t's under win32 are UTF-16 encodings (including the 2 word encodings of UTF-16). Linux is internally UTF-8, if I'm not mistaken. This means D code will feel right at home with linux. Under win32, I plan on fixing all the runtime library functions to convert UTF-8 to UTF-16 internally and use the win32 API UTF-16 functions. Hence, UTF is where the operating systems are going, and D is looking forward to mapping cleanly onto that. I believe that following the C approach of code pages, signed/unsigned char confusion, varying wchar_t sizes, etc., is rapidly becoming obsolete.
 Also, pure D code will
 automatically be UTF-32, which is exactly what you need if you want to
 make the lives of newbies easier. Otherwise people WILL end up using
 ASCII strings when they start out.

Over the last 10 years, I wrote two major internationalized apps. One used UTF-8 internally, and converted other encodings to/from it on input/output. The other used wchar_t throughout, and was ported to win32 and linux which mapped wchar_t to UTF-16 and UTF-32, respectively. The former project ran much faster, consumed far less memory, and (aside from the lack of support from C for UTF-8) simply had far fewer problems. The latter was big and slow. Especially and linux with the wchar_t's being UTF-32, it really hogged the memory.
Dec 16 2003
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
This is simply not true, Walter. The world has not gotten used to
multibyte chars in C at all.

Multibyte char programming in C has been common on the IBM PC for 20 years now (my C compiler has supported it for that long, since it was distributed to an international community), and it was standardized into C in 1989. I agree that many ignore it, but that's because it's badly designed. Dealing with locale-dependent encodings is a real chore in C.

Right, it has been around for decades. And people still don't use it properly. Don't make that same mistake again! I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.
Also, pure D code will
automatically be UTF-32, which is exactly what you need if you want to
make the lives of newbies easier. Otherwise people WILL end up using
ASCII strings when they start out.

Over the last 10 years, I wrote two major internationalized apps. One used UTF-8 internally, and converted other encodings to/from it on input/output. The other used wchar_t throughout, and was ported to win32 and linux which mapped wchar_t to UTF-16 and UTF-32, respectively. The former project ran much faster, consumed far less memory, and (aside from the lack of support from C for UTF-8) simply had far fewer problems. The latter was big and slow. Especially and linux with the wchar_t's being UTF-32, it really hogged the memory.

Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8. And about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character. Also, how much text did your "bad experience" application use? It seems to me that even if you assume best-case for UTF-8 (e.g. one byte per character) then the memory overhead should not be much of an issue. It's only factor 4, after all. So assuming that your application uses 100.000 lines of text (which is a lot more than anything I've ever seen in a program), each 100 characters long and everything held in memory at once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32. These are hardly numbers that will bring a modern OS to its knees anymore. In a few years this might even fit completely into the CPU's cache! I think it's more important to have proper localization ability and programming ease than trying to conserve a few bytes for a limited group of people (i.e. english speakers). Being greedy with memory consumption when making long-term design decisions has always caused problems. For instance, it caused that major Y2K panic in the industry a few years ago! Please also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time. Most people already have several hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit shortsighted to make the lives of D programmers harder forever, just to save a few megabytes of memory that people will laugh about in 5 years (or already laugh about right now)? Hauke
Dec 17 2003
next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:brpvmn$2o0t$1 digitaldaemon.com...
 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

UTF-8 has some nice advantages over other multibyte encodings in that it is possible to find the start of a sequence without backing up to the beginning, none of the multibyte encodings have bit 7 clear (so they never conflict with ascii), and no additional information like code pages are necessary to decode them.
 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

That's correct. And D supports UTF-32 programming if that works better for the particular application.
 And about computing complexity: if you ignore the overhead introduced by
 having to move more (or sometimes less) memory then manipulating UTF-32
 strings is a LOT faster than UTF-8. Simply because random access is
 possible and you do not have to perform an expensive decode operation on
 each character.

Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and away most operations on strings were copying them, storing them, hashing them, etc.
 Also, how much text did your "bad experience" application use?

Maybe 100 megs. Extensive profiling and analysis showed that it would have run much faster if it was UTF-8 rather than UTF-32, not the least of which was it would have hit the 'wall' of thrashing the virtual memory much later.
 It seems
 to me that even if you assume best-case for UTF-8 (e.g. one byte per
 character) then the memory overhead should not be much of an issue. It's
 only factor 4, after all.

It's a huge (!) issue. When you're pushing a web server to the max, using 4x memory means it runs 4x slower. (Actually about 2x slower because of other factors.)
 So assuming that your application uses 100.000
 lines of text (which is a lot more than anything I've ever seen in a
 program), each 100 characters long and everything held in memory at
 once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
 These are hardly numbers that will bring a modern OS to its knees
 anymore. In a few years this might even fit completely into the CPU's

Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive with C++ if it does chars as 32 bits each. I doubt many realize this, but Java and C# pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen do not measure char processing speed or memory consumption.)
 I think it's more important to have proper localization ability and
 programming ease than trying to conserve a few bytes for a limited group
 of people (i.e. english speakers). Being greedy with memory consumption
 when making long-term design decisions has always caused problems. For
 instance, it caused that major Y2K panic in the industry a few years ago!

You have a valid point, but things are always a tradeoff. D offers the flexibility of allowing the programmer to choose whether he wants to build his app around char, wchar, or dchar's. (None of my programs dating back to the 70's had any Y2K bugs in them <g>)
 Please also keep in mind that a factor 4 will be compensated by memory
 enhancements in only 1-2 years time.

I don't agree that memory is improving that fast. Even if it is, people just load them up with more data to fill the memory up. I will agree that program code size is no longer that relevant, but data size is still pretty relevant. Stuff we were forced to do back in the bad old DOS 640k days seem pretty quaint now <g>.
 Most people already have several
 hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
 shortsighted to make the lives of D programmers harder forever, just to
 save a few megabytes of memory that people will laugh about in 5 years
 (or already laugh about right now)?

D programmers can use dchars if they want to.
Dec 17 2003
next sibling parent reply "Roald Ribe" <rr.no spam.teikom.no> writes:
"Walter" <walter digitalmars.com> wrote in message
news:brqr8e$vmh$1 digitaldaemon.com...
 "Hauke Duden" <H.NS.Duden gmx.net> wrote in message
 news:brpvmn$2o0t$1 digitaldaemon.com...
 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

UTF-8 has some nice advantages over other multibyte encodings in that it

 possible to find the start of a sequence without backing up to the
 beginning, none of the multibyte encodings have bit 7 clear (so they never
 conflict with ascii), and no additional information like code pages are
 necessary to decode them.

But with UTF-32, this is not an issue at all.
 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

That's correct. And D supports UTF-32 programming if that works better for the particular application.

Yes, but that statement does not stop clueless/lazy programmers from using chars in libraries/programs where UTF-32 should have been used.
 And about computing complexity: if you ignore the overhead introduced by
 having to move more (or sometimes less) memory then manipulating UTF-32
 strings is a LOT faster than UTF-8. Simply because random access is
 possible and you do not have to perform an expensive decode operation on
 each character.

Interestingly, it was rarely necessary to decode the UTF-8 strings. Far

 away most operations on strings were copying them, storing them, hashing
 them, etc.

If that is correct, it might be just as correct, and even faster, to treat it as binary data in most cases. No need to have that data reoresented as String at all times.
 Also, how much text did your "bad experience" application use?

Maybe 100 megs. Extensive profiling and analysis showed that it would have run much faster if it was UTF-8 rather than UTF-32, not the least of which was it would have hit the 'wall' of thrashing the virtual memory much

I think the profiling might have shown very different numbers if the native language of the profiling crew/test files were traditional chinese texts, mixed with a lot of different languages.
 It seems
 to me that even if you assume best-case for UTF-8 (e.g. one byte per
 character) then the memory overhead should not be much of an issue. It's
 only factor 4, after all.

It's a huge (!) issue. When you're pushing a web server to the max, using

 memory means it runs 4x slower. (Actually about 2x slower because of other
 factors.)

I agree with you, speed is important. But if what you are serving is 8-bit .html files (latin language), why not treat the data as usigned bytes? You are describing the "special case" as the explanation of why UTF-32 should not be the general case. The definition of the language is what people are interested in at this point. What "dirty" tricks you use in the implementation to make it faster (right now, in some special cases, with a limited set of language data) is less interesting.
 So assuming that your application uses 100.000
 lines of text (which is a lot more than anything I've ever seen in a
 program), each 100 characters long and everything held in memory at
 once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
 These are hardly numbers that will bring a modern OS to its knees
 anymore. In a few years this might even fit completely into the CPU's

Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive with

 if it does chars as 32 bits each. I doubt many realize this, but Java and

 pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen
 do not measure char processing speed or memory consumption.)

I think this is a brilliant observation. I had not thought much about this. But I think my thought from above is still correct: why should the data for this special case be String at all? A good server software writer could obtain the ultimate speed by using unsigned bytes. That would give ultimate speed when necessary, and generally applicable String handling for all spoken languages would be enforced for String at the same time.
 I think it's more important to have proper localization ability and
 programming ease than trying to conserve a few bytes for a limited group
 of people (i.e. english speakers). Being greedy with memory consumption
 when making long-term design decisions has always caused problems. For
 instance, it caused that major Y2K panic in the industry a few years


 You have a valid point, but things are always a tradeoff. D offers the
 flexibility of allowing the programmer to choose whether he wants to build
 his app around char, wchar, or dchar's.

With all due respect, I believe you are trading off in the wrong direction. Because you have a personal interest in good performance (which is good) you seem to not want to consider the more general cases as being the general ones. I propose (as an experiment) that you try to think "what would I do if I were a chinese?" each time you want to make a tradeoff on string handling. This is what good design is all about. In the performance trail of thought: do we all agree that the general String _manipulation_ handling in all programs will perform much better if choosing UTF-32 over UTF-8, when considering that the natural language data of the program would be traditional chinese? Another one: If UTF-32 were the base type of String, would it be applicable to have a "Compressed" attribute on each String? That way it could have as small as possible i/o, storage and memcpy size most of the time, and could be uncompressed for manipulation? This should take care of most of the "data size"/trashing related arguments...
 (None of my programs dating back to the 70's had any Y2K bugs in them <g>)

 Please also keep in mind that a factor 4 will be compensated by memory
 enhancements in only 1-2 years time.

I don't agree that memory is improving that fast. Even if it is, people

 load them up with more data to fill the memory up. I will agree that

 code size is no longer that relevant, but data size is still pretty
 relevant. Stuff we were forced to do back in the bad old DOS 640k days

 pretty quaint now <g>.

 Most people already have several
 hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
 shortsighted to make the lives of D programmers harder forever, just to
 save a few megabytes of memory that people will laugh about in 5 years
 (or already laugh about right now)?

D programmers can use dchars if they want to.

The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D. Thanks to all who took the time to read my take on these issues. Regards, Roald
Dec 18 2003
next sibling parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:brsfkq$dpl$1 digitaldaemon.com...
 D programmers can use dchars if they want to.

The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D. Thanks to all who took the time to read my take on these issues.

You raise some good points. This issue should not be treated too lightly. It should be possible to work with text as bytes, (for performance on interfacing with legacy non-Unicode strings) but that should definitely not be the preferred way. I think that there should be no char or wchar, and that dchar should be renamed char. That way if you see byte[] in the code you won't be tempted to think of it as a string but more like raw data. UTF-8 can be well represented by byte[], and if you want to work directly with UTF-8, you can use a wrapper class from the D standard library. Sean
Dec 18 2003
parent reply Lewis <dethbomb hotmail.com> writes:
 I think that there should be no char or wchar, and that dchar should be
 renamed char.  

Sorry if im stating something i lack knowledge in, but if there were no wchar what would you use to call windows wide api? regards
Dec 18 2003
parent Elias Martenson <no spam.spam> writes:
Den Thu, 18 Dec 2003 15:27:02 -0500 skrev Lewis:

 
 I think that there should be no char or wchar, and that dchar should be
 renamed char.  

Sorry if im stating something i lack knowledge in, but if there were no wchar what would you use to call windows wide api?

Most likely ushort[]. Regards Elias MÃ¥rtenson
Dec 18 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:brsfkq$dpl$1 digitaldaemon.com...
 Yes, but that statement does not stop clueless/lazy programmers from
 using chars in libraries/programs where UTF-32 should have been used.

I can't really stop clueless/lazy programmers from writing bad code <g>.
 I think the profiling might have shown very different numbers if the
 native language of the profiling crew/test files were traditional
 chinese texts, mixed with a lot of different languages.

If an app is going to process primarilly chinese, it will probably be more efficient using dchar[]. If an app is going to process primarilly english, then char[] is the right choice. The server app I wrote was for use primarilly by american and european companies. It had to handle chinese, but far and away the bulk of the data it needed to process was plain old ascii. D doesn't force such a choice on the app programmer - he can pick char[], wchar[] or dchar[] to match the probability of the bulk of the text it will be dealing with.
 I agree with you, speed is important. But if what you are serving
 is 8-bit .html files (latin language), why not treat the data as
 usigned bytes? You are describing the "special case" as the
 explanation of why UTF-32 should not be the general case.

For overloading reasons. I never liked the C way of conflating chars with bytes. Having a utf type separate from a byte type enables more reasonable ways of handling things like string literals.
 You have a valid point, but things are always a tradeoff. D offers the
 flexibility of allowing the programmer to choose whether he wants to


 his app around char, wchar, or dchar's.


 Because you have a personal interest in good performance (which is good)
 you seem to not want to consider the more general cases as being the
 general ones. I propose (as an experiment) that you try to think "what
 would I do if I were a chinese?" each time you want to make a tradeoff
 on string handling. This is what good design is all about.

I assume that a chinese programmer writing chinese apps would prefer to use dchar[]. And that is fully supported by D, so I am misunderstanding what our disagreement is about.
 In the performance trail of thought: do we all agree that the general
 String _manipulation_ handling in all programs will perform much better
 if choosing UTF-32 over UTF-8, when considering that the natural
 language data of the program would be traditional chinese?

Sure. But if the data the program will see is not chinese, then performance will suffer. As a language designer, I cannot determine what data the programmer will see, so D provides char[], wchar[] and dchar[] and the programmer can make the choice based on the data for his app.
 Another one: If UTF-32 were the base type of String, would it be
 applicable to have a "Compressed" attribute on each String? That way
 it could have as small as possible i/o, storage and memcpy size most
 of the time, and could be uncompressed for manipulation? This should
 take care of most of the "data size"/trashing related arguments...

An intriguing idea, but I am not convinced it would be superior to UTF-8. Data compression is relatively slow.
 D programmers can use dchars if they want to.

latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D.

D is not going to force one to write internationalized apps, just make it easy to write them if the programmer cares about it. As opposed to C where it is rather difficult to write internationalized apps, so few bother.
 Thanks to all who took the time to read my take on these issues.

It's a fun discussion!
Dec 18 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
Walter wrote:

 "Roald Ribe" <rr.no spam.teikom.no> wrote in message
 news:brsfkq$dpl$1 digitaldaemon.com...
 
Yes, but that statement does not stop clueless/lazy programmers from
using chars in libraries/programs where UTF-32 should have been used.

I can't really stop clueless/lazy programmers from writing bad code <g>.

But it is possible to make it harder to do so. I believe that is what this discussion is all about.
I think the profiling might have shown very different numbers if the
native language of the profiling crew/test files were traditional
chinese texts, mixed with a lot of different languages.

If an app is going to process primarilly chinese, it will probably be more efficient using dchar[]. If an app is going to process primarilly english, then char[] is the right choice. The server app I wrote was for use primarilly by american and european companies. It had to handle chinese, but far and away the bulk of the data it needed to process was plain old ascii.

I don't think most programmers (at the time of writing the code) is aware of the fact that his application is going to be used outside the local region. An example is the current project I'm working in, the old application that out new one is designed to replace, is already exported throughout the world. Even though that is the case, when I came into the project there was absolutely zero understanding that we needed to support anything else than ISO-8859-1. As a result, we have lost a lot of time rewriting parts of the system. Now, I agree that the current D way would have made it a lot easier, but it could be even easier.
 D doesn't force such a choice on the app programmer - he can pick char[],
 wchar[] or dchar[] to match the probability of the bulk of the text it will
 be dealing with.

In the end, I think most people (including me) would be a lot happier if all that was done was renaming dchar into char. No functionality change at all, just a rename of the types. I think most people can see the advantage of D supporting UTF-8 natively, it just feels wrong with an array of "char" which isn't really an array of characters.
 For overloading reasons. I never liked the C way of conflating chars with
 bytes. Having a utf type separate from a byte type enables more reasonable
 ways of handling things like string literals.

Right, I can see your reasoning, but does the type _really_ have to be named "char"?
 I assume that a chinese programmer writing chinese apps would prefer to use
 dchar[]. And that is fully supported by D, so I am misunderstanding what our
 disagreement is about.

Possibly, but in todays world it's not unusual that an application is developed in europe but used in china, or developed in india but used in new zealand. Regards Elias Mårtenson
Dec 19 2003
parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
"Elias Martenson" <elias-m algonet.se> wrote in message
news:bruen1$f05$1 digitaldaemon.com...
 I don't think most programmers (at the time of writing the code) is
 aware of the fact that his application is going to be used outside the
 local region.

Probably true.
 D doesn't force such a choice on the app programmer - he can pick


 wchar[] or dchar[] to match the probability of the bulk of the text it


 be dealing with.

In the end, I think most people (including me) would be a lot happier if all that was done was renaming dchar into char. No functionality change at all, just a rename of the types. I think most people can see the advantage of D supporting UTF-8 natively, it just feels wrong with an array of "char" which isn't really an array of characters.

Even despite the fact that in C and C++, char is byte-sized, it would probably be preferrable to just rename "char" to "bchar" and "dchar" to "char". This corresponds to byte and int, but then wchar seems out of place since in D there is short and ushort but not word. "schar" sounds like "signed char" and I believe we should stay away from that. What to do, what to do?
 For overloading reasons. I never liked the C way of conflating chars


 bytes. Having a utf type separate from a byte type enables more


 ways of handling things like string literals.

Right, I can see your reasoning, but does the type _really_ have to be named "char"?

Good point. But there is the backward compatibility thing, which kind of sucks. It would subtly break any C app ported to D that allocated memory using malloc(N) and then stored a N-character string into it.
 I assume that a chinese programmer writing chinese apps would prefer to


 dchar[]. And that is fully supported by D, so I am misunderstanding what


 disagreement is about.

Possibly, but in todays world it's not unusual that an application is developed in europe but used in china, or developed in india but used in new zealand.

It will still work, but won't be as efficient as it could be. Sean
Dec 19 2003
parent reply "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:
There has been a lot of talk about doing things, but very little has
actually happened. Consequently, I have made a string interface and two
rough and ready string classes for UTF-8 and UTF-32, which are attached to
this message.

Currently they only do a few things, one of which is to provide a consistent
interface for character manipulation. The UTF-8 class also provides direct
access to the bytes for when the user can do things more efficiently with
these. They can also be appended to each other. In addition, each provides a
constructor taking the other one as a parameter.

Please bear in mind that I am only an amateur programmer, who knows very
little about Unicode and has no experience of programming in the real world.
Nevertheless, I can appreciate some of the issues here and I hope that these
classes can be the foundation of something more useful.

From,

Rupert
Dec 19 2003
parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
Cool beans!  Thanks, Rupert!

This brings up a point.  The main reason that I do not like opAssign/opAdd
syntax for operator overloading is that it is not self-documenting that
opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that
opCatAssign corresponds to a ~= b.  This information either has to be
present in a comment or you have to go look it up.  Yeah, D gurus will have
it memorized, but I'd rather there be just one "name" for the function, and
it should be the same both in the definition and at the point of call.

Sean

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
news:brvghd$21n8$2 digitaldaemon.com...
 There has been a lot of talk about doing things, but very little has
 actually happened. Consequently, I have made a string interface and two
 rough and ready string classes for UTF-8 and UTF-32, which are attached to
 this message.

 Currently they only do a few things, one of which is to provide a

 interface for character manipulation. The UTF-8 class also provides direct
 access to the bytes for when the user can do things more efficiently with
 these. They can also be appended to each other. In addition, each provides

 constructor taking the other one as a parameter.

 Please bear in mind that I am only an amateur programmer, who knows very
 little about Unicode and has no experience of programming in the real

 Nevertheless, I can appreciate some of the issues here and I hope that

 classes can be the foundation of something more useful.

 From,

 Rupert

Dec 19 2003
parent reply "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:
I agree with you, but we just have to grin and bear it, unless / until
Walter changes his mind. I suppose I could have commented my code better
though. Hopefully as I become more experienced, I will be a better judge of
these things.

"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brvlj9$29qh$1 digitaldaemon.com...
 Cool beans!  Thanks, Rupert!

 This brings up a point.  The main reason that I do not like opAssign/opAdd
 syntax for operator overloading is that it is not self-documenting that
 opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that
 opCatAssign corresponds to a ~= b.  This information either has to be
 present in a comment or you have to go look it up.  Yeah, D gurus will

 it memorized, but I'd rather there be just one "name" for the function,

 it should be the same both in the definition and at the point of call.

 Sean

 "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
 news:brvghd$21n8$2 digitaldaemon.com...
 There has been a lot of talk about doing things, but very little has
 actually happened. Consequently, I have made a string interface and two
 rough and ready string classes for UTF-8 and UTF-32, which are attached


 this message.

 Currently they only do a few things, one of which is to provide a

 interface for character manipulation. The UTF-8 class also provides


 access to the bytes for when the user can do things more efficiently


 these. They can also be appended to each other. In addition, each


 a
 constructor taking the other one as a parameter.

 Please bear in mind that I am only an amateur programmer, who knows very
 little about Unicode and has no experience of programming in the real

 Nevertheless, I can appreciate some of the issues here and I hope that

 classes can be the foundation of something more useful.

 From,

 Rupert


Dec 19 2003
parent reply "Walter" <walter digitalmars.com> writes:
The problem with the operater* or operator~ syntax is it is ambiguous. It's
also not greppable.

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
news:brvr60$2il5$1 digitaldaemon.com...
 I agree with you, but we just have to grin and bear it, unless / until
 Walter changes his mind. I suppose I could have commented my code better
 though. Hopefully as I become more experienced, I will be a better judge

 these things.

 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brvlj9$29qh$1 digitaldaemon.com...
 Cool beans!  Thanks, Rupert!

 This brings up a point.  The main reason that I do not like


 syntax for operator overloading is that it is not self-documenting that
 opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or


 opCatAssign corresponds to a ~= b.  This information either has to be
 present in a comment or you have to go look it up.  Yeah, D gurus will

 it memorized, but I'd rather there be just one "name" for the function,

 it should be the same both in the definition and at the point of call.

 Sean

 "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in


 news:brvghd$21n8$2 digitaldaemon.com...
 There has been a lot of talk about doing things, but very little has
 actually happened. Consequently, I have made a string interface and



 rough and ready string classes for UTF-8 and UTF-32, which are



 to
 this message.

 Currently they only do a few things, one of which is to provide a

 interface for character manipulation. The UTF-8 class also provides


 access to the bytes for when the user can do things more efficiently


 these. They can also be appended to each other. In addition, each


 a
 constructor taking the other one as a parameter.

 Please bear in mind that I am only an amateur programmer, who knows



 little about Unicode and has no experience of programming in the real

 Nevertheless, I can appreciate some of the issues here and I hope that

 classes can be the foundation of something more useful.

 From,

 Rupert



Dec 19 2003
next sibling parent reply "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:
If you say it's ambiguous, I'll take your word for it and if you think being
greppable is important, I'm also happy to accept that. My personal opinions
are not all that strong - it's only a minor inconvenience to have to check
the overload function names.

More importantly, what do you think of my request for more opSlice
overloads?

From,

Rupert

"Walter" <walter digitalmars.com> wrote in message
news:bs08b8$527$2 digitaldaemon.com...
 The problem with the operater* or operator~ syntax is it is ambiguous.

 also not greppable.

 "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
 news:brvr60$2il5$1 digitaldaemon.com...
 I agree with you, but we just have to grin and bear it, unless / until
 Walter changes his mind. I suppose I could have commented my code better
 though. Hopefully as I become more experienced, I will be a better judge

 these things.

 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brvlj9$29qh$1 digitaldaemon.com...
 Cool beans!  Thanks, Rupert!

 This brings up a point.  The main reason that I do not like


 syntax for operator overloading is that it is not self-documenting



 opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or


 opCatAssign corresponds to a ~= b.  This information either has to be
 present in a comment or you have to go look it up.  Yeah, D gurus will

 it memorized, but I'd rather there be just one "name" for the



 and
 it should be the same both in the definition and at the point of call.

 Sean



Dec 20 2003
parent "Walter" <walter digitalmars.com> writes:
"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
news:bs1d9b$2033$1 digitaldaemon.com...
 More importantly, what do you think of my request for more opSlice
 overloads?

I haven't got that far yet!
Dec 20 2003
prev sibling parent "Sean L. Palmer" <palmer.sean verizon.net> writes:
It would be greppable if it were required that there be no space between the
operator and the symbol.  (if you use regexp you can get around this)

There should be some other way to embed the symbol into the identifier, if
it's causing too many lexer problems.

Sean

"Walter" <walter digitalmars.com> wrote in message
news:bs08b8$527$2 digitaldaemon.com...
 The problem with the operater* or operator~ syntax is it is ambiguous.

 also not greppable.

Dec 20 2003
prev sibling next sibling parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
"Walter" <walter digitalmars.com> wrote in message
news:brqr8e$vmh$1 digitaldaemon.com...
 Interestingly, it was rarely necessary to decode the UTF-8 strings. Far

 away most operations on strings were copying them, storing them, hashing
 them, etc.

That is my experience as well. Either that or it's parsing them more or less linearly.
 Please also keep in mind that a factor 4 will be compensated by memory
 enhancements in only 1-2 years time.

I don't agree that memory is improving that fast. Even if it is, people

 load them up with more data to fill the memory up. I will agree that

 code size is no longer that relevant, but data size is still pretty
 relevant. Stuff we were forced to do back in the bad old DOS 640k days

 pretty quaint now <g>.

Code size is actually still important on embedded apps (console video games) where the machine has small code cache size (8K or less) On PS2, optimizing for size produces faster code in most cases than optimizing for speed.
 Most people already have several
 hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
 shortsighted to make the lives of D programmers harder forever, just to
 save a few megabytes of memory that people will laugh about in 5 years
 (or already laugh about right now)?

D programmers can use dchars if they want to.

So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32? Unfortunately then a char won't hold a single Unicode character, you have to mix char and dchar. It would be nice to have a library function to pull the first character out of a UTF-8 string and increment the iterator pointer past it. dchar extractFirstChar(inout char* utf8string); That seems like an insanely useful text processing function. Maybe the reverse as well: void appendChar(char[] utf8string, dchar c); Sean
Dec 18 2003
next sibling parent Elias Martenson <no spam.spam> writes:
Den Thu, 18 Dec 2003 10:49:31 -0800 skrev Sean L. Palmer:

 So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and
 dchar[] means UTF-32?
 
 Unfortunately then a char won't hold a single Unicode character, you have to
 mix char and dchar.

This is why I have advocated a rename of dchar to char, and the current char to something else (my first suggestion was utf8byte, but I can see why it was rejected off hand. :-) ).
 It would be nice to have a library function to pull the first character out
 of a UTF-8 string and increment the iterator pointer past it.
 
 dchar extractFirstChar(inout char* utf8string);
 
 That seems like an insanely useful text processing function.  Maybe the
 reverse as well:
 
 void appendChar(char[] utf8string, dchar c);

At least my intention when starting this second round of discussion was to iron out what the "D way" of handling strings is, so we can get to work on these library functions that you request. Regards Elias MÃ¥rtenson
Dec 18 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brssrg$135p$1 digitaldaemon.com...
 So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and
 dchar[] means UTF-32?

Yes. Exactly.
 Unfortunately then a char won't hold a single Unicode character,

Correct. But a dchar will.
 you have to mix char and dchar.

 It would be nice to have a library function to pull the first character

 of a UTF-8 string and increment the iterator pointer past it.
 dchar extractFirstChar(inout char* utf8string);

Check out the functions in std.utf.
 That seems like an insanely useful text processing function.  Maybe the
 reverse as well:
 void appendChar(char[] utf8string, dchar c);

Actually, a wrapper class around the string, overloading opApply, [], etc., will do the job nicely.
Dec 18 2003
parent reply Karl Bochert <kbochert copper.net> writes:
On Thu, 18 Dec 2003 16:05:47 -0800, "Walter" <walter digitalmars.com> wrote:
 
 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brssrg$135p$1 digitaldaemon.com...
 So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and
 dchar[] means UTF-32?

Yes. Exactly.
 Unfortunately then a char won't hold a single Unicode character,

Correct. But a dchar will.

A char is defined as a UTF-8 character but does not have enough storage to hold one!? ubute[4] declares storage for 4 ubytes btytes, but char[4] The D manual derscribes a char as being a UTF-8 char AND being 8-bits ? Can't a single UTF-8 character require multiple bytes for representation? A datatype is some storage and a set of operations that can be done on that storage. In what way are char and ubyte different datatypes? An array of a datatype is an indexable set of elements of that type. (Isn't it?) Given char foo[4]; does foo[2] not represent the third char in foo !!?? I would think that the datatype char would be a UTF-8 character, with no indication of the amount of storage it used. The compiler would be free to represent it internally however it chose. Indexing should work (perhaps inefficiently) D's datatypes seem to be of two different varieties; names for units of memory and names for abstract types. Some (ubyte) describe a fixed amount af physical storage, while others ( ifloat?) describe an abstract datatype whose physical structure is hidden (or at least irrelevant) Which is char? Karl Bochert
Dec 20 2003
next sibling parent Elias Martenson <no spam.spam> writes:
Den Sat, 20 Dec 2003 19:33:59 +0000 skrev Karl Bochert:

 D's datatypes seem to be of two different varieties; names for units of memory
 and names for abstract types. Some (ubyte) describe a fixed amount af physical
 storage, while others ( ifloat?)  describe an abstract datatype whose physical
 structure
 is hidden (or at least irrelevant)
 Which is char?

It's a fixed memory type. Look at it as an ubyte, but with some special guarantees (upheld by convention). By your own question you have pointed out that the name "char" is not very good. But I really should stop pointing this out, or I'll be banned before I even get started with providing any actual value to the project. :-) Regards Elias MÃ¥rtenson
Dec 20 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Karl Bochert" <kbochert copper.net> wrote in message
news:1103_1071948839 bose...
 A char is defined as a UTF-8 character but does not have enough storage to

Right.
 The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?

Yes.
 Can't a single UTF-8 character require multiple bytes for representation?

No.
 A datatype is some storage and a set of operations that can be done on

 In what way are char and ubyte different datatypes?

Only how they are overloaded, and how string literals are handled.
 An array of a datatype is an indexable set of elements of that type.

 Given
     char foo[4];

 does foo[2] not represent the third char in foo !!??

If it makes more sense, it is the third byte in foo.
 I would think that the datatype char would be a UTF-8 character, with no

 the amount of storage it used. The compiler would be free to represent it

 however it chose. Indexing should work (perhaps inefficiently)

That would be a higher level view of it, and I suggest a wrapper class around it can provide this.
 D's datatypes seem to be of two different varieties; names for units of

 and names for abstract types. Some (ubyte) describe a fixed amount af

 storage, while others ( ifloat?)  describe an abstract datatype whose

 is hidden (or at least irrelevant)
 Which is char?

char is a fixed 8 bits of storage.
Dec 20 2003
next sibling parent reply "Roald Ribe" <rr.no spam.teikom.no> writes:
"Walter" <walter digitalmars.com> wrote in message
news:bs3pmm$2m0v$2 digitaldaemon.com...
 "Karl Bochert" <kbochert copper.net> wrote in message
 news:1103_1071948839 bose...
 A char is defined as a UTF-8 character but does not have enough storage


 hold one!?

 Right.

 The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?

Yes.
 Can't a single UTF-8 character require multiple bytes for


 No.

??? A unicode character can result in up to 6 bytes used, when encoded with UTF-8. Which is what the poster meant to ask, I think. Roald
Dec 21 2003
next sibling parent "Walter" <walter digitalmars.com> writes:
"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:bs4ddt$ig4$1 digitaldaemon.com...
 Can't a single UTF-8 character require multiple bytes for


 No.

??? A unicode character can result in up to 6 bytes used, when encoded with UTF-8. Which is what the poster meant to ask, I think.

Sure, perhaps I misunderstood him.
Dec 22 2003
prev sibling parent "Serge K" <skarebo programmer.net> writes:
 ???
 A unicode character can result in up to 6 bytes used, when encoded
 with UTF-8.

UTF-8 can represent all Unicode characters with no more then 4 bytes. ISO/IEC 10646 (UCS-4) may require up to 6 bytes in UTF-8, but it's the superset for Unicode.
Dec 30 2003
prev sibling next sibling parent reply "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:
 I would think that the datatype char would be a UTF-8 character, with no

 the amount of storage it used. The compiler would be free to represent


 internally
 however it chose. Indexing should work (perhaps inefficiently)

That would be a higher level view of it, and I suggest a wrapper class around it can provide this.

On Friday 19th, I posted a class that provides this functionality to this thread. You can see the message here: D/20619 As for the attached file - it does not appear to be accessible to users of the webservice, so I have placed it on the wiki at: http://www.wikiservice.at/wiki4d/wiki.cgi?StringClasses Rupert
Dec 21 2003
parent reply Ant <Ant_member pathlink.com> writes:
In article <bs4ea9$jo2$1 digitaldaemon.com>, Rupert Millard says...
 I would think that the datatype char would be a UTF-8 character, with no

 the amount of storage it used. The compiler would be free to represent


 internally
 however it chose. Indexing should work (perhaps inefficiently)

That would be a higher level view of it, and I suggest a wrapper class around it can provide this.

On Friday 19th, I posted a class that provides this functionality to this thread.

I sorry to interrup (I'm one of the cluless here, in fact I call this the unicorn discussion) but isn't Vathix's String class suppose to cover that? D/19525 It's bigger so it must be better ;) Ant
Dec 21 2003
parent Rupert Millard <rupertamillard hotmail.DELETE.THIS.com> writes:
Ant <Ant_member pathlink.com> wrote in
news:bs4gc8$n2c$1 digitaldaemon.com: 

 In article <bs4ea9$jo2$1 digitaldaemon.com>, Rupert Millard says...
 I would think that the datatype char would be a UTF-8 character,
 with no 

 the amount of storage it used. The compiler would be free to
 represent 


 internally
 however it chose. Indexing should work (perhaps inefficiently)

That would be a higher level view of it, and I suggest a wrapper class around it can provide this.

On Friday 19th, I posted a class that provides this functionality to this thread.

I sorry to interrup (I'm one of the cluless here, in fact I call this the unicorn discussion) but isn't Vathix's String class suppose to cover that? D/19525 It's bigger so it must be better ;) Ant

You had me worried here because I missed that post! However, they do slightly different things, I think. Mine indexes characters rather than bytes in UTF-8 strings. Vathix's does many other string handling things. (e.g. changing case) My code needs to be integrated into his, if it can be - I'm not sure what implications his use of templates has. You're quite correct - as they currently are, his is vastly more useful - I can't think of many situations where you need to index whole characters rather than bytes. My main reason for writing it was that I enjoy writing code. Rupert
Dec 21 2003
prev sibling parent Ilya Minkov <minkov cs.tum.edu> writes:
I think this discussion of "language being wrong" is wrong. It is 
obviuosly clear that the char[], char, and other associated types don't 
have a sensible higher-level symantics. The examples are many.

Obviously, i find it quite right from the language not to constrain the 
programmers to high-level types. It is a job for the library.

Now, everyone. Walter has quite enough to do of what he does better than 
all of us. Improving on a standard library is a job which he delegates 
to us.

A library class or struct String should be indexed by a real character 
scanning, and not by the adress, even if it means more overhead. And the 
result of this indexing, as well as any single character acess would be 
a dchar. The internal representation should be still acessible, for the 
case someone finds high-level semantics a bottleneck within his application.

Besides, myself and Mark have proposed a number of solutions a while 
ago, which would give strings non-standard storage, but would allow the 
high level representation to be significantly faster, at the cost of 
ease of operating on a lower-level representation.

-eye
Dec 21 2003
prev sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
I don't see how the design of the UTF-8 encoding adds any advantage over
other multibyte encodings that might cause people to use it properly.

UTF-8 has some nice advantages over other multibyte encodings in that it is possible to find the start of a sequence without backing up to the beginning, none of the multibyte encodings have bit 7 clear (so they never conflict with ascii), and no additional information like code pages are necessary to decode them.

The only situation I can think of where this might be useful is if you want to jump directly into the middle of a string. And that isn't really useful for UTF-8 because you do not know how many characters were before that - so you have no idea where you've "landed".
And about computing complexity: if you ignore the overhead introduced by
having to move more (or sometimes less) memory then manipulating UTF-32
strings is a LOT faster than UTF-8. Simply because random access is
possible and you do not have to perform an expensive decode operation on
each character.

Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and away most operations on strings were copying them, storing them, hashing them, etc.

Hmmm. That IS interesting. Now that you mention it, I think this would also apply to most of my own code. Though it might depend on the kind of application.
So assuming that your application uses 100.000
lines of text (which is a lot more than anything I've ever seen in a
program), each 100 characters long and everything held in memory at
once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
These are hardly numbers that will bring a modern OS to its knees
anymore. In a few years this might even fit completely into the CPU's

cache! Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive with C++ if it does chars as 32 bits each. I doubt many realize this, but Java and C# pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen do not measure char processing speed or memory consumption.)

I hadn't thought of applications that do nothing but serve data/text to others. That's a good counter-example against some of my arguments. Having the server run at 1/2 capacity because of string encoding seems to be too much. So I think you're right in having multiple "native" encodings. That still leaves the problems of providing easy ways to work with strings, though, to ensure that newbies will "automatically" write Unicode capable applications. That's the only way I see to avoid the situation we see in C/C++ code right now. What's bad about multiple encodings is that all libraries would have to support 3 kinds of strings for everything. That's not really feasible in the real world - I certainly don't want to write every function 3 times. I can think of only two ways around that: 1) some sort of automatic conversion when the function is called. This might cause quite a bit of overhead. 2) using some sort of template and let the compiler generate the 3 special cases. I don't think normal templates will work here, because we also need to support string functions in interfaces. Maybe we need some kind of universal string argument type? So that the compiler can automatically generate 3 functions if that type is used in the parameter list? Seems a bit of a hack.... 3) making the string type abstract so that string objects are compatible, no matter what their encoding is. This has the added benefit (as I have mentioned a few times before ;)) that users could have strings in their own encoding, which comes in handy when you're dealing with legacy code that does not use US-ASCII. I think 3 would be the most feasible. You decide about the encoding when you create the string object and everything else is completely transparent. Hauke
Dec 19 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:bruief$kav$1 digitaldaemon.com...
 What's bad about multiple encodings is that all libraries would have to
 support 3 kinds of strings for everything. That's not really feasible in
 the real world - I certainly don't want to write every function 3 times.

I had the same thoughts!
 I can think of only two ways around that:

 1) some sort of automatic conversion when the function is called. This
 might cause quite a bit of overhead.

 2) using some sort of template and let the compiler generate the 3
 special cases. I don't think normal templates will work here, because we
 also need to support string functions in interfaces. Maybe we need some
 kind of universal string argument type? So that the compiler can
 automatically generate 3 functions if that type is used in the parameter
 list? Seems a bit of a hack....

My first thought was to template all functions taking a string. It just got too complicated.
 3) making the string type abstract so that string objects are
 compatible, no matter what their encoding is. This has the added benefit
 (as I have mentioned a few times before ;)) that users could have
 strings in their own encoding, which comes in handy when you're dealing
 with legacy code that does not use US-ASCII.

 I think 3 would be the most feasible. You decide about the encoding when
 you create the string object and everything else is completely

I think 3 is the same as 1!
Dec 19 2003
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
I can think of only two ways around that:

1) some sort of automatic conversion when the function is called. This
might cause quite a bit of overhead.


3) making the string type abstract so that string objects are
compatible, no matter what their encoding is. This has the added benefit
(as I have mentioned a few times before ;)) that users could have
strings in their own encoding, which comes in handy when you're dealing
with legacy code that does not use US-ASCII.

I think 3 would be the most feasible. You decide about the encoding when
you create the string object and everything else is completely

transparent. I think 3 is the same as 1!

Not really ;). With 1 I meant having unrelated string classes (maybe source code compatible, but not derived from a common base class). That would mean that a temporary object would have to be created if a function takes, say, a UTF-8 string as an argument but you pass it a UTF-32 string. Pros: the compiler can do more inlining, since it knows the object type. Cons: the performance gain of the inlining is probably lost with all the conversions that will be going on if you use different libs. It is also not possible to easily add new string types without having to add the corresponding copy constructor and =operators to the existing ones. With 3 there would not be such a problem. All functions would have to use the common string interface for their arguments, so any kind of string object that implements this interface could be passed without a conversion. Pros: adding new string encodings is no problem, passing string objects never causes new objects to be created or data to be converted. Cons: most calls can probably not be inlined, since the functions will never know the actual class of the strings they work with. Also, if you want to pass a string constant to a function you'll have to explicitly wrap it in an object, since the compiler doesn't know what kind of object to create to convert a char[] to a string interface reference. The last point would go away if string constants were also string objects. I think that would be a good idea anyway, since that'd make the string interface the default way to deal with strings. Another solution would be if there was some way to write global conversion functions that are called to do implicit conversions between different types. Such functions could also be useful in many other circumstances, so that might be an idea to think about. Hauke
Dec 19 2003
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Hauke Duden wrote:
 Another solution would be if there was some way to write global 
 conversion functions that are called to do implicit conversions between 
 different types. Such functions could also be useful in many other 
 circumstances, so that might be an idea to think about.

Just to clarify: I meant this in the context of creating a string interface instance from a string constant, not to convert between different string objects (which wouldn't make much sense). E.g. interface string { ... } class MyString implements string { ... } void print(string msg) { ... } Without an implicit conversion we'd have to write: print(new MyString("Hello World")); With an implicit conversion that'd look like this: string opConvert(char[] s) { return new MyString(s); } print("Hello World"); [The last line would translate to print(opConvert("Hello World")) ] Hauke
Dec 19 2003
prev sibling parent reply "Serge K" <skarebo programmer.net> writes:
 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

Well, at least one can convert any Unicode string to UTF-8 without risk of losing information.
 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

UTF-32 never takes less memory than UTF-8. Period. Any Unicode character takes no more than 4 byte in UTF-8: 1 byte - ASCII 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... 3 byte - most of the scripts in use. 4 byte - rare/dead/special scripts UTF-8 means multibyte encoding for most of the languages (except English and maybe some others) Most of the European and Asian languages need just one UTF-16 unit per character. For CJK languages occurrence of the UTF-16 surrogates in the real texts is estimated as <1%. Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols. In most of the cases UTF-16 string can be treated as simple array of UCS-2 characters. You just need to know if it has surrogates // if (number_of_characters < nomber_of_16bit_units)
Dec 30 2003
parent reply "Roald Ribe" <rr.no spam.teikom.no> writes:
"Serge K" <skarebo programmer.net> wrote in message
news:bst8q3$218i$1 digitaldaemon.com...
 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

Well, at least one can convert any Unicode string to UTF-8 without risk of losing information.

This is a good point. But I stand my ground: it may result in up to 6 bytes used for ecah character (worst case).
 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

UTF-32 never takes less memory than UTF-8. Period. Any Unicode character takes no more than 4 byte in UTF-8: 1 byte - ASCII 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... 3 byte - most of the scripts in use. 4 byte - rare/dead/special scripts

This is wrong. Read up on UTF-8 encoding.
 UTF-8 means multibyte encoding for most of the languages (except English

 maybe some others)

Right.
 Most of the European and Asian languages need just one UTF-16 unit per
 character.

Yes most, but not all.
 For CJK languages occurrence of the UTF-16 surrogates in the real texts is
 estimated as <1%.

The code to handle it still has to be present...
 Other scripts encoded in "higher planes" cover very rare or dead languages
 and some special symbols.

 In most of the cases UTF-16 string can be treated as simple array of UCS-2
 characters.

Yes, but "most cases" is not a good argument when the original discussion was initiated to handle ALL laguages, in a way that the developer would find to be "natural", easy and integrated in the D language.
 You just need to know if it has surrogates // if (number_of_characters <
 nomber_of_16bit_units)

There is no such thing as "just" with these issues (IMHO) ;-) Roald
Dec 30 2003
parent "Serge K" <skarebo programmer.net> writes:
 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters



 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

UTF-32 never takes less memory than UTF-8. Period. Any Unicode character takes no more than 4 byte in UTF-8: 1 byte - ASCII 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... 3 byte - most of the scripts in use. 4 byte - rare/dead/special scripts

This is wrong. Read up on UTF-8 encoding.

RTFM. [The Unicode Standard, Version 4.0] The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. UTF-8 D36. UTF-8 encoding form: The Unicode encoding form which assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-5. [Table 3-5. UTF-8 Bit Distribution] Scalar Value 1st Byte 2nd Byte 3rd Byte 4th Byte 00000000 0xxxxxxx 0xxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx [Appendix C : Relationship to ISO/IEC 10646] C.3 UCS Transformation Formats UTF-8 The term UTF-8 stands for UCS Transformation Format, 8-bit form. UTF-8 is an alternative coded representation form for all of the characters of ISO/IEC 10646. The ISO/IEC definition is identical in format to UTF-8 as described under definition D36 in Section 3.9, Unicode Encoding Forms. ... The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters.
Jan 03 2004
prev sibling next sibling parent reply Matthias Becker <Matthias_member pathlink.com> writes:
In a higher level language, yes. But in doing systems work, one always seems
to be looking at the lower level elements anyway. I wrestled with this for a
while, and eventually decided that char[], wchar[], and dchar[] would be low
level representations. One could design a wrapper class for them that
overloads [] to provide automatic decoding if desired.

Shouldn't this wrapper be part of Phobos?
Dec 17 2003
parent "Walter" <walter digitalmars.com> writes:
"Matthias Becker" <Matthias_member pathlink.com> wrote in message
news:brpr00$2grc$1 digitaldaemon.com...
In a higher level language, yes. But in doing systems work, one always


to be looking at the lower level elements anyway. I wrestled with this


while, and eventually decided that char[], wchar[], and dchar[] would be


level representations. One could design a wrapper class for them that
overloads [] to provide automatic decoding if desired.

Shouldn't this wrapper be part of Phobos?

Eventually, yes. First things first, though, and the first step was making the innards of the D language and compiler fully unicode enabled.
Dec 17 2003
prev sibling parent "Roald Ribe" <rr.no spam.teikom.no> writes:
"Walter" <walter digitalmars.com> wrote in message
news:brll85$1oko$1 digitaldaemon.com...
 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 Actually, byte or ubyte doesn't really matter. One is not supposed to
 look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always

 to be looking at the lower level elements anyway. I wrestled with this for

 while, and eventually decided that char[], wchar[], and dchar[] would be

 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.


 The overloading issue is interesting, but may I suggest that char and

 are at least renamed to something more appropriate? Maybe utf8byte and
 utf16byte? I feel it's important to point out that they aren't


 I see your point, but I just can't see making utf8byte into a keyword <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see

 of an issue here.


 And here is also the core of the problem: having an array of "char"
 implies to the unwary programmer that the elements in the sequence
 are in fact "characters", and that you should be allowed to do stuff
 like isspace() on them. The fact that the libraries provide such
 function doesn't help either.

I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it

 for the moment.

 I'd love to help out and do these things. But two things are needed


     - At least one other person needs to volunteer.
       I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!
     - The core concepts needs to be decided upon. Things seems to be
       somewhat in flux right now, with three different string types
       and all. At the very least it needs to be deicded what a "string"
       really is, is it a UTF-8 byte sequence or a UTF-32 character
       sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.
 That's correct as well. The library's support for unicode is



 But
 there also is a nice package (std.utf) which will convert between


 wchar[], and dchar[]. This can be used to convert the text strings



 whatever unicode stream type the underlying operating system API


 (For win32 this would be UTF-16, I am unsure what linux supports.)

and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs


 decoding to UTF-8 when reading into a char[] array.

char[] strings are UTF-8, and as such I don't know what you mean by

 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

 Unless fundamental encoding/decoding is embedded in the streams library,
 it would be best to simply read text data into a byte array and then
 perform native decoding manually afterwards using functions similar
 to the C mbstowcs() and wcstombs(). The drawback to this is that you
 cannot read text data in platform encoding without copying through
 a separate buffer, even in cases when this is not needed.

If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page

 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

 D is headed that way. The current version of the library I'm working



 converts the char[] strings in the file name API's to UTF-16 via
 std.utf.toUTF16z(), for use calling the win32 API's.

the native<->unicode conversion routines.

The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with

 dependent character sets are pushed off to the side as merely an input or
 output translation nuisance. The core routines all expect UTF strings, and
 so are platform and language independent. I personally think the future is
 UTF, and locale dependent encodings will fall by the wayside.

 In C, as already mentioned,
 these are called mbstowcs() and wcstombs(). For Windows, these would
 convert to and from UTF-16. For Unix, these would convert to and from
 whatever encoding the application is running under (dictated by the
 LC_CTYPE environment variable). There really is no need to make the
 API's platform dependent in any way here.

After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from

 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about

 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

Following this discussion, I have read some more on the subject. In additon to the speed issues that was mentioned, I have had some insights on the issues of endianess, serialization, BOM (Byte Order Mark) ++ Most of it can be found in a reasonably short pdf document: http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf There is even more to this than I first believed... Based on the new knowledge I become more and more convinced that the choice of UTF-8 encoding as the basic "correct thing to do" for general use in a programming language, is well founded. But when text _processing_ comes into play, other rules aplies. But: I still find it objectionable to call one byte in a UTF-8/Unicode based language a char! ;-) The naming will of course make it easier to do a straight port from C to D, but such a port will in most cases be of no use on the "International scene". Oh well, this can be argued for/against well both ways I guess... IMHO there should be no char type at all. Only byte. Or maybe to take more sizes into consideration: bin8, bin16, bin32, bin64... I think porting from C to D should involve renaming char's to bin8's Hmmm... It is sad when learning more makes you want to change less ;-) Anyway, there is more to be learned... Roald
Dec 31 2003
prev sibling parent "Elohe" <GODA-XEN terra.es> writes:
First: I'm new in D and my english are bad.
I realy like the utf8, but the true it no is efficient all the time ( local
character acces...) and in a litle number of C/C++ programs I ned to use
interrnal utf32 intestead of utf8 but later, I introduced a hack and I
indexed the utf8 char nunber/pos and used a standard utf8 vector, the memory
need are lower than using utf32 in my most frequent cases and the memory
efficiency are better than utf32 for my experience this work very well in
latin and CJK languages ( I normale use this two encodings) but for cirilyc,
arabian... the memory can be bigger than utf32 but if is used a eficient
indexation system we can equal the memory needed to utf32, in perfomance the
penalitation is than 8 times slower than utf32 implementation, compared  it
to the penalitation in standar utf8 are very fast.

I recomend to add:

stringi -> indexed string for utf8

and the posibility to mark an internal representaion off the utf like:

string utf8-32 -> this mark an utf8 string, but it works internal as utf32
Jan 07 2004