D - Unicode discussion

Elias Martenson (72/72) Dec 15 2003 DISCLAIMER: I am not a "D programmer". I certainly haven't written any

Walter (20/84) Dec 15 2003 The data type you're looking for is implemented in D and is the 'dchar'....

Elias Martenson (50/98) Dec 15 2003 Actually, byte or ubyte doesn't really matter. One is not supposed to

Walter (50/103) Dec 15 2003 In a higher level language, yes. But in doing systems work, one always s...

Lewis (6/167) Dec 15 2003 heres a page i found with some c++ code that may help in creating decode...

Elias Martenson (6/13) Dec 16 2003 IBM has a set of Unicode tools. Last time I googled for them I found

uwem (5/18) Dec 16 2003 You mean icu?!

Elias Martenson (5/8) Dec 16 2003 Yes that's it! No wonder I didn't find it, I was searching for "classes

Sean L. Palmer (8/17) Dec 16 2003 seems

Walter (3/8) Dec 16 2003 You're right.

Elias Martenson (7/19) Dec 17 2003 Agreed. Some kind of itarator for strings are desperately needed.

Elias Martenson (85/154) Dec 16 2003 All right. I can accept this, of course. The problem I still have with

Ben Hinkle (33/187) Dec 16 2003 I think Walter once said char had been called 'ascii'. That doesn't soun...

Elias Martenson (9/14) Dec 16 2003 No. This would be extremely bad. The (unfortunately) very large amount

Ben Hinkle (22/34) Dec 16 2003 sound

Elias Martenson (31/51) Dec 16 2003 For legacy code, you should have to take an extra step to make it work.

Ben Hinkle (7/11) Dec 16 2003 I didn't say the default type should be ASCII. I just said it should be

Elias Martenson (32/46) Dec 16 2003 But for all intents and puproses, ASCII does not exist anymore. It's a

Elias Martenson (4/5) Dec 16 2003 ^^^^ 8999 of course
Ben Hinkle (20/24) Dec 16 2003 which was why I suggested doing away with the generic "char" type entire...

Elias Martenson (21/47) Dec 17 2003 No, it would certainly NOT be safe. You must remember that ASCII doesn't...

Hauke Duden (11/17) Dec 17 2003 None of these alternatives is correct. printf will only work correctly

Charles (16/240) Dec 16 2003 It does sound insane, I like it. I vote for this.

Carlos Santander B. (16/16) Dec 16 2003 "Elias Martenson" wrote in message

Elias Martenson (18/25) Dec 16 2003 mbstowcs() = multi byte string to wide character string

Carlos Santander B. (40/40) Dec 16 2003 Thank you both.

Julio C�sar Carrascal Urquijo (4/10) Dec 16 2003 mbstowcs - Multi Byte to Wide Character String
Andy Friesen (4/11) Dec 16 2003 Ironically enough, you question answers Elias's question quite

Elias Martenson (4/16) Dec 17 2003 Dang! How do you americans say? Three strikes, I'm out. :-)

Walter (34/121) Dec 16 2003 Yes.

Sean L. Palmer (26/42) Dec 17 2003 It's stupid to not agree on a standard size for char, since it's easy to

Elias Martenson (15/38) Dec 17 2003 C doesn't define any standard sizes at all (well, you do have stdint.h
Sean L. Palmer (4/6) Dec 17 2003 Sorry, "sign" of char.

Elias Martenson (63/118) Dec 17 2003 Memory-wise perhaps. But for everything else UTF-8 is always slower.

Walter (1/1) Dec 17 2003 I think we're mostly in agreement!

Hauke Duden (34/46) Dec 16 2003 This is simply not true, Walter. The world has not gotten used to

Elias Martenson (17/59) Dec 16 2003 Indeed. In many cases existing code would actually continue working,

Hauke Duden (19/23) Dec 16 2003 They are not quite as few as one may think. For example, if you pass an

Elias Martenson (11/24) Dec 16 2003 Exactly. But the number of functions that do these things are still

Walter (55/94) Dec 16 2003 .

Hauke Duden (35/59) Dec 17 2003 Right, it has been around for decades. And people still don't use it

Walter (34/68) Dec 17 2003 UTF-8 has some nice advantages over other multibyte encodings in that it...

Roald Ribe (58/126) Dec 18 2003 is

Sean L. Palmer (12/19) Dec 18 2003 You raise some good points.

Lewis (3/5) Dec 18 2003 Sorry if im stating something i lack knowledge in, but if there were no ...

Elias Martenson (4/10) Dec 18 2003 Most likely ushort[].

Walter (29/63) Dec 18 2003 I can't really stop clueless/lazy programmers from writing bad code .

Elias Martenson (26/51) Dec 19 2003 But it is possible to make it harder to do so. I believe that is what

Sean L. Palmer (20/42) Dec 19 2003 Probably true.

Rupert Millard (80/80) Dec 19 2003 There has been a lot of talk about doing things, but very little has

Sean L. Palmer (15/30) Dec 19 2003 Cool beans! Thanks, Rupert!

Rupert Millard (12/46) Dec 19 2003 I agree with you, but we just have to grin and bear it, unless / until

Walter (11/62) Dec 19 2003 The problem with the operater* or operator~ syntax is it is ambiguous. I...

Rupert Millard (13/40) Dec 20 2003 If you say it's ambiguous, I'll take your word for it and if you think b...

Walter (3/5) Dec 20 2003 I haven't got that far yet!

Sean L. Palmer (8/10) Dec 20 2003 It would be greppable if it were required that there be no space between...

Sean L. Palmer (23/39) Dec 18 2003 and

Elias Martenson (9/23) Dec 18 2003 This is why I have advocated a rename of dchar to char, and the current
Walter (8/18) Dec 18 2003 Yes. Exactly.

Karl Bochert (21/33) Dec 20 2003 A char is defined as a UTF-8 character but does not have enough storage ...

Elias Martenson (8/14) Dec 20 2003 It's a fixed memory type. Look at it as an ubyte, but with some special
Walter (18/35) Dec 20 2003 hold one!?

Roald Ribe (8/17) Dec 21 2003 to

Walter (3/10) Dec 22 2003 Sure, perhaps I misunderstood him.
Serge K (3/6) Dec 30 2003 UTF-8 can represent all Unicode characters with no more then 4 bytes.

Rupert Millard (9/16) Dec 21 2003 On Friday 19th, I posted a class that provides this functionality to thi...

Ant (7/18) Dec 21 2003 I sorry to interrup

Rupert Millard (13/38) Dec 21 2003 You had me worried here because I missed that post! However, they do

Ilya Minkov (18/18) Dec 21 2003 I think this discussion of "language being wrong" is wrong. It is

Hauke Duden (37/70) Dec 19 2003 The only situation I can think of where this might be useful is if you

Walter (7/26) Dec 19 2003 I had the same thoughts!

Hauke Duden (31/47) Dec 19 2003 Not really ;).

Hauke Duden (27/31) Dec 19 2003 Just to clarify: I meant this in the context of creating a string

Serge K (20/26) Dec 30 2003 UTF-32 never takes less memory than UTF-8. Period.

Roald Ribe (15/41) Dec 30 2003 This is a good point. But I stand my ground: it may result in up to

Serge K (36/48) Jan 03 2004 RTFM.

Matthias Becker (1/6) Dec 17 2003 Shouldn't this wrapper be part of Phobos?

Walter (7/13) Dec 17 2003 seems

Roald Ribe (40/133) Dec 31 2003 seems

Elohe (16/16) Jan 07 2004 First: I'm new in D and my english are bad.

Elias Martenson <elias-m algonet.se> writes:

DISCLAIMER: I am not a "D programmer". I certainly haven't written any
real-world applications in the language yet but I am very knowlegeable
in localisation issues.

After the recent discussion regarding Unicode in D, which seems to
have faded away now, I have decided to write some initial comments on
what needs to be done to the language and API's to make it support all
languages, not only English and Latin (which to my knowledge are the
only lnguages that can be written using 7-bit ASCII).

char types
----------

Today, according to the specification, there are three char
types. char, wchar and dchar. These are then used in an array to
create three different kinds of internal string representaions: UTF-8,
UTF-16 and UTF-32.

There are several problems with this. First and foremost, when an
expression such as this: "char[] foo" you get the impression that this
is an array of characters. This is wrong. The UTF-8 specification
dictates that a UTF-8 string is an array of bytes, not
characters. This is an important distiction to make since you cannot
take the n'th character from a UTF-8 stream like this: string[n],
since you may get a part of a multibyte character sequence.

The wchar data type has the exact same problem, since it uses UTF-16
which also uses variable lengths for its characters.

What is needed is a "char" datatype that is infact able to hold a
character. You need 21 bits to describe a unicode character (Unicode
allocates 17*2^16 code points, all of which are not yet defined) and
therefore it seems reasonable to use a 32-bit data type for this.

In my opinion this data type should be named "char". For UTF-8 and
UTF-16 strings, one can use the "byte" and "short" data types, which
would be in keeping with the Unicode standards which (to my knowledge,
I'd have to look up the exact wording) declare UTF-8 strings as being
sequences of bytes and 16-bit words respectively, and not
"characters".

String classes and functions
----------------------------

There are a set of const char[] arrays containing various character
sequences including: hexdigits, digits, uppercase, letters,
whitespace, etc... There are also character classification functions
that accept 8-bit characters. These should really be replaced by a new
but similar set of functions that work with 32-bit char types.

     isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace()

These cannot be inlined functions since newer versions of the Unicode
standard can declare new code points and we need to be forward
compatible.

Another funtion is also needed: getCharacterCategory() which returns
the Unicode category. Some other functions are needed to determine
other properites of the characters such as the directionality. Take a
look at the Java classes java.text.BreakIterator and java.text.Bidi to
get some ideas.

Streams
-------

The current std.stream is not adequate for Unicode. It doesn't seem to
take encodings into consideration at all but is simply a binary
interface.

Strings in the Phobos stream library seems to deal primarily with
char[] and wchar[]. The most important stream type, dchar[] is not
even considered. Another problem with the library is that the point as
which native encodiding<->unicode conversion is performed is not
defined.

Personally, I have not given this much considering yet, although I
kind of like the way Java did it by introducing two different kinds of
streams, byte streams and character stream. More discussion is clearly
needed.

Interoperability
----------------

In particular, C often uses 8-bit char arrays to represent
strings. This causes a problem when all strings are 32-bit
internally. The most straightforward olution is to convert UTF-32
char[] to UTF-8 byte[] before a call to a legacy function. This would
also very elegantly deal with the problem is zero-terminated C
strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
conversions functions should create a zero-terminated byte array).

Dec 15 2003

"Walter" <walter digitalmars.com> writes:

"Elias Martenson" <elias-m algonet.se> wrote in message
news:brjvsf$28lb$1 digitaldaemon.com...
 char types
 ----------

 Today, according to the specification, there are three char
 types. char, wchar and dchar. These are then used in an array to
 create three different kinds of internal string representaions: UTF-8,
 UTF-16 and UTF-32.

 There are several problems with this. First and foremost, when an
 expression such as this: "char[] foo" you get the impression that this
 is an array of characters. This is wrong. The UTF-8 specification
 dictates that a UTF-8 string is an array of bytes, not
 characters. This is an important distiction to make since you cannot
 take the n'th character from a UTF-8 stream like this: string[n],
 since you may get a part of a multibyte character sequence.

 The wchar data type has the exact same problem, since it uses UTF-16
 which also uses variable lengths for its characters.

 What is needed is a "char" datatype that is infact able to hold a
 character. You need 21 bits to describe a unicode character (Unicode
 allocates 17*2^16 code points, all of which are not yet defined) and
 therefore it seems reasonable to use a 32-bit data type for this.

 In my opinion this data type should be named "char". For UTF-8 and
 UTF-16 strings, one can use the "byte" and "short" data types, which
 would be in keeping with the Unicode standards which (to my knowledge,
 I'd have to look up the exact wording) declare UTF-8 strings as being
 sequences of bytes and 16-bit words respectively, and not
 "characters".

The data type you're looking for is implemented in D and is the 'dchar'. A
'dchar' is 32 bits wide, wide enough for all the current and future unicode
characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16
short. Having 'char' be a separate type from 'byte' is pretty handy for
overloading purposes. (A minor clarification, 'byte' in D is signed, I think
you meant 'ubyte', since UTF-8 bytes are unsigned.)

 String classes and functions
 ----------------------------

 There are a set of const char[] arrays containing various character
 sequences including: hexdigits, digits, uppercase, letters,
 whitespace, etc... There are also character classification functions
 that accept 8-bit characters. These should really be replaced by a new
 but similar set of functions that work with 32-bit char types.

      isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace()

 These cannot be inlined functions since newer versions of the Unicode
 standard can declare new code points and we need to be forward
 compatible.

 Another funtion is also needed: getCharacterCategory() which returns
 the Unicode category. Some other functions are needed to determine
 other properites of the characters such as the directionality. Take a
 look at the Java classes java.text.BreakIterator and java.text.Bidi to
 get some ideas.

I agree that more needs to be done in the D runtime library along these
lines. I am not an expert on unicode - would you care to write those
functions and contribute them to the D project?

 Streams
 -------

 The current std.stream is not adequate for Unicode. It doesn't seem to
 take encodings into consideration at all but is simply a binary
 interface.

That's correct.

 Strings in the Phobos stream library seems to deal primarily with
 char[] and wchar[]. The most important stream type, dchar[] is not
 even considered. Another problem with the library is that the point as
 which native encodiding<->unicode conversion is performed is not
 defined.

That's correct as well. The library's support for unicode is inadequate. But
there also is a nice package (std.utf) which will convert between char[],
wchar[], and dchar[]. This can be used to convert the text strings into
whatever unicode stream type the underlying operating system API supports.
(For win32 this would be UTF-16, I am unsure what linux supports.)

 Personally, I have not given this much considering yet, although I
 kind of like the way Java did it by introducing two different kinds of
 streams, byte streams and character stream. More discussion is clearly
 needed.

 Interoperability
 ----------------

 In particular, C often uses 8-bit char arrays to represent
 strings. This causes a problem when all strings are 32-bit
 internally. The most straightforward olution is to convert UTF-32
 char[] to UTF-8 byte[] before a call to a legacy function. This would
 also very elegantly deal with the problem is zero-terminated C
 strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
 conversions functions should create a zero-terminated byte array).

D is headed that way. The current version of the library I'm working on
converts the char[] strings in the file name API's to UTF-16 via
std.utf.toUTF16z(), for use calling the win32 API's.

Dec 15 2003

Elias Martenson <no spam.spam> writes:

Den Mon, 15 Dec 2003 02:28:01 -0800 skrev Walter:

 In my opinion this data type should be named "char". For UTF-8 and
 UTF-16 strings, one can use the "byte" and "short" data types, which
 would be in keeping with the Unicode standards which (to my knowledge,
 I'd have to look up the exact wording) declare UTF-8 strings as being
 sequences of bytes and 16-bit words respectively, and not
 "characters".

 
 The data type you're looking for is implemented in D and is the 'dchar'. A
 'dchar' is 32 bits wide, wide enough for all the current and future unicode
 characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16
 short. Having 'char' be a separate type from 'byte' is pretty handy for
 overloading purposes. (A minor clarification, 'byte' in D is signed, I think
 you meant 'ubyte', since UTF-8 bytes are unsigned.)

Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

The overloading issue is interesting, but may I suggest that char and whcar
are at least renamed to something more appropriate? Maybe utf8byte and
utf16byte? I feel it's important to point out that they aren't characters.

And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

I was almost going to provide a summary of the issues we're having in C
with regards to this, but I don't know if it's necessary, and it's also
getting late here (work early tomorrow).

 [ my own comments regarding strings snipped ]

 
 I agree that more needs to be done in the D runtime library along these
 lines. I am not an expert on unicode - would you care to write those
 functions and contribute them to the D project?

I'd love to help out and do these things. But two things are needed first:

    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

 Streams
 -------

 The current std.stream is not adequate for Unicode. It doesn't seem to
 take encodings into consideration at all but is simply a binary
 interface.

 
 That's correct.

Agree. And as such it's very good.

 Strings in the Phobos stream library seems to deal primarily with
 char[] and wchar[]. The most important stream type, dchar[] is not
 even considered. Another problem with the library is that the point as
 which native encodiding<->unicode conversion is performed is not
 defined.

 
 That's correct as well. The library's support for unicode is inadequate. But
 there also is a nice package (std.utf) which will convert between char[],
 wchar[], and dchar[]. This can be used to convert the text strings into
 whatever unicode stream type the underlying operating system API supports.
 (For win32 this would be UTF-16, I am unsure what linux supports.)

Yes. But this would then assume that char[] is always in native encoding
and doesn't rhyme very well with the assertion that char[] is a UTF-8
byte sequence.

Or, the specification could be read as the stream actually performs native
decoding to UTF-8 when reading into a char[] array.

Unless fundamental encoding/decoding is embedded in the streams library,
it would be best to simply read text data into a byte array and then
perform native decoding manually afterwards using functions similar
to the C mbstowcs() and wcstombs(). The drawback to this is that you
cannot read text data in platform encoding without copying through
a separate buffer, even in cases when this is not needed.

 In particular, C often uses 8-bit char arrays to represent
 strings. This causes a problem when all strings are 32-bit
 internally. The most straightforward olution is to convert UTF-32
 char[] to UTF-8 byte[] before a call to a legacy function. This would
 also very elegantly deal with the problem is zero-terminated C
 strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
 conversions functions should create a zero-terminated byte array).

 
 D is headed that way. The current version of the library I'm working on
 converts the char[] strings in the file name API's to UTF-16 via
 std.utf.toUTF16z(), for use calling the win32 API's.

This can be done in a much better, platform independent way, by using
the native<->unicode conversion routines. In C, as already mentioned,
these are called mbstowcs() and wcstombs(). For Windows, these would
convert to and from UTF-16. For Unix, these would convert to and from
whatever encoding the application is running under (dictated by the
LC_CTYPE environment variable). There really is no need to make the
API's platform dependent in any way here.

In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest. This
goes for all the other methods and functions that accept string
parameters. This of course still depends on what a "string" really is,
this really needs to be decided, and I think you are the only one who
can make that call. Although more discussion on the subject might be
needed first?

Regards

Elias Mårtenson

Dec 15 2003

"Walter" <walter digitalmars.com> writes:

"Elias Martenson" <no spam.spam> wrote in message
news:pan.2003.12.15.23.07.24.569047 spam.spam...
 Actually, byte or ubyte doesn't really matter. One is not supposed to
 look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always seems
to be looking at the lower level elements anyway. I wrestled with this for a
while, and eventually decided that char[], wchar[], and dchar[] would be low
level representations. One could design a wrapper class for them that
overloads [] to provide automatic decoding if desired.


 The overloading issue is interesting, but may I suggest that char and

whcar
 are at least renamed to something more appropriate? Maybe utf8byte and
 utf16byte? I feel it's important to point out that they aren't characters.

I see your point, but I just can't see making utf8byte into a keyword <g>.
The world has already gotten used to multibyte 'char' in C and the funky
'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
of an issue here.


 And here is also the core of the problem: having an array of "char"
 implies to the unwary programmer that the elements in the sequence
 are in fact "characters", and that you should be allowed to do stuff
 like isspace() on them. The fact that the libraries provide such
 function doesn't help either.

I think the library functions should be improved to handle unicode chars.
But I'm not much of an expert on how to do it right, so it is the way it is
for the moment.

 I'd love to help out and do these things. But two things are needed first:
     - At least one other person needs to volunteer.
       I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!

     - The core concepts needs to be decided upon. Things seems to be
       somewhat in flux right now, with three different string types
       and all. At the very least it needs to be deicded what a "string"
       really is, is it a UTF-8 byte sequence or a UTF-32 character
       sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8,
UTF-16, or UTF-32 representations.

 That's correct as well. The library's support for unicode is inadequate.


But
 there also is a nice package (std.utf) which will convert between


char[],
 wchar[], and dchar[]. This can be used to convert the text strings into
 whatever unicode stream type the underlying operating system API


supports.
 (For win32 this would be UTF-16, I am unsure what linux supports.)

 Yes. But this would then assume that char[] is always in native encoding
 and doesn't rhyme very well with the assertion that char[] is a UTF-8
 byte sequence.
 Or, the specification could be read as the stream actually performs native
 decoding to UTF-8 when reading into a char[] array.

char[] strings are UTF-8, and as such I don't know what you mean by 'native
decoding'. There is only one possible conversion of UTF-8 to UTF-16.

 Unless fundamental encoding/decoding is embedded in the streams library,
 it would be best to simply read text data into a byte array and then
 perform native decoding manually afterwards using functions similar
 to the C mbstowcs() and wcstombs(). The drawback to this is that you
 cannot read text data in platform encoding without copying through
 a separate buffer, even in cases when this is not needed.

If you're talking about win32 code pages, I'm going to draw a line in the
sand and assert that D char[] strings are NOT locale or code page dependent.
They are UTF-8 strings. If you are reading code page or locale dependent
strings, to put them into a char[] will require running it through a
conversion.

 D is headed that way. The current version of the library I'm working on
 converts the char[] strings in the file name API's to UTF-16 via
 std.utf.toUTF16z(), for use calling the win32 API's.

 This can be done in a much better, platform independent way, by using
 the native<->unicode conversion routines.

The UTF-8 to UTF-16 conversion is defined and platform independent. The D
runtime library includes routines to convert back and forth between them.
They could probably be optimized better, but that's another issue. I feel
that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale
dependent character sets are pushed off to the side as merely an input or
output translation nuisance. The core routines all expect UTF strings, and
so are platform and language independent. I personally think the future is
UTF, and locale dependent encodings will fall by the wayside.

 In C, as already mentioned,
 these are called mbstowcs() and wcstombs(). For Windows, these would
 convert to and from UTF-16. For Unix, these would convert to and from
 whatever encoding the application is running under (dictated by the
 LC_CTYPE environment variable). There really is no need to make the
 API's platform dependent in any way here.

After wrestling with this issue for some time, I finally realized that
supporting locale dependent character sets in the core of the language and
runtime library is a bad idea. The core will support UTF, and locale
dependent representations will only be supported by translating to/from UTF.
This should wind up making D a far more portable language for
internationalization than C/C++ are (ever wrestle with tchar.h? How about
wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having
#ifdef _UNICODE all over the place? I've done that too much already. No
thanks!)

UTF-8 is really quite brilliant. With just some minor extra care over
writing ordinary ascii code, you can write portable code that is fully
capable of handling the complete unicode character set.

 In general, you should be able to open a file, by specifying the file
 name as a dchar[], and then the libraries should handle the rest.

It does that now, except they take a char[].

 This
 goes for all the other methods and functions that accept string
 parameters. This of course still depends on what a "string" really is,
 this really needs to be decided, and I think you are the only one who
 can make that call. Although more discussion on the subject might be
 needed first?

It's been debated here before <g>.

Dec 15 2003

Lewis <dethbomb hotmail.com> writes:

Walter wrote:
 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 
Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

 
 
 In a higher level language, yes. But in doing systems work, one always seems
 to be looking at the lower level elements anyway. I wrestled with this for a
 while, and eventually decided that char[], wchar[], and dchar[] would be low
 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.
 
 
 
The overloading issue is interesting, but may I suggest that char and

 
 whcar
 
are at least renamed to something more appropriate? Maybe utf8byte and
utf16byte? I feel it's important to point out that they aren't characters.

 
 
 I see your point, but I just can't see making utf8byte into a keyword <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
 of an issue here.
 
 
 
And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

 
 
 I think the library functions should be improved to handle unicode chars.
 But I'm not much of an expert on how to do it right, so it is the way it is
 for the moment.
 
 
I'd love to help out and do these things. But two things are needed first:
    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

 
 
 You're not by yourself. There's a whole D community here!
 
 
    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

 
 
 A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8,
 UTF-16, or UTF-32 representations.
 
 
That's correct as well. The library's support for unicode is inadequate.


 
 But
 
there also is a nice package (std.utf) which will convert between


 
 char[],
 
wchar[], and dchar[]. This can be used to convert the text strings into
whatever unicode stream type the underlying operating system API


 
 supports.
 
(For win32 this would be UTF-16, I am unsure what linux supports.)

Yes. But this would then assume that char[] is always in native encoding
and doesn't rhyme very well with the assertion that char[] is a UTF-8
byte sequence.
Or, the specification could be read as the stream actually performs native
decoding to UTF-8 when reading into a char[] array.

 
 
 char[] strings are UTF-8, and as such I don't know what you mean by 'native
 decoding'. There is only one possible conversion of UTF-8 to UTF-16.
 
 
Unless fundamental encoding/decoding is embedded in the streams library,
it would be best to simply read text data into a byte array and then
perform native decoding manually afterwards using functions similar
to the C mbstowcs() and wcstombs(). The drawback to this is that you
cannot read text data in platform encoding without copying through
a separate buffer, even in cases when this is not needed.

 
 
 If you're talking about win32 code pages, I'm going to draw a line in the
 sand and assert that D char[] strings are NOT locale or code page dependent.
 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.
 
 
D is headed that way. The current version of the library I'm working on
converts the char[] strings in the file name API's to UTF-16 via
std.utf.toUTF16z(), for use calling the win32 API's.

This can be done in a much better, platform independent way, by using
the native<->unicode conversion routines.

 
 
 The UTF-8 to UTF-16 conversion is defined and platform independent. The D
 runtime library includes routines to convert back and forth between them.
 They could probably be optimized better, but that's another issue. I feel
 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale
 dependent character sets are pushed off to the side as merely an input or
 output translation nuisance. The core routines all expect UTF strings, and
 so are platform and language independent. I personally think the future is
 UTF, and locale dependent encodings will fall by the wayside.
 
 
In C, as already mentioned,
these are called mbstowcs() and wcstombs(). For Windows, these would
convert to and from UTF-16. For Unix, these would convert to and from
whatever encoding the application is running under (dictated by the
LC_CTYPE environment variable). There really is no need to make the
API's platform dependent in any way here.

 
 
 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language and
 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having
 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)
 
 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.
 
 
In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

 
 
 It does that now, except they take a char[].
 
 
This
goes for all the other methods and functions that accept string
parameters. This of course still depends on what a "string" really is,
this really needs to be decided, and I think you are the only one who
can make that call. Although more discussion on the subject might be
needed first?

 
 
 It's been debated here before <g>.
 
 

heres a page i found with some c++ code that may help in creating decoders
etc...

http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html

for windows coding its easy enough to use com api's to manipulate and
create unicode strings? (for utf16)

Dec 15 2003

Elias Martenson <elias-m algonet.se> writes:

Lewis wrote:

 heres a page i found with some c++ code that may help in creating decoders
 etc...
 
 http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html
 
 for windows coding its easy enough to use com api's to manipulate and
 create unicode strings? (for utf16)

IBM has a set of Unicode tools. Last time I googled for them I found 
them right away but not I can't. I'll keel looking and post again when I 
find the link.

Regards

Elias M�rtenson

Dec 16 2003

uwem <uwem_member pathlink.com> writes:

You mean icu?!

http://oss.software.ibm.com/icu/

Bye
uwe

In article <brmlf3$83b$1 digitaldaemon.com>, Elias Martenson says...
Lewis wrote:

 heres a page i found with some c++ code that may help in creating decoders
 etc...
 
 http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html
 
 for windows coding its easy enough to use com api's to manipulate and
 create unicode strings? (for utf16)

IBM has a set of Unicode tools. Last time I googled for them I found 
them right away but not I can't. I'll keel looking and post again when I 
find the link.

Regards

Elias M�rtenson

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

uwem wrote:

 You mean icu?!
 
 http://oss.software.ibm.com/icu/

Yes that's it! No wonder I didn't find it, I was searching for "classes 
for unicode".

Regards

Elias M�rtenson

Dec 16 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

"Walter" <walter digitalmars.com> wrote in message
news:brll85$1oko$1 digitaldaemon.com...
 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 Actually, byte or ubyte doesn't really matter. One is not supposed to
 look at the individual elements in a UTF-8 or a UTF-16 string anyway.

 In a higher level language, yes. But in doing systems work, one always

seems
 to be looking at the lower level elements anyway. I wrestled with this for

a
 while, and eventually decided that char[], wchar[], and dchar[] would be

low
 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

The problem is that [] would be a horribly inefficient way to index UTF-8
characters.  foreach would be ok.

Sean

Dec 16 2003

"Walter" <walter digitalmars.com> writes:

"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brmeos$2v9c$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

 The problem is that [] would be a horribly inefficient way to index UTF-8
 characters.  foreach would be ok.

You're right.

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Walter wrote:

 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brmeos$2v9c$1 digitaldaemon.com...
 
"Walter" <walter digitalmars.com> wrote in message

One could design a wrapper class for them that
overloads [] to provide automatic decoding if desired.

The problem is that [] would be a horribly inefficient way to index UTF-8
characters.  foreach would be ok.

 
 You're right.

Agreed. Some kind of itarator for strings are desperately needed.

May I ask that they be designed in such a way that they are 
compatible/consistent with other iterators, such as the collections and 
things like the break iterator (also for strings).

Regards

Elias M�rtenson

Dec 17 2003

Elias Martenson <elias-m algonet.se> writes:

Walter wrote:

 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 
Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

 
 In a higher level language, yes. But in doing systems work, one always seems
 to be looking at the lower level elements anyway. I wrestled with this for a
 while, and eventually decided that char[], wchar[], and dchar[] would be low
 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

All right. I can accept this, of course. The problem I still have with 
this is the syntax though. We got to remember here that most 
english-only speaking people have little or no understanding of Unicode 
and are quite happy using someCharString[n] to access individual characters.

 I see your point, but I just can't see making utf8byte into a keyword <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
 of an issue here.

Yes, they have gotten used to it in C, and it's still a horrible hack. 
At least in C. It is possiblt to get the multiple encoding support to 
work in D, but it needs wrappers. More on that later.

And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

 
 I think the library functions should be improved to handle unicode chars.
 But I'm not much of an expert on how to do it right, so it is the way it is
 for the moment.

As for the functions that handle individual characters, the first thing 
that absolutely has to be done is to change them to accept dchar instead 
of char.

I'd love to help out and do these things. But two things are needed first:
    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

 
 You're not by yourself. There's a whole D community here!

Indeed, but no one else volunteered yet. :-)

    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

 
 
 A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8,
 UTF-16, or UTF-32 representations.

OK, if that is your descision then you will not see me argue against it. :-)

However, suppose you are going to write a function that accepts a 
string. Let's call it log_to_file(). How do you declare it? Today, you 
have three different options:

     void log_to_file(char[] str);
     void log_to_file(wchar[] str);
     void log_to_file(dchar[] str);

Which one of these should I use? Should I use all of them? Today, people 
seems to use the first option, but UTF-8 is horribly inefficient 
performance-wise.

Also, in the case of char and wchar strings, how do I access an 
individual character? Unless I missed something, the only way today is 
to use decode(). This is a fairly common operation which needs a better 
syntax, or people will keep accessing individual elements using the 
array notation (str[n]).

Obviously the three different string types needs to be wrapped somehow. 
Either through a class (named "String" perhaps?) or through a keyword 
("string"?) that is able to encapsulate the different behaviour of the 
three different kinds of strings.

Would it be possible to use something like this?

     dchar get_first_char(string str)
     {
         return str[0];
     }

     string str1 = (dchar[])"A UTF-32 string";
     string str2 = (char[])"A UTF-8 string";

     // call the function to demonstrate that the "string"
     // type can be used in declarations
     dchar x = get_first_char(str1);
     dchar y = get_first_char(str2);

I.e. the "string" data type would be a wrapper or supertype for the 
three different string types.

 char[] strings are UTF-8, and as such I don't know what you mean by 'native
 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

The native encoding is what the operating system uses. In Windows this 
is typically UTF-16, although it really depends. It's really a mess, 
since most applications actually use various locale-specific encodings, 
such as ISO-8859-1 or KOI8-R.

In Unix the platform specific encoding is determined by the environment 
variable LC_CTYPE, although the trend is to be moving towards UTF-8 for 
all locales. We're not quite there yet though. Check out 
http://www.utf-8.org/ for some information about this.

 If you're talking about win32 code pages, I'm going to draw a line in the
 sand and assert that D char[] strings are NOT locale or code page dependent.
 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

Right. So what you are saying is basically that there is a difference 
between reading to a ubyte[] and a char[] in that native decoding is 
performed in the latter case but not the former? (in other words, when 
reading to a char[] the data is passed through mbstowcs() internally?)

 The UTF-8 to UTF-16 conversion is defined and platform independent. The D
 runtime library includes routines to convert back and forth between them.
 They could probably be optimized better, but that's another issue. I feel
 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale
 dependent character sets are pushed off to the side as merely an input or
 output translation nuisance. The core routines all expect UTF strings, and
 so are platform and language independent. I personally think the future is
 UTF, and locale dependent encodings will fall by the wayside.

Internally, yes. But there needs to be a clear layer where the platform 
encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. 
Obviously this layer seems to be located in the streams. But we need a 
separate function to do this for byte arrays as well (since there are 
other ways of communicating with the outside world, memory mapped files 
for example). Why not use the same names as are used in C? mbstowcs() 
and wcstombs()?

 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language and
 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having
 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually 
stems from the fact that according to the C standard wchar_t is not 
Unicode. It's simply a "wide character". The Unix standard goes a step 
further and defines wchar_t to be a unicode character. Obviously D goes 
the Unix route here (for dchar), and that is very good.

However, Windows defined wchar_t to be a 16-bit Unicode character back 
in the days Unicode fit inside 16 bits. This is the same mistake Java 
did, and we have now ended up with having UTF-16 strings internally.

So, in the end, C (if you want to be portable between Unix and Windows) 
and Java both no longer allows you to work with individual characters, 
unless you know what you are doing (i.e. you are prepared to deal with 
surrogate pairs manually). My suggestion for the "string" data type will 
hide all the nitty gritty details with various encodins and allow you to 
extract the n'th dchar from a string, regardless of the internal encoding.

 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

Indeed. And using UTF-8 internally is not a bad idea. The problem is 
that we're also allowed to use UTF-16 and UTF-32 as internal encoding, 
and if this is to remain, it needs to be abstracted away somehow.

In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

 
 It does that now, except they take a char[].

Right. But wouldn't it be nicer if they accepted a "string"? The 
compiler could add automatic conversion to and from the "string" type as 
needed.

Regards

Elias M�rtenson

Dec 16 2003

"Ben Hinkle" <bhinkle4 juno.com> writes:

I think Walter once said char had been called 'ascii'. That doesn't sound
all that bad to me. Perhaps we should have the primitive types
'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane,
I know, but at least then you never will mistake an ascii[] for a utf32[]
(or a utf8[], for that matter).

-Ben

"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
 Walter wrote:

 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...

Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

 In a higher level language, yes. But in doing systems work, one always


seems
 to be looking at the lower level elements anyway. I wrestled with this


for a
 while, and eventually decided that char[], wchar[], and dchar[] would be


low
 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

 All right. I can accept this, of course. The problem I still have with
 this is the syntax though. We got to remember here that most
 english-only speaking people have little or no understanding of Unicode
 and are quite happy using someCharString[n] to access individual

characters.
 I see your point, but I just can't see making utf8byte into a keyword


<g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see


much
 of an issue here.

 Yes, they have gotten used to it in C, and it's still a horrible hack.
 At least in C. It is possiblt to get the multiple encoding support to
 work in D, but it needs wrappers. More on that later.

And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

 I think the library functions should be improved to handle unicode


chars.
 But I'm not much of an expert on how to do it right, so it is the way it


is
 for the moment.

 As for the functions that handle individual characters, the first thing
 that absolutely has to be done is to change them to accept dchar instead
 of char.

I'd love to help out and do these things. But two things are needed



first:
    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

 You're not by yourself. There's a whole D community here!

 Indeed, but no one else volunteered yet. :-)

    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.


 A string in D can be char[], wchar[], or dchar[], corresponding to


UTF-8,
 UTF-16, or UTF-32 representations.

 OK, if that is your descision then you will not see me argue against it.

:-)
 However, suppose you are going to write a function that accepts a
 string. Let's call it log_to_file(). How do you declare it? Today, you
 have three different options:

      void log_to_file(char[] str);
      void log_to_file(wchar[] str);
      void log_to_file(dchar[] str);

 Which one of these should I use? Should I use all of them? Today, people
 seems to use the first option, but UTF-8 is horribly inefficient
 performance-wise.

 Also, in the case of char and wchar strings, how do I access an
 individual character? Unless I missed something, the only way today is
 to use decode(). This is a fairly common operation which needs a better
 syntax, or people will keep accessing individual elements using the
 array notation (str[n]).

 Obviously the three different string types needs to be wrapped somehow.
 Either through a class (named "String" perhaps?) or through a keyword
 ("string"?) that is able to encapsulate the different behaviour of the
 three different kinds of strings.

 Would it be possible to use something like this?

      dchar get_first_char(string str)
      {
          return str[0];
      }

      string str1 = (dchar[])"A UTF-32 string";
      string str2 = (char[])"A UTF-8 string";

      // call the function to demonstrate that the "string"
      // type can be used in declarations
      dchar x = get_first_char(str1);
      dchar y = get_first_char(str2);

 I.e. the "string" data type would be a wrapper or supertype for the
 three different string types.

 char[] strings are UTF-8, and as such I don't know what you mean by


'native
 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

 The native encoding is what the operating system uses. In Windows this
 is typically UTF-16, although it really depends. It's really a mess,
 since most applications actually use various locale-specific encodings,
 such as ISO-8859-1 or KOI8-R.

 In Unix the platform specific encoding is determined by the environment
 variable LC_CTYPE, although the trend is to be moving towards UTF-8 for
 all locales. We're not quite there yet though. Check out
 http://www.utf-8.org/ for some information about this.

 If you're talking about win32 code pages, I'm going to draw a line in


the
 sand and assert that D char[] strings are NOT locale or code page


dependent.
 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

 Right. So what you are saying is basically that there is a difference
 between reading to a ubyte[] and a char[] in that native decoding is
 performed in the latter case but not the former? (in other words, when
 reading to a char[] the data is passed through mbstowcs() internally?)

 The UTF-8 to UTF-16 conversion is defined and platform independent. The


D
 runtime library includes routines to convert back and forth between


them.
 They could probably be optimized better, but that's another issue. I


feel
 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with


locale
 dependent character sets are pushed off to the side as merely an input


or
 output translation nuisance. The core routines all expect UTF strings,


and
 so are platform and language independent. I personally think the future


is
 UTF, and locale dependent encodings will fall by the wayside.

 Internally, yes. But there needs to be a clear layer where the platform
 encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding.
 Obviously this layer seems to be located in the streams. But we need a
 separate function to do this for byte arrays as well (since there are
 other ways of communicating with the outside world, memory mapped files
 for example). Why not use the same names as are used in C? mbstowcs()
 and wcstombs()?

 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language


and
 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from


UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How


about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about


having
 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

 Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
 stems from the fact that according to the C standard wchar_t is not
 Unicode. It's simply a "wide character". The Unix standard goes a step
 further and defines wchar_t to be a unicode character. Obviously D goes
 the Unix route here (for dchar), and that is very good.

 However, Windows defined wchar_t to be a 16-bit Unicode character back
 in the days Unicode fit inside 16 bits. This is the same mistake Java
 did, and we have now ended up with having UTF-16 strings internally.

 So, in the end, C (if you want to be portable between Unix and Windows)
 and Java both no longer allows you to work with individual characters,
 unless you know what you are doing (i.e. you are prepared to deal with
 surrogate pairs manually). My suggestion for the "string" data type will
 hide all the nitty gritty details with various encodins and allow you to
 extract the n'th dchar from a string, regardless of the internal encoding.

 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

 Indeed. And using UTF-8 internally is not a bad idea. The problem is
 that we're also allowed to use UTF-16 and UTF-32 as internal encoding,
 and if this is to remain, it needs to be abstracted away somehow.

In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

 It does that now, except they take a char[].

 Right. But wouldn't it be nicer if they accepted a "string"? The
 compiler could add automatic conversion to and from the "string" type as
 needed.

 Regards

 Elias M�rtenson

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Ben Hinkle wrote:

 I think Walter once said char had been called 'ascii'. That doesn't sound
 all that bad to me. Perhaps we should have the primitive types
 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane,
 I know, but at least then you never will mistake an ascii[] for a utf32[]
 (or a utf8[], for that matter).

No. This would be extremely bad. The (unfortunately) very large amount 
of english-only programmers will use "ascii" exclusively, and we'll end 
up with yet another english/latin-only language.

ASCII really has no place in modern computing enironments anymore. All 
oeprating systems and languages has migrated, or ar in the process of 
migrating, to Unicode.

Regards

Elias M�rtenson

Dec 16 2003

"Ben Hinkle" <bhinkle4 juno.com> writes:

"Elias Martenson" <elias-m algonet.se> wrote in message
news:brn3tp$t93$1 digitaldaemon.com...
 Ben Hinkle wrote:

 I think Walter once said char had been called 'ascii'. That doesn't


sound
 all that bad to me. Perhaps we should have the primitive types
 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.


Insane,
 I know, but at least then you never will mistake an ascii[] for a


utf32[]
 (or a utf8[], for that matter).

 No. This would be extremely bad. The (unfortunately) very large amount
 of english-only programmers will use "ascii" exclusively, and we'll end
 up with yet another english/latin-only language.

 ASCII really has no place in modern computing enironments anymore. All
 oeprating systems and languages has migrated, or ar in the process of
 migrating, to Unicode.

But ASCII has a place in a practical programming language designed to work
with legacy system and code. If you pass a utf-8 or utf-32 format string
that isn't ASCII to printf it probably won't print out what you want. That's
life.

In terms of encouraging a healthy, happy future... the only thing the D
language definition can do is choose what type to use for string literals.
ie, given the declarations

 void foo(ascii[]);
 void foo(utf8[]);
 void foo(utf32[]);

what function does

 foo("bar")

calll? Right now it would call foo(utf8[]). You are arguing it should call
utf32[]. I am on the fence about what it should call.
Phobos should have routines to handle any encoding - ascii (or just rely on
the std.c for these), utf8, utf16 and utf32.

-Ben

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Ben Hinkle wrote:

 But ASCII has a place in a practical programming language designed to work
 with legacy system and code. If you pass a utf-8 or utf-32 format string
 that isn't ASCII to printf it probably won't print out what you want. That's
 life.

For legacy code, you should have to take an extra step to make it work. 
However, it should certainly be possible. Allow me to compare to how 
Java does it:

     String str = "this is a unicode string";

     byte[] asciiString = str.getBytes("ASCII");

You can also convert it to UTF-8 if you like:

     byte[] utf8String = str.getBytes("UTF-8");

If the "default" string type in D is a simple ASCII string, do you 
honestly think that programmers who only speak english will even bother 
to do the right thing? Do you think they will even know that they are 
writing effectively broken code?

I am suffering from these kinds of bugs every day (I speak swedish 
natively, but have also need to work with cyrillic) and let me tell you: 
99% of all problems I have are caused by bugs similar to this.

Also, I don't think it's a good idea to design a language around a 
legacy character set (ASCII) which will hopefully be gone in a few years 
(for newly written programs that is).

 In terms of encouraging a healthy, happy future... the only thing the D
 language definition can do is choose what type to use for string literals.
 ie, given the declarations
 
  void foo(ascii[]);
  void foo(utf8[]);
  void foo(utf32[]);
 
 what function does
 
  foo("bar")
 
 calll? Right now it would call foo(utf8[]). You are arguing it should call
 utf32[]. I am on the fence about what it should call.

Yes, with Walters previous posting in mind, I argue that in foo() is 
overloaded with all three string types, it would call the dchar[] 
version. If one is not available, it would fall back to the wchar[], and 
lastly the char[] version.

Then again, I also argue that there should be a way to using the 
supertype, "string", to avoid having to mess with the overloading and 
transparent string conversions.

 Phobos should have routines to handle any encoding - ascii (or just rely on
 the std.c for these), utf8, utf16 and utf32.

The C standard library has largely migrated away from pure ASCII. It's 
there for backwards compatibility reasons, but people people still tend 
to use them, that's not the languages fault though but rather the 
developers.

Regards

Elias M�rtenson

Dec 16 2003

"Ben Hinkle" <bhinkle4 juno.com> writes:

 If the "default" string type in D is a simple ASCII string, do you
 honestly think that programmers who only speak english will even bother
 to do the right thing? Do you think they will even know that they are
 writing effectively broken code?

I didn't say the default type should be ASCII. I just said it should be
explicit when it is ASCII. For example, I think printf should be declared as
accepting an ascii* format string, not a char* as it is currently declared
(same for fopen etc etc). I said I didn't know what the default type should
be, though I'm leaning towards UTF-8 so that casting to ascii[] doesn't have
to reallocate anything.

-Ben

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Ben Hinkle wrote:

If the "default" string type in D is a simple ASCII string, do you
honestly think that programmers who only speak english will even bother
to do the right thing? Do you think they will even know that they are
writing effectively broken code?

 
 I didn't say the default type should be ASCII. I just said it should be
 explicit when it is ASCII.

But for all intents and puproses, ASCII does not exist anymore. It's a 
legacy character set, and it should certainly not be the "natural" way 
of dealing with strings.

Believe it or not, there are a lot of programmers out there who still 
believe that ASCII ought to be enough for anybody.

 For example, I think printf should be declared as
 accepting an ascii* format string, not a char* as it is currently declared
 (same for fopen etc etc).

But printf() works very well with UTF-8 in most cases.

 I said I didn't know what the default type should
 be, though I'm leaning towards UTF-8 so that casting to ascii[]
 doesn't have
 to reallocate anything.

True. But then again, isn't the intent to try to avoid legacy calls ats 
much as possible? Is it a good idea to set the default in order to 
accomodate a legacy character set?

You have to remember that UTF-8 is very inefficient. Suppose you have a 
10000 character long string, and you want to retrieve the 9000'th 
character from that string. If the string is UTF-32, that means a single 
memory lookup. With UTF-8 it could mean anywhere between 9000 and 54000 
memory lookups. Now imagine if the string is ten or one hundred times as 
long...

Now, with the current design, what many people are going to do is:

     char c = str[9000];
     // now play happily(?) with the char "c" that probably isn't the
     // 9000'th character and maybe was a part of a UTF-8 multi byte
     // character

Again, this is a huge problem. The bug will not be evident until some 
other person (me, for example) tries to use non-ASCII characters. The 
above broken code may have run through every single test that the 
developer wrote, simply because he didn't think of putting a non-ASCII 
character in the string.

This is a real problem, and it desperately needs to be solved. Several 
solutions has already been presented, the question is just which one of 
them that Walter will support? He already explained in the previous 
post, but it seems that there is still some things to be said.

Regards

Elias M�rtenson

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Elias Martenson wrote:

     char c = str[9000];

                    ^^^^ 8999 of course

Regards

Elias M�rtenson

Dec 16 2003

"Ben Hinkle" <bhinkle4 juno.com> writes:

      char c = str[8999];
      // now play happily(?) with the char "c" that probably isn't the
      // 9000'th character and maybe was a part of a UTF-8 multi byte
      // character

which was why I suggested doing away with the generic "char" type entirely.
If str was declared as an ascii array then it would be
 ascii c = str[8999];
Which is completely safe and reasonable. If it was declared as utf8[] then
when the user writes
 ubyte c = str[8999]
and they don't have any a-priori knowledge about str they should feel very
nervous since I completely agree indexing into an arbitrary utf-8 encoded
array is pretty meaningless. Plus in my experience using individual
characters isn't that common - I'd say easily 90% of the time a variable is
declared as char* or char[] rather than just char.
By the way, I also think any utf8, utf16 and utf32 types should be aliased
to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I
dunno.

About Java and D: when I program in Java I never worry about the size of a
char because Java is very different than C and you have to jump through
hoops to call C. But when I program in D I feel like it is an extension of C
like C++. Imagine if C++ decided that char should be 32 bits. That would
have been very painful.

-Ben

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Ben Hinkle wrote:

     char c = str[8999];
     // now play happily(?) with the char "c" that probably isn't the
     // 9000'th character and maybe was a part of a UTF-8 multi byte
     // character

 
 
 which was why I suggested doing away with the generic "char" type entirely.
 If str was declared as an ascii array then it would be
  ascii c = str[8999];
 Which is completely safe and reasonable.

No, it would certainly NOT be safe. You must remember that ASCII doesn't 
exist anymore. It's a legacy character set. It's dead. Gone. Bye bye.

And yes, sometimes it's needed for backwards compatibility, but in those 
cases it should be made explicit that you are throwing away information 
when converting.

 If it was declared as utf8[] then
 when the user writes
  ubyte c = str[8999]
 and they don't have any a-priori knowledge about str they should feel very
 nervous since I completely agree indexing into an arbitrary utf-8 encoded
 array is pretty meaningless. Plus in my experience using individual
 characters isn't that common - I'd say easily 90% of the time a variable is
 declared as char* or char[] rather than just char.

You are right. Actually it's probably more than 90%. Especially when 
dealing with unicode. Very often it's not allowed to split a unicode 
string because of composite characters. However, you still need to be 
able to do indicidual character classification, such as isspace().

 By the way, I also think any utf8, utf16 and utf32 types should be aliased
 to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I
 dunno.

ASCII has no business in a modern programming language.

 About Java and D: when I program in Java I never worry about the size of a
 char because Java is very different than C and you have to jump through
 hoops to call C. But when I program in D I feel like it is an extension of C
 like C++. Imagine if C++ decided that char should be 32 bits. That would
 have been very painful.

All I was suggesting was a renaming of the types so that it's made 
explicit what type you have to use in order to be able to hold a single 
character. In D, this type is called "dchar", char doesn't cut it. In C 
on unix, it's called wchar_t. In C on windows the type to use is called 
"int" or "long". And finally in Java, you ahve to use "int". In all of 
these languages, "char" is insufficient to hold a character.

Don't you think it's logical that the data type that can hold a 
character is called "char"?

Regards

Elias M�rtenson

Dec 17 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Elias Martenson wrote:
 accepting an ascii* format string, not a char* as it is currently 
 declared
 (same for fopen etc etc).

 
 
 But printf() works very well with UTF-8 in most cases.

None of these alternatives is correct. printf will only work correctly 
with UTF-8 if the string data is either ASCII or UTF-8 happens to be the 
current system code page. And ASCII will only work for english systems, 
which is even worse.

As I said before, the C functions should be passed strings encoded in 
current system code page. That way all strings that are written in the 
system language will be printed perfectly. Also, characters that are not 
in the code page can be replaced with ? during the conversion, which is 
better than having printf output garbage.


Hauke

Dec 17 2003

"Charles" <sanders-consulting comcast.net> writes:

It does sound insane, I like it.  I vote for this.

C

"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:brn1eq$ppk$1 digitaldaemon.com...
 I think Walter once said char had been called 'ascii'. That doesn't sound
 all that bad to me. Perhaps we should have the primitive types
 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.

Insane,
 I know, but at least then you never will mistake an ascii[] for a utf32[]
 (or a utf8[], for that matter).

 -Ben

 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 Walter wrote:

 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...

Actually, byte or ubyte doesn't really matter. One is not supposed to
look at the individual elements in a UTF-8 or a UTF-16 string anyway.

 In a higher level language, yes. But in doing systems work, one always


 seems
 to be looking at the lower level elements anyway. I wrestled with this


 for a
 while, and eventually decided that char[], wchar[], and dchar[] would



be
 low
 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.

 All right. I can accept this, of course. The problem I still have with
 this is the syntax though. We got to remember here that most
 english-only speaking people have little or no understanding of Unicode
 and are quite happy using someCharString[n] to access individual

 characters.
 I see your point, but I just can't see making utf8byte into a keyword


 <g>.
 The world has already gotten used to multibyte 'char' in C and the



funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't



see
 much
 of an issue here.

 Yes, they have gotten used to it in C, and it's still a horrible hack.
 At least in C. It is possiblt to get the multiple encoding support to
 work in D, but it needs wrappers. More on that later.

And here is also the core of the problem: having an array of "char"
implies to the unwary programmer that the elements in the sequence
are in fact "characters", and that you should be allowed to do stuff
like isspace() on them. The fact that the libraries provide such
function doesn't help either.

 I think the library functions should be improved to handle unicode


 chars.
 But I'm not much of an expert on how to do it right, so it is the way



it
 is
 for the moment.

 As for the functions that handle individual characters, the first thing
 that absolutely has to be done is to change them to accept dchar instead
 of char.

I'd love to help out and do these things. But two things are needed



 first:
    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

 You're not by yourself. There's a whole D community here!

 Indeed, but no one else volunteered yet. :-)

    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a




"string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.


 A string in D can be char[], wchar[], or dchar[], corresponding to


 UTF-8,
 UTF-16, or UTF-32 representations.

 OK, if that is your descision then you will not see me argue against it.

 :-)
 However, suppose you are going to write a function that accepts a
 string. Let's call it log_to_file(). How do you declare it? Today, you
 have three different options:

      void log_to_file(char[] str);
      void log_to_file(wchar[] str);
      void log_to_file(dchar[] str);

 Which one of these should I use? Should I use all of them? Today, people
 seems to use the first option, but UTF-8 is horribly inefficient
 performance-wise.

 Also, in the case of char and wchar strings, how do I access an
 individual character? Unless I missed something, the only way today is
 to use decode(). This is a fairly common operation which needs a better
 syntax, or people will keep accessing individual elements using the
 array notation (str[n]).

 Obviously the three different string types needs to be wrapped somehow.
 Either through a class (named "String" perhaps?) or through a keyword
 ("string"?) that is able to encapsulate the different behaviour of the
 three different kinds of strings.

 Would it be possible to use something like this?

      dchar get_first_char(string str)
      {
          return str[0];
      }

      string str1 = (dchar[])"A UTF-32 string";
      string str2 = (char[])"A UTF-8 string";

      // call the function to demonstrate that the "string"
      // type can be used in declarations
      dchar x = get_first_char(str1);
      dchar y = get_first_char(str2);

 I.e. the "string" data type would be a wrapper or supertype for the
 three different string types.

 char[] strings are UTF-8, and as such I don't know what you mean by


 'native
 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

 The native encoding is what the operating system uses. In Windows this
 is typically UTF-16, although it really depends. It's really a mess,
 since most applications actually use various locale-specific encodings,
 such as ISO-8859-1 or KOI8-R.

 In Unix the platform specific encoding is determined by the environment
 variable LC_CTYPE, although the trend is to be moving towards UTF-8 for
 all locales. We're not quite there yet though. Check out
 http://www.utf-8.org/ for some information about this.

 If you're talking about win32 code pages, I'm going to draw a line in


 the
 sand and assert that D char[] strings are NOT locale or code page


 dependent.
 They are UTF-8 strings. If you are reading code page or locale



dependent
 strings, to put them into a char[] will require running it through a
 conversion.

 Right. So what you are saying is basically that there is a difference
 between reading to a ubyte[] and a char[] in that native decoding is
 performed in the latter case but not the former? (in other words, when
 reading to a char[] the data is passed through mbstowcs() internally?)

 The UTF-8 to UTF-16 conversion is defined and platform independent.



The
 D
 runtime library includes routines to convert back and forth between


 them.
 They could probably be optimized better, but that's another issue. I


 feel
 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with


 locale
 dependent character sets are pushed off to the side as merely an input


 or
 output translation nuisance. The core routines all expect UTF strings,


 and
 so are platform and language independent. I personally think the



future
 is
 UTF, and locale dependent encodings will fall by the wayside.

 Internally, yes. But there needs to be a clear layer where the platform
 encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding.
 Obviously this layer seems to be located in the streams. But we need a
 separate function to do this for byte arrays as well (since there are
 other ways of communicating with the outside world, memory mapped files
 for example). Why not use the same names as are used in C? mbstowcs()
 and wcstombs()?

 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language


 and
 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating



to/from
 UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How


 about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about


 having
 #ifdef _UNICODE all over the place? I've done that too much already.



No
 thanks!)

 Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
 stems from the fact that according to the C standard wchar_t is not
 Unicode. It's simply a "wide character". The Unix standard goes a step
 further and defines wchar_t to be a unicode character. Obviously D goes
 the Unix route here (for dchar), and that is very good.

 However, Windows defined wchar_t to be a 16-bit Unicode character back
 in the days Unicode fit inside 16 bits. This is the same mistake Java
 did, and we have now ended up with having UTF-16 strings internally.

 So, in the end, C (if you want to be portable between Unix and Windows)
 and Java both no longer allows you to work with individual characters,
 unless you know what you are doing (i.e. you are prepared to deal with
 surrogate pairs manually). My suggestion for the "string" data type will
 hide all the nitty gritty details with various encodins and allow you to
 extract the n'th dchar from a string, regardless of the internal


encoding.
 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

 Indeed. And using UTF-8 internally is not a bad idea. The problem is
 that we're also allowed to use UTF-16 and UTF-32 as internal encoding,
 and if this is to remain, it needs to be abstracted away somehow.

In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

 It does that now, except they take a char[].

 Right. But wouldn't it be nicer if they accepted a "string"? The
 compiler could add automatic conversion to and from the "string" type as
 needed.

 Regards

 Elias M�rtenson

Dec 16 2003

"Carlos Santander B." <carlos8294 msn.com> writes:

"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
| for example). Why not use the same names as are used in C? mbstowcs()
| and wcstombs()?
|

Sorry to ask, but what do those do? What do they stand for?

-------------------------
Carlos Santander
"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
| for example). Why not use the same names as are used in C? mbstowcs()
| and wcstombs()?
|

Sorry to ask, but what do those do? What do they stand for?

-------------------------
Carlos Santander

Dec 16 2003

Elias Martenson <no spam.spam> writes:

Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.:

 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |
 
 Sorry to ask, but what do those do? What do they stand for?

mbstowcs() = multi byte string to wide character string
wcstombs() = wide character string to multi byte string

A multi byte string is a (char *), i.e. the platform encoding. This means
that if you are running Unix in a UTF-8 locale (standard these days) then
it contains a UTF-8 string. If you are running Unix or Windows with an
ISO-8859-1 locale, then it contains ISO-8859-1 data.

A wide character string is a (wchar_t *) which is a UTF-32 string on
Unix, and a UTF-16 string on Windows.

As you can see, the windows way of using UTF-16 causes the exact same
problems as you would suffer when using UTF-8, so working with wchar_t on
Windows would be of doubtful use if not for the fact that all Unicode
functions in Windows deal with wchar_t. On Unix it's easier, since you
know that the full Unicode range fits in a wchar_t.

This is the reason why I have been advocating against the UTF-16
representation in D. It makes little sense compared to UTF-8 and UTF-32.

Regards

Elias Mårtenson

Dec 16 2003

"Carlos Santander B." <carlos8294 msn.com> writes:

Thank you both.

"Elias Martenson" <no spam.spam> wrote in message
news:pan.2003.12.16.22.27.42.233945 spam.spam...
| Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.:
|
| > "Elias Martenson" <elias-m algonet.se> wrote in message
| > news:brml3p$7hp$1 digitaldaemon.com...
| > | for example). Why not use the same names as are used in C? mbstowcs()
| > | and wcstombs()?
| > |
| >
| > Sorry to ask, but what do those do? What do they stand for?
|
| mbstowcs() = multi byte string to wide character string
| wcstombs() = wide character string to multi byte string
|
| A multi byte string is a (char *), i.e. the platform encoding. This means
| that if you are running Unix in a UTF-8 locale (standard these days) then
| it contains a UTF-8 string. If you are running Unix or Windows with an
| ISO-8859-1 locale, then it contains ISO-8859-1 data.
|
| A wide character string is a (wchar_t *) which is a UTF-32 string on
| Unix, and a UTF-16 string on Windows.
|
| As you can see, the windows way of using UTF-16 causes the exact same
| problems as you would suffer when using UTF-8, so working with wchar_t on
| Windows would be of doubtful use if not for the fact that all Unicode
| functions in Windows deal with wchar_t. On Unix it's easier, since you
| know that the full Unicode range fits in a wchar_t.
|
| This is the reason why I have been advocating against the UTF-16
| representation in D. It makes little sense compared to UTF-8 and UTF-32.
|
| Regards
|
| Elias Mårtenson
|


---

Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.552 / Virus Database: 344 - Release Date: 2003-12-15

Dec 16 2003

"Julio C�sar Carrascal Urquijo" <adnoctum phreaker.net> writes:

mbstowcs - Multi Byte to Wide Character String
wcstombs - Wide Character String to Multi Byte


Carlos Santander B. <carlos8294 msn.com> escribi� en el mensaje de noticias
brnpe0$206i$3 digitaldaemon.com...
 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |

 Sorry to ask, but what do those do? What do they stand for?

Dec 16 2003

Andy Friesen <andy ikagames.com> writes:

Carlos Santander B. wrote:
 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |
 
 Sorry to ask, but what do those do? What do they stand for?

Ironically enough, you question answers Elias's question quite 
succinctly. ;)

  -- andy

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Andy Friesen wrote:

 Carlos Santander B. wrote:
 
 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 | for example). Why not use the same names as are used in C? mbstowcs()
 | and wcstombs()?
 |

 Sorry to ask, but what do those do? What do they stand for?

 
 Ironically enough, you question answers Elias's question quite 
 succinctly. ;)

Dang! How do you americans say? Three strikes, I'm out. :-)

Regards

Elias M�rtenson

Dec 17 2003

"Walter" <walter digitalmars.com> writes:

"Elias Martenson" <elias-m algonet.se> wrote in message
news:brml3p$7hp$1 digitaldaemon.com...
 As for the functions that handle individual characters, the first thing
 that absolutely has to be done is to change them to accept dchar instead
 of char.

Yes.

 However, suppose you are going to write a function that accepts a
 string. Let's call it log_to_file(). How do you declare it? Today, you
 have three different options:

      void log_to_file(char[] str);
      void log_to_file(wchar[] str);
      void log_to_file(dchar[] str);

 Which one of these should I use? Should I use all of them? Today, people
 seems to use the first option, but UTF-8 is horribly inefficient
 performance-wise.

Do it as char[]. Have the internal implementation convert it to whatever
format the underling operating system API uses. I don't agree that UTF-8 is
horribly inefficient (this is from experience, UTF-32 is much, much worse).

 Also, in the case of char and wchar strings, how do I access an
 individual character? Unless I missed something, the only way today is
 to use decode(). This is a fairly common operation which needs a better
 syntax, or people will keep accessing individual elements using the
 array notation (str[n]).

It's fairly easy to write a wrapper class for it that decodes it
automatically with foreach and [] overloads.

 Obviously the three different string types needs to be wrapped somehow.
 Either through a class (named "String" perhaps?) or through a keyword
 ("string"?) that is able to encapsulate the different behaviour of the
 three different kinds of strings.

 Would it be possible to use something like this?

      dchar get_first_char(string str)
      {
          return str[0];
      }

      string str1 = (dchar[])"A UTF-32 string";
      string str2 = (char[])"A UTF-8 string";

      // call the function to demonstrate that the "string"
      // type can be used in declarations
      dchar x = get_first_char(str1);
      dchar y = get_first_char(str2);

 I.e. the "string" data type would be a wrapper or supertype for the
 three different string types.

The best thing is to stick with one scheme for a program.

 char[] strings are UTF-8, and as such I don't know what you mean by


'native
 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

 The native encoding is what the operating system uses. In Windows this
 is typically UTF-16, although it really depends. It's really a mess,
 since most applications actually use various locale-specific encodings,
 such as ISO-8859-1 or KOI8-R.

For char types, yes. But not for UTF-16, and win32 internally is all UTF-16.
There are no locale-specific encodings in UTF-16.

 In Unix the platform specific encoding is determined by the environment
 variable LC_CTYPE, although the trend is to be moving towards UTF-8 for
 all locales. We're not quite there yet though. Check out
 http://www.utf-8.org/ for some information about this.

Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>.
Let's look forward instead of those backward locale dependent encodings.

 If you're talking about win32 code pages, I'm going to draw a line in


the
 sand and assert that D char[] strings are NOT locale or code page


dependent.
 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

 Right. So what you are saying is basically that there is a difference
 between reading to a ubyte[] and a char[] in that native decoding is
 performed in the latter case but not the former? (in other words, when
 reading to a char[] the data is passed through mbstowcs() internally?)

No, I think D will provide an optional filter for I/O which will translate
to/from locale dependent encodings. Wherever possible, the UTF-16 API's will
be used to avoid any need for locale dependent encodings.


 Internally, yes. But there needs to be a clear layer where the platform
 encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding.
 Obviously this layer seems to be located in the streams. But we need a
 separate function to do this for byte arrays as well (since there are
 other ways of communicating with the outside world, memory mapped files
 for example). Why not use the same names as are used in C? mbstowcs()
 and wcstombs()?

'cuz I can never remember how they're spelled <g>.


 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language


and
 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from


UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How


about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about


having
 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

 Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
 stems from the fact that according to the C standard wchar_t is not
 Unicode. It's simply a "wide character".

Frankly, I think the C standard is out to lunch on this. wchar_t should be
unicode, and there really isn't a problem with using it as unicode. The C
standard is also not helpful in the undefined size of wchar_t, or the sign
of 'char'.

 The Unix standard goes a step
 further and defines wchar_t to be a unicode character. Obviously D goes
 the Unix route here (for dchar), and that is very good.

 However, Windows defined wchar_t to be a 16-bit Unicode character back
 in the days Unicode fit inside 16 bits. This is the same mistake Java
 did, and we have now ended up with having UTF-16 strings internally.

Windows made the right decision given what was known at the time, it was the
unicode folks who goofed by not defining unicode right in the first place.

 Indeed. And using UTF-8 internally is not a bad idea. The problem is
 that we're also allowed to use UTF-16 and UTF-32 as internal encoding,
 and if this is to remain, it needs to be abstracted away somehow.
In general, you should be able to open a file, by specifying the file
name as a dchar[], and then the libraries should handle the rest.

 It does that now, except they take a char[].

 Right. But wouldn't it be nicer if they accepted a "string"? The
 compiler could add automatic conversion to and from the "string" type as
 needed.

It already does that for string literals. I've thought about implicit
conversions for runtime strings, but sometimes trouble results from too many
implicit conversions, so I'm hanging back a bit on this to see how things
evolve.

Dec 16 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

"Walter" <walter digitalmars.com> wrote in message
news:brnurb$2bc5$1 digitaldaemon.com...
 Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
 stems from the fact that according to the C standard wchar_t is not
 Unicode. It's simply a "wide character".

 Frankly, I think the C standard is out to lunch on this. wchar_t should be
 unicode, and there really isn't a problem with using it as unicode. The C
 standard is also not helpful in the undefined size of wchar_t, or the sign
 of 'char'.

It's stupid to not agree on a standard size for char, since it's easy to
"fix" the sign of a char register by biasing it by 128 (xor 0x80 works too),
doing the operation, then biasing it again (un-biasing it).  If all else
fails, you can promote it.  How often is this important anyway?  If it's
crucial, it's worth the time to emulate the sign if you have to.  It is no
good to run fast if the wrong results are generated.  It's just a
portability landmine, waiting for the unwary programmer, and shame on
whoever let it get into a so-called "standard".


 The Unix standard goes a step
 further and defines wchar_t to be a unicode character. Obviously D goes
 the Unix route here (for dchar), and that is very good.

 However, Windows defined wchar_t to be a 16-bit Unicode character back
 in the days Unicode fit inside 16 bits. This is the same mistake Java
 did, and we have now ended up with having UTF-16 strings internally.

 Windows made the right decision given what was known at the time, it was

the
 unicode folks who goofed by not defining unicode right in the first place.

I still don't understand why they couldn't have packed all the languages
that actually get used into the lowest 16 bits, and put all the crud like
box-drawing characters and visible control codes and byzantine musical notes
and runes and Aleutian indian that won't fit into the next 16 pages.
There's lots of gaps in the first 65536 anyway.  And probably plenty of
overlap, duplicated symbols (lots of languages have the same characters,
especially latin-based ones).  Hell they should probably have done away with
accented characters being distinct characters and enforced a combining rule
from the start.  But the Unicode standards body wanted to please the
typesetters, as opposed to giving the world a computer encoding that would
actually be usable as a common text-storage and processing medium.  This
thread shows just how convoluted Unicode really is.

I think someone can (and probably will) do better.  Unfortunately I also
believe that such an effort is doomed to failure.

Sean

Dec 17 2003

Elias Martenson <elias-m algonet.se> writes:

Sean L. Palmer wrote:

 It's stupid to not agree on a standard size for char, since it's easy to
 "fix" the sign of a char register by biasing it by 128 (xor 0x80 works too),
 doing the operation, then biasing it again (un-biasing it).  If all else
 fails, you can promote it.  How often is this important anyway?  If it's
 crucial, it's worth the time to emulate the sign if you have to.  It is no
 good to run fast if the wrong results are generated.  It's just a
 portability landmine, waiting for the unwary programmer, and shame on
 whoever let it get into a so-called "standard".

C doesn't define any standard sizes at all (well, you do have stdint.h 
these days). This is both a curse and a blessing. More often than not, 
it's a curse though.

 I still don't understand why they couldn't have packed all the languages
 that actually get used into the lowest 16 bits, and put all the crud like
 box-drawing characters and visible control codes and byzantine musical notes
 and runes and Aleutian indian that won't fit into the next 16 pages.
 There's lots of gaps in the first 65536 anyway.  And probably plenty of
 overlap, duplicated symbols (lots of languages have the same characters,
 especially latin-based ones).  Hell they should probably have done away with
 accented characters being distinct characters and enforced a combining rule
 from the start.  But the Unicode standards body wanted to please the
 typesetters, as opposed to giving the world a computer encoding that would
 actually be usable as a common text-storage and processing medium.  This
 thread shows just how convoluted Unicode really is.
 
 I think someone can (and probably will) do better.  Unfortunately I also
 believe that such an effort is doomed to failure.

Agreed. Unicode has a lot of cruft. One of my favourite pet peeves are 
the two characters:

00C5 �: LATIN CAPITAL LETTER A WITH RING ABOVE

and

212B �: ANGSTROM SIGN

The comment even says that the preferred representation is the latin �.

But, like you say, trying to do it once again will not succeed. It has 
taken us 10 or so years to get where we are. I'd say we accept Unicode 
for what it is. It's a hell of a lot better than the previous mess.

Regards

Elias M�rtenson

Dec 17 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

Sorry, "sign" of char.

"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brp52o$1gc6$1 digitaldaemon.com...
 It's stupid to not agree on a standard size for char, since it's easy to
 "fix" the sign of a char register by biasing it by 128 (xor 0x80 works

too),

Dec 17 2003

Elias Martenson <elias-m algonet.se> writes:

Walter wrote:

 "Elias Martenson" <elias-m algonet.se> wrote in message
 news:brml3p$7hp$1 digitaldaemon.com...
 
As for the functions that handle individual characters, the first thing
that absolutely has to be done is to change them to accept dchar instead
of char.

 
 Yes.

Good, I like it. :-)

Which one of these should I use? Should I use all of them? Today, people
seems to use the first option, but UTF-8 is horribly inefficient
performance-wise.

 
 Do it as char[]. Have the internal implementation convert it to whatever
 format the underling operating system API uses. I don't agree that UTF-8 is
 horribly inefficient (this is from experience, UTF-32 is much, much worse).

Memory-wise perhaps. But for everything else UTF-8 is always slower. 
Consider what happens when the program is used with russian? Every 
single character will need special decoding, except punctuation of 
course. Now think about chinese and japenese. These are even worse.

Also, in the case of char and wchar strings, how do I access an
individual character? Unless I missed something, the only way today is
to use decode(). This is a fairly common operation which needs a better
syntax, or people will keep accessing individual elements using the
array notation (str[n]).

 
 It's fairly easy to write a wrapper class for it that decodes it
 automatically with foreach and [] overloads.

Indeed. But they will be slow.

Now, personally I can accept the slowness. Again, it's your call.

What we do need to make sure is that the string/character handling 
package that we build is comprehensive in terms on Unicode support, and 
also that every single string handling function handles UTF-32 as well 
as UTF-8. This way a developer who is having performance problems with 
the default UTF-8 strings can easily change his hotspots to work with 
UTF-32 instead.

I.e. the "string" data type would be a wrapper or supertype for the
three different string types.

 
 The best thing is to stick with one scheme for a program.

Unless the developer is bitten by the poor performance of UTF-8 that is. 
A package with perl-like functionality would be horribly slow if using 
UTF-8 rather than UTF-32. If we are to stick with UTF-8 as default 
internal string format, UTF-32 must be available as an option, and it 
must be easy to use.

 For char types, yes. But not for UTF-16, and win32 internally is all UTF-16.
 There are no locale-specific encodings in UTF-16.

True. But I can't see any use for UTF-16 outside communicating with 
external windows libraries. UTF-16 really is the worst of both worlds 
compared to UTF-8 and UTF-32.

UTF-16 should really be considered the "native encoding" and left at 
that. Just like [the content of LC_CTYPE] is the native encoding when 
run in Unix. The developer should be shielded from the native encoding 
in that he should be able to say: "convert my string to the encoding my 
operating system wants (i.e. the native encoding)". As it happens, this 
is what wcstombs() does.

In Unix the platform specific encoding is determined by the environment
variable LC_CTYPE, although the trend is to be moving towards UTF-8 for
all locales. We're not quite there yet though. Check out
http://www.utf-8.org/ for some information about this.

 
 Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>.
 Let's look forward instead of those backward locale dependent encodings.

Agreed. I am heavily lobbying for proper Unicode support everywhere. 
I've been bitten by too many broken applications.

However, Windows has decided on UTF-16. Unix has decided on UTF-8. We 
need a way of transprently inputting and outputting strings so that they 
are converted to whatever encoding the host operating system uses. If we 
don't do this we are going to end up with a lot of conditional code that 
checks which OS (and encoding) is being used.

 No, I think D will provide an optional filter for I/O which will translate
 to/from locale dependent encodings. Wherever possible, the UTF-16 API's will
 be used to avoid any need for locale dependent encodings.

Why UTF-16? There is no need to involve platform specifics at this level.

Remember that UTF-16 can be considered platform specific for Windows.

 'cuz I can never remember how they're spelled <g>.

Allright... So how about adding to the utf8 package some functions 
called... Hmm... nativeToUTF8(), nativeToUTF32() and then an overloaded 
function utfToNative() (which accepts char[], wchar[] and dchar[]}. 
"native" in this case would be a byte[] or ubyte[] to point out that 
this form is not supposed to be used in the program.

Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
stems from the fact that according to the C standard wchar_t is not
Unicode. It's simply a "wide character".

 
 
 Frankly, I think the C standard is out to lunch on this. wchar_t should be
 unicode, and there really isn't a problem with using it as unicode. The C
 standard is also not helpful in the undefined size of wchar_t, or the sign
 of 'char'.

Indeed. That's why the Unix standard went a bit forther and specified a 
wchar_t to be a Unicode character. The problem is with Windows where 
wchar_t is 16-bit and thus cannot hold a Unicode character. And thus we 
end up with the current situation where using wchar_t in Windows really 
doesn't buy you anything because you have the same problems as you would 
with UTF-8. You still cannot assume that a wchar_t can hold a single 
character. You still need all the funky iterators and decoding stuff to 
be able to extract individual characters.

This is why I'm saying that the UTF-16 in Windows is horrible, and that 
UTF-16 is the worst of both worlds.

 Windows made the right decision given what was known at the time, it was the
 unicode folks who goofed by not defining unicode right in the first place.

I agree 100%. Java is in the same boat. How many people know that from 
JDK1.5 and onwards it's a bad idea to use String.charAt()? (in JDK1.5 
the internal representation for String will change from UCS-2 to 
UTF-16). In other words, the exact same problem Windows faced.

The Unicode people argues that they never guaranteed that it was a 
16-bit character set, and while this is technically true, they are 
really trying to cover up their mess.

 It already does that for string literals. I've thought about implicit
 conversions for runtime strings, but sometimes trouble results from too many
 implicit conversions, so I'm hanging back a bit on this to see how things
 evolve.

True. We suffer from this in C++ (costly implicit conversions) and it 
would be nice to be able to avoid this.

Regards

Elias M�rtenson

Dec 17 2003

"Walter" <walter digitalmars.com> writes:

I think we're mostly in agreement!

Dec 17 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Walter wrote:
The overloading issue is interesting, but may I suggest that char and

 
 whcar
 
are at least renamed to something more appropriate? Maybe utf8byte and
utf16byte? I feel it's important to point out that they aren't characters.

 
 
 I see your point, but I just can't see making utf8byte into a keyword <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
 of an issue here.

This is simply not true, Walter. The world has not gotten used to 
multibyte chars in C at all. A lot of english-speaking programmers 
simply treat chars as ASCII characters, even if there's some comment 
somewhere stating that the data should be UTF-8.

I agree with Elias that the "char" type should be 32 bit, so that people 
who simply use a char array as a string, as they have done for years in 
other languages, will actually get the behaviour they expect, without 
losing the Unicode support.

Btw: this could also be used to solve the "oops, I forgot to make the 
string null-terminated" problem when interacting with C functions. If 
the D char is a different type than the old C char (which could be 
called char_c or charz instead) then people will automatically be 
reminded that they need to convert them.

So how about the following proposal:

- char is a 32 bit Unicode character
- wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 
or 32 bits (depending on the system), provided for interoperability with 
C functions
- charz (or char_c? c_char?) is a normal 8 bit C character, also 
provided for interoperability with C functions

UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This 
would at the same time remind users that the elements are NOT characters 
but simply a bunch of binary data. I don't see the need to define a new 
type for these - there are a lot of encodings out there, so why treat 
UTF-8 and UTF-16 specially?

With this system it would be instantly obvious that D strings are 
Unicode. Interacting with legacy C code is still possible, and 
accidentally passing a wrong (e.g. UTF-8) string to a C function that 
expects ASCII or Latin-1 is impossible. Also, pure D code will 
automatically be UTF-32, which is exactly what you need if you want to 
make the lives of newbies easier. Otherwise people WILL end up using 
ASCII strings when they start out.

Hauke

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Hauke Duden wrote:

 I see your point, but I just can't see making utf8byte into a keyword 
 <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't 
 see much
 of an issue here.

 
 This is simply not true, Walter. The world has not gotten used to 
 multibyte chars in C at all. A lot of english-speaking programmers 
 simply treat chars as ASCII characters, even if there's some comment 
 somewhere stating that the data should be UTF-8.

I agree. You are better at explaining these things than I am. :-)

 I agree with Elias that the "char" type should be 32 bit, so that people 
 who simply use a char array as a string, as they have done for years in 
 other languages, will actually get the behaviour they expect, without 
 losing the Unicode support.

Indeed. In many cases existing code would actually continue working, 
since char[] would still declare a string. It wouldn't work when using 
legacy libraries though, but they won't anyway because of the 
zero-termination issue.

 Btw: this could also be used to solve the "oops, I forgot to make the 
 string null-terminated" problem when interacting with C functions. If 
 the D char is a different type than the old C char (which could be 
 called char_c or charz instead) then people will automatically be 
 reminded that they need to convert them.

Exactly.

 So how about the following proposal:
 
 - char is a 32 bit Unicode character
 - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 
 or 32 bits (depending on the system), provided for interoperability with 
 C functions
 - charz (or char_c? c_char?) is a normal 8 bit C character, also 
 provided for interoperability with C functions
 
 UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This 
 would at the same time remind users that the elements are NOT characters 
 but simply a bunch of binary data. I don't see the need to define a new 
 type for these - there are a lot of encodings out there, so why treat 
 UTF-8 and UTF-16 specially?
 
 With this system it would be instantly obvious that D strings are 
 Unicode. Interacting with legacy C code is still possible, and 
 accidentally passing a wrong (e.g. UTF-8) string to a C function that 
 expects ASCII or Latin-1 is impossible. Also, pure D code will 
 automatically be UTF-32, which is exactly what you need if you want to 
 make the lives of newbies easier. Otherwise people WILL end up using 
 ASCII strings when they start out.

We have to keep in mind, that in most cases, when you call a legacy C 
functions accepting (char *) the correct thing is to pas in a UTF-8 
encoded string. The number of functions which actually fail when doing 
so are quite few.

What I'm saying here is that there are actually few "C function[s] that 
expects ASCII or Latin-1". Most of them expect a (char *) and work on 
them as if they were a byte array. Compare this to my (and your) 
suggestion of using byte[] (or ubyte[]) for UTF-8 strings.

Regards

Elias M�rtenson

Dec 16 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Elias Martenson wrote:
 We have to keep in mind, that in most cases, when you call a legacy C 
 functions accepting (char *) the correct thing is to pas in a UTF-8 
 encoded string. The number of functions which actually fail when doing 
 so are quite few.

They are not quite as few as one may think. For example, if you pass an 
UTF-8 string to fopen then it will only work correctly if the filename 
is made up of only ASCII characters only. printf will print garbage if 
you pass it a UTF-8 character. If you use scanf to read a string from 
stdin then the returned string will not be UTF-8, so you have to deal 
with that. The is-functions (isalpha, etc.) will not work correctly for 
all characters. toupper, tolower, etc. are not able to work with 
non-ASCII characters. The list goes on...

Pretty much the only thing I can think of that will work correctly under 
all circumstances are simple C functions that pass strings through 
unmodified (if they modify them they might slice them in the middle of a 
UTF-8 sequence).

IMHO, the safest way to call C functions is to pass them strings encoded 
using the current system code page, because that's what the CRT expects 
a char array to be. Since the code page is different from system to 
system this makes a runtime conversion pretty much inevitable, but 
there's no way around that if you want Unicode support.

Hauke

Dec 16 2003

Elias Martenson <elias-m algonet.se> writes:

Hauke Duden wrote:

 They are not quite as few as one may think. For example, if you pass an 
 UTF-8 string to fopen then it will only work correctly if the filename 
 is made up of only ASCII characters only.

Depends on the OS. Unix handles it perfectly.

 printf will print garbage if
 you pass it a UTF-8 character. If you use scanf to read a string from 
 stdin then the returned string will not be UTF-8, so you have to deal 
 with that. The is-functions (isalpha, etc.) will not work correctly for 
 all characters. toupper, tolower, etc. are not able to work with 
 non-ASCII characters. The list goes on...

Exactly. But the number of functions that do these things are still 
pretty smapp, compared to the total number of functions accetping 
strings. Take a look at your own code and try to classify them as UTF-8 
safe or not. I think you'll be surprised.

 Pretty much the only thing I can think of that will work correctly under 
 all circumstances are simple C functions that pass strings through 
 unmodified (if they modify them they might slice them in the middle of a 
 UTF-8 sequence).

And, believe it or not, this is the major part of all such functions.

But, the discussion is really irrelevant since we both agree that it is 
inherently unsafe.

Regards

Elias M�rtenson

Dec 16 2003

"Walter" <walter digitalmars.com> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:brnas5$1940$1 digitaldaemon.com...
 Walter wrote:
 I see your point, but I just can't see making utf8byte into a keyword


<g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see


much
 of an issue here.

 This is simply not true, Walter. The world has not gotten used to
 multibyte chars in C at all.

Multibyte char programming in C has been common on the IBM PC for 20 years
now (my C compiler has supported it for that long, since it was distributed
to an international community), and it was standardized into C in 1989. I
agree that many ignore it, but that's because it's badly designed. Dealing
with locale-dependent encodings is a real chore in C.

 A lot of english-speaking programmers
 simply treat chars as ASCII characters, even if there's some comment
 somewhere stating that the data should be UTF-8.

True, but code doesn't have to be changed much to allow for UTF-8. For
example, D source text is UTF-8, and supporting that required little change
in the D front end, and none in the back end. Trying to use UTF-32
internally to support this would have been a disaster.

 I agree with Elias that the "char" type should be 32 bit, so that people
 who simply use a char array as a string, as they have done for years in
 other languages, will actually get the behaviour they expect, without
 losing the Unicode support.

Other problems are introduced with that for the naive programmer who expects
it to work just like ascii. For example, many people don't bother
multiplying by sizeof(char) when allocating storage for char arrays. chars
and 'bytes' in C are used willy-nilly interchangeably. Direct manipulation
of chars (without going through ctype.h) is common for converting lower case
to upper case. Etc.

The nice thing about UTF-8 is it does work just like ascii when you're
dealing with ascii data.

 Btw: this could also be used to solve the "oops, I forgot to make the
 string null-terminated" problem when interacting with C functions. If
 the D char is a different type than the old C char (which could be
 called char_c or charz instead) then people will automatically be
 reminded that they need to convert them.

 So how about the following proposal:

 - char is a 32 bit Unicode character

Already have that, it's 'dchar' <g>. There is nothing in D that prevents a
programmer from using dchar's for his character handling chores.

 - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16
 or 32 bits (depending on the system), provided for interoperability with
 C functions

I've dealt with porting large projects between win32 and linux and the
change in wchar_t size from 16 to 32. I've come to believe that method is a
mistake, hence wchar and dchar in D. (One of the wretched problems is one
cannot intermingle printf and wprintf to stdout in C.)

 - charz (or char_c? c_char?) is a normal 8 bit C character, also
 provided for interoperability with C functions

I agree that the 0 termination is an issue when calling C functions. I think
this issue will fade, however, as the D libraries get more comprehensive.
Another problem with 'normal' C chars is the confusion about whether they
are signed or unsigned. The D char type is unsigned, period <g>.

 UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This
 would at the same time remind users that the elements are NOT characters
 but simply a bunch of binary data. I don't see the need to define a new
 type for these - there are a lot of encodings out there, so why treat
 UTF-8 and UTF-16 specially?

Treating UTF-8 and UTF-16 specially in D has great advantages in making the
internal workings of the compiler and runtime library consistent. (No more
problems mixing printf and wprintf!) I'm convinced that UTF is becoming the
linqua franca of computing, and the other encodings will be relegated to
sideshow status.

 With this system it would be instantly obvious that D strings are
 Unicode. Interacting with legacy C code is still possible, and
 accidentally passing a wrong (e.g. UTF-8) string to a C function that
 expects ASCII or Latin-1 is impossible.

Windows NT, 2000, XP, and onwards are internally all UTF-16. Any win32 API
functions that accept 8 bit chars will immediately convert them to UTF-16.
wchar_t's under win32 are UTF-16 encodings (including the 2 word encodings
of UTF-16). Linux is internally UTF-8, if I'm not mistaken. This means D
code will feel right at home with linux. Under win32, I plan on fixing all
the runtime library functions to convert UTF-8 to UTF-16 internally and use
the win32 API UTF-16 functions.

Hence, UTF is where the operating systems are going, and D is looking
forward to mapping cleanly onto that. I believe that following the C
approach of code pages, signed/unsigned char confusion, varying wchar_t
sizes, etc., is rapidly becoming obsolete.

 Also, pure D code will
 automatically be UTF-32, which is exactly what you need if you want to
 make the lives of newbies easier. Otherwise people WILL end up using
 ASCII strings when they start out.

Over the last 10 years, I wrote two major internationalized apps. One used
UTF-8 internally, and converted other encodings to/from it on input/output.
The other used wchar_t throughout, and was ported to win32 and linux which
mapped wchar_t to UTF-16 and UTF-32, respectively.

The former project ran much faster, consumed far less memory, and (aside
from the lack of support from C for UTF-8) simply had far fewer problems.
The latter was big and slow. Especially and linux with the wchar_t's being
UTF-32, it really hogged the memory.

Dec 16 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Walter wrote:
This is simply not true, Walter. The world has not gotten used to
multibyte chars in C at all.

 
 
 Multibyte char programming in C has been common on the IBM PC for 20 years
 now (my C compiler has supported it for that long, since it was distributed
 to an international community), and it was standardized into C in 1989. I
 agree that many ignore it, but that's because it's badly designed. Dealing
 with locale-dependent encodings is a real chore in C.

Right, it has been around for decades. And people still don't use it 
properly. Don't make that same mistake again!

I don't see how the design of the UTF-8 encoding adds any advantage over 
other multibyte encodings that might cause people to use it properly.

Also, pure D code will
automatically be UTF-32, which is exactly what you need if you want to
make the lives of newbies easier. Otherwise people WILL end up using
ASCII strings when they start out.

 
 
 Over the last 10 years, I wrote two major internationalized apps. One used
 UTF-8 internally, and converted other encodings to/from it on input/output.
 The other used wchar_t throughout, and was ported to win32 and linux which
 mapped wchar_t to UTF-16 and UTF-32, respectively.
 
 The former project ran much faster, consumed far less memory, and (aside
 from the lack of support from C for UTF-8) simply had far fewer problems.
 The latter was big and slow. Especially and linux with the wchar_t's being
 UTF-32, it really hogged the memory.

Actually, depending on your language, UTF-32 can also be better than 
UTF-8. If you use a language that uses the upper Unicode characters then 
UTF-8 will use 3-5 bytes per character. So you may end up using even 
more memory with UTF-8.

And about computing complexity: if you ignore the overhead introduced by 
having to move more (or sometimes less) memory then manipulating UTF-32 
strings is a LOT faster than UTF-8. Simply because random access is 
possible and you do not have to perform an expensive decode operation on 
each character.

Also, how much text did your "bad experience" application use? It seems 
to me that even if you assume best-case for UTF-8 (e.g. one byte per 
character) then the memory overhead should not be much of an issue. It's 
only factor 4, after all. So assuming that your application uses 100.000 
lines of text (which is a lot more than anything I've ever seen in a 
program), each 100 characters long and everything held in memory at 
once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32. 
These are hardly numbers that will bring a modern OS to its knees 
anymore. In a few years this might even fit completely into the CPU's cache!

I think it's more important to have proper localization ability and 
programming ease than trying to conserve a few bytes for a limited group 
of people (i.e. english speakers). Being greedy with memory consumption 
when making long-term design decisions has always caused problems. For 
instance, it caused that major Y2K panic in the industry a few years ago!

Please also keep in mind that a factor 4 will be compensated by memory 
enhancements in only 1-2 years time. Most people already have several 
hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit 
shortsighted to make the lives of D programmers harder forever, just to 
save a few megabytes of memory that people will laugh about in 5 years 
(or already laugh about right now)?

Hauke

Dec 17 2003

"Walter" <walter digitalmars.com> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:brpvmn$2o0t$1 digitaldaemon.com...
 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

UTF-8 has some nice advantages over other multibyte encodings in that it is
possible to find the start of a sequence without backing up to the
beginning, none of the multibyte encodings have bit 7 clear (so they never
conflict with ascii), and no additional information like code pages are
necessary to decode them.


 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

That's correct. And D supports UTF-32 programming if that works better for
the particular application.

 And about computing complexity: if you ignore the overhead introduced by
 having to move more (or sometimes less) memory then manipulating UTF-32
 strings is a LOT faster than UTF-8. Simply because random access is
 possible and you do not have to perform an expensive decode operation on
 each character.

Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and
away most operations on strings were copying them, storing them, hashing
them, etc.

 Also, how much text did your "bad experience" application use?

Maybe 100 megs. Extensive profiling and analysis showed that it would have
run much faster if it was UTF-8 rather than UTF-32, not the least of which
was it would have hit the 'wall' of thrashing the virtual memory much later.

 It seems
 to me that even if you assume best-case for UTF-8 (e.g. one byte per
 character) then the memory overhead should not be much of an issue. It's
 only factor 4, after all.

It's a huge (!) issue. When you're pushing a web server to the max, using 4x
memory means it runs 4x slower. (Actually about 2x slower because of other
factors.)

 So assuming that your application uses 100.000
 lines of text (which is a lot more than anything I've ever seen in a
 program), each 100 characters long and everything held in memory at
 once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
 These are hardly numbers that will bring a modern OS to its knees
 anymore. In a few years this might even fit completely into the CPU's

cache!

Server applications usually get maxed out on memory, and they deal
primarilly with text. The bottom line is D will not be competitive with C++

pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen
do not measure char processing speed or memory consumption.)

 I think it's more important to have proper localization ability and
 programming ease than trying to conserve a few bytes for a limited group
 of people (i.e. english speakers). Being greedy with memory consumption
 when making long-term design decisions has always caused problems. For
 instance, it caused that major Y2K panic in the industry a few years ago!

You have a valid point, but things are always a tradeoff. D offers the
flexibility of allowing the programmer to choose whether he wants to build
his app around char, wchar, or dchar's.

(None of my programs dating back to the 70's had any Y2K bugs in them <g>)

 Please also keep in mind that a factor 4 will be compensated by memory
 enhancements in only 1-2 years time.

I don't agree that memory is improving that fast. Even if it is, people just
load them up with more data to fill the memory up. I will agree that program
code size is no longer that relevant, but data size is still pretty
relevant. Stuff we were forced to do back in the bad old DOS 640k days seem
pretty quaint now <g>.

 Most people already have several
 hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
 shortsighted to make the lives of D programmers harder forever, just to
 save a few megabytes of memory that people will laugh about in 5 years
 (or already laugh about right now)?

D programmers can use dchars if they want to.

Dec 17 2003

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Walter" <walter digitalmars.com> wrote in message
news:brqr8e$vmh$1 digitaldaemon.com...
 "Hauke Duden" <H.NS.Duden gmx.net> wrote in message
 news:brpvmn$2o0t$1 digitaldaemon.com...
 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

 UTF-8 has some nice advantages over other multibyte encodings in that it

is
 possible to find the start of a sequence without backing up to the
 beginning, none of the multibyte encodings have bit 7 clear (so they never
 conflict with ascii), and no additional information like code pages are
 necessary to decode them.

But with UTF-32, this is not an issue at all.

 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

 That's correct. And D supports UTF-32 programming if that works better for
 the particular application.

Yes, but that statement does not stop clueless/lazy programmers from
using chars in libraries/programs where UTF-32 should have been used.

 And about computing complexity: if you ignore the overhead introduced by
 having to move more (or sometimes less) memory then manipulating UTF-32
 strings is a LOT faster than UTF-8. Simply because random access is
 possible and you do not have to perform an expensive decode operation on
 each character.

 Interestingly, it was rarely necessary to decode the UTF-8 strings. Far

and
 away most operations on strings were copying them, storing them, hashing
 them, etc.

If that is correct, it might be just as correct, and even faster,
to treat it as binary data in most cases. No need to have that data
reoresented as String at all times.

 Also, how much text did your "bad experience" application use?

 Maybe 100 megs. Extensive profiling and analysis showed that it would have
 run much faster if it was UTF-8 rather than UTF-32, not the least of which
 was it would have hit the 'wall' of thrashing the virtual memory much

later.

I think the profiling might have shown very different numbers if the
native language of the profiling crew/test files were traditional
chinese texts, mixed with a lot of different languages.

 It seems
 to me that even if you assume best-case for UTF-8 (e.g. one byte per
 character) then the memory overhead should not be much of an issue. It's
 only factor 4, after all.

 It's a huge (!) issue. When you're pushing a web server to the max, using

4x
 memory means it runs 4x slower. (Actually about 2x slower because of other
 factors.)

I agree with you, speed is important. But if what you are serving
is 8-bit .html files (latin language), why not treat the data as
usigned bytes? You are describing the "special case" as the
explanation of why UTF-32 should not be the general case.

The definition of the language is what people are interested in at
this point. What "dirty" tricks you use in the implementation to make
it faster (right now, in some special cases, with a limited set of
language data) is less interesting.

 So assuming that your application uses 100.000
 lines of text (which is a lot more than anything I've ever seen in a
 program), each 100 characters long and everything held in memory at
 once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
 These are hardly numbers that will bring a modern OS to its knees
 anymore. In a few years this might even fit completely into the CPU's

 cache!

 Server applications usually get maxed out on memory, and they deal
 primarilly with text. The bottom line is D will not be competitive with

C++
 if it does chars as 32 bits each. I doubt many realize this, but Java and


 pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen
 do not measure char processing speed or memory consumption.)

I think this is a brilliant observation. I had not thought much about this.
But I think my thought from above is still correct: why should the data
for this special case be String at all? A good server software writer
could obtain the ultimate speed by using unsigned bytes. That would give
ultimate speed when necessary, and generally applicable String handling
for all spoken languages would be enforced for String at the same time.

 I think it's more important to have proper localization ability and
 programming ease than trying to conserve a few bytes for a limited group
 of people (i.e. english speakers). Being greedy with memory consumption
 when making long-term design decisions has always caused problems. For
 instance, it caused that major Y2K panic in the industry a few years


ago!
 You have a valid point, but things are always a tradeoff. D offers the
 flexibility of allowing the programmer to choose whether he wants to build
 his app around char, wchar, or dchar's.

With all due respect, I believe you are trading off in the wrong direction.
Because you have a personal interest in good performance (which is good)
you seem to not want to consider the more general cases as being the
general ones. I propose (as an experiment) that you try to think "what
would I do if I were a chinese?" each time you want to make a tradeoff
on string handling. This is what good design is all about.

In the performance trail of thought: do we all agree that the general
String _manipulation_ handling in all programs will perform much better
if choosing UTF-32 over UTF-8, when considering that the natural
language data of the program would be traditional chinese?

Another one: If UTF-32 were the base type of String, would it be
applicable to have a "Compressed" attribute on each String? That way
it could have as small as possible i/o, storage and memcpy size most
of the time, and could be uncompressed for manipulation? This should
take care of most of the "data size"/trashing related arguments...

 (None of my programs dating back to the 70's had any Y2K bugs in them <g>)

 Please also keep in mind that a factor 4 will be compensated by memory
 enhancements in only 1-2 years time.

 I don't agree that memory is improving that fast. Even if it is, people

just
 load them up with more data to fill the memory up. I will agree that

program
 code size is no longer that relevant, but data size is still pretty
 relevant. Stuff we were forced to do back in the bad old DOS 640k days

seem
 pretty quaint now <g>.

 Most people already have several
 hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
 shortsighted to make the lives of D programmers harder forever, just to
 save a few megabytes of memory that people will laugh about in 5 years
 (or already laugh about right now)?

 D programmers can use dchars if they want to.

The option to do so, is the problem. Because the programmers from a
latin letter using country will most likely choose chars, because
that is what they are used to, and because it will perform better
(on systems with too little RAM). And that will be a loss for the
international applicapability of D.

Thanks to all who took the time to read my take on these issues.

Regards,
Roald

Dec 18 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:brsfkq$dpl$1 digitaldaemon.com...
 D programmers can use dchars if they want to.

 The option to do so, is the problem. Because the programmers from a
 latin letter using country will most likely choose chars, because
 that is what they are used to, and because it will perform better
 (on systems with too little RAM). And that will be a loss for the
 international applicapability of D.

 Thanks to all who took the time to read my take on these issues.

You raise some good points.

This issue should not be treated too lightly.  It should be possible to work
with text as bytes, (for performance on interfacing with legacy non-Unicode
strings) but that should definitely not be the preferred way.

I think that there should be no char or wchar, and that dchar should be
renamed char.  That way if you see byte[] in the code you won't be tempted
to think of it as a string but more like raw data.  UTF-8 can be well
represented by byte[], and if you want to work directly with UTF-8, you can
use a wrapper class from the D standard library.

Sean

Dec 18 2003

Lewis <dethbomb hotmail.com> writes:

 I think that there should be no char or wchar, and that dchar should be
 renamed char.  

Sorry if im stating something i lack knowledge in, but if there were no wchar 
what would you use to call windows wide api?

regards

Dec 18 2003

Elias Martenson <no spam.spam> writes:

Den Thu, 18 Dec 2003 15:27:02 -0500 skrev Lewis:

 
 I think that there should be no char or wchar, and that dchar should be
 renamed char.  

 
 Sorry if im stating something i lack knowledge in, but if there were no wchar 
 what would you use to call windows wide api?

Most likely ushort[].

Regards

Elias Mårtenson

Dec 18 2003

"Walter" <walter digitalmars.com> writes:

"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:brsfkq$dpl$1 digitaldaemon.com...
 Yes, but that statement does not stop clueless/lazy programmers from
 using chars in libraries/programs where UTF-32 should have been used.

I can't really stop clueless/lazy programmers from writing bad code <g>.


 I think the profiling might have shown very different numbers if the
 native language of the profiling crew/test files were traditional
 chinese texts, mixed with a lot of different languages.

If an app is going to process primarilly chinese, it will probably be more
efficient using dchar[]. If an app is going to process primarilly english,
then char[] is the right choice. The server app I wrote was for use
primarilly by american and european companies. It had to handle chinese, but
far and away the bulk of the data it needed to process was plain old ascii.

D doesn't force such a choice on the app programmer - he can pick char[],
wchar[] or dchar[] to match the probability of the bulk of the text it will
be dealing with.

 I agree with you, speed is important. But if what you are serving
 is 8-bit .html files (latin language), why not treat the data as
 usigned bytes? You are describing the "special case" as the
 explanation of why UTF-32 should not be the general case.

For overloading reasons. I never liked the C way of conflating chars with
bytes. Having a utf type separate from a byte type enables more reasonable
ways of handling things like string literals.

 You have a valid point, but things are always a tradeoff. D offers the
 flexibility of allowing the programmer to choose whether he wants to


build
 his app around char, wchar, or dchar's.

 With all due respect, I believe you are trading off in the wrong

direction.
 Because you have a personal interest in good performance (which is good)
 you seem to not want to consider the more general cases as being the
 general ones. I propose (as an experiment) that you try to think "what
 would I do if I were a chinese?" each time you want to make a tradeoff
 on string handling. This is what good design is all about.

I assume that a chinese programmer writing chinese apps would prefer to use
dchar[]. And that is fully supported by D, so I am misunderstanding what our
disagreement is about.

 In the performance trail of thought: do we all agree that the general
 String _manipulation_ handling in all programs will perform much better
 if choosing UTF-32 over UTF-8, when considering that the natural
 language data of the program would be traditional chinese?

Sure. But if the data the program will see is not chinese, then performance
will suffer. As a language designer, I cannot determine what data the
programmer will see, so D provides char[], wchar[] and dchar[] and the
programmer can make the choice based on the data for his app.


 Another one: If UTF-32 were the base type of String, would it be
 applicable to have a "Compressed" attribute on each String? That way
 it could have as small as possible i/o, storage and memcpy size most
 of the time, and could be uncompressed for manipulation? This should
 take care of most of the "data size"/trashing related arguments...

An intriguing idea, but I am not convinced it would be superior to UTF-8.
Data compression is relatively slow.

 D programmers can use dchars if they want to.

 The option to do so, is the problem. Because the programmers from a
 latin letter using country will most likely choose chars, because
 that is what they are used to, and because it will perform better
 (on systems with too little RAM). And that will be a loss for the
 international applicapability of D.

D is not going to force one to write internationalized apps, just make it
easy to write them if the programmer cares about it. As opposed to C where
it is rather difficult to write internationalized apps, so few bother.

 Thanks to all who took the time to read my take on these issues.

It's a fun discussion!

Dec 18 2003

Elias Martenson <elias-m algonet.se> writes:

Walter wrote:

 "Roald Ribe" <rr.no spam.teikom.no> wrote in message
 news:brsfkq$dpl$1 digitaldaemon.com...
 
Yes, but that statement does not stop clueless/lazy programmers from
using chars in libraries/programs where UTF-32 should have been used.

 
 I can't really stop clueless/lazy programmers from writing bad code <g>.

But it is possible to make it harder to do so. I believe that is what 
this discussion is all about.

I think the profiling might have shown very different numbers if the
native language of the profiling crew/test files were traditional
chinese texts, mixed with a lot of different languages.

 
 If an app is going to process primarilly chinese, it will probably be more
 efficient using dchar[]. If an app is going to process primarilly english,
 then char[] is the right choice. The server app I wrote was for use
 primarilly by american and european companies. It had to handle chinese, but
 far and away the bulk of the data it needed to process was plain old ascii.

I don't think most programmers (at the time of writing the code) is 
aware of the fact that his application is going to be used outside the 
local region.

An example is the current project I'm working in, the old application 
that out new one is designed to replace, is already exported throughout 
the world. Even though that is the case, when I came into the project 
there was absolutely zero understanding that we needed to support 
anything else than ISO-8859-1. As a result, we have lost a lot of time 
rewriting parts of the system.

Now, I agree that the current D way would have made it a lot easier, but 
it could be even easier.

 D doesn't force such a choice on the app programmer - he can pick char[],
 wchar[] or dchar[] to match the probability of the bulk of the text it will
 be dealing with.

In the end, I think most people (including me) would be a lot happier if 
all that was done was renaming dchar into char. No functionality change 
at all, just a rename of the types. I think most people can see the 
advantage of D supporting UTF-8 natively, it just feels wrong with an 
array of "char" which isn't really an array of characters.

 For overloading reasons. I never liked the C way of conflating chars with
 bytes. Having a utf type separate from a byte type enables more reasonable
 ways of handling things like string literals.

Right, I can see your reasoning, but does the type _really_ have to be 
named "char"?

 I assume that a chinese programmer writing chinese apps would prefer to use
 dchar[]. And that is fully supported by D, so I am misunderstanding what our
 disagreement is about.

Possibly, but in todays world it's not unusual that an application is 
developed in europe but used in china, or developed in india but used in 
new zealand.

Regards

Elias M�rtenson

Dec 19 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

"Elias Martenson" <elias-m algonet.se> wrote in message
news:bruen1$f05$1 digitaldaemon.com...
 I don't think most programmers (at the time of writing the code) is
 aware of the fact that his application is going to be used outside the
 local region.

Probably true.

 D doesn't force such a choice on the app programmer - he can pick


char[],
 wchar[] or dchar[] to match the probability of the bulk of the text it


will
 be dealing with.

 In the end, I think most people (including me) would be a lot happier if
 all that was done was renaming dchar into char. No functionality change
 at all, just a rename of the types. I think most people can see the
 advantage of D supporting UTF-8 natively, it just feels wrong with an
 array of "char" which isn't really an array of characters.


Even despite the fact that in C and C++, char is byte-sized, it would
probably be preferrable to just rename "char" to "bchar" and "dchar" to
"char".  This corresponds to byte and int, but then wchar seems out of place
since in D there is short and ushort but not word.  "schar" sounds like
"signed char" and I believe we should stay away from that.  What to do, what
to do?

 For overloading reasons. I never liked the C way of conflating chars


with
 bytes. Having a utf type separate from a byte type enables more


reasonable
 ways of handling things like string literals.

 Right, I can see your reasoning, but does the type _really_ have to be
 named "char"?

Good point.  But there is the backward compatibility thing, which kind of
sucks.  It would subtly break any C app ported to D that allocated memory
using malloc(N) and then stored a N-character string into it.

 I assume that a chinese programmer writing chinese apps would prefer to


use
 dchar[]. And that is fully supported by D, so I am misunderstanding what


our
 disagreement is about.

 Possibly, but in todays world it's not unusual that an application is
 developed in europe but used in china, or developed in india but used in
 new zealand.

It will still work, but won't be as efficient as it could be.

Sean

Dec 19 2003

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:

There has been a lot of talk about doing things, but very little has
actually happened. Consequently, I have made a string interface and two
rough and ready string classes for UTF-8 and UTF-32, which are attached to
this message.

Currently they only do a few things, one of which is to provide a consistent
interface for character manipulation. The UTF-8 class also provides direct
access to the bytes for when the user can do things more efficiently with
these. They can also be appended to each other. In addition, each provides a
constructor taking the other one as a parameter.

Please bear in mind that I am only an amateur programmer, who knows very
little about Unicode and has no experience of programming in the real world.
Nevertheless, I can appreciate some of the issues here and I hope that these
classes can be the foundation of something more useful.

From,

Rupert




begin 666 stringclasses.d
M+R]0<F4M06QP:&$ 4W1R:6YG($-L87-S97,-"B\O8GD 4G5P97)T($UI;&QA





M" DO+VEN9&EV:61U86P 8VAA<F%C=&5R(&UA;FEP=6QA=&EO;B!F=6YC=&EO
M;G,-" DO+W1H97-E(&%R92!F87-T(&]N('5T9C,R('-T<FEN9W,L('-L;W=E

M26YD97 H:6YT(&DI.PT*("  ('9O:60 ;W!);F1E>"AI;G0 :2P =71F,S)C
M:&%R('9A;'5E*3L-"B  ("!3=')I;F< ;W!3;&EC92AI;G0 >"P :6YT('DI
M.PT*("  ( T*("  ("\O=&AI<R!I<R!J=7-T(&9O<B!Q=6EC:R!A;F0 9&ER

M80T*("  ('9O:60 <')I;G0H*3L-"GT-" T*8VQA<W, 4W1R:6YG551&,S( 
M.B!3=')I;F<-"GL-" EP<FEV871E('5T9C,R8VAA<EM=(&-H87)S.PT*"0T*
M"71H:7,H4W1R:6YG551&."!S*0T*"7L-" D)8VAA<G, /2!T;U541C,R*',N





M("  ('L-" D ("  <F5T=7)N(&YE=R!3=')I;F=55$8S,BAC:&%R<UM=*3L-


M" D)<F5T=7)N(&-H87)S6VE=.PT*"7T-" D-"B  ("!V;VED(&]P26YD97 H
M:6YT(&DL('5T9C,R8VAA<B!C*0T*("  ('L-" D ("  8VAA<G-;:5T /2!C
M.PT*("  ('T-"B  (" -"B  ("!3=')I;F< ;W!3;&EC92AI;G0 >"P :6YT

M6W  +BX >5TI.PT*("  ('T-" T*("  ('9O:60 <')I;G0H*0T*("  ('L-
M" D ("  <')I;G1F*"(E+BIS7&XB+"!T;U541C H8VAA<G,I*3L-"B  ("!]

M" EP=6)L:6, =71F.&)Y=&5;72!B>71E<SL-" T*"71H:7,H=71F.&)Y=&5;
M72!S*0T*"7L-" D)8GET97, /2!S.PT*"7T-" D-" ET:&ES*%-T<FEN9U54


M97, ?CT =&]55$8X*',N8VAA<G,I.PT*"7T-" D-" EV;VED(&]P0V%T07-S







M('5I;G0 <#TP.PT*"2  (" -" D ("  =VAI;&4 *&X\:2D-" D (" )>PT*
M"0D (" )9&5C;V1E*&)Y=&5S+"!P*3L-" D)("  "6XK*SL-" D (" )?0T*


M"2  (" O+R!.3U0 1$].12!9150-" D ("  87-S97)T*&9A;'-E*3L)("  
M( T*("  ('T-"B  (" -"B  ("!3=')I;F< ;W!3;&EC92AI;G0 >"P :6YT

M.PT*"2  (" -" D ("  =VAI;&4 *&X\>"D-" D (" )>PT*"0D (" )9&5C

M("  "7!Y/7!X.PT*"2  ( D-" D (" )=VAI;&4 *&X\>2D-" D (" )>PT*


M+B!P>5TI.PT*("  ('T-" T*("  ('9O:60 <')I;G0H*0T*("  ('L-" D 

M="!M86EN("AC:&%R6UU;72!A<F=S*0T*>PT*"5-T<FEN9U541C  82 ](&YE
M=R!3=')I;F=55$8X*")3=')I;F< :2(I.PT*"5-T<FEN9U541C,R(&( /2!N

M<')I;G0H*3L-" D-" EA/6YE=R!3=')I;F=55$8X*&(I.PT*"6$N<')I;G0H
4*3L-" D-" ER971U<FX ,#L-"GT`
`
end

Dec 19 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

Cool beans!  Thanks, Rupert!

This brings up a point.  The main reason that I do not like opAssign/opAdd
syntax for operator overloading is that it is not self-documenting that
opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that
opCatAssign corresponds to a ~= b.  This information either has to be
present in a comment or you have to go look it up.  Yeah, D gurus will have
it memorized, but I'd rather there be just one "name" for the function, and
it should be the same both in the definition and at the point of call.

Sean

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
news:brvghd$21n8$2 digitaldaemon.com...
 There has been a lot of talk about doing things, but very little has
 actually happened. Consequently, I have made a string interface and two
 rough and ready string classes for UTF-8 and UTF-32, which are attached to
 this message.

 Currently they only do a few things, one of which is to provide a

consistent
 interface for character manipulation. The UTF-8 class also provides direct
 access to the bytes for when the user can do things more efficiently with
 these. They can also be appended to each other. In addition, each provides

a
 constructor taking the other one as a parameter.

 Please bear in mind that I am only an amateur programmer, who knows very
 little about Unicode and has no experience of programming in the real

world.
 Nevertheless, I can appreciate some of the issues here and I hope that

these
 classes can be the foundation of something more useful.

 From,

 Rupert

Dec 19 2003

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:

I agree with you, but we just have to grin and bear it, unless / until
Walter changes his mind. I suppose I could have commented my code better
though. Hopefully as I become more experienced, I will be a better judge of
these things.

"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brvlj9$29qh$1 digitaldaemon.com...
 Cool beans!  Thanks, Rupert!

 This brings up a point.  The main reason that I do not like opAssign/opAdd
 syntax for operator overloading is that it is not self-documenting that
 opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that
 opCatAssign corresponds to a ~= b.  This information either has to be
 present in a comment or you have to go look it up.  Yeah, D gurus will

have
 it memorized, but I'd rather there be just one "name" for the function,

and
 it should be the same both in the definition and at the point of call.

 Sean

 "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
 news:brvghd$21n8$2 digitaldaemon.com...
 There has been a lot of talk about doing things, but very little has
 actually happened. Consequently, I have made a string interface and two
 rough and ready string classes for UTF-8 and UTF-32, which are attached


to
 this message.

 Currently they only do a few things, one of which is to provide a

 consistent
 interface for character manipulation. The UTF-8 class also provides


direct
 access to the bytes for when the user can do things more efficiently


with
 these. They can also be appended to each other. In addition, each


provides
 a
 constructor taking the other one as a parameter.

 Please bear in mind that I am only an amateur programmer, who knows very
 little about Unicode and has no experience of programming in the real

 world.
 Nevertheless, I can appreciate some of the issues here and I hope that

 these
 classes can be the foundation of something more useful.

 From,

 Rupert

Dec 19 2003

"Walter" <walter digitalmars.com> writes:

The problem with the operater* or operator~ syntax is it is ambiguous. It's
also not greppable.

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
news:brvr60$2il5$1 digitaldaemon.com...
 I agree with you, but we just have to grin and bear it, unless / until
 Walter changes his mind. I suppose I could have commented my code better
 though. Hopefully as I become more experienced, I will be a better judge

of
 these things.

 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brvlj9$29qh$1 digitaldaemon.com...
 Cool beans!  Thanks, Rupert!

 This brings up a point.  The main reason that I do not like


opAssign/opAdd
 syntax for operator overloading is that it is not self-documenting that
 opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or


that
 opCatAssign corresponds to a ~= b.  This information either has to be
 present in a comment or you have to go look it up.  Yeah, D gurus will

 have
 it memorized, but I'd rather there be just one "name" for the function,

 and
 it should be the same both in the definition and at the point of call.

 Sean

 "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in


message
 news:brvghd$21n8$2 digitaldaemon.com...
 There has been a lot of talk about doing things, but very little has
 actually happened. Consequently, I have made a string interface and



two
 rough and ready string classes for UTF-8 and UTF-32, which are



attached
 to
 this message.

 Currently they only do a few things, one of which is to provide a

 consistent
 interface for character manipulation. The UTF-8 class also provides


 direct
 access to the bytes for when the user can do things more efficiently


 with
 these. They can also be appended to each other. In addition, each


 provides
 a
 constructor taking the other one as a parameter.

 Please bear in mind that I am only an amateur programmer, who knows



very
 little about Unicode and has no experience of programming in the real

 world.
 Nevertheless, I can appreciate some of the issues here and I hope that

 these
 classes can be the foundation of something more useful.

 From,

 Rupert

Dec 19 2003

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:

If you say it's ambiguous, I'll take your word for it and if you think being
greppable is important, I'm also happy to accept that. My personal opinions
are not all that strong - it's only a minor inconvenience to have to check
the overload function names.

More importantly, what do you think of my request for more opSlice
overloads?

From,

Rupert

"Walter" <walter digitalmars.com> wrote in message
news:bs08b8$527$2 digitaldaemon.com...
 The problem with the operater* or operator~ syntax is it is ambiguous.

It's
 also not greppable.

 "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
 news:brvr60$2il5$1 digitaldaemon.com...
 I agree with you, but we just have to grin and bear it, unless / until
 Walter changes his mind. I suppose I could have commented my code better
 though. Hopefully as I become more experienced, I will be a better judge

 of
 these things.

 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brvlj9$29qh$1 digitaldaemon.com...
 Cool beans!  Thanks, Rupert!

 This brings up a point.  The main reason that I do not like


 opAssign/opAdd
 syntax for operator overloading is that it is not self-documenting



that
 opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or


 that
 opCatAssign corresponds to a ~= b.  This information either has to be
 present in a comment or you have to go look it up.  Yeah, D gurus will

 have
 it memorized, but I'd rather there be just one "name" for the



function,
 and
 it should be the same both in the definition and at the point of call.

 Sean

Dec 20 2003

"Walter" <walter digitalmars.com> writes:

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message
news:bs1d9b$2033$1 digitaldaemon.com...
 More importantly, what do you think of my request for more opSlice
 overloads?

I haven't got that far yet!

Dec 20 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

It would be greppable if it were required that there be no space between the
operator and the symbol.  (if you use regexp you can get around this)

There should be some other way to embed the symbol into the identifier, if
it's causing too many lexer problems.

Sean

"Walter" <walter digitalmars.com> wrote in message
news:bs08b8$527$2 digitaldaemon.com...
 The problem with the operater* or operator~ syntax is it is ambiguous.

It's
 also not greppable.

Dec 20 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

"Walter" <walter digitalmars.com> wrote in message
news:brqr8e$vmh$1 digitaldaemon.com...
 Interestingly, it was rarely necessary to decode the UTF-8 strings. Far

and
 away most operations on strings were copying them, storing them, hashing
 them, etc.

That is my experience as well.  Either that or it's parsing them more or
less linearly.

 Please also keep in mind that a factor 4 will be compensated by memory
 enhancements in only 1-2 years time.

 I don't agree that memory is improving that fast. Even if it is, people

just
 load them up with more data to fill the memory up. I will agree that

program
 code size is no longer that relevant, but data size is still pretty
 relevant. Stuff we were forced to do back in the bad old DOS 640k days

seem
 pretty quaint now <g>.

Code size is actually still important on embedded apps (console video games)
where the machine has small code cache size (8K or less)
On PS2, optimizing for size produces faster code in most cases than
optimizing for speed.

 Most people already have several
 hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
 shortsighted to make the lives of D programmers harder forever, just to
 save a few megabytes of memory that people will laugh about in 5 years
 (or already laugh about right now)?

 D programmers can use dchars if they want to.

So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and
dchar[] means UTF-32?

Unfortunately then a char won't hold a single Unicode character, you have to
mix char and dchar.

It would be nice to have a library function to pull the first character out
of a UTF-8 string and increment the iterator pointer past it.

dchar extractFirstChar(inout char* utf8string);

That seems like an insanely useful text processing function.  Maybe the
reverse as well:

void appendChar(char[] utf8string, dchar c);

Sean

Dec 18 2003

Elias Martenson <no spam.spam> writes:

Den Thu, 18 Dec 2003 10:49:31 -0800 skrev Sean L. Palmer:

 So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and
 dchar[] means UTF-32?
 
 Unfortunately then a char won't hold a single Unicode character, you have to
 mix char and dchar.

This is why I have advocated a rename of dchar to char, and the current
char to something else (my first suggestion was utf8byte, but I can see
why it was rejected off hand. :-) ).

 It would be nice to have a library function to pull the first character out
 of a UTF-8 string and increment the iterator pointer past it.
 
 dchar extractFirstChar(inout char* utf8string);
 
 That seems like an insanely useful text processing function.  Maybe the
 reverse as well:
 
 void appendChar(char[] utf8string, dchar c);

At least my intention when starting this second round of discussion was to
iron out what the "D way" of handling strings is, so we can get to work on
these library functions that you request.

Regards

Elias Mårtenson

Dec 18 2003

"Walter" <walter digitalmars.com> writes:

"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:brssrg$135p$1 digitaldaemon.com...
 So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and
 dchar[] means UTF-32?

Yes. Exactly.

 Unfortunately then a char won't hold a single Unicode character,

Correct. But a dchar will.

 you have to mix char and dchar.

 It would be nice to have a library function to pull the first character

out
 of a UTF-8 string and increment the iterator pointer past it.
 dchar extractFirstChar(inout char* utf8string);

Check out the functions in std.utf.

 That seems like an insanely useful text processing function.  Maybe the
 reverse as well:
 void appendChar(char[] utf8string, dchar c);

Actually, a wrapper class around the string, overloading opApply, [], etc.,
will do the job nicely.

Dec 18 2003

Karl Bochert <kbochert copper.net> writes:

On Thu, 18 Dec 2003 16:05:47 -0800, "Walter" <walter digitalmars.com> wrote:
 
 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:brssrg$135p$1 digitaldaemon.com...
 So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and
 dchar[] means UTF-32?

 
 Yes. Exactly.
 
 Unfortunately then a char won't hold a single Unicode character,

 
 Correct. But a dchar will.
 

A char is defined as a UTF-8 character but does not have enough storage to hold
one!?

ubute[4]  declares storage for 4 ubytes btytes, but char[4] 

The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?

Can't a single UTF-8 character require multiple bytes for representation?

A datatype is some storage and a set of operations that can be done on that
storage.
In what way are char and ubyte different datatypes?

An array of a datatype is an indexable set of elements of that type. (Isn't it?)
Given
    char foo[4];

does foo[2] not represent the third char in foo !!??


I would think that the datatype char would be a UTF-8 character, with no
indication of
the amount of storage it used. The compiler would be free to represent it
internally
however it chose. Indexing should work (perhaps inefficiently)

D's datatypes seem to be of two different varieties; names for units of memory
and names for abstract types. Some (ubyte) describe a fixed amount af physical
storage, while others ( ifloat?)  describe an abstract datatype whose physical
structure
is hidden (or at least irrelevant)
Which is char?

 
Karl Bochert

Dec 20 2003

Elias Martenson <no spam.spam> writes:

Den Sat, 20 Dec 2003 19:33:59 +0000 skrev Karl Bochert:

 D's datatypes seem to be of two different varieties; names for units of memory
 and names for abstract types. Some (ubyte) describe a fixed amount af physical
 storage, while others ( ifloat?)  describe an abstract datatype whose physical
 structure
 is hidden (or at least irrelevant)
 Which is char?

It's a fixed memory type. Look at it as an ubyte, but with some special
guarantees (upheld by convention).

By your own question you have pointed out that the name "char" is not very
good. But I really should stop pointing this out, or I'll be banned before
I even get started with providing any actual value to the project. :-)

Regards

Elias Mårtenson

Dec 20 2003

"Walter" <walter digitalmars.com> writes:

"Karl Bochert" <kbochert copper.net> wrote in message
news:1103_1071948839 bose...
 A char is defined as a UTF-8 character but does not have enough storage to

hold one!?

Right.

 The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?

Yes.

 Can't a single UTF-8 character require multiple bytes for representation?

No.

 A datatype is some storage and a set of operations that can be done on

that storage.
 In what way are char and ubyte different datatypes?

Only how they are overloaded, and how string literals are handled.

 An array of a datatype is an indexable set of elements of that type.

(Isn't it?)
 Given
     char foo[4];

 does foo[2] not represent the third char in foo !!??

If it makes more sense, it is the third byte in foo.

 I would think that the datatype char would be a UTF-8 character, with no

indication of
 the amount of storage it used. The compiler would be free to represent it

internally
 however it chose. Indexing should work (perhaps inefficiently)

That would be a higher level view of it, and I suggest a wrapper class
around it can provide this.

 D's datatypes seem to be of two different varieties; names for units of

memory
 and names for abstract types. Some (ubyte) describe a fixed amount af

physical
 storage, while others ( ifloat?)  describe an abstract datatype whose

physical structure
 is hidden (or at least irrelevant)
 Which is char?

char is a fixed 8 bits of storage.

Dec 20 2003

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Walter" <walter digitalmars.com> wrote in message
news:bs3pmm$2m0v$2 digitaldaemon.com...
 "Karl Bochert" <kbochert copper.net> wrote in message
 news:1103_1071948839 bose...
 A char is defined as a UTF-8 character but does not have enough storage


to
 hold one!?

 Right.

 The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?

 Yes.

 Can't a single UTF-8 character require multiple bytes for


representation?
 No.

???
A unicode character can result in up to 6 bytes used, when encoded
with UTF-8. Which is what the poster meant to ask, I think.

Roald

Dec 21 2003

"Walter" <walter digitalmars.com> writes:

"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:bs4ddt$ig4$1 digitaldaemon.com...
 Can't a single UTF-8 character require multiple bytes for


 representation?
 No.

 ???
 A unicode character can result in up to 6 bytes used, when encoded
 with UTF-8. Which is what the poster meant to ask, I think.

Sure, perhaps I misunderstood him.

Dec 22 2003

"Serge K" <skarebo programmer.net> writes:

 ???
 A unicode character can result in up to 6 bytes used, when encoded
 with UTF-8.

UTF-8 can represent all Unicode characters with no more then 4 bytes.
ISO/IEC 10646 (UCS-4) may require up to 6 bytes in UTF-8, but it's the
superset for Unicode.

Dec 30 2003

"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> writes:

 I would think that the datatype char would be a UTF-8 character, with no

 indication of
 the amount of storage it used. The compiler would be free to represent


it
 internally
 however it chose. Indexing should work (perhaps inefficiently)

 That would be a higher level view of it, and I suggest a wrapper class
 around it can provide this.

On Friday 19th, I posted a class that provides this functionality to this
thread.

You can see the message here:
http://www.digitalmars.com/drn-bin/wwwnews?D/20619

As for the attached file - it does not appear to be accessible to users of
the webservice, so I have placed it on the wiki at:
http://www.wikiservice.at/wiki4d/wiki.cgi?StringClasses

Rupert

Dec 21 2003

Ant <Ant_member pathlink.com> writes:

In article <bs4ea9$jo2$1 digitaldaemon.com>, Rupert Millard says...
 I would think that the datatype char would be a UTF-8 character, with no

 indication of
 the amount of storage it used. The compiler would be free to represent


it
 internally
 however it chose. Indexing should work (perhaps inefficiently)

 That would be a higher level view of it, and I suggest a wrapper class
 around it can provide this.

On Friday 19th, I posted a class that provides this functionality to this
thread.

I sorry to interrup
(I'm one of the cluless here, in fact I call this the unicorn discussion)
but isn't Vathix's String class suppose to cover that?
http://www.digitalmars.com/drn-bin/wwwnews?D/19525

It's bigger so it must be better ;)

Ant

Dec 21 2003

Rupert Millard <rupertamillard hotmail.DELETE.THIS.com> writes:

Ant <Ant_member pathlink.com> wrote in
news:bs4gc8$n2c$1 digitaldaemon.com: 

 In article <bs4ea9$jo2$1 digitaldaemon.com>, Rupert Millard says...
 I would think that the datatype char would be a UTF-8 character,
 with no 

 indication of
 the amount of storage it used. The compiler would be free to
 represent 


it
 internally
 however it chose. Indexing should work (perhaps inefficiently)

 That would be a higher level view of it, and I suggest a wrapper
 class around it can provide this.

On Friday 19th, I posted a class that provides this functionality to
this thread.

 
 I sorry to interrup
 (I'm one of the cluless here, in fact I call this the unicorn
 discussion) but isn't Vathix's String class suppose to cover that?
 http://www.digitalmars.com/drn-bin/wwwnews?D/19525
 
 It's bigger so it must be better ;)
 
 Ant

You had me worried here because I missed that post! However, they do 
slightly different things, I think.

Mine indexes characters rather than bytes in UTF-8 strings. Vathix's does 
many other string handling things. (e.g. changing case) My code needs to 
be integrated into his, if it can be - I'm not sure what implications his 
use of templates has.

You're quite correct - as they currently are, his is vastly more useful -  
I can't think of many situations where you need to index whole characters 
rather than bytes. My main reason for writing it was that I enjoy writing 
code.

Rupert

Dec 21 2003

Ilya Minkov <minkov cs.tum.edu> writes:

I think this discussion of "language being wrong" is wrong. It is 
obviuosly clear that the char[], char, and other associated types don't 
have a sensible higher-level symantics. The examples are many.

Obviously, i find it quite right from the language not to constrain the 
programmers to high-level types. It is a job for the library.

Now, everyone. Walter has quite enough to do of what he does better than 
all of us. Improving on a standard library is a job which he delegates 
to us.

A library class or struct String should be indexed by a real character 
scanning, and not by the adress, even if it means more overhead. And the 
result of this indexing, as well as any single character acess would be 
a dchar. The internal representation should be still acessible, for the 
case someone finds high-level semantics a bottleneck within his application.

Besides, myself and Mark have proposed a number of solutions a while 
ago, which would give strings non-standard storage, but would allow the 
high level representation to be significantly faster, at the cost of 
ease of operating on a lower-level representation.

-eye

Dec 21 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Walter wrote:
I don't see how the design of the UTF-8 encoding adds any advantage over
other multibyte encodings that might cause people to use it properly.

 
 
 UTF-8 has some nice advantages over other multibyte encodings in that it is
 possible to find the start of a sequence without backing up to the
 beginning, none of the multibyte encodings have bit 7 clear (so they never
 conflict with ascii), and no additional information like code pages are
 necessary to decode them.

The only situation I can think of where this might be useful is if you 
want to jump directly into the middle of a string. And that isn't really 
useful for UTF-8 because you do not know how many characters were before 
that - so you have no idea where you've "landed".

And about computing complexity: if you ignore the overhead introduced by
having to move more (or sometimes less) memory then manipulating UTF-32
strings is a LOT faster than UTF-8. Simply because random access is
possible and you do not have to perform an expensive decode operation on
each character.

 
 
 Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and
 away most operations on strings were copying them, storing them, hashing
 them, etc.

Hmmm. That IS interesting. Now that you mention it, I think this would 
also apply to most of my own code. Though it might depend on the kind of 
application.

So assuming that your application uses 100.000
lines of text (which is a lot more than anything I've ever seen in a
program), each 100 characters long and everything held in memory at
once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
These are hardly numbers that will bring a modern OS to its knees
anymore. In a few years this might even fit completely into the CPU's

 
 cache!
 
 Server applications usually get maxed out on memory, and they deal
 primarilly with text. The bottom line is D will not be competitive with C++

 pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen
 do not measure char processing speed or memory consumption.)

I hadn't thought of applications that do nothing but serve data/text to 
others. That's a good counter-example against some of my arguments. 
Having the server run at 1/2 capacity because of string encoding seems 
to be too much.

So I think you're right in having multiple "native" encodings. That 
still leaves the problems of providing easy ways to work with strings, 
though, to ensure that newbies will "automatically" write Unicode 
capable applications. That's the only way I see to avoid the situation 
we see in C/C++ code right now.

What's bad about multiple encodings is that all libraries would have to 
support 3 kinds of strings for everything. That's not really feasible in 
the real world - I certainly don't want to write every function 3 times. 
I can think of only two ways around that:

1) some sort of automatic conversion when the function is called. This 
might cause quite a bit of overhead.

2) using some sort of template and let the compiler generate the 3 
special cases. I don't think normal templates will work here, because we 
also need to support string functions in interfaces. Maybe we need some 
kind of universal string argument type? So that the compiler can 
automatically generate 3 functions if that type is used in the parameter 
list? Seems a bit of a hack....

3) making the string type abstract so that string objects are 
compatible, no matter what their encoding is. This has the added benefit 
(as I have mentioned a few times before ;)) that users could have 
strings in their own encoding, which comes in handy when you're dealing 
with legacy code that does not use US-ASCII.

I think 3 would be the most feasible. You decide about the encoding when 
you create the string object and everything else is completely transparent.



Hauke

Dec 19 2003

"Walter" <walter digitalmars.com> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:bruief$kav$1 digitaldaemon.com...
 What's bad about multiple encodings is that all libraries would have to
 support 3 kinds of strings for everything. That's not really feasible in
 the real world - I certainly don't want to write every function 3 times.

I had the same thoughts!

 I can think of only two ways around that:

 1) some sort of automatic conversion when the function is called. This
 might cause quite a bit of overhead.

 2) using some sort of template and let the compiler generate the 3
 special cases. I don't think normal templates will work here, because we
 also need to support string functions in interfaces. Maybe we need some
 kind of universal string argument type? So that the compiler can
 automatically generate 3 functions if that type is used in the parameter
 list? Seems a bit of a hack....

My first thought was to template all functions taking a string. It just got
too complicated.

 3) making the string type abstract so that string objects are
 compatible, no matter what their encoding is. This has the added benefit
 (as I have mentioned a few times before ;)) that users could have
 strings in their own encoding, which comes in handy when you're dealing
 with legacy code that does not use US-ASCII.

 I think 3 would be the most feasible. You decide about the encoding when
 you create the string object and everything else is completely

transparent.

I think 3 is the same as 1!

Dec 19 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Walter wrote:
I can think of only two ways around that:

1) some sort of automatic conversion when the function is called. This
might cause quite a bit of overhead.


<snip>
3) making the string type abstract so that string objects are
compatible, no matter what their encoding is. This has the added benefit
(as I have mentioned a few times before ;)) that users could have
strings in their own encoding, which comes in handy when you're dealing
with legacy code that does not use US-ASCII.

I think 3 would be the most feasible. You decide about the encoding when
you create the string object and everything else is completely

 
 transparent.
 
 I think 3 is the same as 1!

Not really ;).

With 1 I meant having unrelated string classes (maybe source code 
compatible, but not derived from a common base class). That would mean 
that a temporary object would have to be created if a function takes, 
say, a UTF-8 string as an argument but you pass it a UTF-32 string.

Pros: the compiler can do more inlining, since it knows the object type.
Cons: the performance gain of the inlining is probably lost with all the 
conversions that will be going on if you use different libs. It is also 
not possible to easily add new string types without having to add the 
corresponding copy constructor and =operators to the existing ones.

With 3 there would not be such a problem. All functions would have to 
use the common string interface for their arguments, so any kind of 
string object that implements this interface could be passed without a 
conversion.

Pros: adding new string encodings is no problem, passing string objects 
never causes new objects to be created or data to be converted.
Cons: most calls can probably not be inlined, since the functions will 
never know the actual class of the strings they work with. Also, if you 
want to pass a string constant to a function you'll have to explicitly 
wrap it in an object, since the compiler doesn't know what kind of 
object to create to convert a char[] to a string interface reference.

The last point would go away if string constants were also string 
objects. I think that would be a good idea anyway, since that'd make the 
string interface the default way to deal with strings.

Another solution would be if there was some way to write global 
conversion functions that are called to do implicit conversions between 
different types. Such functions could also be useful in many other 
circumstances, so that might be an idea to think about.

Hauke

Dec 19 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Hauke Duden wrote:
 Another solution would be if there was some way to write global 
 conversion functions that are called to do implicit conversions between 
 different types. Such functions could also be useful in many other 
 circumstances, so that might be an idea to think about.

Just to clarify: I meant this in the context of creating a string 
interface instance from a string constant, not to convert between 
different string objects (which wouldn't make much sense).

E.g.

interface string
{
...
}

class MyString implements string
{
...
}

void print(string msg)
{
...
}



Without an implicit conversion we'd have to write:

print(new MyString("Hello World"));



With an implicit conversion that'd look like this:

string opConvert(char[] s)
{
	return new MyString(s);
}

print("Hello World");

[The last line would translate to print(opConvert("Hello World")) ]


Hauke

Dec 19 2003

"Serge K" <skarebo programmer.net> writes:

 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

Well, at least one can convert any Unicode string to UTF-8 without risk of
losing information.

 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

UTF-32 never takes less memory than UTF-8. Period.
Any Unicode character takes no more than 4 byte in UTF-8:
1 byte - ASCII
2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc...
3 byte - most of the scripts in use.
4 byte - rare/dead/special scripts

UTF-8 means multibyte encoding for most of the languages (except English and
maybe some others)
Most of the European and Asian languages need just one UTF-16 unit per
character.
For CJK languages occurrence of the UTF-16 surrogates in the real texts is
estimated as <1%.
Other scripts encoded in "higher planes" cover very rare or dead languages
and some special symbols.

In most of the cases UTF-16 string can be treated as simple array of UCS-2
characters.
You just need to know if it has surrogates // if (number_of_characters <
nomber_of_16bit_units)

Dec 30 2003

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Serge K" <skarebo programmer.net> wrote in message
news:bst8q3$218i$1 digitaldaemon.com...
 I don't see how the design of the UTF-8 encoding adds any advantage over
 other multibyte encodings that might cause people to use it properly.

 Well, at least one can convert any Unicode string to UTF-8 without risk of
 losing information.

This is a good point. But I stand my ground: it may result in up to
6 bytes used for ecah character (worst case).

 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

 UTF-32 never takes less memory than UTF-8. Period.
 Any Unicode character takes no more than 4 byte in UTF-8:
 1 byte - ASCII
 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc...
 3 byte - most of the scripts in use.
 4 byte - rare/dead/special scripts

This is wrong. Read up on UTF-8 encoding.

 UTF-8 means multibyte encoding for most of the languages (except English

and
 maybe some others)

Right.

 Most of the European and Asian languages need just one UTF-16 unit per
 character.

Yes most, but not all.

 For CJK languages occurrence of the UTF-16 surrogates in the real texts is
 estimated as <1%.

The code to handle it still has to be present...

 Other scripts encoded in "higher planes" cover very rare or dead languages
 and some special symbols.

 In most of the cases UTF-16 string can be treated as simple array of UCS-2
 characters.

Yes, but "most cases" is not a good argument when the original
discussion was initiated to handle ALL laguages, in a way that
the developer would find to be "natural", easy and integrated
in the D language.

 You just need to know if it has surrogates // if (number_of_characters <
 nomber_of_16bit_units)

There is no such thing as "just" with these issues (IMHO) ;-)

Roald

Dec 30 2003

"Serge K" <skarebo programmer.net> writes:

 Actually, depending on your language, UTF-32 can also be better than
 UTF-8. If you use a language that uses the upper Unicode characters



then
 UTF-8 will use 3-5 bytes per character. So you may end up using even
 more memory with UTF-8.

 UTF-32 never takes less memory than UTF-8. Period.
 Any Unicode character takes no more than 4 byte in UTF-8:
 1 byte - ASCII
 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc...
 3 byte - most of the scripts in use.
 4 byte - rare/dead/special scripts

 This is wrong. Read up on UTF-8 encoding.

RTFM.

[The Unicode Standard, Version 4.0]
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and
UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and
U+E000..U+10FFFF to unique code unit sequences.

UTF-8
D36.  UTF-8 encoding form: The Unicode encoding form which assigns each
Unicode scalar
value to an unsigned byte sequence of one to four bytes in length, as
specified in Table 3-5.

[Table 3-5. UTF-8 Bit Distribution]
Scalar Value   1st Byte   2nd Byte   3rd Byte   4th Byte
00000000   0xxxxxxx   0xxxxxxx
00000yyy yyxxxxxx   110yyyyy   10xxxxxx
zzzzyyyy yyxxxxxx   1110zzzz   10yyyyyy   10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx   11110uuu   10uuzzzz   10yyyyyy   10xxxxxx

[Appendix C : Relationship to ISO/IEC 10646]
C.3 UCS Transformation Formats
UTF-8
The term UTF-8 stands for UCS Transformation Format, 8-bit form. UTF-8 is an
alternative
coded representation form for all of the characters of ISO/IEC 10646. The
ISO/IEC
definition is identical in format to UTF-8 as described under definition D36
in Section 3.9,
Unicode Encoding Forms.
...
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for
the use of
five- and six-byte sequences to encode characters that are outside the range
of the Unicode
character set; those five- and six-byte sequences are illegal for the use of
UTF-8 as an
encoding form of Unicode characters.

Jan 03 2004

Matthias Becker <Matthias_member pathlink.com> writes:

In a higher level language, yes. But in doing systems work, one always seems
to be looking at the lower level elements anyway. I wrestled with this for a
while, and eventually decided that char[], wchar[], and dchar[] would be low
level representations. One could design a wrapper class for them that
overloads [] to provide automatic decoding if desired.

Shouldn't this wrapper be part of Phobos?

Dec 17 2003

"Walter" <walter digitalmars.com> writes:

"Matthias Becker" <Matthias_member pathlink.com> wrote in message
news:brpr00$2grc$1 digitaldaemon.com...
In a higher level language, yes. But in doing systems work, one always


seems
to be looking at the lower level elements anyway. I wrestled with this


for a
while, and eventually decided that char[], wchar[], and dchar[] would be


low
level representations. One could design a wrapper class for them that
overloads [] to provide automatic decoding if desired.

 Shouldn't this wrapper be part of Phobos?

Eventually, yes. First things first, though, and the first step was making
the innards of the D language and compiler fully unicode enabled.

Dec 17 2003

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Walter" <walter digitalmars.com> wrote in message
news:brll85$1oko$1 digitaldaemon.com...
 "Elias Martenson" <no spam.spam> wrote in message
 news:pan.2003.12.15.23.07.24.569047 spam.spam...
 Actually, byte or ubyte doesn't really matter. One is not supposed to
 look at the individual elements in a UTF-8 or a UTF-16 string anyway.

 In a higher level language, yes. But in doing systems work, one always

seems
 to be looking at the lower level elements anyway. I wrestled with this for

a
 while, and eventually decided that char[], wchar[], and dchar[] would be

low
 level representations. One could design a wrapper class for them that
 overloads [] to provide automatic decoding if desired.


 The overloading issue is interesting, but may I suggest that char and

 whcar
 are at least renamed to something more appropriate? Maybe utf8byte and
 utf16byte? I feel it's important to point out that they aren't


characters.
 I see your point, but I just can't see making utf8byte into a keyword <g>.
 The world has already gotten used to multibyte 'char' in C and the funky
 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see

much
 of an issue here.


 And here is also the core of the problem: having an array of "char"
 implies to the unwary programmer that the elements in the sequence
 are in fact "characters", and that you should be allowed to do stuff
 like isspace() on them. The fact that the libraries provide such
 function doesn't help either.

 I think the library functions should be improved to handle unicode chars.
 But I'm not much of an expert on how to do it right, so it is the way it

is
 for the moment.

 I'd love to help out and do these things. But two things are needed


first:
     - At least one other person needs to volunteer.
       I've had bad experiences when one person does this by himself,

 You're not by yourself. There's a whole D community here!

     - The core concepts needs to be decided upon. Things seems to be
       somewhat in flux right now, with three different string types
       and all. At the very least it needs to be deicded what a "string"
       really is, is it a UTF-8 byte sequence or a UTF-32 character
       sequence? I haven't hid the fact that I would prefer the latter.

 A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8,
 UTF-16, or UTF-32 representations.

 That's correct as well. The library's support for unicode is



inadequate.
 But
 there also is a nice package (std.utf) which will convert between


 char[],
 wchar[], and dchar[]. This can be used to convert the text strings



into
 whatever unicode stream type the underlying operating system API


 supports.
 (For win32 this would be UTF-16, I am unsure what linux supports.)

 Yes. But this would then assume that char[] is always in native encoding
 and doesn't rhyme very well with the assertion that char[] is a UTF-8
 byte sequence.
 Or, the specification could be read as the stream actually performs


native
 decoding to UTF-8 when reading into a char[] array.

 char[] strings are UTF-8, and as such I don't know what you mean by

'native
 decoding'. There is only one possible conversion of UTF-8 to UTF-16.

 Unless fundamental encoding/decoding is embedded in the streams library,
 it would be best to simply read text data into a byte array and then
 perform native decoding manually afterwards using functions similar
 to the C mbstowcs() and wcstombs(). The drawback to this is that you
 cannot read text data in platform encoding without copying through
 a separate buffer, even in cases when this is not needed.

 If you're talking about win32 code pages, I'm going to draw a line in the
 sand and assert that D char[] strings are NOT locale or code page

dependent.
 They are UTF-8 strings. If you are reading code page or locale dependent
 strings, to put them into a char[] will require running it through a
 conversion.

 D is headed that way. The current version of the library I'm working



on
 converts the char[] strings in the file name API's to UTF-16 via
 std.utf.toUTF16z(), for use calling the win32 API's.

 This can be done in a much better, platform independent way, by using
 the native<->unicode conversion routines.

 The UTF-8 to UTF-16 conversion is defined and platform independent. The D
 runtime library includes routines to convert back and forth between them.
 They could probably be optimized better, but that's another issue. I feel
 that by designing D around UTF-8, UTF-16 and UTF-32 the problems with

locale
 dependent character sets are pushed off to the side as merely an input or
 output translation nuisance. The core routines all expect UTF strings, and
 so are platform and language independent. I personally think the future is
 UTF, and locale dependent encodings will fall by the wayside.

 In C, as already mentioned,
 these are called mbstowcs() and wcstombs(). For Windows, these would
 convert to and from UTF-16. For Unix, these would convert to and from
 whatever encoding the application is running under (dictated by the
 LC_CTYPE environment variable). There really is no need to make the
 API's platform dependent in any way here.

 After wrestling with this issue for some time, I finally realized that
 supporting locale dependent character sets in the core of the language and
 runtime library is a bad idea. The core will support UTF, and locale
 dependent representations will only be supported by translating to/from

UTF.
 This should wind up making D a far more portable language for
 internationalization than C/C++ are (ever wrestle with tchar.h? How about
 wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about

having
 #ifdef _UNICODE all over the place? I've done that too much already. No
 thanks!)

 UTF-8 is really quite brilliant. With just some minor extra care over
 writing ordinary ascii code, you can write portable code that is fully
 capable of handling the complete unicode character set.

Following this discussion, I have read some more on the subject.
In additon to the speed issues that was mentioned, I have had some
insights on the issues of endianess, serialization,
BOM (Byte Order Mark) ++
Most of it can be found in a reasonably short pdf document:
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

There is even more to this than I first believed...
Based on the new knowledge I become more and more convinced that
the choice of UTF-8 encoding as the basic "correct thing to do"
for general use in a programming language, is well founded. But
when text _processing_ comes into play, other rules aplies.

But: I still find it objectionable to call one byte in a
UTF-8/Unicode based language a char! ;-) The naming will of course
make it easier to do a straight port from C to D, but such a port
will in most cases be of no use on the "International scene".
Oh well, this can be argued for/against well both ways I guess...
IMHO there should be no char type at all. Only byte. Or maybe to
take more sizes into consideration: bin8, bin16, bin32, bin64...
I think porting from C to D should involve renaming char's to bin8's

Hmmm... It is sad when learning more makes you want to
change less ;-) Anyway, there is more to be learned...

Roald

Dec 31 2003

"Elohe" <GODA-XEN terra.es> writes:

First: I'm new in D and my english are bad.
I realy like the utf8, but the true it no is efficient all the time ( local
character acces...) and in a litle number of C/C++ programs I ned to use
interrnal utf32 intestead of utf8 but later, I introduced a hack and I
indexed the utf8 char nunber/pos and used a standard utf8 vector, the memory
need are lower than using utf32 in my most frequent cases and the memory
efficiency are better than utf32 for my experience this work very well in
latin and CJK languages ( I normale use this two encodings) but for cirilyc,
arabian... the memory can be bigger than utf32 but if is used a eficient
indexation system we can equal the memory needed to utf32, in perfomance the
penalitation is than 8 times slower than utf32 implementation, compared  it
to the penalitation in standar utf8 are very fast.

I recomend to add:

stringi -> indexed string for utf8

and the posibility to mark an internal representaion off the utf like:

string utf8-32 -> this mark an utf8 string, but it works internal as utf32

Jan 07 2004

D Programming

C/C++ Programming

Other

D - Unicode discussion