www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF-8 char[] consistency

reply "Jaap Geurts" <jaapsen hotmail.com> writes:
Hi all,

I'm testing and programming in D using UTF-8 under linux to encode the
Vietnamese character set.
I have some trouble with the way D handles the char[].length property.

If I make a string as follows
char[] s = "câu này có những chữ cái tiếng việt";

Then the length property (s.length) reports the number of bytes not the number
of characters as I would expect to happen. The length property would return the
number of bytes for the byte[].
Therefore I still need to use a strlen function to determine the correct string
length.
One of the implications is that most *string* handling functions in the phobos
library depend on the length property and thus fail.

There are some solutions to this: without modifying the language:

1. use a special functions to do the work.
2. make a string class.
3. convert everything internally to UTF-16, convert it back to UTF-8 before
output.

1. The special functions would work but is troublesome because the phobos
functions cannot be used.(i.e. they have to be rewritten).

2. the string class doesn't work well because the opAssign function cannot be
overridden and this the following cannot be done:

# String s = new String(); s = "hello".

I know that it can be done slightly different but I'd like it to be as seemless
as possible. (String s = new String("hello");) However the the phobos functions
stil don't work and have to be included in the class. Wasn't Walter against a
String class?? ;)

3. Converting everything is not very efficient. And requires non-transparent
extra work.

I'd suggest the following:

1. The char[] needs to be treated by the D compiler as a string array not as a
byte array, or
2. Implement a special String datatype (has been discussed earlier and Walter
is against it.)

Also, a lot of phobos functions are missing for wide and double character
operations. E.g. wchar[] ljustify(wchar[], int width); is not available and
many more are not available for larger char sets.

Regards, Jaap


---
D programming from Vietnam
Sep 25 2004
next sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
Jaap Geurts wrote:

 Hi all,
 
 I'm testing and programming in D using UTF-8 under linux to encode the
 Vietnamese character set. I have some trouble with the way D handles the
 char[].length property.

If this isn't in some FAQ it should be.
 If I make a string as follows
 char[] s = "câu này có những chữ cái tiếng việt";
 
 Then the length property (s.length) reports the number of bytes not the
 number of characters as I would expect to happen. The length property
 would return the number of bytes for the byte[]. Therefore I still need to
 use a strlen function to determine the correct string length. One of the
 implications is that most *string* handling functions in the phobos
 library depend on the length property and thus fail.

which string functions specifically? What do you mean by "fail"?
 There are some solutions to this: without modifying the language:
 
 1. use a special functions to do the work.
 2. make a string class.
 3. convert everything internally to UTF-16, convert it back to UTF-8
 before output.

4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in your string will fit in a wchar).
 1. The special functions would work but is troublesome because the phobos
 functions cannot be used.(i.e. they have to be rewritten).
 
 2. the string class doesn't work well because the opAssign function cannot
 be overridden and this the following cannot be done:
 
 # String s = new String(); s = "hello".
 
 I know that it can be done slightly different but I'd like it to be as
 seemless as possible. (String s = new String("hello");) However the the
 phobos functions stil don't work and have to be included in the class.
 Wasn't Walter against a String class?? ;)
 
 3. Converting everything is not very efficient. And requires
 non-transparent extra work.
 
 I'd suggest the following:
 
 1. The char[] needs to be treated by the D compiler as a string array not
 as a byte array, or 2. Implement a special String datatype (has been
 discussed earlier and Walter is against it.)

Have you tried using dchar[] or wchar[] in your app? Someone has made wstring.d which is the wchar equivalent to std.string (maybe it works for dchar, too, I don't remember exactly). And AJ and some others are working on expanding the unicode support - see the www.dsource.org.
 Also, a lot of phobos functions are missing for wide and double character
 operations. E.g. wchar[] ljustify(wchar[], int width); is not available
 and many more are not available for larger char sets.

I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar and dchar would most likely be very much appreciated.
 Regards, Jaap
 
 
 ---
 D programming from Vietnam

Sep 25 2004
parent reply "Jaap Geurts" <jaapsen hotmail.com> writes:
On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 If I make a string as follows
 char[] s = "câu này có những chữ cái tiếng việt";

 Then the length property (s.length) reports the number of bytes not the
 number of characters as I would expect to happen. The length property
 would return the number of bytes for the byte[]. Therefore I still need to
 use a strlen function to determine the correct string length. One of the
 implications is that most *string* handling functions in the phobos
 library depend on the length property and thus fail.

which string functions specifically? What do you mean by "fail"?

The report the incorrect length. It reports the byte count not the the actual character count, as I would expect because it's an array of char. If I'm right for a char[] s; array and then requesting its length s.length; should report a wcslen(s) of some sort. But the curren't implementation doesn't.
   4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in
 your string will fit in a wchar).

I tried the wchar[] and dchar[] and that works just fine. But because I program under linux it would be nice if I can keep all my internal data in a consistent format. Which is utf-8 for unix bases systems. It seems a little odd to have to convert it to utf-16 each time I need to know the length of a string. Of course the occasional conversion is unavoidable because sometimes if one wants to insert a utf-8 encoded character into a string, one has to fit a wchar into a char[], i realize that.
 I don't have that wstring.d handy but hopefully it covers these. If not
 please let the author know so they can add them (and/or contribute them
 yourself). Your help in improving the library support for wchar and dchar
 would most likely be very much appreciated.

If someone is reading this and knows where the wstring.d is. Can you please point me to it? Thanks, Jaap --- D programming from Vietnam
Sep 26 2004
next sibling parent reply David L. Davis <SpottedTiger yahoo.com> writes:
In article <opsexriepv2saxk9 krd8833t>, Jaap Geurts says...
On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 I don't have that wstring.d handy but hopefully it covers these. If not
 please let the author know so they can add them (and/or contribute them
 yourself). Your help in improving the library support for wchar and dchar
 would most likely be very much appreciated.

If someone is reading this and knows where the wstring.d is. Can you please point me to it? Thanks, Jaap --- D programming from Vietnam

Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you can find here: http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html Please, let me know if there's any missing std.string.d function(s) that you need, and I'll work on getting them in as soon as possible. David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 26 2004
next sibling parent "Jaap Geurts" <jaapsen hotmail.com> writes:
"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:cj7aih$mq5$1 digitaldaemon.com...

 Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you

 find here:

 Please, let me know if there's any missing std.string.d function(s) that

 need, and I'll work on getting them in as soon as possible.

Thanks, David.
Sep 27 2004
prev sibling parent reply "Jaap Geurts" <jaapsen hotmail.com> writes:
David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

Regards, Jaap

"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:cj7aih$mq5$1 digitaldaemon.com...
 In article <opsexriepv2saxk9 krd8833t>, Jaap Geurts says...
On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 I don't have that wstring.d handy but hopefully it covers these. If not
 please let the author know so they can add them (and/or contribute them
 yourself). Your help in improving the library support for wchar and



 would most likely be very much appreciated.

If someone is reading this and knows where the wstring.d is. Can you


Thanks, Jaap

---
D programming from Vietnam

Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you

 find here:

 Please, let me know if there's any missing std.string.d function(s) that

 need, and I'll work on getting them in as soon as possible.

 David L.

 -------------------------------------------------------------------
 "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Sep 27 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?).

Sorry to leap into the middle of your conversation with David, but that is not so. What you need to do is go to www.dsource.org and look for a project called Deimos. Therein, you will find a library called etc.unicode, in source code form. Development of this library has been halted, in favor of ICU, but etc.unicode /does/ do simple casing. (And don't be fooled by the word "simple" - which only means that the function works on characters, not strings, (so it can't uppercase "" to "SS") and that it doesn't know that Turkish, Azeri and Lithuanian have non-standard casing rules. It is "simple casing" as opposed to "full casing", that's all). The relevant prototypes are: # dchar getSimpleUppercaseMapping(dchar c); # dchar getSimpleLowercaseMapping(dchar c); # dchar getSimpleTitlecaseMapping(dchar c); You do not need to specify a locale, because, if the locale is anything other than Turkish, Azeri or Lithuanian, the casing will be done correctly.
I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

It is a common misconception that casing is locale sensitive. In Unicode, in general, it is not. Okay, so (as mentioned above) Turkish, Azeri and Lithuanian are different, but that is a small enough number that I prefer to think of it as being "locale-independent with three exceptions". I think the misconception arises because the C functions toupper(), tolower() etc. are dependent on something /called/ locale, but which is in fact more closely related to encoding scheme. These ctype functions need to do this because C's chars are only eight bits wide. This logic does not apply to Unicode, and certainly not to the functions in etc.unicode and the forthcoming ICU port.
I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

The Unicode standard does not regard Vietnamese as an exception to the standard lookups, so etc.unicode is all you need. Arcane Jill
Sep 28 2004
prev sibling next sibling parent David L. Davis <SpottedTiger yahoo.com> writes:
In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...
David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

Regards, Jaap

Jaap: Currently for anything unicode based, I've been waiting on work that Arcane Jill is doing. StringW.d was mainly created to make it easier to work with 16-bit characters (string.d made it a real pain...you nearly have to cast everything), and hopefully in turn it will work with Windows' 16-bit wide character API functions. But at this point I haven't tested it, plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored string.d functions like tolower(), toupper() and my very own asciiProperCase() functions to still work on ascii characters only. In my last reply I was mainly to point you to where stringw.d could be found, and if you found it useful and to let you know that if you needed anything that string.d had that's missing in it...that I would add it if you needed it. I hope I did give the impression that it did unicode? Also, I'm afraid I don't know much about "locale support" either. But if you do something in that area I wouldn't mind taking a look at it. :)) Good Luck in your project, David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 28 2004
prev sibling parent reply David L. Davis <SpottedTiger yahoo.com> writes:
In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...
David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

Regards, Jaap

Jaap: Currently for anything unicode based, I've been waiting on work that Arcane Jill is doing. StringW.d was mainly created to make it easier to work with 16-bit characters (string.d made it a real pain...you nearly have to cast everything), and hopefully in turn it will work with Windows' 16-bit wide character API functions. But at this point I haven't tested it, plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored string.d functions like tolower(), toupper() and my very own asciiProperCase() functions to still work on ascii characters only. In my last reply I was mainly to point you to where stringw.d could be found, and if you found it useful and to let you know that if you needed anything that string.d had that's missing in it...that I would add it if you needed it. I hope I did give the impression that it did unicode? Also, I'm afraid I don't know much about "locale support" either. But if you do something in that area I wouldn't mind taking a look at it. :)) Good Luck in your project, David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 29 2004
next sibling parent David L. Davis <SpottedTiger yahoo.com> writes:
Everyone: Oops!!! Sorry about the repost everyone. I had a bad storm in my area
last night and my connection to the internet wasn't working right, so I didn't
think my message had gotten posted. Again sorry.

David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 29 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cje51f$q8t$1 digitaldaemon.com>, David L. Davis says...

plus I don't
understand enough to know the real different between the 16-bit characters and
unicode characters (some real example data and code would be helpful in this
area...Jill?, Ben?, and/or anyone?).

Unlike UTF-8, UTF-16 is very cunning - and this is basically because Unicode and UTF-16 were designed together, to work with each other. Here's how it works - there are two different perspectives: the 16-bit perspective, and the 21-bit perspective. In the 21-bit perspective, characters run from U+0000 to U+10FFFF - /but/, the range U+D800 to U+DFFF is illegal and invalid. There are /no/ Unicode characters in this range. Any application built to view the Unicode world from this point of view should be prepared to correctly handle and display all valid characters (which excludes U+D800 to U+DFFF). In the 16-bit perspective, characters run from U+0000 to U+FFFF - and, in this world, the range U+D800 to U+DFFF are just hunky dory. In this perspective, they are called "surrogate characters". They always occur in pairs, with a high surrogate (a character in the range U+D800 to U+DBFF) always immediately followed by a low surrogate (a character in the range U+DC00 to U_DFFF). There are plenty of applications built to view the Unicode world from this point of view (in particular, legacy applications written before Unicode 3.0, when all Unicode characters actually /were/ 16 bits wide). Let's take an example. The Unicode character U+1D11E (musical symbol G clef). When viewed by an application which sees 21-bit wide characters, what you see is U+1D11E, which you interpret as a single character, and display as ... well ... as musical symbol G clef. A legacy 16-bit-Unicode application looking at the same text file (assuming it to have been saved in UTF-16) will see two "characters": U+D874 followed by U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E). Such an application may safely interpret these wchars as "unknown character" followed by "unknown character", and nothing will break. A slightly more sophisticated application might even interpret them as "high surrogate" followed by "low surrogate", and still, nothing would break. These pseudo-characters would likely both display as "unknown character" glyphs, but some fonts may give high surrogates a different glyph from low surrogates. (And, indeed, the Mac's "last chance" fallback font will actually display each psuedo-character as a tiny little hex representation of its codepoint!) Of course, all of this will fail completely if UTF-8 is used instead of UTF-16. In UTF-8, the representation of U+1D11E is: F0 9D 84 9E. Every UTF-8-aware application will decode this as 0x1D11E, and an application which is unaware of characters beyond U+FFFF would fall over badly here. (It might even truncate it to U+D11E: Hangul syllable TYAELM). But of course, you can still transcode into UTF-16 and deal with it that way - which is another reason why UTF-16 is very good for the internal workings of an application. Arcane Jill PS. It is worth noting that the vast majority of fonts available today which are either free or come bundled with an OS do not render characters beyond U+FFFF at all. In fact, I have yet to find /even one/ free font which contains U+1D11E (musical symbol G clef). [I would be very happy to be shown to be wrong on this point - anyone know of one?]. This means that if you stick such characters in a web page, nobody will be able to see them - so you'll have to use a gif after all. :( Unicode may be the future, but sadly it is not the present.
Sep 29 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cje7o0$rj6$1 digitaldaemon.com>, Arcane Jill says...

A legacy 16-bit-Unicode application looking at the same text file (assuming it
to have been saved in UTF-16) will see two "characters": U+D874 followed by
U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E).

Erratum. Whoops! UTF-16 for 1D11E is actually D834 followed by DD1E. (That'll teach me not to try UTF-16 transcoding by hand in future!) The logic of the post still holds, however. Jill
Sep 29 2004
prev sibling parent David L. Davis <SpottedTiger yahoo.com> writes:
Arcane Jill: Thxs as always for the clear insight! I now have a better
understanding of how 16-bit characters (aka UTF-16 / wchar[]) and Unicode (v3.0
/ v4.0) match against one another. :)) 

I hope your ICU conversion work is coming along fine.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 29 2004
prev sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Jaap Geurts" <jaapsen hotmail.com> wrote in message
news:opsexriepv2saxk9 krd8833t...
 On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 If I make a string as follows
 char[] s = "cu ny c nh?ng ch? ci ti?ng vi?t";

 Then the length property (s.length) reports the number of bytes not the
 number of characters as I would expect to happen. The length property
 would return the number of bytes for the byte[]. Therefore I still need



 use a strlen function to determine the correct string length. One of



 implications is that most *string* handling functions in the phobos
 library depend on the length property and thus fail.

which string functions specifically? What do you mean by "fail"?

The report the incorrect length. It reports the byte count not the the

I'm right for a char[] s; array and then requesting its length s.length; should report a wcslen(s) of some sort. But the curren't implementation doesn't. That is by design. Out of curiosity, what are you doing with your strings that require the number of characters? Usually one just deals with string fragments and it doesn't matter how long it is (either in characters or in bytes). In a perfect world your expectation of having a one-to-one mapping between array indexing and character indexing would clearly be nice to have. But the current design is (in Walter's opinion - and I agree with him) the best we can do given the imperfect world we find ourselves in and given D's design goals.
Sep 26 2004
parent "Jaap Geurts" <jaapsen hotmail.com> writes:
"Ben Hinkle" <bhinkle mathworks.com> wrote in message
news:cj7eb6$ole$1 digitaldaemon.com...
 That is by design. Out of curiosity, what are you doing with your strings
 that require the number of characters? Usually one just deals with string
 fragments and it doesn't matter how long it is (either in characters or in
 bytes). In a perfect world your expectation of having a one-to-one mapping
 between array indexing and character indexing would clearly be nice to

 But the current design is (in Walter's opinion - and I agree with him) the
 best we can do given the imperfect world we find ourselves in and given

 design goals.

need to insert characters into existing strings. I see. Moreover if char[] does behave the way it currently does it will be fast, but it probably won't if it had to interpret the array as UTF-8. But then I see little difference between byte[] and char[]. They are basically the same and can be interpreted ambiguously. Something that Walter wanted to prevent if I remember correctly. Jaap
Sep 27 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opsevonsdl2saxk9 krd8833t>, Jaap Geurts says...
Hi all,

Hi.
I'm testing and programming in D using UTF-8 under linux to encode the
Vietnamese character set.

Cool.
I have some trouble with the way D handles the char[].length property.

length does what it does. What you need is a character count, which is something different.
Therefore I still need to use a strlen function to determine the correct string
length.

Okay, here's one: # uint strlen(char[] s) # { # uint n = 0; # foreach (char c; s) # { # if (c<0x80 || c>=0xC0) ++n; # } # return n; # } And some overloads to complete the set: # uint strlen(wchar[] s) # { # uint n = 0; # foreach (wchar c; s) # { # if (c<0xD800 || c>=0xDC00) ++n; # } # return n; # } # # uint strlen(dchar[] s) # { # return s.length; # }
One of the implications is that most *string* handling functions in the phobos
library depend on the length property and thus fail.

Phobos is not really geared up for Unicode yet. The string handling functions are defined to work only for ASCII. What you need is Unicode string handling. D doesn't have that yet. There is a third party Unicode library called ICU (Internationalization Components for Unicode) which I'm trying to port to D, but it's slow work, partly because I've got too much else on at the moment.
There are some solutions to this: without modifying the language:

1. use a special functions to do the work.
2. make a string class.
3. convert everything internally to UTF-16, convert it back to UTF-8 before
output.

Option 3 won't work in general. In general, you'll need to convert everything internally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese, UTF-16 will be fine.
1. The special functions would work but is troublesome because the phobos
functions cannot be used.(i.e. they have to be rewritten).

True.
2. the string class doesn't work well because the opAssign function cannot be
overridden and this the following cannot be done:

# String s = new String(); s = "hello".

I know that it can be done slightly different but I'd like it to be as seemless
as possible. (String s = new String("hello");) However the the phobos functions
stil don't work and have to be included in the class. Wasn't Walter against a
String class?? ;)

I've had exactly the same problem with a completely different class. I would very much like to see implicit constructors in D, so we could do: # String s = "hello"; // what you want # Int n = 42; // what I want But this sort of thing is down to Walter, and he doesn't consider it a priority.
3. Converting everything is not very efficient. And requires non-transparent
extra work.

I'd suggest the following:

1. The char[] needs to be treated by the D compiler as a string array not as a
byte array,

That's just not possible. A char is a UTF-8 fragment, not a Unicode character. They're just not the same.
or
2. Implement a special String datatype (has been discussed earlier and Walter
is against it.)

This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.
Also, a lot of phobos functions are missing for wide and double character
operations. E.g. wchar[] ljustify(wchar[], int width); is not available and
many more are not available for larger char sets.

Again, ICU will fill in these gaps. I wish I could bring you better news, but at least these things are on their way and will get here eventually. Arcane Jill
Sep 26 2004
next sibling parent reply Benjamin Herr <ben 0x539.de> writes:
Arcane Jill wrote:
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)? I mean, strings via easy-to-use arrays were one of those nifty ideas that attracts me to D. No freaky libraries to remember, just intuitive things that work the same for all kinds of arrays. But having strings implemented as character arrays is cool only as long as I can actually use that char[]-string like an array and get characters out of it by using the [] operator. Beyond that, it just is an annoying inconsistent analogy. Also it appears confusing to me that some string operations are supposed to be done with array operations, while others are defined in std.string. Now it seems far easier-to-use to have a string class that wraps all this. I apologise if my uneducated ranting is far below the average level of insight that is to be available here, and I apologise for the slight offtopicness, and I apologise for bringing this up long after the case to ditch char. -ben
Sep 26 2004
parent reply "Thomas Kuehne" <eisvogel users.sourceforge.net> writes:
Benjamin Herr <ben 0x539.de> schrieb:
 Arcane Jill wrote:
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)?

I guess you didn't (yet) dive into Unicode? A "character" is something quite complicated. 1) it can consist of one codepoint like 0x41 "A" 2) two different codpoint sequences can be equal: 0xC1 "" and 0x41 0x2CA "" 3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to 4 codepoints. 4) upper/lowercase conversion is dependend on the language used: Up1 -> Down1, Down2 Above points out only some basics you'd have to implement in your string class. Thomas
Sep 26 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cj8kb0$22n5$1 digitaldaemon.com>, Thomas Kuehne says...

A "character" is something quite complicated.

True enough. The best definition of "character" I have ever encountered is this: A "character" is anything the Unicode Consortium say is a character! More official definitions such as "the smallest unit of information having semantic meaning" just don't hold up under close examination, as it's too easy to find counterexamples. The problem arises because Unicode started its life as the union of many existing legacy "character sets", each of which had their own different idea of what a "character" was.
1) it can consist of one codepoint like 0x41 "A"
2) two different codpoint sequences can be equal: 0xC1 "" and 0x41 0x2CA
""
3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to
4 codepoints.

Actually, you're talking about graphemes and/or glyphs, not characters. There is, in fact, a precise one-to-one correspondence between codepoints and characters. A grapheme, on the other hand, may consist of one or more characters combined together (for example 'A' + combining-acute-accent = '', as per your example); a glyph may consist of one or more graphemes ligated together (for example 'a' + zero-width-joiner + 'e' = ''). And just to be even more pedantic, your statement "two different codepoint sequences can be /equal/" should really read "two different codepoint sequences can be /canonically equivalent/". Equal means equal.
4) upper/lowercase conversion is dependend on the language used: Up1 ->
Down1, Down2

..though currently only Turkish, Lithuanian and Azeri are non-standard. As far as casing is concerned, locale is /almost/ ignorable. The functions getSimpleUppercaseMapping() and getSimpleLowercaseMapping() in etc.unicode will work fine for all languages apart from these few non-standard exceptions listed above. A bigger problem with casing is that (for example) uppercase "" is "SS" - that is, strings can get longer when you case-convert them. Even etc.unicode doesn't deal with that (because it got aborted in favor of ICU before full casing was implemented). You're probably thinking of collation (sort order), which varies /greatly/ from language to language.
Above points out only some basics you'd have to implement in your string
class.

I think the original poster was only talking about character counting, and the related problem of locating character boundaries in a UTF array. That's relatively easy, and can be hand-coded without too much trouble. The more complex stuff like casing, collation, equivalence, grapheme boundary identification, etc., is probably best left to an external library. Arcane Jill
Sep 27 2004
prev sibling parent reply Benjamin Herr <ben 0x539.de> writes:
Thomas Kuehne wrote:
 Benjamin Herr <ben 0x539.de> schrieb:
 
Arcane Jill wrote:

This will happen anyway in time - by accident! ICU has a class called
UnicodeString, so D will get that once ICU is ported.

So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)?

I guess you didn't (yet) dive into Unicode? A "character" is something quite complicated.

so far off, though.
 1) it can consist of one codepoint like 0x41 "A"

 2) two different codpoint sequences can be equal: 0xC1 "" and 0x41 0x2CA
 ""

`large' representation?
 [...]

 Above points out only some basics you'd have to implement in your string
 class.

hand' every time I need any of this functionality. Which is why I suggested a class (or even just a struct, to keep the semantics closer to the standard arrays), after all. -ben
Sep 27 2004
parent "Thomas Kuehne" <eisvogel users.sourceforge.net> writes:
Benjamin Herr schrieb:
 2) two different codpoint sequences can be equal: 0xC1 "" and 0x41
 0x2CA ""

`large' representation?

UTF-8/16/32 only deal with one codepoint at a time(except for some checking). The codepoint sequence above would be U+0000C1 "" and U+000041 U+0002CA "" The above are different Normalization Forms. (http://www.unicode.org/reports/tr15/)
 Above points out only some basics you'd have to implement in your string
 class.

hand' every time I need any of this functionality. Which is why I suggested a class (or even just a struct, to keep the semantics closer to the standard arrays), after all.

If you ensure that the input only contains Latin(Frensh/German..) / Greek / Cyrillic in fully NFC/NFKC you can assume for most cases that 1 dchar == 1 character. If you realy nead full string handling, I suppose you assist Arcane Jill with porting the ICU. Thomas
Sep 27 2004
prev sibling parent reply "Jaap Geurts" <jaapsen hotmail.com> writes:
I'm testing and programming in D using UTF-8 under linux to encode the


 Cool.


I have some trouble with the way D handles the char[].length property.

length does what it does. What you need is a character count, which is something different.

I see. If that is the way it is. Than I'll use functions operating on strings.
Therefore I still need to use a strlen function to determine the correct


 Okay, here's one:

Thanks for the code examples.
 Phobos is not really geared up for Unicode yet. The string handling

 are defined to work only for ASCII.

I noticed. I'll use David's (Spotted Tiger) stringw.d and complement if necessary.
3. convert everything internally to UTF-16, convert it back to UTF-8


 Option 3 won't work in general. In general, you'll need to convert

 internally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese,

 will be fine.

Strangely enough the Windows32 API uses the UTF16 as their encoding.
 That's just not possible. A char is a UTF-8 fragment, not a Unicode

 They're just not the same.

I understand the issues, and UTF-8 in particular was actually designed with backwards compatibility in mind. (For C uses the zero char as the terminator. Had the world programmed in Pascal then we probably wouldn't have UTF-8/
or
2. Implement a special String datatype (has been discussed earlier and


 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think. UTF-8 will always require a class of some sort... I'm not trying to put oil in the fire but isn't this an important aspect for version 1.0? Jaap -- D Programming from Vietnam.
Sep 27 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cj90u4$b46$1 digitaldaemon.com>, Jaap Geurts says...

Strangely enough the Windows32 API uses the UTF16 as their encoding.

Most Unicode platforms use UTF-16, including the ICU library. It follows logically, therefore, that on these platforms - /including/ the Windows API - you cannot use array indexing to find the nth character. But there is method in this madness. The Unicode characters from U+10000 upwards are not characters from living languages. By and large, it is generally considered "harmless" to regard such characters as if they were two characters. For examples, consider the character U+1D11E (musical symbol G clef) - does it really /matter/ if your application perceives instead U+D874 followed by U+D11C? It won't affect casing, sorting or anything like that because the character isn't part of any living language script. From the point of view of most general purpose algorithms, it's just another shape to draw, like a WingDings symbol. So UTF-16 is simply the best space/speed compromise for the majority of real-life languages.
I understand the issues, and UTF-8 in particular was actually designed with
backwards compatibility in mind. (For C uses the zero char as the
terminator. Had the world programmed in Pascal then we probably wouldn't
have UTF-8/

The compatibility is with ASCII, not with C. There is no Unicode meaning of U+0000, apart from "some sort of application-dependent control character".
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think.

Well, ICU is not really anything to do with D. It was originally a Java API, then got ported to C and C++. We'll have it in D, too, eventually. It's not my fault if ICU defines a string class. But I don't think Walter will be complaining - the ICU class isn't a simple "replacement" or "alternative" to char[] - it provides full Unicode functionality, in a way that char[] doesn't. I don't think we'll be seeing "a multitude of String classes" either. To be honest, I don't think even ICU's UnicodeString class will ever become any kind of D "standard", because you won't be able to do implicit casting to/from it.
UTF-8 will always require a class of some
sort...

Well, I'm more inclined to the view that truly internationalized software just won't use UTF-8 at all. UTF-16 is much more managable for this sort of thing. UTF-8 can do the job, but it's mainly intended for text which "mostly ASCII". Arcane Jill
Sep 27 2004
next sibling parent reply "Thomas Kuehne" <eisvogel users.sourceforge.net> writes:
Arcane Jill schrieb:
 But there is method in this madness. The Unicode characters from U+10000
 upwardsare not characters from living languages. By and large, it is
 generally considered "harmless" to regard such characters as if they were
 two characters.
 For examples, consider the character U+1D11E (musical symbol G clef) -
 does it really /matter/ if your application perceives instead U+D874
 followed by U+D11C?
 It won't affect casing, sorting or anything like that because the
 character isn't part of any living language script.

Guess you missed the extended CJK part. There are names of living persons that can only be encoded using post U+FFFF stuff. As a consequence it does affect the sorting and "character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms. Thomas
Sep 27 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cj97fh$t7f$1 digitaldaemon.com>, Thomas Kuehne says...
Arcane Jill schrieb:
 But there is method in this madness. The Unicode characters from U+10000
 upwardsare not characters from living languages. By and large, it is
 generally considered "harmless" to regard such characters as if they were
 two characters.
 For examples, consider the character U+1D11E (musical symbol G clef) -
 does it really /matter/ if your application perceives instead U+D874
 followed by U+D11C?
 It won't affect casing, sorting or anything like that because the
 character isn't part of any living language script.

Guess you missed the extended CJK part. There are names of living persons that can only be encoded using post U+FFFF stuff. As a consequence it does affect the sorting and "character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms. Thomas

I will freely admit that I don't speak Chinese and don't know the intricacies of CJK. But that isn't really what I was trying to get at. Yes, obviously, if an app wants to be general, it must use proper character access via library functions. All I really meant was that if you pretend UTF-16 fragments are characters then your application will /usually/ behave sensibly. That's all. Me, I'm all in favor of proper character iteration. It's just that a lot of apps are going to want a quick-and-dirty shortcut that works more often than not, and UTF-16 is exactly that. So, there are characters in the >U+FFFF range which are used in proper names? I didn't know that. But how badly does that change things? Does it affect casing? I suppose the answer to that depends on whether or not CJK characters /have/ case. Do they? Does it affect sorting? Not in general, since collation is a function of the /user's preferences/, not the script (that is, if an English user sorts Czechoslovakian text, they will expect to see it in "English order", not "Czechoslovakian order"), so only applications which are (a) fully internationalized, or (b) written for CJK users specifically, will need to care. For the rest of the world, two "unknown character" glyphs is not that much worse than one. So I'd summarize as: *) If you want to write a fully internationalized app, you need to be using a proper Unicode library, but *) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide. In other words, yes, you're right. But we can ususally cheat. Anyway, this sort of conversation goes on all the time on the Unicode public forum. If you want to talk about this in depth, suggest we move the discussion there. Arcane Jill
Sep 28 2004
parent reply Benjamin Herr <ben 0x539.de> writes:
Arcane Jill wrote:
 *) If you just want basic Unicode support which works in all but exceptional
 circumstances, you can make do with UTF-16, and the pretence that characters
are
 16-bits wide.

codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless* -ben
Sep 28 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...
Arcane Jill wrote:
 *) If you just want basic Unicode support which works in all but exceptional
 circumstances, you can make do with UTF-16, and the pretence that characters
are
 16-bits wide.

codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*

I think what Jill was saying is that in most cases, UTF-16 will represent any character you care about with a single wchar (ie. in 16 bits). So if you code an application to use wchars you can generally pretend as if there is a 1 to 1 correspondence between wchars and characters. It's *possible* that some users (Chinese perhaps) could break your application, but if this isn't your target market then it may not be a concern. I think the point is that if you're worried that dchars will use up too much memory, you can usually get away with pretending UTF-16 is not a multi-char encoding scheme. Sean
Sep 28 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjc7pb$2n3d$1 digitaldaemon.com>, Sean Kelly says...

I think what Jill was saying is that in most cases, UTF-16 will represent any
character you care about with a single wchar (ie. in 16 bits).  So if you code
an application to use wchars you can generally pretend as if there is a 1 to 1
correspondence between wchars and characters.  It's *possible* that some users
(Chinese perhaps) could break your application, but if this isn't your target
market then it may not be a concern.  I think the point is that if you're
worried that dchars will use up too much memory, you can usually get away with
pretending UTF-16 is not a multi-char encoding scheme.

Sean

Yes, exactly. And to some extent, the same is also true of UTF-8 if your application only cares about ASCII. /Many/ algorithms will work just fine if you pretend that /UTF-8/ is a character set, and that a char[] is an actual string of 8-bit-wide "characters". For example, concatenation (strcat, ~); finding a character or a substring (strchr, strstr, find); splitting on boundaries determined by strchr/strstr/find; tokenizing using ASCII separators such as space or tab; identification of C/C++/D comments; parsing XML; ... the list is endless. So long as you don't try to interpret or manipulate the characters you don't "understand", these encodings are robust enough to withstand most other manipulations. The major reason for preferring UTF-16 over UTF-8, however, is that UTF-16 is likely to contain over 99% of all characters in which you are likely to be interested. The same cannot be said of UTF-8, which contains only ASCII characters. The major reason for preferring UTF-16 over UTF-32 is that you get a lot of wasted space with UTF-32. As noted above, >99% of your characters will only need two bytes, so that's two bytes of zeroes for each such character. Even the
U+FFFF characters are still guaranteed to have /over one third/ of its bits
unused. UTF-32 text files (and strings), therefore, /will/ have between a third
and a half (and maybe even more if the text is mostly ASCII) of all of its bits
wasted.

So it's just a space/speed compromise, that's all. But a pretty good one in most cases. Jill
Sep 29 2004
prev sibling next sibling parent reply Thomas Kuehne <eisvogel users.sourceforge.net> writes:
Benjamin Herr Schrieb:
 *) If you just want basic Unicode support which works in all but
 exceptional circumstances, you can make do with UTF-16, and the pretence
 that characters are 16-bits wide.

codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*

Potentially codepoints are 64 bit. The highes currently assigned codepoint fits in 32 bit. For the majority of living languages only codepoints fit in 16 bit. The bit-size of a codepoint has nothing todo with multi-codepoint "chars". Again if you ensure that neither Korean/Hebrew/Arabic, (Zero-Width-)Joiners nor combining accents are used you might trade a 16-bit char as a "character" in most cases. Exceptions: sorting, display and advanced text analysis. Thomas
Sep 27 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjc8ag$2nb2$1 digitaldaemon.com>, Thomas Kuehne says...

Potentially codepoints are 64 bit.

First I've heard of it. Do you have a source for this information? So far as I am aware, the UC are /adamant/ that they will never go beyond 21 bits. Programming languages tend to use 32 bits because (a) 32 bits is a more natural length for computers, and (b) they're not taking chances - once upon a time the UC thought that 16 bits would be sufficient. But I have never heard /anyone/ claim that codepoints are potentially 64 bits before. Whence does this originate? Arcane Jill
Sep 29 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...
Arcane Jill wrote:
 *) If you just want basic Unicode support which works in all but exceptional
 circumstances, you can make do with UTF-16, and the pretence that characters
are
 16-bits wide.

codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*

Head out to www.unicode.org and check out their various FAQs. They do a much better job at explaining things than I. For what it's worth, here's my potted summary: "code unit" = the technical name for a single primitive fragment of either UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 fragment to express this concept. "code point" = the technical name for the numerical value associated with a character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a codepoint can only be stored in a dchar. "character" = officially, the smallest unit of textual information with semantic meaning. Practically speaking, this means either (a) a control code; (b) something printable; or (c) a combiner, such as an accent you can place over another character. Every character has a unique codepoint. Conversely, every codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. Unicode characters are often written in the form U+#### (for example, U+20AC, which is the character corresponding to codepoint 0x20AC). As an observation, over 99% of all the characters you are likely to use, and which are involved in text processing, will occur in the range U+0000 to U+FFFF. Therefore an array of sixteen-bit values interpretted as characters will likely be sufficient for most purposes. (A UTF-16 string may be interpretted in this way). If you want that extra 1%, as some apps will, you'll need to go the whole hog and recognise characters all the way up to U+10FFFF. "grapheme" = a printable base character which may have been modified by zero or more combining characters (for example 'a' followed by combining-acute-accent). "glyph" = one or more graphemes glued together to form a single printable symbol. The Unicode character zero-width-joiner usually acts as the glue. For more detailed information, as I suggested above, please feel free to go to the Unicode website, and get all the details from the people who organize the whole thing. Arcane Jill
Sep 28 2004
parent J C Calvarese <jcc7 cox.net> writes:
Arcane Jill wrote:
 In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...
 
Arcane Jill wrote:

*) If you just want basic Unicode support which works in all but exceptional
circumstances, you can make do with UTF-16, and the pretence that characters are
16-bits wide.

I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*

Head out to www.unicode.org and check out their various FAQs. They do a much better job at explaining things than I. For what it's worth, here's my potted summary:

Cool. I added this to a wiki page: http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
 
 "code unit" = the technical name for a single primitive fragment of either
 UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or
 dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32
 fragment to express this concept.
 
 "code point" = the technical name for the numerical value associated with a
 character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D,
a
 codepoint can only be stored in a dchar.
 
 "character" = officially, the smallest unit of textual information with
semantic
 meaning. Practically speaking, this means either (a) a control code; (b)
 something printable; or (c) a combiner, such as an accent you can place over
 another character. Every character has a unique codepoint. Conversely, every
 codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character.
 Unicode characters are often written in the form U+#### (for example, U+20AC,
 which is the character corresponding to codepoint 0x20AC).
 
 As an observation, over 99% of all the characters you are likely to use, and
 which are involved in text processing, will occur in the range U+0000 to
U+FFFF.
 Therefore an array of sixteen-bit values interpretted as characters will likely
 be sufficient for most purposes. (A UTF-16 string may be interpretted in this
 way). If you want that extra 1%, as some apps will, you'll need to go the whole
 hog and recognise characters all the way up to U+10FFFF.
 
 "grapheme" = a printable base character which may have been modified by zero or
 more combining characters (for example 'a' followed by combining-acute-accent).
 
 "glyph" = one or more graphemes glued together to form a single printable
 symbol. The Unicode character zero-width-joiner usually acts as the glue.
 
 For more detailed information, as I suggested above, please feel free to go to
 the Unicode website, and get all the details from the people who organize the
 whole thing.
 
 Arcane Jill
 
 

-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Sep 29 2004
prev sibling parent "Ben Hinkle" <bhinkle mathworks.com> writes:
[snip]
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think.

Well, ICU is not really anything to do with D. It was originally a Java

 then got ported to C and C++. We'll have it in D, too, eventually. It's

 fault if ICU defines a string class. But I don't think Walter will be
 complaining - the ICU class isn't a simple "replacement" or "alternative"

 char[] - it provides full Unicode functionality, in a way that char[]

 I don't think we'll be seeing "a multitude of String classes" either. To

 honest, I don't think even ICU's UnicodeString class will ever become any

 of D "standard", because you won't be able to do implicit casting to/from

Is there a link to the String class API? I'm curious to see what the differences are from a function-based API. Is the basic difference that the String's encoding is determined at runtime? Maybe a struct would be better than a class: struct ICUString { enum Encoding {UTF8, UTF16, UTF32,...}; uint len; void* data; Encoding encoding; ... member functions like opIndex, etc... } ... functions like std.string with ICUString instead of char[] or wchar[] or dchar[]... [snip]
Sep 27 2004