www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - The Unicode Casing Algorithms

reply Arcane Jill <Arcane_member pathlink.com> writes:
Sean makes some good points in his posts, but the D character set is Unicode by
definition. Let me go through this:

Some languages don't have upper and lowercase letters.

This is true, but it's not relevant. This was relevant back in the days of conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't necessarily mean 'A'. But in Unicode this simply doesn't matter, because there is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak Russian.
And many others don't
convert properly using the default routines,

Again, this is true, if by "default routines" you mean existing C routines. But they do convert properly if you employ the Unicode casing algorithms. These guys (the Unicode Consortium) have been figuring out this stuff for the last few decades, and have knowledge and experience which encompasses pretty much all the scripts in the world.
even if the ASCII character set
contains all the appropriate symbols.

ASCII, of course, doesn't even contain e-acute, a symbol used, for example in the English word "café". This symbol (having codepoint '\u00E9') exists in ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to 0x7F). I realise from the context that Sean did know that.
So tolower(x)==tolower(y) may yield the
incorrect result if the string contains characters beyond the usual 52 ASCII
English values.

Absolutely. The existing tolower() function is not suitable for Unicode. It exists for historical reasons, and is useful in compiling legacy code. But it really should be deprecated. Having said that, one can't deprecate a function until one has something with which to replace it. Hmmm....
I'd like to assume that a D string is a sequence of characters,
unicode or otherwise, and I think it would be a mistake to provide methods that
don't work properly outside of ASCII English. While I'm not much of an expert
on localization, I do think that the library should be designed with
localization in mind.

Would you like to know what the localization issues ARE? In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I, while dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about it, the Turkish system actually makes more sense). But Unicode wanted to be a superset of ASCII, so that particular casing rule did not become a part of the standard. Lithuanian retains the dot in a lowercase i when followed by accents. I believe that it would be perfectly acceptable to provide default casing algorithms which work for the whole world apart from the above exceptions. Special functions could be written for those languages if needed. For the rest of the world, it all works smoothly, and differences in display are consigned to "font rendering issues". For example, in French, it is unusual to display an accent on an uppercase letter - but '\u00E9' (e acute) still uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY the acute accent is considered a rendering issue, not a character issue, and is a problem which is solved very, very neatly simply by supplying specialized French fonts (in which '\u00C9' is rendered without an accent). Similarly, in tradition Irish, the letter i is written without a font - but the codepoint is still '\u0069', same as for the rest of us. Likewise with French, the decision not to display the dot is a mere rendering issue.
For a more thorough explanation, Scott Meyers discusses the problem in one of
his "Effective C++" books, the second one IIRC.

Yes, but that was then and this is now. Unicode was invented precisely to solve this kind of problem, and solve it it has. There is neither any need nor any sense in our reinventing the wheel here. To case-convert a Unicode character, one merely looks up that character in the published Unicode charts. These are purposefully in machine-readable form, and are easily parsed. Case COMPARISONS are defined in Unicode Technical Report #30, Character Foldings (http://www.unicode.org/reports/tr30/). This is slightly more tricky, for reasons I won't go into here, but all of the algorithms are easily implementable. Collation, as we know, IS locale dependent. This is even more tricky, but everything you need to know is defined in Unicode Technical Standard #10, Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/) If I had the time, I'd implement all of this myself, but I'm working on something else right now. I do hope, however, that D doesn't do a half-assed job and not be standards-compliant with the defined Unicode algorithms. I'm with what Walter says in the D manual on this one: Unicode is the future. Arcane Jill
Jun 04 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9p8dn$2j2i$1 digitaldaemon.com>, Arcane Jill says...

Typo correction:

in tradition Irish, the letter i is written without a font

should read:
in traditional Irish, the letter i is written without a DOT

Sorry about that, Jill
Jun 04 2004
parent "Kris" <someidiot earthlink.dot.dot.dot.net> writes:
If it turns out that Jill is Irish, this spells "imminent joviality" to me:

The next time Matthew, Jill, and I disagree on the same thread, some canny
wit is bound to make a fricking wisecrack about "There was this Englishman,
Irishman, and Scotsman ...".

I'll stake ten bucks, and a slightly worn pocket-protector, that it will be
Brad Anderson ... any takers?

<g>



"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9pa5a$2ln9$1 digitaldaemon.com...
 In article <c9p8dn$2j2i$1 digitaldaemon.com>, Arcane Jill says...

 Typo correction:

in tradition Irish, the letter i is written without a font

should read:
in traditional Irish, the letter i is written without a DOT

Sorry about that, Jill

Jun 04 2004
prev sibling next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Just wanted to note that I have a "real" Unicode casing module in the 
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.



Arcane Jill wrote:
 Sean makes some good points in his posts, but the D character set is Unicode by
 definition. Let me go through this:
 
 
Some languages don't have upper and lowercase letters.

This is true, but it's not relevant. This was relevant back in the days of conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't necessarily mean 'A'. But in Unicode this simply doesn't matter, because there is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak Russian.
And many others don't
convert properly using the default routines,

Again, this is true, if by "default routines" you mean existing C routines. But they do convert properly if you employ the Unicode casing algorithms. These guys (the Unicode Consortium) have been figuring out this stuff for the last few decades, and have knowledge and experience which encompasses pretty much all the scripts in the world.
even if the ASCII character set
contains all the appropriate symbols.

ASCII, of course, doesn't even contain e-acute, a symbol used, for example in the English word "café". This symbol (having codepoint '\u00E9') exists in ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to 0x7F). I realise from the context that Sean did know that.
So tolower(x)==tolower(y) may yield the
incorrect result if the string contains characters beyond the usual 52 ASCII
English values.

Absolutely. The existing tolower() function is not suitable for Unicode. It exists for historical reasons, and is useful in compiling legacy code. But it really should be deprecated. Having said that, one can't deprecate a function until one has something with which to replace it. Hmmm....
I'd like to assume that a D string is a sequence of characters,
unicode or otherwise, and I think it would be a mistake to provide methods that
don't work properly outside of ASCII English. While I'm not much of an expert
on localization, I do think that the library should be designed with
localization in mind.

Would you like to know what the localization issues ARE? In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I, while dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about it, the Turkish system actually makes more sense). But Unicode wanted to be a superset of ASCII, so that particular casing rule did not become a part of the standard. Lithuanian retains the dot in a lowercase i when followed by accents. I believe that it would be perfectly acceptable to provide default casing algorithms which work for the whole world apart from the above exceptions. Special functions could be written for those languages if needed. For the rest of the world, it all works smoothly, and differences in display are consigned to "font rendering issues". For example, in French, it is unusual to display an accent on an uppercase letter - but '\u00E9' (e acute) still uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY the acute accent is considered a rendering issue, not a character issue, and is a problem which is solved very, very neatly simply by supplying specialized French fonts (in which '\u00C9' is rendered without an accent). Similarly, in tradition Irish, the letter i is written without a font - but the codepoint is still '\u0069', same as for the rest of us. Likewise with French, the decision not to display the dot is a mere rendering issue.
For a more thorough explanation, Scott Meyers discusses the problem in one of
his "Effective C++" books, the second one IIRC.

Yes, but that was then and this is now. Unicode was invented precisely to solve this kind of problem, and solve it it has. There is neither any need nor any sense in our reinventing the wheel here. To case-convert a Unicode character, one merely looks up that character in the published Unicode charts. These are purposefully in machine-readable form, and are easily parsed. Case COMPARISONS are defined in Unicode Technical Report #30, Character Foldings (http://www.unicode.org/reports/tr30/). This is slightly more tricky, for reasons I won't go into here, but all of the algorithms are easily implementable. Collation, as we know, IS locale dependent. This is even more tricky, but everything you need to know is defined in Unicode Technical Standard #10, Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/) If I had the time, I'd implement all of this myself, but I'm working on something else right now. I do hope, however, that D doesn't do a half-assed job and not be standards-compliant with the defined Unicode algorithms. I'm with what Walter says in the D manual on this one: Unicode is the future. Arcane Jill

Jun 04 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...
Just wanted to note that I have a "real" Unicode casing module in the 
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.

Wow! I'm so impressed. How's it done? Have you defined a String class? I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should compare equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they won't compare as equal in a straightforward dchar[] == test. (Even the lengths are different). I imagined crafting a String class which knew all about Unicode normalization, so that:
      assert(String("\u0065\u0301") == String("\u00E9"));

would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one. ..and not forgetting the conversions:
       // String s;
       dchar[] a = s.nfc();
       dchar[] b = s.nfd();
       dchar[] c = s.nfkc();
       dchar[] d = s.nfkd();

If your module is slready complete, I guess it's too late for me to point you in the direction of UPR, a binary format for Unicode character properties (much easier to parse than the code-charts). Info is at: http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might want to bear it in mind for the future, unless you've already got your own code for parsing the code-charts (for when the next version of Unicode comes out). Anyway, good luck. I'm really pleased to see someone taking all this seriously. There are just too many people of the "ASCII's good enough for me" ilk, and it makes a refreshing change to see D and its supporters taking the initiative here. Arcane Jill
Jun 04 2004
next sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
Arcane Jill wrote:

 In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...
Just wanted to note that I have a "real" Unicode casing module in the
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.

Wow! I'm so impressed. How's it done? Have you defined a String class? I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should compare equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they won't compare as equal in a straightforward dchar[] == test. (Even the lengths are different). I imagined crafting a String class which knew all about Unicode normalization, so that:
      assert(String("\u0065\u0301") == String("\u00E9"));

would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one.

Instead of making a String class another approach would be to write char[] normalize(char[]) that uses COW like std.string and use the regular comparison. That is the model used by tolower and friends. If it is desired an equivalent to cmp can be devised that takes normalization into account much like std.string.icmp takes case into account. A class for String came up a while ago and the basic argument against it was that it wasn't needed - functions work fine. Maybe we'll get to the point where a class is needed but the mental model of <length, ptr> and COW functions is so simple it would be a big change to give it up. -Ben
Jun 04 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9ppdu$c90$1 digitaldaemon.com>, Ben Hinkle says...

Instead of making a String class another approach would be to write
 char[] normalize(char[])
that uses COW like std.string and use the regular comparison. That is the
model used by tolower and friends. If it is desired an equivalent to cmp
can be devised that takes normalization into account much like
std.string.icmp takes case into account.

Yup, there are all sorts of possible approaches. I could think of a few more too (e.g. optimized comparisons which only need to test the start of the string instead of pre-normalizing all of it). But anyway - I'm keen to see which one Hauke Duden has come up with. I certainly look forward to it. Jill
Jun 04 2004
prev sibling next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...
 
Just wanted to note that I have a "real" Unicode casing module in the 
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.

Wow! I'm so impressed. How's it done? Have you defined a String class?

I'm afraid I don't deserve your praise ;). While I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.
 I ask because, as I'm sure you know, the Unicode character sequence
 '\u0065\u0301' (lowercase e followed by combining acute accent) should compare
 equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they
 won't compare as equal in a straightforward dchar[] == test. (Even the lengths
 are different). I imagined crafting a String class which knew all about Unicode
 normalization, so that:
 
 
     assert(String("\u0065\u0301") == String("\u00E9"));

would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one.

I think that Unicode is so complicated that doing the case foldings and normalizations on-the-fly for every comparison is a bit of an overkill and could also introduce unnecessary performance bottlenecks. For my own programs I have long settled on only comparing strings the simple way (i.e. character for character). That's good enough if you don't have to work on strings that come from outside your program. For all other situations you can use a normalize function that is called once when the string enters the program.
 If your module is slready complete, I guess it's too late for me to point you
in
 the direction of UPR, a binary format for Unicode character properties (much
 easier to parse than the code-charts). Info is at:
 http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might
 want to bear it in mind for the future, unless you've already got your own code
 for parsing the code-charts (for when the next version of Unicode comes out).

Thanks for that info - I will check it out. But as a matter of fact I do already have my own tool for parsing the Unicode data ;). It is more convenient for me, since the module works with static arrays that contain the data in compressed form (a relatively simple RLE algorithm, but effective enough to reduce 2 MB worth of tables to 12 KB).
 Anyway, good luck. I'm really pleased to see someone taking all this seriously.
 There are just too many people of the "ASCII's good enough for me" ilk, and it
 makes a refreshing change to see D and its supporters taking the initiative
 here.

Thanks ;). I agree that far too many people ignore Unicode (right until their application needs to be translated to Japanese, for example). And D is in the position to make it easier for people to do the right thing from the start. We "only" have to make sure that Phobos implements proper Unicode support. Hauke
Jun 04 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9q5sl$vcj$1 digitaldaemon.com...
 While I'm also working on a string class, the module I'm talking about
 is a set of simple global functions like charToLower, charToUpper,
 charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support
 for the full unicode character range.

How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace: import std.ctype; with: import std.utype; and they'll get the unicode-capable versions of the same functions.
Jun 04 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9qh23$1fdh$2 digitaldaemon.com>, Walter says...

replace:
    import std.ctype;
with:
    import std.utype;

Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go for it man. I must be working in the wrong field. Jill :(
Jun 04 2004
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
replace:
   import std.ctype;
with:
   import std.utype;

Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go for it man.

Thanks for cheering me on AJ ;). But let's wait and see what Walter thinks about it when he has it in his hands - especially about the function names :). Hauke
Jun 04 2004
prev sibling next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
 "Hauke Duden" <H.NS.Duden gmx.net> wrote in message
 news:c9q5sl$vcj$1 digitaldaemon.com...
 
While I'm also working on a string class, the module I'm talking about
is a set of simple global functions like charToLower, charToUpper,
charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support
for the full unicode character range.

How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace:

I had three reasons for choosing these function names: 1) isdigit etc. do not conform to the convention that new words should be capitalized. 2) because of D's overloading rules (with definitions in one module being able to completely hide those in others) I'm reluctant to choose global names that could also be used in another context. 3) I wanted to improve on ctype in a few places and also keep a bit closer to the Unicode terms. For example, isspace tests for things that separate words (whitespace in ASCII). In Unicode that's more than just whitespace, thus the name doesn't fit. I also think charIsSpace should check for actual space characters instead of all whitespace. Of course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore. Hauke
Jun 04 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9qjqr$1jfv$1 digitaldaemon.com...
 I had three reasons for choosing these function names:

 1) isdigit etc. do not conform to the convention that new words should
 be capitalized.

I know, but since these are well-established names, I think we can bend the rules a bit for them <g>.
 2) because of D's overloading rules (with definitions in one module
 being able to completely hide those in others) I'm reluctant to choose
 global names that could also be used in another context.

I can't think of a case where they conflict. Note that the actual global names will not conflict, because the names will be prefixed by the package.module name.
 3) I wanted to improve on ctype in a few places and also keep a bit
 closer to the Unicode terms. For example, isspace tests for things that
 separate words (whitespace in ASCII). In Unicode that's more than just
 whitespace, thus the name doesn't fit. I also think charIsSpace should
 check for actual space characters instead of all whitespace.

If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.
 Of course we could create a module std.utype which simply defines
 std.c.ctype compatible aliases. Or even better, simply call the unicode
 functions directly from std.c.ctype so that there is no wrong choice
 anymore.

I'd do that if the utype functions didn't add significant bloat, but they do (I presume).
Jun 04 2004
next sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
I had three reasons for choosing these function names:

1) isdigit etc. do not conform to the convention that new words should
be capitalized.

I know, but since these are well-established names, I think we can bend the rules a bit for them <g>.

Well, if you're not going to make the cut now, when then? D is a new language and I think the standard library should at least be consistent.
2) because of D's overloading rules (with definitions in one module
being able to completely hide those in others) I'm reluctant to choose
global names that could also be used in another context.

I can't think of a case where they conflict. Note that the actual global names will not conflict, because the names will be prefixed by the package.module name.

I can think of a few conflicts. In fact, in one of my own applications I had a function called "isSeparator" that had nothing at all to do with strings. Regarding the prefixes: I know that you can always access the functions in a fully qualified way, but I think having to do that can be a pain. Especially when you can sometimes get away without it and at other times you have to use the module name.
3) I wanted to improve on ctype in a few places and also keep a bit
closer to the Unicode terms. For example, isspace tests for things that
separate words (whitespace in ASCII). In Unicode that's more than just
whitespace, thus the name doesn't fit. I also think charIsSpace should
check for actual space characters instead of all whitespace.

If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.

That's precisely why it is not called isspace in my module :). I wanted to make it obvious that it has different behaviour. The function that does what ctype.isspace does is called charIsSeparator (Unicode calls such characters "separators"). charIsSpace on the other hand tests for characters with the Unicode separator subtype "space", which does NOT include linebreaks. That is as it should be, I think. However, I'd appreciate any ideas for a better name for charIsSpace that makes it obvious that it tests for spaces without actually using the word "space". I couldn't think of any.
Of course we could create a module std.utype which simply defines
std.c.ctype compatible aliases. Or even better, simply call the unicode
functions directly from std.c.ctype so that there is no wrong choice
anymore.

I'd do that if the utype functions didn't add significant bloat, but they do (I presume).

Well, there's not THAT much overhead. But I guess every little bit could be too much for some specialized applications. For example, it would probably not be a good choice for embedded systems. Right now the module will increase executable size by 12 KB and uses about 2 MB of RAM. The RAM usage could be reduced quite a bit but then the character lookup would be about 3 times slower (right now only a comparison and a simple array indexing operation is needed). Hauke
Jun 04 2004
prev sibling next sibling parent reply "Kris" <someidiot earthlink.dot.dot.dot.net> writes:
"Walter"  wrote:
 Of course we could create a module std.utype which simply defines
 std.c.ctype compatible aliases. Or even better, simply call the unicode
 functions directly from std.c.ctype so that there is no wrong choice
 anymore.

I'd do that if the utype functions didn't add significant bloat, but they

 (I presume).

Well then, Walter. If that's the case, perhaps you'd apply the same rule to printf usage within the root object? As we all know, printf drags along all the floating point formatting and boatloads of other, uhhh, errrrr ... stuff. It absolutely does not belong in the root object, and there's only a dozen or so references to it within debug code inside Phobos ... Sorry to sound a bit snotty, but this is surely a blatant double-standard <g> - Kris
Jun 04 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Kris" <someidiot earthlink.dot.dot.dot.net> wrote in message
news:c9qub0$22er$1 digitaldaemon.com...
 "Walter"  wrote:
 Of course we could create a module std.utype which simply defines
 std.c.ctype compatible aliases. Or even better, simply call the



 functions directly from std.c.ctype so that there is no wrong choice
 anymore.

I'd do that if the utype functions didn't add significant bloat, but


 do
 (I presume).

Well then, Walter. If that's the case, perhaps you'd apply the same rule

 printf usage within the root object? As we all know, printf drags along

 the floating point formatting and boatloads of other, uhhh, errrrr ...
 stuff.

 It absolutely does not belong in the root object, and there's only a dozen
 or so references to it within debug code inside Phobos ...

 Sorry to sound a bit snotty, but this is surely a blatant double-standard
 <g>

But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.
Jun 04 2004
next sibling parent "Kris" <someidiot earthlink.dot.dot.dot.net> writes:
Printf is certainly useful, but one shouldn't have to pay the bloat price
when they don't even use it. Placing a printf call within Object.d (the
print() method) adds zero value, and has negative impact.

It's great not having to explicitly import printf ... but having it
automatically loaded where it's never actually used is so totally bogus.

BTW, there's actually only around 20 calls to Object.print(); All within
Phobos (as Ben Hinkle pointed out). If you remove those, along with
Object.print(), the problem just goes away ...

"Walter" wrote:
 But everyone needs printf! And printf doesn't add 2Mb, either, last I
 checked <g>.

Jun 04 2004
prev sibling parent reply "Kris" <someidiot earthlink.dot.dot.dot.net> writes:
"Walter"  wrote:
 But everyone needs printf! And printf doesn't add 2Mb, either, last I
 checked <g>.

Walter: I realize my reply wasn't very helpful, so please permit me to re-phrase? Yes, as you say, everyone needs printf <g>. They just don't need it in Object.print() - Kris
Jun 04 2004
parent "Walter" <newshound digitalmars.com> writes:
"Kris" <someidiot earthlink.dot.dot.dot.net> wrote in message
news:c9r8sq$2hnt$1 digitaldaemon.com...
 "Walter"  wrote:
 But everyone needs printf! And printf doesn't add 2Mb, either, last I
 checked <g>.

Walter: I realize my reply wasn't very helpful, so please permit me to re-phrase? Yes, as you say, everyone needs printf <g>. They just don't need it in Object.print()

Yeah, it probably should go from that.
Jun 04 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9qr0q$1tk7$2 digitaldaemon.com>, Walter says...

If you're changing what, say, isspace does for ASCII characters, then I
think that's a mistake.

Unicode space is not whitespace. Whitespace is a completely different concept. For example, non-breaking space ('\u00A0') is not considered whitespace, but Unicode correctly identifies it as a spacing character. Even more disasterous, '\n' is whitespace, but it is not space. Hauke is correct. These are different properties. You cannot simply re-use the old functions. You have to supply new ones, and preferably with different names. Arcane Jill (By the way, I couldn't download the zip file. Mozilla Firebird freaked out when I tried to click on the link).
Jun 04 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <c9rqvu$bah$1 digitaldaemon.com>, Arcane Jill says...
In article <c9qr0q$1tk7$2 digitaldaemon.com>, Walter says...

If you're changing what, say, isspace does for ASCII characters, then I
think that's a mistake.

Unicode space is not whitespace. Whitespace is a completely different concept. For example, non-breaking space ('\u00A0') is not considered whitespace, but Unicode correctly identifies it as a spacing character. Even more disasterous, '\n' is whitespace, but it is not space. Hauke is correct. These are different properties. You cannot simply re-use the old functions. You have to supply new ones, and preferably with different names.

But that doesn't break the ASCII functions for the ASCII character set, it only means that new ones must be provided for Unicode characters. Personally, I'd prefer that the new functions work for both Unicode and for ASCII, much like the locale-based functions do in C++. Localization in C++ is probably the most complex part of the language, however, and I'd like to see if we can't find a way to simplify it a bit in D. Sean
Jun 05 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9sob6$1qpn$1 digitaldaemon.com>, Sean Kelly says...

But that doesn't break the ASCII functions for the ASCII character set, it only
means that new ones must be provided for Unicode characters.  Personally, I'd
prefer that the new functions work for both Unicode and for ASCII,

Obviously you are aware of this, but your choice of words gives a strange impression here. Clearly, ASCII characters *are* Unicode characters. ASCII is but a small subset of Unicode. They are defined for all Unicode characters, therefore they are defined for all ASCII characters.
much like the
locale-based functions do in C++.  Localization in C++ is probably the most
complex part of the language, however, and I'd like to see if we can't find a
way to simplify it a bit in D.

Agreed, but I'm not clear what you're asking. I've been involved with a text-to-speech project which we had to internationalize and localize for a whole bunch of languages. That was in C++, so I know the issues. Using Unicode made things a whole lot easier, but localization is about a lot more than selecting a character set. Stuff like what character you use for a decimal point, how you punctuate sentences, what kind of quotation marks you use, and so on, are all relevant to localization, and it would be nice to address these. But these issues are independent of the assinged properties of Unicode characters. But I never did like the way C handled locales. Java's tactic made more sense. With regard to those character properties, I couldn't quite figure out if you were agreeing or disagreeing. I suspect that we are all in agreement really. Certainly I would hope so, because actually there is no decision to be taken. And for obvious reasons: (1) The behavior of the ctype functions for the ASCII range is well and truly defined by years of precedent, and cannot be changed. (2) Similarly, the Unicode standard, and its various classifications, is an established international standard, and one which we are also not at liberty to change. So, either we implement Unicode properties or we don't, but if we want to be standards compliant, we /cannot/ change one single Unicode property - not even to make it compatible with isspace(), whether we agree with it or not. To do so would place us at odds with - well, basically, the rest of the world. It follows, therefore, that we need BOTH functions - for instance, we need the old fashioned ctype isspace() AND we need the new Unicode function charIsSpace(). We need the old fashioned ctype isalpha() AND we need the new Unicode function charIsLetter(). Supplying new functions cannot possibly break the old ones! But as Hauke and I have pointed out, in general they do not agree with each other, even in the ASCII range, and certainly not in the range 0x00 to 0xFF (the range for which the ctype functions are usually implemented). Java has a nice solution, which we might like to copy. Java implements the Unicode Standard (at least for Unicode 2.0), but they ALSO implement ADDITIONAL functions, such as isWhitespace(), isJavaIdentifierStart(), and so on. <ping!> I've just realized what you're refering to. How dumb of me not to have seen it earlier! Ok, let me go through this.... In C, the ctype functions such as toupper(c) will return a different value for a given codepoint c, depending on the current system default locale. toupper(0xD3) might give a different answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE WITH UNICODE. However, D implements toupper(), so the question is, should toupper() be locale dependent in D as it is in C. My immediate thought would be no. No way. The C system locale selects a character encoding upon which toupper() et al operate, but there is only one D character encoding standard. It is Unicode - the superset of all the others. And in Unicode, you *don't* call toupper(), you call Hauke's new function - charToUpper(). My inclination is that the old ctype functions should be defined only for the ASCII range (though having them take a dchar is harmless), and within that range, they be compatible with what C did. Arcane Jill
Jun 05 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9t05d$26ft$1 digitaldaemon.com...
 <ping!> I've just realized what you're refering to. How dumb of me not to

 seen it earlier! Ok, let me go through this.... In C, the ctype functions

 as toupper(c) will return a different value for a given codepoint c,

 on the current system default locale. toupper(0xD3) might give a different
 answer in Russia from that which it does in France. THIS PROBLEM DOES NOT

 WITH UNICODE. However, D implements toupper(), so the question is, should
 toupper() be locale dependent in D as it is in C. My immediate thought

 no. No way. The C system locale selects a character encoding upon which
 toupper() et al operate, but there is only one D character encoding

 is Unicode - the superset of all the others. And in Unicode, you *don't*

 toupper(), you call Hauke's new function - charToUpper(). My inclination

 the old ctype functions should be defined only for the ASCII range (though
 having them take a dchar is harmless), and within that range, they be

 with what C did.

I've pretty much come to the same conclusions: 1) D's character types are unicode. They aren't indices into locale-dependent code pages. The library functions are unicode. If you have data that's in a locale-dependent code page, convert it to unicode before using library string functions. 2) The ctype functions will just return 0 for non-ASCII characters. 3) There will be a separate set of functions for unicode, with different names. Thanks to you and Hauke for clarifying the issues with this.
Jun 05 2004
parent Sean Kelly <sean f4.ca> writes:
Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:c9t05d$26ft$1 digitaldaemon.com...
 
<ping!> I've just realized what you're refering to. How dumb of me not to have
seen it earlier! Ok, let me go through this.... In C, the ctype functions such
as toupper(c) will return a different value for a given codepoint c, depending
on the current system default locale. toupper(0xD3) might give a different
answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE
WITH UNICODE. However, D implements toupper(), so the question is, should
toupper() be locale dependent in D as it is in C. My immediate thought would be
no. No way. The C system locale selects a character encoding upon which
toupper() et al operate, but there is only one D character encoding standard. It
is Unicode - the superset of all the others. And in Unicode, you *don't* call
toupper(), you call Hauke's new function - charToUpper(). My inclination is that
the old ctype functions should be defined only for the ASCII range (though
having them take a dchar is harmless), and within that range, they be compatible
with what C did.


Thanks for putting it so clearly. I'm a bit rusty with C locale stuff and had forgotten about the default locale business. I agree. I would prefer to have a set of basic functions that are not locale dependent for the ASCII character set and have D provide its own set of unicode functions.
 I've pretty much come to the same conclusions:
 
 1) D's character types are unicode. They aren't indices into
 locale-dependent code pages. The library functions are unicode. If you have
 data that's in a locale-dependent code page, convert it to unicode before
 using library string functions.
 
 2) The ctype functions will just return 0 for non-ASCII characters.
 
 3) There will be a separate set of functions for unicode, with different
 names.

Sounds fantastic. Sean
Jun 05 2004
prev sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 (By the way, I couldn't download the zip file. Mozilla Firebird freaked out
when
 I tried to click on the link).

It is now also available here: http://www.hazardarea.com/unichar.zip Hauke
Jun 05 2004
prev sibling parent reply David L. Davis <SpottedTiger yahoo.com> writes:
In article <c9qh23$1fdh$2 digitaldaemon.com>, Walter says...
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9q5sl$vcj$1 digitaldaemon.com...
 While I'm also working on a string class, the module I'm talking about
 is a set of simple global functions like charToLower, charToUpper,
 charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support
 for the full unicode character range.

How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace: import std.ctype; with: import std.utype; and they'll get the unicode-capable versions of the same functions.

Walter: The above sounds like a good idea for the dchar character(s) in std.ctype, but what about for strings that use std.string functions and are defined as char[], or is there a dchar[] string type I've missed somewhere? And if there isn't, shouldn't the strings really be defined as dchar[] to work with unicode 32-bit? Thxs for your answer in advance. :))
Jun 04 2004
parent "Walter" <newshound digitalmars.com> writes:
"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:c9qmr7$1nrj$1 digitaldaemon.com...
 Walter: The above sounds like a good idea for the dchar character(s) in
 std.ctype, but what about for strings that use std.string functions and

 defined as char[], or is there a dchar[] string type I've missed

 if there isn't, shouldn't the strings really be defined as dchar[] to work

 unicode 32-bit?

Check out the std.utf package, which will decode char[] into a dchar.
Jun 04 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9pneo$91a$1 digitaldaemon.com...
 I ask because, as I'm sure you know, the Unicode character sequence
 '\u0065\u0301' (lowercase e followed by combining acute accent) should

 equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly

 won't compare as equal in a straightforward dchar[] == test. (Even the

 are different).

Oh durn, even with 20 bit unicode they are *still* having multicharacter sequences? ARRRRGGGGHHH.
Jun 04 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9qh22$1fdh$1 digitaldaemon.com>, Walter says...
Oh durn, even with 20 bit unicode they are *still* having multicharacter
sequences? ARRRRGGGGHHH.

It's 21 bits actually, the top codepoint being 0x10FFFF. But yeah, there is a distinction between characters and glyphs (or - if you wan't to get technical, "default grapheme clusters"). One character equals one dchar - no questions there - there is not a one-to-one corresporence between characters and glyphs, and there may be several different "spellings" of the same glyph. The combining characters allow you, for example, to put an acute accent over any character. It's all cunning stuff, and of course something of a nightmare for those who design fonts, make text editors, and so on. But fortunately for us, font design is not an issue, just implementation of a few basic algorithms which someone else has already worked out for us. (Although of course, things are never that straightforward. The Consortium's algorithms are kind of "proof of concept". /Real/ implementations would have to throw in a bit of speed optimization). No need for the aaargh, though. Once you get your head around the character/glyph distinction, it all makes complete sense. D's dchars are *characters*, and for that purpose, they are exactly what they are designed to be. D has got it right. And no - there's no need to introduce a glyph type, before anyone asks. Glyphs are only important to people who write rendering algorithms. Glyph /boundaries/ are important, but the algorithms will cover that. I'm sure someone will take up the challenge. It's a fascinating area. Arcane Jill
Jun 04 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9p8dn$2j2i$1 digitaldaemon.com...
 If I had the time, I'd implement all of this myself, but I'm working on
 something else right now. I do hope, however, that D doesn't do a

 and not be standards-compliant with the defined Unicode algorithms. I'm

 what Walter says in the D manual on this one:

 Unicode is the future.

Yes. Thanks for the excellent references. Right now, the std.ctype functions all take an argument of 'dchar'. This means the interface is correct for unicode, even if the current implementation fails to work on anything but ASCII. If an ambitious person wishes to fix the implementations so they work with unicode, I'll incorporate them.
Jun 04 2004
parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...
Right now, the std.ctype functions
all take an argument of 'dchar'. This means the interface is correct for
unicode, even if the current implementation fails to work on anything but
ASCII.

7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)? Ciao
Jun 07 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ca15r8$1uun$1 digitaldaemon.com>, Roberto Mariottini says...
In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...
Right now, the std.ctype functions
all take an argument of 'dchar'. This means the interface is correct for
unicode, even if the current implementation fails to work on anything but
ASCII.

7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)? Ciao

Just ASCII. WINDOWS-1252 (to give it its official encoding name) is all too often incorrectly declared as ISO-8859-1, thanks to Microsoft. Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1 locale, but I suspect that would be terribly confusing to those for whom that was not their default locale. WINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have taken over enough of the world as it is without their invading D as well. ;-) Jill
Jun 07 2004
parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <ca173t$20v3$1 digitaldaemon.com>, Arcane Jill says...
In article <ca15r8$1uun$1 digitaldaemon.com>, Roberto Mariottini says...
In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...
Right now, the std.ctype functions
all take an argument of 'dchar'. This means the interface is correct for
unicode, even if the current implementation fails to work on anything but
ASCII.

7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)? Ciao

Just ASCII. WINDOWS-1252 (to give it its official encoding name) is all too often incorrectly declared as ISO-8859-1, thanks to Microsoft. Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1 locale, but I suspect that would be terribly confusing to those for whom that was not their default locale.

I know. It's only that I'm italian, and the italian language needs at least ISO-8859-1 (with collation, etc), ASCII is not sufficient. Supporting only ASCII means supporting only english. While this can be understandable for english-speaking people, I think that it's worth adding a single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french, portuguese, german, italian, etc.
WINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't
recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have
taken over enough of the world as it is without their invading D as well. ;-)

I don't know how D handles the interface with the S.O., but I think Windows would pass CP1252-encoded characters to getchar(), for example. Ciao
Jun 08 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ca3pe5$24v$1 digitaldaemon.com>, Roberto Mariottini says...

I know. It's only that I'm italian, and the italian language needs at least
ISO-8859-1 (with collation, etc), ASCII is not sufficient.
Supporting only ASCII means supporting only english. While this can be
understandable for english-speaking people, I think that it's worth adding a
single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
portuguese, german, italian, etc.

Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so). That, in conjunction with the real Unicode functions which he has also supplied should solve all your problems. However, there is no way I would support adding explicit support to D for ISO-8859-1. I am also European, and I also use non-ASCII characters, but when I step outside the bounds of ASCII, I use use Unicode, not ISO-8859-1. Jill PS. Unicode is a superset of ISO-8859-1 with codepoint equivalence. In this sense only, ISO-8859-1 has special status compared with, say, ISO-8859-2. (Unicode is a superset of ISO-8859-2 as well, of course, but the codepoints are different). So anything which works for Unicode will work for ISO-8859-1, codepoint for codepoint. But that's not the same as restricting it to that range.
Jun 08 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
I know. It's only that I'm italian, and the italian language needs at least
ISO-8859-1 (with collation, etc), ASCII is not sufficient.
Supporting only ASCII means supporting only english. While this can be
understandable for english-speaking people, I think that it's worth adding a
single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
portuguese, german, italian, etc.

Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so).

It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;). Hauke
Jun 08 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ca3v8c$fai$1 digitaldaemon.com>, Hauke Duden says...

 Hauke has now implemented utype - a drop-in replacement for ctype, which now
 supports all Unicode characters. (I don't know how he did it. I'm not
completely
 convinced that it's backwardly compatible with ctype in the ASCII range, but
 even if it isn't, I'm sure it could be made so).

It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;). Hauke

Excellent! This is superb. The only thing is, the docs don't make that claim (unless I missed it). When I read the docs for utype.isspace() I kinda got the impression that it just called charIsSpace(), which obviously would not be compatible with ctype. Perhaps you could make the documentation more explicit. All in all, I'm thoroughly impressed with this. Nice one! Jill PS. Did you omit charToCasefold(), or did I just miss it?
Jun 08 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
Hauke has now implemented utype - a drop-in replacement for ctype, which now
supports all Unicode characters. (I don't know how he did it. I'm not completely
convinced that it's backwardly compatible with ctype in the ASCII range, but
even if it isn't, I'm sure it could be made so).

It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;). Hauke

Excellent! This is superb. The only thing is, the docs don't make that claim (unless I missed it).

It is there, in the module description.
 When I read the docs for utype.isspace() I kinda got the
 impression that it just called charIsSpace(), which obviously would not be
 compatible with ctype. Perhaps you could make the documentation more explicit.

The documentation of isspace states that it is equivalent to charIsSeparator. But I will make it a little more obvious.
 All in all, I'm thoroughly impressed with this. Nice one!

Thanks :).
 PS. Did you omit charToCasefold(), or did I just miss it?

No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module. If you want to do simple one-to-one case folding then calling charToLower on both characters should be equivalent. Hauke
Jun 08 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ca49vq$10ui$1 digitaldaemon.com>, Hauke Duden says...

 PS. Did you omit charToCasefold(), or did I just miss it?

No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module.

Yes, I know. But I think it would be nice to start getting people used to the idea that they need to be calling toCasefold() instead of toLower() if they're going to do case-insensitive comparisons. It's a good "new thing to learn". Even if all it does (for now) is call charToLower(), that would be better than nothing.
If you want to do simple one-to-one case folding then calling 
charToLower on both characters should be equivalent.

I know, but basically, I'm saying that code which reads:
       if (charToCaseFold(c) == charToCaseFold(d))

is more self-documenting than code which reads:
       if (charToLower(c) == charToLower(d))

and it gets people to start thinking in the Unicode way. So - even if it does nothing useful, I think it's still a good function to have. Jill
Jun 08 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:

 In article <ca49vq$10ui$1 digitaldaemon.com>, Hauke Duden says...
 
 
PS. Did you omit charToCasefold(), or did I just miss it?

No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module.

Yes, I know. But I think it would be nice to start getting people used to the idea that they need to be calling toCasefold() instead of toLower() if they're going to do case-insensitive comparisons. It's a good "new thing to learn". Even if all it does (for now) is call charToLower(), that would be better than nothing.

But the interface would have to be changed to return a string instead of a single character. That would break all code that uses it. Hauke
Jun 08 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
Okay, cancel that. I've just realized I was talking complete rubbish. You were
right. I was wrong. Case folding comes into play during special casing, not
simple casing. (I was thinking it was in UnicodeData.txt, but of course it
isn't, it's only in SpecialCasing.txt). So I withdraw my suggestion, apologize
for questioning you, and now I'm going to go and hide in a corner until I stop
feeling such a prat.

Jill (embarrassed).
Jun 08 2004
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:

 Okay, cancel that. I've just realized I was talking complete rubbish. You were
 right. I was wrong. Case folding comes into play during special casing, not
 simple casing. (I was thinking it was in UnicodeData.txt, but of course it
 isn't, it's only in SpecialCasing.txt). So I withdraw my suggestion, apologize
 for questioning you, and now I'm going to go and hide in a corner until I stop
 feeling such a prat.
 
 Jill (embarrassed).

Lol. Come on, don't be sad... ;) It's good practice to question other people's work. They could be wrong just as easily as you could. At the very least it will keep both you and the other one thinking, which is always a good thing. Hauke
Jun 08 2004