www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Why UTF-8/16 character encodings?

On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
[...]
 The vast majority of non-english alphabets in UCS can be encoded in
 a single byte.  It is your exceptions that are not relevant.

I'll have you know that Chinese, Korean, and Japanese account for a significant percentage of the world's population, and therefore arguments about "vast majority" are kinda missing the forest for the trees. If you count the number of *alphabets* that can be encoded in a single byte, you can get a majority, but that in no way reflects actual usage. [...]
The only alternatives to a variable width encoding I can see are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.

made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand.

This is so patently absurd I don't even know how to begin to answer... have you actually dealt with any significant amount of text at all? A large amount of text in today's digital world are at least bilingual, if not more. Even in pure English text, you occasionally need a foreign letter in order to transcribe a borrowed/quoted word, e.g., "cliché", "naïve", etc.. Under your scheme, it would be impossible to encode any text that contains even a single instance of such words. All it takes is *one* word in a 500-page text and your scheme breaks down, and we're back to the bad ole days of codepages. And yes you can say "well just include é and ï in the English code page". But then all it takes is a single math formula that requires a Greek letter, and your text is non-encodable anymore. By the time you pull in all the French, German, Greek letters and math symbols, you might as well just go back to UTF-8. The alternative is to have embedded escape sequences for the rare foreign letter/word that you might need, but then you're back to being unable to slice the string at will, since slicing it at the wrong place will produce gibberish. I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are things about it that are annoying, but it's certainly better than the scheme you're proposing. T -- Живёшь только однажды.
May 25 2013