www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Questions about Unicode, particularly Japanese

reply "Nick Sabalausky" <a a.a> writes:
The "Wide character support in D" thread got me to question and double-check 
some of my assumptions about unicode. From double-checking the UTF-8 
encoding, and looking at the charts at ( http://www.unicode.org/charts/ ), I 
realized that Japanese, Chinese and Korean characters are almost entirely 
(if not entirely) 3 bytes on UTF-8. For some reason I had been under the 
impression that the Japanese -kanas and at least a few of the Chinese 
characters were 2 bytes on UTF-8. Turns out that's not the case. I thought 
I'd share that in case any one else didn't know. Also, FWIW, Cyrillic (ex, 
Russian, AIUI), and Greek appear to be primarily, if not entirely, 2 bytes 
in UTF-8.

But then I noticed something on the charts for the Japanese -kanas (ex: 
http://www.unicode.org/charts/PDF/U3040.pdf ). Umm, first of all, for those 
unfamiliar with Japanese: There are two phonetic alphabets, hiragana and 
katakana (in addition to the chinese characters), and they're based more on 
syllables than the individual sounds of western-style letters. Also, some of 
the sounds are formed by adding a modifier to a symbol for a similar sound. 
For instance: ? (U+305D, hiragana "so") is the sound "so", and to make "zo" 
you add what looks like a double-quote to it: ? (U+305E, hiragana "zo") (You 
may need to increase your font size to see it well). That same modifier 
converts most of the "s"'s to "z"'s, or any of the "h"'s to "b"'s, etc. And 
there's also another modifier that converts the "h"'s to "p"'s (looks like a 
little circle).

The thing is, there appears to also be Unicode code points for these 
modifiers by themselves (U+3099 and U+309A). Maybe I'm understanding it 
wrong, but according to Page 3 in the document I linked to above, it looks 
like these are intended to be used in conjunction with the regular letters 
in order to modify them. So, it seems that there are two valid ways to 
encode a single character like ? ("zo"): Either (U+305E) or (U+305D, 
U+3099).

I think these are what people call "combining characters" but every 
explanation of Unicode I've ever seen that actually mentions such things 
always just hand-waves it away with "oh, yea, and then there's something 
called 'combining characters' that can complicate things", and that's all 
they ever say.

So, my questions:

1. Am I correct in all of that?

2. Is there a proper way to encode that modifier character by itself? For 
instance, if you wanted to write "Japanese has a (the modifier by itself 
here) that changes a sound".

3. A text editor, for instance, is intended to treat something like (U+305D, 
U+3099) as a single character, right?

4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to 
compare as equal?






7. I assume Unicode doesn't have any provisions for Furigana, right? I 
assume that would be outside the scope of Unicode, but I thought I'd ask.
Jun 08 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Nick Sabalausky:

 3. A text editor, for instance, is intended to treat something like (U+305D, 
 U+3099) as a single character, right?
Languages are a product of biology, and in biology it's usually hard to put absolute limits between things; all definitions must be flexible and a little fuzzy if they want to grasp enough of the reality and be useful. So I think the answer to this question is positive. When you iterate with D foreach on a string that contains those, what is the right way to split chars? Returning a single "char" 8 bytes long (that is a string of two 32-bit chars) that contains them both is not wrong (but probably not expected) :-) Bye, bearophile
Jun 08 2010
prev sibling next sibling parent reply Matti Niemenmaa <see_signature for.real.address> writes:
On 2010-06-08 22:27, Nick Sabalausky wrote:
<snip>
 1. Am I correct in all of that?
Yes. In particular, the three-byteness of CJK characters is an often-cited reason to use UTF-16 instead of UTF-8.
 2. Is there a proper way to encode that modifier character by itself? For
 instance, if you wanted to write "Japanese has a (the modifier by itself
 here) that changes a sound".
You can combine it with a space, but yes: that mark, called the dakuten or voicing mark, can be encoded by itself as U+309B. I recommend http://rishida.net/scripts/uniview/ for searching through Unicode.
 3. A text editor, for instance, is intended to treat something like (U+305D,
 U+3099) as a single character, right?
Yes, I'd say so. I suppose it could allow for removing only the modifier (or the modified), but that doesn't seem like it should be the default behaviour.
 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
 compare as equal?
Yes. You might want to read about equivalence and normalization in Unicode: http://en.wikipedia.org/wiki/Unicode_equivalence

AFAIK, neither support normalization of any kind.


Factor has pretty good support for Unicode: http://docs.factorcode.org/content/article-unicode.html
 7. I assume Unicode doesn't have any provisions for Furigana, right? I
 assume that would be outside the scope of Unicode, but I thought I'd ask.
There's: U+FFF9 INTERLINEAR ANNOTATION ANCHOR U+FFFA INTERLINEAR ANNOTATION SEPARATOR U+FFFB INTERLINEAR ANNOTATION TERMINATOR But it's usually recommended to use some kind of ruby markup instead. See: http://en.wikipedia.org/wiki/Ruby_character#Ruby_in_Unicode -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Jun 08 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Matti Niemenmaa" <see_signature for.real.address> wrote in message 
news:hum6ft$2jar$1 digitalmars.com...
 On 2010-06-08 22:27, Nick Sabalausky wrote:
 <snip>
Thanks for the helpful response :)
 I recommend http://rishida.net/scripts/uniview/ for searching through 
 Unicode.
Ahh, I'd been wanting a good Unicode equivalent to an ASCII chart. That seems to do nicely.
 6. Are there other languages with similar things for which the answers to 


Factor has pretty good support for Unicode: http://docs.factorcode.org/content/article-unicode.html
Actually, I meant other human-languages. Like, are there other combining characters for some language other than Japanese that are indended to be compared as unequal to their corresponding singe-code-point version?
 7. I assume Unicode doesn't have any provisions for Furigana, right? I
 assume that would be outside the scope of Unicode, but I thought I'd ask.
There's: U+FFF9 INTERLINEAR ANNOTATION ANCHOR U+FFFA INTERLINEAR ANNOTATION SEPARATOR U+FFFB INTERLINEAR ANNOTATION TERMINATOR But it's usually recommended to use some kind of ruby markup instead. See: http://en.wikipedia.org/wiki/Ruby_character#Ruby_in_Unicode
Thanks. I was wondering about those being there but not being recommended, so I followed that link and the footnote, and found the following very helpful explanation: http://www.unicode.org/reports/tr20/#Interlinear Their explanation is easy to understand, but basically, they're there as a convenience for internal use by an application. It don't provide other information that would normally be important for markup, such as where to position it. And it's not easily displayable in plain-text-only-modes without the risk of subtly changing the meaning. Any idea if "Ruby markup" has anything to do with the Ruby programming language? It's not clear from that Wikipedia article.
Jun 08 2010
parent reply Matti Niemenmaa <see_signature for.real.address> writes:
On 2010-06-08 23:16, Nick Sabalausky wrote:
 "Matti Niemenmaa"<see_signature for.real.address>  wrote in message
 news:hum6ft$2jar$1 digitalmars.com...
 On 2010-06-08 22:27, Nick Sabalausky wrote:
 6. Are there other languages with similar things for which the answers to


Factor has pretty good support for Unicode: http://docs.factorcode.org/content/article-unicode.html
Actually, I meant other human-languages. Like, are there other combining characters for some language other than Japanese that are indended to be compared as unequal to their corresponding singe-code-point version?
Ah, sorry for the misunderstanding. :-) I don't think so, no. The Unicode FAQ at http://www.unicode.org/faq/normalization.html says "Programs should always compare canonical-equivalent Unicode strings as equal".
 Any idea if "Ruby markup" has anything to do with the Ruby programming
 language? It's not clear from that Wikipedia article.
No, they're completely unrelated. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Jun 08 2010
parent "Nick Sabalausky" <a a.a> writes:
"Matti Niemenmaa" <see_signature for.real.address> wrote in message 
news:hum8us$2o7m$1 digitalmars.com...
 Any idea if "Ruby markup" has anything to do with the Ruby programming
 language? It's not clear from that Wikipedia article.
No, they're completely unrelated.
Heh, you know, that would have been perfectly obvious from the article if I had just scrolled up a bit :)
Jun 08 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Sorry, if it's again top post in your mail clients. I'll try to figure out
what's going on later today.


 
 1. Am I correct in all of that?
Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. It really depends on a situation. The advantage is not only space but also faster processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) since those 2 bytes can be read at one memory access as opposed to UTF-8. Also, consider another thing: it's easier (and cheaper) to convert from ANSI to UTF-16 since a direct table can be created. Whereas for UTF-8, you'll have to do some shifts to create a surrogate for non-ASCII letters (even for Latin ones). What encoding is better depends on your taste, language, applications, etc. I was simply pointing out that it's quite nice to have universal 'tchar' type. My argument was never about which encoding is better - it's hard to tell in general. Besides, many people still use ANSI and not UTF-8.
Jun 08 2010
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 08 Jun 2010 16:18:54 -0400, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Sorry, if it's again top post in your mail clients. I'll try to figure  
 out what's going on later today.
It appears as a top-post in my newsreader too.
 1. Am I correct in all of that?
Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. It really depends on a situation. The advantage is not only space but also faster processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) since those 2 bytes can be read at one memory access as opposed to UTF-8. Also, consider another thing: it's easier (and cheaper) to convert from ANSI to UTF-16 since a direct table can be created. Whereas for UTF-8, you'll have to do some shifts to create a surrogate for non-ASCII letters (even for Latin ones). What encoding is better depends on your taste, language, applications, etc. I was simply pointing out that it's quite nice to have universal 'tchar' type. My argument was never about which encoding is better - it's hard to tell in general. Besides, many people still use ANSI and not UTF-8.
Wouldn't this suggest that the decision of what character type to use would be more suited to what language you speak than what OS you are running? -Steve
Jun 08 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.138.1276028343.24349.digitalmars-d puremagic.com...
 Sorry, if it's again top post in your mail clients. I'll try to figure out 
 what's going on later today.


 1. Am I correct in all of that?
Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. It really depends on a situation. The advantage is not only space but also faster processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) since those 2 bytes can be read at one memory access as opposed to UTF-8. Also, consider another thing: it's easier (and cheaper) to convert from ANSI to UTF-16 since a direct table can be created. Whereas for UTF-8, you'll have to do some shifts to create a surrogate for non-ASCII letters (even for Latin ones).
Yea, I need to remember not to try to post late at night ;)
Jun 08 2010
prev sibling next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-06-08 15:27:10 -0400, "Nick Sabalausky" <a a.a> said:

 So, my questions:
 
 1. Am I correct in all of that?
Yes. Note that combining characters exist for a variety of glyphs. There is somewhere a "combining acute accent" that can be combined with a "e", so you could use two code points to write "é" if you wanted instead of the single code point "pre-combined" form.
 2. Is there a proper way to encode that modifier character by itself? For
 instance, if you wanted to write "Japanese has a (the modifier by itself
 here) that changes a sound".
Sometime there is a separate (non-combining) character for that. For instance you have a non-combining acute accent as a standalone character. Perhaps you can use a combining character with a no-break space?
 3. A text editor, for instance, is intended to treat something like (U+305D,
 U+3099) as a single character, right?
Yes. They are both equivalent, and they'll share the same Unicode normalization.
 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
 compare as equal?
Yes, well, it depends on what you're trying to do. Say you're searching for "é" in a text editor, it should match both the normal and the combining version. In your code, it depends on what you want to do (if you want to replace U+305D U+3099 with U+305E, then obviously you search by code point). I think the proper way to do this is to perform Unicode normalization on both strings before comparing code points.

Probably not. But again, in some cases making a literal code-point search might be what you want. It'd be interesting if someone could make a unicode normalizer in the form of a range in Phobos 2. That way you could compare both strings by comparing code points from the normalizer ranges, all this without having to create a normalized copy.


Not all combinations have a pre-combined form, so you can't always convert them to a single code point. But beside that, when there is a pre-combined form, they should be treated as equivalent.
 7. I assume Unicode doesn't have any provisions for Furigana, right? I
 assume that would be outside the scope of Unicode, but I thought I'd ask.
I'm pretty sure furigana is out of scope. Reference: <http://en.wikipedia.org/wiki/Combining_character> <http://en.wikipedia.org/wiki/Unicode_normalization> -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jun 08 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
Thanks all for the helpful responses. Since we seem to have some real 
Unicode-knowledge people here, I'd like to repost a question I had asked 
elsewhere awhile back, but didn't get an answer:

--------------------------------------------------------------------------------
Can someone explain how folding-case differs from lower-case and why it 
should be used for case-insensitive matching instead of lower-case?

I was looking at this document, but still don't get it: 
http://www.unicode.org/reports/tr21/tr21-5.html

The only part I see that directly addresses that is this:

      Case-folding is more than just conversion to lowercase.
      For example, it handles cases such as the Greek sigma,
      so that "?????" and "????S" will match correctly.

Which references what it says earlier about sigma:

      Characters may also have different case mappings,
      depending on the context.

      For example, U+03A3 "S" capital sigma lowercases to
      U+03C3 "s" small sigma if it is followed by another
      letter, but lowercases to U+03C2 "?" small
      final sigma if it is not.

But I still don't see how that demonstrates a need for anything other than 
toLower provided that the given toLower routine is already properly handling 
the "end of word"/"not end of word" difference.
--------------------------------------------------------------------------------

Unless, it's just extra speed due to not having to handle things like the 
"end of word"/"not end of word" difference?

BTW, if those characters don't show up right on the newsgroup, the original 
quesion (where they are showing up right, at least for me) is here: 
http://www.dsource.org/projects/tango/forums/topic/782
Jun 08 2010
parent Don <nospam nospam.com> writes:
Nick Sabalausky wrote:
 Thanks all for the helpful responses. Since we seem to have some real 
 Unicode-knowledge people here, I'd like to repost a question I had asked 
 elsewhere awhile back, but didn't get an answer:
 
 --------------------------------------------------------------------------------
 Can someone explain how folding-case differs from lower-case and why it 
 should be used for case-insensitive matching instead of lower-case?
 
 I was looking at this document, but still don't get it: 
 http://www.unicode.org/reports/tr21/tr21-5.html
 
 The only part I see that directly addresses that is this:
 
       Case-folding is more than just conversion to lowercase.
       For example, it handles cases such as the Greek sigma,
       so that "?????" and "????S" will match correctly.
 
 Which references what it says earlier about sigma:
 
       Characters may also have different case mappings,
       depending on the context.
 
       For example, U+03A3 "S" capital sigma lowercases to
       U+03C3 "s" small sigma if it is followed by another
       letter, but lowercases to U+03C2 "?" small
       final sigma if it is not.
 
 But I still don't see how that demonstrates a need for anything other than 
 toLower provided that the given toLower routine is already properly handling 
 the "end of word"/"not end of word" difference.
 --------------------------------------------------------------------------------
 
 Unless, it's just extra speed due to not having to handle things like the 
 "end of word"/"not end of word" difference?
If you want to case-insensitive find "as" in " basdaS " in English, you can just convert both to lower case, and you'll find them both. Now suppose you want to find "as" in the string " basdas ", where it's all in Greek. It still occurs twice, but it you convert it to lower case, each s has a different character. toLower() doesn't work.
Jun 09 2010