digitalmars.D - Questions about Unicode, particularly Japanese

Nick Sabalausky (48/48) Jun 08 2010 The "Wide character support in D" thread got me to question and double-c...

bearophile (5/7) Jun 08 2010 Languages are a product of biology, and in biology it's usually hard to ...
Matti Niemenmaa (24/37) Jun 08 2010 Yes. In particular, the three-byteness of CJK characters is an

Nick Sabalausky (19/36) Jun 08 2010 Thanks for the helpful response :)

Matti Niemenmaa (8/24) Jun 08 2010 Ah, sorry for the misunderstanding. :-)

Nick Sabalausky (4/7) Jun 08 2010 Heh, you know, that would have been perfectly obvious from the article i...

Ruslan Nikolaev (4/6) Jun 08 2010 Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encodin...

Steven Schveighoffer (7/24) Jun 08 2010 It appears as a top-post in my newsreader too.
Nick Sabalausky (3/15) Jun 08 2010 Yea, I need to remember not to try to post late at night ;)

Michel Fortin (35/50) Jun 08 2010 Yes. Note that combining characters exist for a variety of glyphs.
Nick Sabalausky (28/28) Jun 08 2010 Thanks all for the helpful responses. Since we seem to have some real

Don (6/40) Jun 09 2010 If you want to case-insensitive find "as" in " basdaS " in English, you

"Nick Sabalausky" <a a.a> writes:

The "Wide character support in D" thread got me to question and double-check 
some of my assumptions about unicode. From double-checking the UTF-8 
encoding, and looking at the charts at ( http://www.unicode.org/charts/ ), I 
realized that Japanese, Chinese and Korean characters are almost entirely 
(if not entirely) 3 bytes on UTF-8. For some reason I had been under the 
impression that the Japanese -kanas and at least a few of the Chinese 
characters were 2 bytes on UTF-8. Turns out that's not the case. I thought 
I'd share that in case any one else didn't know. Also, FWIW, Cyrillic (ex, 
Russian, AIUI), and Greek appear to be primarily, if not entirely, 2 bytes 
in UTF-8.

But then I noticed something on the charts for the Japanese -kanas (ex: 
http://www.unicode.org/charts/PDF/U3040.pdf ). Umm, first of all, for those 
unfamiliar with Japanese: There are two phonetic alphabets, hiragana and 
katakana (in addition to the chinese characters), and they're based more on 
syllables than the individual sounds of western-style letters. Also, some of 
the sounds are formed by adding a modifier to a symbol for a similar sound. 
For instance: ? (U+305D, hiragana "so") is the sound "so", and to make "zo" 
you add what looks like a double-quote to it: ? (U+305E, hiragana "zo") (You 
may need to increase your font size to see it well). That same modifier 
converts most of the "s"'s to "z"'s, or any of the "h"'s to "b"'s, etc. And 
there's also another modifier that converts the "h"'s to "p"'s (looks like a 
little circle).

The thing is, there appears to also be Unicode code points for these 
modifiers by themselves (U+3099 and U+309A). Maybe I'm understanding it 
wrong, but according to Page 3 in the document I linked to above, it looks 
like these are intended to be used in conjunction with the regular letters 
in order to modify them. So, it seems that there are two valid ways to 
encode a single character like ? ("zo"): Either (U+305E) or (U+305D, 
U+3099).

I think these are what people call "combining characters" but every 
explanation of Unicode I've ever seen that actually mentions such things 
always just hand-waves it away with "oh, yea, and then there's something 
called 'combining characters' that can complicate things", and that's all 
they ever say.

So, my questions:

1. Am I correct in all of that?

2. Is there a proper way to encode that modifier character by itself? For 
instance, if you wanted to write "Japanese has a (the modifier by itself 
here) that changes a sound".

3. A text editor, for instance, is intended to treat something like (U+305D, 
U+3099) as a single character, right?

4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to 
compare as equal?






7. I assume Unicode doesn't have any provisions for Furigana, right? I 
assume that would be outside the scope of Unicode, but I thought I'd ask.

Jun 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Nick Sabalausky:

 3. A text editor, for instance, is intended to treat something like (U+305D, 
 U+3099) as a single character, right?

Languages are a product of biology, and in biology it's usually hard to put
absolute limits between things; all definitions must be flexible and a little
fuzzy if they want to grasp enough of the reality and be useful. So I think the
answer to this question is positive.
When you iterate with D foreach on a string that contains those, what is the
right way to split chars? Returning a single "char" 8 bytes long (that is a
string of two 32-bit chars) that contains them both is not wrong (but probably
not expected) :-)

Bye,
bearophile

Jun 08 2010

Matti Niemenmaa <see_signature for.real.address> writes:

On 2010-06-08 22:27, Nick Sabalausky wrote:
<snip>
 1. Am I correct in all of that?

Yes. In particular, the three-byteness of CJK characters is an 
often-cited reason to use UTF-16 instead of UTF-8.

 2. Is there a proper way to encode that modifier character by itself? For
 instance, if you wanted to write "Japanese has a (the modifier by itself
 here) that changes a sound".

You can combine it with a space, but yes: that mark, called the dakuten 
or voicing mark, can be encoded by itself as U+309B.

I recommend http://rishida.net/scripts/uniview/ for searching through 
Unicode.

 3. A text editor, for instance, is intended to treat something like (U+305D,
 U+3099) as a single character, right?

Yes, I'd say so. I suppose it could allow for removing only the modifier 
(or the modified), but that doesn't seem like it should be the default 
behaviour.

 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
 compare as equal?

Yes. You might want to read about equivalence and normalization in Unicode:

http://en.wikipedia.org/wiki/Unicode_equivalence



AFAIK, neither support normalization of any kind.




Factor has pretty good support for Unicode:

http://docs.factorcode.org/content/article-unicode.html

 7. I assume Unicode doesn't have any provisions for Furigana, right? I
 assume that would be outside the scope of Unicode, but I thought I'd ask.

There's:

U+FFF9  INTERLINEAR ANNOTATION ANCHOR
U+FFFA  INTERLINEAR ANNOTATION SEPARATOR
U+FFFB  INTERLINEAR ANNOTATION TERMINATOR

But it's usually recommended to use some kind of ruby markup instead. See:

http://en.wikipedia.org/wiki/Ruby_character#Ruby_in_Unicode

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Matti Niemenmaa" <see_signature for.real.address> wrote in message 
news:hum6ft$2jar$1 digitalmars.com...
 On 2010-06-08 22:27, Nick Sabalausky wrote:
 <snip>

Thanks for the helpful response :)


 I recommend http://rishida.net/scripts/uniview/ for searching through 
 Unicode.

Ahh, I'd been wanting a good Unicode equivalent to an ASCII chart. That 
seems to do nicely.


 6. Are there other languages with similar things for which the answers to 



 Factor has pretty good support for Unicode:

 http://docs.factorcode.org/content/article-unicode.html

Actually, I meant other human-languages. Like, are there other combining 
characters for some language other than Japanese that are indended to be 
compared as unequal to their corresponding singe-code-point version?


 7. I assume Unicode doesn't have any provisions for Furigana, right? I
 assume that would be outside the scope of Unicode, but I thought I'd ask.

 There's:

 U+FFF9  INTERLINEAR ANNOTATION ANCHOR
 U+FFFA  INTERLINEAR ANNOTATION SEPARATOR
 U+FFFB  INTERLINEAR ANNOTATION TERMINATOR

 But it's usually recommended to use some kind of ruby markup instead. See:

 http://en.wikipedia.org/wiki/Ruby_character#Ruby_in_Unicode

Thanks. I was wondering about those being there but not being recommended, 
so I followed that link and the footnote, and found the following very 
helpful explanation:

http://www.unicode.org/reports/tr20/#Interlinear

Their explanation is easy to understand, but basically, they're there as a 
convenience for internal use by an application. It don't provide other 
information that would normally be important for markup, such as where to 
position it. And it's not easily displayable in plain-text-only-modes 
without the risk of subtly changing the meaning.

Any idea if "Ruby markup" has anything to do with the Ruby programming 
language? It's not clear from that Wikipedia article.

Jun 08 2010

Matti Niemenmaa <see_signature for.real.address> writes:

On 2010-06-08 23:16, Nick Sabalausky wrote:
 "Matti Niemenmaa"<see_signature for.real.address>  wrote in message
 news:hum6ft$2jar$1 digitalmars.com...
 On 2010-06-08 22:27, Nick Sabalausky wrote:
 6. Are there other languages with similar things for which the answers to



 Factor has pretty good support for Unicode:

 http://docs.factorcode.org/content/article-unicode.html

 Actually, I meant other human-languages. Like, are there other combining
 characters for some language other than Japanese that are indended to be
 compared as unequal to their corresponding singe-code-point version?

Ah, sorry for the misunderstanding. :-)

I don't think so, no. The Unicode FAQ at 
http://www.unicode.org/faq/normalization.html says "Programs should 
always compare canonical-equivalent Unicode strings as equal".

 Any idea if "Ruby markup" has anything to do with the Ruby programming
 language? It's not clear from that Wikipedia article.

No, they're completely unrelated.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Matti Niemenmaa" <see_signature for.real.address> wrote in message 
news:hum8us$2o7m$1 digitalmars.com...
 Any idea if "Ruby markup" has anything to do with the Ruby programming
 language? It's not clear from that Wikipedia article.

 No, they're completely unrelated.

Heh, you know, that would have been perfectly obvious from the article if I 
had just scrolled up a bit :)

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Sorry, if it's again top post in your mail clients. I'll try to figure out
what's going on later today.


 
 1. Am I correct in all of that?

Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. It
really depends on a situation. The advantage is not only space but also faster
processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) since those 2
bytes can be read at one memory access as opposed to UTF-8. Also, consider
another thing: it's easier (and cheaper) to convert from ANSI to UTF-16 since a
direct table can be created. Whereas for UTF-8, you'll have to do some shifts
to create a surrogate for non-ASCII letters (even for Latin ones).

What encoding is better depends on your taste, language, applications, etc. I
was simply pointing out that it's quite nice to have universal 'tchar' type. My
argument was never about which encoding is better - it's hard to tell in
general. Besides, many people still use ANSI and not UTF-8.

Jun 08 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 08 Jun 2010 16:18:54 -0400, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Sorry, if it's again top post in your mail clients. I'll try to figure  
 out what's going on later today.

It appears as a top-post in my newsreader too.

 1. Am I correct in all of that?

 Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy  
 encoding. It really depends on a situation. The advantage is not only  
 space but also faster processing speed (even for 2 byte letters: Greek,  
 Cyrillic, etc.) since those 2 bytes can be read at one memory access as  
 opposed to UTF-8. Also, consider another thing: it's easier (and  
 cheaper) to convert from ANSI to UTF-16 since a direct table can be  
 created. Whereas for UTF-8, you'll have to do some shifts to create a  
 surrogate for non-ASCII letters (even for Latin ones).

 What encoding is better depends on your taste, language, applications,  
 etc. I was simply pointing out that it's quite nice to have universal  
 'tchar' type. My argument was never about which encoding is better -  
 it's hard to tell in general. Besides, many people still use ANSI and  
 not UTF-8.

Wouldn't this suggest that the decision of what character type to use  
would be more suited to what language you speak than what OS you are  
running?

-Steve

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.138.1276028343.24349.digitalmars-d puremagic.com...
 Sorry, if it's again top post in your mail clients. I'll try to figure out 
 what's going on later today.


 1. Am I correct in all of that?

 Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. 
 It really depends on a situation. The advantage is not only space but also 
 faster processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) 
 since those 2 bytes can be read at one memory access as opposed to UTF-8. 
 Also, consider another thing: it's easier (and cheaper) to convert from 
 ANSI to UTF-16 since a direct table can be created. Whereas for UTF-8, 
 you'll have to do some shifts to create a surrogate for non-ASCII letters 
 (even for Latin ones).

Yea, I need to remember not to try to post late at night ;)

Jun 08 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-06-08 15:27:10 -0400, "Nick Sabalausky" <a a.a> said:

 So, my questions:
 
 1. Am I correct in all of that?

Yes. Note that combining characters exist for a variety of glyphs. 
There is somewhere a "combining acute accent" that can be combined with 
a "e", so you could use two code points to write "�" if you wanted 
instead of the single code point "pre-combined" form.

 2. Is there a proper way to encode that modifier character by itself? For
 instance, if you wanted to write "Japanese has a (the modifier by itself
 here) that changes a sound".

Sometime there is a separate (non-combining) character for that. For 
instance you have a non-combining acute accent as a standalone 
character. Perhaps you can use a combining character with a no-break 
space?

 3. A text editor, for instance, is intended to treat something like (U+305D,
 U+3099) as a single character, right?

Yes. They are both equivalent, and they'll share the same Unicode 
normalization.

 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
 compare as equal?

Yes, well, it depends on what you're trying to do. Say you're searching 
for "�" in a text editor, it should match both the normal and the 
combining version. In your code, it depends on what you want to do (if 
you want to replace U+305D U+3099 with U+305E, then obviously you 
search by code point).

I think the proper way to do this is to perform Unicode normalization 
on both strings before comparing code points.



Probably not. But again, in some cases making a literal code-point 
search might be what you want.

It'd be interesting if someone could make a unicode normalizer in the 
form of a range in Phobos 2. That way you could compare both strings by 
comparing code points from the normalizer ranges, all this without 
having to create a normalized copy.




Not all combinations have a pre-combined form, so you can't always 
convert them to a single code point. But beside that, when there is a 
pre-combined form, they should be treated as equivalent.

 7. I assume Unicode doesn't have any provisions for Furigana, right? I
 assume that would be outside the scope of Unicode, but I thought I'd ask.

I'm pretty sure furigana is out of scope.

Reference:
<http://en.wikipedia.org/wiki/Combining_character>
<http://en.wikipedia.org/wiki/Unicode_normalization>

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

Thanks all for the helpful responses. Since we seem to have some real 
Unicode-knowledge people here, I'd like to repost a question I had asked 
elsewhere awhile back, but didn't get an answer:

--------------------------------------------------------------------------------
Can someone explain how folding-case differs from lower-case and why it 
should be used for case-insensitive matching instead of lower-case?

I was looking at this document, but still don't get it: 
http://www.unicode.org/reports/tr21/tr21-5.html

The only part I see that directly addresses that is this:

      Case-folding is more than just conversion to lowercase.
      For example, it handles cases such as the Greek sigma,
      so that "?????" and "????S" will match correctly.

Which references what it says earlier about sigma:

      Characters may also have different case mappings,
      depending on the context.

      For example, U+03A3 "S" capital sigma lowercases to
      U+03C3 "s" small sigma if it is followed by another
      letter, but lowercases to U+03C2 "?" small
      final sigma if it is not.

But I still don't see how that demonstrates a need for anything other than 
toLower provided that the given toLower routine is already properly handling 
the "end of word"/"not end of word" difference.
--------------------------------------------------------------------------------

Unless, it's just extra speed due to not having to handle things like the 
"end of word"/"not end of word" difference?

BTW, if those characters don't show up right on the newsgroup, the original 
quesion (where they are showing up right, at least for me) is here: 
http://www.dsource.org/projects/tango/forums/topic/782

Jun 08 2010

Don <nospam nospam.com> writes:

Nick Sabalausky wrote:
 Thanks all for the helpful responses. Since we seem to have some real 
 Unicode-knowledge people here, I'd like to repost a question I had asked 
 elsewhere awhile back, but didn't get an answer:
 
 --------------------------------------------------------------------------------
 Can someone explain how folding-case differs from lower-case and why it 
 should be used for case-insensitive matching instead of lower-case?
 
 I was looking at this document, but still don't get it: 
 http://www.unicode.org/reports/tr21/tr21-5.html
 
 The only part I see that directly addresses that is this:
 
       Case-folding is more than just conversion to lowercase.
       For example, it handles cases such as the Greek sigma,
       so that "?????" and "????S" will match correctly.
 
 Which references what it says earlier about sigma:
 
       Characters may also have different case mappings,
       depending on the context.
 
       For example, U+03A3 "S" capital sigma lowercases to
       U+03C3 "s" small sigma if it is followed by another
       letter, but lowercases to U+03C2 "?" small
       final sigma if it is not.
 
 But I still don't see how that demonstrates a need for anything other than 
 toLower provided that the given toLower routine is already properly handling 
 the "end of word"/"not end of word" difference.

 --------------------------------------------------------------------------------
 
 Unless, it's just extra speed due to not having to handle things like the 
 "end of word"/"not end of word" difference?

If you want to case-insensitive find "as" in " basdaS " in English, you 
can just convert both to lower case, and you'll find them both.

Now suppose you want to find "as" in the string " basdas ", where it's 
all in Greek.  It still occurs twice, but it you convert it to lower 
case, each s has a different character. toLower() doesn't work.

Jun 09 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Questions about Unicode, particularly Japanese