www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - More Unicode fun

reply foobar <foo bar.com> writes:
After reading most of the posts above the subject, I wanted to first thank
Spir, Michel and others who brought this topic to light.
Since Andrei and others asked for more information on the subject I wanted to
contribute what I know to this discussion:

1. Regarding combining marks, Hebrew (and Arabic) make extensive use of this.

Hebrew has letters only for consonants, vowels are optional combining marks. In
addition, some letters have diacritics (e.g an 's' sound vs a 'sh' in Hebrew is
differentiated based on if there's a diacritic dot on the left or the right top
corner of the letter '&#1513;') 
in Addition to that, some punctuation is also combining marks (you add a middle
dot to emphasize a consonant whereas in western languages you double the letter
(such as in the word 'letter')
On top of that, biblical text has an additional set of marks to represent the
chanting rhythm.

So it's definitly possible in Hebrew to have more than one combining mark on
the same base letter. When comparing such letters the order of the combining
marks should not matter and I think there's a default normalized order in such
cases. 

2. case depends on locale. In Turkish for instance, they have two 'i' letters,
one with a dot and one without. Therefore the Turkish upper case of i is a
capital 'i' with a dot, different from English.
Jan 14 2011
parent spir <denis.spir gmail.com> writes:
On 01/14/2011 11:29 PM, foobar wrote:
 So it's definitly possible in Hebrew to have more than one combining mark on
the same base letter. When comparing such letters the order of the combining
marks should not matter and I think there's a default normalized order in such
cases.

Unicode defines an standard order for combining marks _of different kinds_ inside a given "grapheme". "Different kinds" mainly means how they are supposed to be placed relative base marks by rendering engines. Combining marks of the same kind ordered different are supposed to describe a different character: for instance, <e>+<acute accent above>+<grave accent above> is not the same character for Unicode as <e>+<grave accent above>+<acute accent above> (there is a subtile placement difference). But <e>+<acute accent above>+<grave accent below> is equal to <e>+<grave accent below>+<acute accent above>: reordereing will happen. This order is not imposed to users or any text-producing software, so that an ordering phase is necessary to end any normalisation process --at least if the goal is to produce a unique character representation allowing direct comparison.
 2. case depends on locale. In Turkish for instance, they have two 'i' letters,
one with a dot and one without. Therefore the Turkish upper case of i is a
capital 'i' with a dot, different from English.

Casing issues are very complicated and, as you say, language-specific. But not only: in french, for instance, there is no single applied uppercasing rule for accented letters (even in official texts or newspapers). This is why I consider casing simply doesn't belong to a general-purpose text manipulation type. Instead, tools to help and define language-, script-, culture- specific casing algorithms (or app- or domain- specific ones) should be made available in a Unicode toolkit library. But it's only me. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011