www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Turkish 'I's can't D either

reply Ali Cehreli <acehreli yahoo.com> writes:
You may be aware of the problems related to the consistency of the two separate
letter 'I's in the Turkish alphabet (and the alphabets that are based on the
Turkish alphabet).

Lowercase and uppercase versions of the two are consistent in whether they have
a dot or not:

  http://en.wikipedia.org/wiki/Turkish_I

Turkish alphabet being in a position so close to the western alphabets, but not
close enough, puts it in a strange position. (Strangely; the same applies
geographically, politically, socially, etc. as well... ;))

Computer systems *almost* work for Turkish, but not for those two letters.

I love the fact that D allows Unicode letters in the source code and that it
natively supports Unicode. I cannot stress enough how important this is. That
is the single biggest reason why I decided to finally write a programming
tutorial. Thank you to all who proposed and implemented those features!

Back to the Turquois 'I's... What a programmer is to do who is writing programs
that deals with Turkish letters?

a) Accept that Phobos too has this age old behavior that is a result of
premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 'A'))

b) Accept that the problem is unsolvable because the letter I has two
minuscules, and the letter i has two majuscules anyway, and that the intent is
not always clear

c) Accept Turkish alphabet as being pathological (merely for being in the
minority!), and use a Turkish version of Phobos or some other library

d) Solve the problem with locale support

Is option d possible with today's systems? Whose resposibility is this anyway?
OS? Language? Program? Something else?

The fact that alphanumerical ordering is also of interest, I think this has
something to do with locales.

Is there a way for a program to work with Turkish letters and ensure that the
following program produces the expected output of 'dotless i', 'I with dot',
and 0?

import std.stdio;
import std.string;
import std.c.locale;
import std.uni;

void main()
{
    const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
    assert(result);

    writeln(toUniLower('I'));
    writeln(toUniUpper('i'));
    writeln(indexOf("I",
                    '\u0131',               // dotless i
                    (CaseSensitive).no));
}

This is a practical question. I really want to be able to work with Turkish...
:)

Thank you,
Ali
Aug 24 2009
next sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
Ali Cehreli wrote:
 Is option d possible with today's systems? Whose resposibility is
 this anyway? OS? Language? Program? Something else?

This appears to be a library issue to me. If Phobos can't do this properly, your options are basically: 1. Find a third party library solution. 2. Write your own solution. 2a. Write your own solution, and try to get it into Phobos. 2b. Write your own solution, and release it as a third-party library so that other people can use it. I know ICU can use different case mappings for different locales, but I don't think it has D bindings. -- Rainer Deyke - rainerd eldwood.com
Aug 24 2009
next sibling parent reply Ali Cehreli <acehreli yahoo.com> writes:
Rainer Deyke Wrote:

 This appears to be a library issue to me.

I started to see this at a more fundamental level. The Unicode letter I (dotless capital i) has two possible lowercases and letter i has two possible uppercases. The chain of some historical events appears to have produced a crippled system: the application can't know how to lowercase or uppercase those. Having three separate 'i's would keep things elegant and correct, but the ASCII i and I have been in use for Turkish documents for decades now.
 I know ICU can use different case mappings for different locales, but I
 don't think it has D bindings.

Under my limited understanding, that seems to be in contradiction with what Walter mentions in the other comment: locales or no locales? I will investigate more. :) Thanks! Ali
Aug 25 2009
parent Rainer Deyke <rainerd eldwood.com> writes:
Ali Cehreli wrote:
 Rainer Deyke Wrote:
 
 This appears to be a library issue to me.

I started to see this at a more fundamental level. The Unicode letter I (dotless capital i) has two possible lowercases

According to ICU, Lithuanian sometimes uses a third.
 and letter i has
 two possible uppercases. The chain of some historical events appears
 to have produced a crippled system: the application can't know how to
 lowercase or uppercase those.

That's hardly the only case where unicode behavior is locale-dependent. For example, collating order varies widely between languages. -- Rainer Deyke - rainerd eldwood.com
Aug 25 2009
prev sibling parent Frank Benoit <keinfarbton googlemail.com> writes:
Rainer Deyke schrieb:
 I know ICU can use different case mappings for different locales, but I
 don't think it has D bindings.
 
 

There are existing ICU bindings in the mango project, see dsource.org
Aug 25 2009
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
I think it's great that you're doing a Turkish programming tutorial! I 
can't help you, though, with details of the Turkish language because I 
have no idea how it works.

The only thing I can suggest is using a setlocale for Turkish is the 
wrong way, as Unicode was supposed to get away from that.

tolower really is only for ASCII. But the toUniLower should work right 
with Turkish, though I don't know what right is for that case.
Aug 24 2009
next sibling parent reply Ali Cehreli <acehreli yahoo.com> writes:
Walter Bright Wrote:

 with details of the Turkish language because I 
 have no idea how it works.

It is a very interesting story. The Turkish 'i's have caused lots of trouble, even hardcoded conditionals in at least the early Java libraries that checked whether the locale was Turkish. Even the Unicode is in a strange position because two Unicode code points have two separate upper and lower cases. (I don't know whether there are other alphabets in such a situation.)
 tolower really is only for ASCII. But the toUniLower should work right 
 with Turkish, though I don't know what right is for that case.

The current implementation of toUniLower() favors the ASCII lowercasing of 'I' over the Turkish one (similar with toUniUpper() for i): dchar toUniLower(dchar c) { if (c >= 'A' && c <= 'Z') { c += 32; } An application would need a separate set of toUniLower() and friends to be able to work in Turkish. I don't think the issue is big enough for Phobos to tackle with a solution similar to CaseSensitive: toUniLower('I', (Alphabet).tr); Instead, a wrapper around toUniLower() should be used... Ali
Aug 25 2009
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Ali Cehreli wrote:
 Walter Bright Wrote:
 
 with details of the Turkish language because I 
 have no idea how it works.

It is a very interesting story. The Turkish 'i's have caused lots of trouble, even hardcoded conditionals in at least the early Java libraries that checked whether the locale was Turkish. Even the Unicode is in a strange position because two Unicode code points have two separate upper and lower cases. (I don't know whether there are other alphabets in such a situation.)
 tolower really is only for ASCII. But the toUniLower should work right 
 with Turkish, though I don't know what right is for that case.

The current implementation of toUniLower() favors the ASCII lowercasing of 'I' over the Turkish one (similar with toUniUpper() for i): dchar toUniLower(dchar c) { if (c >= 'A' && c <= 'Z') { c += 32; } An application would need a separate set of toUniLower() and friends to be able to work in Turkish. I don't think the issue is big enough for Phobos to tackle with a solution similar to CaseSensitive: toUniLower('I', (Alphabet).tr); Instead, a wrapper around toUniLower() should be used... Ali

To me, it seems that the issue is that the library routines don't have enough context to be able to correctly work out how to lowercase a string. Having it locale-dependant seems like a bad idea; let's say I'm processing some internal data that uses string names; case is irrelevant so I lowercase them and look them up in a hash table. If there's an I in there, but the hashtable is stored as i, the program will break if run in a Turkish locale. One thing I think the typesystem should be used more for is attaching more semantic information to data. So maybe the solution is to introduce something like a Text type that also stores the language of the text. Then the library methods WILL have the right context to know how to act. Just a thought.
Aug 25 2009
parent "Ameer Armaly" <ameer.armaly furman.edu> writes:
"Daniel Keep" <daniel.keep.lists gmail.com> wrote in message 
news:h70aup$cjn$1 digitalmars.com...
One thing I think the typesystem should be used more for is attaching
 more semantic information to data.  So maybe the solution is to
 introduce something like a Text type that also stores the language of
 the text.  Then the library methods WILL have the right context to know
 how to act.

 Just a thought.

If I'm understanding you correctly, then the hash function would treat Turkish i's the same as any other letter i, because the focus is on internal processing, but writef and friends would make the distinction because the text is meant to be read. Am I right? Ameer
Sep 03 2009
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Walter Bright wrote:
 I think it's great that you're doing a Turkish programming tutorial! I 
 can't help you, though, with details of the Turkish language because I 
 have no idea how it works.

It's quite simple actually. I is the uppercase form of ı. İ is the uppercase form of i. http://www.unicode.org/Public/UNIDATA/UnicodeData.txt lists them as 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069; 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069; 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 but this is inadequate: while it tells you how to case-convert ı and İ (that's what the 0049 and 0069 at the end are), you need to add a locale-specific rule to all this to convert I and i in Turkish. Stewart.
Sep 04 2009
parent Ali Cehreli <acehreli yahoo.com> writes:
Stewart Gordon Wrote:

 I is the uppercase form of ı.
 İ is the uppercase form of i.
 
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 lists them as
 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN 
 CAPITAL LETTER I DOT;;;0069;
 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
 
 but this is inadequate: while it tells you how to case-convert ı and İ 
 (that's what the 0049 and 0069 at the end are), you need to add a 
 locale-specific rule to all this to convert I and i in Turkish.

I think there should be three i's to solve problems like being able to capitalize strings that contain words from two languages as in e.g. an imaginary company name "Ali & Jim". The two lowercase i's should have been separate to be able to work with them correctly. The problem stems from Unicode... A group of us are about to start a small project that involves thin wrappers around Phobos to favor the Turkish behavior for character and string processing. That should help with applications that are happy to use Turkish only. More complex applications could use libraries like IBM's ICU. Ali
Sep 05 2009
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-08-25 00:23:25 -0400, Ali Cehreli <acehreli yahoo.com> said:

 You may be aware of the problems related to the consistency of the two 
 separate letter 'I's in the Turkish alphabet (and the alphabets that 
 are based on the Turkish alphabet).
 
 Lowercase and uppercase versions of the two are consistent in whether 
 they have a dot or not:
 
   http://en.wikipedia.org/wiki/Turkish_I
 
 Turkish alphabet being in a position so close to the western alphabets, 
 but not close enough, puts it in a strange position. (Strangely; the 
 same applies geographically, politically, socially, etc. as well... ;))
 
 Computer systems *almost* work for Turkish, but not for those two letters.
 
 I love the fact that D allows Unicode letters in the source code and 
 that it natively supports Unicode. I cannot stress enough how important 
 this is. That is the single biggest reason why I decided to finally 
 write a programming tutorial. Thank you to all who proposed and 
 implemented those features!
 
 Back to the Turquois 'I's... What a programmer is to do who is writing 
 programs that deals with Turkish letters?
 
 a) Accept that Phobos too has this age old behavior that is a result of 
 premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 
 'A'))
 
 b) Accept that the problem is unsolvable because the letter I has two 
 minuscules, and the letter i has two majuscules anyway, and that the 
 intent is not always clear
 
 c) Accept Turkish alphabet as being pathological (merely for being in 
 the minority!), and use a Turkish version of Phobos or some other 
 library
 
 d) Solve the problem with locale support
 
 Is option d possible with today's systems? Whose resposibility is this 
 anyway? OS? Language? Program? Something else?
 
 The fact that alphanumerical ordering is also of interest, I think this 
 has something to do with locales.
 
 Is there a way for a program to work with Turkish letters and ensure 
 that the following program produces the expected output of 'dotless i', 
 'I with dot', and 0?
 
 import std.stdio;
 import std.string;
 import std.c.locale;
 import std.uni;
 
 void main()
 {
     const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
     assert(result);
 
     writeln(toUniLower('I'));
     writeln(toUniUpper('i'));
     writeln(indexOf("I",
                     '\u0131',               // dotless i
                     (CaseSensitive).no));
 }
 
 This is a practical question. I really want to be able to work with 
 Turkish... :)

Perhaps this could be of some inspiration. In Cocoa you can pass a locale argument to many string methods (unfortunatly, not lowercaseString or uppercaseStrings) to get the desired result. For instance, the "rangeOfString:options:range:locale:" method can search for substrings case-insentively, and it specifically discuss the Turkish “ı” character under the locale parameter. http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/rangeOfString:options:range:locale: It's also interesting to see that when you search for ß in a webpage using Safari, it also matches every instance of SS (whatever your locale). ß is a german character that becomes SS in uppercase. - - - What I'd like to see is an a base class representing a locale. Then you can instanciate the locale you want (from a config file, by coding it directly, having bindings to system APIs, or a mix of all this) and use the locale. Something like: class Locale { immutable: string lowercase(string s); string uppercase(string s); int compare(string a, string b); int compare(string a, string b); // number & date formatting, etc. } immutable(Locale) systemLocale(); // get default system locale immutable(Locale) locale(string localeName); // get best matching locale void main() { Locale turkish = locale("tr-TR"); writeln(turkish.lowercase("I")); // writes "ı" writeln(turkish.uppercase("i")); // writes "İ" Locale english = locale("en-US"); writeln(english.lowercase("I")); // writes "i" writeln(english.uppercase("i")); // writes "I" writeln(systemLocale.lowercase("I")); // depends on user settings writeln(systemLocale.uppercase("i")); // depends on user settings } This way you can work with many locales at once. And there's no reliance on a global state. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 25 2009
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Michel Fortin wrote:
 ...
 
 What I'd like to see is an a base class representing a locale. Then you
 can instanciate the locale you want (from a config file, by coding it
 directly, having bindings to system APIs, or a mix of all this) and use
 the locale. Something like:
 
     class Locale
     {
     immutable:
         string lowercase(string s);
         string uppercase(string s);
 
         int compare(string a, string b);
         int compare(string a, string b);
 
         // number & date formatting, etc.
     }
 
     ...
 
 This way you can work with many locales at once. And there's no reliance
 on a global state.

You're assuming it's possible and practical to write every method that is locale-dependant at once in a single class. I personally think that's somewhat unlikely...
Aug 25 2009
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2009-08-25 08:04:44 -0400, Daniel Keep <daniel.keep.lists gmail.com> said:

 You're assuming it's possible and practical to write every method that
 is locale-dependant at once in a single class.

No, only the base methods. You can build on them to create other things. With a compare method you can do sorting according to various collations for instance, but the sorting algorithm doesn't need to be part of the locale, it just needs the locale as an argument.
 I personally think that's somewhat unlikely...

Every attempt at defining locales in an operating system attemps to centralize the data at some place. That said, a Locale class should know it's locale name, which means in turn that by passing an instance of Locale to some function, that function can take the locale name and do its own thing. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 25 2009