digitalmars.D - Turkish 'I's can't D either

Ali Cehreli (31/31) Aug 24 2009 You may be aware of the problems related to the consistency of the two s...

Rainer Deyke (12/14) Aug 24 2009 This appears to be a library issue to me. If Phobos can't do this

Ali Cehreli (6/9) Aug 25 2009 I started to see this at a more fundamental level. The Unicode letter I ...

Rainer Deyke (6/16) Aug 25 2009 That's hardly the only case where unicode behavior is locale-dependent.

Frank Benoit (2/6) Aug 25 2009 There are existing ICU bindings in the mango project, see dsource.org

Walter Bright (7/7) Aug 24 2009 I think it's great that you're doing a Turkish programming tutorial! I

Ali Cehreli (15/19) Aug 25 2009 It is a very interesting story. The Turkish 'i's have caused lots of tro...

Daniel Keep (14/44) Aug 25 2009 To me, it seems that the issue is that the library routines don't have

Ameer Armaly (8/14) Sep 03 2009 If I'm understanding you correctly, then the hash function would treat

Stewart Gordon (16/19) Sep 04 2009

Ali Cehreli (4/18) Sep 05 2009 I think there should be three i's to solve problems like being able to c...

Michel Fortin (45/112) Aug 25 2009 Perhaps this could be of some inspiration. In Cocoa you can pass a

Daniel Keep (4/27) Aug 25 2009 You're assuming it's possible and practical to write every method that

Michel Fortin (14/17) Aug 25 2009 No, only the base methods. You can build on them to create other

Ali Cehreli <acehreli yahoo.com> writes:

You may be aware of the problems related to the consistency of the two separate
letter 'I's in the Turkish alphabet (and the alphabets that are based on the
Turkish alphabet).

Lowercase and uppercase versions of the two are consistent in whether they have
a dot or not:

  http://en.wikipedia.org/wiki/Turkish_I

Turkish alphabet being in a position so close to the western alphabets, but not
close enough, puts it in a strange position. (Strangely; the same applies
geographically, politically, socially, etc. as well... ;))

Computer systems *almost* work for Turkish, but not for those two letters.

I love the fact that D allows Unicode letters in the source code and that it
natively supports Unicode. I cannot stress enough how important this is. That
is the single biggest reason why I decided to finally write a programming
tutorial. Thank you to all who proposed and implemented those features!

Back to the Turquois 'I's... What a programmer is to do who is writing programs
that deals with Turkish letters?

a) Accept that Phobos too has this age old behavior that is a result of
premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 'A'))

b) Accept that the problem is unsolvable because the letter I has two
minuscules, and the letter i has two majuscules anyway, and that the intent is
not always clear

c) Accept Turkish alphabet as being pathological (merely for being in the
minority!), and use a Turkish version of Phobos or some other library

d) Solve the problem with locale support

Is option d possible with today's systems? Whose resposibility is this anyway?
OS? Language? Program? Something else?

The fact that alphanumerical ordering is also of interest, I think this has
something to do with locales.

Is there a way for a program to work with Turkish letters and ensure that the
following program produces the expected output of 'dotless i', 'I with dot',
and 0?

import std.stdio;
import std.string;
import std.c.locale;
import std.uni;

void main()
{
    const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
    assert(result);

    writeln(toUniLower('I'));
    writeln(toUniUpper('i'));
    writeln(indexOf("I",
                    '\u0131',               // dotless i
                    (CaseSensitive).no));
}

This is a practical question. I really want to be able to work with Turkish...
:)

Thank you,
Ali

Aug 24 2009

Rainer Deyke <rainerd eldwood.com> writes:

Ali Cehreli wrote:
 Is option d possible with today's systems? Whose resposibility is
 this anyway? OS? Language? Program? Something else?

This appears to be a library issue to me.  If Phobos can't do this
properly, your options are basically:
  1. Find a third party library solution.
  2. Write your own solution.
  2a. Write your own solution, and try to get it into Phobos.
  2b. Write your own solution, and release it as a third-party library
so that other people can use it.

I know ICU can use different case mappings for different locales, but I
don't think it has D bindings.


-- 
Rainer Deyke - rainerd eldwood.com

Aug 24 2009

Ali Cehreli <acehreli yahoo.com> writes:

Rainer Deyke Wrote:

 This appears to be a library issue to me.

I started to see this at a more fundamental level. The Unicode letter I
(dotless capital i) has two possible lowercases and letter i has two possible
uppercases. The chain of some historical events appears to have produced a
crippled system: the application can't know how to lowercase or uppercase those.

Having three separate 'i's would keep things elegant and correct, but the ASCII
i and I have been in use for Turkish documents for decades now.

 I know ICU can use different case mappings for different locales, but I
 don't think it has D bindings.

Under my limited understanding, that seems to be in contradiction with what
Walter mentions in the other comment: locales or no locales? I will investigate
more. :)

Thanks!
Ali

Aug 25 2009

Rainer Deyke <rainerd eldwood.com> writes:

Ali Cehreli wrote:
 Rainer Deyke Wrote:
 
 This appears to be a library issue to me.

 
 I started to see this at a more fundamental level. The Unicode letter
 I (dotless capital i) has two possible lowercases

According to ICU, Lithuanian sometimes uses a third.

 and letter i has
 two possible uppercases. The chain of some historical events appears
 to have produced a crippled system: the application can't know how to
 lowercase or uppercase those.

That's hardly the only case where unicode behavior is locale-dependent.
 For example, collating order varies widely between languages.


-- 
Rainer Deyke - rainerd eldwood.com

Aug 25 2009

Frank Benoit <keinfarbton googlemail.com> writes:

Rainer Deyke schrieb:
 I know ICU can use different case mappings for different locales, but I
 don't think it has D bindings.
 
 

There are existing ICU bindings in the mango project, see dsource.org

Aug 25 2009

Walter Bright <newshound1 digitalmars.com> writes:

I think it's great that you're doing a Turkish programming tutorial! I 
can't help you, though, with details of the Turkish language because I 
have no idea how it works.

The only thing I can suggest is using a setlocale for Turkish is the 
wrong way, as Unicode was supposed to get away from that.

tolower really is only for ASCII. But the toUniLower should work right 
with Turkish, though I don't know what right is for that case.

Aug 24 2009

Ali Cehreli <acehreli yahoo.com> writes:

Walter Bright Wrote:

 with details of the Turkish language because I 
 have no idea how it works.

It is a very interesting story. The Turkish 'i's have caused lots of trouble,
even hardcoded conditionals in at least the early Java libraries that checked
whether the locale was Turkish.

Even the Unicode is in a strange position because two Unicode code points have
two separate upper and lower cases. (I don't know whether there are other
alphabets in such a situation.)

 tolower really is only for ASCII. But the toUniLower should work right 
 with Turkish, though I don't know what right is for that case.

The current implementation of toUniLower() favors the ASCII lowercasing of 'I'
over the Turkish one (similar with toUniUpper() for i):

dchar toUniLower(dchar c)
{
    if (c >= 'A' && c <= 'Z')
    {
        c += 32;
    }

An application would need a separate set of toUniLower() and friends to be able
to work in Turkish.

I don't think the issue is big enough for Phobos to tackle with a solution
similar to CaseSensitive:

   toUniLower('I', (Alphabet).tr);

Instead, a wrapper around toUniLower() should be used...

Ali

Aug 25 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Ali Cehreli wrote:
 Walter Bright Wrote:
 
 with details of the Turkish language because I 
 have no idea how it works.

 
 It is a very interesting story. The Turkish 'i's have caused lots of trouble,
even hardcoded conditionals in at least the early Java libraries that checked
whether the locale was Turkish.
 
 Even the Unicode is in a strange position because two Unicode code points have
two separate upper and lower cases. (I don't know whether there are other
alphabets in such a situation.)
 
 tolower really is only for ASCII. But the toUniLower should work right 
 with Turkish, though I don't know what right is for that case.

 
 The current implementation of toUniLower() favors the ASCII lowercasing of 'I'
over the Turkish one (similar with toUniUpper() for i):
 
 dchar toUniLower(dchar c)
 {
     if (c >= 'A' && c <= 'Z')
     {
         c += 32;
     }
 
 An application would need a separate set of toUniLower() and friends to be
able to work in Turkish.
 
 I don't think the issue is big enough for Phobos to tackle with a solution
similar to CaseSensitive:
 
    toUniLower('I', (Alphabet).tr);
 
 Instead, a wrapper around toUniLower() should be used...
 
 Ali

To me, it seems that the issue is that the library routines don't have
enough context to be able to correctly work out how to lowercase a string.

Having it locale-dependant seems like a bad idea; let's say I'm
processing some internal data that uses string names; case is irrelevant
so I lowercase them and look them up in a hash table.

If there's an I in there, but the hashtable is stored as i, the program
will break if run in a Turkish locale.

One thing I think the typesystem should be used more for is attaching
more semantic information to data.  So maybe the solution is to
introduce something like a Text type that also stores the language of
the text.  Then the library methods WILL have the right context to know
how to act.

Just a thought.

Aug 25 2009

"Ameer Armaly" <ameer.armaly furman.edu> writes:

"Daniel Keep" <daniel.keep.lists gmail.com> wrote in message 
news:h70aup$cjn$1 digitalmars.com...
One thing I think the typesystem should be used more for is attaching
 more semantic information to data.  So maybe the solution is to
 introduce something like a Text type that also stores the language of
 the text.  Then the library methods WILL have the right context to know
 how to act.

 Just a thought.

If I'm understanding you correctly, then the hash function would treat 
Turkish i's the same as any other letter i, because the focus is on internal 
processing, but writef and friends would make the distinction because the 
text is meant to be read.
Am I right?

Ameer

Sep 03 2009

Stewart Gordon <smjg_1998 yahoo.com> writes:

Walter Bright wrote:
 I think it's great that you're doing a Turkish programming tutorial! I 
 can't help you, though, with details of the Turkish language because I 
 have no idea how it works.

<snip>

It's quite simple actually.

I is the uppercase form of ı.
İ is the uppercase form of i.

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
lists them as
0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN 
CAPITAL LETTER I DOT;;;0069;
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049

but this is inadequate: while it tells you how to case-convert ı and İ 
(that's what the 0049 and 0069 at the end are), you need to add a 
locale-specific rule to all this to convert I and i in Turkish.

Stewart.

Sep 04 2009

Ali Cehreli <acehreli yahoo.com> writes:

Stewart Gordon Wrote:

 I is the uppercase form of ı.
 İ is the uppercase form of i.
 
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 lists them as
 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN 
 CAPITAL LETTER I DOT;;;0069;
 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
 
 but this is inadequate: while it tells you how to case-convert ı and İ 
 (that's what the 0049 and 0069 at the end are), you need to add a 
 locale-specific rule to all this to convert I and i in Turkish.

I think there should be three i's to solve problems like being able to
capitalize strings that contain words from two languages as in e.g. an
imaginary company name "Ali & Jim". The two lowercase i's should have been
separate to be able to work with them correctly. The problem stems from
Unicode...

A group of us are about to start a small project that involves thin wrappers
around Phobos to favor the Turkish behavior for character and string
processing. That should help with applications that are happy to use Turkish
only. More complex applications could use libraries like IBM's ICU.

Ali

Sep 05 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-08-25 00:23:25 -0400, Ali Cehreli <acehreli yahoo.com> said:

 You may be aware of the problems related to the consistency of the two 
 separate letter 'I's in the Turkish alphabet (and the alphabets that 
 are based on the Turkish alphabet).
 
 Lowercase and uppercase versions of the two are consistent in whether 
 they have a dot or not:
 
   http://en.wikipedia.org/wiki/Turkish_I
 
 Turkish alphabet being in a position so close to the western alphabets, 
 but not close enough, puts it in a strange position. (Strangely; the 
 same applies geographically, politically, socially, etc. as well... ;))
 
 Computer systems *almost* work for Turkish, but not for those two letters.
 
 I love the fact that D allows Unicode letters in the source code and 
 that it natively supports Unicode. I cannot stress enough how important 
 this is. That is the single biggest reason why I decided to finally 
 write a programming tutorial. Thank you to all who proposed and 
 implemented those features!
 
 Back to the Turquois 'I's... What a programmer is to do who is writing 
 programs that deals with Turkish letters?
 
 a) Accept that Phobos too has this age old behavior that is a result of 
 premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 
 'A'))
 
 b) Accept that the problem is unsolvable because the letter I has two 
 minuscules, and the letter i has two majuscules anyway, and that the 
 intent is not always clear
 
 c) Accept Turkish alphabet as being pathological (merely for being in 
 the minority!), and use a Turkish version of Phobos or some other 
 library
 
 d) Solve the problem with locale support
 
 Is option d possible with today's systems? Whose resposibility is this 
 anyway? OS? Language? Program? Something else?
 
 The fact that alphanumerical ordering is also of interest, I think this 
 has something to do with locales.
 
 Is there a way for a program to work with Turkish letters and ensure 
 that the following program produces the expected output of 'dotless i', 
 'I with dot', and 0?
 
 import std.stdio;
 import std.string;
 import std.c.locale;
 import std.uni;
 
 void main()
 {
     const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
     assert(result);
 
     writeln(toUniLower('I'));
     writeln(toUniUpper('i'));
     writeln(indexOf("I",
                     '\u0131',               // dotless i
                     (CaseSensitive).no));
 }
 
 This is a practical question. I really want to be able to work with 
 Turkish... :)

Perhaps this could be of some inspiration. In Cocoa you can pass a 
locale argument to many string methods (unfortunatly, not 
lowercaseString or uppercaseStrings) to get the desired result. For 
instance, the "rangeOfString:options:range:locale:" method can search 
for substrings case-insentively, and it specifically discuss the 
Turkish “ı” character under the locale parameter.



It's 

also interesting to see that when you search for ß in a webpage using 
Safari, it also matches every instance of SS (whatever your locale). ß 
is a german character that becomes SS in uppercase.

 - - -

What I'd like to see is an a base class representing a locale. Then you 
can instanciate the locale you want (from a config file, by coding it 
directly, having bindings to system APIs, or a mix of all this) and use 
the locale. Something like:

	class Locale
	{
	immutable:
		string lowercase(string s);
		string uppercase(string s);

		int compare(string a, string b);
		int compare(string a, string b);

		// number & date formatting, etc.
	}

	immutable(Locale) systemLocale();              // get default system locale
	immutable(Locale) locale(string localeName); // get best matching locale

	void main()
	{
		Locale turkish = locale("tr-TR");
	    writeln(turkish.lowercase("I")); // writes "ı"
	    writeln(turkish.uppercase("i")); // writes "İ"

		Locale english = locale("en-US");
	    writeln(english.lowercase("I")); // writes "i"
	    writeln(english.uppercase("i")); // writes "I"

	    writeln(systemLocale.lowercase("I")); // depends on user settings
	    writeln(systemLocale.uppercase("i")); // depends on user settings
	}

This way you can work with many locales at once. And there's no 
reliance on a global state.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Aug 25 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Michel Fortin wrote:
 ...
 
 What I'd like to see is an a base class representing a locale. Then you
 can instanciate the locale you want (from a config file, by coding it
 directly, having bindings to system APIs, or a mix of all this) and use
 the locale. Something like:
 
     class Locale
     {
     immutable:
         string lowercase(string s);
         string uppercase(string s);
 
         int compare(string a, string b);
         int compare(string a, string b);
 
         // number & date formatting, etc.
     }
 
     ...
 
 This way you can work with many locales at once. And there's no reliance
 on a global state.

You're assuming it's possible and practical to write every method that
is locale-dependant at once in a single class.

I personally think that's somewhat unlikely...

Aug 25 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-08-25 08:04:44 -0400, Daniel Keep <daniel.keep.lists gmail.com> said:

 You're assuming it's possible and practical to write every method that
 is locale-dependant at once in a single class.

No, only the base methods. You can build on them to create other 
things. With a compare method you can do sorting according to various 
collations for instance, but the sorting algorithm doesn't need to be 
part of the locale, it just needs the locale as an argument.

 I personally think that's somewhat unlikely...

Every attempt at defining locales in an operating system attemps to 
centralize the data at some place. That said, a Locale class should 
know it's locale name, which means in turn that by passing an instance 
of Locale to some function, that function can take the locale name and 
do its own thing.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Aug 25 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Turkish 'I's can't D either