www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Table of strings sorting problem

reply Aarti <aarti interia.pl> writes:
Hello all D-Fans!

I encountered a problem with string sorting according to Polish language 
rules. Here is a simple test program:

// ----------------------------------
import std.stdio;
void main() {
	char[][] table;
	table.length=15;
	
	table[0]="";
	table[1]="a";
	table[2]="";
	table[3]="c";
	table[4]="";
	table[5]="e";
	table[6]="";
	table[7]="n";
	table[6]="";
	table[7]="l";
	table[8]="";
	table[9]="o";
	table[10]="";
	table[11]="s";
	table[12]="";
	table[13]="";
	table[14]="z";

	table.sort;

	foreach(char[] s; table) {
		writef(s);
	}
	writefln();
}
// ----------------------------------

Output of this test is:
acelosz곶

when it should be:
acelosz

It looks like sort doesn't sort properly according to language rules.

Is it a known issue? How to sort strings in D according to language rules?

PS. Possibility of using Polish characters in class identifiers is for 
me really cool. In C++ books in examples you can see all the time 
Trojkat instead of Trjkt (triangle) and it looks awful.

Regards
Marcin Kuszczak
Mar 10 2006
next sibling parent reply S. Chancellor <dnewsgr mephit.kicks-ass.org> writes:
On 2006-03-10 17:20:35 -0800, Aarti <aarti interia.pl> said:

 Hello all D-Fans!
 
 I encountered a problem with string sorting according to Polish 
 language rules. Here is a simple test program:
 
 // ----------------------------------
 import std.stdio;
 void main() {
 	char[][] table;
 	table.length=15;
 	
 	table[0]="ą";
 	table[1]="a";
 	table[2]="ć";
 	table[3]="c";
 	table[4]="ę";
 	table[5]="e";
 	table[6]="ń";
 	table[7]="n";
 	table[6]="ł";
 	table[7]="l";
 	table[8]="ó";
 	table[9]="o";
 	table[10]="ś";
 	table[11]="s";
 	table[12]="ź";
 	table[13]="ż";
 	table[14]="z";
 
 	table.sort;
 
 	foreach(char[] s; table) {
 		writef(s);
 	}
 	writefln();
 }
 // ----------------------------------
 
 Output of this test is:
 aceloszóąćęłśźż
 
 when it should be:
 aącćeęlłoósśzźż
 
 It looks like sort doesn't sort properly according to language rules.
 
 Is it a known issue? How to sort strings in D according to language rules?
 
 PS. Possibility of using Polish characters in class identifiers is for 
 me really cool. In C++ books in examples you can see all the time 
 Trojkat instead of Trójkąt (triangle) and it looks awful.
 
 Regards
 Marcin Kuszczak

Sort works off of the binary value of a character. To implement a sort algorithm for polish language on characters would need to be manually done by you. You would need to specify a map from the character to it's sort order and sort based on that. I'm not sure if the sort property takes a delegate, that was something that was proposed before. You could mainly say it's coincidence that the latin characters fall in order numerically. (It was probably done on purpose with the person who decided the ASCII character values though.) -S.
Mar 10 2006
parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
S. Chancellor wrote:
 On 2006-03-10 17:20:35 -0800, Aarti <aarti interia.pl> said:
 
 Hello all D-Fans!

 I encountered a problem with string sorting according to Polish 
 language rules. Here is a simple test program:

 // ----------------------------------
 import std.stdio;
 void main() {
     char[][] table;
     table.length=15;
     
     table[0]="ą";
     table[1]="a";
     table[2]="ć";
     table[3]="c";
     table[4]="ę";
     table[5]="e";
     table[6]="ń";
     table[7]="n";
     table[6]="ł";
     table[7]="l";
     table[8]="ó";
     table[9]="o";
     table[10]="ś";
     table[11]="s";
     table[12]="ź";
     table[13]="ż";
     table[14]="z";

     table.sort;

     foreach(char[] s; table) {
         writef(s);
     }
     writefln();
 }
 // ----------------------------------

 Output of this test is:
 aceloszóąćęłśźż

 when it should be:
 aącćeęlłoósśzźż

 It looks like sort doesn't sort properly according to language rules.

 Is it a known issue? How to sort strings in D according to language 
 rules?

 PS. Possibility of using Polish characters in class identifiers is for 
 me really cool. In C++ books in examples you can see all the time 
 Trojkat instead of Trójkąt (triangle) and it looks awful.

 Regards
 Marcin Kuszczak

Sort works off of the binary value of a character. To implement a sort algorithm for polish language on characters would need to be manually done by you. You would need to specify a map from the character to it's sort order and sort based on that. I'm not sure if the sort property takes a delegate, that was something that was proposed before. You could mainly say it's coincidence that the latin characters fall in order numerically. (It was probably done on purpose with the person who decided the ASCII character values though.) -S.

And note that the output
 aceloszóąćęłśźż


Mar 10 2006
parent James Dunne <james.jdunne gmail.com> writes:
Hasan Aljudy wrote:
 S. Chancellor wrote:
 
 On 2006-03-10 17:20:35 -0800, Aarti <aarti interia.pl> said:

 Hello all D-Fans!

 I encountered a problem with string sorting according to Polish 
 language rules. Here is a simple test program:

 // ----------------------------------
 import std.stdio;
 void main() {
     char[][] table;
     table.length=15;
         table[0]="ą";
     table[1]="a";
     table[2]="ć";
     table[3]="c";
     table[4]="ę";
     table[5]="e";
     table[6]="ń";
     table[7]="n";
     table[6]="ł";
     table[7]="l";
     table[8]="ó";
     table[9]="o";
     table[10]="ś";
     table[11]="s";
     table[12]="ź";
     table[13]="ż";
     table[14]="z";

     table.sort;

     foreach(char[] s; table) {
         writef(s);
     }
     writefln();
 }
 // ----------------------------------

 Output of this test is:
 aceloszóąćęłśźż

 when it should be:
 aącćeęlłoósśzźż

 It looks like sort doesn't sort properly according to language rules.

 Is it a known issue? How to sort strings in D according to language 
 rules?

 PS. Possibility of using Polish characters in class identifiers is 
 for me really cool. In C++ books in examples you can see all the time 
 Trojkat instead of Trójkąt (triangle) and it looks awful.

 Regards
 Marcin Kuszczak

Sort works off of the binary value of a character. To implement a sort algorithm for polish language on characters would need to be manually done by you. You would need to specify a map from the character to it's sort order and sort based on that. I'm not sure if the sort property takes a delegate, that was something that was proposed before. You could mainly say it's coincidence that the latin characters fall in order numerically. (It was probably done on purpose with the person who decided the ASCII character values though.) -S.

And note that the output >> aceloszóąćęłśźż prints "english" characters first!! acelosz

Correction: ASCII characters first, because they are in the range 0-127. Look at the unicode tables; they're publicly available. Other latin languages use the ASCII characters. The problem is language and culture-specific collation. It is a very difficult problem to solve generically, since each language has many subcultures and each subculture agrees on different rules for collating text. See discussions on ICU in the archives. If one is looking for an explanation of the problem along with a collation solution, I would recommend: http://www.unicode.org/reports/tr10/ -- Regards, James Dunne
Mar 11 2006
prev sibling parent reply John C <johnch_atms hotmail.com> writes:
Aarti wrote:
 Hello all D-Fans!
 
 I encountered a problem with string sorting according to Polish language 
 rules. Here is a simple test program:
 
 // ----------------------------------
 import std.stdio;
 void main() {
     char[][] table;
     table.length=15;
     
     table[0]="";
     table[1]="a";
     table[2]="";
     table[3]="c";
     table[4]="";
     table[5]="e";
     table[6]="";
     table[7]="n";
     table[6]="";
     table[7]="l";
     table[8]="";
     table[9]="o";
     table[10]="";
     table[11]="s";
     table[12]="";
     table[13]="";
     table[14]="z";
 
     table.sort;
 
     foreach(char[] s; table) {
         writef(s);
     }
     writefln();
 }
 // ----------------------------------
 
 Output of this test is:
 acelosz곶
 
 when it should be:
 acelosz
 
 It looks like sort doesn't sort properly according to language rules.
 
 Is it a known issue? How to sort strings in D according to language rules?
 
 PS. Possibility of using Polish characters in class identifiers is for 
 me really cool. In C++ books in examples you can see all the time 
 Trojkat instead of Trjkt (triangle) and it looks awful.
 
 Regards
 Marcin Kuszczak

As others have implied, D's standard library isn't culturally aware. I've been working on a locale package for Mango that will eventually allow correct string sorting for specific languages. This is how you'd sort a list of Polish characters: const char[][] table = [ "a","","c","","e","" ]; Culture.current = Culture.getCulture("pl-PL"); table.sort();
Mar 11 2006
parent reply Aarti <aarti interia.pl> writes:
John C wrote:

 As others have implied, D's standard library isn't culturally aware.
 
 I've been working on a locale package for Mango that will eventually 
 allow correct string sorting for specific languages. This is how you'd 
 sort a list of Polish characters:
 
 const char[][] table = [ "a","","c","","e","" ];
 Culture.current = Culture.getCulture("pl-PL");
 table.sort();

It would be really helpful! Does it already work? I especially don't understand how can I change standard behaviour of table sort property? I think that internationalization support is one of most important areas which could increase D acceptance all over the world. Althrough in C++ it's not as easy as it should be, but it's still easier than making own sort function. Especially when I want in my program that sorting according to rules of _many_ different languages should be supported. Another problem is that D documentation does not say anything that D sorts tables only in binary order. There should be also hint how to implement own sorters for table, because now language does not behave as expected in case of strings. Regards Marcin Kuszczak
Mar 11 2006
parent John C <johnch_atms hotmail.com> writes:
Aarti wrote:
 John C wrote:
 
 As others have implied, D's standard library isn't culturally aware.

 I've been working on a locale package for Mango that will eventually 
 allow correct string sorting for specific languages. This is how you'd 
 sort a list of Polish characters:

 const char[][] table = [ "a","","c","","e","" ];
 Culture.current = Culture.getCulture("pl-PL");
 table.sort();

It would be really helpful! Does it already work? I especially don't understand how can I change standard behaviour of table sort property?

Well, I've written an implementation that works, but it's not yet ready to be unleashed on the public. It might be possible to override _adSort ... not tried it yet. Currently it's just a free function, which can be called as if an array property.
 
 I think that internationalization support is one of most important areas 
 which could increase D acceptance all over the world.
 
 Althrough in C++ it's not as easy as it should be, but it's still easier 
 than making own sort function. Especially when I want in my program that 
 sorting according to rules of _many_ different languages should be 
 supported.
 
 Another problem is that D documentation does not say anything that D 
 sorts tables only in binary order. There should be also hint how to 
 implement own sorters for table, because now language does not behave as 
 expected in case of strings.
 
 Regards
 Marcin Kuszczak

Mar 11 2006