www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Sorting with non-ASCII characters

reply "Chris" <wendlec tcd.ie> writes:
Short question in case anyone knows the answer straight away:

How do I sort text so that non-ascii characters like "á" are 
treated in the same way as "a"?

Now I'm getting this:

[wow, ara, ába, marca]

===> sort(listAbove);

[ara, marca, wow, ába]

I'd like to get:

[ ába, ara, marca, wow]

Thanks.
Sep 19 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are 
 treated in the same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]

 Thanks.
Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm" http://d.puremagic.com/issues/show_bug.cgi?id=10566 Unfortunately, I don't know of any workarounds for you :/
Sep 19 2013
parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 19 September 2013 at 15:34:28 UTC, monarch_dodra 
wrote:
 On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are 
 treated in the same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]

 Thanks.
Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm" http://d.puremagic.com/issues/show_bug.cgi?id=10566 Unfortunately, I don't know of any workarounds for you :/
Good that I asked! Imagine the time I would have wasted.
Sep 19 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Chris:

 How do I sort text so that non-ascii characters like "á" are 
 treated in the same way as "a"?
The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered. But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a schwartzSort: http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py (If you translate that function to D, you could also add it to Dub later.) Bye, bearophile
Sep 19 2013
parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 19 September 2013 at 15:42:52 UTC, bearophile wrote:
 Chris:

 How do I sort text so that non-ascii characters like "á" are 
 treated in the same way as "a"?
The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered. But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a schwartzSort: http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py (If you translate that function to D, you could also add it to Dub later.) Bye, bearophile
Ok, thanks. We'll see.
Sep 19 2013
prev sibling next sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 09/19/2013 08:18 AM, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are treated in
 the same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]

 Thanks.
I have a project that tries to do exactly that: https://code.google.com/p/trileri/source/browse/trunk/tr/dizgi.d#823 However, it is in Turkish and in need of a rewrite. :/ For the whole thing to work, every character must be of a certain alphabet. Here is the English alphabet: https://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d#747 Here is how I define e.g. á to be an accented version of a: https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#23 However, some characters stand individually as they are not accents but proper letters themselves (e.g. ç of the Turkish alphabet): https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#44 Well... I hope to get back to it at some point, taking advantage of the new std.uni as well. Ali
Sep 19 2013
prev sibling parent reply Jos van Uden <usenet fwend.com> writes:
On 19-9-2013 17:18, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are treated in the
same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]
If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Sep 19 2013
next sibling parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden
wrote:
 On 19-9-2013 17:18, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are 
 treated in the same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]
If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Ok, thanks, will try that. I'll let you know if it worked.
Sep 19 2013
prev sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden 
wrote:
 On 19-9-2013 17:18, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are 
 treated in the same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]
If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Thanks a million, Jos! This does the trick for me.
Sep 24 2013
parent reply Jos van Uden <usenet fwend.com> writes:
On 24-9-2013 11:26, Chris wrote:
 On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:
 On 19-9-2013 17:18, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are treated in the
same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]
If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Thanks a million, Jos! This does the trick for me.
Great. Be aware that the above code does a case insensitive sort, if you need case sensitive, you can use something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.no); writeln(sa); writeln; sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.yes); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr, CaseSensitive cs = CaseSensitive.yes) { static c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸ"d; static c2 = "aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUYY"d; if (cs == CaseSensitive.no) arr.schwartzSort!(a => a.toLower.tr(c1, c2), less); else arr.schwartzSort!(a => a.tr(c1, c2), less); }
Sep 24 2013
parent "Chris" <wendlec tcd.ie> writes:
On Tuesday, 24 September 2013 at 10:35:53 UTC, Jos van Uden wrote:
 On 24-9-2013 11:26, Chris wrote:
 On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden 
 wrote:
 On 19-9-2013 17:18, Chris wrote:
 Short question in case anyone knows the answer straight away:

 How do I sort text so that non-ascii characters like "á" are 
 treated in the same way as "a"?

 Now I'm getting this:

 [wow, ara, ába, marca]

 ===> sort(listAbove);

 [ara, marca, wow, ába]

 I'd like to get:

 [ ába, ara, marca, wow]
If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Thanks a million, Jos! This does the trick for me.
Great. Be aware that the above code does a case insensitive sort, if you need case sensitive, you can use something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.no); writeln(sa); writeln; sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.yes); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr, CaseSensitive cs = CaseSensitive.yes) { static c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸ"d; static c2 = "aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUYY"d; if (cs == CaseSensitive.no) arr.schwartzSort!(a => a.toLower.tr(c1, c2), less); else arr.schwartzSort!(a => a.tr(c1, c2), less); }
Ah, yes of course. I will keep that in mind. At the moment I only need case insensitive, but you never know. Thanks again.
Sep 24 2013