digitalmars.D.learn - Sorting char arrays

Magnus Lie Hetland (14/14) Mar 12 2012 The following fails, which I guess is a bug?

Dmitry Olshansky (8/18) Mar 12 2012 Mm it should perform sort on UTF-8 buffer? Tricky thing but worths an

Magnus Lie Hetland (14/20) Mar 12 2012 Humm -- dunno ;) The UTF-8-semantics of single characters sort of

Jonathan M Davis (10/32) Mar 12 2012 The problem is that sort requires a random access range, and narrow stri...

Magnus Lie Hetland (21/30) Mar 13 2012 I'd certainly think that'd be posible. (Might, in fact, be a nice

Jonathan M Davis (12/44) Mar 13 2012 If you use ubyte[] instead of cast it to ubyte[], then _that_ is a rando...

bearophile (11/17) Mar 12 2012 It's not a bug, char is meant to be a UTF-8. Two workarounds:

Magnus Lie Hetland (8/10) Mar 12 2012 Thanks. I'm doing the sorting in a template, so this won't work -- but

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (27/34) Mar 12 2012 You can use isNarrowString to either disallow or provide special

Magnus Lie Hetland (5/7) Mar 13 2012 Ah -- useful, thanks!

Magnus Lie Hetland <magnus hetland.org> writes:

The following fails, which I guess is a bug?

import std.algorithm;
void main() {
	char[] a = ['a', 'b', 'c'];
	sort(a);
}

I thought maybe I'd report it -- sort of surprises me that it hasn't 
been reported before, but I couldn't find it (although I found some 
similar reports) in the Bugzilla. (No biggie for me, though; the Phobos 
sort seems to fail with all kinds of things, so I have my own anyway... 
;)

-- 
Magnus Lie Hetland
http://hetland.org

Mar 12 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 12.03.2012 16:51, Magnus Lie Hetland wrote:
 The following fails, which I guess is a bug?

 import std.algorithm;
 void main() {
 char[] a = ['a', 'b', 'c'];
 sort(a);
 }

 I thought maybe I'd report it -- sort of surprises me that it hasn't
 been reported before, but I couldn't find it (although I found some
 similar reports) in the Bugzilla. (No biggie for me, though; the Phobos
 sort seems to fail with all kinds of things, so I have my own anyway... ;)

Mm it should perform sort on UTF-8 buffer? Tricky thing but worths an 
enhancement request.

If it's ASCII then try:
sort(a.representation)

representation is in std.string IRC.
-- 
Dmitry Olshansky

Mar 12 2012

Magnus Lie Hetland <magnus hetland.org> writes:

On 2012-03-12 13:09:18 +0000, Dmitry Olshansky said:

 Mm it should perform sort on UTF-8 buffer?

Humm -- dunno ;) The UTF-8-semantics of single characters sort of 
slipped my mind :)

 Tricky thing but worths an enhancement request.

I'm just thinking an array of anything that can be compared should 
probably be sort-able. But comparing individual UTF-8-bytes can be 
weird, indeed. So, yeah. I guess the weirdness follows from the fact 
that individual characters are considered UTF-8 :)

 If it's ASCII then try:
 sort(a.representation)
 
 representation is in std.string IRC.

The thing is, I'm using sort() in a template, and I just happen to use 
char as the template parameter in a unit test. And since I have no 
special UTF-8 values in there, my own sort() works just fine. (Although 
maybe it shouldn't? ;)

-- 
Magnus Lie Hetland
http://hetland.org

Mar 12 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, March 12, 2012 16:04:37 Magnus Lie Hetland wrote:
 On 2012-03-12 13:09:18 +0000, Dmitry Olshansky said:
 Mm it should perform sort on UTF-8 buffer?

 
 Humm -- dunno ;) The UTF-8-semantics of single characters sort of
 slipped my mind :)
 
 Tricky thing but worths an enhancement request.

 
 I'm just thinking an array of anything that can be compared should
 probably be sort-able. But comparing individual UTF-8-bytes can be
 weird, indeed. So, yeah. I guess the weirdness follows from the fact
 that individual characters are considered UTF-8 :)
 
 If it's ASCII then try:
 sort(a.representation)
 
 representation is in std.string IRC.

 
 The thing is, I'm using sort() in a template, and I just happen to use
 char as the template parameter in a unit test. And since I have no
 special UTF-8 values in there, my own sort() works just fine. (Although
 maybe it shouldn't? ;)

The problem is that sort requires a random access range, and narrow string 
(string and wstring) aren't, because they're variable length encoded. I'm not 
sure that strings _can_ be efficiently sorted, because of that, but maybe 
there's a sorting algorthm that could do it reasonably efficiently, and we
could 
special case sort for narrow strings to use that one, but it's a while since I 
messed with sorting algorithms, so I don't remember all of their 
characteristics off of the top of my head. Certainly, with how sort is currenly 
implemented, it can't handle any range which isn't a random access range.

- Jonathan M Davis

Mar 12 2012

Magnus Lie Hetland <magnus hetland.org> writes:

On 2012-03-12 17:24:56 +0000, Jonathan M Davis said:

 The problem is that sort requires a random access range, and narrow string
 (string and wstring) aren't, because they're variable length encoded. I'm not
 sure that strings _can_ be efficiently sorted, because of that, but maybe
 there's a sorting algorthm that could do it reasonably efficiently,

I'd certainly think that'd be posible. (Might, in fact, be a nice 
problem for the students in my algorithms class ;)

I guess I'm just tripped up by the fact that char[] is treated as an 
"almost-string", and hence is "infected" by the variable-length 
encoding of string (i.e., const char[]).

This all makes sense, for sure -- at least as a practical tradeoff or 
the like. (After all, UTF-8 is a super-elegant solution to a very 
difficult problem.) It's just so easy to assume that T[] is a 
random-access range. It seems so *obvious*. And it is true for any type 
T except the variable-length chars... :)

In my case, I was just using char literals (and strings) as an easy way 
of encoding test cases for a template class, and the sorting (+ uniq) 
was just a way of removing duplicates. (Could've used a hash, of 
course.) So I was really just treating it as a ubyte[]. Slightly 
Evil[tm], I guess.

 and we could
 special case sort for narrow strings to use that one, but it's a while since I
 messed with sorting algorithms, so I don't remember all of their
 characteristics off of the top of my head. Certainly, with how sort is currenly
 implemented, it can't handle any range which isn't a random access range.

No, I get that. I was just assuming that any T[] could be treated as a 
random access range if I really wanted to ;)

-- 
Magnus Lie Hetland
http://hetland.org

Mar 13 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, March 13, 2012 10:13:23 Magnus Lie Hetland wrote:
 On 2012-03-12 17:24:56 +0000, Jonathan M Davis said:
 The problem is that sort requires a random access range, and narrow string
 (string and wstring) aren't, because they're variable length encoded. I'm
 not sure that strings _can_ be efficiently sorted, because of that, but
 maybe there's a sorting algorthm that could do it reasonably efficiently,

 I'd certainly think that'd be posible. (Might, in fact, be a nice
 problem for the students in my algorithms class ;)
 
 I guess I'm just tripped up by the fact that char[] is treated as an
 "almost-string", and hence is "infected" by the variable-length
 encoding of string (i.e., const char[]).
 
 This all makes sense, for sure -- at least as a practical tradeoff or
 the like. (After all, UTF-8 is a super-elegant solution to a very
 difficult problem.) It's just so easy to assume that T[] is a
 random-access range. It seems so *obvious*. And it is true for any type
 T except the variable-length chars... :)
 
 In my case, I was just using char literals (and strings) as an easy way
 of encoding test cases for a template class, and the sorting (+ uniq)
 was just a way of removing duplicates. (Could've used a hash, of
 course.) So I was really just treating it as a ubyte[]. Slightly
 Evil[tm], I guess.
 
 and we could
 special case sort for narrow strings to use that one, but it's a while
 since I messed with sorting algorithms, so I don't remember all of their
 characteristics off of the top of my head. Certainly, with how sort is
 currenly implemented, it can't handle any range which isn't a random
 access range.

 No, I get that. I was just assuming that any T[] could be treated as a
 random access range if I really wanted to ;)

If you use ubyte[] instead of cast it to ubyte[], then _that_ is a random-
access range. So, as long as you don't mind operating on code units instead of 
code points (which really only works with ASCII and nothing else for char), 
then you can use functions like sort on it. But, of course, you're screwed as 
soon as you have a non-ASCII character, so you have to be careful. Depending 
on what your requirements are (with regards to efficiency and whatnot), it
might 
just be safer to either using dchar[] or to convert to dchar[] and back again 
for operations that require it (such as sort). But if you _know_ that you're 
only dealing with ASCII, it works just fine to cast to ubyte[] for operations 
that need a random-access range.

- Jonathan M Davis

Mar 13 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Magnus Lie Hetland:

 The following fails, which I guess is a bug?

 import std.algorithm;
 void main() {
 	char[] a = ['a', 'b', 'c'];
 	sort(a);
 }

It's not a bug, char is meant to be a UTF-8. Two workarounds:

import std.algorithm;
void main() {
	dchar[] a1 = ['a', 'b', 'c'];
	sort(a1);
	char[] a2 = ['a', 'b', 'c'];
	sort(cast(ubyte[])a2);
}

Bye,
bearophile

Mar 12 2012

Magnus Lie Hetland <magnus hetland.org> writes:

On 2012-03-12 13:56:15 +0000, bearophile said:

 It's not a bug, char is meant to be a UTF-8.

Right.

 Two workarounds:

Thanks. I'm doing the sorting in a template, so this won't work -- but 
I guess I just can't use char as a type parameter to my template 
either, then :)

-- 
Magnus Lie Hetland
http://hetland.org

Mar 12 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/12/2012 08:06 AM, Magnus Lie Hetland wrote:
 On 2012-03-12 13:56:15 +0000, bearophile said:

 It's not a bug, char is meant to be a UTF-8.

 Right.

 Two workarounds:

 Thanks. I'm doing the sorting in a template, so this won't work -- but I
 guess I just can't use char as a type parameter to my template either,
 then :)

You can use isNarrowString to either disallow or provide special 
implementation for narrow strings (char[] and wchar[]):

import std.traits;

void foo(T)(T t)
     if (!isNarrowString!T)
{
     // ...
}

void bar(T)(T t)
{
     static if (isNarrowString!T){
         // ...

     } else {
         // ...
     }
}

void main()
{
     char[] sc;
     dchar[] sd;
     // foo(sc);     // <-- compilation error
     foo(sd);

     bar(sc);
     bar(sd);
}

Ali

Mar 12 2012

Magnus Lie Hetland <magnus hetland.org> writes:

On 2012-03-12 17:33:35 +0000, Ali �ehreli said:

 You can use isNarrowString to either disallow or provide special 
 implementation for narrow strings (char[] and wchar[]):

Ah -- useful, thanks!

-- 
Magnus Lie Hetland
http://hetland.org

Mar 13 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Sorting char arrays