www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Faster sort?

reply Andrea Fontana <nospam example.com> writes:
I did a couple of tries using a "specialized" implementation of 
sort!less.

For example using a counting sort for bool seems a good idea. 
This code:

auto boolSort(alias less = "a<b")(in bool[] data)
{
    import std.array        : uninitializedArray;
    import std.functional   : binaryFun;

    // Check if order is true/false or false/true
    immutable test = binaryFun!less(true, false);
    size_t count;

    foreach(ref x; data)
       if (x == test) count++;

    return chain(repeat(test,count), repeat(!test, data.length - 
count));
}

Count number of false (or true, it depends on result of "less"), 
and then chain them.

I did a test over 10000 randomized array of 2,4,8,16...65536 
elements.

rdmd -O -release -noboundscheck -inline sort.d :

2 inputs:
Faster: 229 μs and 9 hnsecs
Phobos: 215 μs and 5 hnsecs

...

64 inputs:
Faster: 513 μs and 4 hnsecs
Phobos: 2 ms, 143 μs, and 5 hnsecs

...

65536 inputs:
Faster: 2 secs, 700 ms, and 696 μs
Phobos: 6 secs, 835 ms, and 319 μs

Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort  
instead:

2 inputs:
Faster: 0 hnsecs
Phobos: 33 μs and 5 hnsecs

...

65536 inputs:
Faster: 0 hnsecs (???)
Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs


The same thing works for byte[] and ubyte[] array using a few 
tricks (eg: using counting sort and then sorting only keys).

So for ubyte[]/byte[] (ldc):

2 inputs:
1 ms, 305 μs, and 4 hnsecs
31 μs and 4 hnsecs

...

256 inputs:
8 ms, 76 μs, and 3 hnsecs
8 ms, 807 μs, and 3 hnsecs

...

65536 inputs:
347 ms, 330 μs, and 9 hnsecs
14 secs, 214 ms, 66 μs, and 5 hnsecs

Anyway implementation for byte/ubyte should be improved as long 
as it probably won't work fine if less is a functor or delegate:

auto byteSort(alias less = "a<b", T)(in T[] data)
if (is(T == byte) || is(T == ubyte))
{
    import std.range     : iota;
    import std.algorithm : map;
    import std.array     : uninitializedArray, array;

    // It needs 256*size_t.sizeof bytes. Maybe not suitables for 
embedded platforms?
    size_t[256] buckets;

    foreach(ref x; data)
       buckets[cast(ubyte)x]++;

    // Sort 256 keys using less function at compile time (less 
should work at compile time)
    static immutable sortedKeys = iota(0, 256).map!(x => 
cast(T)x).array.sort!less.array;

    auto result = uninitializedArray!(T[])(data.length);
    size_t currentIdx = 0;

    // bucket[0] = 3, bucket[1] = 0, bucket[2] = 1 => 0,0,0,2
    foreach(ref k; sortedKeys)
    {
       auto b = buckets[cast(ubyte)k];

       if (b)
       {
          result[currentIdx .. currentIdx + b] = cast(T)k;
          currentIdx += b;
          if (currentIdx >= data.length) return result;
       }
    }

    assert(0);
}


Maybe those implementations could be useful to improve phobos 
sorting?

Andrea
Apr 06 2016
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana wrote:
 Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort
  instead:

 2 inputs:
 Faster: 0 hnsecs
 Phobos: 33 μs and 5 hnsecs

 ...

 65536 inputs:
 Faster: 0 hnsecs (???)
 Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs
Can you share the benchmark code? "0 hnsecs" results generally mean that your test was too simple and/or didn't have any obvious side effects, and so the optimizer just removed it completely.
Apr 06 2016
parent reply Andrea Fontana <nospam example.com> writes:
On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:
 On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana 
 wrote:
 Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort
  instead:

 2 inputs:
 Faster: 0 hnsecs
 Phobos: 33 μs and 5 hnsecs

 ...

 65536 inputs:
 Faster: 0 hnsecs (???)
 Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs
Can you share the benchmark code? "0 hnsecs" results generally mean that your test was too simple and/or didn't have any obvious side effects, and so the optimizer just removed it completely.
A simple test just written: Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; boolSort(arr); total += sw.peek.to!Duration; sw.stop; } andrea ububocs:/tmp$ ./sort 0 hnsecs I don't think compiler can remove a random generated array... I think times are too small to be measured with a good accuracy, using start() stop() to resume stopwatch seems to add a (fixed) overhead (some microsecs over the 10000 cycles) that doesn't depend on size of array tested. Anyway, it doesn't matter if it is 0 hnsec or some microsecs, in my opinion. It's still faster and faster than original sort (of course it's n vs n*log(n)). Here the same code with "sort(arr)" instead of "boolSort(arr)": andrea ububocs:/tmp$ ./sort 10 secs, 620 ms, 414 μs, and 2 hnsecs Andrea
Apr 07 2016
next sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:
 On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:
 On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana 
 wrote:
 Using ldmd2 -O -release -noboundscheck -inline sort.d && 
 ./sort
  instead:

 2 inputs:
 Faster: 0 hnsecs
 Phobos: 33 μs and 5 hnsecs

 ...

 65536 inputs:
 Faster: 0 hnsecs (???)
 Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs
Can you share the benchmark code? "0 hnsecs" results generally mean that your test was too simple and/or didn't have any obvious side effects, and so the optimizer just removed it completely.
A simple test just written: Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; boolSort(arr); total += sw.peek.to!Duration; sw.stop; } andrea ububocs:/tmp$ ./sort 0 hnsecs I don't think compiler can remove a random generated array...
But it definitely can eliminate an unused result. My prediction: you took an array and sorted it, then did nothing with the result, so it rightly concluded that there was no point doing the sort. In any given case the compiler could be removing some or all of the work. Laborious approach that defeats the optimisations you don't want while keeping the ones you do: % cat modA.d float[] codeToBenchmark(int someParam, float[] someOtherParam) { /* blah blah code */ } % cat modB.d // Do not import modA here auto codeToBenchmark(int, float[]); void main() { // start loop and timers: codeToBenchmark(/*blah params */); // end timers and loop } % ldc2 -c <optimisation options here> modA.d % ldc2 <opt options, less important> modA.o modB.d % ./modB <reliable results come out here> I really need to write an article about bencharking in D...
Apr 07 2016
next sibling parent reply Andrea Fontana <nospam example.com> writes:
On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.
But it should remove result if I replace boolSort() with sort() too, instead it take 10 seconds to run.
Apr 07 2016
parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 7 April 2016 at 08:33:40 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.
But it should remove result if I replace boolSort() with sort() too, instead it take 10 seconds to run.
Not necessarily. It totally depends on the implementation details and the exact way the optimiser works. It might be interesting and informative for you to explore exactly why a particular version of a particular compiler with particular compilation flags will inline and elide one sort function but not another, but I would recommend just not letting the compiler see the source code of the benchmark and the sorting at the same time*, then you know neither will be inlined and also no extra attributes will be inferred and unrealistically taken advantage of. *hench my example of compiling one module to an object file and then compiling the other and linking them, without ever importing one from the other.
Apr 07 2016
parent reply Andrea Fontana <nospam example.com> writes:
On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file and 
 then compiling the other and linking them, without ever 
 importing one from the other.
If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
Apr 07 2016
next sibling parent reply Andrea Fontana <nospam example.com> writes:
On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file 
 and then compiling the other and linking them, without ever 
 importing one from the other.
If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
What about this? Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; auto t = boolSort(arr); sw.stop; size_t sum = 0; foreach(x; t.map!(x => x?1:0)) sum+=x; writeln(sum); } writeln(total); This outputs: 32784 ... ... ... 32819 32648 32649 32716 32972 0 hnsecs
Apr 07 2016
next sibling parent Andrea Fontana <nospam example.com> writes:
On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:
 What about this?

    Duration total;

    foreach(_; 0..10000)
    {
       auto arr = generate!( () => uniform(0,2)).map!(x => 
 cast(bool)x).take(65536).array;
       StopWatch sw;
       sw.start;
       auto t = boolSort(arr);
Missed total+=...
       sw.stop;
       size_t sum = 0;
       foreach(x; t.map!(x => x?1:0)) sum+=x;
       writeln(sum);
    }
whoops i missed a line there. It's 300ms but i wonder if test is valid. Anyway, that's not the point of my original post :)
Apr 07 2016
prev sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:
 What about this?

    Duration total;

    foreach(_; 0..10000)
    {
       auto arr = generate!( () => uniform(0,2)).map!(x => 
 cast(bool)x).take(65536).array;
       StopWatch sw;
       sw.start;
       auto t = boolSort(arr);
       sw.stop;
       size_t sum = 0;
       foreach(x; t.map!(x => x?1:0)) sum+=x;
       writeln(sum);
    }

    writeln(total);

 This outputs:
 32784
 ...
 ...
 ...
 32819
 32648
 32649
 32716
 32972
 0 hnsecs
A time of "zero" means the benchmark is broken. In this case, you forgot to actually add the stopwatch time to the `total` variable. You don't need to print the sum every time; just accumulate it in a variable declared outside the loop and print it once at the end (like with `total`). See my benchmark code that I posted a few minutes ago.
Apr 07 2016
prev sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file 
 and then compiling the other and linking them, without ever 
 importing one from the other.
If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
You could create a dummy function (or even the real function, just with a different name) that creates the same type and use typeof(myDummyFunction(myDummyArgs)) when making the definition.
Apr 07 2016
parent John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 7 April 2016 at 09:25:46 UTC, John Colvin wrote:
 On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file 
 and then compiling the other and linking them, without ever 
 importing one from the other.
If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
You could create a dummy function (or even the real function, just with a different name) that creates the same type and use typeof(myDummyFunction(myDummyArgs)) when making the definition.
Correction, use http://dlang.org/phobos/std_traits.html#ReturnType instead of all that typeof mess.
Apr 07 2016
prev sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.

 Laborious approach that defeats the optimisations you don't 
 want while keeping the ones you do:
It's easier to just output the result in some form. I typically calculate and display a checksum of some sort.
Apr 07 2016
parent John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 7 April 2016 at 09:01:14 UTC, tsbockman wrote:
 On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.

 Laborious approach that defeats the optimisations you don't 
 want while keeping the ones you do:
It's easier to just output the result in some form. I typically calculate and display a checksum of some sort.
Take care with this approach. For example, calling a pure function (whether D pure or some optimiser inferred sort of purity) repeatedly in a loop can sometimes cause the loop to be reduced to a single iteration (doesn't apply in this case, because of randomly generating data each iteration).
Apr 07 2016
prev sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:
 I don't think compiler can remove a random generated array...
 I think times are too small to be measured with a good 
 accuracy, using start() stop() to resume stopwatch seems to add 
 a (fixed) overhead (some microsecs over the 10000 cycles) that 
 doesn't depend on size of array tested.
`boolSort(arr)` has no side effects, and in your benchmark the return value is discarded. This means that the compiler is free to simply skip that function call entirely. Because of this, with warnings-as-errors your code won't even compile: "Warning: calling app.boolSort!"a<b".boolSort without side effects discards return value of type Result, prepend a cast(void) if intentional"
 Anyway, it doesn't matter if it is 0 hnsec or some microsecs, 
 in my opinion. It's still faster and faster than original sort 
 (of course it's n vs n*log(n)).
A time of "zero" means that your benchmark is broken, and tells you nothing about how fast your code actually is. Here's a less broken benchmark (it's still got lots of room for improvement): module app; import std.algorithm, std.conv, std.range, std.random, std.stdio; auto boolSort(alias less = "a<b")(in bool[] data) { import std.array : uninitializedArray; import std.functional : binaryFun; // Check if order is true/false or false/true immutable test = binaryFun!less(true, false); size_t count; foreach(ref x; data) if (x == test) count++; return chain(repeat(test,count), repeat(!test, data.length - count)); } void main() { import std.datetime : Duration, StopWatch; Duration boolSortDur, stdSortDur; ulong boolSortKS, stdSortKS; foreach(_; 0..10_000) { auto arr = generate!( () => uniform(0,2)) .map!(x => cast(bool)x) .take(65_536) .array; { StopWatch sw; sw.start; auto sorted = boolSort(arr); sw.stop; boolSortDur += sw.peek.to!Duration; /* Scan the sorted result so that the compiler doesn't elide the boolSort(arr) call: */ foreach(bool b; sorted) boolSortKS += b; } { StopWatch sw; sw.start; auto sorted = sort(arr); sw.stop; stdSortDur += sw.peek.to!Duration; /* Scan the sorted result so that the compiler doesn't elide the sort(arr) call: */ foreach(bool b; sorted) stdSortKS += b; } } /* Make some I/O conditional upon the sorted results to establish a direct causal chain between the code being benchmarked, and something that we know cannot ever be removed by the optimizer: */ if(boolSortKS != stdSortKS) writeln("Keep sum mismatch!"); writefln("boolSort(): %s", boolSortDur); writefln("std sort(): %s", stdSortDur); } Results on my 3.2GHz Haswell 64-bit Linux system: DMD 64-bit: boolSort(): 2 secs, 886 ms, 915 μs, and 5 hnsecs std sort(): 15 secs, 135 ms, 900 μs, and 8 hnsecs DMD 32-bit: boolSort(): 2 secs, 827 ms, 13 μs, and 6 hnsecs std sort(): 16 secs, 668 ms, 900 μs, and 2 hnsecs LDC 64-bit: boolSort(): 369 ms, 590 μs, and 6 hnsecs std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs So, your boolSort() is faster as expected - but not infinitely so. Andrei gave a lecture on optimization a while back: http://forum.dlang.org/thread/n6odlg$j2n$1 digitalmars.com Among the key points he made was basically, "Never assume: measure. And, measuring correctly is harder than it looks."
Apr 07 2016
parent Andrea Fontana <nospam example.com> writes:
On Thursday, 7 April 2016 at 08:48:41 UTC, tsbockman wrote:
 LDC 64-bit:
    boolSort(): 369 ms, 590 μs, and 6 hnsecs
    std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs
 So, your boolSort() is faster as expected - but not infinitely 
 so.
Yes point of my original discussion wasn't the obviusly wrong mesure but the greate gain over standard sort() 369ms vs 13 secs is still a lot (35x: and it increase more and more for bigger length... it's linear vs nlogn)
Apr 07 2016