digitalmars.D - Faster sort?

Andrea Fontana (90/90) Apr 06 2016 I did a couple of tries using a "specialized" implementation of

tsbockman (5/14) Apr 06 2016 Can you share the benchmark code?

Andrea Fontana (27/45) Apr 07 2016 A simple test just written:

John Colvin (27/64) Apr 07 2016 But it definitely can eliminate an unused result. My prediction:

Andrea Fontana (3/8) Apr 07 2016 But it should remove result if I replace boolSort() with sort()

John Colvin (14/22) Apr 07 2016 Not necessarily. It totally depends on the implementation details

Andrea Fontana (5/8) Apr 07 2016 If i move boolSort() on another module but I can't use auto as

Andrea Fontana (27/35) Apr 07 2016 What about this?

Andrea Fontana (4/18) Apr 07 2016 whoops i missed a line there. It's 300ms but i wonder if test is
tsbockman (7/33) Apr 07 2016 A time of "zero" means the benchmark is broken. In this case, you

John Colvin (4/12) Apr 07 2016 You could create a dummy function (or even the real function,

John Colvin (4/18) Apr 07 2016 Correction, use

tsbockman (3/10) Apr 07 2016 It's easier to just output the result in some form. I typically

John Colvin (6/17) Apr 07 2016 Take care with this approach. For example, calling a pure

tsbockman (85/93) Apr 07 2016 `boolSort(arr)` has no side effects, and in your benchmark the

Andrea Fontana (5/10) Apr 07 2016 Yes point of my original discussion wasn't the obviusly wrong

Andrea Fontana <nospam example.com> writes:

I did a couple of tries using a "specialized" implementation of 
sort!less.

For example using a counting sort for bool seems a good idea. 
This code:

auto boolSort(alias less = "a<b")(in bool[] data)
{
    import std.array        : uninitializedArray;
    import std.functional   : binaryFun;

    // Check if order is true/false or false/true
    immutable test = binaryFun!less(true, false);
    size_t count;

    foreach(ref x; data)
       if (x == test) count++;

    return chain(repeat(test,count), repeat(!test, data.length - 
count));
}

Count number of false (or true, it depends on result of "less"), 
and then chain them.

I did a test over 10000 randomized array of 2,4,8,16...65536 
elements.

rdmd -O -release -noboundscheck -inline sort.d :

2 inputs:
Faster: 229 μs and 9 hnsecs
Phobos: 215 μs and 5 hnsecs

...

64 inputs:
Faster: 513 μs and 4 hnsecs
Phobos: 2 ms, 143 μs, and 5 hnsecs

...

65536 inputs:
Faster: 2 secs, 700 ms, and 696 μs
Phobos: 6 secs, 835 ms, and 319 μs

Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort  
instead:

2 inputs:
Faster: 0 hnsecs
Phobos: 33 μs and 5 hnsecs

...

65536 inputs:
Faster: 0 hnsecs (???)
Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs


The same thing works for byte[] and ubyte[] array using a few 
tricks (eg: using counting sort and then sorting only keys).

So for ubyte[]/byte[] (ldc):

2 inputs:
1 ms, 305 μs, and 4 hnsecs
31 μs and 4 hnsecs

...

256 inputs:
8 ms, 76 μs, and 3 hnsecs
8 ms, 807 μs, and 3 hnsecs

...

65536 inputs:
347 ms, 330 μs, and 9 hnsecs
14 secs, 214 ms, 66 μs, and 5 hnsecs

Anyway implementation for byte/ubyte should be improved as long 
as it probably won't work fine if less is a functor or delegate:

auto byteSort(alias less = "a<b", T)(in T[] data)
if (is(T == byte) || is(T == ubyte))
{
    import std.range     : iota;
    import std.algorithm : map;
    import std.array     : uninitializedArray, array;

    // It needs 256*size_t.sizeof bytes. Maybe not suitables for 
embedded platforms?
    size_t[256] buckets;

    foreach(ref x; data)
       buckets[cast(ubyte)x]++;

    // Sort 256 keys using less function at compile time (less 
should work at compile time)
    static immutable sortedKeys = iota(0, 256).map!(x => 
cast(T)x).array.sort!less.array;

    auto result = uninitializedArray!(T[])(data.length);
    size_t currentIdx = 0;

    // bucket[0] = 3, bucket[1] = 0, bucket[2] = 1 => 0,0,0,2
    foreach(ref k; sortedKeys)
    {
       auto b = buckets[cast(ubyte)k];

       if (b)
       {
          result[currentIdx .. currentIdx + b] = cast(T)k;
          currentIdx += b;
          if (currentIdx >= data.length) return result;
       }
    }

    assert(0);
}


Maybe those implementations could be useful to improve phobos 
sorting?

Andrea

Apr 06 2016

tsbockman <thomas.bockman gmail.com> writes:

On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana wrote:
 Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort
  instead:

 2 inputs:
 Faster: 0 hnsecs
 Phobos: 33 μs and 5 hnsecs

 ...

 65536 inputs:
 Faster: 0 hnsecs (???)
 Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs

Can you share the benchmark code?

"0 hnsecs" results generally mean that your test was too simple 
and/or didn't have any obvious side effects, and so the optimizer 
just removed it completely.

Apr 06 2016

Andrea Fontana <nospam example.com> writes:

On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:
 On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana 
 wrote:
 Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort
  instead:

 2 inputs:
 Faster: 0 hnsecs
 Phobos: 33 μs and 5 hnsecs

 ...

 65536 inputs:
 Faster: 0 hnsecs (???)
 Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs

 Can you share the benchmark code?

 "0 hnsecs" results generally mean that your test was too simple 
 and/or didn't have any obvious side effects, and so the 
 optimizer just removed it completely.

A simple test just written:

    Duration total;

    foreach(_; 0..10000)
    {
       auto arr = generate!( () => uniform(0,2)).map!(x => 
cast(bool)x).take(65536).array;
       StopWatch sw;
       sw.start;
       boolSort(arr);
       total += sw.peek.to!Duration;
       sw.stop;

    }


andrea ububocs:/tmp$ ./sort
0 hnsecs

I don't think compiler can remove a random generated array...
I think times are too small to be measured with a good accuracy, 
using start() stop() to resume stopwatch seems to add a (fixed) 
overhead (some microsecs over the 10000 cycles) that doesn't 
depend on size of array tested.
Anyway, it doesn't matter if it is 0 hnsec or some microsecs, in 
my opinion. It's still faster and faster than original sort (of 
course it's n vs n*log(n)). Here the same code with "sort(arr)" 
instead of "boolSort(arr)":

andrea ububocs:/tmp$ ./sort
10 secs, 620 ms, 414 μs, and 2 hnsecs

Andrea

Apr 07 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:
 On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:
 On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana 
 wrote:
 Using ldmd2 -O -release -noboundscheck -inline sort.d && 
 ./sort
  instead:

 2 inputs:
 Faster: 0 hnsecs
 Phobos: 33 μs and 5 hnsecs

 ...

 65536 inputs:
 Faster: 0 hnsecs (???)
 Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs

 Can you share the benchmark code?

 "0 hnsecs" results generally mean that your test was too 
 simple and/or didn't have any obvious side effects, and so the 
 optimizer just removed it completely.

 A simple test just written:

    Duration total;

    foreach(_; 0..10000)
    {
       auto arr = generate!( () => uniform(0,2)).map!(x => 
 cast(bool)x).take(65536).array;
       StopWatch sw;
       sw.start;
       boolSort(arr);
       total += sw.peek.to!Duration;
       sw.stop;

    }


 andrea ububocs:/tmp$ ./sort
 0 hnsecs

 I don't think compiler can remove a random generated array...

But it definitely can eliminate an unused result. My prediction: 
you took an array and sorted it, then did nothing with the 
result, so it rightly concluded that there was no point doing the 
sort. In any given case the compiler could be removing some or 
all of the work.

Laborious approach that defeats the optimisations you don't want 
while keeping the ones you do:

% cat modA.d
float[] codeToBenchmark(int someParam, float[] someOtherParam)
{
     /* blah blah code */
}

% cat modB.d
// Do not import modA here

auto codeToBenchmark(int, float[]);

void main()
{
     // start loop and timers:
         codeToBenchmark(/*blah params */);
     // end timers and loop
}
% ldc2 -c <optimisation options here> modA.d
% ldc2 <opt options, less important> modA.o modB.d
% ./modB
<reliable results come out here>


I really need to write an article about bencharking in D...

Apr 07 2016

Andrea Fontana <nospam example.com> writes:

On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.

But it should remove result if I replace boolSort() with sort() 
too, instead it take 10 seconds to run.

Apr 07 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 7 April 2016 at 08:33:40 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.

 But it should remove result if I replace boolSort() with sort() 
 too, instead it take 10 seconds to run.

Not necessarily. It totally depends on the implementation details 
and the exact way the optimiser works. It might be interesting 
and informative for you to explore exactly why a particular 
version of a particular compiler with particular compilation 
flags will inline and elide one sort function but not another, 
but I would recommend just not letting the compiler see the 
source code of the benchmark and the sorting at the same time*, 
then you know neither will be inlined and also no extra 
attributes will be inferred and unrealistically taken advantage 
of.

*hench my example of compiling one module to an object file and 
then compiling the other and linking them, without ever importing 
one from the other.

Apr 07 2016

Andrea Fontana <nospam example.com> writes:

On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file and 
 then compiling the other and linking them, without ever 
 importing one from the other.

If i move boolSort() on another module but I can't use auto as 
return type.
Return type is a "voldemort" type returned by std.range.chain.

How can I define it?

Apr 07 2016

Andrea Fontana <nospam example.com> writes:

On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file 
 and then compiling the other and linking them, without ever 
 importing one from the other.

 If i move boolSort() on another module but I can't use auto as 
 return type.
 Return type is a "voldemort" type returned by std.range.chain.

 How can I define it?

What about this?

    Duration total;

    foreach(_; 0..10000)
    {
       auto arr = generate!( () => uniform(0,2)).map!(x => 
cast(bool)x).take(65536).array;
       StopWatch sw;
       sw.start;
       auto t = boolSort(arr);
       sw.stop;
       size_t sum = 0;
       foreach(x; t.map!(x => x?1:0)) sum+=x;
       writeln(sum);
    }

    writeln(total);

This outputs:
32784
...
...
...
32819
32648
32649
32716
32972
0 hnsecs

Apr 07 2016

Andrea Fontana <nospam example.com> writes:

On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:
 What about this?

    Duration total;

    foreach(_; 0..10000)
    {
       auto arr = generate!( () => uniform(0,2)).map!(x => 
 cast(bool)x).take(65536).array;
       StopWatch sw;
       sw.start;
       auto t = boolSort(arr);

Missed total+=...

       sw.stop;
       size_t sum = 0;
       foreach(x; t.map!(x => x?1:0)) sum+=x;
       writeln(sum);
    }


whoops i missed a line there. It's 300ms but i wonder if test is 
valid. Anyway, that's not the point of my original post :)

Apr 07 2016

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:
 What about this?

    Duration total;

    foreach(_; 0..10000)
    {
       auto arr = generate!( () => uniform(0,2)).map!(x => 
 cast(bool)x).take(65536).array;
       StopWatch sw;
       sw.start;
       auto t = boolSort(arr);
       sw.stop;
       size_t sum = 0;
       foreach(x; t.map!(x => x?1:0)) sum+=x;
       writeln(sum);
    }

    writeln(total);

 This outputs:
 32784
 ...
 ...
 ...
 32819
 32648
 32649
 32716
 32972
 0 hnsecs

A time of "zero" means the benchmark is broken. In this case, you 
forgot to actually add the stopwatch time to the `total` variable.

You don't need to print the sum every time; just accumulate it in 
a variable declared outside the loop and print it once at the end 
(like with `total`). See my benchmark code that I posted a few 
minutes ago.

Apr 07 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file 
 and then compiling the other and linking them, without ever 
 importing one from the other.

 If i move boolSort() on another module but I can't use auto as 
 return type.
 Return type is a "voldemort" type returned by std.range.chain.

 How can I define it?

You could create a dummy function (or even the real function, 
just with a different name) that creates the same type and use 
typeof(myDummyFunction(myDummyArgs)) when making the definition.

Apr 07 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 7 April 2016 at 09:25:46 UTC, John Colvin wrote:
 On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
 *hench my example of compiling one module to an object file 
 and then compiling the other and linking them, without ever 
 importing one from the other.

 If i move boolSort() on another module but I can't use auto as 
 return type.
 Return type is a "voldemort" type returned by std.range.chain.

 How can I define it?

 You could create a dummy function (or even the real function, 
 just with a different name) that creates the same type and use 
 typeof(myDummyFunction(myDummyArgs)) when making the definition.

Correction, use 
http://dlang.org/phobos/std_traits.html#ReturnType instead of all 
that typeof mess.

Apr 07 2016

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.

 Laborious approach that defeats the optimisations you don't 
 want while keeping the ones you do:

It's easier to just output the result in some form. I typically 
calculate and display a checksum of some sort.

Apr 07 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 7 April 2016 at 09:01:14 UTC, tsbockman wrote:
 On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
 But it definitely can eliminate an unused result. My 
 prediction: you took an array and sorted it, then did nothing 
 with the result, so it rightly concluded that there was no 
 point doing the sort. In any given case the compiler could be 
 removing some or all of the work.

 Laborious approach that defeats the optimisations you don't 
 want while keeping the ones you do:

 It's easier to just output the result in some form. I typically 
 calculate and display a checksum of some sort.

Take care with this approach. For example, calling a pure 
function (whether D pure or some optimiser inferred sort of 
purity) repeatedly in a loop can sometimes cause the loop to be 
reduced to a single iteration (doesn't apply in this case, 
because of randomly generating data each iteration).

Apr 07 2016

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:
 I don't think compiler can remove a random generated array...
 I think times are too small to be measured with a good 
 accuracy, using start() stop() to resume stopwatch seems to add 
 a (fixed) overhead (some microsecs over the 10000 cycles) that 
 doesn't depend on size of array tested.

`boolSort(arr)` has no side effects, and in your benchmark the 
return value is discarded. This means that the compiler is free 
to simply skip that function call entirely.

Because of this, with warnings-as-errors your code won't even 
compile: "Warning: calling app.boolSort!"a<b".boolSort without 
side effects discards return value of type Result, prepend a 
cast(void) if intentional"

 Anyway, it doesn't matter if it is 0 hnsec or some microsecs, 
 in my opinion. It's still faster and faster than original sort 
 (of course it's n vs n*log(n)).

A time of "zero" means that your benchmark is broken, and tells 
you nothing about how fast your code actually is.

Here's a less broken benchmark (it's still got lots of room for 
improvement):

module app;

import std.algorithm, std.conv, std.range, std.random, std.stdio;

auto boolSort(alias less = "a<b")(in bool[] data)
{
    import std.array        : uninitializedArray;
    import std.functional   : binaryFun;

    // Check if order is true/false or false/true
    immutable test = binaryFun!less(true, false);
    size_t count;

    foreach(ref x; data)
       if (x == test) count++;

    return chain(repeat(test,count), repeat(!test, data.length - 
count));
}

void main()
{
     import std.datetime : Duration, StopWatch;

     Duration boolSortDur, stdSortDur;
     ulong boolSortKS, stdSortKS;

     foreach(_; 0..10_000)
     {
         auto arr =
             generate!( () => uniform(0,2))
             .map!(x => cast(bool)x)
             .take(65_536)
             .array;

         {
             StopWatch sw;
             sw.start;
             auto sorted = boolSort(arr);
             sw.stop;
             boolSortDur += sw.peek.to!Duration;

             /* Scan the sorted result so that the compiler
                doesn't elide the boolSort(arr) call: */
             foreach(bool b; sorted)
                 boolSortKS += b;
         }

         {
             StopWatch sw;
             sw.start;
             auto sorted = sort(arr);
             sw.stop;
             stdSortDur += sw.peek.to!Duration;

             /* Scan the sorted result so that the compiler
                doesn't elide the sort(arr) call: */
             foreach(bool b; sorted)
                 stdSortKS += b;
         }
     }

     /* Make some I/O conditional upon the sorted results
        to establish a direct causal chain between the
        code being benchmarked, and something that we
        know cannot ever be removed by the optimizer: */
     if(boolSortKS != stdSortKS)
         writeln("Keep sum mismatch!");

     writefln("boolSort(): %s", boolSortDur);
     writefln("std sort(): %s", stdSortDur);
}

Results on my 3.2GHz Haswell 64-bit Linux system:

DMD 64-bit:
     boolSort(): 2 secs, 886 ms, 915 μs, and 5 hnsecs
     std sort(): 15 secs, 135 ms, 900 μs, and 8 hnsecs

DMD 32-bit:
     boolSort(): 2 secs, 827 ms, 13 μs, and 6 hnsecs
     std sort(): 16 secs, 668 ms, 900 μs, and 2 hnsecs

LDC 64-bit:
     boolSort(): 369 ms, 590 μs, and 6 hnsecs
     std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs

So, your boolSort() is faster as expected - but not infinitely so.

Andrei gave a lecture on optimization a while back:
     http://forum.dlang.org/thread/n6odlg$j2n$1 digitalmars.com
Among the key points he made was basically, "Never assume: 
measure. And, measuring correctly is harder than it looks."

Apr 07 2016

Andrea Fontana <nospam example.com> writes:

On Thursday, 7 April 2016 at 08:48:41 UTC, tsbockman wrote:
 LDC 64-bit:
    boolSort(): 369 ms, 590 μs, and 6 hnsecs
    std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs
 So, your boolSort() is faster as expected - but not infinitely 
 so.

Yes point of my original discussion wasn't the obviusly wrong 
mesure but the greate gain over standard sort()

369ms vs 13 secs is still a lot (35x: and it increase more and 
more for bigger length... it's linear vs nlogn)

Apr 07 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Faster sort?