## digitalmars.D - Faster sort?

• Andrea Fontana (90/90) Apr 06 2016 I did a couple of tries using a "specialized" implementation of
• tsbockman (5/14) Apr 06 2016 Can you share the benchmark code?
• Andrea Fontana (27/45) Apr 07 2016 A simple test just written:
• John Colvin (27/64) Apr 07 2016 But it definitely can eliminate an unused result. My prediction:
• Andrea Fontana (3/8) Apr 07 2016 But it should remove result if I replace boolSort() with sort()
• John Colvin (14/22) Apr 07 2016 Not necessarily. It totally depends on the implementation details
• Andrea Fontana (5/8) Apr 07 2016 If i move boolSort() on another module but I can't use auto as
• Andrea Fontana (4/18) Apr 07 2016 whoops i missed a line there. It's 300ms but i wonder if test is
• tsbockman (7/33) Apr 07 2016 A time of "zero" means the benchmark is broken. In this case, you
• John Colvin (4/12) Apr 07 2016 You could create a dummy function (or even the real function,
• tsbockman (3/10) Apr 07 2016 It's easier to just output the result in some form. I typically
• John Colvin (6/17) Apr 07 2016 Take care with this approach. For example, calling a pure
• tsbockman (85/93) Apr 07 2016 `boolSort(arr)` has no side effects, and in your benchmark the
• Andrea Fontana (5/10) Apr 07 2016 Yes point of my original discussion wasn't the obviusly wrong
Andrea Fontana <nospam example.com> writes:
```I did a couple of tries using a "specialized" implementation of
sort!less.

For example using a counting sort for bool seems a good idea.
This code:

auto boolSort(alias less = "a<b")(in bool[] data)
{
import std.array        : uninitializedArray;
import std.functional   : binaryFun;

// Check if order is true/false or false/true
immutable test = binaryFun!less(true, false);
size_t count;

foreach(ref x; data)
if (x == test) count++;

return chain(repeat(test,count), repeat(!test, data.length -
count));
}

Count number of false (or true, it depends on result of "less"),
and then chain them.

I did a test over 10000 randomized array of 2,4,8,16...65536
elements.

rdmd -O -release -noboundscheck -inline sort.d :

2 inputs:
Faster: 229 μs and 9 hnsecs
Phobos: 215 μs and 5 hnsecs

...

64 inputs:
Faster: 513 μs and 4 hnsecs
Phobos: 2 ms, 143 μs, and 5 hnsecs

...

65536 inputs:
Faster: 2 secs, 700 ms, and 696 μs
Phobos: 6 secs, 835 ms, and 319 μs

Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort

2 inputs:
Faster: 0 hnsecs
Phobos: 33 μs and 5 hnsecs

...

65536 inputs:
Faster: 0 hnsecs (???)
Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs

The same thing works for byte[] and ubyte[] array using a few
tricks (eg: using counting sort and then sorting only keys).

So for ubyte[]/byte[] (ldc):

2 inputs:
1 ms, 305 μs, and 4 hnsecs
31 μs and 4 hnsecs

...

256 inputs:
8 ms, 76 μs, and 3 hnsecs
8 ms, 807 μs, and 3 hnsecs

...

65536 inputs:
347 ms, 330 μs, and 9 hnsecs
14 secs, 214 ms, 66 μs, and 5 hnsecs

Anyway implementation for byte/ubyte should be improved as long
as it probably won't work fine if less is a functor or delegate:

auto byteSort(alias less = "a<b", T)(in T[] data)
if (is(T == byte) || is(T == ubyte))
{
import std.range     : iota;
import std.algorithm : map;
import std.array     : uninitializedArray, array;

// It needs 256*size_t.sizeof bytes. Maybe not suitables for
embedded platforms?
size_t[256] buckets;

foreach(ref x; data)
buckets[cast(ubyte)x]++;

// Sort 256 keys using less function at compile time (less
should work at compile time)
static immutable sortedKeys = iota(0, 256).map!(x =>
cast(T)x).array.sort!less.array;

auto result = uninitializedArray!(T[])(data.length);
size_t currentIdx = 0;

// bucket[0] = 3, bucket[1] = 0, bucket[2] = 1 => 0,0,0,2
foreach(ref k; sortedKeys)
{
auto b = buckets[cast(ubyte)k];

if (b)
{
result[currentIdx .. currentIdx + b] = cast(T)k;
currentIdx += b;
if (currentIdx >= data.length) return result;
}
}

assert(0);
}

Maybe those implementations could be useful to improve phobos
sorting?

Andrea
```
Apr 06 2016
tsbockman <thomas.bockman gmail.com> writes:
```On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana wrote:
Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort

2 inputs:
Faster: 0 hnsecs
Phobos: 33 μs and 5 hnsecs

...

65536 inputs:
Faster: 0 hnsecs (???)
Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs

Can you share the benchmark code?

"0 hnsecs" results generally mean that your test was too simple
and/or didn't have any obvious side effects, and so the optimizer
just removed it completely.
```
Apr 06 2016
Andrea Fontana <nospam example.com> writes:
```On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:
On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana
wrote:
Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort

2 inputs:
Faster: 0 hnsecs
Phobos: 33 μs and 5 hnsecs

...

65536 inputs:
Faster: 0 hnsecs (???)
Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs

Can you share the benchmark code?

"0 hnsecs" results generally mean that your test was too simple
and/or didn't have any obvious side effects, and so the
optimizer just removed it completely.

A simple test just written:

Duration total;

foreach(_; 0..10000)
{
auto arr = generate!( () => uniform(0,2)).map!(x =>
cast(bool)x).take(65536).array;
StopWatch sw;
sw.start;
boolSort(arr);
total += sw.peek.to!Duration;
sw.stop;

}

andrea ububocs:/tmp\$ ./sort
0 hnsecs

I don't think compiler can remove a random generated array...
I think times are too small to be measured with a good accuracy,
using start() stop() to resume stopwatch seems to add a (fixed)
overhead (some microsecs over the 10000 cycles) that doesn't
depend on size of array tested.
Anyway, it doesn't matter if it is 0 hnsec or some microsecs, in
my opinion. It's still faster and faster than original sort (of
course it's n vs n*log(n)). Here the same code with "sort(arr)"

andrea ububocs:/tmp\$ ./sort
10 secs, 620 ms, 414 μs, and 2 hnsecs

Andrea
```
Apr 07 2016
John Colvin <john.loughran.colvin gmail.com> writes:
```On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:
On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:
On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana
wrote:
Using ldmd2 -O -release -noboundscheck -inline sort.d &&
./sort

2 inputs:
Faster: 0 hnsecs
Phobos: 33 μs and 5 hnsecs

...

65536 inputs:
Faster: 0 hnsecs (???)
Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs

Can you share the benchmark code?

"0 hnsecs" results generally mean that your test was too
simple and/or didn't have any obvious side effects, and so the
optimizer just removed it completely.

A simple test just written:

Duration total;

foreach(_; 0..10000)
{
auto arr = generate!( () => uniform(0,2)).map!(x =>
cast(bool)x).take(65536).array;
StopWatch sw;
sw.start;
boolSort(arr);
total += sw.peek.to!Duration;
sw.stop;

}

andrea ububocs:/tmp\$ ./sort
0 hnsecs

I don't think compiler can remove a random generated array...

But it definitely can eliminate an unused result. My prediction:
you took an array and sorted it, then did nothing with the
result, so it rightly concluded that there was no point doing the
sort. In any given case the compiler could be removing some or
all of the work.

Laborious approach that defeats the optimisations you don't want
while keeping the ones you do:

% cat modA.d
float[] codeToBenchmark(int someParam, float[] someOtherParam)
{
/* blah blah code */
}

% cat modB.d
// Do not import modA here

auto codeToBenchmark(int, float[]);

void main()
{
// start loop and timers:
codeToBenchmark(/*blah params */);
// end timers and loop
}
% ldc2 -c <optimisation options here> modA.d
% ldc2 <opt options, less important> modA.o modB.d
% ./modB
<reliable results come out here>

I really need to write an article about bencharking in D...
```
Apr 07 2016
Andrea Fontana <nospam example.com> writes:
```On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
But it definitely can eliminate an unused result. My
prediction: you took an array and sorted it, then did nothing
with the result, so it rightly concluded that there was no
point doing the sort. In any given case the compiler could be
removing some or all of the work.

But it should remove result if I replace boolSort() with sort()
too, instead it take 10 seconds to run.
```
Apr 07 2016
John Colvin <john.loughran.colvin gmail.com> writes:
```On Thursday, 7 April 2016 at 08:33:40 UTC, Andrea Fontana wrote:
On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
But it definitely can eliminate an unused result. My
prediction: you took an array and sorted it, then did nothing
with the result, so it rightly concluded that there was no
point doing the sort. In any given case the compiler could be
removing some or all of the work.

But it should remove result if I replace boolSort() with sort()
too, instead it take 10 seconds to run.

Not necessarily. It totally depends on the implementation details
and the exact way the optimiser works. It might be interesting
and informative for you to explore exactly why a particular
version of a particular compiler with particular compilation
flags will inline and elide one sort function but not another,
but I would recommend just not letting the compiler see the
source code of the benchmark and the sorting at the same time*,
then you know neither will be inlined and also no extra
attributes will be inferred and unrealistically taken advantage
of.

*hench my example of compiling one module to an object file and
then compiling the other and linking them, without ever importing
one from the other.
```
Apr 07 2016
Andrea Fontana <nospam example.com> writes:
```On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
*hench my example of compiling one module to an object file and
then compiling the other and linking them, without ever
importing one from the other.

If i move boolSort() on another module but I can't use auto as
return type.
Return type is a "voldemort" type returned by std.range.chain.

How can I define it?
```
Apr 07 2016
Andrea Fontana <nospam example.com> writes:
```On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
*hench my example of compiling one module to an object file
and then compiling the other and linking them, without ever
importing one from the other.

If i move boolSort() on another module but I can't use auto as
return type.
Return type is a "voldemort" type returned by std.range.chain.

How can I define it?

Duration total;

foreach(_; 0..10000)
{
auto arr = generate!( () => uniform(0,2)).map!(x =>
cast(bool)x).take(65536).array;
StopWatch sw;
sw.start;
auto t = boolSort(arr);
sw.stop;
size_t sum = 0;
foreach(x; t.map!(x => x?1:0)) sum+=x;
writeln(sum);
}

writeln(total);

This outputs:
32784
...
...
...
32819
32648
32649
32716
32972
0 hnsecs
```
Apr 07 2016
Andrea Fontana <nospam example.com> writes:
```On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:

Duration total;

foreach(_; 0..10000)
{
auto arr = generate!( () => uniform(0,2)).map!(x =>
cast(bool)x).take(65536).array;
StopWatch sw;
sw.start;
auto t = boolSort(arr);

Missed total+=...

sw.stop;
size_t sum = 0;
foreach(x; t.map!(x => x?1:0)) sum+=x;
writeln(sum);
}

whoops i missed a line there. It's 300ms but i wonder if test is
valid. Anyway, that's not the point of my original post :)
```
Apr 07 2016
tsbockman <thomas.bockman gmail.com> writes:
```On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:

Duration total;

foreach(_; 0..10000)
{
auto arr = generate!( () => uniform(0,2)).map!(x =>
cast(bool)x).take(65536).array;
StopWatch sw;
sw.start;
auto t = boolSort(arr);
sw.stop;
size_t sum = 0;
foreach(x; t.map!(x => x?1:0)) sum+=x;
writeln(sum);
}

writeln(total);

This outputs:
32784
...
...
...
32819
32648
32649
32716
32972
0 hnsecs

A time of "zero" means the benchmark is broken. In this case, you
forgot to actually add the stopwatch time to the `total` variable.

You don't need to print the sum every time; just accumulate it in
a variable declared outside the loop and print it once at the end
(like with `total`). See my benchmark code that I posted a few
minutes ago.
```
Apr 07 2016
John Colvin <john.loughran.colvin gmail.com> writes:
```On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
*hench my example of compiling one module to an object file
and then compiling the other and linking them, without ever
importing one from the other.

If i move boolSort() on another module but I can't use auto as
return type.
Return type is a "voldemort" type returned by std.range.chain.

How can I define it?

You could create a dummy function (or even the real function,
just with a different name) that creates the same type and use
typeof(myDummyFunction(myDummyArgs)) when making the definition.
```
Apr 07 2016
John Colvin <john.loughran.colvin gmail.com> writes:
```On Thursday, 7 April 2016 at 09:25:46 UTC, John Colvin wrote:
On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:
On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:
*hench my example of compiling one module to an object file
and then compiling the other and linking them, without ever
importing one from the other.

If i move boolSort() on another module but I can't use auto as
return type.
Return type is a "voldemort" type returned by std.range.chain.

How can I define it?

You could create a dummy function (or even the real function,
just with a different name) that creates the same type and use
typeof(myDummyFunction(myDummyArgs)) when making the definition.

Correction, use
that typeof mess.
```
Apr 07 2016
tsbockman <thomas.bockman gmail.com> writes:
```On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
But it definitely can eliminate an unused result. My
prediction: you took an array and sorted it, then did nothing
with the result, so it rightly concluded that there was no
point doing the sort. In any given case the compiler could be
removing some or all of the work.

Laborious approach that defeats the optimisations you don't
want while keeping the ones you do:

It's easier to just output the result in some form. I typically
calculate and display a checksum of some sort.
```
Apr 07 2016
John Colvin <john.loughran.colvin gmail.com> writes:
```On Thursday, 7 April 2016 at 09:01:14 UTC, tsbockman wrote:
On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:
But it definitely can eliminate an unused result. My
prediction: you took an array and sorted it, then did nothing
with the result, so it rightly concluded that there was no
point doing the sort. In any given case the compiler could be
removing some or all of the work.

Laborious approach that defeats the optimisations you don't
want while keeping the ones you do:

It's easier to just output the result in some form. I typically
calculate and display a checksum of some sort.

Take care with this approach. For example, calling a pure
function (whether D pure or some optimiser inferred sort of
purity) repeatedly in a loop can sometimes cause the loop to be
reduced to a single iteration (doesn't apply in this case,
because of randomly generating data each iteration).
```
Apr 07 2016
tsbockman <thomas.bockman gmail.com> writes:
```On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:
I don't think compiler can remove a random generated array...
I think times are too small to be measured with a good
accuracy, using start() stop() to resume stopwatch seems to add
a (fixed) overhead (some microsecs over the 10000 cycles) that
doesn't depend on size of array tested.

`boolSort(arr)` has no side effects, and in your benchmark the
return value is discarded. This means that the compiler is free
to simply skip that function call entirely.

Because of this, with warnings-as-errors your code won't even
compile: "Warning: calling app.boolSort!"a<b".boolSort without
side effects discards return value of type Result, prepend a
cast(void) if intentional"

Anyway, it doesn't matter if it is 0 hnsec or some microsecs,
in my opinion. It's still faster and faster than original sort
(of course it's n vs n*log(n)).

A time of "zero" means that your benchmark is broken, and tells

Here's a less broken benchmark (it's still got lots of room for
improvement):

module app;

import std.algorithm, std.conv, std.range, std.random, std.stdio;

auto boolSort(alias less = "a<b")(in bool[] data)
{
import std.array        : uninitializedArray;
import std.functional   : binaryFun;

// Check if order is true/false or false/true
immutable test = binaryFun!less(true, false);
size_t count;

foreach(ref x; data)
if (x == test) count++;

return chain(repeat(test,count), repeat(!test, data.length -
count));
}

void main()
{
import std.datetime : Duration, StopWatch;

Duration boolSortDur, stdSortDur;
ulong boolSortKS, stdSortKS;

foreach(_; 0..10_000)
{
auto arr =
generate!( () => uniform(0,2))
.map!(x => cast(bool)x)
.take(65_536)
.array;

{
StopWatch sw;
sw.start;
auto sorted = boolSort(arr);
sw.stop;
boolSortDur += sw.peek.to!Duration;

/* Scan the sorted result so that the compiler
doesn't elide the boolSort(arr) call: */
foreach(bool b; sorted)
boolSortKS += b;
}

{
StopWatch sw;
sw.start;
auto sorted = sort(arr);
sw.stop;
stdSortDur += sw.peek.to!Duration;

/* Scan the sorted result so that the compiler
doesn't elide the sort(arr) call: */
foreach(bool b; sorted)
stdSortKS += b;
}
}

/* Make some I/O conditional upon the sorted results
to establish a direct causal chain between the
code being benchmarked, and something that we
know cannot ever be removed by the optimizer: */
if(boolSortKS != stdSortKS)
writeln("Keep sum mismatch!");

writefln("boolSort(): %s", boolSortDur);
writefln("std sort(): %s", stdSortDur);
}

Results on my 3.2GHz Haswell 64-bit Linux system:

DMD 64-bit:
boolSort(): 2 secs, 886 ms, 915 μs, and 5 hnsecs
std sort(): 15 secs, 135 ms, 900 μs, and 8 hnsecs

DMD 32-bit:
boolSort(): 2 secs, 827 ms, 13 μs, and 6 hnsecs
std sort(): 16 secs, 668 ms, 900 μs, and 2 hnsecs

LDC 64-bit:
boolSort(): 369 ms, 590 μs, and 6 hnsecs
std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs

So, your boolSort() is faster as expected - but not infinitely so.

Andrei gave a lecture on optimization a while back:
Among the key points he made was basically, "Never assume:
measure. And, measuring correctly is harder than it looks."
```
Apr 07 2016
Andrea Fontana <nospam example.com> writes:
```On Thursday, 7 April 2016 at 08:48:41 UTC, tsbockman wrote:
LDC 64-bit:
boolSort(): 369 ms, 590 μs, and 6 hnsecs
std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs
So, your boolSort() is faster as expected - but not infinitely
so.

Yes point of my original discussion wasn't the obviusly wrong
mesure but the greate gain over standard sort()

369ms vs 13 secs is still a lot (35x: and it increase more and
more for bigger length... it's linear vs nlogn)
```
Apr 07 2016