digitalmars.D - A few experiments with partial unrolling

Andrei Alexandrescu (108/108) Dec 25 2010 I was looking at http://d.puremagic.com/issues/show_bug.cgi?id=4725 and

bearophile (4/6) Dec 25 2010 But this is true for a microbenchmark. In a real program the code half p...

Andrei Alexandrescu (6/10) Dec 25 2010 Yah, that's what I think unroll should be a generic function leaving it

Andrei Alexandrescu (6/9) Dec 25 2010 [snip]

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I was looking at http://d.puremagic.com/issues/show_bug.cgi?id=4725 and 
thought I'd experiment a bit with a generic unrolling routine (code at 
the end of this). So I wrote a general parameterized unroll function 
that accepts a function, a number of times to unroll, and a 
random-access range. The idea was to experiment with unroll and see how 
it works for sums. I also wrote two baselines - a brute force one and 
one that unrolls but is specialized for addition.

The results are interesting. First, partial unrolling does help a lot - 
the brute force baseline takes 1.2s on my machine and unrollSum takes 
only 0.2s.

Second, there is an issue with inlining: unroll takes 0.37s, almost 
twice as much as the hand-unrolled version.

Third, it looks like larger unrolling limits is better - I only got a 
plateau at 128!


Andrei


import std.algorithm, std.conv, std.date, std.functional, std.range, 
std.stdio, std.string, std.traits;

ElementType!R unroll(alias fun, size_t times, R)(R r) if 
(isRandomAccessRange!R && hasLength!R)
{
     static assert(times > 1 && !(times & (times - 1)));

     static string repeat(string s, string indexPlaceholder, size_t n)
     {
         string r = "";
         foreach (i; 0 .. n)
         {
             r ~= replace(s, indexPlaceholder, .to!string(i));
         }
         return r;
     }

     alias binaryFun!fun f;
     typeof(return) result = 0;
     immutable adjusted = r.length & ~cast(size_t) (times - 1);
     size_t i = 0;
     for (; i < adjusted; i += times)
     {
         ElementType!R[times - 1] temp = void;
         mixin(repeat("temp[X] = f(r[i + (X + X)], r[i + (X + X + 
1)]);", "X", times / 2));
         mixin(repeat("temp[X + times / 2] = f(temp[X + X], temp[X + X + 
1]);", "X", times / 2 - 1));
         result = f(result, temp[$ - 1]);
     }
     for (; i != r.length; ++i)
     {
         result += r[i];
     }
     return result;
}

ElementType!R unrollSum(size_t times, R)(R r) if (isRandomAccessRange!R 
&& hasLength!R)
{
     static assert(times > 1 && !(times & (times - 1)));

     static string repeat(string s, string indexPlaceholder, size_t n)
     {
         string r = "";
         foreach (i; 0 .. n)
         {
             r ~= replace(s, indexPlaceholder, .to!string(i));
         }
         return r;
     }

     typeof(return) result = 0;
     immutable adjusted = r.length & ~cast(size_t) (times - 1);
     size_t i = 0;
     for (; i < adjusted; i += times)
     {
         ElementType!R[times - 1] temp = void;
         mixin(repeat("temp[X] = r[i + (X + X)] + r[i + (X + X + 1)];", 
"X", times / 2));
         mixin(repeat("temp[X + times / 2] = temp[X + X] + temp[X + X + 
1];", "X", times / 2 - 1));
         result += temp[$ - 1];
     }
     for (; i != r.length; ++i)
     {
         result += r[i];
     }
     return result;
}

ElementType!R sum0(R)(R r) if (isInputRange!R)
{
     typeof(return) result = 0;
     for (; !r.empty; r.popFront()) {
         result += r.front;
     }
     return result;
}

void main(string[] args)
{
     enum times = 8;
     auto a = array(iota(100_000_001));
     double t = getUTCtime();
     if (args.length == 1) {
         writeln("baseline");
         writeln(sum0(a));
     } else if (args[1] == "unroll") {
         writeln(args[1]);
         writeln(unroll!("a + b", times)(a));
     } else if (args[1] == "unrollSum") {
         writeln(args[1]);
         writeln(unrollSum!(times)(a));
     } else {
         assert(0);
     }
     writeln((getUTCtime() - t) / ticksPerSecond);
}

Dec 25 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei:

 Third, it looks like larger unrolling limits is better - I only got a 
 plateau at 128!

But this is true for a microbenchmark. In a real program the code half part of
the CPU L1 cache is quite limited, so the more code you have to push through
that little cache (code of different  functions), the more cache misses you
have, and this slows down the code. This is why too much unrolling or too much
inlining is bad, and this is why I have unrolled my sum() only once.

Bye,
bearophile

Dec 25 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/25/10 3:07 PM, bearophile wrote:
 Andrei:

 Third, it looks like larger unrolling limits is better - I only got a
 plateau at 128!

 But this is true for a microbenchmark. In a real program the code half part of
the CPU L1 cache is quite limited, so the more code you have to push through
that little cache (code of different  functions), the more cache misses you
have, and this slows down the code. This is why too much unrolling or too much
inlining is bad, and this is why I have unrolled my sum() only once.

Yah, that's what I think unroll should be a generic function leaving it 
to the user to choose the parameters. Before that I'd like to generalize 
the function a bit more - right now it can only do reduce-type workloads 
on associative functions.

Andrei

Dec 25 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/25/10 2:54 PM, Andrei Alexandrescu wrote:
 I was looking at http     b://d.puremagic.com/issues/show_bug.cgi?id=4725 and
 thought I'd experiment a bit with a generic unrolling routine (code at
 the end of this).

[snip]

Ignore my previous numbers, I forgot -O -release -inline... with those, 
improvements are still considerable, from 220ms to 72ms, and unroll does 
as well as unrollSum.

Andrei

Dec 25 2010

D Programming

C/C++ Programming

Other

digitalmars.D - A few experiments with partial unrolling