www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A few experiments with partial unrolling

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I was looking at http://d.puremagic.com/issues/show_bug.cgi?id=4725 and 
thought I'd experiment a bit with a generic unrolling routine (code at 
the end of this). So I wrote a general parameterized unroll function 
that accepts a function, a number of times to unroll, and a 
random-access range. The idea was to experiment with unroll and see how 
it works for sums. I also wrote two baselines - a brute force one and 
one that unrolls but is specialized for addition.

The results are interesting. First, partial unrolling does help a lot - 
the brute force baseline takes 1.2s on my machine and unrollSum takes 
only 0.2s.

Second, there is an issue with inlining: unroll takes 0.37s, almost 
twice as much as the hand-unrolled version.

Third, it looks like larger unrolling limits is better - I only got a 
plateau at 128!


Andrei

#!/usr/bin/env rdmd
import std.algorithm, std.conv, std.date, std.functional, std.range, 
std.stdio, std.string, std.traits;

ElementType!R unroll(alias fun, size_t times, R)(R r) if 
(isRandomAccessRange!R && hasLength!R)
{
     static assert(times > 1 && !(times & (times - 1)));

     static string repeat(string s, string indexPlaceholder, size_t n)
     {
         string r = "";
         foreach (i; 0 .. n)
         {
             r ~= replace(s, indexPlaceholder, .to!string(i));
         }
         return r;
     }

     alias binaryFun!fun f;
     typeof(return) result = 0;
     immutable adjusted = r.length & ~cast(size_t) (times - 1);
     size_t i = 0;
     for (; i < adjusted; i += times)
     {
         ElementType!R[times - 1] temp = void;
         mixin(repeat("temp[X] = f(r[i + (X + X)], r[i + (X + X + 
1)]);", "X", times / 2));
         mixin(repeat("temp[X + times / 2] = f(temp[X + X], temp[X + X + 
1]);", "X", times / 2 - 1));
         result = f(result, temp[$ - 1]);
     }
     for (; i != r.length; ++i)
     {
         result += r[i];
     }
     return result;
}

ElementType!R unrollSum(size_t times, R)(R r) if (isRandomAccessRange!R 
&& hasLength!R)
{
     static assert(times > 1 && !(times & (times - 1)));

     static string repeat(string s, string indexPlaceholder, size_t n)
     {
         string r = "";
         foreach (i; 0 .. n)
         {
             r ~= replace(s, indexPlaceholder, .to!string(i));
         }
         return r;
     }

     typeof(return) result = 0;
     immutable adjusted = r.length & ~cast(size_t) (times - 1);
     size_t i = 0;
     for (; i < adjusted; i += times)
     {
         ElementType!R[times - 1] temp = void;
         mixin(repeat("temp[X] = r[i + (X + X)] + r[i + (X + X + 1)];", 
"X", times / 2));
         mixin(repeat("temp[X + times / 2] = temp[X + X] + temp[X + X + 
1];", "X", times / 2 - 1));
         result += temp[$ - 1];
     }
     for (; i != r.length; ++i)
     {
         result += r[i];
     }
     return result;
}

ElementType!R sum0(R)(R r) if (isInputRange!R)
{
     typeof(return) result = 0;
     for (; !r.empty; r.popFront()) {
         result += r.front;
     }
     return result;
}

void main(string[] args)
{
     enum times = 8;
     auto a = array(iota(100_000_001));
     double t = getUTCtime();
     if (args.length == 1) {
         writeln("baseline");
         writeln(sum0(a));
     } else if (args[1] == "unroll") {
         writeln(args[1]);
         writeln(unroll!("a + b", times)(a));
     } else if (args[1] == "unrollSum") {
         writeln(args[1]);
         writeln(unrollSum!(times)(a));
     } else {
         assert(0);
     }
     writeln((getUTCtime() - t) / ticksPerSecond);
}
Dec 25 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei:

 Third, it looks like larger unrolling limits is better - I only got a 
 plateau at 128!

But this is true for a microbenchmark. In a real program the code half part of the CPU L1 cache is quite limited, so the more code you have to push through that little cache (code of different functions), the more cache misses you have, and this slows down the code. This is why too much unrolling or too much inlining is bad, and this is why I have unrolled my sum() only once. Bye, bearophile
Dec 25 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/25/10 3:07 PM, bearophile wrote:
 Andrei:

 Third, it looks like larger unrolling limits is better - I only got a
 plateau at 128!

But this is true for a microbenchmark. In a real program the code half part of the CPU L1 cache is quite limited, so the more code you have to push through that little cache (code of different functions), the more cache misses you have, and this slows down the code. This is why too much unrolling or too much inlining is bad, and this is why I have unrolled my sum() only once.

Yah, that's what I think unroll should be a generic function leaving it to the user to choose the parameters. Before that I'd like to generalize the function a bit more - right now it can only do reduce-type workloads on associative functions. Andrei
Dec 25 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/25/10 2:54 PM, Andrei Alexandrescu wrote:
 I was looking at http     b://d.puremagic.com/issues/show_bug.cgi?id=4725 and
 thought I'd experiment a bit with a generic unrolling routine (code at
 the end of this).

Ignore my previous numbers, I forgot -O -release -inline... with those, improvements are still considerable, from 220ms to 72ms, and unroll does as well as unrollSum. Andrei
Dec 25 2010