digitalmars.D.learn - Performance of ranges verses direct

Frustrated (6/6) Dec 10 2013 Has anyone done any work on comparing the performance of ranges

Marco Leise (9/17) Dec 11 2013 Just use hand written loops if you are worried about this or
John Colvin (12/18) Dec 11 2013 In my very unscientific experiments:

Joseph Rushton Wakeling (14/19) Dec 11 2013 Simple example:

Marco Leise (7/22) Dec 12 2013 As far as I remember it *is* the same speed using GDC or LDC.
John Colvin (9/11) Dec 12 2013 That shouldn't be necessary if the iota operations are inlined

Marco Leise (27/40) Dec 13 2013 This doesn't seem to be the general case...

Nikolay (66/66) Dec 13 2013 I found that performance of ranges is miserable in many cases.

lomereiter (3/10) Dec 13 2013 And now try -inline-threshold=1024 ;-)

Nikolay (2/4) Dec 14 2013 Thanks. It is really game changer. I get my words about ranges

"Frustrated" <c1514843 drdrb.com> writes:

Has anyone done any work on comparing the performance of ranges 
vs using direct straightforward code(optimized in the sense of 
the way ranges work but by hand).

e.g., How does filter compare to a simple loop over the elements 
and comparing it. How does a a chain of UFCS compare to doing the 
same in a simple loop.

Dec 10 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 11 Dec 2013 08:29:38 +0100
schrieb "Frustrated" <c1514843 drdrb.com>:

 
 Has anyone done any work on comparing the performance of ranges 
 vs using direct straightforward code(optimized in the sense of 
 the way ranges work but by hand).
 
 e.g., How does filter compare to a simple loop over the elements 
 and comparing it. How does a a chain of UFCS compare to doing the 
 same in a simple loop.

Just use hand written loops if you are worried about this or
benchmark on a per case basis. There is no one fits all answer.
Not all code executes a million times per second though, so
feel free to use them now and then where concise code is
regarded higher than premature optimization. ;)

-- 
Marco

Dec 11 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Wednesday, 11 December 2013 at 07:29:39 UTC, Frustrated wrote:
 Has anyone done any work on comparing the performance of ranges 
 vs using direct straightforward code(optimized in the sense of 
 the way ranges work but by hand).

 e.g., How does filter compare to a simple loop over the 
 elements and comparing it. How does a a chain of UFCS compare 
 to doing the same in a simple loop.

In my very unscientific experiments:
range based, functional style D code runs about 75-100% speed of 
the equivalent explicit iterative version.

A lot of the performance loss is down to missed optimisations, in 
particular inlining. gdc/ldc are good at this, dmd not so good so 
you'll likely see bigger performance gaps with dmd than the other 
compilers.


General advice: use ranges, std.range and std.algorithm wherever 
possible. Profile the resulting code and write out the hotspots 
in an imperative style to see if you can squeeze out some extra 
speed.

Dec 11 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 11/12/13 11:10, John Colvin wrote:
 A lot of the performance loss is down to missed optimisations, in particular
 inlining.

Simple example:

     foreach(i; iota(0, 10)) { ... }

should be as fast as

     foreach(i; 0 .. 10) { ... }

but isn't.  I remember Andrei noting that this ought to be easily fixable [*], 
but I don't know if that actually happened, because I haven't compared the two 
recently.

[* If I recall right, it's achievable by special-casing iota when the increment 
is 1, but don't quote me on that.]

 General advice: use ranges, std.range and std.algorithm wherever possible.
 Profile the resulting code and write out the hotspots in an imperative style to
 see if you can squeeze out some extra speed.

Yes -- use ranges as much as possible because the resulting code will be so
much 
easier to read and maintain.  Switching to imperative style is only worth it if 
there's a speed problem (read: "The speed is unsatisfactory for your use case 
and/or inferior to rival programs doing the same thing.").

Dec 11 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 11 Dec 2013 17:40:39 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 11/12/13 11:10, John Colvin wrote:
 A lot of the performance loss is down to missed optimisations, in particular
 inlining.

 
 Simple example:
 
      foreach(i; iota(0, 10)) { ... }
 
 should be as fast as
 
      foreach(i; 0 .. 10) { ... }
 
 but isn't.  I remember Andrei noting that this ought to be easily fixable [*], 
 but I don't know if that actually happened, because I haven't compared the two 
 recently.

As far as I remember it *is* the same speed using GDC or LDC.
But as long as this issue remains with DMD I'm confident that
foreach_reverse(i; 0 .. 10)  will stay :)

-- 
Marco

Dec 12 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Wednesday, 11 December 2013 at 16:40:48 UTC, Joseph Rushton 
Wakeling wrote:
 [* If I recall right, it's achievable by special-casing iota 
 when the increment is 1, but don't quote me on that.]

That shouldn't be necessary if the iota operations are inlined 
(which, if their not, is a much greater concern!). The increment 
is then known to be a particular compile-time constant and code 
generated appropriately.

Also, on x86 the gcc and llvm backends both prefer to use add 1 
instead of inc, so you wouldn't even expect to see any difference 
at that level.

Dec 12 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 12 Dec 2013 16:02:28 +0100
schrieb "John Colvin" <john.loughran.colvin gmail.com>:

 On Wednesday, 11 December 2013 at 16:40:48 UTC, Joseph Rushton 
 Wakeling wrote:
 [* If I recall right, it's achievable by special-casing iota 
 when the increment is 1, but don't quote me on that.]

 
 That shouldn't be necessary if the iota operations are inlined 
 (which, if their not, is a much greater concern!). The increment 
 is then known to be a particular compile-time constant and code 
 generated appropriately.
 
 Also, on x86 the gcc and llvm backends both prefer to use add 1 
 instead of inc, so you wouldn't even expect to see any difference 
 at that level.

This doesn't seem to be the general case...
http://stackoverflow.com/questions/12163610/why-inc-and-add-1-have-different-performances

"For the x86 architecture, INC updates on a subset of the
condition codes, whereas ADD updates the entire set of
condition codes. (Other architectures have different rules
so this discussion may or may not apply).

So an INC instruction must wait for other previous instructions
that update the condition code bits to finish, before it can
modify that previous value to produce its final condition code
result.

ADD can produce final condition code bits without regard to
previous values of the condition codes, so it doesn't need to
wait for previous instructions to finish computing their value
of the condition codes."

- and -

"On x86 architectures, with variable length instruction
encoding, there could be occasions where either one is
preferable over the other. If the shorter on would fit
in a cache line or decode block where the larger wouldn't,
then it will come out ahead. If the shorter one would leave
half of the next instruction in the window, and the remaining
half in the next window, the larger one might be better by
aligning its successor nicely."

-- 
Marco

Dec 13 2013

"Nikolay" <sibnick gmail.com> writes:

I found that performance of ranges is miserable in many cases. 
You should not use them if there is any chance for big/critical 
computations. Actually I don't like to have two subsets of 
language: one with good performance, and other with good 
maintainability/readability. I have a couple thought about this 
situation. May by we can use CTFE and mixin for inlining some 
range based code on library level?

FYI One of my benchmark (calculate sum of two simple ranges):


module zip_test;

import std.range;
import std.datetime;
import std.getopt;
import std.stdio;
import std.algorithm;


struct Range{
         uint count = 0;
          property bool empty(){
                 return count>=limit;
         }
          property int front(){
                 return count;
         }
         bool popFront(){
                 count++;
                 return count<limit;
         }
}

uint testForEach(){
         Range r1, r2;
         uint rs = 0;
         for(int i=0; !r1.empty && !r2.empty; i++){
                 rs += r1.front + r2.front;
                 r1.popFront();
                 r2.popFront();
         }
         writefln("foreach: %s", rs);
         return rs;
}
uint testZip(){
         Range r1, r2;
         auto tmpRange = step1(r1, r2);
         auto rs = reduce!(q{ a + b })(0U, tmpRange);
//      auto rs = reduce!((a,b) => a + b)(0U, tmpRange);
         writefln("zip: %s", rs);
         return rs;
}

auto step1(Range r1, Range r2){
         //return zip(r1, r2).map!(a => a[0]+a[1]);
         return zip(r1, r2).map!(q{a[0]+a[1]});
}

uint limit = 10_000;

void main(string[] args){
         getopt( args, "limit", &limit);
         auto r = benchmark!(testForEach, testZip)(10);
         foreach(idx, rs; r){
                 writefln("%s - %s", idx, rs.msecs);
         }
}

Results:
~/ldc2-0.12.0-linux-x86_64/bin/ldmd2 -O -release -inline 
zip_test.d
./zip_test --limit 10000000
…........
0 - 0
1 - 764

Dec 13 2013

"lomereiter" <lomereiter gmail.com> writes:

On Friday, 13 December 2013 at 18:03:28 UTC, Nikolay wrote:
 Results:
 ~/ldc2-0.12.0-linux-x86_64/bin/ldmd2 -O -release -inline 
 zip_test.d
 ./zip_test --limit 10000000
 …........
 0 - 0
 1 - 764

And now try -inline-threshold=1024 ;-)
Reading -help-hidden sometimes helps.

Dec 13 2013

"Nikolay" <sibnick gmail.com> writes:

 And now try -inline-threshold=1024 ;-)
 Reading -help-hidden sometimes helps.

Thanks. It is really game changer. I get my words about ranges 
back.

Dec 14 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Performance of ranges verses direct