www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Performance of ranges verses direct

reply "Frustrated" <c1514843 drdrb.com> writes:
Has anyone done any work on comparing the performance of ranges 
vs using direct straightforward code(optimized in the sense of 
the way ranges work but by hand).

e.g., How does filter compare to a simple loop over the elements 
and comparing it. How does a a chain of UFCS compare to doing the 
same in a simple loop.
Dec 10 2013
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 11 Dec 2013 08:29:38 +0100
schrieb "Frustrated" <c1514843 drdrb.com>:

 
 Has anyone done any work on comparing the performance of ranges 
 vs using direct straightforward code(optimized in the sense of 
 the way ranges work but by hand).
 
 e.g., How does filter compare to a simple loop over the elements 
 and comparing it. How does a a chain of UFCS compare to doing the 
 same in a simple loop.
Just use hand written loops if you are worried about this or benchmark on a per case basis. There is no one fits all answer. Not all code executes a million times per second though, so feel free to use them now and then where concise code is regarded higher than premature optimization. ;) -- Marco
Dec 11 2013
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Wednesday, 11 December 2013 at 07:29:39 UTC, Frustrated wrote:
 Has anyone done any work on comparing the performance of ranges 
 vs using direct straightforward code(optimized in the sense of 
 the way ranges work but by hand).

 e.g., How does filter compare to a simple loop over the 
 elements and comparing it. How does a a chain of UFCS compare 
 to doing the same in a simple loop.
In my very unscientific experiments: range based, functional style D code runs about 75-100% speed of the equivalent explicit iterative version. A lot of the performance loss is down to missed optimisations, in particular inlining. gdc/ldc are good at this, dmd not so good so you'll likely see bigger performance gaps with dmd than the other compilers. General advice: use ranges, std.range and std.algorithm wherever possible. Profile the resulting code and write out the hotspots in an imperative style to see if you can squeeze out some extra speed.
Dec 11 2013
parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 11/12/13 11:10, John Colvin wrote:
 A lot of the performance loss is down to missed optimisations, in particular
 inlining.
Simple example: foreach(i; iota(0, 10)) { ... } should be as fast as foreach(i; 0 .. 10) { ... } but isn't. I remember Andrei noting that this ought to be easily fixable [*], but I don't know if that actually happened, because I haven't compared the two recently. [* If I recall right, it's achievable by special-casing iota when the increment is 1, but don't quote me on that.]
 General advice: use ranges, std.range and std.algorithm wherever possible.
 Profile the resulting code and write out the hotspots in an imperative style to
 see if you can squeeze out some extra speed.
Yes -- use ranges as much as possible because the resulting code will be so much easier to read and maintain. Switching to imperative style is only worth it if there's a speed problem (read: "The speed is unsatisfactory for your use case and/or inferior to rival programs doing the same thing.").
Dec 11 2013
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 11 Dec 2013 17:40:39 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 11/12/13 11:10, John Colvin wrote:
 A lot of the performance loss is down to missed optimisations, in particular
 inlining.
Simple example: foreach(i; iota(0, 10)) { ... } should be as fast as foreach(i; 0 .. 10) { ... } but isn't. I remember Andrei noting that this ought to be easily fixable [*], but I don't know if that actually happened, because I haven't compared the two recently.
As far as I remember it *is* the same speed using GDC or LDC. But as long as this issue remains with DMD I'm confident that foreach_reverse(i; 0 .. 10) will stay :) -- Marco
Dec 12 2013
prev sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Wednesday, 11 December 2013 at 16:40:48 UTC, Joseph Rushton 
Wakeling wrote:
 [* If I recall right, it's achievable by special-casing iota 
 when the increment is 1, but don't quote me on that.]
That shouldn't be necessary if the iota operations are inlined (which, if their not, is a much greater concern!). The increment is then known to be a particular compile-time constant and code generated appropriately. Also, on x86 the gcc and llvm backends both prefer to use add 1 instead of inc, so you wouldn't even expect to see any difference at that level.
Dec 12 2013
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 12 Dec 2013 16:02:28 +0100
schrieb "John Colvin" <john.loughran.colvin gmail.com>:

 On Wednesday, 11 December 2013 at 16:40:48 UTC, Joseph Rushton 
 Wakeling wrote:
 [* If I recall right, it's achievable by special-casing iota 
 when the increment is 1, but don't quote me on that.]
That shouldn't be necessary if the iota operations are inlined (which, if their not, is a much greater concern!). The increment is then known to be a particular compile-time constant and code generated appropriately. Also, on x86 the gcc and llvm backends both prefer to use add 1 instead of inc, so you wouldn't even expect to see any difference at that level.
This doesn't seem to be the general case... http://stackoverflow.com/questions/12163610/why-inc-and-add-1-have-different-performances "For the x86 architecture, INC updates on a subset of the condition codes, whereas ADD updates the entire set of condition codes. (Other architectures have different rules so this discussion may or may not apply). So an INC instruction must wait for other previous instructions that update the condition code bits to finish, before it can modify that previous value to produce its final condition code result. ADD can produce final condition code bits without regard to previous values of the condition codes, so it doesn't need to wait for previous instructions to finish computing their value of the condition codes." - and - "On x86 architectures, with variable length instruction encoding, there could be occasions where either one is preferable over the other. If the shorter on would fit in a cache line or decode block where the larger wouldn't, then it will come out ahead. If the shorter one would leave half of the next instruction in the window, and the remaining half in the next window, the larger one might be better by aligning its successor nicely." -- Marco
Dec 13 2013
prev sibling parent reply "Nikolay" <sibnick gmail.com> writes:
I found that performance of ranges is miserable in many cases. 
You should not use them if there is any chance for big/critical 
computations. Actually I don't like to have two subsets of 
language: one with good performance, and other with good 
maintainability/readability. I have a couple thought about this 
situation. May by we can use CTFE and mixin for inlining some 
range based code on library level?

FYI One of my benchmark (calculate sum of two simple ranges):
#!/usr/bin/rdmd

module zip_test;

import std.range;
import std.datetime;
import std.getopt;
import std.stdio;
import std.algorithm;


struct Range{
         uint count = 0;
          property bool empty(){
                 return count>=limit;
         }
          property int front(){
                 return count;
         }
         bool popFront(){
                 count++;
                 return count<limit;
         }
}

uint testForEach(){
         Range r1, r2;
         uint rs = 0;
         for(int i=0; !r1.empty && !r2.empty; i++){
                 rs += r1.front + r2.front;
                 r1.popFront();
                 r2.popFront();
         }
         writefln("foreach: %s", rs);
         return rs;
}
uint testZip(){
         Range r1, r2;
         auto tmpRange = step1(r1, r2);
         auto rs = reduce!(q{ a + b })(0U, tmpRange);
//      auto rs = reduce!((a,b) => a + b)(0U, tmpRange);
         writefln("zip: %s", rs);
         return rs;
}

auto step1(Range r1, Range r2){
         //return zip(r1, r2).map!(a => a[0]+a[1]);
         return zip(r1, r2).map!(q{a[0]+a[1]});
}

uint limit = 10_000;

void main(string[] args){
         getopt( args, "limit", &limit);
         auto r = benchmark!(testForEach, testZip)(10);
         foreach(idx, rs; r){
                 writefln("%s - %s", idx, rs.msecs);
         }
}

Results:
~/ldc2-0.12.0-linux-x86_64/bin/ldmd2 -O -release -inline 
zip_test.d
./zip_test --limit 10000000
…........
0 - 0
1 - 764
Dec 13 2013
parent reply "lomereiter" <lomereiter gmail.com> writes:
On Friday, 13 December 2013 at 18:03:28 UTC, Nikolay wrote:
 Results:
 ~/ldc2-0.12.0-linux-x86_64/bin/ldmd2 -O -release -inline 
 zip_test.d
 ./zip_test --limit 10000000
 …........
 0 - 0
 1 - 764
And now try -inline-threshold=1024 ;-) Reading -help-hidden sometimes helps.
Dec 13 2013
parent "Nikolay" <sibnick gmail.com> writes:
 And now try -inline-threshold=1024 ;-)
 Reading -help-hidden sometimes helps.
Thanks. It is really game changer. I get my words about ranges back.
Dec 14 2013