digitalmars.D - Ranges, Performance, and opApply

Kapps (33/33) Oct 30 2011 As it stands right now, ranges are quite inefficient. Phobos is designed...

Andrei Alexandrescu (13/19) Oct 30 2011 The test is flawed in a subtle way and no conclusion can be drawn from i...

Martin Nowak (12/33) Oct 30 2011 No it does not.

Andrei Alexandrescu (5/42) Oct 30 2011 Indeed you're right. In the quiescent state both implementations call

dsimcha (2/5) Oct 30 2011 Just for future reference, LDC **DOES** sometimes inline opApply bodies.

Martin Nowak (2/11) Oct 30 2011 Right it should have been 'is mostly not inlined'.
Steven Schveighoffer (15/24) Oct 31 2011 The compiler should almost always be able to inline opApply, as the code...

Martin Nowak (44/70) Oct 31 2011 Inlining the body would necessitate one parameterized function

Kapps (8/46) Oct 31 2011 Thanks for the explanation guys. That definitely makes me feel better

Kapps <Kapps NotValidEmail.com> writes:

As it stands right now, ranges are quite inefficient. Phobos is designed 
with an emphasis on ranges, including in performance-sensitive code such 
as std.algorithm. No matter how optimized the range is, it will always 
be slower than the equivalent opApply function (and, as far as I can 
tell, offers no benefit for just iteration).

The fastest way of iterating over anything currently, is foreach over an 
array. It beats even for(size_t i 0 to length-1) by a decent bit. The 
lowest level range for all ranges is almost always an array (that is, if 
you have a range using a range ... using a range, it is likely that the 
last range is an array or uses an array). By implementing an opApply for 
these ranges, that calls opApply on the wrapped range or wrapped array 
(or other), you can basically double the performance cost of iterating 
(for many iterations over a small body). It is not possible to make the 
compiler-generated opApply make this optimization, as it does not know 
what the underlying data for the range is.

Creating an opApply can be done on a per-range basis for ranges that are 
used in performance-sensitive code (such as countUntil). If the manual 
opApply foreach's over the input range, and the input range has a 
similar manual opApply, there are fair performance benefits. If the 
underlying range does not have a manual opApply, there will still be 
some performance benefits, and does not break code in any way. If all 
ranges being used have opApply and the last range is an array, then 
performance benefits are very significant.

As an example, I made a test of 5 different ways of implementing 
std.algorithm.filter, and two tests of another filter on top of the 
output from the first one, one with a manual opApply, and one without. 
Basically, a manual opApply was over twice as fast. The code and results 
can be found at http://pastie.org/2781723. Ultimately, nothing is as 
fast as just doing a foreach yourself, but that's to be expected. It can 
be however, be easily made that the difference is not as extreme as it 
currently is.

So, should Phobos ranges attempt to have a manual opApply where feasibly 
being used in performance sensitive code?

Oct 30 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

The test is flawed in a subtle way and no conclusion can be drawn from it.

The test filters a checkerboard distribution (one element satisfies/the
next doesn't satisfy the filter). As such, the loop on lines 26-29 will
always test exactly two times to take one step. In contrast, the loop at
lines 75-81 is integrated and only makes the minimum amount of tests.
Calling the delegate is expensive, but there are only half as many calls
and the saved tests compensate for that.

I believe there is nothing inherent in the design of ranges that makes
them generally less efficient than various alternatives. Yes, there will 
be the corner cases where internal iteration will be better, but overall 
I don't think they form a large fraction of all cases.


Andrei

Oct 30 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

 The test is flawed in a subtle way and no conclusion can be drawn from  
 it.

 The test filters a checkerboard distribution (one element satisfies/the
 next doesn't satisfy the filter). As such, the loop on lines 26-29 will
 always test exactly two times to take one step. In contrast, the loop at
 lines 75-81 is integrated and only makes the minimum amount of tests.
 Calling the delegate is expensive, but there are only half as many calls
 and the saved tests compensate for that.

No it does not.
In fact the opApply version does ONE additional check at loop start, which
doesn't matter for 2.5M iterations.

 I believe there is nothing inherent in the design of ranges that makes
 them generally less efficient than various alternatives. Yes, there will  
 be the corner cases where internal iteration will be better, but overall  
 I don't think they form a large fraction of all cases.


 Andrei

The test still doesn't measure anything but spurious effects of different  
inline decisions.
I get a ~10% hit for range foreach over opApply.
When looping without predicate than popFront is inlined and the result is  
a ~10% hit for opApply.
As the opApply body can never be inlined it is a worse choice in general.

martin

Oct 30 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/30/11 11:45 AM, Martin Nowak wrote:
 On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

 The test is flawed in a subtle way and no conclusion can be drawn from
 it.

 The test filters a checkerboard distribution (one element satisfies/the
 next doesn't satisfy the filter). As such, the loop on lines 26-29 will
 always test exactly two times to take one step. In contrast, the loop at
 lines 75-81 is integrated and only makes the minimum amount of tests.
 Calling the delegate is expensive, but there are only half as many calls
 and the saved tests compensate for that.

 No it does not.

Indeed you're right. In the quiescent state both implementations call 
the support functions the same number of times. Thanks.

 In fact the opApply version does ONE additional check at loop start, which
 doesn't matter for 2.5M iterations.

 I believe there is nothing inherent in the design of ranges that makes
 them generally less efficient than various alternatives. Yes, there
 will be the corner cases where internal iteration will be better, but
 overall I don't think they form a large fraction of all cases.


 Andrei

 The test still doesn't measure anything but spurious effects of
 different inline decisions.
 I get a ~10% hit for range foreach over opApply.
 When looping without predicate than popFront is inlined and the result
 is a ~10% hit for opApply.
 As the opApply body can never be inlined it is a worse choice in general.

That was my intuition too.


Andrei

Oct 30 2011

dsimcha <dsimcha yahoo.com> writes:

On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in general.

 That was my intuition too.


 Andrei

Just for future reference, LDC **DOES** sometimes inline opApply bodies.

Oct 30 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 30 Oct 2011 20:23:59 +0100, dsimcha <dsimcha yahoo.com> wrote:

 On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in  
 general.

 That was my intuition too.


 Andrei

 Just for future reference, LDC **DOES** sometimes inline opApply bodies.

Right it should have been 'is mostly not inlined'.

Oct 30 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in  
 general.

 That was my intuition too.


 Andrei

 Just for future reference, LDC **DOES** sometimes inline opApply bodies.

The compiler should almost always be able to inline opApply, as the code  
for the opApply body is always available.

There are few exceptions, such as when opApply is not final, or when it's  
recursive.  I wonder if even in these cases, some sort of "virtual  
inlining" such as pushing the code onto the stack and avoiding using  
function calls would be possible (speaking from naivety, maybe this does  
nothing).  Being able to exploit the fact that a delegate literal is  
always fully available would be nice.

Indeed, opApply should beat the pants off of ranges when the range  
primitives are virtual functions.  But ranges should win when inlining is  
possible in some cases.

There are always going to be use cases for opApply that ranges cannot  
duplicate (such as recursion), and vice versa (such as parallel iteration).

-Steve

Oct 31 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 31 Oct 2011 13:56:21 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in  
 general.

 That was my intuition too.


 Andrei

 Just for future reference, LDC **DOES** sometimes inline opApply bodies.

 The compiler should almost always be able to inline opApply, as the code  
 for the opApply body is always available.

 There are few exceptions, such as when opApply is not final, or when  
 it's recursive.  I wonder if even in these cases, some sort of "virtual  
 inlining" such as pushing the code onto the stack and avoiding using  
 function calls would be possible (speaking from naivety, maybe this does  
 nothing).  Being able to exploit the fact that a delegate literal is  
 always fully available would be nice.

 Indeed, opApply should beat the pants off of ranges when the range  
 primitives are virtual functions.  But ranges should win when inlining  
 is possible in some cases.

 There are always going to be use cases for opApply that ranges cannot  
 duplicate (such as recursion), and vice versa (such as parallel  
 iteration).

 -Steve

Inlining the body would necessitate one parameterized function
per caller. Quite a lot of code and doesn't even have defined mangling,  
does it?
It's a different topic when both, the opApply function and the body are
inlined at the caller site.
Also when inlining the body it is much more difficult to assign registers
to variables from the calling stack unless you can prove that nobody can
refer to them.

I think a better approach is to revive the templated 'opApply(alias fbody)'
idea alternatively to the delegated one.
But there are currently too many issues to do this.

Constructing similar functionality requires quite some C++-ish gymnastics.
Partly due to bugs and for reasons discussed here:
http://d.puremagic.com/issues/show_bug.cgi?id=5710.
----
import std.range, std.stdio;

struct Context
{
     // doesn't even work as static method
     alias Context_forEach forEach;
     uint[] data;
}

// doesn't work at module scope when erroneously prefixed with static ???
/*static*/
void Context_forEach(alias fbody)(ref Context self)
{
     foreach(e; self.data)
         fbody(e);
}

void main()
{
     size_t sum;
     auto ctx = Context(array(iota(0U, 500U)));
     Context.forEach!(
     (a)
     {
         sum += a;
     }
     )(ctx);
     writeln(sum);
}

martin

Oct 31 2011

Kapps <Kapps NotValidEmail.com> writes:

Thanks for the explanation guys. That definitely makes me feel better 
about using ranges in my code (though for plain iteration opApply does 
seem easier to implement due to not forcing to conform to a specific 
naming convention except for one operator). When replacing modulus 2 
with modulus 100, the results are indeed very close, proving that the 
performance issue is not related to ranges vs opApply. Which is rather 
surprising to me actually. Optimization ftw I guess. :P

On 30/10/2011 10:45 AM, Martin Nowak wrote:
 On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

 The test is flawed in a subtle way and no conclusion can be drawn from
 it.

 The test filters a checkerboard distribution (one element satisfies/the
 next doesn't satisfy the filter). As such, the loop on lines 26-29 will
 always test exactly two times to take one step. In contrast, the loop at
 lines 75-81 is integrated and only makes the minimum amount of tests.
 Calling the delegate is expensive, but there are only half as many calls
 and the saved tests compensate for that.

 No it does not.
 In fact the opApply version does ONE additional check at loop start, which
 doesn't matter for 2.5M iterations.

 I believe there is nothing inherent in the design of ranges that makes
 them generally less efficient than various alternatives. Yes, there
 will be the corner cases where internal iteration will be better, but
 overall I don't think they form a large fraction of all cases.


 Andrei

 The test still doesn't measure anything but spurious effects of
 different inline decisions.
 I get a ~10% hit for range foreach over opApply.
 When looping without predicate than popFront is inlined and the result
 is a ~10% hit for opApply.
 As the opApply body can never be inlined it is a worse choice in general.

 martin

Oct 31 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Ranges, Performance, and opApply