www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Ranges, Performance, and opApply

reply Kapps <Kapps NotValidEmail.com> writes:
As it stands right now, ranges are quite inefficient. Phobos is designed 
with an emphasis on ranges, including in performance-sensitive code such 
as std.algorithm. No matter how optimized the range is, it will always 
be slower than the equivalent opApply function (and, as far as I can 
tell, offers no benefit for just iteration).

The fastest way of iterating over anything currently, is foreach over an 
array. It beats even for(size_t i 0 to length-1) by a decent bit. The 
lowest level range for all ranges is almost always an array (that is, if 
you have a range using a range ... using a range, it is likely that the 
last range is an array or uses an array). By implementing an opApply for 
these ranges, that calls opApply on the wrapped range or wrapped array 
(or other), you can basically double the performance cost of iterating 
(for many iterations over a small body). It is not possible to make the 
compiler-generated opApply make this optimization, as it does not know 
what the underlying data for the range is.

Creating an opApply can be done on a per-range basis for ranges that are 
used in performance-sensitive code (such as countUntil). If the manual 
opApply foreach's over the input range, and the input range has a 
similar manual opApply, there are fair performance benefits. If the 
underlying range does not have a manual opApply, there will still be 
some performance benefits, and does not break code in any way. If all 
ranges being used have opApply and the last range is an array, then 
performance benefits are very significant.

As an example, I made a test of 5 different ways of implementing 
std.algorithm.filter, and two tests of another filter on top of the 
output from the first one, one with a manual opApply, and one without. 
Basically, a manual opApply was over twice as fast. The code and results 
can be found at http://pastie.org/2781723. Ultimately, nothing is as 
fast as just doing a foreach yourself, but that's to be expected. It can 
be however, be easily made that the difference is not as extreme as it 
currently is.

So, should Phobos ranges attempt to have a manual opApply where feasibly 
being used in performance sensitive code?
Oct 30 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that. I believe there is nothing inherent in the design of ranges that makes them generally less efficient than various alternatives. Yes, there will be the corner cases where internal iteration will be better, but overall I don't think they form a large fraction of all cases. Andrei
Oct 30 2011
next sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that.

In fact the opApply version does ONE additional check at loop start, which doesn't matter for 2.5M iterations.
 I believe there is nothing inherent in the design of ranges that makes
 them generally less efficient than various alternatives. Yes, there will  
 be the corner cases where internal iteration will be better, but overall  
 I don't think they form a large fraction of all cases.


 Andrei

The test still doesn't measure anything but spurious effects of different inline decisions. I get a ~10% hit for range foreach over opApply. When looping without predicate than popFront is inlined and the result is a ~10% hit for opApply. As the opApply body can never be inlined it is a worse choice in general. martin
Oct 30 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/30/11 11:45 AM, Martin Nowak wrote:
 On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that.


Indeed you're right. In the quiescent state both implementations call the support functions the same number of times. Thanks.
 In fact the opApply version does ONE additional check at loop start, which
 doesn't matter for 2.5M iterations.

 I believe there is nothing inherent in the design of ranges that makes
 them generally less efficient than various alternatives. Yes, there
 will be the corner cases where internal iteration will be better, but
 overall I don't think they form a large fraction of all cases.


 Andrei

The test still doesn't measure anything but spurious effects of different inline decisions. I get a ~10% hit for range foreach over opApply. When looping without predicate than popFront is inlined and the result is a ~10% hit for opApply. As the opApply body can never be inlined it is a worse choice in general.

That was my intuition too. Andrei
Oct 30 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in general.

That was my intuition too. Andrei

Just for future reference, LDC **DOES** sometimes inline opApply bodies.
Oct 30 2011
prev sibling parent Kapps <Kapps NotValidEmail.com> writes:
Thanks for the explanation guys. That definitely makes me feel better 
about using ranges in my code (though for plain iteration opApply does 
seem easier to implement due to not forcing to conform to a specific 
naming convention except for one operator). When replacing modulus 2 
with modulus 100, the results are indeed very close, proving that the 
performance issue is not related to ranges vs opApply. Which is rather 
surprising to me actually. Optimization ftw I guess. :P

On 30/10/2011 10:45 AM, Martin Nowak wrote:
 On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/30/11 3:02 AM, Kapps wrote:
 As it stands right now, ranges are quite inefficient. Phobos is
 designed with an emphasis on ranges, including in
 performance-sensitive code such as std.algorithm. No matter how
 optimized the range is, it will always be slower than the equivalent
 opApply function (and, as far as I can tell, offers no benefit for
 just iteration).

The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that.

In fact the opApply version does ONE additional check at loop start, which doesn't matter for 2.5M iterations.
 I believe there is nothing inherent in the design of ranges that makes
 them generally less efficient than various alternatives. Yes, there
 will be the corner cases where internal iteration will be better, but
 overall I don't think they form a large fraction of all cases.


 Andrei

The test still doesn't measure anything but spurious effects of different inline decisions. I get a ~10% hit for range foreach over opApply. When looping without predicate than popFront is inlined and the result is a ~10% hit for opApply. As the opApply body can never be inlined it is a worse choice in general. martin

Oct 31 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sun, 30 Oct 2011 20:23:59 +0100, dsimcha <dsimcha yahoo.com> wrote:

 On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in  
 general.

That was my intuition too. Andrei

Just for future reference, LDC **DOES** sometimes inline opApply bodies.

Right it should have been 'is mostly not inlined'.
Oct 30 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in  
 general.

That was my intuition too. Andrei

Just for future reference, LDC **DOES** sometimes inline opApply bodies.

The compiler should almost always be able to inline opApply, as the code for the opApply body is always available. There are few exceptions, such as when opApply is not final, or when it's recursive. I wonder if even in these cases, some sort of "virtual inlining" such as pushing the code onto the stack and avoiding using function calls would be possible (speaking from naivety, maybe this does nothing). Being able to exploit the fact that a delegate literal is always fully available would be nice. Indeed, opApply should beat the pants off of ranges when the range primitives are virtual functions. But ranges should win when inlining is possible in some cases. There are always going to be use cases for opApply that ranges cannot duplicate (such as recursion), and vice versa (such as parallel iteration). -Steve
Oct 31 2011
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 31 Oct 2011 13:56:21 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
 As the opApply body can never be inlined it is a worse choice in  
 general.

That was my intuition too. Andrei

Just for future reference, LDC **DOES** sometimes inline opApply bodies.

The compiler should almost always be able to inline opApply, as the code for the opApply body is always available. There are few exceptions, such as when opApply is not final, or when it's recursive. I wonder if even in these cases, some sort of "virtual inlining" such as pushing the code onto the stack and avoiding using function calls would be possible (speaking from naivety, maybe this does nothing). Being able to exploit the fact that a delegate literal is always fully available would be nice. Indeed, opApply should beat the pants off of ranges when the range primitives are virtual functions. But ranges should win when inlining is possible in some cases. There are always going to be use cases for opApply that ranges cannot duplicate (such as recursion), and vice versa (such as parallel iteration). -Steve

Inlining the body would necessitate one parameterized function per caller. Quite a lot of code and doesn't even have defined mangling, does it? It's a different topic when both, the opApply function and the body are inlined at the caller site. Also when inlining the body it is much more difficult to assign registers to variables from the calling stack unless you can prove that nobody can refer to them. I think a better approach is to revive the templated 'opApply(alias fbody)' idea alternatively to the delegated one. But there are currently too many issues to do this. Constructing similar functionality requires quite some C++-ish gymnastics. Partly due to bugs and for reasons discussed here: http://d.puremagic.com/issues/show_bug.cgi?id=5710. ---- import std.range, std.stdio; struct Context { // doesn't even work as static method alias Context_forEach forEach; uint[] data; } // doesn't work at module scope when erroneously prefixed with static ??? /*static*/ void Context_forEach(alias fbody)(ref Context self) { foreach(e; self.data) fbody(e); } void main() { size_t sum; auto ctx = Context(array(iota(0U, 500U))); Context.forEach!( (a) { sum += a; } )(ctx); writeln(sum); } martin
Oct 31 2011