www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - I need some help benchmarking SoA vs AoS

reply maik klein <maikklein googlemail.com> writes:
I recently wrote an article an SoA 
https://maikklein.github.io/post/soa-d/

But now I wanted to actually benchmark SoA vs AoS and it is so 
much harder than I thought.

In DMD SoA basically always beats AoS by a huge chuck. SoA is 
always at least twice as fast compared to AoS.

But with LDC it is just really hard to get reliable results.

I have created a example without any dependencies

http://dpaste.dzfl.pl/877a925e0a33

dub run -b release --compiler=dmd

benchmarking complete access
AoS: 419 ms, 938 μs, and 2 hnsecs
SoA: 11 μs and 8 hnsecs
benchmarking partial access
AoS: 521 ms, 381 μs, and 3 hnsecs
SoA: 5 μs


dub run -b release --compiler=ldc

benchmarking complete access
AoS: 1 μs and 6 hnsecs
SoA: 1 hnsec
benchmarking partial access
AoS: 1 hnsec
SoA: 1 hnsec


The problem I have is that LDC always seems to optimize the 
functions too much. At least one function executes always in "1 
hnsec".

When I do manage do create an testcase where AoS and SoA have a 
reasonable time, then they are equally fast, no matter what I do.

What I wanted to see:

If I iterate over a collection and I always access all members, I 
would assume that AoS vs SoA should be equally fast.

If I iterate over a collection where the elements have a big size 
and I only access some members I would assume SoA to be much 
faster.

Any help tips are greatly appreciated.
Mar 26 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 26.03.2016 14:47, maik klein wrote:
 The problem I have is that LDC always seems to optimize the functions
 too much. At least one function executes always in "1 hnsec".
Let the output depend on the results somehow. Simply printing them out should do the trick. You can also try throwing an Exception on wrong results. Else the calculations will be optimized away completely. Also make sure that data that's supposed to be dynamic actually comes from some kind of input. Else the program may just print precalculated results.
Mar 26 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 26.03.2016 17:31, ag0aep6g wrote:
 Let the output depend on the results somehow. Simply printing them out
 should do the trick. You can also try throwing an Exception on wrong
 results. Else the calculations will be optimized away completely.

 Also make sure that data that's supposed to be dynamic actually comes
 from some kind of input. Else the program may just print precalculated
 results.
I've crudely applied my advise here: https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions (Can I somehow link to one specific commit of a gist?) size is now read from the program arguments instead of being an enum, and the results are checked with enforce. ldmd can no longer avoid calculating anything or precalculate the results completely. It may still do ridiculous optimizations that wouldn't be possible with real world data, because the data is so formulaic. To avoid that you could prepare a file with test data and read it from there in the benchmark program. That way ldmd can't make predictions about it. Note that I quick-and-dirty worked around two correctness issues in your code: 1) SOA didn't set its length. I've added it to allocate. No idea if that's the right spot. 2) SOA doesn't set the correct initial values. I'm just setting them from the outside here.
Mar 26 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions
PS: Those enforces are for a size of 100_000 not 1_000_000, because I'm impatient.
Mar 26 2016
parent reply maik klein <maikklein googlemail.com> writes:
On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions
PS: Those enforces are for a size of 100_000 not 1_000_000, because I'm impatient.
Thanks, okay that gives me more more reliable results. for 1_000_000 benchmarking complete access AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs benchmarking partial access AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec This is sort of what I expected. I will do a few more benchmarks now. I probably also randomize the inputs.
Mar 26 2016
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 26 Mar 2016 17:43:48 +0000
schrieb maik klein <maikklein googlemail.com>:

 On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions
PS: Those enforces are for a size of 100_000 not 1_000_000,=20 because I'm impatient.
=20 Thanks, okay that gives me more more reliable results. for 1_000_000 =20 benchmarking complete access AoS: 1 sec, 87 ms, 266 =CE=BCs, and 4 hnsecs SoA: 1 sec, 491 ms, 186 =CE=BCs, and 6 hnsecs benchmarking partial access AoS: 7 secs, 167 ms, 635 =CE=BCs, and 8 hnsecs SoA: 1 sec, 20 ms, 573 =CE=BCs, and 1 hnsec =20 This is sort of what I expected. I will do a few more benchmarks=20 now. I probably also randomize the inputs.
That looks more like it. :) There is a few things to keep in mind. When you use constant data and don't use the result compilers can: - Const-fold computations away. - Specialize functions on compile-time known arguments. That works mostly as if the argument was a template argument. A new instance of the function is created for each invokation with a compile-time known value. (Disabling inlining wont prevent this.) - Call pure functions with the same argument only once in a loop of 1_000_000. - Replace 1_000_000 additions of the number X in a loop with the expression 1_000_000*X. In addition to these real-world optimizations, when you don't accumulate the result of the function call and print it or store it in some global variable, the whole computation may be removed as "no side-effect", as others have pointed out. When inlining is used the compiler may also see through attempts to only use a part of the result and remove instructions that lead to the rest of it. For example when you return a struct with two fields - a and b - and store the sum of a, but ignore b, then the compiler may remove computations that are only needed for b! Try to generate input from random number generators or external files. Disable inlining for the benchmarked function via attribute or pragma(inline, false) or otherwise make sure that the compiler cannot guess what any of the arguments are and perform const-folding after inlining. When the result is returned, make sure you use so much of it, that the compiler cannot elide instructions after inlining. It is often enough to just store it in a global variable. --=20 Marco
Mar 26 2016
prev sibling parent reply ZombineDev <petar.p.kirov gmail.com> writes:
On Saturday, 26 March 2016 at 17:43:48 UTC, maik klein wrote:
 On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions
PS: Those enforces are for a size of 100_000 not 1_000_000, because I'm impatient.
Thanks, okay that gives me more more reliable results. for 1_000_000 benchmarking complete access AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs benchmarking partial access AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec This is sort of what I expected. I will do a few more benchmarks now. I probably also randomize the inputs.
Can you benchmark my implementation (http://dpaste.dzfl.pl/3de1e18756f8) against yours? It should have roughly the same API, except that mine doesn't support pre-allocation (reserving capacity) - I wanted to keep it minimalistic. During access it should have the same performance, however the insertion behavior should be noticeably different. I'm interested to see if one larger allocation on insertion would be faster than a number of small ones (or vice-versa). Though, AoS should be the fastest in this regard (growth performance).
Mar 29 2016
parent maik klein <maikklein googlemail.com> writes:
On Tuesday, 29 March 2016 at 21:27:15 UTC, ZombineDev wrote:
 On Saturday, 26 March 2016 at 17:43:48 UTC, maik klein wrote:
 On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions
PS: Those enforces are for a size of 100_000 not 1_000_000, because I'm impatient.
Thanks, okay that gives me more more reliable results. for 1_000_000 benchmarking complete access AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs benchmarking partial access AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec This is sort of what I expected. I will do a few more benchmarks now. I probably also randomize the inputs.
Can you benchmark my implementation (http://dpaste.dzfl.pl/3de1e18756f8) against yours? It should have roughly the same API, except that mine doesn't support pre-allocation (reserving capacity) - I wanted to keep it minimalistic. During access it should have the same performance, however the insertion behavior should be noticeably different. I'm interested to see if one larger allocation on insertion would be faster than a number of small ones (or vice-versa). Though, AoS should be the fastest in this regard (growth performance).
Yes I'll do it tomorrow, though I am not sure how much you can rely on the data. I think it really depends on how fragmented the heap currently is. I will also benchmark memory access, just to see if there is any difference. I recently asked a question about it on SO https://stackoverflow.com/questions/36296960/pros-and-cons-of-one-big-buffer-vs-many-smaller-buffer? Doesn't seem to be well received though.
Mar 29 2016