digitalmars.D.learn - I need some help benchmarking SoA vs AoS

maik klein (35/35) Mar 26 2016 I recently wrote an article an SoA

ag0aep6g (7/9) Mar 26 2016 Let the output depend on the results somehow. Simply printing them out

ag0aep6g (17/23) Mar 26 2016 I've crudely applied my advise here:

ag0aep6g (3/4) Mar 26 2016 PS: Those enforces are for a size of 100_000 not 1_000_000, because I'm

maik klein (11/15) Mar 26 2016 Thanks, okay that gives me more more reliable results.

Marco Leise (35/54) Mar 26 2016 That looks more like it. :) There is a few things to keep in
ZombineDev (11/27) Mar 29 2016 Can you benchmark my implementation

maik klein (8/38) Mar 29 2016 Yes I'll do it tomorrow, though I am not sure how much you can

maik klein <maikklein googlemail.com> writes:

I recently wrote an article an SoA 
https://maikklein.github.io/post/soa-d/

But now I wanted to actually benchmark SoA vs AoS and it is so 
much harder than I thought.

In DMD SoA basically always beats AoS by a huge chuck. SoA is 
always at least twice as fast compared to AoS.

But with LDC it is just really hard to get reliable results.

I have created a example without any dependencies

http://dpaste.dzfl.pl/877a925e0a33

dub run -b release --compiler=dmd

benchmarking complete access
AoS: 419 ms, 938 μs, and 2 hnsecs
SoA: 11 μs and 8 hnsecs
benchmarking partial access
AoS: 521 ms, 381 μs, and 3 hnsecs
SoA: 5 μs


dub run -b release --compiler=ldc

benchmarking complete access
AoS: 1 μs and 6 hnsecs
SoA: 1 hnsec
benchmarking partial access
AoS: 1 hnsec
SoA: 1 hnsec


The problem I have is that LDC always seems to optimize the 
functions too much. At least one function executes always in "1 
hnsec".

When I do manage do create an testcase where AoS and SoA have a 
reasonable time, then they are equally fast, no matter what I do.

What I wanted to see:

If I iterate over a collection and I always access all members, I 
would assume that AoS vs SoA should be equally fast.

If I iterate over a collection where the elements have a big size 
and I only access some members I would assume SoA to be much 
faster.

Any help tips are greatly appreciated.

Mar 26 2016

ag0aep6g <anonymous example.com> writes:

On 26.03.2016 14:47, maik klein wrote:
 The problem I have is that LDC always seems to optimize the functions
 too much. At least one function executes always in "1 hnsec".

Let the output depend on the results somehow. Simply printing them out 
should do the trick. You can also try throwing an Exception on wrong 
results. Else the calculations will be optimized away completely.

Also make sure that data that's supposed to be dynamic actually comes 
from some kind of input. Else the program may just print precalculated 
results.

Mar 26 2016

ag0aep6g <anonymous example.com> writes:

On 26.03.2016 17:31, ag0aep6g wrote:
 Let the output depend on the results somehow. Simply printing them out
 should do the trick. You can also try throwing an Exception on wrong
 results. Else the calculations will be optimized away completely.

 Also make sure that data that's supposed to be dynamic actually comes
 from some kind of input. Else the program may just print precalculated
 results.

I've crudely applied my advise here:

https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions

(Can I somehow link to one specific commit of a gist?)

size is now read from the program arguments instead of being an enum, 
and the results are checked with enforce. ldmd can no longer avoid 
calculating anything or precalculate the results completely.

It may still do ridiculous optimizations that wouldn't be possible with 
real world data, because the data is so formulaic. To avoid that you 
could prepare a file with test data and read it from there in the 
benchmark program. That way ldmd can't make predictions about it.

Note that I quick-and-dirty worked around two correctness issues in your 
code:

1) SOA didn't set its length. I've added it to allocate. No idea if 
that's the right spot.
2) SOA doesn't set the correct initial values. I'm just setting them 
from the outside here.

Mar 26 2016

ag0aep6g <anonymous example.com> writes:

On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions

PS: Those enforces are for a size of 100_000 not 1_000_000, because I'm 
impatient.

Mar 26 2016

maik klein <maikklein googlemail.com> writes:

On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions

 PS: Those enforces are for a size of 100_000 not 1_000_000, 
 because I'm impatient.

Thanks, okay that gives me more more reliable results.
for 1_000_000

benchmarking complete access
AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs
SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs
benchmarking partial access
AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs
SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec

This is sort of what I expected. I will do a few more benchmarks 
now. I probably also randomize the inputs.

Mar 26 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Sat, 26 Mar 2016 17:43:48 +0000
schrieb maik klein <maikklein googlemail.com>:

 On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions

 PS: Those enforces are for a size of 100_000 not 1_000_000,=20
 because I'm impatient.

=20
 Thanks, okay that gives me more more reliable results.
 for 1_000_000
=20
 benchmarking complete access
 AoS: 1 sec, 87 ms, 266 =CE=BCs, and 4 hnsecs
 SoA: 1 sec, 491 ms, 186 =CE=BCs, and 6 hnsecs
 benchmarking partial access
 AoS: 7 secs, 167 ms, 635 =CE=BCs, and 8 hnsecs
 SoA: 1 sec, 20 ms, 573 =CE=BCs, and 1 hnsec
=20
 This is sort of what I expected. I will do a few more benchmarks=20
 now. I probably also randomize the inputs.

That looks more like it. :) There is a few things to keep in
mind. When you use constant data and don't use the result
compilers can:

- Const-fold computations away.
- Specialize functions on compile-time known arguments. That
  works mostly as if the argument was a template argument. A
  new instance of the function is created for each invokation
  with a compile-time known value. (Disabling inlining wont
  prevent this.)
- Call pure functions with the same argument only once in a
  loop of 1_000_000.
- Replace 1_000_000 additions of the number X in a loop with
  the expression 1_000_000*X.

In addition to these real-world optimizations, when you don't
accumulate the result of the function call and print it or
store it in some global variable, the whole computation may be
removed as "no side-effect", as others have pointed out. When
inlining is used the compiler may also see through attempts to
only use a part of the result and remove instructions that
lead to the rest of it. For example when you return a struct
with two fields - a and b - and store the sum of a, but ignore
b, then the compiler may remove computations that are only
needed for b!

Try to generate input from random number generators or
external files. Disable inlining for the benchmarked function
via attribute or pragma(inline, false) or otherwise make sure
that the compiler cannot guess what any of the arguments are
and perform const-folding after inlining. When the result is
returned, make sure you use so much of it, that the
compiler cannot elide instructions after inlining. It is
often enough to just store it in a global variable.

--=20
Marco

Mar 26 2016

ZombineDev <petar.p.kirov gmail.com> writes:

On Saturday, 26 March 2016 at 17:43:48 UTC, maik klein wrote:
 On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions

 PS: Those enforces are for a size of 100_000 not 1_000_000, 
 because I'm impatient.

 Thanks, okay that gives me more more reliable results.
 for 1_000_000

 benchmarking complete access
 AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs
 SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs
 benchmarking partial access
 AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs
 SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec

 This is sort of what I expected. I will do a few more 
 benchmarks now. I probably also randomize the inputs.

Can you benchmark my implementation 
(http://dpaste.dzfl.pl/3de1e18756f8) against yours? It should 
have roughly the same API, except that mine doesn't support 
pre-allocation (reserving capacity) - I wanted to keep it 
minimalistic.
During access it should have the same performance, however the 
insertion behavior should be noticeably different. I'm interested 
to see if one larger allocation on insertion would be faster than 
a number of small ones (or vice-versa). Though, AoS should be the 
fastest in this regard (growth performance).

Mar 29 2016

maik klein <maikklein googlemail.com> writes:

On Tuesday, 29 March 2016 at 21:27:15 UTC, ZombineDev wrote:
 On Saturday, 26 March 2016 at 17:43:48 UTC, maik klein wrote:
 On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
 On 26.03.2016 18:04, ag0aep6g wrote:
 https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions

 PS: Those enforces are for a size of 100_000 not 1_000_000, 
 because I'm impatient.

 Thanks, okay that gives me more more reliable results.
 for 1_000_000

 benchmarking complete access
 AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs
 SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs
 benchmarking partial access
 AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs
 SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec

 This is sort of what I expected. I will do a few more 
 benchmarks now. I probably also randomize the inputs.

 Can you benchmark my implementation 
 (http://dpaste.dzfl.pl/3de1e18756f8) against yours? It should 
 have roughly the same API, except that mine doesn't support 
 pre-allocation (reserving capacity) - I wanted to keep it 
 minimalistic.
 During access it should have the same performance, however the 
 insertion behavior should be noticeably different. I'm 
 interested to see if one larger allocation on insertion would 
 be faster than a number of small ones (or vice-versa). Though, 
 AoS should be the fastest in this regard (growth performance).

Yes I'll do it tomorrow, though I am not sure how much you can 
rely on the data. I think it really depends on how fragmented the 
heap currently is.

I will also benchmark memory access, just to see if there is any 
difference. I recently asked a question about it on SO 
https://stackoverflow.com/questions/36296960/pros-and-cons-of-one-big-buffer-vs-many-smaller-buffer?

Doesn't seem to be well received though.

Mar 29 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - I need some help benchmarking SoA vs AoS