www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - benchmark on binary trees

reply Alex <sascha.orlov gmail.com> writes:
Hi everybody,
this is going to be a learning by doing a benchmark test - post.

Found this "game" on the web
http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=binarytrees
and wanted to experiment on my self, I tried to reimplement some 
code in D.

This:
http://dpaste.dzfl.pl/4d8e0ceb867c
I did by reimplementing the C# version
and this:
http://dpaste.dzfl.pl/2fd3841d85ef
by reimplementing the C++ version.

Besides the questions of improving both of them, which are 
important but only because I would... maybe... at some point... 
upload my code. For some DLang advertising effect ;)

The learning effect - questions are the following:

1. I wrote the C++ inspired version after the C# inspired, hoping 
it would be faster. This was not the case. Why?
This is a big question, as there could be just some silly 
mistakes of kind
"do not use this call, use this, and your program will be twice 
as fast."
to some mistakes of general kind like
"why were you so foolish, and didn't wrote the whole code in just 
one line."

2. I failed trying to implement some kind of parallelism in the 
second version. I tried something like

auto depthind = iota(min_depth, max_depth+1, 2);
foreach(dummy_i, d; taskPool.parallel(depthind))

for the for-loop in the main function, but then, the program 
never ends. Do you have a hint what was wrong?

3. The compilation was done by:
dmd -O -release -boundscheck=off [filename.d]
Is there anything else to improve performance significantly?

4. This is some ambiguous question. I come originally from the C# 
corner, so I natively think in terms of the first approach. Can 
one general make the statement, that for D one of the approaches 
above will be always faster then the other?
Dec 04 2015
next sibling parent CraigDillabaugh <craig.dillabaugh gmail.com> writes:
On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote:
 Hi everybody,
 this is going to be a learning by doing a benchmark test - post.
clip
 3. The compilation was done by:
 dmd -O -release -boundscheck=off [filename.d]
 Is there anything else to improve performance significantly?
If you have access to LDC or GDC they typically produce significantly faster code.
Dec 04 2015
prev sibling next sibling parent reply anonymous <anonymous example.com> writes:
On 04.12.2015 15:06, Alex wrote:
 1. I wrote the C++ inspired version after the C# inspired, hoping it
 would be faster. This was not the case. Why?
[...] Why did you expect the C++ inspired version to be faster? Just because the original was written in C++? From a quick skim the two versions seem to be pretty much identical, aside from names and struct vs class. Names don't make a difference of course. It would be easier to compare the two versions if you used the same names, though. The differences between structs on the heap and classes are subtle. It's not surprising that they don't have an impact here. If there are substantial differences between the two versions, please point them out.
 2. I failed trying to implement some kind of parallelism in the second
 version. I tried something like

 auto depthind = iota(min_depth, max_depth+1, 2);
 foreach(dummy_i, d; taskPool.parallel(depthind))

 for the for-loop in the main function, but then, the program never ends.
 Do you have a hint what was wrong?
Works for me. Maybe show the exact full program you tried, and tell what compiler version you're using.
 3. The compilation was done by:
 dmd -O -release -boundscheck=off [filename.d]
 Is there anything else to improve performance significantly?
The other compilers, ldc and gdc, usually produce faster code than dmd.
 4. This is some ambiguous question. I come originally from the C#
 corner, so I natively think in terms of the first approach. Can one
 general make the statement, that for D one of the approaches above will
 be always faster then the other?
Just reiterating what I said re the first question: I don't really see a difference. If you think there is, please point it out. Or if you're not sure, feel free to ask about specific parts.
Dec 04 2015
parent reply Alex <sascha.orlov gmail.com> writes:
On Friday, 4 December 2015 at 19:31:22 UTC, anonymous wrote:
 Why did you expect the C++ inspired version to be faster? Just 
 because the original was written in C++?

 From a quick skim the two versions seem to be pretty much 
 identical, aside from names and struct vs class.

 Names don't make a difference of course. It would be easier to 
 compare the two versions if you used the same names, though.

 The differences between structs on the heap and classes are 
 subtle. It's not surprising that they don't have an impact here.

 If there are substantial differences between the two versions, 
 please point them out.
Yes, I missed this, sorry. The main part of the question was probably about the class and struct difference. I thought handling with structs and pointers would be faster then with classes.
 2. auto depthind = iota(min_depth, max_depth+1, 2);
 foreach(dummy_i, d; taskPool.parallel(depthind))
Works for me. Maybe show the exact full program you tried, and tell what compiler version you're using.
Ok, this was strange, but I found the crux. The correct question is: Why the parallel version is slower then the sequential? If you set int n = 14 in the main function the parallel version is MUCH slower then the sequential. At my machine 7x slower. Shouldn't it be the other way round?
 3. The compilation was done by:
 dmd -O -release -boundscheck=off [filename.d]
 Is there anything else to improve performance significantly?
The other compilers, ldc and gdc, usually produce faster code than dmd.
Thanks for the hint! As ldc doesn't have the experimental part of the includes, compared on the first version. The result was: program compiled with ldc2, same params, was 5% slower... nothing crucial, I think...
 Just reiterating what I said re the first question: I don't 
 really see a difference. If you think there is, please point it 
 out. Or if you're not sure, feel free to ask about specific 
 parts.
Yeah... so the answer here for me, is that I can stay with my way of thinking in c# style. :)
Dec 04 2015
parent reply anonymous <anonymous example.com> writes:
On 04.12.2015 21:30, Alex wrote:
 Yes, I missed this, sorry. The main part of the question was probably
 about the class and struct difference. I thought handling with structs
 and pointers would be faster then with classes.
When you use a struct directly, without going through a pointer, that can be significantly faster than using a class. But structs through pointers are pretty much the same as classes, performance-wise. [...]
 Why the parallel version is slower then the sequential?
 If you set
 int n = 14 in the main function
 the parallel version is MUCH slower then the sequential. At my machine
 7x slower. Shouldn't it be the other way round?
I don't know what's going on here. You're allocating a lot of `TreeNode`s, though. That's probably not very parallelizable. The GC could also play a role. [...]
 As ldc doesn't have the experimental part of the includes, compared on
 the first version. The result was: program compiled with ldc2, same
 params, was 5% slower... nothing crucial, I think...
"The experimental part" is std.experimental.allocator, right? I may be wrong here, but I think the default allocator is essentially just `new`. So that wouldn't give you any performance boost. [...]
 Yeah... so the answer here for me, is that I can stay with my way of
 thinking in c# style. :)
In this case, I think you're fine. Generally, be aware that D doesn't shine when you create lots of throw-away objects. The GC can't compete with those of C# or Java, so when you translate code from those languages too closely, performance may be worse.
Dec 04 2015
parent reply Alex <sascha.orlov gmail.com> writes:
On Friday, 4 December 2015 at 23:23:37 UTC, anonymous wrote:
 Why the parallel version is slower then the sequential?
 If you set
 int n = 14 in the main function
 the parallel version is MUCH slower then the sequential. At my 
 machine
 7x slower. Shouldn't it be the other way round?
I don't know what's going on here. You're allocating a lot of `TreeNode`s, though. That's probably not very parallelizable. The GC could also play a role.
found and tried out the -vgc option... Is there a way to deactivate the GC, if it stands in way?
 In this case, I think you're fine. Generally, be aware that D 
 doesn't shine when you create lots of throw-away objects. The 
 GC can't compete with those of C# or Java, so when you 
 translate code from those languages too closely, performance 
 may be worse.
Yes, I thought in the same direction. That's why I tried to reimplement the c++ version. The idea was: as I can't compete with the GC of C#, I could try to compete by applying another approach. I don't try to write something which compete with c++ either (I would have to take c++, then? ;) ), but something which clearly outperforms the languages with a virtual machine... tried the -inline option... no time win...
Dec 04 2015
parent anonymous <anonymous example.com> writes:
On 05.12.2015 01:40, Alex wrote:
 found and tried out the -vgc option...
 Is there a way to deactivate the GC, if it stands in way?
You can call core.memory.GC.disable to disable automatic collections. .enable to turn them on again. http://dlang.org/phobos/core_memory.html#.GC
 Yes, I thought in the same direction. That's why I tried to reimplement
 the c++ version. The idea was: as I can't compete with the GC of C#, I
 could try to compete by applying another approach. I don't try to write
 something which compete with c++ either (I would have to take c++, then?
 ;) ), but something which clearly outperforms the languages with a
 virtual machine...
Your C++ inspired version still allocated via the GC, though. If that eats performance, then you'd have to mirror more closely what the C++ version actually does. It most probably doesn't use a GC. I presume this is the C++ version you took as inspiration: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6 That uses a boost::object_pool for the nodes. Assuming that that's being used for a reason, you could probably get a performance boost by doing something similar. I'm not really familiar with std.experimental.allocator, maybe there's something like object_pool in there. Otherwise, you'd have to implement it yourself. Generally, I think most of the time you can write a program in D that's as fast as one written in C++. But you may have to give up on some convenience, and the libraries may not be there to support you.
Dec 05 2015
prev sibling next sibling parent anonymous <anonymous example.com> writes:
On 04.12.2015 15:06, Alex wrote:
 3. The compilation was done by:
 dmd -O -release -boundscheck=off [filename.d]
 Is there anything else to improve performance significantly?
You forgot -inline. By the way, I'm not a fan of using -boundscheck=off like a general optimization flag. It undermines safe, and as the documentation says, it should be used with caution. http://dlang.org/dmd-windows.html#switch-boundscheck
Dec 04 2015
prev sibling next sibling parent JR <sunspyre gmail.com> writes:
On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote:
[...]
 hoping it would be faster. This was not the case. Why?
[...]
 Is there anything else to improve performance significantly?
Profile. Profile profile profile. Callgrind. Find bottlenecks instead of guessing them.
Dec 05 2015
prev sibling parent reply Alex <sascha.orlov gmail.com> writes:
Ok, lets conclude this post for now. Did some comparison runs 
with the original C++ code. Missed this at the beginning.
The results so far are as following:

Here is the original code in C++.
http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6

With modifications to be able to run it on my mac os machine this 
results in the code available here:
http://pastebin.com/G5cYESdX
compilation done with
g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp 
main.cpp -o main.o && g++ main.o -o main.run -fopenmp 
-lboost_system

Here you can find the last version of my D code:
http://dpaste.dzfl.pl/8c8ab00699b5
compilation done with
dmd -release -O -boundscheck=off -inline main.d

time comparison, just with
time ./main
yields

for C++
real 0m8.255s
user 0m7.342s
sys 0m0.905s

for D
real 0m35.356s
user 0m35.149s
sys 0m0.154s

so the overall factor is round about 5.

Thanks for commenting to everyone! If anybody has further ideas - 
all of them would be appreciated :)
The original site is not interested in any further languages to 
be tested, so my experiment ends here for now...
Dec 06 2015
next sibling parent reply Rikki Cattermole <alphaglosined gmail.com> writes:
On 06/12/15 9:35 PM, Alex wrote:
 Ok, lets conclude this post for now. Did some comparison runs with the
 original C++ code. Missed this at the beginning.
 The results so far are as following:

 Here is the original code in C++.
 http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6


 With modifications to be able to run it on my mac os machine this
 results in the code available here:
 http://pastebin.com/G5cYESdX
 compilation done with
 g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp main.cpp -o
 main.o && g++ main.o -o main.run -fopenmp -lboost_system

 Here you can find the last version of my D code:
 http://dpaste.dzfl.pl/8c8ab00699b5
 compilation done with
 dmd -release -O -boundscheck=off -inline main.d

 time comparison, just with
 time ./main
 yields

 for C++
 real 0m8.255s
 user 0m7.342s
 sys 0m0.905s

 for D
 real 0m35.356s
 user 0m35.149s
 sys 0m0.154s

 so the overall factor is round about 5.

 Thanks for commenting to everyone! If anybody has further ideas - all of
 them would be appreciated :)
 The original site is not interested in any further languages to be
 tested, so my experiment ends here for now...
Why is TreeNode not final? Also yours does not use threads in any way. If you cannot add multithreading on D, remove it from c++ before comparing. They are not equal.
Dec 06 2015
parent Alex <sascha.orlov gmail.com> writes:
On Sunday, 6 December 2015 at 08:45:10 UTC, Rikki Cattermole 
wrote:
 Why is TreeNode not final?
This is an interesting hint! Just after adding final the program takes two seconds less... This is roughly 5%. Do you have another hints of this kind? ;)
 Also yours does not use threads in any way.

 If you cannot add multithreading on D, remove it from c++ 
 before comparing. They are not equal.
Yes... I'm aware of this discrepancy, and I tried how the time statistics differ with and without the #pragma statement. With the statement it is half a second slower then without. This seems strange to me, and I don't have a clue why it is so. But as this doesn't change very much for the D/C++ comparison and I think the correct goal is to have a parallel version in D I let the #pragma statement inside.
Dec 07 2015
prev sibling parent reply visitor <visitor gmail.com> writes:
On Sunday, 6 December 2015 at 08:35:33 UTC, Alex wrote:
 Thanks for commenting to everyone! If anybody has further ideas 
 - all of them would be appreciated :)
 The original site is not interested in any further languages to 
 be tested, so my experiment ends here for now...
Hello, interesting exercise for me to learn about allocators :-) i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : http://dpaste.dzfl.pl/6c3e6edcff59 BUT it consumes insane memory, don't try with argument more than 17 !!! so either i'm doing something stupid (in parallel code, i guess) or the FreeList allocator is totally not the right tool ... (both ?) some light-shedding would be really appreciated, Thanks
Dec 06 2015
parent reply Alex <sascha.orlov gmail.com> writes:
On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:
 Hello, interesting exercise for me to learn about allocators :-)
Nice to know, a novice can inspire someone :)
 i managed to parallelize the code reaching similar performance, 
 in terms of speed, as the non parallel version :
 http://dpaste.dzfl.pl/6c3e6edcff59
Cool! This is what I looked for!
 BUT it consumes insane memory, don't try with argument more 
 than 17 !!!

 so either i'm doing something stupid (in parallel code, i 
 guess) or the FreeList allocator is totally not the right tool 
 ... (both ?)
I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now. And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have... But your solution is really impressive, it reduces the factor from 5 to 3 immediately :) And I'm going to read your code carefully...
 some light-shedding would be really appreciated, Thanks
Dec 07 2015
parent reply visitor <visitor gmail.com> writes:
On Monday, 7 December 2015 at 10:55:25 UTC, Alex wrote:
 On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:
 Hello, interesting exercise for me to learn about allocators 
 :-)
Nice to know, a novice can inspire someone :)
 i managed to parallelize the code reaching similar 
 performance, in terms of speed, as the non parallel version :
 http://dpaste.dzfl.pl/6c3e6edcff59
Cool! This is what I looked for!
 BUT it consumes insane memory, don't try with argument more 
 than 17 !!!
I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now. And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have...
i've got more speed improvement with "taskPool.parallel(depthind, 2)" in the foreach parallel loop : second argument are workUnits (2 for me, on a quad core gave best results) Also using directly "FreeList!(Mallocator, Tree_node.sizeof)" without wrapping it in an allocatorObject gives speed improvement (with changes to makeTree method) i took inspiration from the C code, they use a memory pool management, like anonymous already pointed in c++ version, which i think could (must?) be achieved with allocators, to gain speed i think it's a key point, no GC !! FreeList allocator appears (to me) as a good candidate for this. but as i'm new to this, i'm sure to not doing it the right way ! i tried the exact same FreeList allocator but backed with the GCAllocator (not the Mallocator used in my code), then memory consumption is very good but of course it"s slow ! i tried a lot of other allocators, variations on the presented code, but memory management is awful :(
Dec 07 2015
parent reply visitor <visitor gmail.com> writes:
using Apache Portable Runtime(APR) like in the C version :
http://dpaste.dzfl.pl/6ca8b5ffd6dc

works like a charm, 2.061s on my machine !

if file name is binarytrees.d
dmd -w -inline -O -release -I/usr/include/apr-1.0 
-L/usr/lib/x86_64-linux-gnu/libapr-1.so -of"binarytrees" 
"binarytrees.d"

measured with "time ./binarytrees 20" on command line:
real	0m2.061s
user	0m8.156s
sys	0m0.181s

C version gives :
real	0m1.892s
user	0m9.087s
sys	0m0.188s

Not bad !!! :-D

Still i would like to know what i'm doing wrong with the 
allocator's version, besides the fact that apr tools are doing 
more sophisticated work than my naive approach :-D ?
Dec 08 2015
parent reply visitor <visitor gmail.com> writes:
C++ version :
real	0m3.587s
user	0m9.211s
sys	0m7.341s
Dec 08 2015
parent reply visitor <visitor gmail.com> writes:
version with apr (like in c version)
http://dpaste.dzfl.pl/68c0157225e7
compiled with ldc it's indeed a bit faster on average :
real	0m1.999s
user	0m9.810s
sys	0m0.148

btw Rust version is even faster than my little bit outdated gcc 
(4.9)

latest try with allocators :
http://dpaste.dzfl.pl/86b9b3c4ad71
swallows huge memory, slow !

what am i doing wrong ? (concurrency problem ?)
Dec 09 2015
parent visitor <visitor gmail.com> writes:
ok, i have a working version (memory is nice, twice the speed as 
non parallel) ;
http://dpaste.dzfl.pl/504a652c6c47

real	0m14.427s
user	1m19.347s
sys     0m0.124s

i've got similar performances, without Allocators, using directly 
malloc and free

i had to recursively deallocate ...

though, still far from competing with the fastest, any advice ?
i guess FreeList allocator is not the better tool here ?
Dec 11 2015