www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Perlin noise benchmark speed

reply Nick Treleaven <ntrel-public yahoo.co.uk> writes:
Hi,
A Perlin noise benchmark was quoted in this reddit thread:

http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

It apparently shows the 3 main D compilers producing slower code than 
Go, Rust, gcc, clang, Nimrod:

https://github.com/nsf/pnoise#readme

I initially wondered about std.random, but got this response:

"Yeah, but std.random is not used in that benchmark, it just initializes 
256 random vectors and permutates 256 sequential integers. What spins in 
a loop is just plain FP math and array read/writes. I'm sure it can be 
done faster, maybe D compilers are bad at automatic inlining or something. "

Obviously this is only one person's benchmark, but I wondered if people 
would like to check their code and suggest reasons for the speed deficit.
Jun 20 2014
next sibling parent reply Nick Treleaven <ntrel-public yahoo.co.uk> writes:
On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
Jun 20 2014
next sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower 
 code than
 Go, Rust, gcc, clang, Nimrod:

Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash

-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. David
Jun 20 2014
prev sibling next sibling parent "MrSmith" <mrsmith33 yandex.ru> writes:
On Friday, 20 June 2014 at 12:56:46 UTC, David Nadlinger wrote:
 On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower 
 code than
 Go, Rust, gcc, clang, Nimrod:

Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash

-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. David

struct can be used instead of class
Jun 20 2014
prev sibling next sibling parent Robert Schadek via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 06/20/2014 02:34 PM, Nick Treleaven via Digitalmars-d wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash

Jun 20 2014
prev sibling next sibling parent Robert Schadek via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 06/20/2014 02:56 PM, David Nadlinger via Digitalmars-d wrote:
 On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash

-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. David

Jun 20 2014
prev sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Friday, 20 June 2014 at 13:20:16 UTC, Robert Schadek via 
Digitalmars-d wrote:
 I added some final pure  safe stuff

Thanks. As a general comment, I'd be careful with suggesting the use of pure/ safe/… for performance improvements in microbenchmarks. While it is certainly good D style to use them wherever possible, it might lead people less familiar with D to believe that fast D code needs a lot of annotations. David
Jun 20 2014
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 14:32, schrieb Nick Treleaven:
 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just initializes
 256 random vectors and permutates 256 sequential integers. What spins in
 a loop is just plain FP math and array read/writes. I'm sure it can be
 done faster, maybe D compilers are bad at automatic inlining or something. "

 Obviously this is only one person's benchmark, but I wondered if people
 would like to check their code and suggest reasons for the speed deficit.

write, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...
Jun 20 2014
next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 15:14, schrieb dennis luehring:
 Am 20.06.2014 14:32, schrieb Nick Treleaven:
 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just initializes
 256 random vectors and permutates 256 sequential integers. What spins in
 a loop is just plain FP math and array read/writes. I'm sure it can be
 done faster, maybe D compilers are bad at automatic inlining or something. "

 Obviously this is only one person's benchmark, but I wondered if people
 would like to check their code and suggest reasons for the speed deficit.

write, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...

using perf with 10 is maybe too small to give good avarge result infos and also runtime startup etc. is measured - it not clear what is slower these benchmarks should be seperated into 3 parts runtime-startup pure pnoise result output - needed only once for verification, return dummy output will fit better to test the pnoise speed are array bounds checks active?
Jun 20 2014
prev sibling next sibling parent "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 13:14:04 UTC, dennis luehring wrote:
 write, printf etc. performance is benchmarked also - so not 
 clear
 if pnoise is super-fast but write is super-slow etc...

Indeed and using Windows (At least 8), the size of command-window (CMD) interferes in the result drastically... for example: running this test with console maximized will take: 2.58s while the same test but in small window: 2.11s! Matheus.
Jun 20 2014
prev sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Friday, 20 June 2014 at 13:46:26 UTC, Mattcoder wrote:
 On Friday, 20 June 2014 at 13:14:04 UTC, dennis luehring wrote:
 write, printf etc. performance is benchmarked also - so not 
 clear
 if pnoise is super-fast but write is super-slow etc...

Indeed and using Windows (At least 8), the size of command-window (CMD) interferes in the result drastically... for example: running this test with console maximized will take: 2.58s while the same test but in small window: 2.11s!

Before I wrote the above, I briefly ran the benchmark on my local (OS X) machine, and verified that the bulk of the time is indeed spent in the noise calculation loop (with stdout piped into /dev/null). Still, the LDC-compiled code is only about half as fast as the Clang-compiled version, and there is no good reason why it should be. My new guess is a difference in inlining heuristics (note also that the Rust version uses inlining hints). The big difference between GCC and Clang might be a hint that the performance drop is caused by a rather minute difference in optimizer tuning. Thus, we really need somebody to sit down with a profiler/disassembler and figure out what is going on. David
Jun 20 2014
prev sibling next sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 6/20/14, 9:32 AM, Nick Treleaven wrote:
 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr


 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just initializes
 256 random vectors and permutates 256 sequential integers. What spins in
 a loop is just plain FP math and array read/writes. I'm sure it can be
 done faster, maybe D compilers are bad at automatic inlining or
 something. "

 Obviously this is only one person's benchmark, but I wondered if people
 would like to check their code and suggest reasons for the speed deficit.

I just tried it with ldc and it's faster (faster than Go, slower than Ni. But this is still slower than other languages. And other languages keep the array bounds check on...
Jun 20 2014
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

This should be compiled with LDC2, it's more idiomatic and a little faster than the original D version: http://dpaste.dzfl.pl/8d2ff04b62d3 I have already seen that if I inline Noise2DContext.get in the main manually the program gets faster (but not yet fast enough). Bye, bearophile
Jun 20 2014
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 17:09, schrieb bearophile:
 Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

This should be compiled with LDC2, it's more idiomatic and a little faster than the original D version: http://dpaste.dzfl.pl/8d2ff04b62d3 I have already seen that if I inline Noise2DContext.get in the main manually the program gets faster (but not yet fast enough). Bye, bearophile

it does not makes sense to "optmized" this example more and more - it should be fast with the original version (except the missing finals on the virtuals)
Jun 20 2014
parent dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 22:44, schrieb bearophile:
 dennis luehring:

 it does not makes sense to "optmized" this example more and
 more - it should be fast with the original version

But the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile

as long as you find out its a library thing the c version is without any annotations and immutable/const the fastest - so whats the problem with D here, it can't(shouln't) be that one needs to work/change that much on such simple code to reach c speed
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
 http://dpaste.dzfl.pl/8d2ff04b62d3

Sorry for the awful tabs. Bye, bearophile
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
If I add this import in Noise2DContext.getGradients the run-time 
decreases a lot (I am now just two times slower than gcc with 
-Ofast):

import core.stdc.math: floor;

Bye,
bearophile
Jun 20 2014
prev sibling next sibling parent "whassup" <Whasss yahoo.com> writes:
  GO BEAROPHILE YOU CAN DO IT

On Friday, 20 June 2014 at 15:24:38 UTC, bearophile wrote:
 If I add this import in Noise2DContext.getGradients the 
 run-time decreases a lot (I am now just two times slower than 
 gcc with -Ofast):

 import core.stdc.math: floor;

 Bye,
 bearophile

Jun 20 2014
prev sibling next sibling parent "JR" <zorael gmail.com> writes:
On Friday, 20 June 2014 at 15:24:38 UTC, bearophile wrote:
 If I add this import in Noise2DContext.getGradients the 
 run-time decreases a lot (I am now just two times slower than 
 gcc with -Ofast):

 import core.stdc.math: floor;

 Bye,
 bearophile

Was just about to post that if I cheat and replace usage of floor(x) with cast(float)cast(int)x, ldc2 is almost down to gcc speeds (119.6ms average over 100 full executions vs gcc 102.7ms). It stood out in the callgraph. Because profiling before optimizing.
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
So this is the best so far version:

http://dpaste.dzfl.pl/8dae9b359f27

I don't show the version with manually inlined function.

(I have also seen that GCC generates on my cpu a little faster 
code if I don't use sse registers.)

Bye,
bearophile
Jun 20 2014
prev sibling next sibling parent "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 16:02:56 UTC, bearophile wrote:
 So this is the best so far version:

 http://dpaste.dzfl.pl/8dae9b359f27

Just one note, with the last version of DMD: dmd -O -noboundscheck -inline -release pnoise.d pnoise.d(42): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' pnoise.d(43): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' Matheus.
Jun 20 2014
prev sibling next sibling parent "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 18:29:35 UTC, Mattcoder wrote:
 On Friday, 20 June 2014 at 16:02:56 UTC, bearophile wrote:
 So this is the best so far version:

 http://dpaste.dzfl.pl/8dae9b359f27

Just one note, with the last version of DMD: dmd -O -noboundscheck -inline -release pnoise.d pnoise.d(42): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' pnoise.d(43): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' Matheus.

Sorry, I forgot this: Beside the error above, which for now I'm using: immutable float x0f = cast(int)x; //x.floor; immutable float y0f = cast(int)y; //y.floor; Just to compile, your version here is twice faster than the original one. Matheus.
Jun 20 2014
prev sibling next sibling parent "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 18:32:22 UTC, dennis luehring wrote:
 it does not makes sense to "optmized" this example more and 
 more - it should be fast with the original version (except the 
 missing finals on the virtuals)

Oh please, let him continue, I'm really learning a lot with these optimizations. Matheus.
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Mattcoder:

 Just one note, with the last version of DMD:

Yes, I know, at the top of the file I have specified it's for ldc2. Bye, bearophile
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Mattcoder:

 Beside the error above, which for now I'm using:

 immutable float x0f = cast(int)x; //x.floor;
 immutable float y0f = cast(int)y; //y.floor;

 Just to compile,

If you remove the calls to floor, you are avoiding the main problem to fix. Bye, bearohile
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
dennis luehring:

 it does not makes sense to "optmized" this example more and 
 more - it should be fast with the original version

But the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:

And a simple benchmark for D ranges/parallelism: Bye, bearophile
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:

And a simple benchmark for D ranges/parallelism: http://www.reddit.com/r/programming/comments/28mub4/clash_of_the_lambdas_comparing_lambda_performance/ Bye, bearophile
Jun 20 2014
prev sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Saturday, 21 June 2014 at 05:00:25 UTC, dennis luehring wrote:
 Am 20.06.2014 22:44, schrieb bearophile:
 dennis luehring:

 it does not makes sense to "optmized" this example more and
 more - it should be fast with the original version

But the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile

as long as you find out its a library thing the c version is without any annotations and immutable/const the fastest - so whats the problem with D here, it can't(shouln't) be that one needs to work/change that much on such simple code to reach c speed

bearophile's work is very valuable regardless of what the cause is, as it provides a pretty decent hint of what could be improved for anybody investigating the issue. This is not to say that we wouldn't need to fix our compilers (in end user terms, i.e. compiler + standard library) to make those examples fast – zero-cost abstractions are one of the main strengths of D. David
Jun 21 2014