www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Perlin noise benchmark speed

reply Nick Treleaven <ntrel-public yahoo.co.uk> writes:
Hi,
A Perlin noise benchmark was quoted in this reddit thread:

http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

It apparently shows the 3 main D compilers producing slower code than 
Go, Rust, gcc, clang, Nimrod:

https://github.com/nsf/pnoise#readme

I initially wondered about std.random, but got this response:

"Yeah, but std.random is not used in that benchmark, it just initializes 
256 random vectors and permutates 256 sequential integers. What spins in 
a loop is just plain FP math and array read/writes. I'm sure it can be 
done faster, maybe D compilers are bad at automatic inlining or something. "

Obviously this is only one person's benchmark, but I wondered if people 
would like to check their code and suggest reasons for the speed deficit.
Jun 20 2014
next sibling parent reply Nick Treleaven <ntrel-public yahoo.co.uk> writes:
On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:
Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
Jun 20 2014
next sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower 
 code than
 Go, Rust, gcc, clang, Nimrod:
Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. David
Jun 20 2014
next sibling parent "MrSmith" <mrsmith33 yandex.ru> writes:
On Friday, 20 June 2014 at 12:56:46 UTC, David Nadlinger wrote:
 On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower 
 code than
 Go, Rust, gcc, clang, Nimrod:
Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. David
struct can be used instead of class
Jun 20 2014
prev sibling parent Robert Schadek via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 06/20/2014 02:56 PM, David Nadlinger via Digitalmars-d wrote:
 On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:
Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. David
I converted Noise2DContext into a struct, I gone add some more to my patch
Jun 20 2014
prev sibling parent reply Robert Schadek via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 06/20/2014 02:34 PM, Nick Treleaven via Digitalmars-d wrote:
 On 20/06/2014 13:32, Nick Treleaven wrote:
 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:
Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
I added some final pure safe stuff
Jun 20 2014
parent "David Nadlinger" <code klickverbot.at> writes:
On Friday, 20 June 2014 at 13:20:16 UTC, Robert Schadek via 
Digitalmars-d wrote:
 I added some final pure  safe stuff
Thanks. As a general comment, I'd be careful with suggesting the use of pure/ safe/… for performance improvements in microbenchmarks. While it is certainly good D style to use them wherever possible, it might lead people less familiar with D to believe that fast D code needs a lot of annotations. David
Jun 20 2014
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 14:32, schrieb Nick Treleaven:
 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just initializes
 256 random vectors and permutates 256 sequential integers. What spins in
 a loop is just plain FP math and array read/writes. I'm sure it can be
 done faster, maybe D compilers are bad at automatic inlining or something. "

 Obviously this is only one person's benchmark, but I wondered if people
 would like to check their code and suggest reasons for the speed deficit.
write, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...
Jun 20 2014
next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 15:14, schrieb dennis luehring:
 Am 20.06.2014 14:32, schrieb Nick Treleaven:
 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just initializes
 256 random vectors and permutates 256 sequential integers. What spins in
 a loop is just plain FP math and array read/writes. I'm sure it can be
 done faster, maybe D compilers are bad at automatic inlining or something. "

 Obviously this is only one person's benchmark, but I wondered if people
 would like to check their code and suggest reasons for the speed deficit.
write, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...
using perf with 10 is maybe too small to give good avarge result infos and also runtime startup etc. is measured - it not clear what is slower these benchmarks should be seperated into 3 parts runtime-startup pure pnoise result output - needed only once for verification, return dummy output will fit better to test the pnoise speed are array bounds checks active?
Jun 20 2014
prev sibling parent reply "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 13:14:04 UTC, dennis luehring wrote:
 write, printf etc. performance is benchmarked also - so not 
 clear
 if pnoise is super-fast but write is super-slow etc...
Indeed and using Windows (At least 8), the size of command-window (CMD) interferes in the result drastically... for example: running this test with console maximized will take: 2.58s while the same test but in small window: 2.11s! Matheus.
Jun 20 2014
parent "David Nadlinger" <code klickverbot.at> writes:
On Friday, 20 June 2014 at 13:46:26 UTC, Mattcoder wrote:
 On Friday, 20 June 2014 at 13:14:04 UTC, dennis luehring wrote:
 write, printf etc. performance is benchmarked also - so not 
 clear
 if pnoise is super-fast but write is super-slow etc...
Indeed and using Windows (At least 8), the size of command-window (CMD) interferes in the result drastically... for example: running this test with console maximized will take: 2.58s while the same test but in small window: 2.11s!
Before I wrote the above, I briefly ran the benchmark on my local (OS X) machine, and verified that the bulk of the time is indeed spent in the noise calculation loop (with stdout piped into /dev/null). Still, the LDC-compiled code is only about half as fast as the Clang-compiled version, and there is no good reason why it should be. My new guess is a difference in inlining heuristics (note also that the Rust version uses inlining hints). The big difference between GCC and Clang might be a hint that the performance drop is caused by a rather minute difference in optimizer tuning. Thus, we really need somebody to sit down with a profiler/disassembler and figure out what is going on. David
Jun 20 2014
prev sibling next sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 6/20/14, 9:32 AM, Nick Treleaven wrote:
 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr


 It apparently shows the 3 main D compilers producing slower code than
 Go, Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just initializes
 256 random vectors and permutates 256 sequential integers. What spins in
 a loop is just plain FP math and array read/writes. I'm sure it can be
 done faster, maybe D compilers are bad at automatic inlining or
 something. "

 Obviously this is only one person's benchmark, but I wondered if people
 would like to check their code and suggest reasons for the speed deficit.
I just tried it with ldc and it's faster (faster than Go, slower than Ni. But this is still slower than other languages. And other languages keep the array bounds check on...
Jun 20 2014
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr
This should be compiled with LDC2, it's more idiomatic and a little faster than the original D version: http://dpaste.dzfl.pl/8d2ff04b62d3 I have already seen that if I inline Noise2DContext.get in the main manually the program gets faster (but not yet fast enough). Bye, bearophile
Jun 20 2014
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
 http://dpaste.dzfl.pl/8d2ff04b62d3
Sorry for the awful tabs. Bye, bearophile
Jun 20 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
So this is the best so far version:

http://dpaste.dzfl.pl/8dae9b359f27

I don't show the version with manually inlined function.

(I have also seen that GCC generates on my cpu a little faster 
code if I don't use sse registers.)

Bye,
bearophile
Jun 20 2014
parent reply "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 16:02:56 UTC, bearophile wrote:
 So this is the best so far version:

 http://dpaste.dzfl.pl/8dae9b359f27
Just one note, with the last version of DMD: dmd -O -noboundscheck -inline -release pnoise.d pnoise.d(42): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' pnoise.d(43): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' Matheus.
Jun 20 2014
next sibling parent reply "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 18:29:35 UTC, Mattcoder wrote:
 On Friday, 20 June 2014 at 16:02:56 UTC, bearophile wrote:
 So this is the best so far version:

 http://dpaste.dzfl.pl/8dae9b359f27
Just one note, with the last version of DMD: dmd -O -noboundscheck -inline -release pnoise.d pnoise.d(42): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' pnoise.d(43): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' Matheus.
Sorry, I forgot this: Beside the error above, which for now I'm using: immutable float x0f = cast(int)x; //x.floor; immutable float y0f = cast(int)y; //y.floor; Just to compile, your version here is twice faster than the original one. Matheus.
Jun 20 2014
parent "bearophile" <bearophileHUGS lycos.com> writes:
Mattcoder:

 Beside the error above, which for now I'm using:

 immutable float x0f = cast(int)x; //x.floor;
 immutable float y0f = cast(int)y; //y.floor;

 Just to compile,
If you remove the calls to floor, you are avoiding the main problem to fix. Bye, bearohile
Jun 20 2014
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Mattcoder:

 Just one note, with the last version of DMD:
Yes, I know, at the top of the file I have specified it's for ldc2. Bye, bearophile
Jun 20 2014
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
If I add this import in Noise2DContext.getGradients the run-time 
decreases a lot (I am now just two times slower than gcc with 
-Ofast):

import core.stdc.math: floor;

Bye,
bearophile
Jun 20 2014
next sibling parent "whassup" <Whasss yahoo.com> writes:
  GO BEAROPHILE YOU CAN DO IT

On Friday, 20 June 2014 at 15:24:38 UTC, bearophile wrote:
 If I add this import in Noise2DContext.getGradients the 
 run-time decreases a lot (I am now just two times slower than 
 gcc with -Ofast):

 import core.stdc.math: floor;

 Bye,
 bearophile
Jun 20 2014
prev sibling parent "JR" <zorael gmail.com> writes:
On Friday, 20 June 2014 at 15:24:38 UTC, bearophile wrote:
 If I add this import in Noise2DContext.getGradients the 
 run-time decreases a lot (I am now just two times slower than 
 gcc with -Ofast):

 import core.stdc.math: floor;

 Bye,
 bearophile
Was just about to post that if I cheat and replace usage of floor(x) with cast(float)cast(int)x, ldc2 is almost down to gcc speeds (119.6ms average over 100 full executions vs gcc 102.7ms). It stood out in the callgraph. Because profiling before optimizing.
Jun 20 2014
prev sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 17:09, schrieb bearophile:
 Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr
This should be compiled with LDC2, it's more idiomatic and a little faster than the original D version: http://dpaste.dzfl.pl/8d2ff04b62d3 I have already seen that if I inline Noise2DContext.get in the main manually the program gets faster (but not yet fast enough). Bye, bearophile
it does not makes sense to "optmized" this example more and more - it should be fast with the original version (except the missing finals on the virtuals)
Jun 20 2014
next sibling parent "Mattcoder" <fromtheotherside mail.com> writes:
On Friday, 20 June 2014 at 18:32:22 UTC, dennis luehring wrote:
 it does not makes sense to "optmized" this example more and 
 more - it should be fast with the original version (except the 
 missing finals on the virtuals)
Oh please, let him continue, I'm really learning a lot with these optimizations. Matheus.
Jun 20 2014
prev sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
dennis luehring:

 it does not makes sense to "optmized" this example more and 
 more - it should be fast with the original version
But the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile
Jun 20 2014
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 20.06.2014 22:44, schrieb bearophile:
 dennis luehring:

 it does not makes sense to "optmized" this example more and
 more - it should be fast with the original version
But the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile
as long as you find out its a library thing the c version is without any annotations and immutable/const the fastest - so whats the problem with D here, it can't(shouln't) be that one needs to work/change that much on such simple code to reach c speed
Jun 20 2014
parent "David Nadlinger" <code klickverbot.at> writes:
On Saturday, 21 June 2014 at 05:00:25 UTC, dennis luehring wrote:
 Am 20.06.2014 22:44, schrieb bearophile:
 dennis luehring:

 it does not makes sense to "optmized" this example more and
 more - it should be fast with the original version
But the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile
as long as you find out its a library thing the c version is without any annotations and immutable/const the fastest - so whats the problem with D here, it can't(shouln't) be that one needs to work/change that much on such simple code to reach c speed
bearophile's work is very valuable regardless of what the cause is, as it provides a pretty decent hint of what could be improved for anybody investigating the issue. This is not to say that we wouldn't need to fix our compilers (in end user terms, i.e. compiler + standard library) to make those examples fast – zero-cost abstractions are one of the main strengths of D. David
Jun 21 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:
And a simple benchmark for D ranges/parallelism: Bye, bearophile
Jun 20 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Nick Treleaven:

 A Perlin noise benchmark was quoted in this reddit thread:
And a simple benchmark for D ranges/parallelism: http://www.reddit.com/r/programming/comments/28mub4/clash_of_the_lambdas_comparing_lambda_performance/ Bye, bearophile
Jun 20 2014
prev sibling parent reply "weaselcat" <weaselcat gmail.com> writes:
On Friday, 20 June 2014 at 12:32:39 UTC, Nick Treleaven wrote:
 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr

 It apparently shows the 3 main D compilers producing slower 
 code than Go, Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just 
 initializes 256 random vectors and permutates 256 sequential 
 integers. What spins in a loop is just plain FP math and array 
 read/writes. I'm sure it can be done faster, maybe D compilers 
 are bad at automatic inlining or something. "

 Obviously this is only one person's benchmark, but I wondered 
 if people would like to check their code and suggest reasons 
 for the speed deficit.
I saw this thread when searching for something on the site, been a few months since anyone posted- I fixed the D flags, gdc is now about 15% faster than the second fastest in the benchmark(C - gcc) which obviously puts D in first. some notes: LDC is missing _tons_ of inline opportunities, killing it in comparison to GDC. I think GDC inlined pretty much everything. LDC is about 50% slower. Also, AFAICT there's no fast-math switch for LDC(enabling this for GDC might actually be compromising it though : ) ) I think LDC turns the floor in std.math into the same as the stdc one, but GDC does not. std.math.floor is still abysmally slow, I thought it was because it was still using reals but that does not seem to be the case. GDC slows to a crawl(10-20x slower) if you replace the stdc floor with the one in std.math(just remove the alias) I thought this might be interesting to someone(i.e, LDC/GDC folks or phobos math folks) bye.
Mar 23 2015
parent Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
I'd suspect stdc.math to be SSE3/SSE4 optimised assembly, where as std.math
uses a very generic (works on almost every float format) implementation
that is at least 'pure'.

Iain.
On 24 Mar 2015 00:30, "weaselcat via Digitalmars-d" <
digitalmars-d puremagic.com> wrote:

 On Friday, 20 June 2014 at 12:32:39 UTC, Nick Treleaven wrote:

 Hi,
 A Perlin noise benchmark was quoted in this reddit thread:

 http://www.reddit.com/r/rust/comments/289enx/c0de517e_
 where_is_my_c_replacement/cibn6sr

 It apparently shows the 3 main D compilers producing slower code than Go,
 Rust, gcc, clang, Nimrod:

 https://github.com/nsf/pnoise#readme

 I initially wondered about std.random, but got this response:

 "Yeah, but std.random is not used in that benchmark, it just initializes
 256 random vectors and permutates 256 sequential integers. What spins in a
 loop is just plain FP math and array read/writes. I'm sure it can be done
 faster, maybe D compilers are bad at automatic inlining or something. "

 Obviously this is only one person's benchmark, but I wondered if people
 would like to check their code and suggest reasons for the speed deficit.
I saw this thread when searching for something on the site, been a few months since anyone posted- I fixed the D flags, gdc is now about 15% faster than the second fastest in the benchmark(C - gcc) which obviously puts D in first. some notes: LDC is missing _tons_ of inline opportunities, killing it in comparison to GDC. I think GDC inlined pretty much everything. LDC is about 50% slower. Also, AFAICT there's no fast-math switch for LDC(enabling this for GDC might actually be compromising it though : ) ) I think LDC turns the floor in std.math into the same as the stdc one, but GDC does not. std.math.floor is still abysmally slow, I thought it was because it was still using reals but that does not seem to be the case. GDC slows to a crawl(10-20x slower) if you replace the stdc floor with the one in std.math(just remove the alias) I thought this might be interesting to someone(i.e, LDC/GDC folks or phobos math folks) bye.
Mar 23 2015