digitalmars.D.learn - Finding large difference b/w execution time of c++ and D codes for
- Sparsh Mittal (105/105) Feb 12 2013 I am writing Julia sets program in C++ and D; exactly same way as
- Sparsh Mittal (1/1) Feb 12 2013 I am finding C++ code is much faster than D code.
- monarch_dodra (3/4) Feb 12 2013 dmd (AFAIK) is known to be slower. try LDC or GDC if speed is
- Dmitry Olshansky (9/10) Feb 12 2013 Seems like DMD's floating point issue. The issue being that it always
- Sparsh Mittal (3/3) Feb 12 2013 Pardon me, can you please point me to suitable reference or tell
- Sparsh Mittal (1/1) Feb 12 2013 OK. I found it.
- Dmitry Olshansky (21/24) Feb 12 2013 GDC, seems like its mostly "build from source" kind of thing.
- H. S. Teoh (18/28) Feb 12 2013 [...]
- Sparsh Mittal (1/1) Feb 12 2013 Thanks for your insights. It was very helpful.
- FG (10/11) Feb 12 2013 I had a look, but first had to make juliaValue global, because g++ had o...
- Sparsh Mittal (2/4) Feb 12 2013 Brilliant! Yes, that is why the time was coming out to be zero,
- FG (2/6) Feb 12 2013 LOL. For a while you thought that C++ could be that much faster than D? ...
- Sparsh Mittal (4/6) Feb 12 2013 I was stunned and shared it with others who could not find. It
- Rob T (10/21) Feb 12 2013 Well technically it was that much faster because it did optimize
- Marco Leise (78/78) Feb 13 2013 I like optimization challenges. This is an excellent test
- FG (2/9) Feb 13 2013 Why aren't r and i of type TReal?
- Marco Leise (7/17) Feb 13 2013 They are actual storage in memory, where every increase in
- Joseph Rushton Wakeling (2/4) Feb 13 2013 When I replaced with TReal, it sped things up for double.
- Marco Leise (8/13) Feb 13 2013 Give me that stuff, your northbridge is on!
- Marco Leise (15/20) Feb 13 2013 Oh this gets even better... I only added double as last step
- FG (2/5) Feb 13 2013 I'll play it safe and only bet my opDollar. :)
- Joseph Rushton Wakeling (3/6) Feb 13 2013 I have to say, it's not been my experience that using real improves spee...
- Marco Leise (6/13) Feb 13 2013 The target is Linux, AMD64 and the compiler arguments are:
- Joseph Rushton Wakeling (17/23) Feb 13 2013 Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-G...
- Marco Leise (18/38) Feb 13 2013 Ok, I get pretty much the same numbers as before with:
- Joseph Rushton Wakeling (11/25) Feb 13 2013 My experience has been that the higher -O values of ldc don't do much, b...
- FG (5/5) Feb 13 2013 Good point about choosing the right type of floating point numbers.
- Joseph Rushton Wakeling (2/7) Feb 13 2013 Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).
- Marco Leise (95/100) Feb 13 2013 Yeah we are living in the 32-bit past ;)
- Joseph Rushton Wakeling (34/41) Feb 13 2013 Just to update on times. I was running another large job at the same ti...
- Marco Leise (26/71) Feb 13 2013 e as=20
- jerro (100/100) Feb 13 2013 When you are comparing LDC and GDC, you should either use
- Sparsh Mittal (1/1) Feb 14 2013 Thanks a lot for your reply.
- Joseph Rushton Wakeling (2/4) Feb 13 2013 ... try adding -frelease to the gdc call?
I am writing Julia sets program in C++ and D; exactly same way as much as possible. On executing I find large difference in their execution time. Can you comment what wrong am I doing or is it expected? //===============C++ code, compiled with -O3 ============== #include <sys/time.h> #include <iostream> using namespace std; const int DIM= 4194304; struct complexClass { float r; float i; complexClass( float a, float b ) { r = a; i = b; } float squarePlusMag(complexClass another) { float r1 = r*r - i*i + another.r; float i1 = 2.0*i*r + another.i; r = r1; i = i1; return (r1*r1+ i1*i1); } }; int juliaFunction( int x, int y ) { complexClass a (x,y); complexClass c(-0.8, 0.156); int i = 0; for (i=0; i<200; i++) { if( a.squarePlusMag(c) > 1000) return 0; } return 1; } void kernel( ){ for (int x=0; x<DIM; x++) { for (int y=0; y<DIM; y++) { int offset = x + y * DIM; int juliaValue = juliaFunction( x, y ); //juliaValue will be used by some function. } } } int main() { struct timeval start, end; gettimeofday(&start, NULL); kernel(); gettimeofday(&end, NULL); float delta = ((end.tv_sec - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6; cout<<" C++ code with dimension " << DIM <<" Total time: "<< delta << "[sec]\n"; } //=====================D++ code, compiled with -O -release -inline========= import std.stdio; import std.datetime; immutable int DIM= 4194304; struct complexClass { float r; float i; float squarePlusMag(complexClass another) { float r1 = r*r - i*i + another.r; float i1 = 2.0*i*r + another.i; r = r1; i = i1; return (r1*r1+ i1*i1); } }; int juliaFunction( int x, int y ) { complexClass c = complexClass(0.8, 0.156); complexClass a= complexClass(x, y); for (int i=0; i<200; i++) { if( a.squarePlusMag(c) > 1000) return 0; } return 1; } void kernel( ){ for (int x=0; x<DIM; x++) { for (int y=0; y<DIM; y++) { int offset = x + y * DIM; int juliaValue = juliaFunction( x, y ); //juliaValue will be used by some function. } } } void main() { StopWatch sw; sw.start(); kernel(); sw.stop(); writeln(" D code serial with dimension ", DIM ," Total time: ", (sw.peek().msecs/1000), "[sec]"); } //================ I will appreciate any help.
Feb 12 2013
I am finding C++ code is much faster than D code.
Feb 12 2013
On Tuesday, 12 February 2013 at 20:39:36 UTC, Sparsh Mittal wrote:I am finding C++ code is much faster than D code.dmd (AFAIK) is known to be slower. try LDC or GDC if speed is your major concern.
Feb 12 2013
13-Feb-2013 00:39, Sparsh Mittal пишет:I am finding C++ code is much faster than D code.Seems like DMD's floating point issue. The issue being that it always works with floats as full-width reals + rounding. Basically if nothing changed (and I doubt it changed) then DMD with floating point code is about two (or more) times slower then GDC/LDC. The cure is using GDC/LDC compiler as they are pretty stable and up to date on the front-end side these days. -- Dmitry Olshansky
Feb 12 2013
Pardon me, can you please point me to suitable reference or tell just command here. Searching on google, I could not find anything yet. Performance is my main concern.
Feb 12 2013
13-Feb-2013 01:09, Sparsh Mittal пишет:Pardon me, can you please point me to suitable reference or tell just command here. Searching on google, I could not find anything yet. Performance is my main concern.GDC, seems like its mostly "build from source" kind of thing. Moved to gitbub: https://github.com/D-Programming-GDC (See also newsgroup digitalmars.d.D.gnu) GDC binaries for Windows TDM-GCC toolchain are still available there: https://bitbucket.org/goshawk/gdc/downloads AFAIK it needs 4.6.1 version of TDM toolset. LDC(2), recent release with binaries. https://github.com/downloads/ldc-developers/ldc/ldc-0.10.0-src.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86_64.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86_64.tar.xz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86.tar.xz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-osx-x86_64.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-osx-x86_64.tar.xz (See also announce on the newsgroup digitalmars.d.D.ldc) Both compilers ship dmd-style compiler driver called gdmd or ldmd2. Speed is mostly what you'd expect of GCC and LLVM respectively. -- Dmitry Olshansky
Feb 12 2013
On Wed, Feb 13, 2013 at 12:56:01AM +0400, Dmitry Olshansky wrote:13-Feb-2013 00:39, Sparsh Mittal пишет:[...] I did a few benchmarks somewhat recently where I compared the performance of code produced by GDC with DMD. Code produced by GDC consistently outperforms code produced by DMD by about 20-30% or so. This is across the board, with both floats, reals, and applications that don't do heavy arithmetic (just basic looping/recursion constructs). I didn't investigate in detail the cause of this difference, but the last time I looked at the assembly code generated by both compilers, I noticed that GDC's optimizer is far more advanced than DMD's, esp. when it comes to loop-unrolling, strength reduction, inlining, etc.. For non-trivial code, GDC pretty much consistently produces superior code in general (not just in floating-point operations). So if performance is a concern, I'd say definitely look into GDC or LDC instead of DMD. T -- Two wrongs don't make a right; but three rights do make a left...I am finding C++ code is much faster than D code.Seems like DMD's floating point issue. The issue being that it always works with floats as full-width reals + rounding. Basically if nothing changed (and I doubt it changed) then DMD with floating point code is about two (or more) times slower then GDC/LDC. The cure is using GDC/LDC compiler as they are pretty stable and up to date on the front-end side these days.
Feb 12 2013
Thanks for your insights. It was very helpful.
Feb 12 2013
On 2013-02-12 21:39, Sparsh Mittal wrote:I am finding C++ code is much faster than D code.I had a look, but first had to make juliaValue global, because g++ had optimized all the calculations away. :) Also changed DIM to 32 * 1024. 13.2s -- g++ -O3 16.0s -- g++ -O2 15.9s -- gdc -O3 15.9s -- gdc -O2 16.2s -- dmd -O -release -inline (v.2.060) Winblows and DMD 32-bit, the rest 64-bit, but still, dmd was quite fast. Interesting how gdc -O3 gave no extra boost vs. -O2.
Feb 12 2013
I had a look, but first had to make juliaValue global, because g++ had optimized all the calculations away.Brilliant! Yes, that is why the time was coming out to be zero, regardless of what value of DIM I put. Thank you very very much.
Feb 12 2013
On 2013-02-13 00:06, Sparsh Mittal wrote:LOL. For a while you thought that C++ could be that much faster than D? :DI had a look, but first had to make juliaValue global, because g++ had optimized all the calculations away.Brilliant! Yes, that is why the time was coming out to be zero, regardless of what value of DIM I put. Thank you very very much.
Feb 12 2013
LOL. For a while you thought that C++ could be that much faster than D? :DI was stunned and shared it with others who could not find. It was like a scientist discovering a phenomenon which is against established laws. Good that I was wrong and a right person pointed it.
Feb 12 2013
Well technically it was that much faster because it did optimize away the useless calcOn Tuesday, 12 February 2013 at 23:31:17 UTC, FG wrote:On 2013-02-13 00:06, Sparsh Mittal wrote:Well technically it's not that C++ is faster than D or visa-versa, it's that the two compilers did different optimizations, and in this case one of the optimizations that g++ did (removing redundancies) had a large effect on the outcome. It's entirely possible that DMD can still beat g++ under different circumstances. --rtLOL. For a while you thought that C++ could be that much faster than D? :DI had a look, but first had to make juliaValue global, because g++ had optimized all the calculations away.Brilliant! Yes, that is why the time was coming out to be zero, regardless of what value of DIM I put. Thank you very very much.
Feb 12 2013
I like optimization challenges. This is an excellent test program to check the effect of different floating point types on intermediate values. Remember that when you store values in a float variable, the FPU actually has to round it down to that precision, store it in a 32-bit memory location, then load it back in and expand it - you _asked_ for that. I compiled with LDC2 and these are the results: D code serial with dimension 32768 ... using floats Total time: 13.399 [sec] using doubles Total time: 9.429 [sec] using reals Total time: 8.909 [sec] // <- !!! You get both, 50% more speed and more precision! It is a win-win situation. Also take a look at Phobos' std.math that returns real everywhere. Modified code: ---8<------------------------------- module main; import std.datetime; import std.metastrings; import std.stdio; import std.typetuple; enum DIM = 32 * 1024; int juliaValue; template Julia(TReal) { struct ComplexStruct { float r; float i; TReal squarePlusMag(const ComplexStruct another) { TReal r1 = r*r - i*i + another.r; TReal i1 = 2.0*i*r + another.i; r = r1; i = i1; return (r1*r1 + i1*i1); } } int juliaFunction( int x, int y ) { auto c = ComplexStruct(0.8, 0.156); auto a = ComplexStruct(x, y); foreach (i; 0 .. 200) if (a.squarePlusMag(c) > 1000) return 0; return 1; } void kernel() { foreach (x; 0 .. DIM) { foreach (y; 0 .. DIM) { juliaValue = juliaFunction( x, y ); } } } } void main() { writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ..."); StopWatch sw; foreach (Math; TypeTuple!(float, double, real)) { sw.start(); Julia!(Math).kernel(); sw.stop(); writefln(" using %ss Total time: %s [sec]", Math.stringof, (sw.peek().msecs * 0.001)); sw.reset(); } } ------------------------------->8--- -- Marco
Feb 13 2013
On 2013-02-13 14:26, Marco Leise wrote:template Julia(TReal) { struct ComplexStruct { float r; float i; ...Why aren't r and i of type TReal?
Feb 13 2013
Am Wed, 13 Feb 2013 14:44:36 +0100 schrieb FG <home fgda.pl>:On 2013-02-13 14:26, Marco Leise wrote:They are actual storage in memory, where every increase in size hurts. And they cannot be optimized away, like temporary reals, which can be kept on the FPU stack. -- Marcotemplate Julia(TReal) { struct ComplexStruct { float r; float i; ...Why aren't r and i of type TReal?
Feb 13 2013
On 02/13/2013 03:29 PM, Marco Leise wrote:They are actual storage in memory, where every increase in size hurts.When I replaced with TReal, it sped things up for double.
Feb 13 2013
Am Wed, 13 Feb 2013 15:45:13 +0100 schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:On 02/13/2013 03:29 PM, Marco Leise wrote:Give me that stuff, your northbridge is on! But I still want to rule out the LLVM version, since GDC seems to produce code with similar runtime on both our systems, but LDC2 divergess so much. -- MarcoThey are actual storage in memory, where every increase in size hurts.When I replaced with TReal, it sped things up for double.
Feb 13 2013
Am Wed, 13 Feb 2013 15:45:13 +0100 schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:On 02/13/2013 03:29 PM, Marco Leise wrote:Oh this gets even better... I only added double as last step to that code, so I didn't notice this effect. Looks like we've got: - CPUs that are good at converting to double - 64-bit, so the size of a double matches - only 16 bytes of memory in total With double struct fields the 'double' case gains 50% speed for me, making it the overall fastest now (on LDC). I'd still bet a dollar that with an array of values floats would outperform doubles, when cache misses happen. (E.g. more or less random memory access.) -- MarcoThey are actual storage in memory, where every increase in size hurts.When I replaced with TReal, it sped things up for double.
Feb 13 2013
On 2013-02-13 16:26, Marco Leise wrote:I'd still bet a dollar that with an array of values floats would outperform doubles, when cache misses happen. (E.g. more or less random memory access.)I'll play it safe and only bet my opDollar. :)
Feb 13 2013
On 02/13/2013 02:26 PM, Marco Leise wrote:You get both, 50% more speed and more precision! It is a win-win situation. Also take a look at Phobos' std.math that returns real everywhere.I have to say, it's not been my experience that using real improves speed. Exactly what optimizations are you using when compiling?
Feb 13 2013
Am Wed, 13 Feb 2013 14:48:21 +0100 schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:On 02/13/2013 02:26 PM, Marco Leise wrote:The target is Linux, AMD64 and the compiler arguments are: ldc2 -O5 -check-printf-calls -fdata-sections -ffunction-sections -release -singleobj -strip-debug -wi -L=--gc-sections -L=-s -- MarcoYou get both, 50% more speed and more precision! It is a win-win situation. Also take a look at Phobos' std.math that returns real everywhere.I have to say, it's not been my experience that using real improves speed. Exactly what optimizations are you using when compiling?
Feb 13 2013
On 02/13/2013 02:26 PM, Marco Leise wrote:I compiled with LDC2 and these are the results: D code serial with dimension 32768 ... using floats Total time: 13.399 [sec] using doubles Total time: 9.429 [sec] using reals Total time: 8.909 [sec] // <- !!! You get both, 50% more speed and more precision!Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-GitHub LDC, LLVM 3.2: D code serial with dimension 32768 ... using floats Total time: 4.751 [sec] using doubles Total time: 4.362 [sec] using reals Total time: 5.95 [sec] Using double is indeed marginally faster than float, but real is slower than both. What's disturbing is that when compiled instead with gdmd -O -inline -release the code is dramatically slower: D code serial with dimension 32768 ... using floats Total time: 22.108 [sec] using doubles Total time: 21.203 [sec] using reals Total time: 23.717 [sec] It's the first time I've encountered such a dramatic difference between GDC and LDC, and I'm wondering whether it's down to a bug or some change between D releases 2.060 and 2.061.
Feb 13 2013
Am Wed, 13 Feb 2013 15:00:21 +0100 schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-GitHub LDC, LLVM 3.2: D code serial with dimension 32768 ... using floats Total time: 4.751 [sec] using doubles Total time: 4.362 [sec] using reals Total time: 5.95 [sec]Ok, I get pretty much the same numbers as before with: ldmd2 -O -inline -release It's even a bit faster than my loooong command line. Do these numbers tell us, that there are such huge differences in the handling of floating point value between different AMD64 CPUs? I can't quite make a rhyme of it yet. What version of LLVM are you using, mine is 3.1. 3.0 is minimum and 3.2 is recommended for LDC2.Using double is indeed marginally faster than float, but real is slower than both. What's disturbing is that when compiled instead with gdmd -O -inline -release the code is dramatically slower: D code serial with dimension 32768 ... using floats Total time: 22.108 [sec] using doubles Total time: 21.203 [sec] using reals Total time: 23.717 [sec] It's the first time I've encountered such a dramatic difference between GDC and LDC, and I'm wondering whether it's down to a bug or some change between D releases 2.060 and 2.061._THAT_ I can reproduce with GDC! : D code serial with dimension 32768 ... using floats Total time: 24.415 [sec] using doubles Total time: 23.268 [sec] using reals Total time: 25.168 [sec] It's the exact same pattern. -- Marco
Feb 13 2013
On 02/13/2013 03:56 PM, Marco Leise wrote:Ok, I get pretty much the same numbers as before with: ldmd2 -O -inline -release It's even a bit faster than my loooong command line.My experience has been that the higher -O values of ldc don't do much, but of course, that's going to vary depending on your code. I think above -O3 it's all link-time, no?Do these numbers tell us, that there are such huge differences in the handling of floating point value between different AMD64 CPUs? I can't quite make a rhyme of it yet.AMD vs Intel might make a difference (my machine is an i7).What version of LLVM are you using, mine is 3.1. 3.0 is minimum and 3.2 is recommended for LDC2.LLVM 3.2._THAT_ I can reproduce with GDC! : D code serial with dimension 32768 ... using floats Total time: 24.415 [sec] using doubles Total time: 23.268 [sec] using reals Total time: 25.168 [sec] It's the exact same pattern.I've never, EVER had ldc-compiled code run four times faster than GDC-compiled code. In fact, I don't think I've ever had LDC-compiled code run faster than GDC-compiled code at all, except where the choice of optimizations was different. That's what makes me concerned that there's some kind of bug in play here ....
Feb 13 2013
Good point about choosing the right type of floating point numbers. Conclusion: when there's enough space, always pick double over float. Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s. I thought to myself: cool, I almost beat the 13.4s I got with C++, until I changed the C++ code to also use doubles and... got a massive speedup: 7.1s!
Feb 13 2013
On 02/13/2013 04:17 PM, FG wrote:Good point about choosing the right type of floating point numbers. Conclusion: when there's enough space, always pick double over float. Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s. I thought to myself: cool, I almost beat the 13.4s I got with C++, until I changed the C++ code to also use doubles and... got a massive speedup: 7.1s!Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).
Feb 13 2013
Am Wed, 13 Feb 2013 16:17:12 +0100 schrieb FG <home fgda.pl>:Good point about choosing the right type of floating point numbers. Conclusion: when there's enough space, always pick double over float. Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s. I thought to myself: cool, I almost beat the 13.4s I got with C++, until I changed the C++ code to also use doubles and... got a massive speedup: 7.1s!Yeah we are living in the 32-bit past ;) Still, be aware that we only write to 2 memory locations in that program! We have neither exceeded the L1 cache size with that nor have we put any strain on the prefetcher and memory bandwidth. With the modification below it is more clear why I said "use float for storage". The result with LDC2 for me is: D code serial with dimension 8192 ... using floats Total time: 4.235 [sec] using doubles Total time: 5.58 [sec] // ~+32% over float using reals Total time: 6.432 [sec] So all the in-CPU performance gain from using doubles is more than lost, when you run out of bandwidth. ---8<----------------------------------- module main; import std.datetime; import std.metastrings; import std.stdio; import std.typetuple; import std.random; import core.stdc.stdlib; enum DIM = 8 * 1024; int juliaValue; size_t* randomAcc; static this() { randomAcc = cast(size_t*) malloc((DIM * DIM + 200) * size_t.sizeof); foreach (i; 0 .. DIM * DIM) randomAcc[i] = i; randomAcc[0 .. DIM * DIM].randomShuffle(); randomAcc[DIM * DIM .. DIM * DIM + 200] = randomAcc[0 .. 200]; } static ~this() { free(randomAcc); } template Julia(TReal) { TReal* squares; static this() { squares = cast(TReal*) malloc(DIM * DIM * TReal.sizeof); } static ~this() { free(squares); } struct ComplexStruct { TReal r; TReal i; TReal squarePlusMag(const ComplexStruct another) { TReal r1 = r*r - i*i + another.r; TReal i1 = 2.0*i*r + another.i; r = r1; i = i1; return (r1*r1 + i1*i1); } } int juliaFunction( int x, int y ) { auto c = ComplexStruct(0.8, 0.156); auto a = ComplexStruct(x, y); foreach (i; 0 .. 200) { size_t idx = randomAcc[DIM * x + y + i]; squares[idx] = a.squarePlusMag(c); if (squares[idx] > 1000) return 0; } return 1; } void kernel() { foreach (x; 0 .. DIM) { foreach (y; 0 .. DIM) { juliaValue = juliaFunction( x, y ); } } } } void main() { writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ..."); StopWatch sw; foreach (Math; TypeTuple!(float, double, real)) { sw.start(); Julia!(Math).kernel(); sw.stop(); writefln(" using %ss Total time: %s [sec]", Math.stringof, (sw.peek().msecs * 0.001)); sw.reset(); } } -- Marco
Feb 13 2013
On 02/13/2013 04:41 PM, Joseph Rushton Wakeling wrote:On 02/13/2013 04:17 PM, FG wrote:Just to update on times. I was running another large job at the same time as doing all these tests, so there was some slowdown. Current results are: -- with g++ -O3 and using double rather than float: about 4.3 s -- with clang++ -O3 and using double rather than float: about 3.1 s -- with gdmd -O -release -inline: D code serial with dimension 32768 ... using floats Total time: 17.179 [sec], Julia value: 0 using doubles Total time: 10.298 [sec], Julia value: 0 using reals Total time: 17.126 [sec], Julia value: 0 -- with ldmd2 -O -release -inline: D code serial with dimension 32768 ... using floats Total time: 3.548 [sec], Julia value: 0 using doubles Total time: 2.708 [sec], Julia value: 0 using reals Total time: 4.371 [sec], Julia value: 0 -- with dmd -O -release -inline: D code serial with dimension 32768 ... using floats Total time: 15.696 [sec], Julia value: 0 using doubles Total time: 7.233 [sec], Julia value: 0 using reals Total time: 28.71 [sec], Julia value: 0 You'll note that I added a writeout of the global juliaValue in order to check that certain calculations weren't being optimized away. It's striking that in this case GDC is slower not only than LDC but also DMD. Current GDC is based off 2.060 as far as I know, whereas current LDC has upgraded to 2.061, so are there some changes between D 2.060 and 2.061 that could explain this? It's also interesting that clang++ produces a faster executable than g++, but it's not possible to make a direct LLVM vs GCC comparison here, as g++ is GCC 4.7.2 whereas GDC is based off a GCC snapshot. My guess would be that it's some combination of LLVM superiority in a particular case here, together with some 2.060 --> 2.061. Are these results comparable to what other people are getting? I can confirm that where code of mine is concerned, GDC still seems to have the edge in terms of executable speed ...Good point about choosing the right type of floating point numbers. Conclusion: when there's enough space, always pick double over float. Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s. I thought to myself: cool, I almost beat the 13.4s I got with C++, until I changed the C++ code to also use doubles and... got a massive speedup: 7.1s!Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).
Feb 13 2013
Am Wed, 13 Feb 2013 18:10:47 +0100 schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:Just to update on times. I was running another large job at the same tim=e as=20doing all these tests, so there was some slowdown. Current results are: =20 -- with g++ -O3 and using double rather than float: about 4.3 s =20 -- with clang++ -O3 and using double rather than float: about 3.1 s =20 -- with gdmd -O -release -inline: =20 D code serial with dimension 32768 ... using floats Total time: 17.179 [sec], Julia value: 0 using doubles Total time: 10.298 [sec], Julia value: 0 using reals Total time: 17.126 [sec], Julia value: 0 =20 -- with ldmd2 -O -release -inline: =20 D code serial with dimension 32768 ... using floats Total time: 3.548 [sec], Julia value: 0 using doubles Total time: 2.708 [sec], Julia value: 0 using reals Total time: 4.371 [sec], Julia value: 0 =20 -- with dmd -O -release -inline: =20 D code serial with dimension 32768 ... using floats Total time: 15.696 [sec], Julia value: 0 using doubles Total time: 7.233 [sec], Julia value: 0 using reals Total time: 28.71 [sec], Julia value: 0 =20 You'll note that I added a writeout of the global juliaValue in order to =check=20that certain calculations weren't being optimized away. =20 It's striking that in this case GDC is slower not only than LDC but also =DMD.=20Current GDC is based off 2.060 as far as I know, whereas current LDC has==20upgraded to 2.061, so are there some changes between D 2.060 and 2.061 th=at=20could explain this???? Anyways I upgraded to LLVM 3.2 - no change. You have an i7, I have a Core2. It would be really interesting to know what LDC does there. Since GDC's output seems rather CPU agnostic and LDC's output is better in every case but also exhibits system specific details so harshly I would never have imagined possible. Should Intel have changed their CPU design so radically?It's also interesting that clang++ produces a faster executable than g++,=but=20it's not possible to make a direct LLVM vs GCC comparison here, as g++ is=GCC=204.7.2 whereas GDC is based off a GCC snapshot.I've compiled GDC based on the same source that the Gentoo package manager built G++ 4.7.2 from and, I get similar numbers.My guess would be that it's some combination of LLVM superiority in a par=ticular=20case here, together with some 2.060 --> 2.061. =20 Are these results comparable to what other people are getting? =20 I can confirm that where code of mine is concerned, GDC still seems to ha=ve the=20edge in terms of executable speed ...I've seen a t=C3=AAte =C3=A0 t=C3=AAte between LDC and GDC in some of my code. --=20 Marco
Feb 13 2013
When you are comparing LDC and GDC, you should either use -mcpu=generic for ldc or -march=native for GDC, because their default targets are different. GDC will produce code that works on most x86_64 (if you are on a x86_64 system) CPUs by default, and LDC targets the host CPU. But this does not explain the difference in timings you are seeing here. One reason why the code generaged by GDC is slower is that squarePlusMag isn't inlined. It seems that the fact that its parameter is const is somehow preventing it from being inlined - I have no idea why. Removing const and adding -march=native to gdc flags gives me: gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native: using floats Total time: 8.283 [sec] using doubles Total time: 6.827 [sec] using reals Total time: 6.795 [sec] ldc2 -O3 -release -singleobj tmp.d -oftmp: using floats Total time: 3.348 [sec] using doubles Total time: 3.08 [sec] using reals Total time: 4.174 [sec] The difference is smaller, but still pretty large. I have noticed that there are needless conversions in this code that are slowing down both GDC generated and LDC generated code. This code is a bit faster: module main; import std.datetime; import std.metastrings; import std.stdio; import std.typetuple; enum DIM = 32 * 1024; int juliaValue; template Julia(TReal) { struct ComplexStruct { TReal r; TReal i; TReal squarePlusMag(ComplexStruct another) { TReal r1 = r*r - i*i + another.r; TReal i1 = cast(TReal)2.0*i*r + another.i; r = r1; i = i1; return (r1*r1 + i1*i1); } } int juliaFunction( int x, int y ) { auto c = ComplexStruct(0.8, 0.156); auto a = ComplexStruct(x, y); foreach (i; 0 .. 200) if (a.squarePlusMag(c) > cast(TReal) 1000) return 0; return 1; } void kernel() { foreach (x; 0 .. DIM) { foreach (y; 0 .. DIM) { juliaValue = juliaFunction( x, y ); } } } } void main() { writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ..."); StopWatch sw; foreach (Math; TypeTuple!(float, double, real)) { sw.start(); Julia!(Math).kernel(); sw.stop(); writefln(" using %ss Total time: %s [sec]", Math.stringof, (sw.peek().msecs * 0.001)); sw.reset(); } } This gives me: gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native: using floats Total time: 6.746 [sec] using doubles Total time: 6.872 [sec] using reals Total time: 5.226 [sec] ldc2 -O3 -release -singleobj tmp.d -oftmp: using floats Total time: 2.36 [sec] using doubles Total time: 2.535 [sec] using reals Total time: 4.106 [sec] At least part of the difference is due to the fact that juliaFunction still isn't getting inlined (but squarePlusMag is). Making juliaFunction a static method of ComplexStruct causes it to get inlined (again, I have no idea why). Moving juliaFunction inside ComplexStruct does not affect the performance of LDC generated code, but for GDC it gives me: using floats Total time: 4.262 [sec] using doubles Total time: 4.251 [sec] using reals Total time: 3.512 [sec] There is still a large difference between LDC and GDC four floats and doubles and I can't explain it. But at least it is much smaller than it was initially. I ran all the benchmarks on 64 bit linux, using core i5 2500k.
Feb 13 2013
On 02/12/2013 11:17 PM, FG wrote:Winblows and DMD 32-bit, the rest 64-bit, but still, dmd was quite fast. Interesting how gdc -O3 gave no extra boost vs. -O2.... try adding -frelease to the gdc call?
Feb 13 2013