www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Finding large difference b/w execution time of c++ and D codes for

reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
I am writing Julia sets program in C++ and D; exactly same way as 
much as possible. On executing I find large difference in their 
execution time. Can you comment what wrong am I doing or is it 
expected?


//===============C++ code, compiled with -O3 ==============
#include <sys/time.h>
#include <iostream>
using namespace std;
const  int DIM= 4194304;

struct complexClass {
   float r;
   float i;
   complexClass( float a, float b )
   {
     r = a;
     i = b;
   }


   float squarePlusMag(complexClass another)
   {
     float r1 = r*r - i*i + another.r;
     float i1 = 2.0*i*r + another.i;

     r = r1;
     i = i1;

     return (r1*r1+ i1*i1);
   }
};


int juliaFunction( int x, int y )
{

   complexClass a (x,y);

    complexClass c(-0.8, 0.156);

   int i = 0;

   for (i=0; i<200; i++) {
    if( a.squarePlusMag(c) > 1000)
       return 0;
   }

   return 1;
}


void kernel(  ){
   for (int x=0; x<DIM; x++) {
     for (int y=0; y<DIM; y++) {
       int offset = x + y * DIM;
       int juliaValue = juliaFunction( x, y );
	//juliaValue will be used by some function.
     }
   }
}


int main()
{

   struct timeval start, end;
   gettimeofday(&start, NULL);
   kernel();
   gettimeofday(&end, NULL);
   float delta = ((end.tv_sec  - start.tv_sec) * 1000000u +
          end.tv_usec - start.tv_usec) / 1.e6;


   cout<<" C++ code with dimension " << DIM <<" Total time: "<< 
delta << "[sec]\n";
}






//=====================D++ code, compiled with -O -release 
-inline=========

#!/usr/bin/env rdmd
import std.stdio;
import std.datetime;
immutable int DIM= 4194304;


struct complexClass {
   float r;
   float i;

   float squarePlusMag(complexClass another)
   {
     float r1 = r*r - i*i + another.r;
     float i1 = 2.0*i*r + another.i;

     r = r1;
     i = i1;

     return (r1*r1+ i1*i1);
   }
};


int juliaFunction( int x, int y )
{

   complexClass c = complexClass(0.8, 0.156);
   complexClass a= complexClass(x, y);


   for (int i=0; i<200; i++) {

     if( a.squarePlusMag(c) > 1000)
       return 0;
   }
   return 1;
}


void kernel(  ){
   for (int x=0; x<DIM; x++) {
     for (int y=0; y<DIM; y++) {
       int offset = x + y * DIM;
       int juliaValue = juliaFunction( x, y );
       //juliaValue will be used by some function.	
     }
   }
}


void main()
{
   StopWatch sw;
   sw.start();
   kernel();
   sw.stop();
   writeln(" D code serial with dimension ", DIM ," Total time: ", 
(sw.peek().msecs/1000), "[sec]");
}

//================
I will appreciate any help.
Feb 12 2013
parent reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
I am finding C++ code is much faster than D code.
Feb 12 2013
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 12 February 2013 at 20:39:36 UTC, Sparsh Mittal wrote:
 I am finding C++ code is much faster than D code.
dmd (AFAIK) is known to be slower. try LDC or GDC if speed is your major concern.
Feb 12 2013
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Feb-2013 00:39, Sparsh Mittal пишет:
 I am finding C++ code is much faster than D code.
Seems like DMD's floating point issue. The issue being that it always works with floats as full-width reals + rounding. Basically if nothing changed (and I doubt it changed) then DMD with floating point code is about two (or more) times slower then GDC/LDC. The cure is using GDC/LDC compiler as they are pretty stable and up to date on the front-end side these days. -- Dmitry Olshansky
Feb 12 2013
next sibling parent reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
Pardon me, can you please point me to suitable reference or tell 
just command here. Searching on google, I could not find anything 
yet. Performance is my main concern.
Feb 12 2013
next sibling parent "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
OK. I found it.
Feb 12 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Feb-2013 01:09, Sparsh Mittal пишет:
 Pardon me, can you please point me to suitable reference or tell just
 command here. Searching on google, I could not find anything yet.
 Performance is my main concern.
GDC, seems like its mostly "build from source" kind of thing. Moved to gitbub: https://github.com/D-Programming-GDC (See also newsgroup digitalmars.d.D.gnu) GDC binaries for Windows TDM-GCC toolchain are still available there: https://bitbucket.org/goshawk/gdc/downloads AFAIK it needs 4.6.1 version of TDM toolset. LDC(2), recent release with binaries. https://github.com/downloads/ldc-developers/ldc/ldc-0.10.0-src.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86_64.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86_64.tar.xz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86.tar.xz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-osx-x86_64.tar.gz https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-osx-x86_64.tar.xz (See also announce on the newsgroup digitalmars.d.D.ldc) Both compilers ship dmd-style compiler driver called gdmd or ldmd2. Speed is mostly what you'd expect of GCC and LLVM respectively. -- Dmitry Olshansky
Feb 12 2013
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Feb 13, 2013 at 12:56:01AM +0400, Dmitry Olshansky wrote:
 13-Feb-2013 00:39, Sparsh Mittal пишет:
I am finding C++ code is much faster than D code.
Seems like DMD's floating point issue. The issue being that it always works with floats as full-width reals + rounding. Basically if nothing changed (and I doubt it changed) then DMD with floating point code is about two (or more) times slower then GDC/LDC. The cure is using GDC/LDC compiler as they are pretty stable and up to date on the front-end side these days.
[...] I did a few benchmarks somewhat recently where I compared the performance of code produced by GDC with DMD. Code produced by GDC consistently outperforms code produced by DMD by about 20-30% or so. This is across the board, with both floats, reals, and applications that don't do heavy arithmetic (just basic looping/recursion constructs). I didn't investigate in detail the cause of this difference, but the last time I looked at the assembly code generated by both compilers, I noticed that GDC's optimizer is far more advanced than DMD's, esp. when it comes to loop-unrolling, strength reduction, inlining, etc.. For non-trivial code, GDC pretty much consistently produces superior code in general (not just in floating-point operations). So if performance is a concern, I'd say definitely look into GDC or LDC instead of DMD. T -- Two wrongs don't make a right; but three rights do make a left...
Feb 12 2013
parent "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
Thanks for your insights. It was very helpful.
Feb 12 2013
prev sibling parent reply FG <home fgda.pl> writes:
On 2013-02-12 21:39, Sparsh Mittal wrote:
 I am finding C++ code is much faster than D code.
I had a look, but first had to make juliaValue global, because g++ had optimized all the calculations away. :) Also changed DIM to 32 * 1024. 13.2s -- g++ -O3 16.0s -- g++ -O2 15.9s -- gdc -O3 15.9s -- gdc -O2 16.2s -- dmd -O -release -inline (v.2.060) Winblows and DMD 32-bit, the rest 64-bit, but still, dmd was quite fast. Interesting how gdc -O3 gave no extra boost vs. -O2.
Feb 12 2013
next sibling parent reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
 I had a look, but first had to make juliaValue global, because 
 g++ had optimized all the calculations away.
Brilliant! Yes, that is why the time was coming out to be zero, regardless of what value of DIM I put. Thank you very very much.
Feb 12 2013
parent reply FG <home fgda.pl> writes:
On 2013-02-13 00:06, Sparsh Mittal wrote:
 I had a look, but first had to make juliaValue global, because g++ had
 optimized all the calculations away.
Brilliant! Yes, that is why the time was coming out to be zero, regardless of what value of DIM I put. Thank you very very much.
LOL. For a while you thought that C++ could be that much faster than D? :D
Feb 12 2013
next sibling parent "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
 LOL. For a while you thought that C++ could be that much faster 
 than D?  :D
I was stunned and shared it with others who could not find. It was like a scientist discovering a phenomenon which is against established laws. Good that I was wrong and a right person pointed it.
Feb 12 2013
prev sibling parent reply "Rob T" <alanb ucora.com> writes:
Well technically it was that much faster because it did optimize 
away the useless calcOn Tuesday, 12 February 2013 at 23:31:17 
UTC, FG wrote:
 On 2013-02-13 00:06, Sparsh Mittal wrote:
 I had a look, but first had to make juliaValue global, 
 because g++ had
 optimized all the calculations away.
Brilliant! Yes, that is why the time was coming out to be zero, regardless of what value of DIM I put. Thank you very very much.
LOL. For a while you thought that C++ could be that much faster than D? :D
Well technically it's not that C++ is faster than D or visa-versa, it's that the two compilers did different optimizations, and in this case one of the optimizations that g++ did (removing redundancies) had a large effect on the outcome. It's entirely possible that DMD can still beat g++ under different circumstances. --rt
Feb 12 2013
parent reply Marco Leise <Marco.Leise gmx.de> writes:
I like optimization challenges. This is an excellent test
program to check the effect of different floating point types
on intermediate values. Remember that when you store values in
a float variable, the FPU actually has to round it down to
that precision, store it in a 32-bit memory location, then
load it back in and expand it - you _asked_ for that.

I compiled with LDC2 and these are the results:

D code serial with dimension 32768 ...
  using floats  Total time: 13.399 [sec]
  using doubles Total time:  9.429 [sec]
  using reals   Total time:  8.909 [sec] // <- !!!

You get both, 50% more speed and more precision! It is a
win-win situation. Also take a look at Phobos' std.math that
returns real everywhere.

Modified code:
---8<-------------------------------

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;


enum DIM = 32 * 1024;

int juliaValue;

template Julia(TReal)
{
	struct ComplexStruct
	{
		float r;
		float i;
	
		TReal squarePlusMag(const ComplexStruct another)
		{
			TReal r1 = r*r - i*i + another.r;
			TReal i1 = 2.0*i*r + another.i;
			
			r = r1;
			i = i1;
			
			return (r1*r1 + i1*i1);
		}
	}

	int juliaFunction( int x, int y )
	{
		auto c = ComplexStruct(0.8, 0.156);
		auto a = ComplexStruct(x, y);
	
		foreach (i; 0 .. 200)
			if (a.squarePlusMag(c) > 1000)
				return 0;
		return 1;
	}
	
	void kernel()
	{
		foreach (x; 0 .. DIM) {
			foreach (y; 0 .. DIM) {
				juliaValue = juliaFunction( x, y );
			}
		}
	}
}

void main()
{
	writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ...");
	StopWatch sw;
	foreach (Math; TypeTuple!(float, double, real))
	{
		sw.start();
		Julia!(Math).kernel();
		sw.stop();
		writefln("  using %ss Total time: %s [sec]",
		         Math.stringof, (sw.peek().msecs * 0.001));
		sw.reset();
	}
}

------------------------------->8---

-- 
Marco
Feb 13 2013
next sibling parent reply FG <home fgda.pl> writes:
On 2013-02-13 14:26, Marco Leise wrote:
 template Julia(TReal)
 {
 	struct ComplexStruct
 	{
 		float r;
 		float i;
 ... 	
Why aren't r and i of type TReal?
Feb 13 2013
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Feb 2013 14:44:36 +0100
schrieb FG <home fgda.pl>:

 On 2013-02-13 14:26, Marco Leise wrote:
 template Julia(TReal)
 {
 	struct ComplexStruct
 	{
 		float r;
 		float i;
 ... 	
Why aren't r and i of type TReal?
They are actual storage in memory, where every increase in size hurts. And they cannot be optimized away, like temporary reals, which can be kept on the FPU stack. -- Marco
Feb 13 2013
parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/13/2013 03:29 PM, Marco Leise wrote:
 They are actual storage in memory, where every increase in
 size hurts.
When I replaced with TReal, it sped things up for double.
Feb 13 2013
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Feb 2013 15:45:13 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 02/13/2013 03:29 PM, Marco Leise wrote:
 They are actual storage in memory, where every increase in
 size hurts.
When I replaced with TReal, it sped things up for double.
Give me that stuff, your northbridge is on! But I still want to rule out the LLVM version, since GDC seems to produce code with similar runtime on both our systems, but LDC2 divergess so much. -- Marco
Feb 13 2013
prev sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Feb 2013 15:45:13 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 02/13/2013 03:29 PM, Marco Leise wrote:
 They are actual storage in memory, where every increase in
 size hurts.
When I replaced with TReal, it sped things up for double.
Oh this gets even better... I only added double as last step to that code, so I didn't notice this effect. Looks like we've got: - CPUs that are good at converting to double - 64-bit, so the size of a double matches - only 16 bytes of memory in total With double struct fields the 'double' case gains 50% speed for me, making it the overall fastest now (on LDC). I'd still bet a dollar that with an array of values floats would outperform doubles, when cache misses happen. (E.g. more or less random memory access.) -- Marco
Feb 13 2013
parent FG <home fgda.pl> writes:
On 2013-02-13 16:26, Marco Leise wrote:
 I'd still bet a dollar that with an array of values floats would
 outperform doubles, when cache misses happen. (E.g. more or
 less random memory access.)
I'll play it safe and only bet my opDollar. :)
Feb 13 2013
prev sibling next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/13/2013 02:26 PM, Marco Leise wrote:
 You get both, 50% more speed and more precision! It is a
 win-win situation. Also take a look at Phobos' std.math that
 returns real everywhere.
I have to say, it's not been my experience that using real improves speed. Exactly what optimizations are you using when compiling?
Feb 13 2013
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Feb 2013 14:48:21 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 02/13/2013 02:26 PM, Marco Leise wrote:
 You get both, 50% more speed and more precision! It is a
 win-win situation. Also take a look at Phobos' std.math that
 returns real everywhere.
I have to say, it's not been my experience that using real improves speed. Exactly what optimizations are you using when compiling?
The target is Linux, AMD64 and the compiler arguments are: ldc2 -O5 -check-printf-calls -fdata-sections -ffunction-sections -release -singleobj -strip-debug -wi -L=--gc-sections -L=-s -- Marco
Feb 13 2013
prev sibling next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/13/2013 02:26 PM, Marco Leise wrote:
 I compiled with LDC2 and these are the results:

 D code serial with dimension 32768 ...
    using floats  Total time: 13.399 [sec]
    using doubles Total time:  9.429 [sec]
    using reals   Total time:  8.909 [sec] // <- !!!

 You get both, 50% more speed and more precision!
Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-GitHub LDC, LLVM 3.2: D code serial with dimension 32768 ... using floats Total time: 4.751 [sec] using doubles Total time: 4.362 [sec] using reals Total time: 5.95 [sec] Using double is indeed marginally faster than float, but real is slower than both. What's disturbing is that when compiled instead with gdmd -O -inline -release the code is dramatically slower: D code serial with dimension 32768 ... using floats Total time: 22.108 [sec] using doubles Total time: 21.203 [sec] using reals Total time: 23.717 [sec] It's the first time I've encountered such a dramatic difference between GDC and LDC, and I'm wondering whether it's down to a bug or some change between D releases 2.060 and 2.061.
Feb 13 2013
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Feb 2013 15:00:21 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-GitHub 
 LDC, LLVM 3.2:
 
    D code serial with dimension 32768 ...
      using floats Total time: 4.751 [sec]
      using doubles Total time: 4.362 [sec]
      using reals Total time: 5.95 [sec]
Ok, I get pretty much the same numbers as before with: ldmd2 -O -inline -release It's even a bit faster than my loooong command line. Do these numbers tell us, that there are such huge differences in the handling of floating point value between different AMD64 CPUs? I can't quite make a rhyme of it yet. What version of LLVM are you using, mine is 3.1. 3.0 is minimum and 3.2 is recommended for LDC2.
 Using double is indeed marginally faster than float, but real is slower than
both.
 
 What's disturbing is that when compiled instead with gdmd -O -inline -release 
 the code is dramatically slower:
 
    D code serial with dimension 32768 ...
      using floats Total time: 22.108 [sec]
      using doubles Total time: 21.203 [sec]
      using reals Total time: 23.717 [sec]
 
 It's the first time I've encountered such a dramatic difference between GDC
and 
 LDC, and I'm wondering whether it's down to a bug or some change between D 
 releases 2.060 and 2.061.
_THAT_ I can reproduce with GDC! : D code serial with dimension 32768 ... using floats Total time: 24.415 [sec] using doubles Total time: 23.268 [sec] using reals Total time: 25.168 [sec] It's the exact same pattern. -- Marco
Feb 13 2013
next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/13/2013 03:56 PM, Marco Leise wrote:
 Ok, I get pretty much the same numbers as before with:
    ldmd2 -O -inline -release
 It's even a bit faster than my loooong command line.
My experience has been that the higher -O values of ldc don't do much, but of course, that's going to vary depending on your code. I think above -O3 it's all link-time, no?
 Do these numbers tell us, that there are such huge differences
 in the handling of floating point value between different
 AMD64 CPUs? I can't quite make a rhyme of it yet.
AMD vs Intel might make a difference (my machine is an i7).
 What version of LLVM are you using, mine is 3.1. 3.0 is
 minimum and 3.2 is recommended for LDC2.
LLVM 3.2.
 _THAT_ I can reproduce with GDC! :

 D code serial with dimension 32768 ...
    using floats Total time: 24.415 [sec]
    using doubles Total time: 23.268 [sec]
    using reals Total time: 25.168 [sec]

 It's the exact same pattern.
I've never, EVER had ldc-compiled code run four times faster than GDC-compiled code. In fact, I don't think I've ever had LDC-compiled code run faster than GDC-compiled code at all, except where the choice of optimizations was different. That's what makes me concerned that there's some kind of bug in play here ....
Feb 13 2013
prev sibling parent reply FG <home fgda.pl> writes:
Good point about choosing the right type of floating point numbers.
Conclusion: when there's enough space, always pick double over float.
Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
I thought to myself: cool, I almost beat the 13.4s I got with C++, until I 
changed the C++ code to also use doubles and... got a massive speedup: 7.1s!
Feb 13 2013
next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/13/2013 04:17 PM, FG wrote:
 Good point about choosing the right type of floating point numbers.
 Conclusion: when there's enough space, always pick double over float.
 Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
 I thought to myself: cool, I almost beat the 13.4s I got with C++, until I
 changed the C++ code to also use doubles and... got a massive speedup: 7.1s!
Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).
Feb 13 2013
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Feb 2013 16:17:12 +0100
schrieb FG <home fgda.pl>:

 Good point about choosing the right type of floating point numbers.
 Conclusion: when there's enough space, always pick double over float.
 Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
 I thought to myself: cool, I almost beat the 13.4s I got with C++, until I 
 changed the C++ code to also use doubles and... got a massive speedup: 7.1s!
Yeah we are living in the 32-bit past ;) Still, be aware that we only write to 2 memory locations in that program! We have neither exceeded the L1 cache size with that nor have we put any strain on the prefetcher and memory bandwidth. With the modification below it is more clear why I said "use float for storage". The result with LDC2 for me is: D code serial with dimension 8192 ... using floats Total time: 4.235 [sec] using doubles Total time: 5.58 [sec] // ~+32% over float using reals Total time: 6.432 [sec] So all the in-CPU performance gain from using doubles is more than lost, when you run out of bandwidth. ---8<----------------------------------- module main; import std.datetime; import std.metastrings; import std.stdio; import std.typetuple; import std.random; import core.stdc.stdlib; enum DIM = 8 * 1024; int juliaValue; size_t* randomAcc; static this() { randomAcc = cast(size_t*) malloc((DIM * DIM + 200) * size_t.sizeof); foreach (i; 0 .. DIM * DIM) randomAcc[i] = i; randomAcc[0 .. DIM * DIM].randomShuffle(); randomAcc[DIM * DIM .. DIM * DIM + 200] = randomAcc[0 .. 200]; } static ~this() { free(randomAcc); } template Julia(TReal) { TReal* squares; static this() { squares = cast(TReal*) malloc(DIM * DIM * TReal.sizeof); } static ~this() { free(squares); } struct ComplexStruct { TReal r; TReal i; TReal squarePlusMag(const ComplexStruct another) { TReal r1 = r*r - i*i + another.r; TReal i1 = 2.0*i*r + another.i; r = r1; i = i1; return (r1*r1 + i1*i1); } } int juliaFunction( int x, int y ) { auto c = ComplexStruct(0.8, 0.156); auto a = ComplexStruct(x, y); foreach (i; 0 .. 200) { size_t idx = randomAcc[DIM * x + y + i]; squares[idx] = a.squarePlusMag(c); if (squares[idx] > 1000) return 0; } return 1; } void kernel() { foreach (x; 0 .. DIM) { foreach (y; 0 .. DIM) { juliaValue = juliaFunction( x, y ); } } } } void main() { writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ..."); StopWatch sw; foreach (Math; TypeTuple!(float, double, real)) { sw.start(); Julia!(Math).kernel(); sw.stop(); writefln(" using %ss Total time: %s [sec]", Math.stringof, (sw.peek().msecs * 0.001)); sw.reset(); } } -- Marco
Feb 13 2013
prev sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/13/2013 04:41 PM, Joseph Rushton Wakeling wrote:
 On 02/13/2013 04:17 PM, FG wrote:
 Good point about choosing the right type of floating point numbers.
 Conclusion: when there's enough space, always pick double over float.
 Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
 I thought to myself: cool, I almost beat the 13.4s I got with C++, until I
 changed the C++ code to also use doubles and... got a massive speedup: 7.1s!
Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).
Just to update on times. I was running another large job at the same time as doing all these tests, so there was some slowdown. Current results are: -- with g++ -O3 and using double rather than float: about 4.3 s -- with clang++ -O3 and using double rather than float: about 3.1 s -- with gdmd -O -release -inline: D code serial with dimension 32768 ... using floats Total time: 17.179 [sec], Julia value: 0 using doubles Total time: 10.298 [sec], Julia value: 0 using reals Total time: 17.126 [sec], Julia value: 0 -- with ldmd2 -O -release -inline: D code serial with dimension 32768 ... using floats Total time: 3.548 [sec], Julia value: 0 using doubles Total time: 2.708 [sec], Julia value: 0 using reals Total time: 4.371 [sec], Julia value: 0 -- with dmd -O -release -inline: D code serial with dimension 32768 ... using floats Total time: 15.696 [sec], Julia value: 0 using doubles Total time: 7.233 [sec], Julia value: 0 using reals Total time: 28.71 [sec], Julia value: 0 You'll note that I added a writeout of the global juliaValue in order to check that certain calculations weren't being optimized away. It's striking that in this case GDC is slower not only than LDC but also DMD. Current GDC is based off 2.060 as far as I know, whereas current LDC has upgraded to 2.061, so are there some changes between D 2.060 and 2.061 that could explain this? It's also interesting that clang++ produces a faster executable than g++, but it's not possible to make a direct LLVM vs GCC comparison here, as g++ is GCC 4.7.2 whereas GDC is based off a GCC snapshot. My guess would be that it's some combination of LLVM superiority in a particular case here, together with some 2.060 --> 2.061. Are these results comparable to what other people are getting? I can confirm that where code of mine is concerned, GDC still seems to have the edge in terms of executable speed ...
Feb 13 2013
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Feb 2013 18:10:47 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 Just to update on times.  I was running another large job at the same tim=
e as=20
 doing all these tests, so there was some slowdown.  Current results are:
=20
 -- with g++ -O3 and using double rather than float: about 4.3 s
=20
 -- with clang++ -O3 and using double rather than float: about 3.1 s
=20
 -- with gdmd -O -release -inline:
=20
      D code serial with dimension 32768 ...
        using floats Total time: 17.179 [sec], Julia value: 0
        using doubles Total time: 10.298 [sec], Julia value: 0
        using reals Total time: 17.126 [sec], Julia value: 0
=20
 -- with ldmd2 -O -release -inline:
=20
      D code serial with dimension 32768 ...
        using floats Total time: 3.548 [sec], Julia value: 0
        using doubles Total time: 2.708 [sec], Julia value: 0
        using reals Total time: 4.371 [sec], Julia value: 0
=20
 -- with dmd -O -release -inline:
=20
      D code serial with dimension 32768 ...
        using floats Total time: 15.696 [sec], Julia value: 0
        using doubles Total time: 7.233 [sec], Julia value: 0
        using reals Total time: 28.71 [sec], Julia value: 0
=20
 You'll note that I added a writeout of the global juliaValue in order to =
check=20
 that certain calculations weren't being optimized away.
=20
 It's striking that in this case GDC is slower not only than LDC but also =
DMD.=20
 Current GDC is based off 2.060 as far as I know, whereas current LDC has=
=20
 upgraded to 2.061, so are there some changes between D 2.060 and 2.061 th=
at=20
 could explain this?
??? Anyways I upgraded to LLVM 3.2 - no change. You have an i7, I have a Core2. It would be really interesting to know what LDC does there. Since GDC's output seems rather CPU agnostic and LDC's output is better in every case but also exhibits system specific details so harshly I would never have imagined possible. Should Intel have changed their CPU design so radically?
 It's also interesting that clang++ produces a faster executable than g++,=
but=20
 it's not possible to make a direct LLVM vs GCC comparison here, as g++ is=
GCC=20
 4.7.2 whereas GDC is based off a GCC snapshot.
I've compiled GDC based on the same source that the Gentoo package manager built G++ 4.7.2 from and, I get similar numbers.
 My guess would be that it's some combination of LLVM superiority in a par=
ticular=20
 case here, together with some 2.060 --> 2.061.
=20
 Are these results comparable to what other people are getting?
=20
 I can confirm that where code of mine is concerned, GDC still seems to ha=
ve the=20
 edge in terms of executable speed ...
I've seen a t=C3=AAte =C3=A0 t=C3=AAte between LDC and GDC in some of my code. --=20 Marco
Feb 13 2013
prev sibling parent "jerro" <a a.com> writes:
When you are comparing LDC and GDC, you should either use 
-mcpu=generic for ldc or -march=native for GDC, because their 
default targets are different. GDC will produce code that works 
on most x86_64 (if you are on a x86_64 system) CPUs by default, 
and LDC targets the host CPU. But this does not explain the 
difference in timings you are seeing here.

One reason why the code generaged by GDC is slower is that 
squarePlusMag isn't inlined. It seems that the fact that its 
parameter is const is somehow preventing it from being inlined - 
I have no idea why. Removing const and adding -march=native to 
gdc flags gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
   using floats Total time: 8.283 [sec]
   using doubles Total time: 6.827 [sec]
   using reals Total time: 6.795 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
   using floats Total time: 3.348 [sec]
   using doubles Total time: 3.08 [sec]
   using reals Total time: 4.174 [sec]

The difference is smaller, but still pretty large.

I have noticed that there are needless conversions in this code 
that are slowing down both GDC generated and LDC generated code. 
This code is a bit faster:

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;


enum DIM = 32 * 1024;

int juliaValue;

template Julia(TReal)
{
     struct ComplexStruct
     {
         TReal r;
         TReal i;

         TReal squarePlusMag(ComplexStruct another)
         {
             TReal r1 = r*r - i*i + another.r;
             TReal i1 = cast(TReal)2.0*i*r + another.i;

             r = r1;
             i = i1;

             return (r1*r1 + i1*i1);
         }
     }

     int juliaFunction( int x, int y )
     {
         auto c = ComplexStruct(0.8, 0.156);
         auto a = ComplexStruct(x, y);

         foreach (i; 0 .. 200)
             if (a.squarePlusMag(c) > cast(TReal) 1000)
                 return 0;
         return 1;
     }

     void kernel()
     {
         foreach (x; 0 .. DIM) {
             foreach (y; 0 .. DIM) {
                 juliaValue = juliaFunction( x, y );
             }
         }
     }
}

void main()
{
     writeln("D code serial with dimension " ~ toStringNow!DIM ~ " 
...");
     StopWatch sw;
     foreach (Math; TypeTuple!(float, double, real))
     {
         sw.start();
         Julia!(Math).kernel();
         sw.stop();
         writefln("  using %ss Total time: %s [sec]",
                  Math.stringof, (sw.peek().msecs * 0.001));
         sw.reset();
     }
}

This gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
   using floats Total time: 6.746 [sec]
   using doubles Total time: 6.872 [sec]
   using reals Total time: 5.226 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
   using floats Total time: 2.36 [sec]
   using doubles Total time: 2.535 [sec]
   using reals Total time: 4.106 [sec]

At least part of the difference is due to the fact that 
juliaFunction still isn't getting inlined (but squarePlusMag is). 
Making juliaFunction a static method of ComplexStruct causes it 
to get inlined (again, I have no idea why). Moving juliaFunction 
inside ComplexStruct does not affect the performance of LDC 
generated code, but for GDC it gives me:

   using floats Total time: 4.262 [sec]
   using doubles Total time: 4.251 [sec]
   using reals Total time: 3.512 [sec]

There is still a large difference between LDC and GDC four floats 
and doubles and I can't explain it. But at least it is much 
smaller than it was initially.

I ran all the benchmarks on 64 bit linux, using core i5 2500k.
Feb 13 2013
prev sibling parent "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
Thanks a lot for your reply.
Feb 14 2013
prev sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/12/2013 11:17 PM, FG wrote:
 Winblows and DMD 32-bit, the rest 64-bit, but still, dmd was quite fast.
 Interesting how gdc -O3 gave no extra boost vs. -O2.
... try adding -frelease to the gdc call?
Feb 13 2013