digitalmars.D.learn - Finding large difference b/w execution time of c++ and D codes for

Sparsh Mittal (105/105) Feb 12 2013 I am writing Julia sets program in C++ and D; exactly same way as

Sparsh Mittal (1/1) Feb 12 2013 I am finding C++ code is much faster than D code.

monarch_dodra (3/4) Feb 12 2013 dmd (AFAIK) is known to be slower. try LDC or GDC if speed is
Dmitry Olshansky (9/10) Feb 12 2013 Seems like DMD's floating point issue. The issue being that it always

Sparsh Mittal (3/3) Feb 12 2013 Pardon me, can you please point me to suitable reference or tell

Sparsh Mittal (1/1) Feb 12 2013 OK. I found it.
Dmitry Olshansky (21/24) Feb 12 2013 GDC, seems like its mostly "build from source" kind of thing.

H. S. Teoh (18/28) Feb 12 2013 [...]

Sparsh Mittal (1/1) Feb 12 2013 Thanks for your insights. It was very helpful.

FG (10/11) Feb 12 2013 I had a look, but first had to make juliaValue global, because g++ had o...

Sparsh Mittal (2/4) Feb 12 2013 Brilliant! Yes, that is why the time was coming out to be zero,

FG (2/6) Feb 12 2013 LOL. For a while you thought that C++ could be that much faster than D? ...

Sparsh Mittal (4/6) Feb 12 2013 I was stunned and shared it with others who could not find. It
Rob T (10/21) Feb 12 2013 Well technically it was that much faster because it did optimize

Marco Leise (78/78) Feb 13 2013 I like optimization challenges. This is an excellent test

FG (2/9) Feb 13 2013 Why aren't r and i of type TReal?

Marco Leise (7/17) Feb 13 2013 They are actual storage in memory, where every increase in

Joseph Rushton Wakeling (2/4) Feb 13 2013 When I replaced with TReal, it sped things up for double.

Marco Leise (8/13) Feb 13 2013 Give me that stuff, your northbridge is on!
Marco Leise (15/20) Feb 13 2013 Oh this gets even better... I only added double as last step

FG (2/5) Feb 13 2013 I'll play it safe and only bet my opDollar. :)

Joseph Rushton Wakeling (3/6) Feb 13 2013 I have to say, it's not been my experience that using real improves spee...

Marco Leise (6/13) Feb 13 2013 The target is Linux, AMD64 and the compiler arguments are:

Joseph Rushton Wakeling (17/23) Feb 13 2013 Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-G...

Marco Leise (18/38) Feb 13 2013 Ok, I get pretty much the same numbers as before with:

Joseph Rushton Wakeling (11/25) Feb 13 2013 My experience has been that the higher -O values of ldc don't do much, b...
FG (5/5) Feb 13 2013 Good point about choosing the right type of floating point numbers.

Joseph Rushton Wakeling (2/7) Feb 13 2013 Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).
Marco Leise (95/100) Feb 13 2013 Yeah we are living in the 32-bit past ;)
Joseph Rushton Wakeling (34/41) Feb 13 2013 Just to update on times. I was running another large job at the same ti...

Marco Leise (26/71) Feb 13 2013 e as=20
jerro (100/100) Feb 13 2013 When you are comparing LDC and GDC, you should either use

Sparsh Mittal (1/1) Feb 14 2013 Thanks a lot for your reply.

Joseph Rushton Wakeling (2/4) Feb 13 2013 ... try adding -frelease to the gdc call?

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

I am writing Julia sets program in C++ and D; exactly same way as 
much as possible. On executing I find large difference in their 
execution time. Can you comment what wrong am I doing or is it 
expected?


//===============C++ code, compiled with -O3 ==============
#include <sys/time.h>
#include <iostream>
using namespace std;
const  int DIM= 4194304;

struct complexClass {
   float r;
   float i;
   complexClass( float a, float b )
   {
     r = a;
     i = b;
   }


   float squarePlusMag(complexClass another)
   {
     float r1 = r*r - i*i + another.r;
     float i1 = 2.0*i*r + another.i;

     r = r1;
     i = i1;

     return (r1*r1+ i1*i1);
   }
};


int juliaFunction( int x, int y )
{

   complexClass a (x,y);

    complexClass c(-0.8, 0.156);

   int i = 0;

   for (i=0; i<200; i++) {
    if( a.squarePlusMag(c) > 1000)
       return 0;
   }

   return 1;
}


void kernel(  ){
   for (int x=0; x<DIM; x++) {
     for (int y=0; y<DIM; y++) {
       int offset = x + y * DIM;
       int juliaValue = juliaFunction( x, y );
	//juliaValue will be used by some function.
     }
   }
}


int main()
{

   struct timeval start, end;
   gettimeofday(&start, NULL);
   kernel();
   gettimeofday(&end, NULL);
   float delta = ((end.tv_sec  - start.tv_sec) * 1000000u +
          end.tv_usec - start.tv_usec) / 1.e6;


   cout<<" C++ code with dimension " << DIM <<" Total time: "<< 
delta << "[sec]\n";
}






//=====================D++ code, compiled with -O -release 
-inline=========


import std.stdio;
import std.datetime;
immutable int DIM= 4194304;


struct complexClass {
   float r;
   float i;

   float squarePlusMag(complexClass another)
   {
     float r1 = r*r - i*i + another.r;
     float i1 = 2.0*i*r + another.i;

     r = r1;
     i = i1;

     return (r1*r1+ i1*i1);
   }
};


int juliaFunction( int x, int y )
{

   complexClass c = complexClass(0.8, 0.156);
   complexClass a= complexClass(x, y);


   for (int i=0; i<200; i++) {

     if( a.squarePlusMag(c) > 1000)
       return 0;
   }
   return 1;
}


void kernel(  ){
   for (int x=0; x<DIM; x++) {
     for (int y=0; y<DIM; y++) {
       int offset = x + y * DIM;
       int juliaValue = juliaFunction( x, y );
       //juliaValue will be used by some function.	
     }
   }
}


void main()
{
   StopWatch sw;
   sw.start();
   kernel();
   sw.stop();
   writeln(" D code serial with dimension ", DIM ," Total time: ", 
(sw.peek().msecs/1000), "[sec]");
}

//================
I will appreciate any help.

Feb 12 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

I am finding C++ code is much faster than D code.

Feb 12 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 12 February 2013 at 20:39:36 UTC, Sparsh Mittal wrote:
 I am finding C++ code is much faster than D code.

dmd (AFAIK) is known to be slower. try LDC or GDC if speed is 
your major concern.

Feb 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

13-Feb-2013 00:39, Sparsh Mittal пишет:
 I am finding C++ code is much faster than D code.

Seems like DMD's floating point issue. The issue being that it always 
works with floats as full-width reals + rounding. Basically if nothing 
changed (and I doubt it changed) then  DMD with floating point code is 
about two (or more) times slower then GDC/LDC.

The cure is using GDC/LDC compiler as they are pretty stable and up to 
date on the front-end side these days.

-- 
Dmitry Olshansky

Feb 12 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

Pardon me, can you please point me to suitable reference or tell 
just command here. Searching on google, I could not find anything 
yet. Performance is my main concern.

Feb 12 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

OK. I found it.

Feb 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

13-Feb-2013 01:09, Sparsh Mittal пишет:
 Pardon me, can you please point me to suitable reference or tell just
 command here. Searching on google, I could not find anything yet.
 Performance is my main concern.

GDC, seems like its mostly "build from source" kind of thing.
Moved to gitbub:
https://github.com/D-Programming-GDC
(See also newsgroup digitalmars.d.D.gnu)

GDC binaries for Windows TDM-GCC toolchain are still available there:
https://bitbucket.org/goshawk/gdc/downloads

AFAIK it needs 4.6.1 version of TDM toolset.


LDC(2), recent release with binaries.

https://github.com/downloads/ldc-developers/ldc/ldc-0.10.0-src.tar.gz
https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86_64.tar.gz
https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86_64.tar.xz
https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86.tar.gz
https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-linux-x86.tar.xz
https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-osx-x86_64.tar.gz
https://github.com/downloads/ldc-developers/ldc/ldc2-0.10.0-osx-x86_64.tar.xz 


(See also announce on the newsgroup digitalmars.d.D.ldc)

Both compilers ship dmd-style compiler driver called gdmd or ldmd2.
Speed is mostly what you'd expect of GCC and LLVM respectively.

-- 
Dmitry Olshansky

Feb 12 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 13, 2013 at 12:56:01AM +0400, Dmitry Olshansky wrote:
 13-Feb-2013 00:39, Sparsh Mittal пишет:
I am finding C++ code is much faster than D code.

 
 Seems like DMD's floating point issue. The issue being that it
 always works with floats as full-width reals + rounding. Basically
 if nothing changed (and I doubt it changed) then  DMD with floating
 point code is about two (or more) times slower then GDC/LDC.
 
 The cure is using GDC/LDC compiler as they are pretty stable and up
 to date on the front-end side these days.

[...]

I did a few benchmarks somewhat recently where I compared the
performance of code produced by GDC with DMD. Code produced by GDC
consistently outperforms code produced by DMD by about 20-30% or so.
This is across the board, with both floats, reals, and applications that
don't do heavy arithmetic (just basic looping/recursion constructs).

I didn't investigate in detail the cause of this difference, but the
last time I looked at the assembly code generated by both compilers, I
noticed that GDC's optimizer is far more advanced than DMD's, esp. when
it comes to loop-unrolling, strength reduction, inlining, etc.. For
non-trivial code, GDC pretty much consistently produces superior code in
general (not just in floating-point operations).

So if performance is a concern, I'd say definitely look into GDC or LDC
instead of DMD.


T

-- 
Two wrongs don't make a right; but three rights do make a left...

Feb 12 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

Thanks for your insights. It was very helpful.

Feb 12 2013

FG <home fgda.pl> writes:

On 2013-02-12 21:39, Sparsh Mittal wrote:
 I am finding C++ code is much faster than D code.

I had a look, but first had to make juliaValue global, because g++ had
optimized 
all the calculations away. :)  Also changed DIM to 32 * 1024.

13.2s -- g++ -O3
16.0s -- g++ -O2
15.9s -- gdc -O3
15.9s -- gdc -O2
16.2s -- dmd -O -release -inline    (v.2.060)

Winblows and DMD 32-bit, the rest 64-bit, but still, dmd was quite fast.
Interesting how gdc -O3 gave no extra boost vs. -O2.

Feb 12 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

 I had a look, but first had to make juliaValue global, because 
 g++ had optimized all the calculations away.

Brilliant! Yes, that is why the time was coming out to be zero, 
regardless of what value of DIM I put. Thank you very very much.

Feb 12 2013

FG <home fgda.pl> writes:

On 2013-02-13 00:06, Sparsh Mittal wrote:
 I had a look, but first had to make juliaValue global, because g++ had
 optimized all the calculations away.

 Brilliant! Yes, that is why the time was coming out to be zero, regardless of
 what value of DIM I put. Thank you very very much.

LOL. For a while you thought that C++ could be that much faster than D?  :D

Feb 12 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

 LOL. For a while you thought that C++ could be that much faster 
 than D?  :D

I was stunned and shared it with others who could not find. It 
was like a scientist discovering a phenomenon which is against 
established laws. Good that I was wrong and a right person 
pointed it.

Feb 12 2013

"Rob T" <alanb ucora.com> writes:

Well technically it was that much faster because it did optimize 
away the useless calcOn Tuesday, 12 February 2013 at 23:31:17 
UTC, FG wrote:
 On 2013-02-13 00:06, Sparsh Mittal wrote:
 I had a look, but first had to make juliaValue global, 
 because g++ had
 optimized all the calculations away.

 Brilliant! Yes, that is why the time was coming out to be 
 zero, regardless of
 what value of DIM I put. Thank you very very much.

 LOL. For a while you thought that C++ could be that much faster 
 than D?  :D

Well technically it's not that C++ is faster than D or 
visa-versa, it's that the two compilers did different 
optimizations, and in this case one of the optimizations that g++ 
did (removing redundancies) had a large effect on the outcome. 
It's entirely possible that DMD can still beat g++ under 
different circumstances.

--rt

Feb 12 2013

Marco Leise <Marco.Leise gmx.de> writes:

I like optimization challenges. This is an excellent test
program to check the effect of different floating point types
on intermediate values. Remember that when you store values in
a float variable, the FPU actually has to round it down to
that precision, store it in a 32-bit memory location, then
load it back in and expand it - you _asked_ for that.

I compiled with LDC2 and these are the results:

D code serial with dimension 32768 ...
  using floats  Total time: 13.399 [sec]
  using doubles Total time:  9.429 [sec]
  using reals   Total time:  8.909 [sec] // <- !!!

You get both, 50% more speed and more precision! It is a
win-win situation. Also take a look at Phobos' std.math that
returns real everywhere.

Modified code:
---8<-------------------------------

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;


enum DIM = 32 * 1024;

int juliaValue;

template Julia(TReal)
{
	struct ComplexStruct
	{
		float r;
		float i;
	
		TReal squarePlusMag(const ComplexStruct another)
		{
			TReal r1 = r*r - i*i + another.r;
			TReal i1 = 2.0*i*r + another.i;
			
			r = r1;
			i = i1;
			
			return (r1*r1 + i1*i1);
		}
	}

	int juliaFunction( int x, int y )
	{
		auto c = ComplexStruct(0.8, 0.156);
		auto a = ComplexStruct(x, y);
	
		foreach (i; 0 .. 200)
			if (a.squarePlusMag(c) > 1000)
				return 0;
		return 1;
	}
	
	void kernel()
	{
		foreach (x; 0 .. DIM) {
			foreach (y; 0 .. DIM) {
				juliaValue = juliaFunction( x, y );
			}
		}
	}
}

void main()
{
	writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ...");
	StopWatch sw;
	foreach (Math; TypeTuple!(float, double, real))
	{
		sw.start();
		Julia!(Math).kernel();
		sw.stop();
		writefln("  using %ss Total time: %s [sec]",
		         Math.stringof, (sw.peek().msecs * 0.001));
		sw.reset();
	}
}

------------------------------->8---

-- 
Marco

Feb 13 2013

FG <home fgda.pl> writes:

On 2013-02-13 14:26, Marco Leise wrote:
 template Julia(TReal)
 {
 	struct ComplexStruct
 	{
 		float r;
 		float i;
 ... 	

Why aren't r and i of type TReal?

Feb 13 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Feb 2013 14:44:36 +0100
schrieb FG <home fgda.pl>:

 On 2013-02-13 14:26, Marco Leise wrote:
 template Julia(TReal)
 {
 	struct ComplexStruct
 	{
 		float r;
 		float i;
 ... 	

 
 Why aren't r and i of type TReal?

They are actual storage in memory, where every increase in
size hurts. And they cannot be optimized away,
like temporary reals, which can be kept on the FPU stack.

-- 
Marco

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 03:29 PM, Marco Leise wrote:
 They are actual storage in memory, where every increase in
 size hurts.

When I replaced with TReal, it sped things up for double.

Feb 13 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Feb 2013 15:45:13 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 02/13/2013 03:29 PM, Marco Leise wrote:
 They are actual storage in memory, where every increase in
 size hurts.

 
 When I replaced with TReal, it sped things up for double.

Give me that stuff, your northbridge is on!
But I still want to rule out the LLVM version, since GDC seems
to produce code with similar runtime on both our systems, but
LDC2 divergess so much.

-- 
Marco

Feb 13 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Feb 2013 15:45:13 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 02/13/2013 03:29 PM, Marco Leise wrote:
 They are actual storage in memory, where every increase in
 size hurts.

 
 When I replaced with TReal, it sped things up for double.

Oh this gets even better... I only added double as last step
to that code, so I didn't notice this effect. Looks like we've
got:

- CPUs that are good at converting to double
- 64-bit, so the size of a double matches
- only 16 bytes of memory in total

With double struct fields the 'double' case gains 50% speed
for me, making it the overall fastest now (on LDC). I'd still
bet a dollar that with an array of values floats would
outperform doubles, when cache misses happen. (E.g. more or
less random memory access.)

-- 
Marco

Feb 13 2013

FG <home fgda.pl> writes:

On 2013-02-13 16:26, Marco Leise wrote:
 I'd still bet a dollar that with an array of values floats would
 outperform doubles, when cache misses happen. (E.g. more or
 less random memory access.)

I'll play it safe and only bet my opDollar. :)

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 02:26 PM, Marco Leise wrote:
 You get both, 50% more speed and more precision! It is a
 win-win situation. Also take a look at Phobos' std.math that
 returns real everywhere.

I have to say, it's not been my experience that using real improves speed. 
Exactly what optimizations are you using when compiling?

Feb 13 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Feb 2013 14:48:21 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 On 02/13/2013 02:26 PM, Marco Leise wrote:
 You get both, 50% more speed and more precision! It is a
 win-win situation. Also take a look at Phobos' std.math that
 returns real everywhere.

 
 I have to say, it's not been my experience that using real improves speed. 
 Exactly what optimizations are you using when compiling?

The target is Linux, AMD64 and the compiler arguments are:

ldc2 -O5 -check-printf-calls -fdata-sections -ffunction-sections -release
-singleobj -strip-debug -wi -L=--gc-sections -L=-s

-- 
Marco

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 02:26 PM, Marco Leise wrote:
 I compiled with LDC2 and these are the results:

 D code serial with dimension 32768 ...
    using floats  Total time: 13.399 [sec]
    using doubles Total time:  9.429 [sec]
    using reals   Total time:  8.909 [sec] // <- !!!

 You get both, 50% more speed and more precision!

Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-GitHub 
LDC, LLVM 3.2:

   D code serial with dimension 32768 ...
     using floats Total time: 4.751 [sec]
     using doubles Total time: 4.362 [sec]
     using reals Total time: 5.95 [sec]

Using double is indeed marginally faster than float, but real is slower than
both.

What's disturbing is that when compiled instead with gdmd -O -inline -release 
the code is dramatically slower:

   D code serial with dimension 32768 ...
     using floats Total time: 22.108 [sec]
     using doubles Total time: 21.203 [sec]
     using reals Total time: 23.717 [sec]

It's the first time I've encountered such a dramatic difference between GDC and 
LDC, and I'm wondering whether it's down to a bug or some change between D 
releases 2.060 and 2.061.

Feb 13 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Feb 2013 15:00:21 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-GitHub 
 LDC, LLVM 3.2:
 
    D code serial with dimension 32768 ...
      using floats Total time: 4.751 [sec]
      using doubles Total time: 4.362 [sec]
      using reals Total time: 5.95 [sec]

Ok, I get pretty much the same numbers as before with:
  ldmd2 -O -inline -release
It's even a bit faster than my loooong command line.
Do these numbers tell us, that there are such huge differences
in the handling of floating point value between different
AMD64 CPUs? I can't quite make a rhyme of it yet.
What version of LLVM are you using, mine is 3.1. 3.0 is
minimum and 3.2 is recommended for LDC2.

 Using double is indeed marginally faster than float, but real is slower than
both.
 
 What's disturbing is that when compiled instead with gdmd -O -inline -release 
 the code is dramatically slower:
 
    D code serial with dimension 32768 ...
      using floats Total time: 22.108 [sec]
      using doubles Total time: 21.203 [sec]
      using reals Total time: 23.717 [sec]
 
 It's the first time I've encountered such a dramatic difference between GDC
and 
 LDC, and I'm wondering whether it's down to a bug or some change between D 
 releases 2.060 and 2.061.

_THAT_ I can reproduce with GDC! :

D code serial with dimension 32768 ...
  using floats Total time: 24.415 [sec]
  using doubles Total time: 23.268 [sec]
  using reals Total time: 25.168 [sec]

It's the exact same pattern.

-- 
Marco

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 03:56 PM, Marco Leise wrote:
 Ok, I get pretty much the same numbers as before with:
    ldmd2 -O -inline -release
 It's even a bit faster than my loooong command line.

My experience has been that the higher -O values of ldc don't do much, but of 
course, that's going to vary depending on your code.  I think above -O3 it's
all 
link-time, no?

 Do these numbers tell us, that there are such huge differences
 in the handling of floating point value between different
 AMD64 CPUs? I can't quite make a rhyme of it yet.

AMD vs Intel might make a difference (my machine is an i7).

 What version of LLVM are you using, mine is 3.1. 3.0 is
 minimum and 3.2 is recommended for LDC2.

LLVM 3.2.

 _THAT_ I can reproduce with GDC! :

 D code serial with dimension 32768 ...
    using floats Total time: 24.415 [sec]
    using doubles Total time: 23.268 [sec]
    using reals Total time: 25.168 [sec]

 It's the exact same pattern.

I've never, EVER had ldc-compiled code run four times faster than GDC-compiled 
code.  In fact, I don't think I've ever had LDC-compiled code run faster than 
GDC-compiled code at all, except where the choice of optimizations was 
different.  That's what makes me concerned that there's some kind of bug in
play 
here ....

Feb 13 2013

FG <home fgda.pl> writes:

Good point about choosing the right type of floating point numbers.
Conclusion: when there's enough space, always pick double over float.
Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
I thought to myself: cool, I almost beat the 13.4s I got with C++, until I 
changed the C++ code to also use doubles and... got a massive speedup: 7.1s!

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 04:17 PM, FG wrote:
 Good point about choosing the right type of floating point numbers.
 Conclusion: when there's enough space, always pick double over float.
 Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
 I thought to myself: cool, I almost beat the 13.4s I got with C++, until I
 changed the C++ code to also use doubles and... got a massive speedup: 7.1s!

Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).

Feb 13 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Feb 2013 16:17:12 +0100
schrieb FG <home fgda.pl>:

 Good point about choosing the right type of floating point numbers.
 Conclusion: when there's enough space, always pick double over float.
 Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
 I thought to myself: cool, I almost beat the 13.4s I got with C++, until I 
 changed the C++ code to also use doubles and... got a massive speedup: 7.1s!

Yeah we are living in the 32-bit past ;)

Still, be aware that we only write to 2 memory locations in
that program!
We have neither exceeded the L1 cache size with that nor have
we put any strain on the prefetcher and memory bandwidth.
With the modification below it is more clear why I said "use
float for storage". The result with LDC2 for me is:

D code serial with dimension 8192 ...
  using floats Total time: 4.235 [sec]
  using doubles Total time: 5.58 [sec] // ~+32% over float
  using reals Total time: 6.432 [sec]

So all the in-CPU performance gain from using doubles is more
than lost, when you run out of bandwidth.

---8<-----------------------------------

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;
import std.random;
import core.stdc.stdlib;


enum DIM = 8 * 1024;

int juliaValue;

size_t* randomAcc;

static this()
{
	randomAcc = cast(size_t*) malloc((DIM * DIM + 200) * size_t.sizeof);
	foreach (i; 0 .. DIM * DIM)
		randomAcc[i] = i;
	randomAcc[0 .. DIM * DIM].randomShuffle();
	randomAcc[DIM * DIM .. DIM * DIM + 200] = randomAcc[0 .. 200];
}

static ~this() { free(randomAcc); }

template Julia(TReal)
{
	TReal* squares;

	static this() { squares = cast(TReal*) malloc(DIM * DIM * TReal.sizeof); }

	static ~this() { free(squares); }

	struct ComplexStruct
	{
		TReal r;
		TReal i;
	
		TReal squarePlusMag(const ComplexStruct another)
		{
			TReal r1 = r*r - i*i + another.r;
			TReal i1 = 2.0*i*r + another.i;
			
			r = r1;
			i = i1;
			
			return (r1*r1 + i1*i1);
		}
	}

	int juliaFunction( int x, int y )
	{
		auto c = ComplexStruct(0.8, 0.156);
		auto a = ComplexStruct(x, y);
	
		foreach (i; 0 .. 200) {
			size_t idx = randomAcc[DIM * x + y + i];
			squares[idx] = a.squarePlusMag(c);
			if (squares[idx] > 1000)
				return 0;
		}
		return 1;
	}
	
	void kernel()
	{
		foreach (x; 0 .. DIM) {
			foreach (y; 0 .. DIM) {
				juliaValue = juliaFunction( x, y );
			}
		}
	}
}

void main()
{
	writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ...");
	StopWatch sw;
	foreach (Math; TypeTuple!(float, double, real))
	{
		sw.start();
		Julia!(Math).kernel();
		sw.stop();
		writefln("  using %ss Total time: %s [sec]",
		         Math.stringof, (sw.peek().msecs * 0.001));
		sw.reset();
	}
}
-- 
Marco

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 04:41 PM, Joseph Rushton Wakeling wrote:
 On 02/13/2013 04:17 PM, FG wrote:
 Good point about choosing the right type of floating point numbers.
 Conclusion: when there's enough space, always pick double over float.
 Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
 I thought to myself: cool, I almost beat the 13.4s I got with C++, until I
 changed the C++ code to also use doubles and... got a massive speedup: 7.1s!

 Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).

Just to update on times.  I was running another large job at the same time as 
doing all these tests, so there was some slowdown.  Current results are:

-- with g++ -O3 and using double rather than float: about 4.3 s

-- with clang++ -O3 and using double rather than float: about 3.1 s

-- with gdmd -O -release -inline:

     D code serial with dimension 32768 ...
       using floats Total time: 17.179 [sec], Julia value: 0
       using doubles Total time: 10.298 [sec], Julia value: 0
       using reals Total time: 17.126 [sec], Julia value: 0

-- with ldmd2 -O -release -inline:

     D code serial with dimension 32768 ...
       using floats Total time: 3.548 [sec], Julia value: 0
       using doubles Total time: 2.708 [sec], Julia value: 0
       using reals Total time: 4.371 [sec], Julia value: 0

-- with dmd -O -release -inline:

     D code serial with dimension 32768 ...
       using floats Total time: 15.696 [sec], Julia value: 0
       using doubles Total time: 7.233 [sec], Julia value: 0
       using reals Total time: 28.71 [sec], Julia value: 0

You'll note that I added a writeout of the global juliaValue in order to check 
that certain calculations weren't being optimized away.

It's striking that in this case GDC is slower not only than LDC but also DMD. 
Current GDC is based off 2.060 as far as I know, whereas current LDC has 
upgraded to 2.061, so are there some changes between D 2.060 and 2.061 that 
could explain this?

It's also interesting that clang++ produces a faster executable than g++, but 
it's not possible to make a direct LLVM vs GCC comparison here, as g++ is GCC 
4.7.2 whereas GDC is based off a GCC snapshot.

My guess would be that it's some combination of LLVM superiority in a
particular 
case here, together with some 2.060 --> 2.061.

Are these results comparable to what other people are getting?

I can confirm that where code of mine is concerned, GDC still seems to have the 
edge in terms of executable speed ...

Feb 13 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Feb 2013 18:10:47 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling webdrake.net>:

 Just to update on times.  I was running another large job at the same tim=

e as=20
 doing all these tests, so there was some slowdown.  Current results are:
=20
 -- with g++ -O3 and using double rather than float: about 4.3 s
=20
 -- with clang++ -O3 and using double rather than float: about 3.1 s
=20
 -- with gdmd -O -release -inline:
=20
      D code serial with dimension 32768 ...
        using floats Total time: 17.179 [sec], Julia value: 0
        using doubles Total time: 10.298 [sec], Julia value: 0
        using reals Total time: 17.126 [sec], Julia value: 0
=20
 -- with ldmd2 -O -release -inline:
=20
      D code serial with dimension 32768 ...
        using floats Total time: 3.548 [sec], Julia value: 0
        using doubles Total time: 2.708 [sec], Julia value: 0
        using reals Total time: 4.371 [sec], Julia value: 0
=20
 -- with dmd -O -release -inline:
=20
      D code serial with dimension 32768 ...
        using floats Total time: 15.696 [sec], Julia value: 0
        using doubles Total time: 7.233 [sec], Julia value: 0
        using reals Total time: 28.71 [sec], Julia value: 0
=20
 You'll note that I added a writeout of the global juliaValue in order to =

check=20
 that certain calculations weren't being optimized away.
=20
 It's striking that in this case GDC is slower not only than LDC but also =

DMD.=20
 Current GDC is based off 2.060 as far as I know, whereas current LDC has=

=20
 upgraded to 2.061, so are there some changes between D 2.060 and 2.061 th=

at=20
 could explain this?

???
Anyways I upgraded to LLVM 3.2 - no change. You have an i7, I
have a Core2. It would be really interesting to know what LDC
does there. Since GDC's output seems rather CPU agnostic and
LDC's output is better in every case but also exhibits system
specific details so harshly I would never have imagined
possible. Should Intel have changed their CPU design so
radically?

 It's also interesting that clang++ produces a faster executable than g++,=

 but=20
 it's not possible to make a direct LLVM vs GCC comparison here, as g++ is=

 GCC=20
 4.7.2 whereas GDC is based off a GCC snapshot.

I've compiled GDC based on the same source that the Gentoo
package manager built G++ 4.7.2 from and, I get similar
numbers.

 My guess would be that it's some combination of LLVM superiority in a par=

ticular=20
 case here, together with some 2.060 --> 2.061.
=20
 Are these results comparable to what other people are getting?
=20
 I can confirm that where code of mine is concerned, GDC still seems to ha=

ve the=20
 edge in terms of executable speed ...

I've seen a t=C3=AAte =C3=A0 t=C3=AAte between LDC and GDC in some of my
code.

--=20
Marco

Feb 13 2013

"jerro" <a a.com> writes:

When you are comparing LDC and GDC, you should either use 
-mcpu=generic for ldc or -march=native for GDC, because their 
default targets are different. GDC will produce code that works 
on most x86_64 (if you are on a x86_64 system) CPUs by default, 
and LDC targets the host CPU. But this does not explain the 
difference in timings you are seeing here.

One reason why the code generaged by GDC is slower is that 
squarePlusMag isn't inlined. It seems that the fact that its 
parameter is const is somehow preventing it from being inlined - 
I have no idea why. Removing const and adding -march=native to 
gdc flags gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
   using floats Total time: 8.283 [sec]
   using doubles Total time: 6.827 [sec]
   using reals Total time: 6.795 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
   using floats Total time: 3.348 [sec]
   using doubles Total time: 3.08 [sec]
   using reals Total time: 4.174 [sec]

The difference is smaller, but still pretty large.

I have noticed that there are needless conversions in this code 
that are slowing down both GDC generated and LDC generated code. 
This code is a bit faster:

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;


enum DIM = 32 * 1024;

int juliaValue;

template Julia(TReal)
{
     struct ComplexStruct
     {
         TReal r;
         TReal i;

         TReal squarePlusMag(ComplexStruct another)
         {
             TReal r1 = r*r - i*i + another.r;
             TReal i1 = cast(TReal)2.0*i*r + another.i;

             r = r1;
             i = i1;

             return (r1*r1 + i1*i1);
         }
     }

     int juliaFunction( int x, int y )
     {
         auto c = ComplexStruct(0.8, 0.156);
         auto a = ComplexStruct(x, y);

         foreach (i; 0 .. 200)
             if (a.squarePlusMag(c) > cast(TReal) 1000)
                 return 0;
         return 1;
     }

     void kernel()
     {
         foreach (x; 0 .. DIM) {
             foreach (y; 0 .. DIM) {
                 juliaValue = juliaFunction( x, y );
             }
         }
     }
}

void main()
{
     writeln("D code serial with dimension " ~ toStringNow!DIM ~ " 
...");
     StopWatch sw;
     foreach (Math; TypeTuple!(float, double, real))
     {
         sw.start();
         Julia!(Math).kernel();
         sw.stop();
         writefln("  using %ss Total time: %s [sec]",
                  Math.stringof, (sw.peek().msecs * 0.001));
         sw.reset();
     }
}

This gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
   using floats Total time: 6.746 [sec]
   using doubles Total time: 6.872 [sec]
   using reals Total time: 5.226 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
   using floats Total time: 2.36 [sec]
   using doubles Total time: 2.535 [sec]
   using reals Total time: 4.106 [sec]

At least part of the difference is due to the fact that 
juliaFunction still isn't getting inlined (but squarePlusMag is). 
Making juliaFunction a static method of ComplexStruct causes it 
to get inlined (again, I have no idea why). Moving juliaFunction 
inside ComplexStruct does not affect the performance of LDC 
generated code, but for GDC it gives me:

   using floats Total time: 4.262 [sec]
   using doubles Total time: 4.251 [sec]
   using reals Total time: 3.512 [sec]

There is still a large difference between LDC and GDC four floats 
and doubles and I can't explain it. But at least it is much 
smaller than it was initially.

I ran all the benchmarks on 64 bit linux, using core i5 2500k.

Feb 13 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

Thanks a lot for your reply.

Feb 14 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/12/2013 11:17 PM, FG wrote:
 Winblows and DMD 32-bit, the rest 64-bit, but still, dmd was quite fast.
 Interesting how gdc -O3 gave no extra boost vs. -O2.

... try adding -frelease to the gdc call?

Feb 13 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Finding large difference b/w execution time of c++ and D codes for