digitalmars.D - Performance

Thomas (59/59) May 30 2014 I made the following performance test, which adds 10^9 Double’s

Adam D. Ruppe (5/7) May 30 2014 I haven't actually run this but my guess is that the format
anonymous (5/19) May 30 2014 [...]
bearophile (43/53) May 30 2014 Your code written in a more idiomatic way (I have commented out

bearophile (45/63) May 30 2014 And this is the 32 bit X86 asm generated by ldc2 for the plus

bearophile (59/59) May 30 2014 This C++ code:
Orvid King (9/75) May 30 2014 Well, I'd argue that in fact neither the C++ nor D code generated the

Russel Winder via Digitalmars-d (17/39) May 30 2014 A priori I would believe there a problem with these numbers: my

bearophile (5/8) May 30 2014 The C++ code I've shown above if compiled with -Ofast seems

Russel Winder via Digitalmars-d (14/22) May 30 2014 I am assuming you are comparing C++/clang with D/ldc2, it is only

Walter Bright (2/3) May 30 2014 Usually, the problem will be obvious from looking at the generated assem...
David Nadlinger (4/5) May 30 2014 This effectively compiles the program without optimizations. Try
Marco Leise (35/35) May 30 2014 Run this with: -O3 -frelease -fno-assert -fno-bounds-check -march=3Dnati...

Thomas (8/42) May 31 2014 Thank you for the help. Which OS is running on your notebook ?

Marco Leise (13/22) May 31 2014 Gentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make

Thomas (6/26) Jun 02 2014 My PC is 5 years old. Of course I used your flags. Besides I am

Marco Leise (15/21) Jun 03 2014 You posted a comparing benchmark between 3 languages providing

John Colvin (3/5) Jun 03 2014 There's always the scheduler, swap etc. Not that they should have

dennis luehring (12/71) May 30 2014 faulty benchmark

Russel Winder via Digitalmars-d (14/29) May 30 2014 Indeed.

dennis luehring (10/13) May 31 2014 average means average of benchmarked times

dennis luehring (14/27) May 31 2014 so the anti-optimizer-overflowing-second-output aka AOOSO should be

Andrei Alexandrescu (3/8) May 31 2014 No. Elapsed time in a benchmark does not follow a Student or Gaussian

Russel Winder via Digitalmars-d (14/23) May 31 2014 We almost certainly need to unpack that more. I agree that behind my

Andrei Alexandrescu (8/25) May 31 2014 Well there's quantization noise which has uniform distribution. Then all...

Russel Winder via Digitalmars-d (10/17) May 31 2014 On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d

Andrei Alexandrescu (3/14) May 31 2014 I don't know the idiom - what does it mean? Something nice I hope :o).

Andrei Alexandrescu (6/21) May 31 2014 Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to

Russel Winder via Digitalmars-d (16/20) May 31 2014 On Sat, 2014-05-31 at 14:45 -0700, Andrei Alexandrescu via Digitalmars-d

John Colvin (5/18) May 31 2014 Well... It depends on what you're looking to do with the result.

Andrei Alexandrescu (2/4) May 31 2014 Use the minimum unless networking is involved. -- Andrei

Narendra Modi (3/8) May 31 2014 cache??

"Thomas" <t.leichner arcor.de> writes:

I made the following performance test, which adds 10^9 Double’s 
on Linux with the latest dmd compiler in the Eclipse IDE and with 
the Gdc-Compiler also on Linux. Then the same test was done with 
C++ on Linux and with Scala in the Java ecosystem on Linux. All 
the testing was done on the same PC.
The results for one addition are:

D-DMD: 3.1 nanoseconds
D-GDC: 3.8 nanoseconds
C++: 1.0 nanoseconds
Scala: 1.0 nanoseconds


D-Source:

import std.stdio;
import std.datetime;
import std.string;
import core.time;


void main() {
   run!(plus)( 1000*1000*1000 );
}

class C {
}

string plus( int steps  )  {
   double sum = 1.346346;
   immutable double p0 = 0.0045;
   immutable double p1 = 1.00045452-p0;
   auto b = true;
   for( int i=0; i<steps; i++){
   	switch( b ){
	case true :
	  sum += p0;
	  break;
	default:
	  sum += p1;
	  break;
	}
	b = !b;	
   }
   return (format("%s  %f","plus\nLast: ", sum) );
//  return ("plus\nLast: ", sum );
}


void run( alias func )( int steps )
   if( is(typeof(func(steps)) == string)) {
   auto begin = Clock.currStdTime();
   string output = func( steps );
   auto end =  Clock.currStdTime();
   double nanotime = toNanos(end-begin)/steps;
   writeln( output );
   writeln( "Time per op: " , nanotime );
   writeln( );
}

double toNanos( long hns ) { return hns*100.0; }


Compiler settings for D:

dmd -c 
-of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9
152107A632F/first.o 
-release -inline -noboundscheck -O -w -version=Have_first 
-Isource source/perf/testperf.d

gdc ./source/perf/testperf.d -frelease -o testperf

So what is the problem ? Are the compiler switches wrong ? Or is 
D on the used compilers so slow ? Can you help me.


Thomas

May 30 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
   return (format("%s  %f","plus\nLast: ", sum) );

I haven't actually run this but my guess is that the format 
function is the slowish thing here. Did you create a new string 
in the C version too?

 gdc ./source/perf/testperf.d -frelease -o testperf

The -O3 switch might help too, which turns on optimizations.

May 30 2014

"anonymous" <anonymous example.com> writes:

On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
 I made the following performance test, which adds 10^9 Double’s 
 on Linux with the latest dmd compiler in the Eclipse IDE and 
 with the Gdc-Compiler also on Linux. Then the same test was 
 done with C++ on Linux and with Scala in the Java ecosystem on 
 Linux. All the testing was done on the same PC.
 The results for one addition are:

 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds


 D-Source:

[...]
 Compiler settings for D:

[...]
 So what is the problem ? Are the compiler switches wrong ? Or 
 is D on the used compilers so slow ? Can you help me.

Sources and command lines for the other languages would be nice
for comparison.

May 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
 I made the following performance test, which adds 10^9 Double’s 
 on Linux with the latest dmd compiler in the Eclipse IDE and 
 with the Gdc-Compiler also on Linux. Then the same test was 
 done with C++ on Linux and with Scala in the Java ecosystem on 
 Linux. All the testing was done on the same PC.
 The results for one addition are:

 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds

Your code written in a more idiomatic way (I have commented out 
new language features):


import std.stdio, std.datetime;

double plus(in uint nSteps) pure nothrow  safe /* nogc*/ {
     enum double p0 = 0.0045;
     enum double p1 = 1.00045452-p0;

     double tot = 1.346346;
     auto b = true;

     foreach (immutable i; 0 .. nSteps) {
         final switch (b) {
             case true:
                 tot += p0;
                 break;
             case false:
                 tot += p1;
                 break;
         }

         b = !b;
     }

     return tot;
}

void run(alias func, string funcName)(in uint nSteps) {
     StopWatch sw;
     sw.start;
     immutable result = func(nSteps);
     sw.stop;
     writeln(funcName);
     writefln("Last: %f", result);
     //writeln("Time per op: ", sw.peek.nsecs / real(nSteps));
     writeln("Time per op: ", sw.peek.nsecs / cast(real)nSteps);
}

void main() {
     run!(plus, "plus")(1_000_000_000U);
}

(But there is also a benchmark helper around).

ldmd2 -O -release -inline -noboundscheck test.d

Using LDC2 compiler, on my system the output is:

plus
Last: 500227252.496398
Time per op: 9.41424

Bye,
bearophile

May 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

 double plus(in uint nSteps) pure nothrow  safe /* nogc*/ {
     enum double p0 = 0.0045;
     enum double p1 = 1.00045452-p0;

     double tot = 1.346346;
     auto b = true;

     foreach (immutable i; 0 .. nSteps) {
         final switch (b) {
             case true:
                 tot += p0;
                 break;
             case false:
                 tot += p1;
                 break;
         }

         b = !b;
     }

     return tot;
 }

And this is the 32 bit X86 asm generated by ldc2 for the plus 
function:

__D4test4plusFNaNbNfxkZd:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%esi
	andl	$-8, %esp
	subl	$24, %esp
	movsd	LCPI0_0, %xmm0
	testl	%eax, %eax
	je	LBB0_8
	xorl	%ecx, %ecx
	movb	$1, %dl
	movsd	LCPI0_1, %xmm1
	movsd	LCPI0_2, %xmm2
	.align	16, 0x90
LBB0_2:
	testb	$1, %dl
	jne	LBB0_3
	addsd	%xmm1, %xmm0
	jmp	LBB0_7
	.align	16, 0x90
LBB0_3:
	movzbl	%dl, %esi
	andl	$1, %esi
	je	LBB0_5
	addsd	%xmm2, %xmm0
LBB0_7:
	xorb	$1, %dl
	incl	%ecx
	cmpl	%eax, %ecx
	jb	LBB0_2
LBB0_8:
	movsd	%xmm0, 8(%esp)
	fldl	8(%esp)
	leal	-4(%ebp), %esp
	popl	%esi
	popl	%ebp
	ret
LBB0_5:
	movl	$11, 4(%esp)
	movl	$__D4test12__ModuleInfoZ, (%esp)
	calll	__d_switch_error

Bye,
bearophile

May 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

This C++ code:

double plus(const unsigned int nSteps) {
     const double p0 = 0.0045;
     const double p1 = 1.00045452-p0;

     double tot = 1.346346;
     bool b = true;

     for (unsigned int i = 0; i < nSteps; i++) {
         switch (b) {
             case true:
                 tot += p0;
                 break;
             case false:
                 tot += p1;
                 break;
         }

         b = !b;
     }

     return tot;
}


G++ 4.8.0 gives the asm (using -Ofast, that implies unsafe FP 
optimizations):

__Z4plusj:
	movl	4(%esp), %ecx
	testl	%ecx, %ecx
	je	L7
	fldl	LC0
	xorl	%edx, %edx
	movl	$1, %eax
	fldl	LC2
	jmp	L6
	.p2align 4,,7
L11:
	fxch	%st(1)
	addl	$1, %edx
	xorl	$1, %eax
	cmpl	%ecx, %edx
	faddl	LC1
	je	L12
	fxch	%st(1)
L6:
	cmpb	$1, %al
	je	L11
	addl	$1, %edx
	xorl	$1, %eax
	cmpl	%ecx, %edx
	fadd	%st, %st(1)
	jne	L6
	fstp	%st(0)
	jmp	L10
	.p2align 4,,7
L12:
	fstp	%st(1)
L10:
	rep ret
L7:
	fldl	LC0
	ret

Bye,
bearophile

May 30 2014

Orvid King <blah38621 gmail.com> writes:

On 5/30/2014 9:30 AM, bearophile wrote:
 double plus(in uint nSteps) pure nothrow  safe /* nogc*/ {
     enum double p0 = 0.0045;
     enum double p1 = 1.00045452-p0;

     double tot = 1.346346;
     auto b = true;

     foreach (immutable i; 0 .. nSteps) {
         final switch (b) {
             case true:
                 tot += p0;
                 break;
             case false:
                 tot += p1;
                 break;
         }

         b = !b;
     }

     return tot;
 }

 And this is the 32 bit X86 asm generated by ldc2 for the plus function:

 __D4test4plusFNaNbNfxkZd:
      pushl    %ebp
      movl    %esp, %ebp
      pushl    %esi
      andl    $-8, %esp
      subl    $24, %esp
      movsd    LCPI0_0, %xmm0
      testl    %eax, %eax
      je    LBB0_8
      xorl    %ecx, %ecx
      movb    $1, %dl
      movsd    LCPI0_1, %xmm1
      movsd    LCPI0_2, %xmm2
      .align    16, 0x90
 LBB0_2:
      testb    $1, %dl
      jne    LBB0_3
      addsd    %xmm1, %xmm0
      jmp    LBB0_7
      .align    16, 0x90
 LBB0_3:
      movzbl    %dl, %esi
      andl    $1, %esi
      je    LBB0_5
      addsd    %xmm2, %xmm0
 LBB0_7:
      xorb    $1, %dl
      incl    %ecx
      cmpl    %eax, %ecx
      jb    LBB0_2
 LBB0_8:
      movsd    %xmm0, 8(%esp)
      fldl    8(%esp)
      leal    -4(%ebp), %esp
      popl    %esi
      popl    %ebp
      ret
 LBB0_5:
      movl    $11, 4(%esp)
      movl    $__D4test12__ModuleInfoZ, (%esp)
      calll    __d_switch_error

 Bye,
 bearophile


Well, I'd argue that in fact neither the C++ nor D code generated the
fastest possible code here, as this code will result in at least 3,
likely more, potentially even every, branch being mispredicted. I would
argue, after checking the throughput numbers for fadd (only checked
haswell), that the fastest code here would actually compute both sides
of the branch and use a set of 4 cmov's (due to the fact it's x86 and
we're working with doubles) to determine which one is the one we need to
use going forward.

May 30 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Fri, 2014-05-30 at 13:35 +0000, Thomas via Digitalmars-d wrote:
 I made the following performance test, which adds 10^9 Double’s 
 on Linux with the latest dmd compiler in the Eclipse IDE and with 
 the Gdc-Compiler also on Linux. Then the same test was done with 
 C++ on Linux and with Scala in the Java ecosystem on Linux. All 
 the testing was done on the same PC.
 The results for one addition are:
 
 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds

A priori I would believe there a problem with these numbers: my
experience of CPU-bound D code is that it is generally as fast as C++.

[…]
 Compiler settings for D:
 
 dmd -c 
 -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9
152107A632F/first.o 
 -release -inline -noboundscheck -O -w -version=Have_first 
 -Isource source/perf/testperf.d
 
 gdc ./source/perf/testperf.d -frelease -o testperf
 
 So what is the problem ? Are the compiler switches wrong ? Or is 
 D on the used compilers so slow ? Can you help me.

What is the C++ code you compare against?

What is the Scala code you compare against? Did you try Java and static
Groovy as well?

What command lines did you use for the generation of all the binaries.

Without the data to compare it is hard to compare and help.

One obvious thing though the gdc command line has no optimization turned
on you probably want the -O3 or at least -O2 there.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Russel Winder:

 A priori I would believe there a problem with these numbers: my
 experience of CPU-bound D code is that it is generally as fast 
 as C++.

The C++ code I've shown above if compiled with -Ofast seems 
faster than the D code compiled with ldc2.

Bye,
bearophile

May 30 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Fri, 2014-05-30 at 19:58 +0000, bearophile via Digitalmars-d wrote:
 Russel Winder:
 
 A priori I would believe there a problem with these numbers: my
 experience of CPU-bound D code is that it is generally as fast 
 as C++.

 
 The C++ code I've shown above if compiled with -Ofast seems 
 faster than the D code compiled with ldc2.

I am assuming you are comparing C++/clang with D/ldc2, it is only
reasonable to compare C++/g++ with D/gdc. I am not sure about other
compilers.

Of course there is then the question of whether C++/clang is
better/worse than C++/g++.

Lots of fun experimentation and data analysis to be had here, if only
there were microbenchmarking frameworks for C++ as well as D ;-) 

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 30 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/30/2014 6:35 AM, Thomas wrote:
 So what is the problem ?

Usually, the problem will be obvious from looking at the generated assembler.

May 30 2014

"David Nadlinger" <code klickverbot.at> writes:

On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
 gdc ./source/perf/testperf.d -frelease -o testperf

This effectively compiles the program without optimizations. Try 
-O3 or -Ofast.

David

May 30 2014

Marco Leise <Marco.Leise gmx.de> writes:

Run this with: -O3 -frelease -fno-assert -fno-bounds-check -march=3Dnative
This way GCC and LLVM will recognize that you alternately add
p0 and p1 to the sum and partially unroll the loop, thereby
removing the condition. It takes 1.4xxxx nanoseconds per step
on my not so new 2.0 Ghz notebook, so I assume your PC will
easily reach parity with your original C++ version.



import std.stdio;
import core.time;

alias =E2=84=95 =3D size_t;

void main()
{
	run!plus(1_000_000_000);
}

double plus(=E2=84=95 steps)
{
	enum p0 =3D 0.0045;
	enum p1 =3D 1.00045452 - p0;

	double sum =3D 1.346346;
	foreach (i; 0 .. steps)
		sum +=3D i%2 ? p1 : p0;
	return sum;
}

void run(alias func)(=E2=84=95 steps)
{
	auto t1 =3D TickDuration.currSystemTick;
	auto output =3D func(steps);
	auto t2 =3D TickDuration.currSystemTick;
	auto nanotime =3D 1_000_000_000.0 / steps * (t2 - t1).length / TickDuratio=
n.ticksPerSec;
	writefln("Last: %s", output);
	writefln("Time per op: %s", nanotime);
	writeln();
}

--=20
Marco

May 30 2014

"Thomas" <t.leichner arcor.de> writes:

On Saturday, 31 May 2014 at 05:12:54 UTC, Marco Leise wrote:
 Run this with: -O3 -frelease -fno-assert -fno-bounds-check 
 -march=native
 This way GCC and LLVM will recognize that you alternately add
 p0 and p1 to the sum and partially unroll the loop, thereby
 removing the condition. It takes 1.4xxxx nanoseconds per step
 on my not so new 2.0 Ghz notebook, so I assume your PC will
 easily reach parity with your original C++ version.



 import std.stdio;
 import core.time;

 alias ℕ = size_t;

 void main()
 {
 	run!plus(1_000_000_000);
 }

 double plus(ℕ steps)
 {
 	enum p0 = 0.0045;
 	enum p1 = 1.00045452 - p0;

 	double sum = 1.346346;
 	foreach (i; 0 .. steps)
 		sum += i%2 ? p1 : p0;
 	return sum;
 }

 void run(alias func)(ℕ steps)
 {
 	auto t1 = TickDuration.currSystemTick;
 	auto output = func(steps);
 	auto t2 = TickDuration.currSystemTick;
 	auto nanotime = 1_000_000_000.0 / steps * (t2 - t1).length / 
 TickDuration.ticksPerSec;
 	writefln("Last: %s", output);
 	writefln("Time per op: %s", nanotime);
 	writeln();
 }


Thank you for the help. Which OS is running on your notebook ? 
For I compiled your source code with your settings with the GCC 
compiler. The run took 3.1xxxx nanoseconds per step. For the DMD 
compiler the run took 5.xxxx nanoseconds. So I think the problem 
could be specific to the linux versions of the GCC and the DMD 
compilers.


Thomas

May 31 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Sat, 31 May 2014 17:44:23 +0000
schrieb "Thomas" <t.leichner arcor.de>:

 Thank you for the help. Which OS is running on your notebook ? 
 For I compiled your source code with your settings with the GCC 
 compiler. The run took 3.1xxxx nanoseconds per step. For the DMD 
 compiler the run took 5.xxxx nanoseconds. So I think the problem 
 could be specific to the linux versions of the GCC and the DMD 
 compilers.
 
 
 Thomas

Gentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make
out a good reason why the runtime should depend on the OS so
much.
Are you sure you don't run on a PC from 2000 and did you use
the compiler flags I gave on top of my post? Did you disable
CPU power saving and was no other process running at the same
time?
By the way I get very similar results when using the LDC
compiler.

-- 
Marco

May 31 2014

"Thomas" <t.leichner arcor.de> writes:

On Sunday, 1 June 2014 at 03:33:36 UTC, Marco Leise wrote:
 Am Sat, 31 May 2014 17:44:23 +0000
 schrieb "Thomas" <t.leichner arcor.de>:

 Thank you for the help. Which OS is running on your notebook ? 
 For I compiled your source code with your settings with the 
 GCC compiler. The run took 3.1xxxx nanoseconds per step. For 
 the DMD compiler the run took 5.xxxx nanoseconds. So I think 
 the problem could be specific to the linux versions of the GCC 
 and the DMD compilers.
 
 
 Thomas

 Gentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make
 out a good reason why the runtime should depend on the OS so
 much.
 Are you sure you don't run on a PC from 2000 and did you use
 the compiler flags I gave on top of my post? Did you disable
 CPU power saving and was no other process running at the same
 time?
 By the way I get very similar results when using the LDC
 compiler.

My PC is 5 years old. Of course I used your flags. Besides I am 
not an idiot, I am programming since 20 years and used 6 
different programming languages. I did't post that just for fun, 
for I am evaluating D as language for numerical programming.

Thomas

Jun 02 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 02 Jun 2014 10:57:24 +0000
schrieb "Thomas" <t.leichner arcor.de>:

 My PC is 5 years old. Of course I used your flags. Besides I am 
 not an idiot, I am programming since 20 years and used 6 
 different programming languages. I did't post that just for fun, 
 for I am evaluating D as language for numerical programming.
 
 Thomas

You posted a comparing benchmark between 3 languages providing
only the source code for one and didn't even run an optimized
compile. That had me thinking. :)
Back on topic: Any chance we can see the C++ code so we can
compare more directly? It's hard to compare the numbers only
for the D version when everyone has different system specs.
Also you say your PC is 5 years old. Is your system 32-bit
then? That would certainly effect the efficiency of loading and
storing 64-bit floating point values and might be a clue in
the right direction. I don't want to believe that the OS has
an effect on a loop that doesn't make any calls to the OS.

-- 
Marco

Jun 03 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 3 June 2014 at 11:25:31 UTC, Marco Leise wrote:
 I don't want to believe that the OS has
 an effect on a loop that doesn't make any calls to the OS.

There's always the scheduler, swap etc. Not that they should have
any effect on *this* benchmark of course.

Jun 03 2014

dennis luehring <dl.soluz gmx.net> writes:

faulty benchmark

-do not benchmark "format"

-use a dummy-var - just add(overflow is not a problem) your plus() 
results to it and return that in your main - preventing dead code 
optimization in any way

-introduce some sort of random-value into your plus() code, for example
use an random-generator or the int-casted pointer to program args as 
startup value

-do not benchmark anything without millions of loops - use the average 
as the result

anything else does not makes sense

Am 30.05.2014 15:35, schrieb Thomas:
 I made the following performance test, which adds 10^9 Double’s
 on Linux with the latest dmd compiler in the Eclipse IDE and with
 the Gdc-Compiler also on Linux. Then the same test was done with
 C++ on Linux and with Scala in the Java ecosystem on Linux. All
 the testing was done on the same PC.
 The results for one addition are:

 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds


 D-Source:

 import std.stdio;
 import std.datetime;
 import std.string;
 import core.time;


 void main() {
     run!(plus)( 1000*1000*1000 );
 }

 class C {
 }

 string plus( int steps  )  {
     double sum = 1.346346;
     immutable double p0 = 0.0045;
     immutable double p1 = 1.00045452-p0;
     auto b = true;
     for( int i=0; i<steps; i++){
     	switch( b ){
 	case true :
 	  sum += p0;
 	  break;
 	default:
 	  sum += p1;
 	  break;
 	}
 	b = !b;	
     }
     return (format("%s  %f","plus\nLast: ", sum) );
 //  return ("plus\nLast: ", sum );
 }


 void run( alias func )( int steps )
     if( is(typeof(func(steps)) == string)) {
     auto begin = Clock.currStdTime();
     string output = func( steps );
     auto end =  Clock.currStdTime();
     double nanotime = toNanos(end-begin)/steps;
     writeln( output );
     writeln( "Time per op: " , nanotime );
     writeln( );
 }

 double toNanos( long hns ) { return hns*100.0; }


 Compiler settings for D:

 dmd -c
 -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9C152107A632F/first.o
 -release -inline -noboundscheck -O -w -version=Have_first
 -Isource source/perf/testperf.d

 gdc ./source/perf/testperf.d -frelease -o testperf

 So what is the problem ? Are the compiler switches wrong ? Or is
 D on the used compilers so slow ? Can you help me.


 Thomas

May 30 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sat, 2014-05-31 at 07:32 +0200, dennis luehring via Digitalmars-d
wrote:
 faulty benchmark

Indeed.

 -do not benchmark "format"
 
 -use a dummy-var - just add(overflow is not a problem) your plus() 
 results to it and return that in your main - preventing dead code 
 optimization in any way
 
 -introduce some sort of random-value into your plus() code, for example
 use an random-generator or the int-casted pointer to program args as 
 startup value
 
 -do not benchmark anything without millions of loops - use the average 
 as the result
 
 anything else does not makes sense

As well as the average (mean), you must provide standard deviation and
degrees of freedom so that a proper error analysis and t-tests are
feasible. Or put it another way: even if you quote a mean with knowing
how many in the sample and what the spread is you cannot judge the error
and so cannot make deductions or inferences. 

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 30 2014

dennis luehring <dl.soluz gmx.net> writes:

Am 31.05.2014 08:36, schrieb Russel Winder via Digitalmars-d:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible.

average means average of benchmarked times

and the dummy values are only for keeping the compiler from removing
anything it can reduce at compiletime - that makes benchmarks 
compareable, these values does not change the algorithm or result 
quality an any way - its more like an overflowing-second-output bases on 
the result of the original algorithm (but should be just a simple 
addition or substraction - ignoring overflow etc.)

thats the base of all types of non-stupid benchmarking - next/pro step 
is to look at the resulting assemblercode

May 31 2014

dennis luehring <dl.soluz gmx.net> writes:

Am 31.05.2014 13:25, schrieb dennis luehring:
 Am 31.05.2014 08:36, schrieb Russel Winder via Digitalmars-d:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible.

 average means average of benchmarked times

 and the dummy values are only for keeping the compiler from removing
 anything it can reduce at compiletime - that makes benchmarks
 compareable, these values does not change the algorithm or result
 quality an any way - its more like an overflowing-second-output bases on
 the result of the original algorithm (but should be just a simple
 addition or substraction - ignoring overflow etc.)

 thats the base of all types of non-stupid benchmarking - next/pro step
 is to look at the resulting assemblercode

so the anti-optimizer-overflowing-second-output aka AOOSO should be

initialized outside of the testfunction with an random-value - i normaly 
use the pointer to the main args as int

the AOOSO should be incremented by the needed result of the benchmarked
algorithm - that could be an int casted float/double value, the variant 
size of an string or whatever is floaty and needed enough to be used

and then return the AOOSO as main return

so the original algorithm isn't changed but the compiler got absolutely 
nothing to prevent the usage and the end output of this AOOSO dummy value

yes it ignores that the code-size (cache problems) is changed by the 
AOOSO incrementation - thats the reason for simple casting/overflowing 
integer stuff here, but if the benchmarking goes that deep you should 
better take a look at the assembler-level

May 31 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible. Or put it another way: even if you quote a mean with knowing
 how many in the sample and what the spread is you cannot judge the error
 and so cannot make deductions or inferences.

No. Elapsed time in a benchmark does not follow a Student or Gaussian 
distribution. Use the mode or (better) the minimum. -- Andrei

May 31 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sat, 2014-05-31 at 07:02 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible. Or put it another way: even if you quote a mean with knowing
 how many in the sample and what the spread is you cannot judge the error
 and so cannot make deductions or inferences.

 
 No. Elapsed time in a benchmark does not follow a Student or Gaussian 
 distribution. Use the mode or (better) the minimum. -- Andrei

We almost certainly need to unpack that more. I agree that behind my
comment was an implicit assumption of a normal distribution of results.
This is an easy assumption to make even if it is wrong. So is it
provably wrong? What is the distribution? If we know that then there is
knowledge of the parameters which then allow for statistical inference
and deduction.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 31 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/31/14, 7:10 AM, Russel Winder via Digitalmars-d wrote:
 On Sat, 2014-05-31 at 07:02 -0700, Andrei Alexandrescu via Digitalmars-d
 wrote:
 On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible. Or put it another way: even if you quote a mean with knowing
 how many in the sample and what the spread is you cannot judge the error
 and so cannot make deductions or inferences.

 No. Elapsed time in a benchmark does not follow a Student or Gaussian
 distribution. Use the mode or (better) the minimum. -- Andrei

 We almost certainly need to unpack that more. I agree that behind my
 comment was an implicit assumption of a normal distribution of results.
 This is an easy assumption to make even if it is wrong. So is it
 provably wrong? What is the distribution? If we know that then there is
 knowledge of the parameters which then allow for statistical inference
 and deduction.

Well there's quantization noise which has uniform distribution. Then all 
other sources of noise are additive (no noise may make code run faster). 
So I speculate that the pdf is a half Gaussian mixed with a uniform 
distribution. Taking the mode (which is very close to the minimum in my 
measurements) would be the most accurate way to go. Taking the average 
would end up in some weird point on the half-Gaussian slope.

Andrei

May 31 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
[…]
 
 Well there's quantization noise which has uniform distribution. Then all 
 other sources of noise are additive (no noise may make code run faster). 
 So I speculate that the pdf is a half Gaussian mixed with a uniform 
 distribution. Taking the mode (which is very close to the minimum in my 
 measurements) would be the most accurate way to go. Taking the average 
 would end up in some weird point on the half-Gaussian slope.

I sense you are taking the piss.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 31 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/31/14, 11:49 AM, Russel Winder via Digitalmars-d wrote:
 On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d
 wrote:
 […]
 Well there's quantization noise which has uniform distribution. Then all
 other sources of noise are additive (no noise may make code run faster).
 So I speculate that the pdf is a half Gaussian mixed with a uniform
 distribution. Taking the mode (which is very close to the minimum in my
 measurements) would be the most accurate way to go. Taking the average
 would end up in some weird point on the half-Gaussian slope.

 I sense you are taking the piss.

I don't know the idiom - what does it mean? Something nice I hope :o). 
-- Andrei

May 31 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/31/14, 2:42 PM, Andrei Alexandrescu wrote:
 On 5/31/14, 11:49 AM, Russel Winder via Digitalmars-d wrote:
 On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d
 wrote:
 […]
 Well there's quantization noise which has uniform distribution. Then all
 other sources of noise are additive (no noise may make code run faster).
 So I speculate that the pdf is a half Gaussian mixed with a uniform
 distribution. Taking the mode (which is very close to the minimum in my
 measurements) would be the most accurate way to go. Taking the average
 would end up in some weird point on the half-Gaussian slope.

 I sense you are taking the piss.

 I don't know the idiom - what does it mean? Something nice I hope :o).
 -- Andrei

Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to 
take it in context; I am being serious, and basing myself on 
measurements taken while designing and implementing 
https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md.

Andrei

May 31 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sat, 2014-05-31 at 14:45 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
[…]
 Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to 
 take it in context; I am being serious, and basing myself on 
 measurements taken while designing and implementing 
 https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md.

My apologies for being a abrupt and ill-considered and hence potentially
rude. Long story.

I'll cogitate on the ideas this morning and see what I can chip in
constructively to take things along.

I will also ask Aleksey Shipilëv what underpinnings he is using for JMH
to see if there is some useful cross-fertilization.
  
-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 31 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 31 May 2014 at 14:01:52 UTC, Andrei Alexandrescu 
wrote:
 On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard 
 deviation and
 degrees of freedom so that a proper error analysis and t-tests 
 are
 feasible. Or put it another way: even if you quote a mean with 
 knowing
 how many in the sample and what the spread is you cannot judge 
 the error
 and so cannot make deductions or inferences.

 No. Elapsed time in a benchmark does not follow a Student or 
 Gaussian distribution. Use the mode or (better) the minimum. -- 
 Andrei

Well... It depends on what you're looking to do with the result. 
As you say though, micro-benchmarks of code-quality should always 
be judged on the minimum of a large sample.

May 31 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/30/14, 10:32 PM, dennis luehring wrote:
 -do not benchmark anything without millions of loops - use the average
 as the result

Use the minimum unless networking is involved. -- Andrei

May 31 2014

"Narendra Modi" <namo namo.com> writes:

On Saturday, 31 May 2014 at 13:59:40 UTC, Andrei Alexandrescu
wrote:
 On 5/30/14, 10:32 PM, dennis luehring wrote:
 -do not benchmark anything without millions of loops - use the 
 average
 as the result

 Use the minimum unless networking is involved. -- Andrei

cache??

May 31 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Performance