www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Performance

reply "Thomas" <t.leichner arcor.de> writes:
I made the following performance test, which adds 10^9 Double’s 
on Linux with the latest dmd compiler in the Eclipse IDE and with 
the Gdc-Compiler also on Linux. Then the same test was done with 
C++ on Linux and with Scala in the Java ecosystem on Linux. All 
the testing was done on the same PC.
The results for one addition are:

D-DMD: 3.1 nanoseconds
D-GDC: 3.8 nanoseconds
C++: 1.0 nanoseconds
Scala: 1.0 nanoseconds


D-Source:

import std.stdio;
import std.datetime;
import std.string;
import core.time;


void main() {
   run!(plus)( 1000*1000*1000 );
}

class C {
}

string plus( int steps  )  {
   double sum = 1.346346;
   immutable double p0 = 0.0045;
   immutable double p1 = 1.00045452-p0;
   auto b = true;
   for( int i=0; i<steps; i++){
   	switch( b ){
	case true :
	  sum += p0;
	  break;
	default:
	  sum += p1;
	  break;
	}
	b = !b;	
   }
   return (format("%s  %f","plus\nLast: ", sum) );
//  return ("plus\nLast: ", sum );
}


void run( alias func )( int steps )
   if( is(typeof(func(steps)) == string)) {
   auto begin = Clock.currStdTime();
   string output = func( steps );
   auto end =  Clock.currStdTime();
   double nanotime = toNanos(end-begin)/steps;
   writeln( output );
   writeln( "Time per op: " , nanotime );
   writeln( );
}

double toNanos( long hns ) { return hns*100.0; }


Compiler settings for D:

dmd -c 
-of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9
152107A632F/first.o 
-release -inline -noboundscheck -O -w -version=Have_first 
-Isource source/perf/testperf.d

gdc ./source/perf/testperf.d -frelease -o testperf

So what is the problem ? Are the compiler switches wrong ? Or is 
D on the used compilers so slow ? Can you help me.


Thomas
May 30 2014
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
   return (format("%s  %f","plus\nLast: ", sum) );

I haven't actually run this but my guess is that the format function is the slowish thing here. Did you create a new string in the C version too?
 gdc ./source/perf/testperf.d -frelease -o testperf

The -O3 switch might help too, which turns on optimizations.
May 30 2014
prev sibling next sibling parent "anonymous" <anonymous example.com> writes:
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
 I made the following performance test, which adds 10^9 Double’s 
 on Linux with the latest dmd compiler in the Eclipse IDE and 
 with the Gdc-Compiler also on Linux. Then the same test was 
 done with C++ on Linux and with Scala in the Java ecosystem on 
 Linux. All the testing was done on the same PC.
 The results for one addition are:

 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds


 D-Source:

 Compiler settings for D:

 So what is the problem ? Are the compiler switches wrong ? Or 
 is D on the used compilers so slow ? Can you help me.

Sources and command lines for the other languages would be nice for comparison.
May 30 2014
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
 I made the following performance test, which adds 10^9 Double’s 
 on Linux with the latest dmd compiler in the Eclipse IDE and 
 with the Gdc-Compiler also on Linux. Then the same test was 
 done with C++ on Linux and with Scala in the Java ecosystem on 
 Linux. All the testing was done on the same PC.
 The results for one addition are:

 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds

Your code written in a more idiomatic way (I have commented out new language features): import std.stdio, std.datetime; double plus(in uint nSteps) pure nothrow safe /* nogc*/ { enum double p0 = 0.0045; enum double p1 = 1.00045452-p0; double tot = 1.346346; auto b = true; foreach (immutable i; 0 .. nSteps) { final switch (b) { case true: tot += p0; break; case false: tot += p1; break; } b = !b; } return tot; } void run(alias func, string funcName)(in uint nSteps) { StopWatch sw; sw.start; immutable result = func(nSteps); sw.stop; writeln(funcName); writefln("Last: %f", result); //writeln("Time per op: ", sw.peek.nsecs / real(nSteps)); writeln("Time per op: ", sw.peek.nsecs / cast(real)nSteps); } void main() { run!(plus, "plus")(1_000_000_000U); } (But there is also a benchmark helper around). ldmd2 -O -release -inline -noboundscheck test.d Using LDC2 compiler, on my system the output is: plus Last: 500227252.496398 Time per op: 9.41424 Bye, bearophile
May 30 2014
parent Orvid King <blah38621 gmail.com> writes:
On 5/30/2014 9:30 AM, bearophile wrote:
 double plus(in uint nSteps) pure nothrow  safe /* nogc*/ {
     enum double p0 = 0.0045;
     enum double p1 = 1.00045452-p0;

     double tot = 1.346346;
     auto b = true;

     foreach (immutable i; 0 .. nSteps) {
         final switch (b) {
             case true:
                 tot += p0;
                 break;
             case false:
                 tot += p1;
                 break;
         }

         b = !b;
     }

     return tot;
 }

And this is the 32 bit X86 asm generated by ldc2 for the plus function: __D4test4plusFNaNbNfxkZd: pushl %ebp movl %esp, %ebp pushl %esi andl $-8, %esp subl $24, %esp movsd LCPI0_0, %xmm0 testl %eax, %eax je LBB0_8 xorl %ecx, %ecx movb $1, %dl movsd LCPI0_1, %xmm1 movsd LCPI0_2, %xmm2 .align 16, 0x90 LBB0_2: testb $1, %dl jne LBB0_3 addsd %xmm1, %xmm0 jmp LBB0_7 .align 16, 0x90 LBB0_3: movzbl %dl, %esi andl $1, %esi je LBB0_5 addsd %xmm2, %xmm0 LBB0_7: xorb $1, %dl incl %ecx cmpl %eax, %ecx jb LBB0_2 LBB0_8: movsd %xmm0, 8(%esp) fldl 8(%esp) leal -4(%ebp), %esp popl %esi popl %ebp ret LBB0_5: movl $11, 4(%esp) movl $__D4test12__ModuleInfoZ, (%esp) calll __d_switch_error Bye, bearophile

Well, I'd argue that in fact neither the C++ nor D code generated the fastest possible code here, as this code will result in at least 3, likely more, potentially even every, branch being mispredicted. I would argue, after checking the throughput numbers for fadd (only checked haswell), that the fastest code here would actually compute both sides of the branch and use a set of 4 cmov's (due to the fact it's x86 and we're working with doubles) to determine which one is the one we need to use going forward.
May 30 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
 double plus(in uint nSteps) pure nothrow  safe /* nogc*/ {
     enum double p0 = 0.0045;
     enum double p1 = 1.00045452-p0;

     double tot = 1.346346;
     auto b = true;

     foreach (immutable i; 0 .. nSteps) {
         final switch (b) {
             case true:
                 tot += p0;
                 break;
             case false:
                 tot += p1;
                 break;
         }

         b = !b;
     }

     return tot;
 }

And this is the 32 bit X86 asm generated by ldc2 for the plus function: __D4test4plusFNaNbNfxkZd: pushl %ebp movl %esp, %ebp pushl %esi andl $-8, %esp subl $24, %esp movsd LCPI0_0, %xmm0 testl %eax, %eax je LBB0_8 xorl %ecx, %ecx movb $1, %dl movsd LCPI0_1, %xmm1 movsd LCPI0_2, %xmm2 .align 16, 0x90 LBB0_2: testb $1, %dl jne LBB0_3 addsd %xmm1, %xmm0 jmp LBB0_7 .align 16, 0x90 LBB0_3: movzbl %dl, %esi andl $1, %esi je LBB0_5 addsd %xmm2, %xmm0 LBB0_7: xorb $1, %dl incl %ecx cmpl %eax, %ecx jb LBB0_2 LBB0_8: movsd %xmm0, 8(%esp) fldl 8(%esp) leal -4(%ebp), %esp popl %esi popl %ebp ret LBB0_5: movl $11, 4(%esp) movl $__D4test12__ModuleInfoZ, (%esp) calll __d_switch_error Bye, bearophile
May 30 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
This C++ code:

double plus(const unsigned int nSteps) {
     const double p0 = 0.0045;
     const double p1 = 1.00045452-p0;

     double tot = 1.346346;
     bool b = true;

     for (unsigned int i = 0; i < nSteps; i++) {
         switch (b) {
             case true:
                 tot += p0;
                 break;
             case false:
                 tot += p1;
                 break;
         }

         b = !b;
     }

     return tot;
}


G++ 4.8.0 gives the asm (using -Ofast, that implies unsafe FP 
optimizations):

__Z4plusj:
	movl	4(%esp), %ecx
	testl	%ecx, %ecx
	je	L7
	fldl	LC0
	xorl	%edx, %edx
	movl	$1, %eax
	fldl	LC2
	jmp	L6
	.p2align 4,,7
L11:
	fxch	%st(1)
	addl	$1, %edx
	xorl	$1, %eax
	cmpl	%ecx, %edx
	faddl	LC1
	je	L12
	fxch	%st(1)
L6:
	cmpb	$1, %al
	je	L11
	addl	$1, %edx
	xorl	$1, %eax
	cmpl	%ecx, %edx
	fadd	%st, %st(1)
	jne	L6
	fstp	%st(0)
	jmp	L10
	.p2align 4,,7
L12:
	fstp	%st(1)
L10:
	rep ret
L7:
	fldl	LC0
	ret

Bye,
bearophile
May 30 2014
prev sibling next sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Fri, 2014-05-30 at 13:35 +0000, Thomas via Digitalmars-d wrote:
 I made the following performance test, which adds 10^9 Double’s 
 on Linux with the latest dmd compiler in the Eclipse IDE and with 
 the Gdc-Compiler also on Linux. Then the same test was done with 
 C++ on Linux and with Scala in the Java ecosystem on Linux. All 
 the testing was done on the same PC.
 The results for one addition are:
 
 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds

A priori I would believe there a problem with these numbers: my experience of CPU-bound D code is that it is generally as fast as C++. […]
 Compiler settings for D:
 
 dmd -c 
 -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9
152107A632F/first.o 
 -release -inline -noboundscheck -O -w -version=Have_first 
 -Isource source/perf/testperf.d
 
 gdc ./source/perf/testperf.d -frelease -o testperf
 
 So what is the problem ? Are the compiler switches wrong ? Or is 
 D on the used compilers so slow ? Can you help me.

What is the C++ code you compare against? What is the Scala code you compare against? Did you try Java and static Groovy as well? What command lines did you use for the generation of all the binaries. Without the data to compare it is hard to compare and help. One obvious thing though the gdc command line has no optimization turned on you probably want the -O3 or at least -O2 there. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 30 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Russel Winder:

 A priori I would believe there a problem with these numbers: my
 experience of CPU-bound D code is that it is generally as fast 
 as C++.

The C++ code I've shown above if compiled with -Ofast seems faster than the D code compiled with ldc2. Bye, bearophile
May 30 2014
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2014 6:35 AM, Thomas wrote:
 So what is the problem ?

Usually, the problem will be obvious from looking at the generated assembler.
May 30 2014
prev sibling next sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
 gdc ./source/perf/testperf.d -frelease -o testperf

This effectively compiles the program without optimizations. Try -O3 or -Ofast. David
May 30 2014
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Run this with: -O3 -frelease -fno-assert -fno-bounds-check -march=3Dnative
This way GCC and LLVM will recognize that you alternately add
p0 and p1 to the sum and partially unroll the loop, thereby
removing the condition. It takes 1.4xxxx nanoseconds per step
on my not so new 2.0 Ghz notebook, so I assume your PC will
easily reach parity with your original C++ version.



import std.stdio;
import core.time;

alias =E2=84=95 =3D size_t;

void main()
{
	run!plus(1_000_000_000);
}

double plus(=E2=84=95 steps)
{
	enum p0 =3D 0.0045;
	enum p1 =3D 1.00045452 - p0;

	double sum =3D 1.346346;
	foreach (i; 0 .. steps)
		sum +=3D i%2 ? p1 : p0;
	return sum;
}

void run(alias func)(=E2=84=95 steps)
{
	auto t1 =3D TickDuration.currSystemTick;
	auto output =3D func(steps);
	auto t2 =3D TickDuration.currSystemTick;
	auto nanotime =3D 1_000_000_000.0 / steps * (t2 - t1).length / TickDuratio=
n.ticksPerSec;
	writefln("Last: %s", output);
	writefln("Time per op: %s", nanotime);
	writeln();
}

--=20
Marco
May 30 2014
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
faulty benchmark

-do not benchmark "format"

-use a dummy-var - just add(overflow is not a problem) your plus() 
results to it and return that in your main - preventing dead code 
optimization in any way

-introduce some sort of random-value into your plus() code, for example
use an random-generator or the int-casted pointer to program args as 
startup value

-do not benchmark anything without millions of loops - use the average 
as the result

anything else does not makes sense

Am 30.05.2014 15:35, schrieb Thomas:
 I made the following performance test, which adds 10^9 Double’s
 on Linux with the latest dmd compiler in the Eclipse IDE and with
 the Gdc-Compiler also on Linux. Then the same test was done with
 C++ on Linux and with Scala in the Java ecosystem on Linux. All
 the testing was done on the same PC.
 The results for one addition are:

 D-DMD: 3.1 nanoseconds
 D-GDC: 3.8 nanoseconds
 C++: 1.0 nanoseconds
 Scala: 1.0 nanoseconds


 D-Source:

 import std.stdio;
 import std.datetime;
 import std.string;
 import core.time;


 void main() {
     run!(plus)( 1000*1000*1000 );
 }

 class C {
 }

 string plus( int steps  )  {
     double sum = 1.346346;
     immutable double p0 = 0.0045;
     immutable double p1 = 1.00045452-p0;
     auto b = true;
     for( int i=0; i<steps; i++){
     	switch( b ){
 	case true :
 	  sum += p0;
 	  break;
 	default:
 	  sum += p1;
 	  break;
 	}
 	b = !b;	
     }
     return (format("%s  %f","plus\nLast: ", sum) );
 //  return ("plus\nLast: ", sum );
 }


 void run( alias func )( int steps )
     if( is(typeof(func(steps)) == string)) {
     auto begin = Clock.currStdTime();
     string output = func( steps );
     auto end =  Clock.currStdTime();
     double nanotime = toNanos(end-begin)/steps;
     writeln( output );
     writeln( "Time per op: " , nanotime );
     writeln( );
 }

 double toNanos( long hns ) { return hns*100.0; }


 Compiler settings for D:

 dmd -c
 -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9C152107A632F/first.o
 -release -inline -noboundscheck -O -w -version=Have_first
 -Isource source/perf/testperf.d

 gdc ./source/perf/testperf.d -frelease -o testperf

 So what is the problem ? Are the compiler switches wrong ? Or is
 D on the used compilers so slow ? Can you help me.


 Thomas

May 30 2014
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 31.05.2014 08:36, schrieb Russel Winder via Digitalmars-d:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible.

average means average of benchmarked times and the dummy values are only for keeping the compiler from removing anything it can reduce at compiletime - that makes benchmarks compareable, these values does not change the algorithm or result quality an any way - its more like an overflowing-second-output bases on the result of the original algorithm (but should be just a simple addition or substraction - ignoring overflow etc.) thats the base of all types of non-stupid benchmarking - next/pro step is to look at the resulting assemblercode
May 31 2014
parent dennis luehring <dl.soluz gmx.net> writes:
Am 31.05.2014 13:25, schrieb dennis luehring:
 Am 31.05.2014 08:36, schrieb Russel Winder via Digitalmars-d:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible.

average means average of benchmarked times and the dummy values are only for keeping the compiler from removing anything it can reduce at compiletime - that makes benchmarks compareable, these values does not change the algorithm or result quality an any way - its more like an overflowing-second-output bases on the result of the original algorithm (but should be just a simple addition or substraction - ignoring overflow etc.) thats the base of all types of non-stupid benchmarking - next/pro step is to look at the resulting assemblercode

so the anti-optimizer-overflowing-second-output aka AOOSO should be initialized outside of the testfunction with an random-value - i normaly use the pointer to the main args as int the AOOSO should be incremented by the needed result of the benchmarked algorithm - that could be an int casted float/double value, the variant size of an string or whatever is floaty and needed enough to be used and then return the AOOSO as main return so the original algorithm isn't changed but the compiler got absolutely nothing to prevent the usage and the end output of this AOOSO dummy value yes it ignores that the code-size (cache problems) is changed by the AOOSO incrementation - thats the reason for simple casting/overflowing integer stuff here, but if the benchmarking goes that deep you should better take a look at the assembler-level
May 31 2014
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/30/14, 10:32 PM, dennis luehring wrote:
 -do not benchmark anything without millions of loops - use the average
 as the result

Use the minimum unless networking is involved. -- Andrei
May 31 2014
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible. Or put it another way: even if you quote a mean with knowing
 how many in the sample and what the spread is you cannot judge the error
 and so cannot make deductions or inferences.

No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei
May 31 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/14, 7:10 AM, Russel Winder via Digitalmars-d wrote:
 On Sat, 2014-05-31 at 07:02 -0700, Andrei Alexandrescu via Digitalmars-d
 wrote:
 On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible. Or put it another way: even if you quote a mean with knowing
 how many in the sample and what the spread is you cannot judge the error
 and so cannot make deductions or inferences.

No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei

We almost certainly need to unpack that more. I agree that behind my comment was an implicit assumption of a normal distribution of results. This is an easy assumption to make even if it is wrong. So is it provably wrong? What is the distribution? If we know that then there is knowledge of the parameters which then allow for statistical inference and deduction.

Well there's quantization noise which has uniform distribution. Then all other sources of noise are additive (no noise may make code run faster). So I speculate that the pdf is a half Gaussian mixed with a uniform distribution. Taking the mode (which is very close to the minimum in my measurements) would be the most accurate way to go. Taking the average would end up in some weird point on the half-Gaussian slope. Andrei
May 31 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/14, 11:49 AM, Russel Winder via Digitalmars-d wrote:
 On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d
 wrote:
 […]
 Well there's quantization noise which has uniform distribution. Then all
 other sources of noise are additive (no noise may make code run faster).
 So I speculate that the pdf is a half Gaussian mixed with a uniform
 distribution. Taking the mode (which is very close to the minimum in my
 measurements) would be the most accurate way to go. Taking the average
 would end up in some weird point on the half-Gaussian slope.

I sense you are taking the piss.

I don't know the idiom - what does it mean? Something nice I hope :o). -- Andrei
May 31 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/14, 2:42 PM, Andrei Alexandrescu wrote:
 On 5/31/14, 11:49 AM, Russel Winder via Digitalmars-d wrote:
 On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d
 wrote:
 […]
 Well there's quantization noise which has uniform distribution. Then all
 other sources of noise are additive (no noise may make code run faster).
 So I speculate that the pdf is a half Gaussian mixed with a uniform
 distribution. Taking the mode (which is very close to the minimum in my
 measurements) would be the most accurate way to go. Taking the average
 would end up in some weird point on the half-Gaussian slope.

I sense you are taking the piss.

I don't know the idiom - what does it mean? Something nice I hope :o). -- Andrei

Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to take it in context; I am being serious, and basing myself on measurements taken while designing and implementing https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md. Andrei
May 31 2014
prev sibling next sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, 2014-05-31 at 07:32 +0200, dennis luehring via Digitalmars-d
wrote:
 faulty benchmark

Indeed.
 -do not benchmark "format"
 
 -use a dummy-var - just add(overflow is not a problem) your plus() 
 results to it and return that in your main - preventing dead code 
 optimization in any way
 
 -introduce some sort of random-value into your plus() code, for example
 use an random-generator or the int-casted pointer to program args as 
 startup value
 
 -do not benchmark anything without millions of loops - use the average 
 as the result
 
 anything else does not makes sense

As well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible. Or put it another way: even if you quote a mean with knowing how many in the sample and what the spread is you cannot judge the error and so cannot make deductions or inferences. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 30 2014
prev sibling next sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Fri, 2014-05-30 at 19:58 +0000, bearophile via Digitalmars-d wrote:
 Russel Winder:
 
 A priori I would believe there a problem with these numbers: my
 experience of CPU-bound D code is that it is generally as fast 
 as C++.

The C++ code I've shown above if compiled with -Ofast seems faster than the D code compiled with ldc2.

I am assuming you are comparing C++/clang with D/ldc2, it is only reasonable to compare C++/g++ with D/gdc. I am not sure about other compilers. Of course there is then the question of whether C++/clang is better/worse than C++/g++. Lots of fun experimentation and data analysis to be had here, if only there were microbenchmarking frameworks for C++ as well as D ;-) -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 30 2014
prev sibling next sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, 2014-05-31 at 07:02 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard deviation and
 degrees of freedom so that a proper error analysis and t-tests are
 feasible. Or put it another way: even if you quote a mean with knowing
 how many in the sample and what the spread is you cannot judge the error
 and so cannot make deductions or inferences.

No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei

We almost certainly need to unpack that more. I agree that behind my comment was an implicit assumption of a normal distribution of results. This is an easy assumption to make even if it is wrong. So is it provably wrong? What is the distribution? If we know that then there is knowledge of the parameters which then allow for statistical inference and deduction. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 31 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 31 May 2014 at 14:01:52 UTC, Andrei Alexandrescu 
wrote:
 On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:
 As well as the average (mean), you must provide standard 
 deviation and
 degrees of freedom so that a proper error analysis and t-tests 
 are
 feasible. Or put it another way: even if you quote a mean with 
 knowing
 how many in the sample and what the spread is you cannot judge 
 the error
 and so cannot make deductions or inferences.

No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei

Well... It depends on what you're looking to do with the result. As you say though, micro-benchmarks of code-quality should always be judged on the minimum of a large sample.
May 31 2014
prev sibling next sibling parent "Thomas" <t.leichner arcor.de> writes:
On Saturday, 31 May 2014 at 05:12:54 UTC, Marco Leise wrote:
 Run this with: -O3 -frelease -fno-assert -fno-bounds-check 
 -march=native
 This way GCC and LLVM will recognize that you alternately add
 p0 and p1 to the sum and partially unroll the loop, thereby
 removing the condition. It takes 1.4xxxx nanoseconds per step
 on my not so new 2.0 Ghz notebook, so I assume your PC will
 easily reach parity with your original C++ version.



 import std.stdio;
 import core.time;

 alias ℕ = size_t;

 void main()
 {
 	run!plus(1_000_000_000);
 }

 double plus(ℕ steps)
 {
 	enum p0 = 0.0045;
 	enum p1 = 1.00045452 - p0;

 	double sum = 1.346346;
 	foreach (i; 0 .. steps)
 		sum += i%2 ? p1 : p0;
 	return sum;
 }

 void run(alias func)(ℕ steps)
 {
 	auto t1 = TickDuration.currSystemTick;
 	auto output = func(steps);
 	auto t2 = TickDuration.currSystemTick;
 	auto nanotime = 1_000_000_000.0 / steps * (t2 - t1).length / 
 TickDuration.ticksPerSec;
 	writefln("Last: %s", output);
 	writefln("Time per op: %s", nanotime);
 	writeln();
 }

Thank you for the help. Which OS is running on your notebook ? For I compiled your source code with your settings with the GCC compiler. The run took 3.1xxxx nanoseconds per step. For the DMD compiler the run took 5.xxxx nanoseconds. So I think the problem could be specific to the linux versions of the GCC and the DMD compilers. Thomas
May 31 2014
prev sibling next sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
[…]
 
 Well there's quantization noise which has uniform distribution. Then all 
 other sources of noise are additive (no noise may make code run faster). 
 So I speculate that the pdf is a half Gaussian mixed with a uniform 
 distribution. Taking the mode (which is very close to the minimum in my 
 measurements) would be the most accurate way to go. Taking the average 
 would end up in some weird point on the half-Gaussian slope.

I sense you are taking the piss. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 31 2014
prev sibling next sibling parent "Narendra Modi" <namo namo.com> writes:
On Saturday, 31 May 2014 at 13:59:40 UTC, Andrei Alexandrescu
wrote:
 On 5/30/14, 10:32 PM, dennis luehring wrote:
 -do not benchmark anything without millions of loops - use the 
 average
 as the result

Use the minimum unless networking is involved. -- Andrei

cache??
May 31 2014
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 31 May 2014 17:44:23 +0000
schrieb "Thomas" <t.leichner arcor.de>:

 Thank you for the help. Which OS is running on your notebook ? 
 For I compiled your source code with your settings with the GCC 
 compiler. The run took 3.1xxxx nanoseconds per step. For the DMD 
 compiler the run took 5.xxxx nanoseconds. So I think the problem 
 could be specific to the linux versions of the GCC and the DMD 
 compilers.
 
 
 Thomas

Gentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make out a good reason why the runtime should depend on the OS so much. Are you sure you don't run on a PC from 2000 and did you use the compiler flags I gave on top of my post? Did you disable CPU power saving and was no other process running at the same time? By the way I get very similar results when using the LDC compiler. -- Marco
May 31 2014
prev sibling next sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, 2014-05-31 at 14:45 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
[…]
 Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to 
 take it in context; I am being serious, and basing myself on 
 measurements taken while designing and implementing 
 https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md.

My apologies for being a abrupt and ill-considered and hence potentially rude. Long story. I'll cogitate on the ideas this morning and see what I can chip in constructively to take things along. I will also ask Aleksey Shipilëv what underpinnings he is using for JMH to see if there is some useful cross-fertilization. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 31 2014
prev sibling next sibling parent "Thomas" <t.leichner arcor.de> writes:
On Sunday, 1 June 2014 at 03:33:36 UTC, Marco Leise wrote:
 Am Sat, 31 May 2014 17:44:23 +0000
 schrieb "Thomas" <t.leichner arcor.de>:

 Thank you for the help. Which OS is running on your notebook ? 
 For I compiled your source code with your settings with the 
 GCC compiler. The run took 3.1xxxx nanoseconds per step. For 
 the DMD compiler the run took 5.xxxx nanoseconds. So I think 
 the problem could be specific to the linux versions of the GCC 
 and the DMD compilers.
 
 
 Thomas

Gentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make out a good reason why the runtime should depend on the OS so much. Are you sure you don't run on a PC from 2000 and did you use the compiler flags I gave on top of my post? Did you disable CPU power saving and was no other process running at the same time? By the way I get very similar results when using the LDC compiler.

My PC is 5 years old. Of course I used your flags. Besides I am not an idiot, I am programming since 20 years and used 6 different programming languages. I did't post that just for fun, for I am evaluating D as language for numerical programming. Thomas
Jun 02 2014
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 02 Jun 2014 10:57:24 +0000
schrieb "Thomas" <t.leichner arcor.de>:

 My PC is 5 years old. Of course I used your flags. Besides I am 
 not an idiot, I am programming since 20 years and used 6 
 different programming languages. I did't post that just for fun, 
 for I am evaluating D as language for numerical programming.
 
 Thomas

You posted a comparing benchmark between 3 languages providing only the source code for one and didn't even run an optimized compile. That had me thinking. :) Back on topic: Any chance we can see the C++ code so we can compare more directly? It's hard to compare the numbers only for the D version when everyone has different system specs. Also you say your PC is 5 years old. Is your system 32-bit then? That would certainly effect the efficiency of loading and storing 64-bit floating point values and might be a clue in the right direction. I don't want to believe that the OS has an effect on a loop that doesn't make any calls to the OS. -- Marco
Jun 03 2014
prev sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 3 June 2014 at 11:25:31 UTC, Marco Leise wrote:
 I don't want to believe that the OS has
 an effect on a loop that doesn't make any calls to the OS.

There's always the scheduler, swap etc. Not that they should have any effect on *this* benchmark of course.
Jun 03 2014