digitalmars.D.learn - Rquest for timings

bearophile (15/15) Nov 24 2011 This is the nbody benchmark of the Shootout site:

Andrea Fontana (13/35) Nov 25 2011 Hmm reading code i verified that i can gain a 10% writing:

bearophile (4/10) Nov 25 2011 My version performs a sqrt and one multiplication, while your version pe...

Andrea Fontana (21/34) Nov 25 2011 Really? Dmd version and system? (here: dmd from git, ubuntu 11.10 64bit)

bearophile (4/4) Nov 25 2011 Andrea Fontana Wrote:

Andrea Fontana (5/9) Nov 26 2011 Maybe Jerry could test my edits and check for timing...

bearophile (5/10) Nov 26 2011 Take a look at the produced assembly code :-)

Jerry (43/59) Nov 25 2011 All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2. I built

bearophile (15/21) Nov 26 2011 This is an uncommon thing, expecially on 32 bit systems.

bearophile (5/6) Nov 27 2011 The 32bit assembly produced by the Intel Fortran compiler on that code, ...

bearophile <bearophileHUGS lycos.com> writes:

This is the nbody benchmark of the Shootout site:
http://shootout.alioth.debian.org/u32/performance.php?test=nbody

The faster version is a Fortran one, probably thanks to vector operations that
allow a better SIMD vectorization.

This is the C++ version:
http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1

C++ version compiled with:
g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
An input parameter is n= 3_000_000 (but use a larger number if therun time is
too much small, like 10 millions or more).

First D2 version (serial):
http://codepad.org/AdRSm2wP

Second D2 version, three times slower thanks to vector ops, more similar to the
Fortran version:
http://codepad.org/7O3mz9en

Is someone willing to take two (or more) timings using LDC 2 compiler (I have
LDC1 only, beside DMD)? I'd like to know how much time it takes to run the
first D version *compared* to the C++ version :-) If you time the second D2
version too, then it's better.

Bye and thank you,
bearophile

Nov 24 2011

Andrea Fontana <advmail katamail.com> writes:

Hmm reading code i verified that i can gain a 10% writing:

                    imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
                    imd mag =3D dt / (dSquared * dSquared^^2);

Around line 115. Give it a try...=20

Il giorno gio, 24/11/2011 alle 22.37 -0500, bearophile ha scritto:

 This is the nbody benchmark of the Shootout site:
 http://shootout.alioth.debian.org/u32/performance.php?test=3Dnbody
=20
 The faster version is a Fortran one, probably thanks to vector operations=

 that allow a better SIMD vectorization.
=20
 This is the C++ version:
 http://shootout.alioth.debian.org/u32/program.php?test=3Dnbody&lang=3Dgpp=

&id=3D1
=20
 C++ version compiled with:
 g++ -Ofast -fomit-frame-pointer -march=3Dnative -mfpmath=3Dsse -msse3 --s=

td=3Dc++0x
 An input parameter is n=3D 3_000_000 (but use a larger number if therun t=

ime is too much small, like 10 millions or more).
=20
 First D2 version (serial):
 http://codepad.org/AdRSm2wP
=20
 Second D2 version, three times slower thanks to vector ops, more similar =

to the Fortran version:
 http://codepad.org/7O3mz9en
=20
 Is someone willing to take two (or more) timings using LDC 2 compiler (I =

have LDC1 only, beside DMD)? I'd like to know how much time it takes to run=
 the first D version *compared* to the C++ version :-) If you time the seco=
nd D2 version too, then it's better.
=20
 Bye and thank you,
 bearophile

Nov 25 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrea Fontana:

 Hmm reading code i verified that i can gain a 10% writing:
 
                     imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
                     imd mag = dt / (dSquared * dSquared^^2);
 
 Around line 115. Give it a try... 

My version performs a sqrt and one multiplication, while your version performs
one sqrt and two multiplications. On my PC my version is faster.

Bye,
bearophile

Nov 25 2011

Andrea Fontana <advmail katamail.com> writes:

Really? Dmd version and system? (here: dmd from git, ubuntu 11.10 64bit)

./app 50000000

Your version:
real	0m14.674s

My version:
real	0m13.644s

Your version:
imd dSquared =3D dx ^^ 2 + dy ^^ 2 + dz ^^ 2;  =20
imd mag =3D dt / (dSquared * sqrt(dSquared));

My version:
imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
imd mag =3D dt / (dSquared * dSquared^^2);  // That is: dt /
(dSquared^^3);
=20
Probably vars evaluation works better on my example?=20

Btw:

imd dSquared =3D sqrt(dx*dx + dy*dy + dz*dz);
imd mag =3D dt / (dSquared*dSquared*dSquared);

real	0m13.574s


Il giorno ven, 25/11/2011 alle 08.14 -0500, bearophile ha scritto:

 Andrea Fontana:
=20
 Hmm reading code i verified that i can gain a 10% writing:
=20
                     imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
                     imd mag =3D dt / (dSquared * dSquared^^2);
=20
 Around line 115. Give it a try...=20

=20
 My version performs a sqrt and one multiplication, while your version per=

forms one sqrt and two multiplications. On my PC my version is faster.
=20
 Bye,
 bearophile

Nov 25 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrea Fontana Wrote:

Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS.

Bye,
bearophile

Nov 25 2011

Andrea Fontana <advmail katamail.com> writes:

Maybe Jerry could test my edits and check for timing...

About "^^": have you tried to benchmark

1/(x ^^ n) vs x ^^ -n ?  (with n>0)

On my setup second version is very very slow.


== Quotato dall`articolo bearophile (bearophileHUGS lycos.com)
 Andrea Fontana Wrote:
 Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS.
 Bye,
 bearophile

Nov 26 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrea Fontana:

 About "^^": have you tried to benchmark
 
 1/(x ^^ n) vs x ^^ -n ?  (with n>0)
 
 On my setup second version is very very slow.

Take a look at the produced assembly code :-)
Maybe x ^^ n is rewritten with a simple code, while ^^-n calls the pow function.

Bye,
bearophile

Nov 26 2011

Jerry <jlquinn optonline.net> writes:

bearophile <bearophileHUGS lycos.com> writes:

 This is the nbody benchmark of the Shootout site:
 http://shootout.alioth.debian.org/u32/performance.php?test=nbody

 The faster version is a Fortran one, probably thanks to vector operations that
allow a better SIMD vectorization.

 This is the C++ version:
 http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1

 C++ version compiled with:
 g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
 An input parameter is n= 3_000_000 (but use a larger number if therun
 time is too much small, like 10 millions or more).

All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2.  I built
with both D and C++ enabled so the back end would be the same.

jlquinn wyvern:~/d/tests$ ~/gcc/gdc/bin/g++ -O3 -fomit-frame-pointer
-march=native -lm -mfpmath=sse -msse3 --std=c++0x nbody.cc -o nbody_c++
jlquinn wyvern:~/d/tests$ time ./nbody_c++ 50000000
-0.169075164
-0.169059907

real	0m10.209s
user	0m10.180s
sys	0m0.010s


 First D2 version (serial):
 http://codepad.org/AdRSm2wP

~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3
-frelease nbody.d -o nbody_d
jlquinn wyvern:~/d/tests$ time ./nbody_d 50000000
-0.169075164
-0.169059907

real	0m9.830s
user	0m9.820s
sys	0m0.000s

jlquinn wyvern:~/d/tests$ dmd -O -release nbody.d
jlquinn wyvern:~/d/tests$ time ./nbody_d 50000000
-0.169075164
-0.169059907

real	0m9.828s
user	0m9.830s
sys	0m0.000s

 Second D2 version, three times slower thanks to vector ops, more similar to
the Fortran version:
 http://codepad.org/7O3mz9en

~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3
-frelease nbody2.d -o nbody2_d
jlquinn wyvern:~/d/tests$ time ./nbody2_d 50000000
-0.169075164
-0.169059907

real	0m26.805s
user	0m26.760s
sys	0m0.020s

jlquinn wyvern:~/d/tests$ dmd -O -release nbody2.d
jlquinn wyvern:~/d/tests$ time ./nbody2_d 50000000
-0.169075164
-0.169059907

real	0m26.777s
user	0m26.760s
sys	0m0.000s


 Is someone willing to take two (or more) timings using LDC 2 compiler (I have
LDC1 only, beside DMD)? I'd like to know how much time it takes to run the
first D version *compared* to the C++ version :-) If you time the second D2
version too, then it's better.

 Bye and thank you,
 bearophile

So, the upshot seems like DMD and GDC generate similar code for this
test.  And both D compilers generate slightly faster code than the C++
version, therefore the D front end is doing a slightly better
optimization job, or your first version is slightly more efficient code.

Jerry

Nov 25 2011

bearophile <bearophileHUGS lycos.com> writes:

Thank you for your suprising timings, Jerry.

 All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2.  I built
 with both D and C++ enabled so the back end would be the same.

Is your system a 64 bit one?


 So, the upshot seems like DMD and GDC generate similar code for this test.

This is an uncommon thing, expecially on 32 bit systems.


 And both D compilers generate slightly faster code than the C++
 version, therefore the D front end is doing a slightly better
 optimization job, or your first version is slightly more efficient code.

D1 code is often a bit slower than similar C++ code, but in this case I think
D2 has allowed to specify more semantics that has produced a faster program.
The static foreach I have used that D2 code is not just looking more clean
compared to the those C++0x template tricks, but also the assembly output is
better.

And the first D2 program is not even the fastest possible: that second D2
program today is slow, but it contains some more semantics that hopefuly
someday will allow the second version of the program to be faster than the
first one, and about as fast as that Fortran version.

This code that currently doesn't compile (no vector ^^, no vector sqrt, no good
sum function):

immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0));


Is currently implemented like this:

double[NPAIR] distance = void;
foreach (i; Iota!(0, NPAIR))
    distance[i] = sqrt(r[i][0] ^^ 2 + r[i][1] ^^ 2 + r[i][2] ^^ 2);


Here those square roots are parallelizable, the compiler is allowed to use a
SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions.
With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd
in lesser languages) does 4 double squares at a time. But maybe its starting
location needs to be aligned to 16 bytes (not currently supported syntax):

align(16) immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0));

Bye,
bearophile

Nov 26 2011

bearophile <bearophileHUGS lycos.com> writes:

 Here those square roots are parallelizable, the compiler is allowed to use a
SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions.
With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd
in lesser languages) does 4 double squares at a time. But maybe its starting
location needs to be aligned to 16 bytes (not currently supported syntax):

The 32bit assembly produced by the Intel Fortran compiler on that code, it's
heavily optimized and fully inlined:
http://codepad.org/h1ilZWVu

It uses only serial square roots (sqrtsd), so the performance improvement has
other causes that I don't know. This also probably means the Fortran version is
not the faster version possible.

Bye,
bearophile

Nov 27 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Rquest for timings