www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - N-body bench

reply "bearophile" <bearophileHUGS lycos.com> writes:
If someone if willing to test LDC2 with a known benchmark, 
there's this one:

http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

A reformatted C++11 version good as start point for a D 
translation:
http://codepad.org/4mOHW0fz

Bye,
bearophile
Jan 24 2014
next sibling parent Jerry <jlquinn optonline.net> writes:
"bearophile" <bearophileHUGS lycos.com> writes:

 If someone if willing to test LDC2 with a known benchmark, there's this one:

 http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

 A reformatted C++11 version good as start point for a D translation:
 http://codepad.org/4mOHW0fz
Just playing with the C++ version in gcc 4.7.3, I see a significant speedup by using -funroll-loops. You might want to make sure that's enabled. Jerry
Jan 28 2014
prev sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 24 January 2014 at 15:56:26 UTC, bearophile wrote:
 If someone if willing to test LDC2 with a known benchmark, 
 there's this one:

 http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

 A reformatted C++11 version good as start point for a D 
 translation:
 http://codepad.org/4mOHW0fz

 Bye,
 bearophile
Hmm.. How would one use core.simd with LDC2? It doesn't seem to define D_SIMD. Or should I go for builtins?
Jan 29 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 Hmm.. How would one use core.simd with LDC2? It doesn't seem to 
 define D_SIMD.
 Or should I go for builtins?
I don't know if this is useful for you, but here I wrote a basic usage example of SIMD in ldc2 (second D entry): http://rosettacode.org/wiki/Four_bits_adder#D Bye, bearophile
Jan 29 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Wednesday, 29 January 2014 at 16:43:35 UTC, bearophile wrote:
 Stanislav Blinov:

 Hmm.. How would one use core.simd with LDC2? It doesn't seem 
 to define D_SIMD.
 Or should I go for builtins?
I don't know if this is useful for you, but here I wrote a basic usage example of SIMD in ldc2 (second D entry): http://rosettacode.org/wiki/Four_bits_adder#D Bye, bearophile
I meant how to make it compile with ldc2? I've translated the code, it compiles and works with dmd (although segfaults in -release mode for some reason, probably a bug somewhere). But with ldc2: nbody.d(68): Error: undefined identifier __simd nbody.d(68): Error: undefined identifier XMM those are needed for that sqrt reciprocal call.
Jan 29 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 I meant how to make it compile with ldc2? I've translated the 
 code, it compiles and works with dmd (although segfaults in 
 -release mode for some reason, probably a bug somewhere).

 But with ldc2:

 nbody.d(68): Error: undefined identifier __simd
 nbody.d(68): Error: undefined identifier XMM

 those are needed for that sqrt reciprocal call.
Usually for me ldc2 works with simd. Perhaps you have to show us the code, ask for help in the ldc newsgoup, or ask for help in the #ldc IRC channel. Regarding dmd with -release, I suggest you to minimize the code and put the problem in Bugzilla. Benchmarks are also useful to find and fix compiler bugs. Bye, bearophile
Jan 29 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Wednesday, 29 January 2014 at 16:54:54 UTC, bearophile wrote:
 Stanislav Blinov:

 I meant how to make it compile with ldc2? I've translated the 
 code, it compiles and works with dmd (although segfaults in 
 -release mode for some reason, probably a bug somewhere).

 But with ldc2:

 nbody.d(68): Error: undefined identifier __simd
 nbody.d(68): Error: undefined identifier XMM

 those are needed for that sqrt reciprocal call.
Usually for me ldc2 works with simd. Perhaps you have to show us the code, ask for help in the ldc newsgoup, or ask for help in the #ldc IRC channel.
It's a direct translation of that C++ code: http://dpaste.dzfl.pl/89517fd0bf8fa This line: distance = __simd(XMM.CVTPS2PD, __simd(XMM.RSQRTPS, __simd(XMM.CVTPD2PS, dsquared))); The XMM enum and __simd functions are defined only when D_SIMD version is set. ldc2 doesn't seem to set this, unless I'm missing some sort of compiler switch.
 Regarding dmd with -release, I suggest you to minimize the code 
 and put the problem in Bugzilla. Benchmarks are also useful to 
 find and fix compiler bugs.
I'm already onto it :)
Jan 29 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
Regarding dmd it looks awfully similar to this:

http://d.puremagic.com/issues/show_bug.cgi?id=9449

I'd need to do some more runs though.
Jan 29 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Wednesday, 29 January 2014 at 18:05:41 UTC, Stanislav Blinov 
wrote:

Yep, doesn't seem to be simd-related:

struct S(T) { T v1, v2; }

void main() {
	alias T = double; // integrals and float are ok :\
	version	(workaround) {
		S!T[1] p = void;
	} else {
		S!T[1] p;
	}
}

Anyway, here's the revised (and bugfixed :o)) code, if anyone's 
interested:

http://dpaste.dzfl.pl/52d9e1fdc0fd

On my machine, dmd -release -O -inline -noboundscheck is only 6 
times slower than that C++ version :D

I'll try to get around to making it work with ldc on the weekend.
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
Ok, didn't need to wait for the weekend :)

Looks like both dmd and ldc don't optimize slice operations yet, 
had to revert to loops (shaved off ~1.5 seconds for ldc, ~9 
seconds for dmd). Also, my local pull of ldc had some issues with 
to!int(string), reverted that to atoi :)

Here's the code:

http://dpaste.dzfl.pl/4b6df0771696

C++ version compiled with the provided flags.

dmd -release -O -inline -noboundscheck

ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops

Here are the results on my machine (i3 2100  3.1GHz):

time ./nbody-cpp 50000000:
-0.169075164
-0.169059907
0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 50000000:
-0.169075164
-0.169059907
0:07.84 real, 7.82 user, 0.00 sys, 1324 kb, 99% cpu

time ./nbody-dmd 50000000:
-0.169075164
-0.169059907
0:23.35 real, 23.29 user, 0.00 sys, 1184 kb, 99% cpu
Jan 30 2014
next sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 14:17:16 UTC, Stanislav Blinov 
wrote:

Forgot one slice assignment in toDobule2(). Now the results are 
more interesting:

time ./nbody-cpp 50000000:
-0.169075164
-0.169059907
0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 50000000:
-0.169075164
-0.169059907
0:05.94 real, 5.92 user, 0.00 sys, 1320 kb, 99% cpu

time ./nbody-dmd 50000000:
-0.169075164
-0.169059907
0:19.62 real, 19.57 user, 0.00 sys, 1188 kb, 99% cpu

:)
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 Forgot one slice assignment in toDobule2(). Now the results are 
 more interesting:
Is the latest link shown the last version? I need the 0.13.0-alpha1 to compile the code. I am seeing a significant performance difference between C++ and D-ldc2. Bye, bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 15:37:24 UTC, bearophile wrote:
 Stanislav Blinov:

 Forgot one slice assignment in toDobule2(). Now the results 
 are more interesting:
Is the latest link shown the last version?
No. In toDouble2() on line 13: replace result.array = args[0] with result.array[0] = args[0]; result.array[1] = args[0];
 I need the 0.13.0-alpha1 to compile the code.
 I am seeing a significant performance difference between C++ 
 and D-ldc2.
You mean with your current version of ldc?
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 You mean with your current version of ldc?
Yes. The older version of LDC2 doesn't even compile the code. I need to use 0.13.0-alpha1. Your D code with small changes: http://codepad.org/xqqScd42 Asm generated by G++ for the advance function (that is the one that uses most of the run time): http://codepad.org/tApRNsVy Asm generated by ldc2: http://codepad.org/jKSJcOAZ With N = 5_000_000 my timings on an old CPU are 2.23 seconds for ldc2 and 1.83 seconds for g++. So there's some performance difference. I have tried to unroll manually the loop in the D code, but I see worse performance. I'll try some more later. Bye, bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 16:53:22 UTC, bearophile wrote:

 Yes. The older version of LDC2 doesn't even compile the code. I 
 need to use 0.13.0-alpha1.
Hmm.
 Your D code with small changes:
 http://codepad.org/xqqScd42
That won't compile with dmd (at least, with 2.064.2): it expects constants as initializers for vectors. :( That's why I rolled up that toDouble2() function.
 With N = 5_000_000 my timings on an old CPU are 2.23 seconds 
 for ldc2 and 1.83 seconds for g++. So there's some performance 
 difference.
What about 50_000_000?
 I have tried to unroll manually the loop in the D code, but I 
 see worse performance. I'll try some more later.
I'm also fiddling :)
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 That won't compile with dmd (at least, with 2.064.2): it 
 expects constants as initializers for vectors. :( That's why I 
 rolled up that toDouble2() function.
I see. Then probably I will have to put it back...
 With N = 5_000_000 my timings on an old CPU are 2.23 seconds 
 for ldc2 and 1.83 seconds for g++. So there's some performance 
 difference.
What about 50_000_000?
First let me try to fiddle with the code some more :-) Once done, this should go somewhere (like the wiki) as a simple example of SIMD usage in D. Bye, bearophile
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
 Stanislav Blinov:

 That won't compile with dmd (at least, with 2.064.2): it 
 expects constants as initializers for vectors. :( That's why I 
 rolled up that toDouble2() function.
Few more changes, but this version still lacks the toDouble2: http://codepad.org/SpMprWym Bye, bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 18:29:42 UTC, bearophile wrote:

I see you're compiling with

ldmd2 -wi -O -release -inline -noboundscheck nbody.d

Try

ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops
Jan 30 2014
parent "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 ldc2 -release -O3 -disable-boundscheck -vectorize 
 -vectorize-loops
All my versions of ldc2 don't even accept -vectorize :-) ldc2: Unknown command line argument '-vectorize'. Try: 'ldc2 -help' ldc2: Did you mean '-vectorize-slp'? And -vectorize-loops should be active on default on recent versions of ldc2 (including V.0.12.1), and indeed I see no performance difference in using it. Bye, bearophile
Jan 30 2014
prev sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 Looks like both dmd and ldc don't optimize slice operations 
 yet, had to revert to loops
It's a very silly problem for a statically typed language. The D type system knows the static length of those arrays, but it doesn't use such information. (Similarly several algorithms in Phobos force to throw away this very precious compile-time information requiring dynamic arrays in input.) I have just suggested a fix for ldc2: http://forum.dlang.org/thread/qeytzeqnygxpocywyifp forum.dlang.org I have a similar enhancement request since some time in Bugzilla: https://d.puremagic.com/issues/show_bug.cgi?id=10523 https://d.puremagic.com/issues/show_bug.cgi?id=10305 Bye, bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 18:43:02 UTC, bearophile wrote:

 It's a very silly problem for a statically typed language. The 
 D type system knows the static length of those arrays, but it 
 doesn't use such information.
I agree. Unrolling everything except the loop in energy() seems to have squeezed the bits neede to outperform c++, at least on my machine :) http://dpaste.dzfl.pl/45e98e476daf (I'm sticking to atoi because my copy of ldc seems to have an issue in std.conv). time ./nbody-cpp 50000000: -0.169075164 -0.169059907 0:05.15 real, 5.14 user, 0.00 sys, 532 kb, 99% cpu time ./nbody-ldc 50000000: -0.169075164 -0.169059907 0:04.41 real, 4.40 user, 0.00 sys, 1308 kb, 99% cpu time ./nbody-dmd 50000000: -0.169075164 -0.169059907 0:15.39 real, 15.34 user, 0.00 sys, 1192 kb, 99% cpu
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 Unrolling everything except the loop in energy() seems to have 
 squeezed the bits neede to outperform c++, at least on my 
 machine :)
That should be impossible, as I remember from my old profilings that energy() should use only an irrelevant amount of run time.
 http://dpaste.dzfl.pl/45e98e476daf
While I benchmark some variants of this program I am seeing a large variety of problems, limitations, bugs and regressions. You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 compiles it. My older version of your D code runs with both compiler versions, but V.0.12.1 generates faster code. Plus you can't make those double2 immutable, you can't use vector ops (because of performance, and also because they aren't nothrow in V.0.12.1). I was also experimenting with (note the align): align(16) struct Body { double[3] x, v; double mass; } struct NBodySystem { private: __gshared static Body[5] bodies = [ // Sun. Body([0., 0., 0.], [0., 0., 0.], solarMass), ... But this improves the code for V.0.12.1 and worsens it for 0.13.0-alpha1. Also I think the __gshared is ignored in V.0.12.1, but this bug could be fixed in more recent versions of ldc2.
 (I'm sticking to atoi because my copy of ldc seems to have an 
 issue in std.conv).
My version seems to use to!() correctly. If ldc2 developers are reading this thread there is enough strange stuff here to give one or two headaches :-) Now I don't know what "final" version should I keep of this program :-) Bye, bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 21:04:06 UTC, bearophile wrote:
 Stanislav Blinov:

 Unrolling everything except the loop in energy() seems to have 
 squeezed the bits neede to outperform c++, at least on my 
 machine :)
That should be impossible, as I remember from my old profilings that energy() should use only an irrelevant amount of run time.
I meant that if I unroll it, it's not irrelevant anymore :)
 While I benchmark some variants of this program I am seeing a 
 large variety of problems, limitations, bugs and regressions...
:)
 You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 
 compiles it.
:))
 My older version of your D code runs with both compiler 
 versions, but V.0.12.1 generates faster code.
:)))
 Plus you can't make those double2 immutable, you can't use 
 vector ops (because of performance, and also because they 
 aren't nothrow in V.0.12.1).
Well, not being able to make them immutable is not *that* big of a problem now, is it? What would be actually cool to have are those slice operations.
 I was also experimenting with (note the align):

 align(16) struct Body {
     double[3] x, v;
     double mass;
 }

 struct NBodySystem {
 private:
     __gshared static Body[5] bodies = [
         // Sun.
         Body([0., 0., 0.],
              [0., 0., 0.],
              solarMass),
Yeah... I've even thrown away that filler in the latest version :o)
 But this improves the code for V.0.12.1 and worsens it for 
 0.13.0-alpha1.
%|
 (I'm sticking to atoi because my copy of ldc seems to have an 
 issue in std.conv).
My version seems to use to!() correctly.
I'm using the git head (704ab3, last commit Sun Jan 26 00:00:21). I haven't tried the release yet.
 If ldc2 developers are reading this thread there is enough 
 strange stuff here to give one or two headaches :-)
Indeed.
 Now I don't know what "final" version should I keep of this 
 program :-)
I was going to compare the asm listings, but C++ seems to have unrolled and inlined the outer loop right inside main(), and now I'm slightly lost in it :)
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 I meant that if I unroll it, it's not irrelevant anymore :)
If a function takes no time to run, and you tweak it, your program is not supposed to go faster.
 I was going to compare the asm listings, but C++ seems to have 
 unrolled and inlined the outer loop right inside main(), and 
 now I'm slightly lost in it :)
Try using -fkeep-inline-functions. Bye, bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 21:33:38 UTC, bearophile wrote:

 If a function takes no time to run, and you tweak it, your 
 program is not supposed to go faster.
Right.
 I was going to compare the asm listings, but C++ seems to have 
 unrolled and inlined the outer loop right inside main(), and 
 now I'm slightly lost in it :)
Try using -fkeep-inline-functions.
Thanks. G++: http://codepad.org/oOZQw1VQ LDC: http://codepad.org/5nHoZL1k LDC basically generated something that I can only call "one straight *whoooosh*"... This reminds me Andrei's talk on (last years?) GoingNative ("more instructions is not always slower code").
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 G++:
 http://codepad.org/oOZQw1VQ

 LDC:
 http://codepad.org/5nHoZL1k
You seem to have a quite recent CPU, as the G++ code contains instructions like vmovsd. So you can try to do the same with ldc2, and use AVX or AVX2. There are the switches: -march=<string> - Architecture to generate code for: -mattr=<a1,+a2,-a3,...> - Target specific attributes (-mattr=help for details) -mcpu=<cpu-name> - Target a specific cpu type (-mcpu=help for details)
 LDC basically generated something that I can only call "one 
 straight *whoooosh*"...
:-) Bye, bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 21:54:17 UTC, bearophile wrote:

 You seem to have a quite recent CPU,
An aging i3?
 as the G++ code contains instructions like vmovsd. So you can 
 try to do the same with ldc2, and use AVX or AVX2.
Hmm... This is getting a bit silly now. I must have some compile switches for g++ wrong: g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer -march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 -fopenmp nbody.cpp -o nbody-cpp time ./nbody-cpp 50000000: -0.169075164 -0.169059907 0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d time ./nbody-ldc 50000000: -0.169075164 -0.169059907 0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpu
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Stanislav Blinov:

 An aging i3?
My CPU is older, it doesn't support AVX2 and AVX.
 This is getting a bit silly now. I must have some compile 
 switches for g++ wrong:

 g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer 
 -march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 
 -fopenmp nbody.cpp -o nbody-cpp

 time ./nbody-cpp 50000000:
 -0.169075164
 -0.169059907
 0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu

 ldc2 -release -O3 -disable-boundscheck -vectorize 
 -vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d

 time ./nbody-ldc 50000000:
 -0.169075164
 -0.169059907
 0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpu
Now the ldc2-compile runs in 4 seconds, this sounds correct. If you have paid for a CPU with AVX2 or AVX, it's right to use that :-) Bye, bearophile
Jan 30 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Since my post someone has added a Fortran version based on the 
algorithm used in the C++11 code. It's a little faster than the 
C++11 code and it's much nicer looking:
http://benchmarksgame.alioth.debian.org/u32/program.php?test=nbody&lang=ifc&id=5


pure subroutine advance(tstep, x, v, mass)
   real*8, intent(in) :: tstep
   real*8, dimension(4,nb), intent(inout) :: x, v
   real*8, dimension(nb), intent(in) :: mass
   real*8 :: r(4,N),mag(N)

   real*8 :: distance, d2
   integer :: i, j, m
   m = 1
   do i = 1, nb
      do j = i + 1, nb
         r(1,m) = x(1,i) - x(1,j)
         r(2,m) = x(2,i) - x(2,j)
         r(3,m) = x(3,i) - x(3,j)
         m = m + 1
      end do
   end do

   do m = 1, N
      d2 = r(1,m)**2 + r(2,m)**2 + r(3,m)**2
      distance = 1/sqrt(real(d2))
      distance = distance * (1.5d0 - 0.5d0 * d2 * distance * 
distance)
      !distance = distance * (1.5d0 - 0.5d0 * d2 * distance * 
distance)
      mag(m) = tstep * distance**3
   end do

   m = 1
   do i = 1, nb
      do j = i + 1, nb
         v(1,i) = v(1,i) - r(1,m) * mass(j) * mag(m)
         v(2,i) = v(2,i) - r(2,m) * mass(j) * mag(m)
         v(3,i) = v(3,i) - r(3,m) * mass(j) * mag(m)

         v(1,j) = v(1,j) + r(1,m) * mass(i) * mag(m)
         v(2,j) = v(2,j) + r(2,m) * mass(i) * mag(m)
         v(3,j) = v(3,j) + r(3,m) * mass(i) * mag(m)

         m = m + 1
      end do
   end do

   do i = 1, nb
      x(1,i) = x(1,i) + tstep * v(1,i)
      x(2,i) = x(2,i) + tstep * v(2,i)
      x(3,i) = x(3,i) + tstep * v(3,i)
   end do
   end subroutine advance


Bye,
bearophile
Jan 30 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Thursday, 30 January 2014 at 22:45:45 UTC, bearophile wrote:
 Since my post someone has added a Fortran version based on the 
 algorithm used in the C++11 code. It's a little faster than the 
 C++11 code and it's much nicer looking:
Yup, I saw it. They're cheating, they almost don't have to explicitly handle any SSE business :o) I'm wondering how our little code could perform on that machine. It looks nice too, by the way: http://dpaste.dzfl.pl/a81a475bbcf6 I've rearranged some bits, brought back to!int (turned out there wasn't any issues, it's just that ldc generated errors regarding to! when there were other compiler errors %\), replaced TypeTuples with your Iota... the works :)
Jan 30 2014
parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
Gah! G'Kar moment...

http://dpaste.dzfl.pl/203d237d7413
Jan 30 2014