digitalmars.D.learn - Vector operations optimization.

Comrad (10/10) Mar 21 2012 I'd like to try d in computational physics. One of the most

Trass3r (4/6) Mar 22 2012 In general gdc or ldc. Not sure how good vectorization is though, esp.

Comrad (51/57) Mar 22 2012 I had such a snippet to test:

James Miller (13/73) Mar 22 2012 th

Trass3r (1/4) Mar 23 2012 -noboundscheck option can also speed up things.

Comrad (3/9) Mar 23 2012 dmd is anyway very slow, ldc2 was better, but still not fast

Dmitry Olshansky (9/67) Mar 23 2012 Here is a culprit, the array ops [] are tuned for arbitrary long(!)

Comrad (3/31) Mar 23 2012 So currently there is no such an optimization exists for any d

"Comrad" <comrad.karlovich googlemail.com> writes:

I'd like to try d in computational physics. One of the most 
appealing features of the d is implementation of arrays, but to 
be really usable this has to work FAST.
So here http://dlang.org/arrays.html it is stated, that:

"Implementation note: many of the more common vector 
operations are expected to take advantage of any 
vector math instructions available on the target 
computer."

What is the status at the moment? What compiler and with which 
compiler flags I should use to achieve maximum performance?

Mar 21 2012

Trass3r <un known.com> writes:

 What is the status at the moment? What compiler and with which compiler  
 flags I should use to achieve maximum performance?

In general gdc or ldc. Not sure how good vectorization is though, esp.  
auto-vectorization.
On the other hand the so called vector operations like a[] = b[] + c[];  
are lowered to hand-written SSE assembly even in dmd.

Mar 22 2012

"Comrad" <comrad.karlovich googlemail.com> writes:

On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
 What is the status at the moment? What compiler and with which 
 compiler flags I should use to achieve maximum performance?

 In general gdc or ldc. Not sure how good vectorization is 
 though, esp. auto-vectorization.
 On the other hand the so called vector operations like a[] = 
 b[] + c[]; are lowered to hand-written SSE assembly even in dmd.

I had such a snippet to test:

   1 import std.stdio;
   2 void main()
   3 {
   4   double[2] a=[1.,0.];
   5   double[2] a1=[1.,0.];
   6   double[2] a2=[1.,0.];
   7   double[2] a3=[0.,0.];
   8   foreach(i;0..1000000000)
   9     a3[]+=a[]+a1[]*a2[];
  10   writeln(a3);
  11 }

And I compared with the following d code:

   1 import std.stdio;
   2 void main()
   3 {
   4   double[2] a=[1.,0.];
   5   double[2] a1=[1.,0.];
   6   double[2] a2=[1.,0.];
   7   double[2] a3=[0.,0.];
   8   foreach(i;0..1000000000)
   9   {
  10     a3[0]+=a[0]+a1[0]*a2[0];
  11     a3[1]+=a[1]+a1[1]*a2[1];
  12   }
  13   writeln(a3);
  14 }

And with the following c code:

   1 #include  <stdio.h>
   2 int main()
   3 {
   4   double a[2]={1.,0.};
   5   double a1[2]={1.,0.};
   6   double a2[2]={1.,0.};
   7   double a3[2];
   8   unsigned i;
   9   for(i=0;i<1000000000;++i)
  10   {
  11     a3[0]+=a[0]+a1[0]*a2[0];
  12     a3[1]+=a[1]+a1[1]*a2[1];
  13   }
  14   printf("%f %f\n",a3[0],a3[1]);
  15   return 0;
  16 }

The last one I compiled with gcc two previous with dmd and ldc. C 
code with -O2
was the fastest and as fast as d without slicing compiled with 
ldc. d code with slicing was 3 times slower (ldc compiler). I 
tried to compile with different optimization flags, that didn't 
help. Maybe I used the wrong ones. Can someone comment on this?

Mar 22 2012

James Miller <james aatch.net> writes:

On 23 March 2012 18:57, Comrad <comrad.karlovich googlemail.com> wrote:
 On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
 What is the status at the moment? What compiler and with which compiler
 flags I should use to achieve maximum performance?


 In general gdc or ldc. Not sure how good vectorization is though, esp.
 auto-vectorization.
 On the other hand the so called vector operations like a[] =3D b[] + c[]=


;
 are lowered to hand-written SSE assembly even in dmd.


 I had such a snippet to test:

 =C2=A01 import std.stdio;
 =C2=A02 void main()
 =C2=A03 {
 =C2=A04 =C2=A0 double[2] a=3D[1.,0.];
 =C2=A05 =C2=A0 double[2] a1=3D[1.,0.];
 =C2=A06 =C2=A0 double[2] a2=3D[1.,0.];
 =C2=A07 =C2=A0 double[2] a3=3D[0.,0.];
 =C2=A08 =C2=A0 foreach(i;0..1000000000)
 =C2=A09 =C2=A0 =C2=A0 a3[]+=3Da[]+a1[]*a2[];
 =C2=A010 =C2=A0 writeln(a3);
 =C2=A011 }

 And I compared with the following d code:

 =C2=A01 import std.stdio;
 =C2=A02 void main()
 =C2=A03 {
 =C2=A04 =C2=A0 double[2] a=3D[1.,0.];
 =C2=A05 =C2=A0 double[2] a1=3D[1.,0.];
 =C2=A06 =C2=A0 double[2] a2=3D[1.,0.];
 =C2=A07 =C2=A0 double[2] a3=3D[0.,0.];
 =C2=A08 =C2=A0 foreach(i;0..1000000000)
 =C2=A09 =C2=A0 {
 =C2=A010 =C2=A0 =C2=A0 a3[0]+=3Da[0]+a1[0]*a2[0];
 =C2=A011 =C2=A0 =C2=A0 a3[1]+=3Da[1]+a1[1]*a2[1];
 =C2=A012 =C2=A0 }
 =C2=A013 =C2=A0 writeln(a3);
 =C2=A014 }

 And with the following c code:

 =C2=A01 #include =C2=A0<stdio.h>
 =C2=A02 int main()
 =C2=A03 {
 =C2=A04 =C2=A0 double a[2]=3D{1.,0.};
 =C2=A05 =C2=A0 double a1[2]=3D{1.,0.};
 =C2=A06 =C2=A0 double a2[2]=3D{1.,0.};
 =C2=A07 =C2=A0 double a3[2];
 =C2=A08 =C2=A0 unsigned i;
 =C2=A09 =C2=A0 for(i=3D0;i<1000000000;++i)
 =C2=A010 =C2=A0 {
 =C2=A011 =C2=A0 =C2=A0 a3[0]+=3Da[0]+a1[0]*a2[0];
 =C2=A012 =C2=A0 =C2=A0 a3[1]+=3Da[1]+a1[1]*a2[1];
 =C2=A013 =C2=A0 }
 =C2=A014 =C2=A0 printf("%f %f\n",a3[0],a3[1]);
 =C2=A015 =C2=A0 return 0;
 =C2=A016 }

 The last one I compiled with gcc two previous with dmd and ldc. C code wi=

th
 -O2
 was the fastest and as fast as d without slicing compiled with ldc. d cod=

e
 with slicing was 3 times slower (ldc compiler). I tried to compile with
 different optimization flags, that didn't help. Maybe I used the wrong on=

es.
 Can someone comment on this?

The flags you want are -O2, -inline -release.

If you don't have those, then that might explain some of the slow down
on slicing, since -release drops a ton of runtime checks.

Otherwise, I'm not sure why its so much slower, the druntime array ops
are written using SIMD instructions where available, so it should be
fast.

--
James Miller

Mar 22 2012

Trass3r <un known.com> writes:

 The flags you want are -O, -inline -release.

 If you don't have those, then that might explain some of the slow down
 on slicing, since -release drops a ton of runtime checks.

-noboundscheck option can also speed up things.

Mar 23 2012

"Comrad" <comrad.karlovich gmail.com> writes:

On Friday, 23 March 2012 at 11:20:59 UTC, Trass3r wrote:
 The flags you want are -O, -inline -release.

 If you don't have those, then that might explain some of the 
 slow down
 on slicing, since -release drops a ton of runtime checks.

 -noboundscheck option can also speed up things.

dmd is anyway very slow, ldc2 was better, but still not fast 
enough.

Mar 23 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 23.03.2012 9:57, Comrad wrote:
 On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
 What is the status at the moment? What compiler and with which
 compiler flags I should use to achieve maximum performance?

 In general gdc or ldc. Not sure how good vectorization is though, esp.
 auto-vectorization.
 On the other hand the so called vector operations like a[] = b[] +
 c[]; are lowered to hand-written SSE assembly even in dmd.

 I had such a snippet to test:

 1 import std.stdio;
 2 void main()
 3 {
 4 double[2] a=[1.,0.];
 5 double[2] a1=[1.,0.];
 6 double[2] a2=[1.,0.];
 7 double[2] a3=[0.,0.];

Here is a culprit, the array ops [] are tuned for arbitrary long(!) 
arrays, they are not plain 1 simd SEE op. They are handcrafted loops(!) 
on SSE ops, cool and fast for arrays in general, not fixed 
pairs/trios/etc. I believe it might change in future, if compiler is 
able to deduce that size is fixed, and use more optimal code for small 
sizes.

 8 foreach(i;0..1000000000)
 9 a3[]+=a[]+a1[]*a2[];
 10 writeln(a3);
 11 }

 And I compared with the following d code:

 1 import std.stdio;
 2 void main()
 3 {
 4 double[2] a=[1.,0.];
 5 double[2] a1=[1.,0.];
 6 double[2] a2=[1.,0.];
 7 double[2] a3=[0.,0.];
 8 foreach(i;0..1000000000)
 9 {
 10 a3[0]+=a[0]+a1[0]*a2[0];
 11 a3[1]+=a[1]+a1[1]*a2[1];
 12 }
 13 writeln(a3);
 14 }

 And with the following c code:

 1 #include <stdio.h>
 2 int main()
 3 {
 4 double a[2]={1.,0.};
 5 double a1[2]={1.,0.};
 6 double a2[2]={1.,0.};
 7 double a3[2];
 8 unsigned i;
 9 for(i=0;i<1000000000;++i)
 10 {
 11 a3[0]+=a[0]+a1[0]*a2[0];
 12 a3[1]+=a[1]+a1[1]*a2[1];
 13 }
 14 printf("%f %f\n",a3[0],a3[1]);
 15 return 0;
 16 }

 The last one I compiled with gcc two previous with dmd and ldc. C code
 with -O2
 was the fastest and as fast as d without slicing compiled with ldc. d
 code with slicing was 3 times slower (ldc compiler). I tried to compile
 with different optimization flags, that didn't help. Maybe I used the
 wrong ones. Can someone comment on this?


-- 
Dmitry Olshansky

Mar 23 2012

"Comrad" <comrad.karlovich gmail.com> writes:

On Friday, 23 March 2012 at 10:48:55 UTC, Dmitry Olshansky wrote:
 On 23.03.2012 9:57, Comrad wrote:
 On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
 What is the status at the moment? What compiler and with 
 which
 compiler flags I should use to achieve maximum performance?

 In general gdc or ldc. Not sure how good vectorization is 
 though, esp.
 auto-vectorization.
 On the other hand the so called vector operations like a[] = 
 b[] +
 c[]; are lowered to hand-written SSE assembly even in dmd.

 I had such a snippet to test:

 1 import std.stdio;
 2 void main()
 3 {
 4 double[2] a=[1.,0.];
 5 double[2] a1=[1.,0.];
 6 double[2] a2=[1.,0.];
 7 double[2] a3=[0.,0.];

 Here is a culprit, the array ops [] are tuned for arbitrary 
 long(!) arrays, they are not plain 1 simd SEE op. They are 
 handcrafted loops(!) on SSE ops, cool and fast for arrays in 
 general, not fixed pairs/trios/etc. I believe it might change 
 in future, if compiler is able to deduce that size is fixed, 
 and use more optimal code for small sizes.

So currently there is no such an optimization exists for any d
compiler?

Mar 23 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Vector operations optimization.