digitalmars.D.learn - Fast multidimensional Arrays

Steinhagelvoll (47/47) Aug 29 2016 Hello,

kink (2/2) Aug 29 2016 At the very least, give the LDC command line a `-release`,
rikki cattermole (7/53) Aug 29 2016 By the looks you're not running the tests more then once.

Steinhagelvoll (12/12) Aug 29 2016 Ok I added release and implemented the benchmark for 500

Daniel Kozak via Digitalmars-d-learn (2/13) Aug 29 2016 It is possible, there is a lot of indirections
rikki cattermole (45/56) Aug 29 2016 double[1000][] A, B, C;

rikki cattermole (24/85) Aug 29 2016 Below change is slightly faster:
Steinhagelvoll (10/18) Aug 29 2016 It seems that the ini doesn't work properly. Every value seems to

rikki cattermole (43/51) Aug 29 2016 My bad, fixed:

rikki cattermole (6/6) Aug 29 2016 Okay looks like I've made a boo boo and ldc is compiling out that entire...

Daniel Kozak via Digitalmars-d-learn (36/42) Aug 29 2016 this is my version:

Daniel Kozak via Digitalmars-d-learn (10/52) Aug 29 2016 This will not work, you need to add some ref :).

David Nadlinger (5/9) Aug 29 2016 D program startup is on the order of milliseconds, so the

Daniel Kozak via Digitalmars-d-learn (13/59) Aug 29 2016 It is unfair to compare different backend:

Steinhagelvoll (12/27) Aug 29 2016 This seems to be it. I also implemented it in C++ (because

Daniel Kozak via Digitalmars-d-learn (3/35) Aug 29 2016 with gcc you can try enable some optimalizations: g++ -O3 -march=native
Seb (10/13) Aug 29 2016 You should definitely have a look at this benchmark for matrix

Steinhagelvoll (6/19) Aug 29 2016 It not really about multiplying matrices. I wanted to see how D

Ilya Yaroshenko (11/36) Aug 29 2016 ndslice is analog of numpy. It is more flexible comparing with

Stefan Koch (5/11) Aug 29 2016 Any chance you can post the generated asm ?

Daniel Kozak via Digitalmars-d-learn (2/14) Aug 29 2016 why i486, I belive it will select x86_64 by default on linux

Steinhagelvoll <matthias.redies posteo.de> writes:

Hello,

I'm trying to find a fast way to use multi dimensional arrays. 
For this I implemented a matrix multiplication and compared the 
times for different ways. As a reference I used a Fortran90 
implementation.

Fortran reference: http://pastebin.com/Hd5zTHVJ
ifort test.f90  -o testf && time ./testf
real    0m0.680s
user    0m0.672s
sys     0m0.008s

ifort -O3 test.f90 -o testf && time ./testf
real    0m0.235s
user    0m0.228s
sys     0m0.004s

ifort -check all test.f90  -o testf && time ./testf
         1000

real    0m34.993s
user    0m35.012s
sys     0m0.008s


For D it tried a number of different ways:

NDSlice: http://pastebin.com/nUbMnt8B
real	0m35.922s
user	0m35.888s
sys	0m0.008


1D Arrays: http://pastebin.com/R7CJFybK
dmd -boundscheck=off -O test.d && time ./test
real	0m4.415s
user	0m4.412s
sys	0m0.004s

ldc2 -O3 test.d && time ./test
real	0m4.261s
user	0m4.252s
sys	0m0.004s

2D Arrays: http://pastebin.com/4CuB4Y0c

dmd -boundscheck=off -O nd_test.d && time ./nd_test
real	0m3.565s
user	0m3.560s
sys	0m0.004s


ldc2 -O3 nd_test.d && time ./nd_test
real	0m3.568s
user	0m3.560s
sys	0m0.004s

None of them is even close to the Fortran implementation, only 
when I enable all check in Fortran it seems to be equal to 
Ndslice. Is there a speedy way to use multi-dimensional matrices?

Kind regards

Matthias

Aug 29 2016

kink <noone nowhere.com> writes:

At the very least, give the LDC command line a `-release`, 
otherwise you end up with all assertions enabled etc.

Aug 29 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 29/08/2016 9:53 PM, Steinhagelvoll wrote:
 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. For this
 I implemented a matrix multiplication and compared the times for
 different ways. As a reference I used a Fortran90 implementation.

 Fortran reference: http://pastebin.com/Hd5zTHVJ
 ifort test.f90  -o testf && time ./testf
 real    0m0.680s
 user    0m0.672s
 sys     0m0.008s

 ifort -O3 test.f90 -o testf && time ./testf
 real    0m0.235s
 user    0m0.228s
 sys     0m0.004s

 ifort -check all test.f90  -o testf && time ./testf
         1000

 real    0m34.993s
 user    0m35.012s
 sys     0m0.008s


 For D it tried a number of different ways:

 NDSlice: http://pastebin.com/nUbMnt8B
 real    0m35.922s
 user    0m35.888s
 sys    0m0.008


 1D Arrays: http://pastebin.com/R7CJFybK
 dmd -boundscheck=off -O test.d && time ./test
 real    0m4.415s
 user    0m4.412s
 sys    0m0.004s

 ldc2 -O3 test.d && time ./test
 real    0m4.261s
 user    0m4.252s
 sys    0m0.004s

 2D Arrays: http://pastebin.com/4CuB4Y0c

 dmd -boundscheck=off -O nd_test.d && time ./nd_test
 real    0m3.565s
 user    0m3.560s
 sys    0m0.004s


 ldc2 -O3 nd_test.d && time ./nd_test
 real    0m3.568s
 user    0m3.560s
 sys    0m0.004s

 None of them is even close to the Fortran implementation, only when I
 enable all check in Fortran it seems to be equal to Ndslice. Is there a
 speedy way to use multi-dimensional matrices?

 Kind regards

 Matthias

By the looks you're not running the tests more then once.
Druntime initialization could be effecting this.

Please execute each test (without memory allocation) 10000 times atleast 
and then report back what they are.

Something like https://dlang.org/phobos/std_datetime.html#.benchmark 
will be very helpful.

Aug 29 2016

Steinhagelvoll <matthias.redies posteo.de> writes:

Ok I added release and implemented the benchmark for 500 
iterations, 10000 are not reasonable. I build on the 2d array 
with LDC: http://pastebin.com/aXxzEdS4 (changes just in the 
beginning)

$ ldc2 -release -O3 nd_test.d
$ ./nd_test
12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

, which is 738 seconds. Compared to (also 500 iterations)

ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


This still seems like a big difference. Is it because I don't use 
a continous piece of memory, but rather a pointer to a pointer?

Aug 29 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 29.8.2016 v 14:13 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 Ok I added release and implemented the benchmark for 500 iterations, 
 10000 are not reasonable. I build on the 2d array with LDC: 
 http://pastebin.com/aXxzEdS4 (changes just in the beginning)

 $ ldc2 -release -O3 nd_test.d
 $ ./nd_test
 12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

 , which is 738 seconds. Compared to (also 500 iterations)

 ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


 This still seems like a big difference. Is it because I don't use a 
 continous piece of memory, but rather a pointer to a pointer?

It is possible, there is a lot of indirections

Aug 29 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 30/08/2016 12:13 AM, Steinhagelvoll wrote:
 Ok I added release and implemented the benchmark for 500 iterations,
 10000 are not reasonable. I build on the 2d array with LDC:
 http://pastebin.com/aXxzEdS4 (changes just in the beginning)

 $ ldc2 -release -O3 nd_test.d
 $ ./nd_test
 12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

 , which is 738 seconds. Compared to (also 500 iterations)

 ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


 This still seems like a big difference. Is it because I don't use a
 continous piece of memory, but rather a pointer to a pointer?

double[1000][] A, B, C;

void main() {
         A = new double[1000][1000];
         B = new double[1000][1000];
         C = new double[1000][1000];

         import std.conv : to;
         import std.datetime;
         import std.stdio : writeln;

         ini(A);
         ini(B);
         ini(C);

         auto r = benchmark!run_test(10000);
         auto res = to!Duration(r[0]);
         writeln(res);
}

void run_test() {
         MatMul(A, B, C);
}

void ini(T)(T mtx) {
         foreach(v; mtx) {
                 v = 3.4;
         }

         foreach(i, v; mtx) {
                 foreach(j, vv; v) {
                         vv += (i * j) + (0.6 * j);
                 }
         }
}

void MatMul(T)(T A, T B, T C) {
         foreach(cv; C) {
                 cv = 0f;
         }

         foreach(i, cv; C) {
                 foreach(j, av; A[i]) {
                         foreach(k, cvv; cv) {
                                 cvv += av * B[j][k];
                         }
                 }
         }
}

$ ldc2 test.d -O5 -release -oftest.exe -m64
$ ./test
3 secs, 995 ms, 115 Î¼s, and 2 hnsecs

Please verify that it is still doing the same thing that you want.

Aug 29 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 30/08/2016 1:02 AM, rikki cattermole wrote:
 On 30/08/2016 12:13 AM, Steinhagelvoll wrote:
 Ok I added release and implemented the benchmark for 500 iterations,
 10000 are not reasonable. I build on the 2d array with LDC:
 http://pastebin.com/aXxzEdS4 (changes just in the beginning)

 $ ldc2 -release -O3 nd_test.d
 $ ./nd_test
 12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

 , which is 738 seconds. Compared to (also 500 iterations)

 ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


 This still seems like a big difference. Is it because I don't use a
 continous piece of memory, but rather a pointer to a pointer?

 double[1000][] A, B, C;

 void main() {
         A = new double[1000][1000];
         B = new double[1000][1000];
         C = new double[1000][1000];

         import std.conv : to;
         import std.datetime;
         import std.stdio : writeln;

         ini(A);
         ini(B);
         ini(C);

         auto r = benchmark!run_test(10000);
         auto res = to!Duration(r[0]);
         writeln(res);
 }

 void run_test() {
         MatMul(A, B, C);
 }

 void ini(T)(T mtx) {
         foreach(v; mtx) {
                 v = 3.4;
         }

         foreach(i, v; mtx) {
                 foreach(j, vv; v) {
                         vv += (i * j) + (0.6 * j);
                 }
         }
 }

 void MatMul(T)(T A, T B, T C) {
         foreach(cv; C) {
                 cv = 0f;
         }

         foreach(i, cv; C) {
                 foreach(j, av; A[i]) {
                         foreach(k, cvv; cv) {
                                 cvv += av * B[j][k];
                         }
                 }
         }
 }

 $ ldc2 test.d -O5 -release -oftest.exe -m64
 $ ./test
 3 secs, 995 ms, 115 Î¼s, and 2 hnsecs

 Please verify that it is still doing the same thing that you want.

Below change is slightly faster:

         foreach(i, cv; C) { 
 
 

                 foreach(j, av; A[i]) { 
 
 

                         auto bv = B[j]; 
 
 

                         foreach(k, cvv; cv) { 
 
 

                                 cvv += av * bv[k]; 
 
 

                         } 
 
 
                                      } 
 
 

         }

Aug 29 2016

Steinhagelvoll <matthias.redies posteo.de> writes:

On Monday, 29 August 2016 at 13:02:43 UTC, rikki cattermole wrote:
 On 30/08/2016 12:13 AM, Steinhagelvoll wrote:
 [...]

 double[1000][] A, B, C;

 void main() {
         A = new double[1000][1000];
         B = new double[1000][1000];
         C = new double[1000][1000];

 [...]

It seems that the ini doesn't work properly. Every value seems to 
be nan.

ini(A);
ini(B);
ini(C);
writeln(A[0][0]);
writeln(C[3][9]);

nan
nan

Aug 29 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 30/08/2016 1:50 AM, Steinhagelvoll wrote:
 It seems that the ini doesn't work properly. Every value seems to be nan.

 ini(A);
 ini(B);
 ini(C);
 writeln(A[0][0]);
 writeln(C[3][9]);

 nan
 nan

My bad, fixed:

double[1000][] A, B, C;

void main() {
         A = new double[1000][1000];
         B = new double[1000][1000];
         C = new double[1000][1000];

         import std.conv : to;
         import std.datetime;
         import std.stdio : writeln;

         ini(A);
         ini(B);
         ini(C);

         auto r = benchmark!run_test(10000);
         auto res = to!Duration(r[0]);
         writeln(res);
}

void run_test() {
         MatMul(A, B, C);
}

void ini(T)(T mtx) {
         foreach(ref v; mtx) {
                 v = 3.4;
         }

         foreach(i, v; mtx) {
                 foreach(j, ref vv; v) {
                         vv += (i * j) + (0.6 * j);
                 }
         }
}

void MatMul(T)(T A, T B, T C) {
         foreach(cv; C) {
                 cv = 0f;
         }

         foreach(i, cv; C) {
                 foreach(j, av; A[i]) {
                         auto bv = B[j];
                         foreach(k, cvv; cv) {
                                 cvv += av * bv[k];
                         }
                 }
         }

}

Aug 29 2016

rikki cattermole <rikki cattermole.co.nz> writes:

Okay looks like I've made a boo boo and ldc is compiling out that entire 
multiplication loop out.

Its passing the array statically and since its never assigned back, its 
just never compiled in (unless you specify it via ref).

So, this is where I give up as it is 2am.

Perhaps try and make it parallel (std.parallemism can help hugely).

Aug 29 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 29.8.2016 v 16:08 rikki cattermole via Digitalmars-d-learn napsal(a):

 Okay looks like I've made a boo boo and ldc is compiling out that 
 entire multiplication loop out.

 Its passing the array statically and since its never assigned back, 
 its just never compiled in (unless you specify it via ref).

 So, this is where I give up as it is 2am.

 Perhaps try and make it parallel (std.parallemism can help hugely).

this is my version:

import std.stdio;

immutable int n = 1000, l=1000, m=1000;

struct ZeroDouble
{
     double val = 0f;
     alias val this;
}

void main(string[] args)
{
     auto A = new double [1000][m];
     auto B = new double [1000][n];
     auto C = new ZeroDouble[1000][n];
     ini!(A);
     ini!(B);
     MatMul!(A,B,C);
     writeln(C[1][1]);
     writefln("%d   %d ", C.length, C[0].length);
}

void ini(alias mtx)(){
     foreach(i, ref mtxInner; mtx) {
         foreach(j, ref cell; mtxInner) {
             cell = i*j + 0.6*j +3.4;
         }
     }
}

void MatMul(alias A, alias B, alias C)() {
     foreach(i, ref cv; C) {
         foreach(j, av; A[i]) {
             foreach(k, ref cvv; cv) {
                 cvv += av * B[j][k];
             }
         }
     }
}

Aug 29 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 29.8.2016 v 15:57 rikki cattermole via Digitalmars-d-learn napsal(a):

 My bad, fixed:

 double[1000][] A, B, C;

 void main() {
         A = new double[1000][1000];
         B = new double[1000][1000];
         C = new double[1000][1000];

         import std.conv : to;
         import std.datetime;
         import std.stdio : writeln;

         ini(A);
         ini(B);
         ini(C);

         auto r = benchmark!run_test(10000);
         auto res = to!Duration(r[0]);
         writeln(res);
 }

 void run_test() {
         MatMul(A, B, C);
 }

 void ini(T)(T mtx) {
         foreach(ref v; mtx) {
                 v = 3.4;
         }

         foreach(i, v; mtx) {
                 foreach(j, ref vv; v) {
                         vv += (i * j) + (0.6 * j);
                 }
         }
 }

 void MatMul(T)(T A, T B, T C) {
         foreach(cv; C) {
                 cv = 0f;
         }

         foreach(i, cv; C) {
                 foreach(j, av; A[i]) {
                         auto bv = B[j];
                         foreach(k, cvv; cv) {
                                 cvv += av * bv[k];
                         }
                 }
         }

 }

This will not work, you need to add some ref :).

     foreach(i, ref cv; C) {
         foreach(j, av; A[i]) {
             auto bv = B[j];
             foreach(k, ref cvv; cv) {
                 cvv += av * bv[k];
             }
         }
     }

Aug 29 2016

David Nadlinger <code klickverbot.at> writes:

On Monday, 29 August 2016 at 10:20:56 UTC, rikki cattermole wrote:
 By the looks you're not running the tests more then once.
 Druntime initialization could be effecting this.

 Please execute each test (without memory allocation) 10000 
 times atleast and then report back what they are.

D program startup is on the order of milliseconds, so the 
difference is negligible for a benchmark that runs for more than 
a second vs. 200 ms.

  — David

Aug 29 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 29.8.2016 v 11:53 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. For 
 this I implemented a matrix multiplication and compared the times for 
 different ways. As a reference I used a Fortran90 implementation.

 Fortran reference: http://pastebin.com/Hd5zTHVJ
 ifort test.f90  -o testf && time ./testf
 real    0m0.680s
 user    0m0.672s
 sys     0m0.008s

 ifort -O3 test.f90 -o testf && time ./testf
 real    0m0.235s
 user    0m0.228s
 sys     0m0.004s

 ifort -check all test.f90  -o testf && time ./testf
         1000

 real    0m34.993s
 user    0m35.012s
 sys     0m0.008s


 For D it tried a number of different ways:

 NDSlice: http://pastebin.com/nUbMnt8B
 real    0m35.922s
 user    0m35.888s
 sys    0m0.008


 1D Arrays: http://pastebin.com/R7CJFybK
 dmd -boundscheck=off -O test.d && time ./test
 real    0m4.415s
 user    0m4.412s
 sys    0m0.004s

 ldc2 -O3 test.d && time ./test
 real    0m4.261s
 user    0m4.252s
 sys    0m0.004s

 2D Arrays: http://pastebin.com/4CuB4Y0c

 dmd -boundscheck=off -O nd_test.d && time ./nd_test
 real    0m3.565s
 user    0m3.560s
 sys    0m0.004s


 ldc2 -O3 nd_test.d && time ./nd_test
 real    0m3.568s
 user    0m3.560s
 sys    0m0.004s

 None of them is even close to the Fortran implementation, only when I 
 enable all check in Fortran it seems to be equal to Ndslice. Is there 
 a speedy way to use multi-dimensional matrices?

 Kind regards

 Matthias

It is unfair to compare different backend:

gfortran -O3 -o test test.f90
[kozak dajinka ~]$ time ./test

real    0m2.072s
user    0m2.053s
sys    0m0.013s

gdc -O3 -o test test.d
[kozak dajinka ~]$ time ./test

real    0m1.655s
user    0m1.640s
sys    0m0.010s

Obviously ifort can use some special instruction on your CPU

Aug 29 2016

Steinhagelvoll <matthias.redies posteo.de> writes:

On Monday, 29 August 2016 at 13:59:15 UTC, Daniel Kozak wrote:
 Dne 29.8.2016 v 11:53 Steinhagelvoll via Digitalmars-d-learn 
 napsal(a):

 [...]

 It is unfair to compare different backend:

 gfortran -O3 -o test test.f90
 [kozak dajinka ~]$ time ./test

 real    0m2.072s
 user    0m2.053s
 sys    0m0.013s

 gdc -O3 -o test test.d
 [kozak dajinka ~]$ time ./test

 real    0m1.655s
 user    0m1.640s
 sys    0m0.010s

 Obviously ifort can use some special instruction on your CPU

This seems to be it. I also implemented it in C++ (because 
gfortran isn't the main focus of GNU) and this is the result:

$ ./cpp_test_clang
elapsed time 1.12785
$ ./cpp_test_gpp
elapsed time 1.24206
$ ./cpp_test_intel
elapsed time 0.298331

It is quite surprising that there is this much of a difference, 
even when all run sequential. I believe this might be specific to 
this small problem.

Aug 29 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 29.8.2016 v 16:43 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 On Monday, 29 August 2016 at 13:59:15 UTC, Daniel Kozak wrote:
 Dne 29.8.2016 v 11:53 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 [...]

 It is unfair to compare different backend:

 gfortran -O3 -o test test.f90
 [kozak dajinka ~]$ time ./test

 real    0m2.072s
 user    0m2.053s
 sys    0m0.013s

 gdc -O3 -o test test.d
 [kozak dajinka ~]$ time ./test

 real    0m1.655s
 user    0m1.640s
 sys    0m0.010s

 Obviously ifort can use some special instruction on your CPU

 This seems to be it. I also implemented it in C++ (because gfortran 
 isn't the main focus of GNU) and this is the result:

 $ ./cpp_test_clang
 elapsed time 1.12785
 $ ./cpp_test_gpp
 elapsed time 1.24206
 $ ./cpp_test_intel
 elapsed time 0.298331

 It is quite surprising that there is this much of a difference, even 
 when all run sequential. I believe this might be specific to this 
 small problem.

with gcc you can try enable some optimalizations: g++ -O3 -march=native 
-o test test.cpp

Aug 29 2016

Seb <seb wilzba.ch> writes:

On Monday, 29 August 2016 at 14:43:08 UTC, Steinhagelvoll wrote:
 It is quite surprising that there is this much of a difference, 
 even when all run sequential. I believe this might be specific 
 to this small problem.

You should definitely have a look at this benchmark for matrix 
multiplication across a many languages:

https://github.com/kostya/benchmarks#matmul

With the recent generic GLAS kernel in mir, matrix multiplication 
in D is the blazingly fast (it improved the existing results by 
at least 8x).
Please not that this requires the latest LDC beta with includes 
the fastMath pragma and GLAS is still under development at mir:

https://github.com/libmir/mir

Aug 29 2016

Steinhagelvoll <matthias.redies posteo.de> writes:

On Monday, 29 August 2016 at 14:55:50 UTC, Seb wrote:
 On Monday, 29 August 2016 at 14:43:08 UTC, Steinhagelvoll wrote:
 It is quite surprising that there is this much of a 
 difference, even when all run sequential. I believe this might 
 be specific to this small problem.

 You should definitely have a look at this benchmark for matrix 
 multiplication across a many languages:

 https://github.com/kostya/benchmarks#matmul

 With the recent generic GLAS kernel in mir, matrix 
 multiplication in D is the blazingly fast (it improved the 
 existing results by at least 8x).
 Please not that this requires the latest LDC beta with includes 
 the fastMath pragma and GLAS is still under development at mir:

 https://github.com/libmir/mir

It not really about multiplying matrices. I wanted to see how D 
compares for different tasks. If I actually want to do matrix 
multiplication I will  use LAPACK or something of that nature.

In this task the difference was much bigger compared to e.g. 
prime testing, which was about even.

Aug 29 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Monday, 29 August 2016 at 15:46:26 UTC, Steinhagelvoll wrote:
 On Monday, 29 August 2016 at 14:55:50 UTC, Seb wrote:
 On Monday, 29 August 2016 at 14:43:08 UTC, Steinhagelvoll 
 wrote:
 It is quite surprising that there is this much of a 
 difference, even when all run sequential. I believe this 
 might be specific to this small problem.

 You should definitely have a look at this benchmark for matrix 
 multiplication across a many languages:

 https://github.com/kostya/benchmarks#matmul

 With the recent generic GLAS kernel in mir, matrix 
 multiplication in D is the blazingly fast (it improved the 
 existing results by at least 8x).
 Please not that this requires the latest LDC beta with 
 includes the fastMath pragma and GLAS is still under 
 development at mir:

 https://github.com/libmir/mir

 It not really about multiplying matrices. I wanted to see how D 
 compares for different tasks. If I actually want to do matrix 
 multiplication I will  use LAPACK or something of that nature.

 In this task the difference was much bigger compared to e.g. 
 prime testing, which was about even.

ndslice is analog of numpy. It is more flexible comparing with 
Fortran arrays. In the same time, if you want fast iteration 
please use Mir, which includes upcoming ndslice.algorithm with 
 fasmath attribute and `vectorized` flag for `ndReduce`. Note, 
that in-memory representation is important for vectorization, 
e.g. for dot product both slices should have strides equal to 1. 
Add also -mcpu=native flag for LDC.

http://docs.mir.dlang.io/latest/mir_ndslice_algorithm.html#ndReduce

Best regards,
Ilya

Aug 29 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 29 August 2016 at 09:53:12 UTC, Steinhagelvoll wrote:
 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. 
 For this I implemented a matrix multiplication and compared the 
 times for different ways. As a reference I used a Fortran90 
 implementation.

 [...]

Any chance you can post the generated asm ?
I have a suspicion:
you are not passing your cpu arch to ldc,
thus probably it generated i486 code.

Aug 29 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 29.8.2016 v 16:21 Stefan Koch via Digitalmars-d-learn napsal(a):

 On Monday, 29 August 2016 at 09:53:12 UTC, Steinhagelvoll wrote:
 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. For 
 this I implemented a matrix multiplication and compared the times for 
 different ways. As a reference I used a Fortran90 implementation.

 [...]

 Any chance you can post the generated asm ?
 I have a suspicion:
 you are not passing your cpu arch to ldc,
 thus probably it generated i486 code.

why i486, I belive it will select x86_64 by default on linux

Aug 29 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Fast multidimensional Arrays