www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Fast multidimensional Arrays

reply Steinhagelvoll <matthias.redies posteo.de> writes:
Hello,

I'm trying to find a fast way to use multi dimensional arrays. 
For this I implemented a matrix multiplication and compared the 
times for different ways. As a reference I used a Fortran90 
implementation.

Fortran reference: http://pastebin.com/Hd5zTHVJ
ifort test.f90  -o testf && time ./testf
real    0m0.680s
user    0m0.672s
sys     0m0.008s

ifort -O3 test.f90 -o testf && time ./testf
real    0m0.235s
user    0m0.228s
sys     0m0.004s

ifort -check all test.f90  -o testf && time ./testf
         1000

real    0m34.993s
user    0m35.012s
sys     0m0.008s


For D it tried a number of different ways:

NDSlice: http://pastebin.com/nUbMnt8B
real	0m35.922s
user	0m35.888s
sys	0m0.008


1D Arrays: http://pastebin.com/R7CJFybK
dmd -boundscheck=off -O test.d && time ./test
real	0m4.415s
user	0m4.412s
sys	0m0.004s

ldc2 -O3 test.d && time ./test
real	0m4.261s
user	0m4.252s
sys	0m0.004s

2D Arrays: http://pastebin.com/4CuB4Y0c

dmd -boundscheck=off -O nd_test.d && time ./nd_test
real	0m3.565s
user	0m3.560s
sys	0m0.004s


ldc2 -O3 nd_test.d && time ./nd_test
real	0m3.568s
user	0m3.560s
sys	0m0.004s

None of them is even close to the Fortran implementation, only 
when I enable all check in Fortran it seems to be equal to 
Ndslice. Is there a speedy way to use multi-dimensional matrices?

Kind regards

Matthias
Aug 29 2016
next sibling parent kink <noone nowhere.com> writes:
At the very least, give the LDC command line a `-release`, 
otherwise you end up with all assertions enabled etc.
Aug 29 2016
prev sibling next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 29/08/2016 9:53 PM, Steinhagelvoll wrote:
 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. For this
 I implemented a matrix multiplication and compared the times for
 different ways. As a reference I used a Fortran90 implementation.

 Fortran reference: http://pastebin.com/Hd5zTHVJ
 ifort test.f90  -o testf && time ./testf
 real    0m0.680s
 user    0m0.672s
 sys     0m0.008s

 ifort -O3 test.f90 -o testf && time ./testf
 real    0m0.235s
 user    0m0.228s
 sys     0m0.004s

 ifort -check all test.f90  -o testf && time ./testf
         1000

 real    0m34.993s
 user    0m35.012s
 sys     0m0.008s


 For D it tried a number of different ways:

 NDSlice: http://pastebin.com/nUbMnt8B
 real    0m35.922s
 user    0m35.888s
 sys    0m0.008


 1D Arrays: http://pastebin.com/R7CJFybK
 dmd -boundscheck=off -O test.d && time ./test
 real    0m4.415s
 user    0m4.412s
 sys    0m0.004s

 ldc2 -O3 test.d && time ./test
 real    0m4.261s
 user    0m4.252s
 sys    0m0.004s

 2D Arrays: http://pastebin.com/4CuB4Y0c

 dmd -boundscheck=off -O nd_test.d && time ./nd_test
 real    0m3.565s
 user    0m3.560s
 sys    0m0.004s


 ldc2 -O3 nd_test.d && time ./nd_test
 real    0m3.568s
 user    0m3.560s
 sys    0m0.004s

 None of them is even close to the Fortran implementation, only when I
 enable all check in Fortran it seems to be equal to Ndslice. Is there a
 speedy way to use multi-dimensional matrices?

 Kind regards

 Matthias
By the looks you're not running the tests more then once. Druntime initialization could be effecting this. Please execute each test (without memory allocation) 10000 times atleast and then report back what they are. Something like https://dlang.org/phobos/std_datetime.html#.benchmark will be very helpful.
Aug 29 2016
next sibling parent reply Steinhagelvoll <matthias.redies posteo.de> writes:
Ok I added release and implemented the benchmark for 500 
iterations, 10000 are not reasonable. I build on the 2d array 
with LDC: http://pastebin.com/aXxzEdS4 (changes just in the 
beginning)

$ ldc2 -release -O3 nd_test.d
$ ./nd_test
12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

, which is 738 seconds. Compared to (also 500 iterations)

ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


This still seems like a big difference. Is it because I don't use 
a continous piece of memory, but rather a pointer to a pointer?
Aug 29 2016
next sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 29.8.2016 v 14:13 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 Ok I added release and implemented the benchmark for 500 iterations, 
 10000 are not reasonable. I build on the 2d array with LDC: 
 http://pastebin.com/aXxzEdS4 (changes just in the beginning)

 $ ldc2 -release -O3 nd_test.d
 $ ./nd_test
 12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

 , which is 738 seconds. Compared to (also 500 iterations)

 ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


 This still seems like a big difference. Is it because I don't use a 
 continous piece of memory, but rather a pointer to a pointer?
It is possible, there is a lot of indirections
Aug 29 2016
prev sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 30/08/2016 12:13 AM, Steinhagelvoll wrote:
 Ok I added release and implemented the benchmark for 500 iterations,
 10000 are not reasonable. I build on the 2d array with LDC:
 http://pastebin.com/aXxzEdS4 (changes just in the beginning)

 $ ldc2 -release -O3 nd_test.d
 $ ./nd_test
 12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

 , which is 738 seconds. Compared to (also 500 iterations)

 ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


 This still seems like a big difference. Is it because I don't use a
 continous piece of memory, but rather a pointer to a pointer?
double[1000][] A, B, C; void main() { A = new double[1000][1000]; B = new double[1000][1000]; C = new double[1000][1000]; import std.conv : to; import std.datetime; import std.stdio : writeln; ini(A); ini(B); ini(C); auto r = benchmark!run_test(10000); auto res = to!Duration(r[0]); writeln(res); } void run_test() { MatMul(A, B, C); } void ini(T)(T mtx) { foreach(v; mtx) { v = 3.4; } foreach(i, v; mtx) { foreach(j, vv; v) { vv += (i * j) + (0.6 * j); } } } void MatMul(T)(T A, T B, T C) { foreach(cv; C) { cv = 0f; } foreach(i, cv; C) { foreach(j, av; A[i]) { foreach(k, cvv; cv) { cvv += av * B[j][k]; } } } } $ ldc2 test.d -O5 -release -oftest.exe -m64 $ ./test 3 secs, 995 ms, 115 μs, and 2 hnsecs Please verify that it is still doing the same thing that you want.
Aug 29 2016
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 30/08/2016 1:02 AM, rikki cattermole wrote:
 On 30/08/2016 12:13 AM, Steinhagelvoll wrote:
 Ok I added release and implemented the benchmark for 500 iterations,
 10000 are not reasonable. I build on the 2d array with LDC:
 http://pastebin.com/aXxzEdS4 (changes just in the beginning)

 $ ldc2 -release -O3 nd_test.d
 $ ./nd_test
 12 minutes, 18 secs, 21 ms, 858 μs, and 3 hnsecs

 , which is 738 seconds. Compared to (also 500 iterations)

 ifort -O3 -o fort_test test.f90 && ./fort_test
  time:    107.4640    seconds


 This still seems like a big difference. Is it because I don't use a
 continous piece of memory, but rather a pointer to a pointer?
double[1000][] A, B, C; void main() { A = new double[1000][1000]; B = new double[1000][1000]; C = new double[1000][1000]; import std.conv : to; import std.datetime; import std.stdio : writeln; ini(A); ini(B); ini(C); auto r = benchmark!run_test(10000); auto res = to!Duration(r[0]); writeln(res); } void run_test() { MatMul(A, B, C); } void ini(T)(T mtx) { foreach(v; mtx) { v = 3.4; } foreach(i, v; mtx) { foreach(j, vv; v) { vv += (i * j) + (0.6 * j); } } } void MatMul(T)(T A, T B, T C) { foreach(cv; C) { cv = 0f; } foreach(i, cv; C) { foreach(j, av; A[i]) { foreach(k, cvv; cv) { cvv += av * B[j][k]; } } } } $ ldc2 test.d -O5 -release -oftest.exe -m64 $ ./test 3 secs, 995 ms, 115 μs, and 2 hnsecs Please verify that it is still doing the same thing that you want.
Below change is slightly faster: foreach(i, cv; C) { foreach(j, av; A[i]) { auto bv = B[j]; foreach(k, cvv; cv) { cvv += av * bv[k]; } } }
Aug 29 2016
prev sibling parent reply Steinhagelvoll <matthias.redies posteo.de> writes:
On Monday, 29 August 2016 at 13:02:43 UTC, rikki cattermole wrote:
 On 30/08/2016 12:13 AM, Steinhagelvoll wrote:
 [...]
double[1000][] A, B, C; void main() { A = new double[1000][1000]; B = new double[1000][1000]; C = new double[1000][1000]; [...]
It seems that the ini doesn't work properly. Every value seems to be nan. ini(A); ini(B); ini(C); writeln(A[0][0]); writeln(C[3][9]); nan nan
Aug 29 2016
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 30/08/2016 1:50 AM, Steinhagelvoll wrote:
 It seems that the ini doesn't work properly. Every value seems to be nan.

 ini(A);
 ini(B);
 ini(C);
 writeln(A[0][0]);
 writeln(C[3][9]);

 nan
 nan
My bad, fixed: double[1000][] A, B, C; void main() { A = new double[1000][1000]; B = new double[1000][1000]; C = new double[1000][1000]; import std.conv : to; import std.datetime; import std.stdio : writeln; ini(A); ini(B); ini(C); auto r = benchmark!run_test(10000); auto res = to!Duration(r[0]); writeln(res); } void run_test() { MatMul(A, B, C); } void ini(T)(T mtx) { foreach(ref v; mtx) { v = 3.4; } foreach(i, v; mtx) { foreach(j, ref vv; v) { vv += (i * j) + (0.6 * j); } } } void MatMul(T)(T A, T B, T C) { foreach(cv; C) { cv = 0f; } foreach(i, cv; C) { foreach(j, av; A[i]) { auto bv = B[j]; foreach(k, cvv; cv) { cvv += av * bv[k]; } } } }
Aug 29 2016
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
Okay looks like I've made a boo boo and ldc is compiling out that entire 
multiplication loop out.

Its passing the array statically and since its never assigned back, its 
just never compiled in (unless you specify it via ref).

So, this is where I give up as it is 2am.

Perhaps try and make it parallel (std.parallemism can help hugely).
Aug 29 2016
parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 29.8.2016 v 16:08 rikki cattermole via Digitalmars-d-learn napsal(a):

 Okay looks like I've made a boo boo and ldc is compiling out that 
 entire multiplication loop out.

 Its passing the array statically and since its never assigned back, 
 its just never compiled in (unless you specify it via ref).

 So, this is where I give up as it is 2am.

 Perhaps try and make it parallel (std.parallemism can help hugely).
this is my version: import std.stdio; immutable int n = 1000, l=1000, m=1000; struct ZeroDouble { double val = 0f; alias val this; } void main(string[] args) { auto A = new double [1000][m]; auto B = new double [1000][n]; auto C = new ZeroDouble[1000][n]; ini!(A); ini!(B); MatMul!(A,B,C); writeln(C[1][1]); writefln("%d %d ", C.length, C[0].length); } void ini(alias mtx)(){ foreach(i, ref mtxInner; mtx) { foreach(j, ref cell; mtxInner) { cell = i*j + 0.6*j +3.4; } } } void MatMul(alias A, alias B, alias C)() { foreach(i, ref cv; C) { foreach(j, av; A[i]) { foreach(k, ref cvv; cv) { cvv += av * B[j][k]; } } } }
Aug 29 2016
prev sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 29.8.2016 v 15:57 rikki cattermole via Digitalmars-d-learn napsal(a):

 My bad, fixed:

 double[1000][] A, B, C;

 void main() {
         A = new double[1000][1000];
         B = new double[1000][1000];
         C = new double[1000][1000];

         import std.conv : to;
         import std.datetime;
         import std.stdio : writeln;

         ini(A);
         ini(B);
         ini(C);

         auto r = benchmark!run_test(10000);
         auto res = to!Duration(r[0]);
         writeln(res);
 }

 void run_test() {
         MatMul(A, B, C);
 }

 void ini(T)(T mtx) {
         foreach(ref v; mtx) {
                 v = 3.4;
         }

         foreach(i, v; mtx) {
                 foreach(j, ref vv; v) {
                         vv += (i * j) + (0.6 * j);
                 }
         }
 }

 void MatMul(T)(T A, T B, T C) {
         foreach(cv; C) {
                 cv = 0f;
         }

         foreach(i, cv; C) {
                 foreach(j, av; A[i]) {
                         auto bv = B[j];
                         foreach(k, cvv; cv) {
                                 cvv += av * bv[k];
                         }
                 }
         }

 }
This will not work, you need to add some ref :). foreach(i, ref cv; C) { foreach(j, av; A[i]) { auto bv = B[j]; foreach(k, ref cvv; cv) { cvv += av * bv[k]; } } }
Aug 29 2016
prev sibling parent David Nadlinger <code klickverbot.at> writes:
On Monday, 29 August 2016 at 10:20:56 UTC, rikki cattermole wrote:
 By the looks you're not running the tests more then once.
 Druntime initialization could be effecting this.

 Please execute each test (without memory allocation) 10000 
 times atleast and then report back what they are.
D program startup is on the order of milliseconds, so the difference is negligible for a benchmark that runs for more than a second vs. 200 ms. — David
Aug 29 2016
prev sibling next sibling parent reply Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 29.8.2016 v 11:53 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. For 
 this I implemented a matrix multiplication and compared the times for 
 different ways. As a reference I used a Fortran90 implementation.

 Fortran reference: http://pastebin.com/Hd5zTHVJ
 ifort test.f90  -o testf && time ./testf
 real    0m0.680s
 user    0m0.672s
 sys     0m0.008s

 ifort -O3 test.f90 -o testf && time ./testf
 real    0m0.235s
 user    0m0.228s
 sys     0m0.004s

 ifort -check all test.f90  -o testf && time ./testf
         1000

 real    0m34.993s
 user    0m35.012s
 sys     0m0.008s


 For D it tried a number of different ways:

 NDSlice: http://pastebin.com/nUbMnt8B
 real    0m35.922s
 user    0m35.888s
 sys    0m0.008


 1D Arrays: http://pastebin.com/R7CJFybK
 dmd -boundscheck=off -O test.d && time ./test
 real    0m4.415s
 user    0m4.412s
 sys    0m0.004s

 ldc2 -O3 test.d && time ./test
 real    0m4.261s
 user    0m4.252s
 sys    0m0.004s

 2D Arrays: http://pastebin.com/4CuB4Y0c

 dmd -boundscheck=off -O nd_test.d && time ./nd_test
 real    0m3.565s
 user    0m3.560s
 sys    0m0.004s


 ldc2 -O3 nd_test.d && time ./nd_test
 real    0m3.568s
 user    0m3.560s
 sys    0m0.004s

 None of them is even close to the Fortran implementation, only when I 
 enable all check in Fortran it seems to be equal to Ndslice. Is there 
 a speedy way to use multi-dimensional matrices?

 Kind regards

 Matthias
It is unfair to compare different backend: gfortran -O3 -o test test.f90 [kozak dajinka ~]$ time ./test real 0m2.072s user 0m2.053s sys 0m0.013s gdc -O3 -o test test.d [kozak dajinka ~]$ time ./test real 0m1.655s user 0m1.640s sys 0m0.010s Obviously ifort can use some special instruction on your CPU
Aug 29 2016
parent reply Steinhagelvoll <matthias.redies posteo.de> writes:
On Monday, 29 August 2016 at 13:59:15 UTC, Daniel Kozak wrote:
 Dne 29.8.2016 v 11:53 Steinhagelvoll via Digitalmars-d-learn 
 napsal(a):

 [...]
It is unfair to compare different backend: gfortran -O3 -o test test.f90 [kozak dajinka ~]$ time ./test real 0m2.072s user 0m2.053s sys 0m0.013s gdc -O3 -o test test.d [kozak dajinka ~]$ time ./test real 0m1.655s user 0m1.640s sys 0m0.010s Obviously ifort can use some special instruction on your CPU
This seems to be it. I also implemented it in C++ (because gfortran isn't the main focus of GNU) and this is the result: $ ./cpp_test_clang elapsed time 1.12785 $ ./cpp_test_gpp elapsed time 1.24206 $ ./cpp_test_intel elapsed time 0.298331 It is quite surprising that there is this much of a difference, even when all run sequential. I believe this might be specific to this small problem.
Aug 29 2016
next sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 29.8.2016 v 16:43 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 On Monday, 29 August 2016 at 13:59:15 UTC, Daniel Kozak wrote:
 Dne 29.8.2016 v 11:53 Steinhagelvoll via Digitalmars-d-learn napsal(a):

 [...]
It is unfair to compare different backend: gfortran -O3 -o test test.f90 [kozak dajinka ~]$ time ./test real 0m2.072s user 0m2.053s sys 0m0.013s gdc -O3 -o test test.d [kozak dajinka ~]$ time ./test real 0m1.655s user 0m1.640s sys 0m0.010s Obviously ifort can use some special instruction on your CPU
This seems to be it. I also implemented it in C++ (because gfortran isn't the main focus of GNU) and this is the result: $ ./cpp_test_clang elapsed time 1.12785 $ ./cpp_test_gpp elapsed time 1.24206 $ ./cpp_test_intel elapsed time 0.298331 It is quite surprising that there is this much of a difference, even when all run sequential. I believe this might be specific to this small problem.
with gcc you can try enable some optimalizations: g++ -O3 -march=native -o test test.cpp
Aug 29 2016
prev sibling parent reply Seb <seb wilzba.ch> writes:
On Monday, 29 August 2016 at 14:43:08 UTC, Steinhagelvoll wrote:
 It is quite surprising that there is this much of a difference, 
 even when all run sequential. I believe this might be specific 
 to this small problem.
You should definitely have a look at this benchmark for matrix multiplication across a many languages: https://github.com/kostya/benchmarks#matmul With the recent generic GLAS kernel in mir, matrix multiplication in D is the blazingly fast (it improved the existing results by at least 8x). Please not that this requires the latest LDC beta with includes the fastMath pragma and GLAS is still under development at mir: https://github.com/libmir/mir
Aug 29 2016
parent reply Steinhagelvoll <matthias.redies posteo.de> writes:
On Monday, 29 August 2016 at 14:55:50 UTC, Seb wrote:
 On Monday, 29 August 2016 at 14:43:08 UTC, Steinhagelvoll wrote:
 It is quite surprising that there is this much of a 
 difference, even when all run sequential. I believe this might 
 be specific to this small problem.
You should definitely have a look at this benchmark for matrix multiplication across a many languages: https://github.com/kostya/benchmarks#matmul With the recent generic GLAS kernel in mir, matrix multiplication in D is the blazingly fast (it improved the existing results by at least 8x). Please not that this requires the latest LDC beta with includes the fastMath pragma and GLAS is still under development at mir: https://github.com/libmir/mir
It not really about multiplying matrices. I wanted to see how D compares for different tasks. If I actually want to do matrix multiplication I will use LAPACK or something of that nature. In this task the difference was much bigger compared to e.g. prime testing, which was about even.
Aug 29 2016
parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Monday, 29 August 2016 at 15:46:26 UTC, Steinhagelvoll wrote:
 On Monday, 29 August 2016 at 14:55:50 UTC, Seb wrote:
 On Monday, 29 August 2016 at 14:43:08 UTC, Steinhagelvoll 
 wrote:
 It is quite surprising that there is this much of a 
 difference, even when all run sequential. I believe this 
 might be specific to this small problem.
You should definitely have a look at this benchmark for matrix multiplication across a many languages: https://github.com/kostya/benchmarks#matmul With the recent generic GLAS kernel in mir, matrix multiplication in D is the blazingly fast (it improved the existing results by at least 8x). Please not that this requires the latest LDC beta with includes the fastMath pragma and GLAS is still under development at mir: https://github.com/libmir/mir
It not really about multiplying matrices. I wanted to see how D compares for different tasks. If I actually want to do matrix multiplication I will use LAPACK or something of that nature. In this task the difference was much bigger compared to e.g. prime testing, which was about even.
ndslice is analog of numpy. It is more flexible comparing with Fortran arrays. In the same time, if you want fast iteration please use Mir, which includes upcoming ndslice.algorithm with fasmath attribute and `vectorized` flag for `ndReduce`. Note, that in-memory representation is important for vectorization, e.g. for dot product both slices should have strides equal to 1. Add also -mcpu=native flag for LDC. http://docs.mir.dlang.io/latest/mir_ndslice_algorithm.html#ndReduce Best regards, Ilya
Aug 29 2016
prev sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 29 August 2016 at 09:53:12 UTC, Steinhagelvoll wrote:
 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. 
 For this I implemented a matrix multiplication and compared the 
 times for different ways. As a reference I used a Fortran90 
 implementation.

 [...]
Any chance you can post the generated asm ? I have a suspicion: you are not passing your cpu arch to ldc, thus probably it generated i486 code.
Aug 29 2016
parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 29.8.2016 v 16:21 Stefan Koch via Digitalmars-d-learn napsal(a):

 On Monday, 29 August 2016 at 09:53:12 UTC, Steinhagelvoll wrote:
 Hello,

 I'm trying to find a fast way to use multi dimensional arrays. For 
 this I implemented a matrix multiplication and compared the times for 
 different ways. As a reference I used a Fortran90 implementation.

 [...]
Any chance you can post the generated asm ? I have a suspicion: you are not passing your cpu arch to ldc, thus probably it generated i486 code.
why i486, I belive it will select x86_64 by default on linux
Aug 29 2016