www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Matrix mul

reply bearophile <bearophileHUGS lycos.com> writes:
While writing code that works on matrices I have found something curious, so I
have written the following little benchmark. As usual keep eyes open for
possible bugs and mistakes of mine:


import std.conv: toInt;
import std.c.stdlib: rand, malloc;
const int RAND_MAX = short.max; // RAND_MAX seems missing from std.c.stdlib

void main(string[] args) {
    int i, j, k;
    int N = args.length == 2 ? toInt(args[1]) : 400;

    // alloc A, B, C
    static if (1) {
        // Version #1
        auto A = new double[][](N, N);
        auto B = new double[][](N, N);
        auto C = new double[][](N, N);
    } else {
        // Version #2
        double** A = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            A[i] = cast(double*)malloc(N * (double).sizeof);

        double** B = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            B[i] = cast(double*)malloc(N * (double).sizeof);

        double** C = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            C[i] = cast(double*)malloc(N * (double).sizeof);
    }

    // init mats randomly
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++) {
            A[i][j] = cast(double)rand() / RAND_MAX;
            B[i][j] = cast(double)rand() / RAND_MAX;
        }

    // C = A * B, naive way
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++)
            for (k = 0; k < N; k++)
                C[i][k] += A[i][j] * B[j][k];
}


Timings, Core2, 2 GHz, DMD V.1.036:
  N = 100:
    Version 1:  0.17 s
    Version 2:  0.03 s
  N = 400:
    Version 1:  9.25 s
    Version 2:  0.26 s
  N = 512:
    Version 1: 19.3 s
    Version 2:  1.0 s

--------------------------

Note there are much faster ways to perform A * B:

// Alternative mul code
double[] bj = new double[N];
for (int j = 0; j < N; j++) {
    for (int k = 0; k < N; k++)
        bj[k] = B[k][j];
    for (int i = 0; i < N; i++) {
        double[] ai = A[i];
        double s = 0;
        for (int k = 0; k < N; k++)
            s += ai[k] * bj[k];
        C[i][j] = s;
    }
}

--------------------------------

Inner loop asm, version #1:

L164:
        mov ECX,[EAX]
        fld qword ptr [ESI*8][ECX]
        mov ECX,0[EBP]
        fmul    qword ptr [EBX*8][ECX]
        mov ECX,[EDX]
        fadd    qword ptr [EBX*8][ECX]
        fstp    qword ptr [EBX*8][ECX]
        inc EBX
        cmp EBX,EDI
        jl  L164


Inner loop asm, version #2:

L12B:
        mov EDX,4[EBX]
        mov EAX,[EBX]
        fld qword ptr [EDI*8][EDX]
        mov ECX,010h[ESP]
        mov EDX,4[ECX]
        mov EAX,[ECX]
        fmul    qword ptr [ESI*8][EDX]
        mov ECX,014h[ESP]
        mov EDX,4[ECX]
        mov EAX,[ECX]
        fadd    qword ptr [ESI*8][EDX]
        fstp    qword ptr [ESI*8][EDX]
        inc ESI
        cmp ESI,EBP
        jl  L12B


Note: compiling the alternative mul code written in C with GCC the running
speed is about twice still compared to the Version2.

Bye,
bearophile
Nov 22 2008
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 While writing code that works on matrices I have found something
 curious, so I have written the following little benchmark. As usual
 keep eyes open for possible bugs and mistakes of mine:

This is yet another proof that bounds checking can cost a lot. Although the looping code looks the same, in the case of using arrays there are plenty of bounds checks going on. My guess is that if you turn that off, the differences won't be as large (or even detectable for certain ranges of N).
 --------------------------
 Note there are much faster ways to perform A * B:

 // Alternative mul code
 double[] bj = new double[N];
 for (int j = 0; j < N; j++) {
     for (int k = 0; k < N; k++)
         bj[k] = B[k][j];
     for (int i = 0; i < N; i++) {
         double[] ai = A[i];
         double s = 0;
         for (int k = 0; k < N; k++)
             s += ai[k] * bj[k];
         C[i][j] = s;
     }
 }

Probably blocking will bring even more mileage (but again that depends on N). Andrei
Nov 22 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 My guess is that if you turn that off, the differences won't be as large
 (or even detectable for certain ranges of N).

The array bounds aren't controlled, the code is compiled with -O -release -inline. Do you see array bound controls in the asm code at the bottom of my post?
 Probably blocking will bring even more mileage (but again that depends 
 on N).

Yes, blocking may help. And using SSE instructions may help some more. The end result may be hundred or more times faster than the naive code in D :-) Bye, bearophile
Nov 22 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Bill Baxter Wrote:
 I haven't done any benchmarks, though. :-)

I see. But using something because it's supposed to be faster without performing actual performance comparisons looks a little strange to me :-)
 Might be interesting to try out my MingGW-compiled ATLAS BLAS matrix
 mult against the numbers you're getting there.

Can you or someone else run that little D code, so you can tell me if my timings are right? Bye, bearophile
Nov 22 2008
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to bearophile,

 Bill Baxter Wrote:
 
 I haven't done any benchmarks, though. :-)
 

performing actual performance comparisons looks a little strange to me :-)

BLAS may well be the most tested, optimized and benchmarked code in the world. It is highly unlikely that anything will be faster on hardwhere anyone is likely to run across. Given that the vast bulk of super computer time is spent on matrix math, it should be fair to say that BLAS is probably the most executed code on the planet.
Nov 22 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Bill Baxter:
 Exactly.  That's why I haven't spent too much time benchmarking it.
 It would be quite surprising if something I wrote in D outperformed
 the ATLAS SSE3 optimized BLAS implementation.

Performing many benchmarks teaches you that it's better not assume too much things. Nature and computers often find ways to surprise you :-) Bye, bearophile
Nov 23 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Bill Baxter:
 I do find all your benchmark postings interesting.

Most of them are buggy, in the beginning... The main problem is that they often do nothing real, so there's no practical result to test. In the future I'll add better tests to avoid the most silly errors of mine.
 Anyway I gave it
 a try, and it does multiply the 400x400 matrices about 6.7x faster
 than the fastest result from the naive mult implementation in your
 benchmark.

Thank you, that's a lot of difference. In my original post I have also given a block of code to multiply the matrices in a less naive way, you may test that too, if you have time and you want. Bye, bearophile
Nov 23 2008
prev sibling next sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sat, 22 Nov 2008 20:14:37 -0500, bearophile wrote:

 Can you or someone else run that little D code, so you can tell me if
 my timings are right?

Tested your code using DMD 2.019. You are right. The #1 is about 30 times slower than #2: 10s against 0,3s on my laptop. I've also tried to replace CRT malloc with GC malloc in #2 and it ran just a bit slower--I would say around 4%--even though I didn't call hasNoPointers() anywhere. The really weird part is, if I comment out the "init mats randomly" loop in #1, it becomes twice as slow, i.e. 20s against the original 10s. I don't get it.
Nov 22 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Sergey Gromov wrote:
 Sat, 22 Nov 2008 20:14:37 -0500, bearophile wrote:
 
 Can you or someone else run that little D code, so you can tell me if
 my timings are right?

Tested your code using DMD 2.019. You are right. The #1 is about 30 times slower than #2: 10s against 0,3s on my laptop. I've also tried to replace CRT malloc with GC malloc in #2 and it ran just a bit slower--I would say around 4%--even though I didn't call hasNoPointers() anywhere. The really weird part is, if I comment out the "init mats randomly" loop in #1, it becomes twice as slow, i.e. 20s against the original 10s. I don't get it.

I think what happens is that the speed of FP arithmetic depends on the values of the operands. Whenever a NaN is involved everything gets a lot slower (tested that). Also, I believe (without having tested it) that multiplying with zero is faster than multiplying nonzero numbers. Andrei
Nov 23 2008
parent Sergey Gromov <snake.scaly gmail.com> writes:
Sun, 23 Nov 2008 07:33:16 -0600, Andrei Alexandrescu wrote:

 Sergey Gromov wrote:
 The really weird part is, if I comment out the "init mats randomly" loop
 in #1, it becomes twice as slow, i.e. 20s against the original 10s.  I
 don't get it.

I think what happens is that the speed of FP arithmetic depends on the values of the operands. Whenever a NaN is involved everything gets a lot slower (tested that).

I'm always forgetting that an uninitialized double is NaN... That should be the reason for at least this slowdown.
Nov 23 2008
prev sibling next sibling parent reply "Bill Baxter" <wbaxter gmail.com> writes:
On Sun, Nov 23, 2008 at 12:25 PM, BCS <ao pathlink.com> wrote:
 Reply to bearophile,

 Bill Baxter Wrote:

 I haven't done any benchmarks, though. :-)

performing actual performance comparisons looks a little strange to me :-)

BLAS may well be the most tested, optimized and benchmarked code in the world. It is highly unlikely that anything will be faster on hardwhere anyone is likely to run across. Given that the vast bulk of super computer time is spent on matrix math, it should be fair to say that BLAS is probably the most executed code on the planet.

Exactly. That's why I haven't spent too much time benchmarking it. It would be quite surprising if something I wrote in D outperformed the ATLAS SSE3 optimized BLAS implementation. (Though It would not be so surprising if something Don wrote managed to outperform it. :-) ) --bb
Nov 23 2008
parent BCS <ao pathlink.com> writes:
Reply to Bill,


 Exactly.  That's why I haven't spent too much time benchmarking it.
 It would be quite surprising if something I wrote in D outperformed
 the ATLAS SSE3 optimized BLAS implementation.  (Though It would not be
 so surprising if something Don wrote managed to outperform it. :-) )
 --bb
 

It wouldn't suprise me if Don wrote it!
Nov 23 2008
prev sibling parent "Bill Baxter" <wbaxter gmail.com> writes:
On Sun, Nov 23, 2008 at 6:15 PM, bearophile <bearophileHUGS lycos.com> wrote:
 Bill Baxter:
 Exactly.  That's why I haven't spent too much time benchmarking it.
 It would be quite surprising if something I wrote in D outperformed
 the ATLAS SSE3 optimized BLAS implementation.

Performing many benchmarks teaches you that it's better not assume too much things. Nature and computers often find ways to surprise you :-)

I do find all your benchmark postings interesting. I'd heard the DMD fp codegen wasn't so great, but I'm quite convinced now. In this particular case I just haven't had any performance problems with my setup yet, so I haven't felt the need to investigate. I needed various LAPACK routines anyway, and LAPACK depends on BLAS, so BLAS is just sitting there. I might as well use it. Anyway I gave it a try, and it does multiply the 400x400 matrices about 6.7x faster than the fastest result from the naive mult implementation in your benchmark. However it should be noted that BLAS only accepts contiguous arrays, so I only tried it on your V2 allocation strategy that allocates one big array. --bb
Nov 23 2008
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 My guess is that if you turn that off, the differences won't be as large
 (or even detectable for certain ranges of N).

The array bounds aren't controlled, the code is compiled with -O -release -inline. Do you see array bound controls in the asm code at the bottom of my post?

Oh sorry for the confusion; I thought the assembler was hand-written, not generated. Andrei
Nov 23 2008
prev sibling parent "Bill Baxter" <wbaxter gmail.com> writes:
On Sun, Nov 23, 2008 at 8:29 AM, bearophile <bearophileHUGS lycos.com> wrote:
 Andrei Alexandrescu:
 My guess is that if you turn that off, the differences won't be as large
 (or even detectable for certain ranges of N).

The array bounds aren't controlled, the code is compiled with -O -release -inline. Do you see array bound controls in the asm code at the bottom of my post?
 Probably blocking will bring even more mileage (but again that depends
 on N).

Yes, blocking may help. And using SSE instructions may help some more. The end result may be hundred or more times faster than the naive code in D :-)

This is why I prefer to call on an optimized BLAS lib for all my large matrix multiplication needs. All that nonsense is already taken care of. And it's compiled with GCC which has better floating point optimization to begin with. I haven't done any benchmarks, though. :-) Might be interesting to try out my MingGW-compiled ATLAS BLAS matrix mult against the numbers you're getting there. --bb
Nov 22 2008
prev sibling parent reply Don <nospam nospam.com> writes:
bearophile wrote:
 While writing code that works on matrices I have found something curious...

Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.
Nov 22 2008
next sibling parent reply Tom S <h3r3tic remove.mat.uni.torun.pl> writes:
Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.

Nah, it's about NaNs :) Version #1 initializes C to NaN, Version #2 initializes it to 0. The 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :D -- Tomasz Stachowiak http://h3.team0xf.com/ h3/h3r3tic on #D freenode
Nov 22 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Tom S:
 Nah, it's about NaNs :)
 Version #1 initializes C to NaN, Version #2 initializes it to 0. The 
 'init mats randomly' loop doesn't touch C at all, thus all the latter 
 additions leave C at NaN, causing lots of FP exceptions.

You are right, and I'm a stupid. Most of my benchmarks have similar silly bugs at the beginning. Thank you. The new code: import std.conv: toInt; import std.c.stdlib: rand, malloc; const int RAND_MAX = short.max; // RAND_MAX seems missing from std.c.stdlib T[][] NewVoidCMatrix(T)(int nr, int nc) { // Part of the code by Marius Muja <mariusm cs.ubc.ca> assert(nr > 0, "NewVoidCMatrix: nr must be > 0."); assert(nc > 0, "NewVoidCMatrix: nc must be > 0."); void* mem = cast(void*)malloc(nr * (T[]).sizeof + nr * nc * T.sizeof); T[]* index = cast(T[]*)mem; T* mat = cast(T*)(mem + nr * (T[]).sizeof); for (int i = 0; i < nr; ++i) { index[i] = mat[0 .. nc]; mat += nc; } return index[0 .. nr]; } void main(string[] args) { int i, j, k; int N = args.length == 2 ? toInt(args[1]) : 400; // alloc A, B, C static if (0) { // Version #1 auto A = new double[][](N, N); auto B = new double[][](N, N); auto C = new double[][](N, N); } static if (0) { // Version #2 auto A = NewVoidCMatrix!(double)(N, N); auto B = NewVoidCMatrix!(double)(N, N); auto C = NewVoidCMatrix!(double)(N, N); } static if (1) { // Version #3 double** A = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) A[i] = cast(double*)malloc(N * (double).sizeof); double** B = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) B[i] = cast(double*)malloc(N * (double).sizeof); double** C = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) C[i] = cast(double*)malloc(N * (double).sizeof); } static if (0) { // Version #4 double** A = cast(double**)malloc(N * (double*).sizeof); A[0] = cast(double*)malloc(N * N * (double).sizeof); for(i = 1; i < N; i++) A[i] = A[0] + i * N; double** B = cast(double**)malloc(N * (double*).sizeof); B[0] = cast(double*)malloc(N * N * (double).sizeof); for(i = 1; i < N; i++) B[i] = B[0] + i * N; double** C = cast(double**)malloc(N * (double*).sizeof); C[0] = cast(double*)malloc(N * N * (double).sizeof); for(i = 1; i < N; i++) C[i] = C[0] + i * N; } // init mats randomly for (i = 0; i < N; i++) for (j = 0; j < N; j++) { A[i][j] = cast(double)rand() / RAND_MAX; B[i][j] = cast(double)rand() / RAND_MAX; C[i][j] = 0.0; } // C = A * B, naive way for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) C[i][k] += A[i][j] * B[j][k]; } I have added two more versions for Don, to show the difference of speed relative to spreading of memory caused by the malloc. And it's visible but not significant. Timings are, n=400: V1: 0.42 s V2: 0.41 s (repeatable difference of 0.01 s) V3: 0.25 s V4: 0.25 s Now the difference in speed has to come from something else. It may come from the DMD not optimizing dynamic arrays usage as well as C arrays. A test I haven't done yet is to change the representation of the dynamic arrays: instead of allocating an array of structs that contain len+ptr, I can allocate two "parallel arrays", one with only len and the other with just the pointers. I think this may increase the locality of reference of the pointers, maybe leading to performance similar to the C arrays. Bye, bearophile
Nov 23 2008
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Tom S wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.

Nah, it's about NaNs :) Version #1 initializes C to NaN, Version #2 initializes it to 0. The 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :D

I swear I wrote my previous message before reading this :o). Damn you're right. I hadn't seen the bug in initialization. But since version #2 uses malloc, isn't that memory garbage (possibly, but not necessarily, zero)? Andrei
Nov 23 2008
parent reply Tom S <h3r3tic remove.mat.uni.torun.pl> writes:
Andrei Alexandrescu wrote:
 Tom S wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.

Nah, it's about NaNs :) Version #1 initializes C to NaN, Version #2 initializes it to 0. The 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :D

I swear I wrote my previous message before reading this :o). Damn you're right. I hadn't seen the bug in initialization. But since version #2 uses malloc, isn't that memory garbage (possibly, but not necessarily, zero)?

Damn, you're right too ;) I always forget whether malloc or calloc initializes to zero. According to my tests, DMD/Tango's malloc returns mostly zeroes in bearophile's test anyway. -- Tomasz Stachowiak http://h3.team0xf.com/ h3/h3r3tic on #D freenode
Nov 23 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Tom S wrote:
 Andrei Alexandrescu wrote:
 Tom S wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.

Nah, it's about NaNs :) Version #1 initializes C to NaN, Version #2 initializes it to 0. The 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :D

I swear I wrote my previous message before reading this :o). Damn you're right. I hadn't seen the bug in initialization. But since version #2 uses malloc, isn't that memory garbage (possibly, but not necessarily, zero)?

Damn, you're right too ;) I always forget whether malloc or calloc initializes to zero. According to my tests, DMD/Tango's malloc returns mostly zeroes in bearophile's test anyway.

In my experience malloc returns zeroed memory around the start of an application (due to the OS applying simple security considerations). Later on, as memory is reused, there's no more zeroes - it's whatever you left in there. Andrei
Nov 23 2008
prev sibling parent "Bill Baxter" <wbaxter gmail.com> writes:
Doh!  The NaNs strike again.

--bb

On Sun, Nov 23, 2008 at 3:40 PM, Tom S <h3r3tic remove.mat.uni.torun.pl> wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something
 curious...

Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.

Nah, it's about NaNs :) Version #1 initializes C to NaN, Version #2 initializes it to 0. The 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :D -- Tomasz Stachowiak http://h3.team0xf.com/ h3/h3r3tic on #D freenode

Nov 23 2008