digitalmars.D - Matrix mul

bearophile (95/95) Nov 22 2008 While writing code that works on matrices I have found something curious...

Andrei Alexandrescu (10/28) Nov 22 2008 [snip]

bearophile (6/10) Nov 22 2008 The array bounds aren't controlled, the code is compiled with -O -releas...

Bill Baxter (9/17) Nov 22 2008 This is why I prefer to call on an optimized BLAS lib for all my large

bearophile (5/8) Nov 22 2008 Can you or someone else run that little D code, so you can tell me if my...

BCS (6/14) Nov 22 2008 BLAS may well be the most tested, optimized and benchmarked code in the ...

Bill Baxter (6/20) Nov 23 2008 Exactly. That's why I haven't spent too much time benchmarking it.

bearophile (4/7) Nov 23 2008 Performing many benchmarks teaches you that it's better not assume too m...

Bill Baxter (13/18) Nov 23 2008 I do find all your benchmark postings interesting. I'd heard the DMD

bearophile (6/11) Nov 23 2008 Thank you, that's a lot of difference.

BCS (2/8) Nov 23 2008 It wouldn't suprise me if Don wrote it!

Sergey Gromov (8/10) Nov 22 2008 Tested your code using DMD 2.019. You are right. The #1 is about 30

Andrei Alexandrescu (6/19) Nov 23 2008 I think what happens is that the speed of FP arithmetic depends on the

Sergey Gromov (3/11) Nov 23 2008 I'm always forgetting that an uninitialized double is NaN... That

Andrei Alexandrescu (4/10) Nov 23 2008 Oh sorry for the confusion; I thought the assembler was hand-written,

Don (9/10) Nov 22 2008 Here's what I think is going on. AFAIK, D hasn't got any special code

Tom S (11/24) Nov 22 2008 Nah, it's about NaNs :)

Bill Baxter (3/29) Nov 23 2008 Doh! The NaNs strike again.
bearophile (82/86) Nov 23 2008 You are right, and I'm a stupid. Most of my benchmarks have similar sill...
Andrei Alexandrescu (6/29) Nov 23 2008 I swear I wrote my previous message before reading this :o). Damn you're...

Tom S (8/37) Nov 23 2008 Damn, you're right too ;) I always forget whether malloc or calloc

Andrei Alexandrescu (6/40) Nov 23 2008 In my experience malloc returns zeroed memory around the start of an

bearophile <bearophileHUGS lycos.com> writes:

While writing code that works on matrices I have found something curious, so I
have written the following little benchmark. As usual keep eyes open for
possible bugs and mistakes of mine:


import std.conv: toInt;
import std.c.stdlib: rand, malloc;
const int RAND_MAX = short.max; // RAND_MAX seems missing from std.c.stdlib

void main(string[] args) {
    int i, j, k;
    int N = args.length == 2 ? toInt(args[1]) : 400;

    // alloc A, B, C
    static if (1) {

        auto A = new double[][](N, N);
        auto B = new double[][](N, N);
        auto C = new double[][](N, N);
    } else {

        double** A = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            A[i] = cast(double*)malloc(N * (double).sizeof);

        double** B = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            B[i] = cast(double*)malloc(N * (double).sizeof);

        double** C = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            C[i] = cast(double*)malloc(N * (double).sizeof);
    }

    // init mats randomly
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++) {
            A[i][j] = cast(double)rand() / RAND_MAX;
            B[i][j] = cast(double)rand() / RAND_MAX;
        }

    // C = A * B, naive way
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++)
            for (k = 0; k < N; k++)
                C[i][k] += A[i][j] * B[j][k];
}


Timings, Core2, 2 GHz, DMD V.1.036:
  N = 100:
    Version 1:  0.17 s
    Version 2:  0.03 s
  N = 400:
    Version 1:  9.25 s
    Version 2:  0.26 s
  N = 512:
    Version 1: 19.3 s
    Version 2:  1.0 s

--------------------------

Note there are much faster ways to perform A * B:

// Alternative mul code
double[] bj = new double[N];
for (int j = 0; j < N; j++) {
    for (int k = 0; k < N; k++)
        bj[k] = B[k][j];
    for (int i = 0; i < N; i++) {
        double[] ai = A[i];
        double s = 0;
        for (int k = 0; k < N; k++)
            s += ai[k] * bj[k];
        C[i][j] = s;
    }
}

--------------------------------



L164:
        mov ECX,[EAX]
        fld qword ptr [ESI*8][ECX]
        mov ECX,0[EBP]
        fmul    qword ptr [EBX*8][ECX]
        mov ECX,[EDX]
        fadd    qword ptr [EBX*8][ECX]
        fstp    qword ptr [EBX*8][ECX]
        inc EBX
        cmp EBX,EDI
        jl  L164




L12B:
        mov EDX,4[EBX]
        mov EAX,[EBX]
        fld qword ptr [EDI*8][EDX]
        mov ECX,010h[ESP]
        mov EDX,4[ECX]
        mov EAX,[ECX]
        fmul    qword ptr [ESI*8][EDX]
        mov ECX,014h[ESP]
        mov EDX,4[ECX]
        mov EAX,[ECX]
        fadd    qword ptr [ESI*8][EDX]
        fstp    qword ptr [ESI*8][EDX]
        inc ESI
        cmp ESI,EBP
        jl  L12B


Note: compiling the alternative mul code written in C with GCC the running
speed is about twice still compared to the Version2.

Bye,
bearophile

Nov 22 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 While writing code that works on matrices I have found something
 curious, so I have written the following little benchmark. As usual
 keep eyes open for possible bugs and mistakes of mine:

[snip]

This is yet another proof that bounds checking can cost a lot. Although 
the looping code looks the same, in the case of using arrays there are 
plenty of bounds checks going on. My guess is that if you turn that off, 
the differences won't be as large (or even detectable for certain ranges 
of N).

 --------------------------
 Note there are much faster ways to perform A * B:

 // Alternative mul code
 double[] bj = new double[N];
 for (int j = 0; j < N; j++) {
     for (int k = 0; k < N; k++)
         bj[k] = B[k][j];
     for (int i = 0; i < N; i++) {
         double[] ai = A[i];
         double s = 0;
         for (int k = 0; k < N; k++)
             s += ai[k] * bj[k];
         C[i][j] = s;
     }
 }

Probably blocking will bring even more mileage (but again that depends 
on N).


Andrei

Nov 22 2008

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 My guess is that if you turn that off, the differences won't be as large
 (or even detectable for certain ranges of N).

The array bounds aren't controlled, the code is compiled with -O -release
-inline.
Do you see array bound controls in the asm code at the bottom of my post?


 Probably blocking will bring even more mileage (but again that depends 
 on N).

Yes, blocking may help. And using SSE instructions may help some more. The end
result may be hundred or more times faster than the naive code in D :-)

Bye,
bearophile

Nov 22 2008

"Bill Baxter" <wbaxter gmail.com> writes:

On Sun, Nov 23, 2008 at 8:29 AM, bearophile <bearophileHUGS lycos.com> wrote:
 Andrei Alexandrescu:
 My guess is that if you turn that off, the differences won't be as large
 (or even detectable for certain ranges of N).

 The array bounds aren't controlled, the code is compiled with -O -release
-inline.
 Do you see array bound controls in the asm code at the bottom of my post?


 Probably blocking will bring even more mileage (but again that depends
 on N).

 Yes, blocking may help. And using SSE instructions may help some more. The end
result may be hundred or more times faster than the naive code in D :-)

This is why I prefer to call on an optimized BLAS lib for all my large
matrix multiplication needs.  All that nonsense is already taken care
of.  And it's compiled with GCC which has better floating point
optimization to begin with.  I haven't done any benchmarks, though.
:-)
Might be interesting to try out my MingGW-compiled ATLAS BLAS matrix
mult against the numbers you're getting there.

--bb

Nov 22 2008

bearophile <bearophileHUGS lycos.com> writes:

Bill Baxter Wrote:
 I haven't done any benchmarks, though. :-)

I see. But using something because it's supposed to be faster without
performing actual performance comparisons looks a little strange to me :-)


 Might be interesting to try out my MingGW-compiled ATLAS BLAS matrix
 mult against the numbers you're getting there.

Can you or someone else run that little D code, so you can tell me if my
timings are right?

Bye,
bearophile

Nov 22 2008

BCS <ao pathlink.com> writes:

Reply to bearophile,

 Bill Baxter Wrote:
 
 I haven't done any benchmarks, though. :-)
 

 I see. But using something because it's supposed to be faster without
 performing actual performance comparisons looks a little strange to me
 :-)
 

BLAS may well be the most tested, optimized and benchmarked code in the world. 
It is highly unlikely that anything will be faster on hardwhere anyone is 
likely to run across. Given that the vast bulk of super computer time is 
spent on matrix math, it should be fair to say that BLAS is probably the 
most executed code on the planet.

Nov 22 2008

"Bill Baxter" <wbaxter gmail.com> writes:

On Sun, Nov 23, 2008 at 12:25 PM, BCS <ao pathlink.com> wrote:
 Reply to bearophile,

 Bill Baxter Wrote:

 I haven't done any benchmarks, though. :-)

 I see. But using something because it's supposed to be faster without
 performing actual performance comparisons looks a little strange to me
 :-)

 BLAS may well be the most tested, optimized and benchmarked code in the
 world. It is highly unlikely that anything will be faster on hardwhere
 anyone is likely to run across. Given that the vast bulk of super computer
 time is spent on matrix math, it should be fair to say that BLAS is probably
 the most executed code on the planet.

Exactly.  That's why I haven't spent too much time benchmarking it.
It would be quite surprising if something I wrote in D outperformed
the ATLAS SSE3 optimized BLAS implementation.  (Though It would not be
so surprising if something Don wrote managed to outperform it. :-) )

--bb

Nov 23 2008

bearophile <bearophileHUGS lycos.com> writes:

Bill Baxter:
 Exactly.  That's why I haven't spent too much time benchmarking it.
 It would be quite surprising if something I wrote in D outperformed
 the ATLAS SSE3 optimized BLAS implementation.

Performing many benchmarks teaches you that it's better not assume too much
things. Nature and computers often find ways to surprise you :-)

Bye,
bearophile

Nov 23 2008

"Bill Baxter" <wbaxter gmail.com> writes:

On Sun, Nov 23, 2008 at 6:15 PM, bearophile <bearophileHUGS lycos.com> wrote:
 Bill Baxter:
 Exactly.  That's why I haven't spent too much time benchmarking it.
 It would be quite surprising if something I wrote in D outperformed
 the ATLAS SSE3 optimized BLAS implementation.

 Performing many benchmarks teaches you that it's better not assume too much
things. Nature and computers often find ways to surprise you :-)

I do find all your benchmark postings interesting.  I'd heard the DMD
fp codegen wasn't so great, but I'm quite convinced now.

In this particular case I just haven't had any performance problems
with my setup yet, so I haven't felt the need to investigate.  I
needed various LAPACK routines anyway, and LAPACK depends on BLAS, so
BLAS is just sitting there.  I might as well use it. Anyway I gave it
a try, and it does multiply the 400x400 matrices about 6.7x faster
than the fastest result from the naive mult implementation in your
benchmark.  However it should be noted that BLAS only accepts
contiguous arrays, so I only tried it on your V2 allocation strategy
that allocates one big array.

--bb

Nov 23 2008

bearophile <bearophileHUGS lycos.com> writes:

Bill Baxter:
 I do find all your benchmark postings interesting.

Most of them are buggy, in the beginning... The main problem is that they often
do nothing real, so there's no practical result to test. In the future I'll add
better tests to avoid the most silly errors of mine.


 Anyway I gave it
 a try, and it does multiply the 400x400 matrices about 6.7x faster
 than the fastest result from the naive mult implementation in your
 benchmark.

Thank you, that's a lot of difference.
In my original post I have also given a block of code to multiply the matrices
in a less naive way, you may test that too, if you have time and you want.

Bye,
bearophile

Nov 23 2008

BCS <ao pathlink.com> writes:

Reply to Bill,


 Exactly.  That's why I haven't spent too much time benchmarking it.
 It would be quite surprising if something I wrote in D outperformed
 the ATLAS SSE3 optimized BLAS implementation.  (Though It would not be
 so surprising if something Don wrote managed to outperform it. :-) )
 --bb
 

It wouldn't suprise me if Don wrote it!

Nov 23 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Sat, 22 Nov 2008 20:14:37 -0500, bearophile wrote:

 Can you or someone else run that little D code, so you can tell me if
 my timings are right?




would say around 4%--even though I didn't call hasNoPointers() anywhere.

The really weird part is, if I comment out the "init mats randomly" loop

don't get it.

Nov 22 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sergey Gromov wrote:
 Sat, 22 Nov 2008 20:14:37 -0500, bearophile wrote:
 
 Can you or someone else run that little D code, so you can tell me if
 my timings are right?

 



 would say around 4%--even though I didn't call hasNoPointers() anywhere.
 
 The really weird part is, if I comment out the "init mats randomly" loop

 don't get it.

I think what happens is that the speed of FP arithmetic depends on the 
values of the operands. Whenever a NaN is involved everything gets a lot 
slower (tested that). Also, I believe (without having tested it) that 
multiplying with zero is faster than multiplying nonzero numbers.

Andrei

Nov 23 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Sun, 23 Nov 2008 07:33:16 -0600, Andrei Alexandrescu wrote:

 Sergey Gromov wrote:
 The really weird part is, if I comment out the "init mats randomly" loop

 don't get it.

 
 I think what happens is that the speed of FP arithmetic depends on the 
 values of the operands. Whenever a NaN is involved everything gets a lot 
 slower (tested that).

I'm always forgetting that an uninitialized double is NaN...  That
should be the reason for at least this slowdown.

Nov 23 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 My guess is that if you turn that off, the differences won't be as large
 (or even detectable for certain ranges of N).

 
 The array bounds aren't controlled, the code is compiled with -O -release
-inline.
 Do you see array bound controls in the asm code at the bottom of my post?

Oh sorry for the confusion; I thought the assembler was hand-written, 
not generated.

Andrei

Nov 23 2008

Don <nospam nospam.com> writes:

bearophile wrote:
 While writing code that works on matrices I have found something curious...

Here's what I think is going on. AFAIK, D hasn't got any special code 
for initializing jagged arrays. So
         auto A = new double[][](N, N);
involves N+1 memory allocations.
As well as being slow, this is going to give horrible locality of 
reference in the memory allocation. You'll suffer cache misses on almost 
every array access!

I don't think this has anything to do with bounds checking.

Nov 22 2008

Tom S <h3r3tic remove.mat.uni.torun.pl> writes:

Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

 
 Here's what I think is going on. AFAIK, D hasn't got any special code 
 for initializing jagged arrays. So
         auto A = new double[][](N, N);
 involves N+1 memory allocations.
 As well as being slow, this is going to give horrible locality of 
 reference in the memory allocation. You'll suffer cache misses on almost 
 every array access!
 
 I don't think this has anything to do with bounds checking.

Nah, it's about NaNs :)


'init mats randomly' loop doesn't touch C at all, thus all the latter 
additions leave C at NaN, causing lots of FP exceptions.

We actually had something like this happen in Deadlock a few times in 
the animation subsystem. Brought the framerate down to a single digit :D


-- 
Tomasz Stachowiak
http://h3.team0xf.com/
h3/h3r3tic on #D freenode

Nov 22 2008

"Bill Baxter" <wbaxter gmail.com> writes:

Doh!  The NaNs strike again.

--bb

On Sun, Nov 23, 2008 at 3:40 PM, Tom S <h3r3tic remove.mat.uni.torun.pl> wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something
 curious...

 Here's what I think is going on. AFAIK, D hasn't got any special code for
 initializing jagged arrays. So
        auto A = new double[][](N, N);
 involves N+1 memory allocations.
 As well as being slow, this is going to give horrible locality of
 reference in the memory allocation. You'll suffer cache misses on almost
 every array access!

 I don't think this has anything to do with bounds checking.

 Nah, it's about NaNs :)


 mats randomly' loop doesn't touch C at all, thus all the latter additions
 leave C at NaN, causing lots of FP exceptions.

 We actually had something like this happen in Deadlock a few times in the
 animation subsystem. Brought the framerate down to a single digit :D


 --
 Tomasz Stachowiak
 http://h3.team0xf.com/
 h3/h3r3tic on #D freenode

Nov 23 2008

bearophile <bearophileHUGS lycos.com> writes:

Tom S:
 Nah, it's about NaNs :)

 'init mats randomly' loop doesn't touch C at all, thus all the latter 
 additions leave C at NaN, causing lots of FP exceptions.

You are right, and I'm a stupid. Most of my benchmarks have similar silly bugs
at the beginning. Thank you.

The new code:

import std.conv: toInt;
import std.c.stdlib: rand, malloc;
const int RAND_MAX = short.max; // RAND_MAX seems missing from std.c.stdlib

T[][] NewVoidCMatrix(T)(int nr, int nc) {
    // Part of the code by Marius Muja <mariusm cs.ubc.ca>
    assert(nr > 0, "NewVoidCMatrix: nr must be > 0.");
    assert(nc > 0, "NewVoidCMatrix: nc must be > 0.");
    void* mem = cast(void*)malloc(nr * (T[]).sizeof + nr * nc * T.sizeof);
    T[]* index = cast(T[]*)mem;
    T* mat = cast(T*)(mem + nr * (T[]).sizeof);

    for (int i = 0; i < nr; ++i) {
        index[i] = mat[0 .. nc];
        mat += nc;
    }

    return index[0 .. nr];
}

void main(string[] args) {
    int i, j, k;
    int N = args.length == 2 ? toInt(args[1]) : 400;

    // alloc A, B, C
    static if (0) {

        auto A = new double[][](N, N);
        auto B = new double[][](N, N);
        auto C = new double[][](N, N);
    } static if (0) {

        auto A = NewVoidCMatrix!(double)(N, N);
        auto B = NewVoidCMatrix!(double)(N, N);
        auto C = NewVoidCMatrix!(double)(N, N);
    } static if (1) {

        double** A = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            A[i] = cast(double*)malloc(N * (double).sizeof);

        double** B = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            B[i] = cast(double*)malloc(N * (double).sizeof);

        double** C = cast(double**)malloc(N * (double*).sizeof);
        for (i = 0; i < N; i++)
            C[i] = cast(double*)malloc(N * (double).sizeof);
    } static if (0) {

        double** A = cast(double**)malloc(N * (double*).sizeof);
        A[0] = cast(double*)malloc(N * N * (double).sizeof);
        for(i = 1; i < N; i++)
            A[i] = A[0] + i * N;

        double** B = cast(double**)malloc(N * (double*).sizeof);
        B[0] = cast(double*)malloc(N * N * (double).sizeof);
        for(i = 1; i < N; i++)
            B[i] = B[0] + i * N;

        double** C = cast(double**)malloc(N * (double*).sizeof);
        C[0] = cast(double*)malloc(N * N * (double).sizeof);
        for(i = 1; i < N; i++)
            C[i] = C[0] + i * N;
    }

    // init mats randomly
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++) {
            A[i][j] = cast(double)rand() / RAND_MAX;
            B[i][j] = cast(double)rand() / RAND_MAX;
            C[i][j] = 0.0;
        }

    // C = A * B, naive way
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++)
            for (k = 0; k < N; k++)
                C[i][k] += A[i][j] * B[j][k];
}

I have added two more versions for Don, to show the difference of speed
relative to spreading of memory caused by the malloc. And it's visible but not
significant.

Timings are, n=400:
  V1: 0.42 s
  V2: 0.41 s (repeatable difference of 0.01 s)
  V3: 0.25 s
  V4: 0.25 s

Now the difference in speed has to come from something else. It may come from
the DMD not optimizing dynamic arrays usage as well as C arrays.

A test I haven't done yet is to change the representation of the dynamic
arrays: instead of allocating an array of structs that contain len+ptr, I can
allocate two "parallel arrays", one with only len and the other with just the
pointers. I think this may increase the locality of reference of the pointers,
maybe leading to performance similar to the C arrays.

Bye,
bearophile

Nov 23 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Tom S wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

 Here's what I think is going on. AFAIK, D hasn't got any special code 
 for initializing jagged arrays. So
         auto A = new double[][](N, N);
 involves N+1 memory allocations.
 As well as being slow, this is going to give horrible locality of 
 reference in the memory allocation. You'll suffer cache misses on 
 almost every array access!

 I don't think this has anything to do with bounds checking.

 
 Nah, it's about NaNs :)
 

 'init mats randomly' loop doesn't touch C at all, thus all the latter 
 additions leave C at NaN, causing lots of FP exceptions.
 
 We actually had something like this happen in Deadlock a few times in 
 the animation subsystem. Brought the framerate down to a single digit :D

I swear I wrote my previous message before reading this :o). Damn you're 

uses malloc, isn't that memory garbage (possibly, but not necessarily, 
zero)?

Andrei

Nov 23 2008

Tom S <h3r3tic remove.mat.uni.torun.pl> writes:

Andrei Alexandrescu wrote:
 Tom S wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

 Here's what I think is going on. AFAIK, D hasn't got any special code 
 for initializing jagged arrays. So
         auto A = new double[][](N, N);
 involves N+1 memory allocations.
 As well as being slow, this is going to give horrible locality of 
 reference in the memory allocation. You'll suffer cache misses on 
 almost every array access!

 I don't think this has anything to do with bounds checking.

 Nah, it's about NaNs :)


 'init mats randomly' loop doesn't touch C at all, thus all the latter 
 additions leave C at NaN, causing lots of FP exceptions.

 We actually had something like this happen in Deadlock a few times in 
 the animation subsystem. Brought the framerate down to a single digit :D

 
 I swear I wrote my previous message before reading this :o). Damn you're 

 uses malloc, isn't that memory garbage (possibly, but not necessarily, 
 zero)?

Damn, you're right too ;) I always forget whether malloc or calloc 
initializes to zero. According to my tests, DMD/Tango's malloc returns 
mostly zeroes in bearophile's test anyway.


-- 
Tomasz Stachowiak
http://h3.team0xf.com/
h3/h3r3tic on #D freenode

Nov 23 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Tom S wrote:
 Andrei Alexandrescu wrote:
 Tom S wrote:
 Don wrote:
 bearophile wrote:
 While writing code that works on matrices I have found something 
 curious...

 Here's what I think is going on. AFAIK, D hasn't got any special 
 code for initializing jagged arrays. So
         auto A = new double[][](N, N);
 involves N+1 memory allocations.
 As well as being slow, this is going to give horrible locality of 
 reference in the memory allocation. You'll suffer cache misses on 
 almost every array access!

 I don't think this has anything to do with bounds checking.

 Nah, it's about NaNs :)


 'init mats randomly' loop doesn't touch C at all, thus all the latter 
 additions leave C at NaN, causing lots of FP exceptions.

 We actually had something like this happen in Deadlock a few times in 
 the animation subsystem. Brought the framerate down to a single digit :D

 I swear I wrote my previous message before reading this :o). Damn 
 you're right. I hadn't seen the bug in initialization. But since 

 necessarily, zero)?

 
 Damn, you're right too ;) I always forget whether malloc or calloc 
 initializes to zero. According to my tests, DMD/Tango's malloc returns 
 mostly zeroes in bearophile's test anyway.

In my experience malloc returns zeroed memory around the start of an 
application (due to the OS applying simple security considerations). 
Later on, as memory is reused, there's no more zeroes - it's whatever 
you left in there.

Andrei

Nov 23 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Matrix mul