digitalmars.D - How to tune numerical D? (matrix multiplication is faster in g++ vs

J (186/186) Mar 03 2013 Dear D pros,

John Colvin (21/209) Mar 03 2013 First things first:

bearophile (4/9) Mar 03 2013 Nope, what matters is the total program runtime.

John Colvin (4/13) Mar 04 2013 The performance of the multiplication loops and the performance

bearophile (10/14) Mar 04 2013 If you want to improve the D compiler, druntime, etc, then I

John Colvin (17/29) Mar 04 2013 I disagree. Information about which parts of the code are running

Walter Bright (4/22) Mar 04 2013 I agree with John. Correctly interpreting benchmark results will enable ...

bearophile (4/4) Mar 03 2013 Your benchmark code updated to D2:

bearophile (7/9) Mar 03 2013 Sorry, this line:

bearophile (4/4) Mar 03 2013 So this should be better:

bearophile (5/5) Mar 03 2013 Generally for such matrix benchmarks if you chose the compilation
J (203/207) Mar 03 2013 @bearophile: Thank you! Unfortunately the

Manu (12/39) Mar 03 2013 Using dynamic arrays of dynamic arrays that way is pretty poor form

J (3/19) Mar 04 2013 That's a really good point. I wonder if there is a canonical

J (90/92) Mar 04 2013 I'm not sure if they are the recommended/best practice for matrix

Lars T. Kyllingstad (26/44) Mar 05 2013 I'm the author of SciD. It's great that you found it useful! :)

bearophile (11/28) Mar 04 2013 A bit better version:

John Colvin (3/10) Mar 04 2013 As far as I know it's just a way of automating the allocation, it
John Colvin (2/6) Mar 04 2013 http://dpaste.dzfl.pl/ is back online btw
jerro (18/22) Mar 04 2013 You can make it much faster even without really changing the

jerro (13/37) Mar 04 2013 forgot to set m3's elements to zero before adding to them:

J (38/41) Mar 04 2013 Thanks Jerro. You made me realize that help from the experts

Timon Gehr (10/30) Mar 05 2013 This is a little faster for 2000x2000 matrices and typical cache size:

Walter Bright (2/3) Mar 04 2013 Using obj2asm will get you much nicer looking assembler.

Rob T (14/14) Mar 03 2013 I suggest that you move this line
Andrei Alexandrescu (6/13) Mar 03 2013 You're measuring the speed of a couple of tight loops. The smallest

J (22/26) Mar 03 2013 Thanks Andrei!

Walter Bright (6/35) Mar 04 2013 One difference that jumps out at me is you have extra variables and ref ...

"J" <private private-dont-email-dont-spam.com> writes:

Dear D pros,

As a fan of D, I was hoping to be able to get similar results as 
this fellow on stack overflow, by noting his tuning steps;
http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c

Sadly however, when I pull out a simple matrix multiplication 
benchmark from the old language shootout (back when it had D), it 
is disturbingly slower in D when pit against C++.

Details? I ran with very recent gdc (gcc 4.7.2, gdc on the 4.7.2 

b8f5c22b0e7afa7e68a287ed788597e783540063), and the exact same gcc 
c++ compiler.

How would I tune this to be more competitive?  I'm comparing gdc 
vs g++ both built using the exact same gcc-4.7.2 back end, so it 
has to be something in the front end.  I've disabled GC after the 
matrices are made in D, so that doesn't explain it.

What is going on?  I'm hoping I'm making a silly, naive, obvious 
beginner mistake, but could that be?  I'm not sure how to apply 
the 'in' argument advice given on stackoverflow; if that is the 
answer, could someone summarise the best practice for 'in' use?

Thank you!

- J

$ g++ --version #shows: g++ (GCC) 4.7.2
$ uname -a

UTC 2010 x86_64 GNU/Linux



$ g++  -O3 matrix.cpp -ocppmatrix
$ time ./cppmatrix
-1015380632 859379360 -367726792 -1548829944

real    1m31.941s
user    1m31.920s
sys 0m0.010s
$ time ./cppmatrix
-1015380632 859379360 -367726792 -1548829944

real    1m32.068s
user    1m32.010s
sys 0m0.050s




$ gdmd -O -inline -release -noboundscheck -m64 matrix.d -ofdmatrix
$ time ./dmatrix
-1015380632 859379360 -367726792 -1548829944

real    2m10.677s
user    2m10.650s
sys 0m0.020s
$
$ time ./dmatrix
-1015380632 859379360 -367726792 -1548829944

real    2m12.664s
user    2m12.600s
sys 0m0.030s




exact same backend compiler.


slower--

$ gdmd -O -q,-O3 -inline -release -noboundscheck -m64 matrix.d 
-ofdmatrix
$ time ./dmatrix
-1015380632 859379360 -367726792 -1548829944

real    2m17.107s
user    2m17.080s
sys 0m0.020s
jaten afarm:~/tmp$



shown; it's same source as all of these; the historical 
http://shootout.alioth.debian.org/ code from when D was in the 
shootout.)

$ time java matrix
-1015380632 859379360 -367726792 -1548829944

real    2m23.739s
user    2m23.650s
sys 0m0.130s
$


Slightly bigger matrix?

SIZE = 2500 results: 25% slower in D

$ time ./cpp.O3.matrix
-1506465222 -119774408 -1600478274 1285663906

real    3m1.340s
user    3m1.290s
sys 0m0.040s

$ time ./dmatrix
-1506465222 -119774408 -1600478274 1285663906

real    4m2.109s
user    4m2.050s
sys 0m0.050s


//////// D version

import core.memory;

import std.stdio, std.string, std.array, std.conv;

const int SIZE = 2000;

int main(string[] args)
{
     int i, n = args.length > 1 ? to!int(args[1]) : 1;

     int[][] m1 = mkmatrix(SIZE,SIZE);
     int[][] m2 = mkmatrix(SIZE,SIZE);
     int[][] mm = mkmatrix(SIZE,SIZE);

     GC.disable;

     for (i=0; i<n; i++) {
         mmult(m1, m2, mm);
     }

     writefln("%d %d %d %d",mm[0][0],mm[2][3],mm[3][2],mm[4][4]);

     return 0;
}

int[][] mkmatrix(int rows, int cols)
{
     int[][] m;
     int count = 1;

     m.length = rows;
     foreach(ref int[] mi; m)
     {
         mi.length = cols;
         foreach(ref int mij; mi)
         {
             mij = count++;
         }
     }

     return(m);
}

void mmult(int[][] m1, int[][] m2, int[][] m3)
{
     foreach(int i, int[] m1i; m1)
     {
         foreach(int j, ref int m3ij; m3[i])
         {
             int val;
             foreach(int k, int[] m2k; m2)
             {
                 val += m1i[k] * m2k[j];
             }
             m3ij = val;
         }
     }
}

////// C++ version

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define SIZE 2000

int **mkmatrix(int rows, int cols) {
     int i, j, count = 1;
     int **m = (int **) malloc(rows * sizeof(int *));
     for (i=0; i<rows; i++) {
     m[i] = (int *) malloc(cols * sizeof(int));
     for (j=0; j<cols; j++) {
         m[i][j] = count++;
     }
     }
     return(m);
}

void zeromatrix(int rows, int cols, int **m) {
     int i, j;
     for (i=0; i<rows; i++)
     for (j=0; j<cols; j++)
         m[i][j] = 0;
}

void freematrix(int rows, int **m) {
     while (--rows > -1) { free(m[rows]); }
     free(m);
}

int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
     int i, j, k, val;
     for (i=0; i<rows; i++) {
     for (j=0; j<cols; j++) {
         val = 0;
         for (k=0; k<cols; k++) {
         val += m1[i][k] * m2[k][j];
         }
         m3[i][j] = val;
     }
     }
     return(m3);
}

int main(int argc, char *argv[]) {
     int i, n = ((argc == 2) ? atoi(argv[1]) : 1);

     int **m1 = mkmatrix(SIZE, SIZE);
     int **m2 = mkmatrix(SIZE, SIZE);
     int **mm = mkmatrix(SIZE, SIZE);

     for (i=0; i<n; i++) {
     mm = mmult(SIZE, SIZE, m1, m2, mm);
     }
     printf("%d %d %d %d\n", mm[0][0], mm[2][3], mm[3][2], 
mm[4][4]);

     freematrix(SIZE, m1);
     freematrix(SIZE, m2);
     freematrix(SIZE, mm);
     return(0);
}

Mar 03 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 4 March 2013 at 03:48:45 UTC, J wrote:
 Dear D pros,

 As a fan of D, I was hoping to be able to get similar results 
 as this fellow on stack overflow, by noting his tuning steps;
 http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c

 Sadly however, when I pull out a simple matrix multiplication 
 benchmark from the old language shootout (back when it had D), 
 it is disturbingly slower in D when pit against C++.

 Details? I ran with very recent gdc (gcc 4.7.2, gdc on the 

 b8f5c22b0e7afa7e68a287ed788597e783540063), and the exact same 
 gcc c++ compiler.

 How would I tune this to be more competitive?  I'm comparing 
 gdc vs g++ both built using the exact same gcc-4.7.2 back end, 
 so it has to be something in the front end.  I've disabled GC 
 after the matrices are made in D, so that doesn't explain it.

 What is going on?  I'm hoping I'm making a silly, naive, 
 obvious beginner mistake, but could that be?  I'm not sure how 
 to apply the 'in' argument advice given on stackoverflow; if 
 that is the answer, could someone summarise the best practice 
 for 'in' use?

 Thank you!

 - J

 $ g++ --version #shows: g++ (GCC) 4.7.2
 $ uname -a

 02:41:37 UTC 2010 x86_64 GNU/Linux



 $ g++  -O3 matrix.cpp -ocppmatrix
 $ time ./cppmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    1m31.941s
 user    1m31.920s
 sys 0m0.010s
 $ time ./cppmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    1m32.068s
 user    1m32.010s
 sys 0m0.050s




 $ gdmd -O -inline -release -noboundscheck -m64 matrix.d 
 -ofdmatrix
 $ time ./dmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m10.677s
 user    2m10.650s
 sys 0m0.020s
 $
 $ time ./dmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m12.664s
 user    2m12.600s
 sys 0m0.030s




 exact same backend compiler.


 goes slower--

 $ gdmd -O -q,-O3 -inline -release -noboundscheck -m64 matrix.d 
 -ofdmatrix
 $ time ./dmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m17.107s
 user    2m17.080s
 sys 0m0.020s
 jaten afarm:~/tmp$



 shown; it's same source as all of these; the historical 
 http://shootout.alioth.debian.org/ code from when D was in the 
 shootout.)

 $ time java matrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m23.739s
 user    2m23.650s
 sys 0m0.130s
 $


 Slightly bigger matrix?

 SIZE = 2500 results: 25% slower in D

 $ time ./cpp.O3.matrix
 -1506465222 -119774408 -1600478274 1285663906

 real    3m1.340s
 user    3m1.290s
 sys 0m0.040s

 $ time ./dmatrix
 -1506465222 -119774408 -1600478274 1285663906

 real    4m2.109s
 user    4m2.050s
 sys 0m0.050s


 //////// D version

 import core.memory;

 import std.stdio, std.string, std.array, std.conv;

 const int SIZE = 2000;

 int main(string[] args)
 {
     int i, n = args.length > 1 ? to!int(args[1]) : 1;

     int[][] m1 = mkmatrix(SIZE,SIZE);
     int[][] m2 = mkmatrix(SIZE,SIZE);
     int[][] mm = mkmatrix(SIZE,SIZE);

     GC.disable;

     for (i=0; i<n; i++) {
         mmult(m1, m2, mm);
     }

     writefln("%d %d %d %d",mm[0][0],mm[2][3],mm[3][2],mm[4][4]);

     return 0;
 }

 int[][] mkmatrix(int rows, int cols)
 {
     int[][] m;
     int count = 1;

     m.length = rows;
     foreach(ref int[] mi; m)
     {
         mi.length = cols;
         foreach(ref int mij; mi)
         {
             mij = count++;
         }
     }

     return(m);
 }

 void mmult(int[][] m1, int[][] m2, int[][] m3)
 {
     foreach(int i, int[] m1i; m1)
     {
         foreach(int j, ref int m3ij; m3[i])
         {
             int val;
             foreach(int k, int[] m2k; m2)
             {
                 val += m1i[k] * m2k[j];
             }
             m3ij = val;
         }
     }
 }

 ////// C++ version

 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>

 #define SIZE 2000

 int **mkmatrix(int rows, int cols) {
     int i, j, count = 1;
     int **m = (int **) malloc(rows * sizeof(int *));
     for (i=0; i<rows; i++) {
     m[i] = (int *) malloc(cols * sizeof(int));
     for (j=0; j<cols; j++) {
         m[i][j] = count++;
     }
     }
     return(m);
 }

 void zeromatrix(int rows, int cols, int **m) {
     int i, j;
     for (i=0; i<rows; i++)
     for (j=0; j<cols; j++)
         m[i][j] = 0;
 }

 void freematrix(int rows, int **m) {
     while (--rows > -1) { free(m[rows]); }
     free(m);
 }

 int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
     int i, j, k, val;
     for (i=0; i<rows; i++) {
     for (j=0; j<cols; j++) {
         val = 0;
         for (k=0; k<cols; k++) {
         val += m1[i][k] * m2[k][j];
         }
         m3[i][j] = val;
     }
     }
     return(m3);
 }

 int main(int argc, char *argv[]) {
     int i, n = ((argc == 2) ? atoi(argv[1]) : 1);

     int **m1 = mkmatrix(SIZE, SIZE);
     int **m2 = mkmatrix(SIZE, SIZE);
     int **mm = mkmatrix(SIZE, SIZE);

     for (i=0; i<n; i++) {
     mm = mmult(SIZE, SIZE, m1, m2, mm);
     }
     printf("%d %d %d %d\n", mm[0][0], mm[2][3], mm[3][2], 
 mm[4][4]);

     freematrix(SIZE, m1);
     freematrix(SIZE, m2);
     freematrix(SIZE, mm);
     return(0);
 }

First things first:
You're not just timing the multiplication, you're timing the 
memory allocation as well. I suggest using 
http://dlang.org/phobos/std_datetime.html#StopWatch to do some 
proper timings in D

Also, there is a semi-documented multi-dimensional array 
allocation syntax that is very neat, see here a simplified 
version of mkmatrix using it:

int[][] mkmatrix(size_t rows, size_t cols)
{
     int[][] m = new int[][](rows, cols);
     size_t count = 1;

     foreach(ref mi; m)
         foreach(ref mij; mi)
             mij = count++;

     return(m);
}


However, I have found myself that D is slower than C for these 
sort of intense numerical things. The assembly code should show 
why quite easily.

Mar 03 2013

"bearophile" <bearophileHUGS lycos.com> writes:

John Colvin:

 First things first:
 You're not just timing the multiplication, you're timing the 
 memory allocation as well. I suggest using 
 http://dlang.org/phobos/std_datetime.html#StopWatch to do some 
 proper timings in D

Nope, what matters is the total program runtime.

Bye,
bearophile

Mar 03 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 4 March 2013 at 04:15:41 UTC, bearophile wrote:
 John Colvin:

 First things first:
 You're not just timing the multiplication, you're timing the 
 memory allocation as well. I suggest using 
 http://dlang.org/phobos/std_datetime.html#StopWatch to do some 
 proper timings in D

 Nope, what matters is the total program runtime.

 Bye,
 bearophile

The performance of the multiplication loops and the performance 
of the allocation are separate issues and should be measured as 
such, especially if one wants to make meaningful optimisations.

Mar 04 2013

"bearophile" <bearophileHUGS lycos.com> writes:

John Colvin:

 The performance of the multiplication loops and the performance 
 of the allocation are separate issues and should be measured as 
 such, especially if one wants to make meaningful optimisations.

If you want to improve the D compiler, druntime, etc, then I 
agree you have to separate the variables and test them one at a 
time. But if you are comparing languages+runtimes+libraries then 
it's better to not cheat, and test the whole running (warmed) 
time.


 http://dpaste.dzfl.pl/ is back online btw

dpaste is too much javascript-heavy for a very quick pasting. 
codepad is much more light if I don't want to run D code.

Bye,
bearophile

Mar 04 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:
 John Colvin:

 The performance of the multiplication loops and the 
 performance of the allocation are separate issues and should 
 be measured as such, especially if one wants to make 
 meaningful optimisations.

 If you want to improve the D compiler, druntime, etc, then I 
 agree you have to separate the variables and test them one at a 
 time. But if you are comparing languages+runtimes+libraries 
 then it's better to not cheat, and test the whole running 
 (warmed) time.

 Bye,
 bearophile

I disagree. Information about which parts of the code are running 
fast and which are running slow is critical to optimisation. If 
you don't know whether it's the D memory allocation that's slow 
or the D multiplication loops, you're trying to optimise 
blindfolded.

Even if all your doing is a comparison, it's a lot more useful to 
know *where* the slowdown is happening so that you can make a 
meaningful analysis of the results.

Enter a strange example:
I found that malloced multi-dim arrays were considerably faster 
to iterate over and assign to than D gc slices, even with the gc 
disable after allocation and bounds checks turned off.

If I hadn't bothered to do separate timings of the allocation and 
iteration, I would never have noticed this effect and instead 
written it off as purely "malloc is faster at allocating than the 
GC"

Mar 04 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/4/2013 9:00 AM, John Colvin wrote:
 On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:
 John Colvin:

 The performance of the multiplication loops and the performance of the
 allocation are separate issues and should be measured as such, especially if
 one wants to make meaningful optimisations.

 If you want to improve the D compiler, druntime, etc, then I agree you have to
 separate the variables and test them one at a time. But if you are comparing
 languages+runtimes+libraries then it's better to not cheat, and test the whole
 running (warmed) time.

 Bye,
 bearophile

 I disagree. Information about which parts of the code are running fast and
which
 are running slow is critical to optimisation. If you don't know whether it's
the
 D memory allocation that's slow or the D multiplication loops, you're trying to
 optimise blindfolded.

I agree with John. Correctly interpreting benchmark results will enable you to 
tune your application to run much faster, even without any changes to the 
compiler or runtime library.

Mar 04 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Your benchmark code updated to D2:

http://codepad.org/WMgu6XQG

Bye,
bearophile

Mar 03 2013

"bearophile" <bearophileHUGS lycos.com> writes:

On Monday, 4 March 2013 at 04:12:18 UTC, bearophile wrote:
 Your benchmark code updated to D2:

 http://codepad.org/WMgu6XQG

Sorry, this line:

enum size_t SIZE = 200;

Should be:

enum size_t SIZE = 2_000;

Bye,
bearophile

Mar 03 2013

"bearophile" <bearophileHUGS lycos.com> writes:

So this should be better:

http://codepad.org/B5b4uyBM

Bye,
bearophile

Mar 03 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Generally for such matrix benchmarks if you chose the compilation 
flags really well (including link-time optimization!) I've seen 
that with LDC you get "good enough" timings.

Bye,
bearophile

Mar 03 2013

"J" <private private-dont-email-dont-spam.com> writes:

On Monday, 4 March 2013 at 04:22:01 UTC, bearophile wrote:
 So this should be better:

 http://codepad.org/B5b4uyBM

 Bye,
 bearophile

 bearophile: Thank you!  Unfortunately the 
http://codepad.org/B5b4uyBM code runs a bit *slower* than the 
original D code. Yikes!

$  gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear
$ time ./dbear
-1015380632 859379360 -367726792 -1548829944

real    2m36.971s
user    2m36.910s
sys 0m0.030s
$ time ./dbear
-1015380632 859379360 -367726792 -1548829944

real    2m34.425s
user    2m34.370s
sys 0m0.020s


 John Colvin: here is the disassembly of mmult() in both 
languages. Unfortunately I'm not literate in x86_64 assembly.  
Perhaps the problem is obvious to you?  All I can really tell is 
that the g++ version is shorter.

The memory allocation, when timed separately (comment out mmult), 
is less than 60 msec for either version, so I don't think its a 
memory issue, although it could be caching issue since the matrix 
layouts are different.



(gdb) disas /m  _D6matrix5mmultFAAiAAiAAiZv
Dump of assembler code for function _D6matrix5mmultFAAiAAiAAiZv:
56	void mmult(int[][] m1, int[][] m2, int[][] m3)
    0x00000000004352a0 <+0>:	push   %r15
    0x00000000004352a5 <+5>:	push   %r14
    0x00000000004352a7 <+7>:	push   %r13
    0x00000000004352a9 <+9>:	push   %r12
    0x00000000004352ab <+11>:	mov    %rdx,%r12
    0x00000000004352ae <+14>:	push   %rbp
    0x00000000004352af <+15>:	mov    %rsi,%rbp
    0x00000000004352b2 <+18>:	push   %rbx
    0x00000000004352b3 <+19>:	mov    %r9,-0x40(%rsp)
    0x00000000004352b8 <+24>:	mov    %rdi,-0x10(%rsp)
    0x00000000004352bd <+29>:	mov    %rsi,-0x8(%rsp)
    0x00000000004352c2 <+34>:	mov    %rdx,-0x20(%rsp)
    0x00000000004352c7 <+39>:	mov    %rcx,-0x18(%rsp)
    0x00000000004352cc <+44>:	mov    %r8,-0x30(%rsp)
    0x00000000004352d1 <+49>:	mov    %r9,-0x28(%rsp)
    0x00000000004352dc <+60>:	add    $0x1,%rdi
    0x00000000004352e0 <+64>:	lea    0x1(%rdx),%rdx
    0x00000000004352e4 <+68>:	mov    $0x1,%r15d
    0x00000000004352ea <+74>:	mov    %rdi,-0x38(%rsp)
    0x0000000000435315 <+117>:	add    $0x1,%r13
    0x0000000000435319 <+121>:	mov    $0x1,%r11d
    0x0000000000435330 <+144>:	mov    $0x1,%esi

57	{
58	    foreach(int i, int[] m1i; m1)
    0x00000000004352a2 <+2>:	test   %rdi,%rdi
    0x00000000004352d6 <+54>:	je     0x4353aa 
<_D6matrix5mmultFAAiAAiAAiZv+266>
    0x00000000004352ef <+79>:	xor    %esi,%esi
    0x00000000004352f1 <+81>:	mov    %rsi,%rax
    0x00000000004352f4 <+84>:	shl    $0x4,%rax
    0x00000000004352f8 <+88>:	mov    0x8(%rbp,%rax,1),%r10
    0x0000000000435398 <+248>:	cmp    -0x38(%rsp),%rax
    0x000000000043539d <+253>:	mov    %r15,%rsi
    0x00000000004353a0 <+256>:	je     0x4353aa 
<_D6matrix5mmultFAAiAAiAAiZv+266>
    0x00000000004353a2 <+258>:	mov    %rax,%r15
    0x00000000004353a5 <+261>:	jmpq   0x4352f1 
<_D6matrix5mmultFAAiAAiAAiZv+81>

59	    {
60	        foreach(int j, ref int m3ij; m3[i])
    0x00000000004352fd <+93>:	add    -0x40(%rsp),%rax
    0x0000000000435302 <+98>:	mov    (%rax),%r13
    0x0000000000435305 <+101>:	mov    0x8(%rax),%r14
    0x0000000000435309 <+105>:	test   %r13,%r13
    0x000000000043530c <+108>:	je     0x435394 
<_D6matrix5mmultFAAiAAiAAiZv+244>
    0x0000000000435312 <+114>:	xor    %r9d,%r9d
    0x000000000043531f <+127>:	shl    $0x2,%r9
    0x0000000000435326 <+134>:	lea    (%r14,%r9,1),%rbx
    0x000000000043536c <+204>:	mov    %r11,%r9
    0x000000000043536f <+207>:	cmp    %r13,%rax
    0x0000000000435372 <+210>:	je     0x435394 
<_D6matrix5mmultFAAiAAiAAiZv+244>
    0x0000000000435374 <+212>:	shl    $0x2,%r9
    0x000000000043537b <+219>:	mov    %rax,%r11
    0x000000000043537e <+222>:	lea    (%r14,%r9,1),%rbx
    0x000000000043538a <+234>:	mov    %r11,%r9
    0x000000000043538f <+239>:	cmp    %r13,%rax
    0x0000000000435392 <+242>:	jne    0x435374 
<_D6matrix5mmultFAAiAAiAAiZv+212>
    0x0000000000435394 <+244>:	lea    0x1(%r15),%rax

61	        {
62	            int val;
    0x0000000000435337 <+151>:	xor    %edi,%edi
    0x0000000000435339 <+153>:	jmp    0x435343 
<_D6matrix5mmultFAAiAAiAAiZv+163>
    0x000000000043533b <+155>:	nopl   0x0(%rax,%rax,1)
    0x0000000000435388 <+232>:	xor    %edi,%edi

63	            foreach(int k, int[] m2k; m2)
    0x0000000000435323 <+131>:	test   %r12,%r12
    0x000000000043532a <+138>:	je     0x435384 
<_D6matrix5mmultFAAiAAiAAiZv+228>
    0x000000000043532c <+140>:	nopl   0x0(%rax)
    0x0000000000435335 <+149>:	xor    %eax,%eax
    0x0000000000435340 <+160>:	mov    %r8,%rsi
    0x0000000000435343 <+163>:	mov    %rax,%r8
    0x000000000043534a <+170>:	shl    $0x4,%r8
    0x000000000043535e <+190>:	cmp    %rdx,%r8
    0x0000000000435361 <+193>:	mov    %rsi,%rax
    0x0000000000435364 <+196>:	jne    0x435340 
<_D6matrix5mmultFAAiAAiAAiZv+160>
    0x0000000000435366 <+198>:	lea    0x1(%r11),%rax
    0x0000000000435378 <+216>:	test   %r12,%r12
    0x0000000000435382 <+226>:	jne    0x435330 
<_D6matrix5mmultFAAiAAiAAiZv+144>
    0x0000000000435384 <+228>:	lea    0x1(%r11),%rax

64	            {
65	                val += m1i[k] * m2k[j];
    0x0000000000435346 <+166>:	mov    (%r10,%rax,4),%eax
    0x000000000043534e <+174>:	mov    0x8(%rcx,%r8,1),%r8
    0x0000000000435353 <+179>:	imul   (%r8,%r9,1),%eax
    0x0000000000435358 <+184>:	lea    0x1(%rsi),%r8
    0x000000000043535c <+188>:	add    %eax,%edi

66	            }
67	            m3ij = val;
    0x000000000043536a <+202>:	mov    %edi,(%rbx)
    0x000000000043538d <+237>:	mov    %edi,(%rbx)

68	        }
69	    }
70	}
    0x00000000004353aa <+266>:	pop    %rbx
    0x00000000004353ab <+267>:	pop    %rbp
    0x00000000004353ac <+268>:	pop    %r12
    0x00000000004353ae <+270>:	pop    %r13
    0x00000000004353b0 <+272>:	pop    %r14
    0x00000000004353b2 <+274>:	pop    %r15
    0x00000000004353b4 <+276>:	retq
    0x00000000004353b5:	data32 nopw %cs:0x0(%rax,%rax,1)

End of assembler dump.
(gdb)




(gdb) disas /m mmult
Dump of assembler code for function mmult(int, int, int**, int**, 
int**):
36	int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
    0x0000000000400a10 <+0>:	push   %r14
    0x0000000000400a14 <+4>:	push   %r13
    0x0000000000400a16 <+6>:	mov    %r8,%r13
    0x0000000000400a19 <+9>:	push   %r12
    0x0000000000400a1b <+11>:	push   %rbp
    0x0000000000400a1c <+12>:	push   %rbx
    0x0000000000400a1d <+13>:	mov    %edi,%ebx
    0x0000000000400a21 <+17>:	lea    -0x1(%rsi),%eax
    0x0000000000400a24 <+20>:	mov    %rdx,%r12
    0x0000000000400a27 <+23>:	xor    %ebp,%ebp
    0x0000000000400a29 <+25>:	lea    0x4(,%rax,4),%rdi
    0x0000000000400a40 <+48>:	xor    %r9d,%r9d
    0x0000000000400a43 <+51>:	xor    %r11d,%r11d
    0x0000000000400a46 <+54>:	nopw   %cs:0x0(%rax,%rax,1)

37	    int i, j, k, val;
38	    for (i=0; i<rows; i++) {
    0x0000000000400a12 <+2>:	test   %edi,%edi
    0x0000000000400a1f <+15>:	jle    0x400a7e <mmult(int, int, 
int**, int**, int**)+110>
    0x0000000000400a7a <+106>:	cmp    %ebp,%ebx
    0x0000000000400a7c <+108>:	jg     0x400a31 <mmult(int, int, 
int**, int**, int**)+33>

39		for (j=0; j<cols; j++) {
    0x0000000000400a31 <+33>:	test   %esi,%esi
    0x0000000000400a33 <+35>:	jle    0x400a76 <mmult(int, int, 
int**, int**, int**)+102>
    0x0000000000400a35 <+37>:	mov    (%r12,%rbp,8),%r8
    0x0000000000400a39 <+41>:	mov    0x0(%r13,%rbp,8),%rdx
    0x0000000000400a3e <+46>:	xor    %eax,%eax
    0x0000000000400a71 <+97>:	cmp    %rdi,%rax
    0x0000000000400a74 <+100>:	jne    0x400a40 <mmult(int, int, 
int**, int**, int**)+48>
    0x0000000000400a76 <+102>:	add    $0x1,%rbp

40		    val = 0;
41		    for (k=0; k<cols; k++) {
    0x0000000000400a64 <+84>:	cmp    %r9d,%esi
    0x0000000000400a67 <+87>:	jg     0x400a50 <mmult(int, int, 
int**, int**, int**)+64>

42			val += m1[i][k] * m2[k][j];
    0x0000000000400a50 <+64>:	mov    (%rcx,%r9,8),%r14
    0x0000000000400a54 <+68>:	mov    (%r8,%r9,4),%r10d
    0x0000000000400a58 <+72>:	add    $0x1,%r9
    0x0000000000400a5c <+76>:	imul   (%r14,%rax,1),%r10d
    0x0000000000400a61 <+81>:	add    %r10d,%r11d

43		    }
44		    m3[i][j] = val;
    0x0000000000400a69 <+89>:	mov    %r11d,(%rdx,%rax,1)
    0x0000000000400a6d <+93>:	add    $0x4,%rax

45		}
46	    }
47	    return(m3);
48	}
    0x0000000000400a7e <+110>:	pop    %rbx
    0x0000000000400a7f <+111>:	pop    %rbp
    0x0000000000400a80 <+112>:	pop    %r12
    0x0000000000400a82 <+114>:	mov    %r13,%rax
    0x0000000000400a85 <+117>:	pop    %r13
    0x0000000000400a87 <+119>:	pop    %r14
    0x0000000000400a89 <+121>:	retq

End of assembler dump.
(gdb)

Mar 03 2013

Manu <turkeyman gmail.com> writes:

On 4 March 2013 14:50, J <private private-dont-email-dont-spam.com> wrote:

 On Monday, 4 March 2013 at 04:22:01 UTC, bearophile wrote:

 So this should be better:

 http://codepad.org/B5b4uyBM

 Bye,
 bearophile

  bearophile: Thank you!  Unfortunately the http://codepad.org/B5b4uyBMcode
runs a bit *slower* than the original D code. Yikes!

 $  gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear
 $ time ./dbear
 -1015380632 859379360 -367726792 -1548829944

 real    2m36.971s
 user    2m36.910s
 sys 0m0.030s
 $ time ./dbear
 -1015380632 859379360 -367726792 -1548829944

 real    2m34.425s
 user    2m34.370s
 sys 0m0.020s


  John Colvin: here is the disassembly of mmult() in both languages.
 Unfortunately I'm not literate in x86_64 assembly.  Perhaps the problem is
 obvious to you?  All I can really tell is that the g++ version is shorter.

 The memory allocation, when timed separately (comment out mmult), is less
 than 60 msec for either version, so I don't think its a memory issue,
 although it could be caching issue since the matrix layouts are different.

Using dynamic arrays of dynamic arrays that way is pretty poor form
regardless of the language.

You should really use single dimensional array:
  int matrix[SIZE*SIZE];
And index via:
  matrix[y*SIZE+x]

(You can pretty this up in various ways, if this is too unsightly)


On 4 March 2013 14:02, John Colvin <john.loughran.colvin gmail.com> wrote:

     int[][] m = new int[][](rows, cols);

Does D support proper square array's this way? Or does it just automate
allocation of the inner arrays?
Does it allocate all the associated memory in one block?

Mar 03 2013

"J" <private private-dont-email-dont-spam.com> writes:

On Monday, 4 March 2013 at 05:07:10 UTC, Manu wrote:
 Using dynamic arrays of dynamic arrays that way is pretty poor 
 form regardless of the language.

 You should really use single dimensional array:
   int matrix[SIZE*SIZE];
 And index via:
   matrix[y*SIZE+x]

 (You can pretty this up in various ways, if this is too 
 unsightly)


 On 4 March 2013 14:02, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

     int[][] m = new int[][](rows, cols);

 Does D support proper square array's this way? Or does it just 
 automate
 allocation of the inner arrays?
 Does it allocate all the associated memory in one block?

That's a really good point. I wonder if there is a canonical 
matrix that would be preferred?

Mar 04 2013

"J" <private private-dont-email-dont-spam.com> writes:

On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:
 That's a really good point. I wonder if there is a canonical 
 matrix that would be preferred?

I'm not sure if they are the recommended/best practice for matrix 
handling in D at the moment (please advise if they are not), but 
with a little searching, I found that the SciD library has nice 
matrices and MattrixViews (column-major storage, LAPACK 
compatible).

Now I like MatrixViews because they let me beat the original 
(clearly non-optimal) C matrix multiplication by a couple 
seconds, and the D code with operator overloading in place makes 
matrix multiplication look elegant.

One shootout benchmark down, eleven to go. :-)

- J

p.s. Am I right in concluding there are no multimethods (multiple 
dispatch) in D... it seemed a little awkward to have to wrap the 
MatrixView in a new struct solely in order to overload 
multiplication. Is there a better way that I've missed?

#beating the C/malloc/shootout based code that ran in 1m32sec:

$ time ./scid_matrix_test
-1015380632 859379360 -367726792 -1548829944

real    1m27.967s
user    1m27.930s
sys 0m0.030s
$ time ./scid_matrix_test
-1015380632 859379360 -367726792 -1548829944

real    1m28.259s
user    1m28.230s
sys 0m0.020s
$

module scid_matrix_test;

// compile: gdmd -O -release -inline -noboundscheck 
scid_matrix_test.d -L-L/home/you/pkg/scid/scid/ -L-lscid 
-L-L/usr/lib/x86_64-linux-gnu/ -L-lgfortran -L-lblas -L-llapack 
-L-L.

import scid.matrix;

import std.stdio, std.string, std.array, std.conv;
const int SIZE = 2000;

import std.stdio;


// Doesn't seem to have multiple dispatch / multimethods, so
// I guess we have to wrap MatrixView to implement 
opBinary!"*"(lhs,rhs)

struct Multipliable (T) {
   MatrixView!(T) _m;
   alias _m this;
   this(MatrixView!(T) src) {
     _m = src;
   }

   Multipliable!(T) opBinary(string op : "*")(ref Multipliable 
rhs) if (op == "*") {
     auto r = Multipliable!int(matrix!int(_m.rows,rhs.cols));
     return mmult(this, rhs, r);
   }

}


Multipliable!(int) mkmatrix(int rows, int cols)
{
     auto m = Multipliable!int();
     m._m = matrix!int(rows,cols);

     int count = 1;
     for(int i = 0; i < rows; ++i)
     {
         for(int j = 0; j < cols; ++j)
         {
             m[i,j] = count++;
         }
     }
     return(m);
}



Multipliable!(int) mmult(ref Multipliable!(int) m1, ref 
Multipliable!(int) m2, ref Multipliable!(int) m3) {
     int i, j, k, val;
     ulong rows = m1.rows;
     ulong cols = m1.cols;
     for (i=0; i<rows; i++) {
     for (j=0; j<cols; j++) {
         val = 0;
         for (k=0; k<cols; k++) {
         val += m1[i,k] * m2[k,j];
         }
         m3[i,j] = val;
     }
     }
     return(m3);
}


void main()
{
     auto m1 = mkmatrix(SIZE,SIZE);
     auto m2 = mkmatrix(SIZE,SIZE);
     auto mm = mkmatrix(SIZE,SIZE);

     mm = m1 * m2;

     writefln("%d %d %d %d",mm[0,0],mm[2,3],mm[3,2],mm[4,4]);

}

Mar 04 2013

"Lars T. Kyllingstad" <public kyllingen.net> writes:

On Monday, 4 March 2013 at 12:28:25 UTC, J wrote:
 On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:
 That's a really good point. I wonder if there is a canonical 
 matrix that would be preferred?

 I'm not sure if they are the recommended/best practice for 
 matrix handling in D at the moment (please advise if they are 
 not), but with a little searching, I found that the SciD 
 library has nice matrices and MattrixViews (column-major 
 storage, LAPACK compatible).

 Now I like MatrixViews because they let me beat the original 
 (clearly non-optimal) C matrix multiplication by a couple 
 seconds, and the D code with operator overloading in place 
 makes matrix multiplication look elegant.

 One shootout benchmark down, eleven to go. :-)

 - J

 p.s. Am I right in concluding there are no multimethods 
 (multiple dispatch) in D... it seemed a little awkward to have 
 to wrap the MatrixView in a new struct solely in order to 
 overload multiplication. Is there a better way that I've missed?

I'm the author of SciD.  It's great that you found it useful! :)

When I wrote scid.matrix and scid.linalg, it was because I needed 
some linear algebra functionality for a project at work.  I knew 
I wouldn't have the time to write a full-featured linear algebra 
library, so I intentionally gave it a minimalistic design, and 
only included the stuff I needed.  That's also why I used the 
name "MatrixView" rather than "Matrix"; it was only supposed to 
be a convenient way to view an array as a matrix, and not a full 
matrix type.

Later, Cristi Cobzarenco forked my library for Google Summer of 
Code 2011, with David Simcha as his mentor.  They removed almost 
everything but the linear algebra modules, and redesigned those 
from scratch.  So basically, there is pretty much nothing left of 
my code in their library. ;)  I don't think they ever completed 
the project, but I believe parts of it are usable.  You'll find 
it here:

     https://github.com/cristicbz/scid

AFAIK, their goal was to use expression templates in order to 
transform D expressions like

     x*A*B + y*C

where A, B and C are matrices, and x and y are scalars, into as 
few optimised BLAS calls as possible -- e.g., a single GEMM call 
in the example above.  I don't know how far they got on this, 
though.

Lars

Mar 05 2013

"bearophile" <bearophileHUGS lycos.com> writes:

J wrote:

  bearophile: Thank you!  Unfortunately the 
 http://codepad.org/B5b4uyBM code runs a bit *slower* than the 
 original D code. Yikes!

 $  gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear
 $ time ./dbear
 -1015380632 859379360 -367726792 -1548829944

 real    2m36.971s
 user    2m36.910s
 sys 0m0.030s
 $ time ./dbear
 -1015380632 859379360 -367726792 -1548829944

 real    2m34.425s
 user    2m34.370s
 sys 0m0.020s

A bit better version:
http://codepad.org/jhbYxEgU

I think this code is good compared to the original (there are 
better algorithms). So maybe the introduced inefficiencies are 
small compiler problems worth finding and reporting.

------------------

Manu:

 Does D support proper square array's this way? Or does it just 
 automate allocation of the inner arrays?
 Does it allocate all the associated memory in one block?

Maybe you should take a look at druntime code.

Bye,
bearophile

Mar 04 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:
 Manu:

 Does D support proper square array's this way? Or does it just 
 automate allocation of the inner arrays?
 Does it allocate all the associated memory in one block?

 Maybe you should take a look at druntime code.

 Bye,
 bearophile

As far as I know it's just a way of automating the allocation, it 
expands out to the necessary loops.

Mar 04 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:
 A bit better version:
 http://codepad.org/jhbYxEgU
 Bye,
 bearophile

http://dpaste.dzfl.pl/ is back online btw

Mar 04 2013

"jerro" <a a.com> writes:

 A bit better version:
 http://codepad.org/jhbYxEgU

 I think this code is good compared to the original (there are 
 better algorithms).

You can make it much faster even without really changing the 
algorithm. Just by reversing the order of inner two loops like 
this:

void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure 
nothrow {
     foreach (immutable i; 0 .. m1.length)
         foreach (immutable k; 0 .. m2[0].length)
             foreach (immutable j; 0 .. m3[0].length)
                 m3[i][j] += m1[i][k] * m2[k][j];
}

you can make the code much more cache friendly (because now you 
aren't iterating any matrix by column in the inner loop) and also 
allow the compiler to do auto vectorization. matrixMul2() takes 
2.6 seconds on my machine and matrixMul()takes 72 seconds (both 
compiled with  gdmd -O -inline -release -noboundscheck -mavx).

This isn't really relevant to the comparison with C++ in this 
thread, I just thought it may be useful for anyone writing matrix 
code.

Mar 04 2013

"jerro" <a a.com> writes:

On Monday, 4 March 2013 at 15:46:50 UTC, jerro wrote:
 A bit better version:
 http://codepad.org/jhbYxEgU

 I think this code is good compared to the original (there are 
 better algorithms).

 You can make it much faster even without really changing the 
 algorithm. Just by reversing the order of inner two loops like 
 this:

 void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure 
 nothrow {
     foreach (immutable i; 0 .. m1.length)
         foreach (immutable k; 0 .. m2[0].length)
             foreach (immutable j; 0 .. m3[0].length)
                 m3[i][j] += m1[i][k] * m2[k][j];
 }

 you can make the code much more cache friendly (because now you 
 aren't iterating any matrix by column in the inner loop) and 
 also allow the compiler to do auto vectorization. matrixMul2() 
 takes 2.6 seconds on my machine and matrixMul()takes 72 seconds 
 (both compiled with  gdmd -O -inline -release -noboundscheck 
 -mavx).

 This isn't really relevant to the comparison with C++ in this 
 thread, I just thought it may be useful for anyone writing 
 matrix code.

forgot to set m3's elements to zero before adding to them:

void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure 
nothrow {
     foreach (immutable i; 0 .. m1.length)
     {
         m3[i][] = 0;

         foreach (immutable k; 0 .. m2[0].length)
             foreach (immutable j; 0 .. m3[0].length)
                 m3[i][j] += m1[i][k] * m2[k][j];
     }
}

This does not make the function noticeably slower.

Mar 04 2013

"J" <not_listed not.not.listed> writes:

On Monday, 4 March 2013 at 15:57:42 UTC, jerro wrote:
 matrixMul2() takes 2.6 seconds on my machine and 
 matrixMul()takes 72 seconds (both compiled with  gdmd -O 
 -inline -release -noboundscheck -mavx).


Thanks Jerro. You made me realize that help from the experts 
could be quite useful.  I plugged in a call to the BLAS matrix 
multiply routine, which SciD conveniently binds.

The result?  My 2000x2000 matrix multiply went from 98 seconds 
down to 1.8 seconds.  Its just hilariously faster to use 20 years 
of numerical experts optimized code than to try to write your own.


// screaming fast version - uses BLAS for 50x speedup over naive 
code.
//
Multipliable!(T) mmult2(T)(ref Multipliable!(T) m1,
                          ref Multipliable!(T) m2,
                          ref Multipliable!(T) m3) {
     m3.array[] = 0;

     assert(m1.cols == m2.rows);

     char ntran = 'N';
     double one = 1.0;
     double zero = 0.0;
     int nrow = cast(int)m1.rows;
     int ncol = cast(int)m1.cols;
     int mcol = cast(int)m2.cols;

     scid.bindings.blas.blas.dgemm_(&ntran, // transa
                                    &ntran, // transb
				   &nrow,  // m
				   &mcol,  // n
				   &ncol,  // k
				   &one,   // alpha
				   m1.array.ptr, // A
                                    &nrow,        // lda
                                    m2.array.ptr, // B
                                    &ncol,        // ldb
                                    &zero,        // beta
                                    m3.array.ptr, // C
                                    &nrow,        // ldc
                                    nrow,         // transa_len
                                    ncol);        // transb_len
     return m3;
}

Mar 04 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 03/04/2013 04:46 PM, jerro wrote:
 A bit better version:
 http://codepad.org/jhbYxEgU

 I think this code is good compared to the original (there are better
 algorithms).

 You can make it much faster even without really changing the algorithm.
 Just by reversing the order of inner two loops like this:

 void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow {
      foreach (immutable i; 0 .. m1.length)
          foreach (immutable k; 0 .. m2[0].length)
              foreach (immutable j; 0 .. m3[0].length)
                  m3[i][j] += m1[i][k] * m2[k][j];
 }

 you can make the code much more cache friendly (because now you aren't
 iterating any matrix by column in the inner loop) and also allow the
 compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my
 machine and matrixMul()takes 72 seconds (both compiled with  gdmd -O
 -inline -release -noboundscheck -mavx).

 This isn't really relevant to the comparison with C++ in this thread, I
 just thought it may be useful for anyone writing matrix code.

This is a little faster for 2000x2000 matrices and typical cache size:

void matrixMult(in int[][] m1, in int[][] m2, int[][] m3){
     enum block=50; // optimal parameter depends on hardware
     foreach(ib;iota(0,m1.length,block))
     foreach(kb;iota(0,m2[0].length,block))
     foreach(i;iota(ib,min(m1.length,ib+block)))
     foreach(k;iota(kb,min(m2[0].length,kb+block)))
         m3[i][] += m1[i][k] * m2[k][];
}

Mar 05 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/3/2013 8:50 PM, J wrote:
 Dump of assembler code for function _D6matrix5mmultFAAiAAiAAiZv:

Using obj2asm will get you much nicer looking assembler.

Mar 04 2013

"Rob T" <alanb ucora.com> writes:

I suggest that you move this line

GC.disable;

to the first line.

I don't see how you are doing your timings so that part is a wild 
card.

Also note that when the GC is re-enabled it can add a significant 
amount of time to the tests. You are not explicitly re-enabling 
the GC, but I don't know if the GC kicks back as part of program 
termination, it probably shouldn't but we lack precise 
documentation and cannot be certain.

I would test timings immediately after the GC is disabled and 
prior to program termination to see what affects the GC may or 
may not have on the timings.

--rt

Mar 03 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/3/13 10:48 PM, J wrote:
 Dear D pros,

 As a fan of D, I was hoping to be able to get similar results as this
 fellow on stack overflow, by noting his tuning steps;
 http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c

 Sadly however, when I pull out a simple matrix multiplication benchmark
 from the old language shootout (back when it had D), it is disturbingly
 slower in D when pit against C++.

You're measuring the speed of a couple of tight loops. The smallest 
differences in codegen between them will be on the radar. Use straight 
for loops or "foreach (i; 0 .. limit)" for those loops, the foreach 
forms you currently may introduce slight differences in codegen.

Andrei

Mar 03 2013

"J" <private private-dont-email-dont-spam.com> writes:

On Monday, 4 March 2013 at 04:49:20 UTC, Andrei Alexandrescu 
wrote:
 You're measuring the speed of a couple of tight loops. The 
 smallest differences in codegen between them will be on the 
 radar. Use straight for loops or "foreach (i; 0 .. limit)" for 
 those loops...

Thanks Andrei!

I validated your analysis by doing a straight port of the C code 
to D, even using the same memory layout (matching malloc calls). 
The port was trivial, which was very reassuring. As evidence, I 
had to tweak only 8 lines othe C to make compiled under gdc. The 
mmult() function in the D version remained identical to that in 
the C version.

Even more reassuring was that the performance of the resulting D 
matched the C to within 1% tolerance (about 200-500 msec seconds 
slower on the D side; presumably due to runtime init).

$time ./gdc_compiled_ported_cpp_version
-1015380632 859379360 -367726792 -1548829944

real    1m32.418s
user    1m32.370s
sys 0m0.020s
$

It's a great feeling, knowing the bare metal is available if need 
be.

Thanks guys!

-J

Mar 03 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/3/2013 7:48 PM, J wrote:
 void mmult(int[][] m1, int[][] m2, int[][] m3)
 {
      foreach(int i, int[] m1i; m1)
      {
          foreach(int j, ref int m3ij; m3[i])
          {
              int val;
              foreach(int k, int[] m2k; m2)
              {
                  val += m1i[k] * m2k[j];
              }
              m3ij = val;
          }
      }
 }

[...]
 ////// C++ version
 int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
      int i, j, k, val;
      for (i=0; i<rows; i++) {
      for (j=0; j<cols; j++) {
          val = 0;
          for (k=0; k<cols; k++) {
          val += m1[i][k] * m2[k][j];
          }
          m3[i][j] = val;
      }
      }
      return(m3);
 }

One difference that jumps out at me is you have extra variables and ref types
in 
the D version, and in the C++ version you have "cached" the row & column loop 
limits. (I.e. the C++ version assumes a rectangular matrix, while the D one has 
a (presumably) different length for each column.)

Mar 04 2013

D Programming

C/C++ Programming

Other

digitalmars.D - How to tune numerical D? (matrix multiplication is faster in g++ vs