digitalmars.D.learn - Improving dot product for standard multidimensional D arrays

p.shkadzko (99/99) Mar 01 2020 Hello again,

jmh530 (2/4) Mar 02 2020 What compiler did you use and what flags?

p.shkadzko (21/26) Mar 02 2020 Ah yes, sorry. I used latest ldc2 (1.20.0-x64) for Windows.

jmh530 (9/10) Mar 02 2020 Thanks. I don't have time right now to review this thoroughly. My

p.shkadzko (7/18) Mar 02 2020 Yes, this it is a sum of multiplications between elements of two

jmh530 (4/7) Mar 02 2020 @optmath is from mir, correct? I believe it implies @fastmath.

p.shkadzko (12/17) Mar 02 2020 Interesting growth of processing time. Could it be GC?

jmh530 (8/20) Mar 02 2020 Your matrixDotProduct creates a new Matrix and then returns it.

p.shkadzko (43/65) Mar 02 2020 I tried using ref (pointer to struct) but it only made things

9il (5/10) Mar 02 2020 Matrix multiplication is about cache-friendly blocking.

maarten van damme (5/17) Mar 03 2020 it is difficult to write an efficient matrix matrix multiplication in an...

p.shkadzko (3/17) Mar 03 2020 Yeah, got it. After some reading, I understand that's not trivial
jmh530 (4/8) Mar 03 2020 And of course there's going to be a big slowdown in using native

kdevel (15/20) Mar 03 2020 This is row-major order [1]. BTW: Why don't you make toIdx a
Timon Gehr (15/43) Mar 04 2020 This asserts that the result is a square matrix. I think you want

p.shkadzko <p.shkadzko gmail.com> writes:

Hello again,

Thanks to previous thread on multidimensional arrays, I managed 
to play around with pure D matrix representations and even 
benchmark a little against numpy:

+-------------------------------------------------------------+------------+-----------+
|                          benchmark                          | 
time (sec) |  vs Numpy |
+-------------------------------------------------------------+------------+-----------+
| Sum of two [5000, 6000] int array of arrays                 | 
~0.28      | x4.5      |
| Multiplication of two [5000, 6000] double array of arrays   | 
~0.3       | x2.6      |
| Sum of two [5000, 6000] int struct matrices                 | 
~0.039     | x0.6      |
| Multiplication of two [5000, 6000] double struct matrices   | 
~0.135     | x1.2      |
| L2 norm of [5000, 6000] double struct matrix                | 
~0.015     | x15       |
| Sort of [5000, 6000] double struct matrix (axis=-1)         | 
~2.435     | x1.9      |
| Dot product of [500, 600]&[600, 500] double struct matrices | 
~0.172     | --        |
+-------------------------------------------------------------+------------+-----------+

However, there is one benchmark I am trying to make at least a 
little comparable. That is the dot product of two struct 
matrices. Concretely [A x B]   [B x A] = [A, A] operation.
There is a dotProduct function in std.numeric but it works with 
1D ranges only.

After it was clear that array of arrays are not very good to 
represent multidimensional data, I used struct to represent a 
multidimensional arrays like so:

******************************************************************
struct Matrix(T)
{
     T[] data; // keep our data as 1D array and reshape to 2D when 
needed
     int rows;
     int cols;
     // allow Matrix[] instead of Matrix.data[]
     alias data this;

     this(int rows, int cols)
     {
         this.data = new T[rows * cols];
         this.rows = rows;
         this.cols = cols;
     }

     this(int rows, int cols, T[] data)
     {
         assert(data.length == rows * cols);
         this.data = data;
         this.rows = rows;
         this.cols = cols;
     }

     T[][] to2D()
     {
         return this.data.chunks(this.cols).array;
     }

     /// Allow element 2D indexing, e.g. Matrix[row, col]
     T opIndex(in int r, in int c)
     {
         return this.data[toIdx(this, r, c)];
     }

}

pragma(inline) static int toIdx(T)(Matrix!T m, in int i, in int j)
{
     return m.cols * i + j;
}
******************************************************************

And here is the dot product function:

******************************************************************
Matrix!T matrixDotProduct(T)(Matrix!T m1, Matrix!T m2)
in
{
     assert(m1.rows == m2.cols);
}
do
{
     Matrix!T m3 = Matrix!T(m1.rows, m2.cols);

     for (int i; i < m1.rows; ++i)
     {
         for (int j; j < m2.cols; ++j)
         {
             for (int k; k < m2.rows; ++k)
             {
                 m3.data[toIdx(m3, i, j)] += m1[i, k] * m2[k, j];
             }
         }
     }
     return m3;
}
******************************************************************

However, attempting to run dotProduct on two 5000x6000 struct 
Matrices took ~20 min while 500x600 only 0.172 sec. And I 
wondered if there is something really wrong with the 
matrixDotProduct function.

I can see that accessing the appropriate array member in 
Matrix.data is costly due to toIdx operation but, I can hardly 
explain why it gets so much costly. Maybe there is a better way 
to do it after all?

Mar 01 2020

jmh530 <john.michael.hall gmail.com> writes:

On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 [snip]


What compiler did you use and what flags?

Mar 02 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Monday, 2 March 2020 at 11:33:25 UTC, jmh530 wrote:
 On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 [snip]


 What compiler did you use and what flags?

Ah yes, sorry. I used latest ldc2 (1.20.0-x64) for Windows.
Dflags -mcpu=native and "inline", "optimize", "releaseMode".

Here is a dub.json of the project:

{
	"name": "app",
	"targetType": "executable",
	"dependencies": {
         "mir": "~>3.2.0"
	},
	"dflags-ldc": ["-mcpu=native"],

	"buildTypes": {
		"release": {
			"buildOptions": ["releaseMode", "inline", "optimize"],
			"dflags": ["-boundscheck=off"]
		},
		"tests": {
			"buildOptions": ["unittests"]
		}

	}
}

Mar 02 2020

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 2 March 2020 at 13:35:15 UTC, p.shkadzko wrote:
 [snip]

Thanks. I don't have time right now to review this thoroughly. My 
recollection is that the dot product of two matrices is actually 
matrix multiplication, correct? It generally makes sense to defer 
to other people's implementation of this. I recommend trying 
lubeck's version against numpy. It uses a blas/lapack 
implementation. mir-glas, I believe, also has a version.

Also, I'm not sure if the fastmath attribute would do anything 
here, but something worth looking into.

Mar 02 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Monday, 2 March 2020 at 15:00:56 UTC, jmh530 wrote:
 On Monday, 2 March 2020 at 13:35:15 UTC, p.shkadzko wrote:
 [snip]

 Thanks. I don't have time right now to review this thoroughly. 
 My recollection is that the dot product of two matrices is 
 actually matrix multiplication, correct? It generally makes 
 sense to defer to other people's implementation of this. I 
 recommend trying lubeck's version against numpy. It uses a 
 blas/lapack implementation. mir-glas, I believe, also has a 
 version.

 Also, I'm not sure if the fastmath attribute would do anything 
 here, but something worth looking into.

Yes, this it is a sum of multiplications between elements of two 
matrices or a scalar product in case of vectors. This is not 
simple element-wise multiplication that I did in earlier 
benchmarks.

I tested  fastmath and  optmath for toIdx function and that 
didn't change anyting.

Mar 02 2020

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 2 March 2020 at 18:17:05 UTC, p.shkadzko wrote:
 [snip]
 I tested  fastmath and  optmath for toIdx function and that 
 didn't change anyting.

 optmath is from mir, correct? I believe it implies  fastmath. 
The latest code in mir doesn't have it doing anything else at 
least.

Mar 02 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 Thanks to previous thread on multidimensional arrays, I managed 
 to play around with pure D matrix representations and even 
 benchmark a little against numpy:

 [...]

Interesting growth of processing time. Could it be GC?

+------------------+-------------+
| matrixDotProduct | time (sec.) |
+------------------+-------------+
| 2x[100 x 100]    |        0.01 |
| 2x[1000 x 1000]  |        2.21 |
| 2x[1500 x 1000]  |         5.6 |
| 2x[1500 x 1500]  |        9.28 |
| 2x[2000 x 2000]  |       44.59 |
| 2x[2100 x 2100]  |       55.13 |
+------------------+-------------+

Mar 02 2020

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 2 March 2020 at 20:22:55 UTC, p.shkadzko wrote:
 [snip]

 Interesting growth of processing time. Could it be GC?

 +------------------+-------------+
 | matrixDotProduct | time (sec.) |
 +------------------+-------------+
 | 2x[100 x 100]    |        0.01 |
 | 2x[1000 x 1000]  |        2.21 |
 | 2x[1500 x 1000]  |         5.6 |
 | 2x[1500 x 1500]  |        9.28 |
 | 2x[2000 x 2000]  |       44.59 |
 | 2x[2100 x 2100]  |       55.13 |
 +------------------+-------------+

Your matrixDotProduct creates a new Matrix and then returns it. 
When you look at the Matrix struct, it is basically building upon 
D's GC-backed slices. So yes, you are using the GC here.

You could try creating the output matrices outside of the 
matrixDotProduct function and then pass them by pointer or 
reference into the function if you want to profile just the 
calculation.

Mar 02 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Monday, 2 March 2020 at 20:56:50 UTC, jmh530 wrote:
 On Monday, 2 March 2020 at 20:22:55 UTC, p.shkadzko wrote:
 [snip]

 Interesting growth of processing time. Could it be GC?

 +------------------+-------------+
 | matrixDotProduct | time (sec.) |
 +------------------+-------------+
 | 2x[100 x 100]    |        0.01 |
 | 2x[1000 x 1000]  |        2.21 |
 | 2x[1500 x 1000]  |         5.6 |
 | 2x[1500 x 1500]  |        9.28 |
 | 2x[2000 x 2000]  |       44.59 |
 | 2x[2100 x 2100]  |       55.13 |
 +------------------+-------------+

 Your matrixDotProduct creates a new Matrix and then returns it. 
 When you look at the Matrix struct, it is basically building 
 upon D's GC-backed slices. So yes, you are using the GC here.

 You could try creating the output matrices outside of the 
 matrixDotProduct function and then pass them by pointer or 
 reference into the function if you want to profile just the 
 calculation.

I tried using ref (pointer to struct) but it only made things 
slower by 0.5 s.
I an not passing the result matrix to "toIdx" anymore, this is 
not necessary we just need the column value. This didn't change 
anything though.
Here is how the code looks now.

*************************************************************************
pragma(inline) static int toIdx(int matrixCols, in int i, in int 
j)
{
     return matrixCols * i + j;
}

Matrix!T matrixDotProduct(T)(Matrix!T m1, Matrix!T m2, ref 
Matrix!T initM)
in
{
     assert(m1.rows == m2.cols);
}
do
{
     /// This implementation requires opIndex in Matrix struct.
     for (int i; i < m1.rows; ++i)
     {
         for (int j; j < m2.cols; ++j)
         {
             for (int k; k < m2.rows; ++k)
             {
                 initM.data[toIdx(initM.cols, i, j)] += m1[i, k] * 
m2[k, j];
             }
         }
     }
     return initM;
}

void main() {
     Matrix!double initMatrix = Matrix!double(m1.rows, m2.cols);
     auto e = matrixDotProduct!double(m5, m6, initMatrix).to2D;
}
*************************************************************************

I tried disabling GC via GC.disable; GC.enable; before and after 
3 loops in matrixDotProduct to see what happens. But nothing 
changed :(

Mar 02 2020

9il <ilyayaroshenko gmail.com> writes:

On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 Thanks to previous thread on multidimensional arrays, I managed 
 to play around with pure D matrix representations and even 
 benchmark a little against numpy:

 [...]

Matrix multiplication is about cache-friendly blocking.
https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf

`mir-blas` package can be used for matrix operations for ndslice. 
  `cblas`  - if you want to work with your own matrix type .

Mar 02 2020

maarten van damme <maartenvd1994 gmail.com> writes:

it is difficult to write an efficient matrix matrix multiplication in any
language. If you want a fair comparison, implement your naive method in
python and compare those timings.

Op di 3 mrt. 2020 om 04:20 schreef 9il via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com>:

 On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 Thanks to previous thread on multidimensional arrays, I managed
 to play around with pure D matrix representations and even
 benchmark a little against numpy:

 [...]

 Matrix multiplication is about cache-friendly blocking.
 https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf

 `mir-blas` package can be used for matrix operations for ndslice.
   `cblas`  - if you want to work with your own matrix type .

Mar 03 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Tuesday, 3 March 2020 at 10:25:27 UTC, maarten van damme wrote:
 it is difficult to write an efficient matrix matrix 
 multiplication in any language. If you want a fair comparison, 
 implement your naive method in python and compare those timings.

 Op di 3 mrt. 2020 om 04:20 schreef 9il via Digitalmars-d-learn 
 < digitalmars-d-learn puremagic.com>:

 On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 [...]

 Matrix multiplication is about cache-friendly blocking. 
 https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf

 `mir-blas` package can be used for matrix operations for 
 ndslice.
   `cblas`  - if you want to work with your own matrix type .


Yeah, got it. After some reading, I understand that's not trivial 
once bigger matrices are involved.

Mar 03 2020

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 3 March 2020 at 10:25:27 UTC, maarten van damme wrote:
 it is difficult to write an efficient matrix matrix 
 multiplication in any language. If you want a fair comparison, 
 implement your naive method in python and compare those timings.
 [snip]

And of course there's going to be a big slowdown in using native 
python. Numpy basically calls blas in the background. A naive C 
implementation might be another comparison.

Mar 03 2020

kdevel <kdevel vogtner.de> writes:

On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 pragma(inline) static int toIdx(T)(Matrix!T m, in int i, in int 
 j)
 {
     return m.cols * i + j;
 }

This is row-major order [1]. BTW: Why don't you make toIdx a 
member of Matrix? It saves one parameter. You may also define 
opIndex as

    ref T opIndex(in int r, in int c)

Then the innermost summation becomes more readable:

    m3[i, j] += m1[i, k] * m2[k, j];

How about performing an in-place transposition of m2 before 
performing the dot product? Then you can then rewrite the 
innermost loop:

    m3[i, j] += m1[i, k] * m2[j, k]; // note: j and k swapped

This should avoid the costly jumping thru the memory. A good 
starting point for a performance analysis would be looking over 
the assember code of the innermost loop.

[1] https://en.wikipedia.org/wiki/Row_major

Mar 03 2020

Timon Gehr <timon.gehr gmx.ch> writes:

On 01.03.20 21:58, p.shkadzko wrote:
 
 ******************************************************************
 Matrix!T matrixDotProduct(T)(Matrix!T m1, Matrix!T m2)
 in
 {
      assert(m1.rows == m2.cols);

This asserts that the result is a square matrix. I think you want 
`m1.cols==m2.rows` instead.

 }
 do
 {
      Matrix!T m3 = Matrix!T(m1.rows, m2.cols);
 
      for (int i; i < m1.rows; ++i)
      {
          for (int j; j < m2.cols; ++j)
          {
              for (int k; k < m2.rows; ++k)
              {
                  m3.data[toIdx(m3, i, j)] += m1[i, k] * m2[k,
j];
              }
          }
      }
      return m3;
 }
 ******************************************************************
 ...
 I can see that accessing the appropriate array member in Matrix.data is 
 costly due to toIdx operation but, I can hardly explain why it gets so 
 much costly. Maybe there is a better way to do it after all?

Changing the order of the second and third loop probably goes a pretty 
long way in terms of cache efficiency:

Matrix!T matrixDotProduct(T)(Matrix!T m1,Matrix!T m2)in{
     assert(m1.cols==m2.rows);
}do{
     int m=m1.rows,n=m1.cols,p=m2.cols;
     Matrix!T m3=Matrix!T(m,p);
     foreach(i;0..m) foreach(j;0..n) foreach(k;0..p)
         m3.data[i*p+k]+=m1.data[i*n+j]*m2.data[j*p+k];
     return m3;
}


(untested.)

Mar 04 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Improving dot product for standard multidimensional D arrays