www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Improving dot product for standard multidimensional D arrays

reply p.shkadzko <p.shkadzko gmail.com> writes:
Hello again,

Thanks to previous thread on multidimensional arrays, I managed 
to play around with pure D matrix representations and even 
benchmark a little against numpy:

+-------------------------------------------------------------+------------+-----------+
|                          benchmark                          | 
time (sec) |  vs Numpy |
+-------------------------------------------------------------+------------+-----------+
| Sum of two [5000, 6000] int array of arrays                 | 
~0.28      | x4.5      |
| Multiplication of two [5000, 6000] double array of arrays   | 
~0.3       | x2.6      |
| Sum of two [5000, 6000] int struct matrices                 | 
~0.039     | x0.6      |
| Multiplication of two [5000, 6000] double struct matrices   | 
~0.135     | x1.2      |
| L2 norm of [5000, 6000] double struct matrix                | 
~0.015     | x15       |
| Sort of [5000, 6000] double struct matrix (axis=-1)         | 
~2.435     | x1.9      |
| Dot product of [500, 600]&[600, 500] double struct matrices | 
~0.172     | --        |
+-------------------------------------------------------------+------------+-----------+

However, there is one benchmark I am trying to make at least a 
little comparable. That is the dot product of two struct 
matrices. Concretely [A x B]   [B x A] = [A, A] operation.
There is a dotProduct function in std.numeric but it works with 
1D ranges only.

After it was clear that array of arrays are not very good to 
represent multidimensional data, I used struct to represent a 
multidimensional arrays like so:

******************************************************************
struct Matrix(T)
{
     T[] data; // keep our data as 1D array and reshape to 2D when 
needed
     int rows;
     int cols;
     // allow Matrix[] instead of Matrix.data[]
     alias data this;

     this(int rows, int cols)
     {
         this.data = new T[rows * cols];
         this.rows = rows;
         this.cols = cols;
     }

     this(int rows, int cols, T[] data)
     {
         assert(data.length == rows * cols);
         this.data = data;
         this.rows = rows;
         this.cols = cols;
     }

     T[][] to2D()
     {
         return this.data.chunks(this.cols).array;
     }

     /// Allow element 2D indexing, e.g. Matrix[row, col]
     T opIndex(in int r, in int c)
     {
         return this.data[toIdx(this, r, c)];
     }

}

pragma(inline) static int toIdx(T)(Matrix!T m, in int i, in int j)
{
     return m.cols * i + j;
}
******************************************************************

And here is the dot product function:

******************************************************************
Matrix!T matrixDotProduct(T)(Matrix!T m1, Matrix!T m2)
in
{
     assert(m1.rows == m2.cols);
}
do
{
     Matrix!T m3 = Matrix!T(m1.rows, m2.cols);

     for (int i; i < m1.rows; ++i)
     {
         for (int j; j < m2.cols; ++j)
         {
             for (int k; k < m2.rows; ++k)
             {
                 m3.data[toIdx(m3, i, j)] += m1[i, k] * m2[k, j];
             }
         }
     }
     return m3;
}
******************************************************************

However, attempting to run dotProduct on two 5000x6000 struct 
Matrices took ~20 min while 500x600 only 0.172 sec. And I 
wondered if there is something really wrong with the 
matrixDotProduct function.

I can see that accessing the appropriate array member in 
Matrix.data is costly due to toIdx operation but, I can hardly 
explain why it gets so much costly. Maybe there is a better way 
to do it after all?
Mar 01 2020
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 [snip]
What compiler did you use and what flags?
Mar 02 2020
parent reply p.shkadzko <p.shkadzko gmail.com> writes:
On Monday, 2 March 2020 at 11:33:25 UTC, jmh530 wrote:
 On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 [snip]
What compiler did you use and what flags?
Ah yes, sorry. I used latest ldc2 (1.20.0-x64) for Windows. Dflags -mcpu=native and "inline", "optimize", "releaseMode". Here is a dub.json of the project: { "name": "app", "targetType": "executable", "dependencies": { "mir": "~>3.2.0" }, "dflags-ldc": ["-mcpu=native"], "buildTypes": { "release": { "buildOptions": ["releaseMode", "inline", "optimize"], "dflags": ["-boundscheck=off"] }, "tests": { "buildOptions": ["unittests"] } } }
Mar 02 2020
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 2 March 2020 at 13:35:15 UTC, p.shkadzko wrote:
 [snip]
Thanks. I don't have time right now to review this thoroughly. My recollection is that the dot product of two matrices is actually matrix multiplication, correct? It generally makes sense to defer to other people's implementation of this. I recommend trying lubeck's version against numpy. It uses a blas/lapack implementation. mir-glas, I believe, also has a version. Also, I'm not sure if the fastmath attribute would do anything here, but something worth looking into.
Mar 02 2020
parent reply p.shkadzko <p.shkadzko gmail.com> writes:
On Monday, 2 March 2020 at 15:00:56 UTC, jmh530 wrote:
 On Monday, 2 March 2020 at 13:35:15 UTC, p.shkadzko wrote:
 [snip]
Thanks. I don't have time right now to review this thoroughly. My recollection is that the dot product of two matrices is actually matrix multiplication, correct? It generally makes sense to defer to other people's implementation of this. I recommend trying lubeck's version against numpy. It uses a blas/lapack implementation. mir-glas, I believe, also has a version. Also, I'm not sure if the fastmath attribute would do anything here, but something worth looking into.
Yes, this it is a sum of multiplications between elements of two matrices or a scalar product in case of vectors. This is not simple element-wise multiplication that I did in earlier benchmarks. I tested fastmath and optmath for toIdx function and that didn't change anyting.
Mar 02 2020
parent jmh530 <john.michael.hall gmail.com> writes:
On Monday, 2 March 2020 at 18:17:05 UTC, p.shkadzko wrote:
 [snip]
 I tested  fastmath and  optmath for toIdx function and that 
 didn't change anyting.
optmath is from mir, correct? I believe it implies fastmath. The latest code in mir doesn't have it doing anything else at least.
Mar 02 2020
prev sibling next sibling parent reply p.shkadzko <p.shkadzko gmail.com> writes:
On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 Thanks to previous thread on multidimensional arrays, I managed 
 to play around with pure D matrix representations and even 
 benchmark a little against numpy:

 [...]
Interesting growth of processing time. Could it be GC? +------------------+-------------+ | matrixDotProduct | time (sec.) | +------------------+-------------+ | 2x[100 x 100] | 0.01 | | 2x[1000 x 1000] | 2.21 | | 2x[1500 x 1000] | 5.6 | | 2x[1500 x 1500] | 9.28 | | 2x[2000 x 2000] | 44.59 | | 2x[2100 x 2100] | 55.13 | +------------------+-------------+
Mar 02 2020
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 2 March 2020 at 20:22:55 UTC, p.shkadzko wrote:
 [snip]

 Interesting growth of processing time. Could it be GC?

 +------------------+-------------+
 | matrixDotProduct | time (sec.) |
 +------------------+-------------+
 | 2x[100 x 100]    |        0.01 |
 | 2x[1000 x 1000]  |        2.21 |
 | 2x[1500 x 1000]  |         5.6 |
 | 2x[1500 x 1500]  |        9.28 |
 | 2x[2000 x 2000]  |       44.59 |
 | 2x[2100 x 2100]  |       55.13 |
 +------------------+-------------+
Your matrixDotProduct creates a new Matrix and then returns it. When you look at the Matrix struct, it is basically building upon D's GC-backed slices. So yes, you are using the GC here. You could try creating the output matrices outside of the matrixDotProduct function and then pass them by pointer or reference into the function if you want to profile just the calculation.
Mar 02 2020
parent p.shkadzko <p.shkadzko gmail.com> writes:
On Monday, 2 March 2020 at 20:56:50 UTC, jmh530 wrote:
 On Monday, 2 March 2020 at 20:22:55 UTC, p.shkadzko wrote:
 [snip]

 Interesting growth of processing time. Could it be GC?

 +------------------+-------------+
 | matrixDotProduct | time (sec.) |
 +------------------+-------------+
 | 2x[100 x 100]    |        0.01 |
 | 2x[1000 x 1000]  |        2.21 |
 | 2x[1500 x 1000]  |         5.6 |
 | 2x[1500 x 1500]  |        9.28 |
 | 2x[2000 x 2000]  |       44.59 |
 | 2x[2100 x 2100]  |       55.13 |
 +------------------+-------------+
Your matrixDotProduct creates a new Matrix and then returns it. When you look at the Matrix struct, it is basically building upon D's GC-backed slices. So yes, you are using the GC here. You could try creating the output matrices outside of the matrixDotProduct function and then pass them by pointer or reference into the function if you want to profile just the calculation.
I tried using ref (pointer to struct) but it only made things slower by 0.5 s. I an not passing the result matrix to "toIdx" anymore, this is not necessary we just need the column value. This didn't change anything though. Here is how the code looks now. ************************************************************************* pragma(inline) static int toIdx(int matrixCols, in int i, in int j) { return matrixCols * i + j; } Matrix!T matrixDotProduct(T)(Matrix!T m1, Matrix!T m2, ref Matrix!T initM) in { assert(m1.rows == m2.cols); } do { /// This implementation requires opIndex in Matrix struct. for (int i; i < m1.rows; ++i) { for (int j; j < m2.cols; ++j) { for (int k; k < m2.rows; ++k) { initM.data[toIdx(initM.cols, i, j)] += m1[i, k] * m2[k, j]; } } } return initM; } void main() { Matrix!double initMatrix = Matrix!double(m1.rows, m2.cols); auto e = matrixDotProduct!double(m5, m6, initMatrix).to2D; } ************************************************************************* I tried disabling GC via GC.disable; GC.enable; before and after 3 loops in matrixDotProduct to see what happens. But nothing changed :(
Mar 02 2020
prev sibling next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 Thanks to previous thread on multidimensional arrays, I managed 
 to play around with pure D matrix representations and even 
 benchmark a little against numpy:

 [...]
Matrix multiplication is about cache-friendly blocking. https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf `mir-blas` package can be used for matrix operations for ndslice. `cblas` - if you want to work with your own matrix type .
Mar 02 2020
parent reply maarten van damme <maartenvd1994 gmail.com> writes:
it is difficult to write an efficient matrix matrix multiplication in any
language. If you want a fair comparison, implement your naive method in
python and compare those timings.

Op di 3 mrt. 2020 om 04:20 schreef 9il via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com>:

 On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 Hello again,

 Thanks to previous thread on multidimensional arrays, I managed
 to play around with pure D matrix representations and even
 benchmark a little against numpy:

 [...]
Matrix multiplication is about cache-friendly blocking. https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf `mir-blas` package can be used for matrix operations for ndslice. `cblas` - if you want to work with your own matrix type .
Mar 03 2020
next sibling parent p.shkadzko <p.shkadzko gmail.com> writes:
On Tuesday, 3 March 2020 at 10:25:27 UTC, maarten van damme wrote:
 it is difficult to write an efficient matrix matrix 
 multiplication in any language. If you want a fair comparison, 
 implement your naive method in python and compare those timings.

 Op di 3 mrt. 2020 om 04:20 schreef 9il via Digitalmars-d-learn 
 < digitalmars-d-learn puremagic.com>:

 On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 [...]
Matrix multiplication is about cache-friendly blocking. https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf `mir-blas` package can be used for matrix operations for ndslice. `cblas` - if you want to work with your own matrix type .
Yeah, got it. After some reading, I understand that's not trivial once bigger matrices are involved.
Mar 03 2020
prev sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 3 March 2020 at 10:25:27 UTC, maarten van damme wrote:
 it is difficult to write an efficient matrix matrix 
 multiplication in any language. If you want a fair comparison, 
 implement your naive method in python and compare those timings.
 [snip]
And of course there's going to be a big slowdown in using native python. Numpy basically calls blas in the background. A naive C implementation might be another comparison.
Mar 03 2020
prev sibling next sibling parent kdevel <kdevel vogtner.de> writes:
On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote:
 pragma(inline) static int toIdx(T)(Matrix!T m, in int i, in int 
 j)
 {
     return m.cols * i + j;
 }
This is row-major order [1]. BTW: Why don't you make toIdx a member of Matrix? It saves one parameter. You may also define opIndex as ref T opIndex(in int r, in int c) Then the innermost summation becomes more readable: m3[i, j] += m1[i, k] * m2[k, j]; How about performing an in-place transposition of m2 before performing the dot product? Then you can then rewrite the innermost loop: m3[i, j] += m1[i, k] * m2[j, k]; // note: j and k swapped This should avoid the costly jumping thru the memory. A good starting point for a performance analysis would be looking over the assember code of the innermost loop. [1] https://en.wikipedia.org/wiki/Row_major
Mar 03 2020
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01.03.20 21:58, p.shkadzko wrote:
 
 ******************************************************************
 Matrix!T matrixDotProduct(T)(Matrix!T m1, Matrix!T m2)
 in
 {
      assert(m1.rows == m2.cols);
This asserts that the result is a square matrix. I think you want `m1.cols==m2.rows` instead.
 }
 do
 {
      Matrix!T m3 = Matrix!T(m1.rows, m2.cols);
 
      for (int i; i < m1.rows; ++i)
      {
          for (int j; j < m2.cols; ++j)
          {
              for (int k; k < m2.rows; ++k)
              {
                  m3.data[toIdx(m3, i, j)] += m1[i, k] * m2[k,
j];
              }
          }
      }
      return m3;
 }
 ******************************************************************
 ...
 I can see that accessing the appropriate array member in Matrix.data is 
 costly due to toIdx operation but, I can hardly explain why it gets so 
 much costly. Maybe there is a better way to do it after all?
Changing the order of the second and third loop probably goes a pretty long way in terms of cache efficiency: Matrix!T matrixDotProduct(T)(Matrix!T m1,Matrix!T m2)in{ assert(m1.cols==m2.rows); }do{ int m=m1.rows,n=m1.cols,p=m2.cols; Matrix!T m3=Matrix!T(m,p); foreach(i;0..m) foreach(j;0..n) foreach(k;0..p) m3.data[i*p+k]+=m1.data[i*n+j]*m2.data[j*p+k]; return m3; } (untested.)
Mar 04 2020