www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [4Walter&Andrei] D is 40 times slower. We need a new language feature!

reply 9il <ilyayaroshenko gmail.com> writes:
Hello,

When users write math code, they expect [2, 3, 4] that the code 
like

------
import mir.ndslice; //[1]

...

foreach (i; 0..m)
{
    foreach (j; 0..n)
    {
        // use matrix1[i, j], matrix2[i, j], matrix3[i, j]
    }
}
------

will be vectorized like in Fortran and other math languages.

There are different kinds of engineers and good math engineer may 
not have engineering education. They may be economists, 
physicist, geologist and etc. The may not be programmers or they 
may not be D programmers. They do not want to learn special loop 
free programming idioms. They just want their code be fast with 
ordinary loops.

In D the example above can not be vectorised.

The reason is that `matrixX[i, j]` is opIndex call, opIndex is a 
function. It can be inlined. But optimizers can not split its 
body and move half of opIndex computations out of the inner loop, 
which it required for vectorization.

Optimizers should do

------------
foreach (i; 0..m)
{
    auto __tempV1 = matrix1[i];
    auto __tempV2 = matrix2[i];
    auto __tempV3 = matrix3[i];
    foreach (j; 0..n)
    {
        // use __tempV1[j], __tempV2[j], __tempV3[j]
    }
}
------------

As was said optimizsers can not split opIndex body because it is 
function (inlined or not  inlined does not matter).

Walter, Andrei, and D community please help to make D simple for 
math world!

I do not know what language changes we should add. I only know 
how it should look for compiler:

------
import mir.ndslice;
...

foreach (i; 0..m)
{
    foreach (j; 0..n)
    {
        // matrixX[i, j] should be transformed to

        // matrix.ptr[matrix.stride * i + j]

       //  it is simplified version, ndslice has more complex API
    }
}
------

Looks like Rust macro system can do something similar.

What can I do to make it happen?

Ilya

[1] https://github.com/libmir/mir-algorithm
[2] 
https://gist.github.com/dextorious/d987865a7da147645ae34cc17a87729d
[3] 
https://gist.github.com/dextorious/9a65a20e353542d6fb3a8d45c515bc18
[4] https://gitter.im/libmir/public
May 19
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 20/05/2017 4:24 AM, 9il wrote:
snip
 Looks like Rust macro system can do something similar.
Just to confirm, in Rust is it calling a function to assign the value? E.g. ```D void opIndexAssign(T v, size_t j, size_t i); ``` Because if the compiler is seeing a function call, it probably won't attempt this optimization, but if its seeing a direct to memory write, that's a different story. Also the compiler doesn't know if ``foo[i][j]`` is equal to ``foo[i, j]``. So your proposed solution isn't valid.
May 19
parent 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 20 May 2017 at 03:50:31 UTC, rikki cattermole wrote:
 On 20/05/2017 4:24 AM, 9il wrote:
 snip
 Looks like Rust macro system can do something similar.
Just to confirm, in Rust is it calling a function to assign the value?
I mean that Rust has macro system. I do not know if it can be used for indexing.
 E.g.
 ```D
 void opIndexAssign(T v, size_t j, size_t i);
 ```

 Because if the compiler is seeing a function call, it probably 
 won't attempt this optimization, but if its seeing a direct to 
 memory write, that's a different story.

 Also the compiler doesn't know if ``foo[i][j]`` is equal to 
 ``foo[i, j]``. So your proposed solution isn't valid.
Proposed solution is similar to mixins. Compiler do not need to know something. Compiler just need to replace `foo[i, j]` with its body like it is a mixin. All loop optimiation would be done by optimizer. See clang loop optimisation for more details.
May 19
prev sibling next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the code 
 like

 [...]
I assume you compiled with LDC and used pragma(inline, true). Have you had a chance to look at what `-fsave-optimization-record` gives you? (This was adde to LDC master 3 days ago.) That may give some insight as to why it was not vectorised, e.g. bounds checks not removed. I dont think we need a new language feature for this.
May 19
parent 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 20 May 2017 at 03:53:19 UTC, Nicholas Wilson wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the 
 code like

 [...]
I assume you compiled with LDC and used pragma(inline, true). Have you had a chance to look at what `-fsave-optimization-record` gives you? (This was adde to LDC master 3 days ago.) That may give some insight as to why it was not vectorised, e.g. bounds checks not removed. I dont think we need a new language feature for this.
No. It is exactly because it is single function for column and row indexes.
May 19
prev sibling next sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 What can I do to make it happen?
Sounds like you're asking for opIndex currying? https://en.wikipedia.org/wiki/Currying Have you tried implementing opIndex as a function which takes a single argument, and returns an object which then also implements opIndex with a single argument? You would probably need to write matrix[2][4] instead of matrix[2, 4], but that doesn't look hard to fix as well.
 As was said optimizsers can not split opIndex body because it 
 is function (inlined or not  inlined does not matter).
Have you tried splitting the opIndex implementation into two functions, one with just the code that should always be inlined, and one with the rest of the code that doesn't necessarily have to be inlined? How about pragma(inline), does that help?
May 19
parent 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 20 May 2017 at 03:53:42 UTC, Vladimir Panteleev 
wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 What can I do to make it happen?
Sounds like you're asking for opIndex currying? https://en.wikipedia.org/wiki/Currying Have you tried implementing opIndex as a function which takes a single argument, and returns an object which then also implements opIndex with a single argument? You would probably need to write matrix[2][4] instead of matrix[2, 4], but that doesn't look hard to fix as well.
Yes, matrix[i][j] allows vectorization. This is already implemented. In the same time users prefer [i, j] syntax. So it should be deprecated :-/
 As was said optimizsers can not split opIndex body because it 
 is function (inlined or not  inlined does not matter).
Have you tried splitting the opIndex implementation into two functions, one with just the code that should always be inlined, and one with the rest of the code that doesn't necessarily have to be inlined?
ditto
 How about pragma(inline), does that help?
No
May 19
prev sibling next sibling parent 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 The reason is that `matrixX[i, j]` is opIndex call, opIndex is 
 a function. It can be inlined. But optimizers can not split its 
 body and move half of opIndex computations out of the inner 
 loop, which it required for vectorization.
Hmm, look like new LLVM solves this issue. Need to do more benchmarks...
May 20
prev sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the code 
 like

 [...]
What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining?
May 20
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the 
 code like

 [...]
What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining?
I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41
May 20
parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:
 On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the 
 code like

 [...]
What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining?
I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41
What's​ surprising about it? Thinking very simplistically (I don't know how it actually works), if inlining happened first then surely the later optimisation stages wouldn't have a problem detecting the necessary loop invariants and hoisting them out.
May 20
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:
 What's​ surprising about it? Thinking very simplistically (I 
 don't know how it actually works), if inlining happened first 
 then surely the later optimisation stages wouldn't have a 
 problem detecting the necessary loop invariants and hoisting 
 them out.
Inlining is usually one of the first passes scheduled. So that should not be an issue, However loop-invariant code motion is not straight-forward.
May 20
prev sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:
 On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 [...]
I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41
What's​ surprising about it? Thinking very simplistically (I don't know how it actually works), if inlining happened first then surely the later optimisation stages wouldn't have a problem detecting the necessary loop invariants and hoisting them out.
It did not work before. I did similar benchmarks a year ago.
May 20
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, May 20, 2017 at 12:00:37PM +0000, 9il via Digitalmars-d wrote:
 On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:
 On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 [...]
I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41
What's surprising about it? Thinking very simplistically (I don't know how it actually works), if inlining happened first then surely the later optimisation stages wouldn't have a problem detecting the necessary loop invariants and hoisting them out.
It did not work before. I did similar benchmarks a year ago.
I don't think this warrants a language change. It's an implementation issue, specifically, an optimizer issue. Optimizers can always be refined and improved further. T -- In a world without fences, who needs Windows and Gates? -- Christian Surchi
May 20