digitalmars.D - [4Walter&Andrei] D is 40 times slower. We need a new language feature!

9il (67/67) May 19 2017 Hello,

rikki cattermole (12/13) May 19 2017 Just to confirm, in Rust is it calling a function to assign the value?

9il (7/21) May 19 2017 I mean that Rust has macro system. I do not know if it can be

Nicholas Wilson (7/11) May 19 2017 I assume you compiled with LDC and used pragma(inline, true).

9il (3/16) May 19 2017 No. It is exactly because it is single function for column and

Vladimir Panteleev (13/16) May 19 2017 Sounds like you're asking for opIndex currying?

9il (8/24) May 19 2017 Yes, matrix[i][j] allows vectorization. This is already

9il (3/7) May 20 2017 Hmm, look like new LLVM solves this issue. Need to do more
John Colvin (8/12) May 20 2017 What you are saying is that there is a specific shortcoming that

9il (7/21) May 20 2017 I just found that new LLVM solves this issue (and was very

John Colvin (5/28) May 20 2017 What's surprising about it? Thinking very simplistically (I

Stefan Koch (4/9) May 20 2017 Inlining is usually one of the first passes scheduled.
9il (2/18) May 20 2017 It did not work before. I did similar benchmarks a year ago.

H. S. Teoh via Digitalmars-d (7/26) May 20 2017 I don't think this warrants a language change. It's an implementation

9il <ilyayaroshenko gmail.com> writes:

Hello,

When users write math code, they expect [2, 3, 4] that the code 
like

------
import mir.ndslice; //[1]

...

foreach (i; 0..m)
{
    foreach (j; 0..n)
    {
        // use matrix1[i, j], matrix2[i, j], matrix3[i, j]
    }
}
------

will be vectorized like in Fortran and other math languages.

There are different kinds of engineers and good math engineer may 
not have engineering education. They may be economists, 
physicist, geologist and etc. The may not be programmers or they 
may not be D programmers. They do not want to learn special loop 
free programming idioms. They just want their code be fast with 
ordinary loops.

In D the example above can not be vectorised.

The reason is that `matrixX[i, j]` is opIndex call, opIndex is a 
function. It can be inlined. But optimizers can not split its 
body and move half of opIndex computations out of the inner loop, 
which it required for vectorization.

Optimizers should do

------------
foreach (i; 0..m)
{
    auto __tempV1 = matrix1[i];
    auto __tempV2 = matrix2[i];
    auto __tempV3 = matrix3[i];
    foreach (j; 0..n)
    {
        // use __tempV1[j], __tempV2[j], __tempV3[j]
    }
}
------------

As was said optimizsers can not split opIndex body because it is 
function (inlined or not  inlined does not matter).

Walter, Andrei, and D community please help to make D simple for 
math world!

I do not know what language changes we should add. I only know 
how it should look for compiler:

------
import mir.ndslice;
...

foreach (i; 0..m)
{
    foreach (j; 0..n)
    {
        // matrixX[i, j] should be transformed to

        // matrix.ptr[matrix.stride * i + j]

       //  it is simplified version, ndslice has more complex API
    }
}
------

Looks like Rust macro system can do something similar.

What can I do to make it happen?

Ilya

[1] https://github.com/libmir/mir-algorithm
[2] 
https://gist.github.com/dextorious/d987865a7da147645ae34cc17a87729d
[3] 
https://gist.github.com/dextorious/9a65a20e353542d6fb3a8d45c515bc18
[4] https://gitter.im/libmir/public

May 19 2017

rikki cattermole <rikki cattermole.co.nz> writes:

On 20/05/2017 4:24 AM, 9il wrote:
snip
 Looks like Rust macro system can do something similar.

Just to confirm, in Rust is it calling a function to assign the value?

E.g.
```D
void opIndexAssign(T v, size_t j, size_t i);
```

Because if the compiler is seeing a function call, it probably won't 
attempt this optimization, but if its seeing a direct to memory write, 
that's a different story.

Also the compiler doesn't know if ``foo[i][j]`` is equal to ``foo[i, 
j]``. So your proposed solution isn't valid.

May 19 2017

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 20 May 2017 at 03:50:31 UTC, rikki cattermole wrote:
 On 20/05/2017 4:24 AM, 9il wrote:
 snip
 Looks like Rust macro system can do something similar.

 Just to confirm, in Rust is it calling a function to assign the 
 value?

I mean that Rust has macro system. I do not know if it can be 
used for indexing.

 E.g.
 ```D
 void opIndexAssign(T v, size_t j, size_t i);
 ```

 Because if the compiler is seeing a function call, it probably 
 won't attempt this optimization, but if its seeing a direct to 
 memory write, that's a different story.

 Also the compiler doesn't know if ``foo[i][j]`` is equal to 
 ``foo[i, j]``. So your proposed solution isn't valid.

Proposed solution is similar to mixins. Compiler do not need to 
know something. Compiler just need to replace `foo[i, j]` with 
its body like it is a mixin. All loop optimiation would be done 
by optimizer. See clang loop optimisation for more details.

May 19 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the code 
 like

 [...]

I assume you compiled with LDC and used pragma(inline, true). 
Have you had a chance to look at what 
`-fsave-optimization-record` gives you? (This was adde to LDC 
master 3 days ago.) That may give some insight as to why it was 
not vectorised, e.g. bounds checks not removed.

I dont think we need a new language feature for this.

May 19 2017

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 20 May 2017 at 03:53:19 UTC, Nicholas Wilson wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the 
 code like

 [...]

 I assume you compiled with LDC and used pragma(inline, true). 
 Have you had a chance to look at what 
 `-fsave-optimization-record` gives you? (This was adde to LDC 
 master 3 days ago.) That may give some insight as to why it was 
 not vectorised, e.g. bounds checks not removed.

 I dont think we need a new language feature for this.

No. It is exactly because it is single function for column and 
row indexes.

May 19 2017

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 What can I do to make it happen?

Sounds like you're asking for opIndex currying?

https://en.wikipedia.org/wiki/Currying

Have you tried implementing opIndex as a function which takes a 
single argument, and returns an object which then also implements 
opIndex with a single argument? You would probably need to write 
matrix[2][4] instead of matrix[2, 4], but that doesn't look hard 
to fix as well.

 As was said optimizsers can not split opIndex body because it 
 is function (inlined or not  inlined does not matter).

Have you tried splitting the opIndex implementation into two 
functions, one with just the code that should always be inlined, 
and one with the rest of the code that doesn't necessarily have 
to be inlined?

How about pragma(inline), does that help?

May 19 2017

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 20 May 2017 at 03:53:42 UTC, Vladimir Panteleev 
wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 What can I do to make it happen?

 Sounds like you're asking for opIndex currying?

 https://en.wikipedia.org/wiki/Currying

 Have you tried implementing opIndex as a function which takes a 
 single argument, and returns an object which then also 
 implements opIndex with a single argument? You would probably 
 need to write matrix[2][4] instead of matrix[2, 4], but that 
 doesn't look hard to fix as well.

Yes, matrix[i][j] allows vectorization. This is already 
implemented.
In the same time users prefer [i, j] syntax. So it should be 
deprecated :-/

 As was said optimizsers can not split opIndex body because it 
 is function (inlined or not  inlined does not matter).

 Have you tried splitting the opIndex implementation into two 
 functions, one with just the code that should always be 
 inlined, and one with the rest of the code that doesn't 
 necessarily have to be inlined?

ditto

 How about pragma(inline), does that help?

No

May 19 2017

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 The reason is that `matrixX[i, j]` is opIndex call, opIndex is 
 a function. It can be inlined. But optimizers can not split its 
 body and move half of opIndex computations out of the inner 
 loop, which it required for vectorization.

Hmm, look like new LLVM solves this issue. Need to do more 
benchmarks...

May 20 2017

John Colvin <john.loughran.colvin gmail.com> writes:

On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the code 
 like

 [...]

What you are saying is that there is a specific shortcoming that 
you are observing in optimisers, yes? Perhaps we should 
investigate how to fix the optimisers first before insisting on 
language additions / changes.

Have you talked to someone with experience writing optimisers 
about what stops the relevant optimisation being done after 
inlining?

May 20 2017

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the 
 code like

 [...]

 What you are saying is that there is a specific shortcoming 
 that you are observing in optimisers, yes? Perhaps we should 
 investigate how to fix the optimisers first before insisting on 
 language additions / changes.

 Have you talked to someone with experience writing optimisers 
 about what stops the relevant optimisation being done after 
 inlining?

I just found that new LLVM solves this issue (and was very 
surprised).
The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121.
I have added workaround in [2], it is v0.6.2.

[1] https://github.com/ldc-developers/ldc/issues/2121
[2] https://github.com/libmir/mir-algorithm/pull/41

May 20 2017

John Colvin <john.loughran.colvin gmail.com> writes:

On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:
 On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:
 Hello,

 When users write math code, they expect [2, 3, 4] that the 
 code like

 [...]

 What you are saying is that there is a specific shortcoming 
 that you are observing in optimisers, yes? Perhaps we should 
 investigate how to fix the optimisers first before insisting 
 on language additions / changes.

 Have you talked to someone with experience writing optimisers 
 about what stops the relevant optimisation being done after 
 inlining?

 I just found that new LLVM solves this issue (and was very 
 surprised).
 The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121.
 I have added workaround in [2], it is v0.6.2.

 [1] https://github.com/ldc-developers/ldc/issues/2121
 [2] https://github.com/libmir/mir-algorithm/pull/41

What's surprising about it? Thinking very simplistically (I 
don't know how it actually works), if inlining happened first 
then surely the later optimisation stages wouldn't have a problem 
detecting the necessary loop invariants and hoisting them out.

May 20 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:
 What's surprising about it? Thinking very simplistically (I 
 don't know how it actually works), if inlining happened first 
 then surely the later optimisation stages wouldn't have a 
 problem detecting the necessary loop invariants and hoisting 
 them out.

Inlining is usually one of the first passes scheduled.
So that should not be an issue,
However loop-invariant code motion is not straight-forward.

May 20 2017

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:
 On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 [...]

 I just found that new LLVM solves this issue (and was very 
 surprised).
 The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121.
 I have added workaround in [2], it is v0.6.2.

 [1] https://github.com/ldc-developers/ldc/issues/2121
 [2] https://github.com/libmir/mir-algorithm/pull/41

 What's surprising about it? Thinking very simplistically (I 
 don't know how it actually works), if inlining happened first 
 then surely the later optimisation stages wouldn't have a 
 problem detecting the necessary loop invariants and hoisting 
 them out.

It did not work before. I did similar benchmarks a year ago.

May 20 2017

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, May 20, 2017 at 12:00:37PM +0000, 9il via Digitalmars-d wrote:
 On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:
 On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:
 On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:
 [...]

 
 I just found that new LLVM solves this issue (and was very
 surprised).
 The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121.
 I have added workaround in [2], it is v0.6.2.
 
 [1] https://github.com/ldc-developers/ldc/issues/2121
 [2] https://github.com/libmir/mir-algorithm/pull/41

 
 What's surprising about it? Thinking very simplistically (I don't
 know how it actually works), if inlining happened first then surely
 the later optimisation stages wouldn't have a problem detecting the
 necessary loop invariants and hoisting them out.

 
 It did not work before. I did similar benchmarks a year ago.

I don't think this warrants a language change.  It's an implementation
issue, specifically, an optimizer issue. Optimizers can always be
refined and improved further.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi

May 20 2017

D Programming

C/C++ Programming

Other

digitalmars.D - [4Walter&Andrei] D is 40 times slower. We need a new language feature!