digitalmars.D - [4Walter&Andrei] D is 40 times slower. We need a new language feature!
- 9il (67/67) May 19 2017 Hello,
- rikki cattermole (12/13) May 19 2017 Just to confirm, in Rust is it calling a function to assign the value?
- 9il (7/21) May 19 2017 I mean that Rust has macro system. I do not know if it can be
- Nicholas Wilson (7/11) May 19 2017 I assume you compiled with LDC and used pragma(inline, true).
- 9il (3/16) May 19 2017 No. It is exactly because it is single function for column and
- Vladimir Panteleev (13/16) May 19 2017 Sounds like you're asking for opIndex currying?
- 9il (8/24) May 19 2017 Yes, matrix[i][j] allows vectorization. This is already
- 9il (3/7) May 20 2017 Hmm, look like new LLVM solves this issue. Need to do more
- John Colvin (8/12) May 20 2017 What you are saying is that there is a specific shortcoming that
- 9il (7/21) May 20 2017 I just found that new LLVM solves this issue (and was very
- John Colvin (5/28) May 20 2017 What's surprising about it? Thinking very simplistically (I
- Stefan Koch (4/9) May 20 2017 Inlining is usually one of the first passes scheduled.
- 9il (2/18) May 20 2017 It did not work before. I did similar benchmarks a year ago.
- H. S. Teoh via Digitalmars-d (7/26) May 20 2017 I don't think this warrants a language change. It's an implementation
Hello, When users write math code, they expect [2, 3, 4] that the code like ------ import mir.ndslice; //[1] ... foreach (i; 0..m) { foreach (j; 0..n) { // use matrix1[i, j], matrix2[i, j], matrix3[i, j] } } ------ will be vectorized like in Fortran and other math languages. There are different kinds of engineers and good math engineer may not have engineering education. They may be economists, physicist, geologist and etc. The may not be programmers or they may not be D programmers. They do not want to learn special loop free programming idioms. They just want their code be fast with ordinary loops. In D the example above can not be vectorised. The reason is that `matrixX[i, j]` is opIndex call, opIndex is a function. It can be inlined. But optimizers can not split its body and move half of opIndex computations out of the inner loop, which it required for vectorization. Optimizers should do ------------ foreach (i; 0..m) { auto __tempV1 = matrix1[i]; auto __tempV2 = matrix2[i]; auto __tempV3 = matrix3[i]; foreach (j; 0..n) { // use __tempV1[j], __tempV2[j], __tempV3[j] } } ------------ As was said optimizsers can not split opIndex body because it is function (inlined or not inlined does not matter). Walter, Andrei, and D community please help to make D simple for math world! I do not know what language changes we should add. I only know how it should look for compiler: ------ import mir.ndslice; ... foreach (i; 0..m) { foreach (j; 0..n) { // matrixX[i, j] should be transformed to // matrix.ptr[matrix.stride * i + j] // it is simplified version, ndslice has more complex API } } ------ Looks like Rust macro system can do something similar. What can I do to make it happen? Ilya [1] https://github.com/libmir/mir-algorithm [2] https://gist.github.com/dextorious/d987865a7da147645ae34cc17a87729d [3] https://gist.github.com/dextorious/9a65a20e353542d6fb3a8d45c515bc18 [4] https://gitter.im/libmir/public
May 19 2017
On 20/05/2017 4:24 AM, 9il wrote: snipLooks like Rust macro system can do something similar.Just to confirm, in Rust is it calling a function to assign the value? E.g. ```D void opIndexAssign(T v, size_t j, size_t i); ``` Because if the compiler is seeing a function call, it probably won't attempt this optimization, but if its seeing a direct to memory write, that's a different story. Also the compiler doesn't know if ``foo[i][j]`` is equal to ``foo[i, j]``. So your proposed solution isn't valid.
May 19 2017
On Saturday, 20 May 2017 at 03:50:31 UTC, rikki cattermole wrote:On 20/05/2017 4:24 AM, 9il wrote: snipI mean that Rust has macro system. I do not know if it can be used for indexing.Looks like Rust macro system can do something similar.Just to confirm, in Rust is it calling a function to assign the value?E.g. ```D void opIndexAssign(T v, size_t j, size_t i); ``` Because if the compiler is seeing a function call, it probably won't attempt this optimization, but if its seeing a direct to memory write, that's a different story. Also the compiler doesn't know if ``foo[i][j]`` is equal to ``foo[i, j]``. So your proposed solution isn't valid.Proposed solution is similar to mixins. Compiler do not need to know something. Compiler just need to replace `foo[i, j]` with its body like it is a mixin. All loop optimiation would be done by optimizer. See clang loop optimisation for more details.
May 19 2017
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:Hello, When users write math code, they expect [2, 3, 4] that the code like [...]I assume you compiled with LDC and used pragma(inline, true). Have you had a chance to look at what `-fsave-optimization-record` gives you? (This was adde to LDC master 3 days ago.) That may give some insight as to why it was not vectorised, e.g. bounds checks not removed. I dont think we need a new language feature for this.
May 19 2017
On Saturday, 20 May 2017 at 03:53:19 UTC, Nicholas Wilson wrote:On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:No. It is exactly because it is single function for column and row indexes.Hello, When users write math code, they expect [2, 3, 4] that the code like [...]I assume you compiled with LDC and used pragma(inline, true). Have you had a chance to look at what `-fsave-optimization-record` gives you? (This was adde to LDC master 3 days ago.) That may give some insight as to why it was not vectorised, e.g. bounds checks not removed. I dont think we need a new language feature for this.
May 19 2017
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:What can I do to make it happen?Sounds like you're asking for opIndex currying? https://en.wikipedia.org/wiki/Currying Have you tried implementing opIndex as a function which takes a single argument, and returns an object which then also implements opIndex with a single argument? You would probably need to write matrix[2][4] instead of matrix[2, 4], but that doesn't look hard to fix as well.As was said optimizsers can not split opIndex body because it is function (inlined or not inlined does not matter).Have you tried splitting the opIndex implementation into two functions, one with just the code that should always be inlined, and one with the rest of the code that doesn't necessarily have to be inlined? How about pragma(inline), does that help?
May 19 2017
On Saturday, 20 May 2017 at 03:53:42 UTC, Vladimir Panteleev wrote:On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:Yes, matrix[i][j] allows vectorization. This is already implemented. In the same time users prefer [i, j] syntax. So it should be deprecated :-/What can I do to make it happen?Sounds like you're asking for opIndex currying? https://en.wikipedia.org/wiki/Currying Have you tried implementing opIndex as a function which takes a single argument, and returns an object which then also implements opIndex with a single argument? You would probably need to write matrix[2][4] instead of matrix[2, 4], but that doesn't look hard to fix as well.dittoAs was said optimizsers can not split opIndex body because it is function (inlined or not inlined does not matter).Have you tried splitting the opIndex implementation into two functions, one with just the code that should always be inlined, and one with the rest of the code that doesn't necessarily have to be inlined?How about pragma(inline), does that help?No
May 19 2017
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:The reason is that `matrixX[i, j]` is opIndex call, opIndex is a function. It can be inlined. But optimizers can not split its body and move half of opIndex computations out of the inner loop, which it required for vectorization.Hmm, look like new LLVM solves this issue. Need to do more benchmarks...
May 20 2017
On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:Hello, When users write math code, they expect [2, 3, 4] that the code like [...]What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining?
May 20 2017
On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41Hello, When users write math code, they expect [2, 3, 4] that the code like [...]What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining?
May 20 2017
On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:What's surprising about it? Thinking very simplistically (I don't know how it actually works), if inlining happened first then surely the later optimisation stages wouldn't have a problem detecting the necessary loop invariants and hoisting them out.On Saturday, 20 May 2017 at 03:24:41 UTC, 9il wrote:I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41Hello, When users write math code, they expect [2, 3, 4] that the code like [...]What you are saying is that there is a specific shortcoming that you are observing in optimisers, yes? Perhaps we should investigate how to fix the optimisers first before insisting on language additions / changes. Have you talked to someone with experience writing optimisers about what stops the relevant optimisation being done after inlining?
May 20 2017
On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:What's surprising about it? Thinking very simplistically (I don't know how it actually works), if inlining happened first then surely the later optimisation stages wouldn't have a problem detecting the necessary loop invariants and hoisting them out.Inlining is usually one of the first passes scheduled. So that should not be an issue, However loop-invariant code motion is not straight-forward.
May 20 2017
On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:On Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:It did not work before. I did similar benchmarks a year ago.On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:What's surprising about it? Thinking very simplistically (I don't know how it actually works), if inlining happened first then surely the later optimisation stages wouldn't have a problem detecting the necessary loop invariants and hoisting them out.[...]I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41
May 20 2017
On Sat, May 20, 2017 at 12:00:37PM +0000, 9il via Digitalmars-d wrote:On Saturday, 20 May 2017 at 11:47:32 UTC, John Colvin wrote:I don't think this warrants a language change. It's an implementation issue, specifically, an optimizer issue. Optimizers can always be refined and improved further. T -- In a world without fences, who needs Windows and Gates? -- Christian SurchiOn Saturday, 20 May 2017 at 11:34:55 UTC, 9il wrote:It did not work before. I did similar benchmarks a year ago.On Saturday, 20 May 2017 at 11:30:54 UTC, John Colvin wrote:What's surprising about it? Thinking very simplistically (I don't know how it actually works), if inlining happened first then surely the later optimisation stages wouldn't have a problem detecting the necessary loop invariants and hoisting them out.[...]I just found that new LLVM solves this issue (and was very surprised). The reason that ndslice <=v0.6.1 was so slow is LDC Issue 2121. I have added workaround in [2], it is v0.6.2. [1] https://github.com/ldc-developers/ldc/issues/2121 [2] https://github.com/libmir/mir-algorithm/pull/41
May 20 2017