digitalmars.D - A look at Chapel, D, and Julia using kernel matrix calculations

data pulverizer (5/5) May 21 2020 Hi,

CraigDillabaugh (5/10) May 21 2020 Very well done, an interesting read. I like the comment make

rikki cattermole (2/5) May 21 2020 https://dlang.org/blog/category/d-and-c/

bachmeier (5/10) May 22 2020 Nice post. You said "adding SIMD support could easily put D ahead

data pulverizer (10/16) May 22 2020 Sorry it wasn't clear, I have amended the statement. I meant

kinke (11/12) May 22 2020 Thx for the article. - You mention the lack of multi-dim array

9il (4/16) May 23 2020 docs.algorithm.dlang.io is outdated.

data pulverizer (4/6) May 23 2020 I'm hoping to keep writing articles like this and hope I can get

data pulverizer (13/26) May 23 2020 I've now updated the blog with this information (the new docs

data pulverizer (17/23) May 24 2020 I've just tried it and the times are faster just adding the flag
data pulverizer (17/23) May 24 2020 I've just tried it and the times are faster just adding the flag
welkam (9/11) May 24 2020 By default compiler will be conservative and only emit

data pulverizer (7/13) May 24 2020 My CPU is coffee lake (Intel i9-8950HK CPU) which is not listed

welkam (43/49) May 25 2020 Just use --mcpu=native. Compiler will check your cpu and use

welkam (17/17) May 22 2020 Nice article. It shows that you put a lot of work into it. The
data pulverizer (3/8) May 28 2020 An update of the article is available.

data pulverizer <data.pulverizer gmail.com> writes:

Hi,

this article grew out of a Dlang Learn thread 
(https://forum.dlang.org/thread/motdqixwsqmabzkdoslp forum.dlang.org). It looks
at Kernel Matrix Calculations in Chapel, D, and Julia and has a more general
discussion of all three languages. Comments welcome.

https://github.com/dataPulverizer/KernelMatrixBenchmark

Thanks

May 21 2020

CraigDillabaugh <craig.dillabaugh gmail.com> writes:

On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
 Hi,

 this article grew out of a Dlang Learn thread 
 (https://forum.dlang.org/thread/motdqixwsqmabzkdoslp forum.dlang.org). It
looks at Kernel Matrix Calculations in Chapel, D, and Julia and has a more
general discussion of all three languages. Comments welcome.

 https://github.com/dataPulverizer/KernelMatrixBenchmark

 Thanks

Very well done, an interesting read.  I like the comment make 
about the lack of examples for how to link to C code. Mike Parker 
had some excellent tutorials on this, but I couldn't find them 
after a quick search.

May 21 2020

rikki cattermole <rikki cattermole.co.nz> writes:

On 22/05/2020 3:12 PM, CraigDillabaugh wrote:
 Very well done, an interesting read.  I like the comment make about the 
 lack of examples for how to link to C code. Mike Parker had some 
 excellent tutorials on this, but I couldn't find them after a quick search.

https://dlang.org/blog/category/d-and-c/

May 21 2020

bachmeier <no spam.net> writes:

On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
 Hi,

 this article grew out of a Dlang Learn thread 
 (https://forum.dlang.org/thread/motdqixwsqmabzkdoslp forum.dlang.org). It
looks at Kernel Matrix Calculations in Chapel, D, and Julia and has a more
general discussion of all three languages. Comments welcome.

 https://github.com/dataPulverizer/KernelMatrixBenchmark

 Thanks

Nice post. You said "adding SIMD support could easily put D ahead 
or on par with Julia at the larger data size". It's not clear 
precisely what you mean. Does this package help?

https://code.dlang.org/packages/intel-intrinsics

May 22 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Friday, 22 May 2020 at 13:46:21 UTC, bachmeier wrote:
 On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
 https://github.com/dataPulverizer/KernelMatrixBenchmark

 Nice post. You said "adding SIMD support could easily put D 
 ahead or on par with Julia at the larger data size". It's not 
 clear precisely what you mean. Does this package help?

 https://code.dlang.org/packages/intel-intrinsics

Sorry it wasn't clear, I have amended the statement. I meant 
adding SIMD support to my matrix object could put D's performance 
at the largest data set on par or ahead of Julia since Julia 
edges D out on that data set and has SIMD support whereas my 
matrix does not, so I'm betting that that is the "x-factor" in 
Julia's performance at that scale. I've removed "easily" because 
it's too strong a word - more of an "educated" speculation. 
Probably something to look at next. I need to do some reading on 
SIMD. Thanks for the link, it's code that will get me started.

May 22 2020

kinke <noone nowhere.com> writes:

On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
 Comments welcome.

Thx for the article. - You mention the lack of multi-dim array 
support in Phobos; AFAIK, that's fully intentional, and the 
de-facto solution is 
http://docs.algorithm.dlang.io/latest/mir_ndslice.html.

As you suspect SIMD potential being left on the table by LDC, you 
can firstly use -mcpu=native to enable advanced instructions 
supported by your CPU, and secondly use 
-fsave-optimization-record to inspect LLVM's optimization remarks 
(e.g., why a loop isn't auto-vectorized etc.). -O5 is identical 
to -O3, which is identical to -O.

May 22 2020

9il <ilyayaroshenko gmail.com> writes:

On Friday, 22 May 2020 at 14:13:50 UTC, kinke wrote:
 On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
 Comments welcome.

 Thx for the article. - You mention the lack of multi-dim array 
 support in Phobos; AFAIK, that's fully intentional, and the 
 de-facto solution is 
 http://docs.algorithm.dlang.io/latest/mir_ndslice.html.

 As you suspect SIMD potential being left on the table by LDC, 
 you can firstly use -mcpu=native to enable advanced 
 instructions supported by your CPU, and secondly use 
 -fsave-optimization-record to inspect LLVM's optimization 
 remarks (e.g., why a loop isn't auto-vectorized etc.). -O5 is 
 identical to -O3, which is identical to -O.

docs.algorithm.dlang.io is outdated.

The new docs home is
http://mir-algorithm.libmir.org/mir_ndslice.html

May 23 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Saturday, 23 May 2020 at 14:18:54 UTC, 9il wrote:
 The new docs home is
 http://mir-algorithm.libmir.org/mir_ndslice.html

I'm hoping to keep writing articles like this and hope I can get 
round to doing one/some about one/some of Mir's modules. By now 
it's probably quite mature.

May 23 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Friday, 22 May 2020 at 14:13:50 UTC, kinke wrote:
 On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
 Comments welcome.

 Thx for the article. - You mention the lack of multi-dim array 
 support in Phobos; AFAIK, that's fully intentional, and the 
 de-facto solution is 
 http://docs.algorithm.dlang.io/latest/mir_ndslice.html.

I've now updated the blog with this information (the new docs 
home: http://mir-algorithm.libmir.org/mir_ndslice.html).

 As you suspect SIMD potential being left on the table by LDC, 
 you can firstly use -mcpu=native to enable advanced 
 instructions supported by your CPU, ...

If special instructions are enabled does the compiler 
automatically take advantage of these or does the programmer need 
to do anything?

 ... and secondly use -fsave-optimization-record to inspect 
 LLVM's optimization remarks (e.g., why a loop isn't 
 auto-vectorized etc.).

By this are you saying that SIMD happens automatically with 
`-mcpu=native` flag?

 ... -O5 is identical to -O3, which is identical to -O.

Yes I saw that when I was writing the code and tried it and found 
it to be true but there's something psychologically comforting 
about using -O5 rather than -O. I've updated the article to 
reflect your comments. I'm in the process of updating the D code 
and will change the flags once I'm done. Thanks

May 23 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
 On Friday, 22 May 2020 at 14:13:50 UTC, kinke wrote:
 ... and secondly use -fsave-optimization-record to inspect 
 LLVM's optimization remarks (e.g., why a loop isn't 
 auto-vectorized etc.).

 By this are you saying that SIMD happens automatically with 
 `-mcpu=native` flag?

I've just tried it and the times are faster just adding the flag 
the times for the largest data set is very close to Julia's 
sometimes a little faster sometimes a little slower.

I have extended the number kernel functions to 9 and the full 
benchmark in Julia is taking 2 hours (I may have to re-run it 
since I made some more code changes). I've added the new kernels 
(locally) to D and updating the script now. I also need to do the 
same for Chapel. Once I run them all I'll update the article.

 From what I can see now D wins for all but the largest data set 
but with new flag it's so close to Julia's that it will be a 
"photo finish" I might have to run at the largest data size 100 
times, I'll just pick one kernel (probably dot product) for that 
but it will take AGES! I'll have to look at maybe running it on a 
cloud instance rather than locally. Populating the arrays is what 
takes the longest time, I'll probably do that using parallel 
threads. This is getting interesting!

May 24 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
 On Friday, 22 May 2020 at 14:13:50 UTC, kinke wrote:
 ... and secondly use -fsave-optimization-record to inspect 
 LLVM's optimization remarks (e.g., why a loop isn't 
 auto-vectorized etc.).

 By this are you saying that SIMD happens automatically with 
 `-mcpu=native` flag?

I've just tried it and the times are faster just adding the flag 
the times for the largest data set is very close to Julia's 
sometimes a little faster sometimes a little slower.

I have extended the number kernel functions to 9 and the full 
benchmark in Julia is taking 2 hours (I may have to re-run it 
since I made some more code changes). I've added the new kernels 
(locally) to D and updating the script now. I also need to do the 
same for Chapel. Once I run them all I'll update the article.

 From what I can see now D wins for all but the largest data set 
but with new flag it's so close to Julia's that it will be a 
"photo finish" I might have to run at the largest data size 100 
times, I'll just pick one kernel (probably dot product) for that 
but it will take AGES! I'll have to look at maybe running it on a 
cloud instance rather than locally. Populating the arrays is what 
takes the longest time, I'll probably do that using parallel 
threads. This is getting interesting!

May 24 2020

welkam <wwwelkam gmail.com> writes:

On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
 By this are you saying that SIMD happens automatically with 
 `-mcpu=native` flag?

By default compiler will be conservative and only emit 
instructions that all CPUs can run. For 32 bit executables you 
almost never going to get SSE instructions because stack is not 
guaranteed to be 16 byte aligned and its not easy to prove that 
memory access is aligned and pointers do not alias.
When you tell compiler to generate code for specific CPU 
architecture (-mcpu) it can apply specific optimizations for that 
CPU and that includes SIMD instruction generation.

May 24 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Sunday, 24 May 2020 at 12:12:09 UTC, welkam wrote:
 On Sunday, 24 May 2020 at 05:39:30 UTC, data pulverizer wrote:
 By this are you saying that SIMD happens automatically with 
 `-mcpu=native` flag?

 When you tell compiler to generate code for specific CPU 
 architecture (-mcpu) it can apply specific optimizations for 
 that CPU and that includes SIMD instruction generation.

My CPU is coffee lake (Intel i9-8950HK CPU) which is not listed 
under `--mcpu=help` but has avx2 instructions. I tried 
`--mcpu=core-avx2 -mattr=+avx2,+sse4.1,+sse4.2` and getting the 
same improved performance as when using `--mcpu=native` am I 
correct in assuming that `core-avx2` is right for my CPU?

Thanks

May 24 2020

welkam <wwwelkam gmail.com> writes:

On Sunday, 24 May 2020 at 16:51:37 UTC, data pulverizer wrote:
 My CPU is coffee lake (Intel i9-8950HK CPU) which is not listed 
 under `--mcpu=help`

Just use --mcpu=native. Compiler will check your cpu and use 
correct flag for you. If you want to manually specify the 
architecture then look here
https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support

 I tried `--mcpu=core-avx2 -mattr=+avx2,+sse4.1,+sse4.2` and 
 getting the same improved performance as when using 
 `--mcpu=native` am I correct in assuming that `core-avx2` is 
 right for my CPU?

These flags are for fine grain control. If you have to ask about 
them then that means you should not use them. I would have to 
google to answer your question. When you use --mcpu=native all 
appropriate flags will be set. You dont have to worry about them.

For a data scientist here is a list of flags that you should be 
using and in order of importance.
--O2 (Turning on optimizations is good)
--mcpu=native (allows compiler to use newer instructions and 
enable architecture specific optimizations. Just dont share the 
binaries because they might crash on older CPU's)
--O3 (less important than mcpu and sometimes doesnt provide any 
speed improvements so measure measure measure)
--flto=thin (link time optimizations. Good when using libraries.)
PGO (not a single flag but profile guided optimizations can add 
few % improvement on top of all of other flags)
http://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html

--ffast-math (only useful for floating point (float, double). If 
you dont do math with those types then this flag does nothing)
--boundscheck=off (is D specific flag. majority of array bounds 
checking is remove by compiler without this flag but its good to 
throw it in just to make sure. But dont use this flag in 
development because it can catch bugs.)


When reading your message I get impression that you assumed that 
those newer instruction will improve performance. When it comes 
to performance never assume anything. Always profile before 
making judgments. Maybe your CPU is limited by memory bandwidth 
if you only have one stick of RAM and you use all 6 cores.

Anyway I looked at the disassembly of one function and its mostly 
SSE instructions with one AVX. That function is
arrays.Matrix!(float).Matrix 
kernelmatrix.calculateKernelMatrix!(kernelmatrix.DotProduct!(float).DotProduct,
float).calculateKernelMatrix(kernelmatrix.DotProduct!(float).DotProduct,
arrays.Matrix!(float).Matrix)

For SIMD instruction work D has specific vector types. I believe 
compiler guarantees that they are properly aligned but its not 
stated in the doc.
https://dlang.org/spec/simd.html

I have 0 experience in writing SIMD code but from what I heard 
over the years is that if you want to get max performance from 
your CPU you have to write your kernels with SIMD intrinsics.

May 25 2020

welkam <wwwelkam gmail.com> writes:

Nice article. It shows that you put a lot of work into it. The 
one thing I want to point out is that --O5 is the same as --O3.

 From ldc2 --help

   Setting the optimization level:
       -O                                   - Equivalent to -O3
       --O0                                  - No optimizations 
(default)
       --O1                                  - Simple optimizations
       --O2                                  - Good optimizations
       --O3                                  - Aggressive 
optimizations
       --O4                                  - Equivalent to -O3
       --O5                                  - Equivalent to -O3
       --Os                                  - Like -O2 with extra 
optimizations for size
       --Oz                                  - Like -Os but 
reduces code size further

May 22 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Friday, 22 May 2020 at 01:58:07 UTC, data pulverizer wrote:
 Hi,

 this article grew out of a Dlang Learn thread 
 (https://forum.dlang.org/thread/motdqixwsqmabzkdoslp forum.dlang.org). It
looks at Kernel Matrix Calculations in Chapel, D, and Julia and has a more
general discussion of all three languages. Comments welcome.

 https://github.com/dataPulverizer/KernelMatrixBenchmark

 Thanks

An update of the article is available.

Thanks

May 28 2020

D Programming

C/C++ Programming

Other

digitalmars.D - A look at Chapel, D, and Julia using kernel matrix calculations