www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How about implementing SPMD on SIMD for D?

reply Random D user <no email.com> writes:
TL;DR
Would want to run your code 8x - 32x faster? SPMD (Single Program 
Multiple Data) on SIMD (Single Instruction Multiple Data) might 
be the answer you're looking for.
It works by running multiple iterations/instances of your loop at 
once on SIMD and the compiler could do that automatically for you 
and your normal loop & array code.

---

I'm a bit late to the party, but I recently was reading this ( 
http://pharr.org/matt/blog/2018/04/30/ispc-all.html ), a highly 
interesting blog post series about how one guy did what the Intel 
compiler team wouldn't or couldn't do.
He wrote a C like language and compiler on top of LLVM which 
transforms normal scalar code into "parallel" SIMD code.
That compiler is called the ISPC ( 
https://ispc.github.io/perf.html ).

It basically works the similarly as GPU shaders, but the code 
runs on the CPU SIMD.
You write your code for one thread/lane and the compiler then 
runs N instances of that code simultaneously in lockstep.
For example, loop 8x (c.xyzw = a.xyzw + b.xyzw) would become 2x 
(x.cccc = x.aaaa + x.bbbb; y.cccc = y.aaaa + y.bbbb; z.cccc = 
z.aaaa + z.bbbb; w.cccc = w.aaaa + w.bbbb) (the notation here is 
a bit weird, but I was trying to keep it short).
Branches are done using masking, so the code runs both sides of 
the branch, but masks away the wrong results.
All of this is way better described in the paper they wrote about 
it ( http://pharr.org/matt/papers/ispc_inpar_2012.pdf ). I 
recommend reading it.

I was also looking at some videos from Unity (game 
engine/framework) about their new "Performance by default" 
initiative.


functions, slices and annotations (no classes).
That reminded me of D :).
One thing they touched was pointer aliasing and how slices and 
custom compiler tech (that knows about the other engine systems) 
allows them to avoid aliasing and produce more optimal code.
However the interesting part was that the compiler does similar 
things as the ISPC when specific annotations are given by the 
programmer.
Video about the tech/compiler is here ( 
https://www.youtube.com/watch?v=NF6kcNS6U80&feature=youtu.be?list=PLX2vGYjWbI0S8
jCJKYT-mIZf7YCuF-Ka ).

It occurred to me that SPMD on SIMD would be really nice addition 
to D's arsenal.
Especially, since D doesn't even attempt any auto-vectorization 
(poor results and difficult to implement) and manual loops are 
quite tedious to write (even std.simd failed to materialize), so 
SPMD would be nice alternative.
D also has some existing vector syntax and specialization, so 
there's a precedent for vector programing. This could be 
considered as an extension to that.
The SPMD should be easy to implement (I'm not a compiler expert) 
since it's only a code transformation and not an optimization.

Finally, I don't think any serious systems/performance oriented 
language can ignore that kind of performance-increase figures for 
too long.

I had something like this in mind:

 spmd  //or  simd  // NOTE: just removing  spmd would mean it's a 
normal loop, great for debugging
foreach( int i; 0 .. 100 )
{
     c[i] = a[i] + b[i];
}

or

void doSum( float4[] a, float4[] b, float4[] c )  spmd  //or  simd
{
     c = a + b;  // NOTE: c[i] = a[i] + b[i], array index is 
implicit because of  spmd, it's just some index of 0 .. a.length
}

What do you think?
Jul 06 2018
next sibling parent reply solidstate1991 <laszloszeremi outlook.com> writes:
I think fixing the SIMD support would be also great.

1) Make it compatible with Intel intrinsics.

2) Update it to work with AVX-512.

3) Add some preliminary support for ARM Neon.

4) Optional: Make SIMD to compile for all 32 bit targets, so I 
don't have to write long-ass assembly code for stuff. Once I even 
broke LDC with my code (see CPUblit).
Jul 06 2018
parent reply Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 7 July 2018 at 02:00:14 UTC, solidstate1991 wrote:
 I think fixing the SIMD support would be also great.

 1) Make it compatible with Intel intrinsics.
Working on this, https://github.com/AuburnSounds/intel-intrinsics With LDC I'm getting the best performance out of vector intrinsics (wrapped or not with that library). But this is a layer that will change less that LDC intrinsics, and can .
 2) Update it to work with AVX-512.
The nice thing about LLVM intrinsics is that _semantics is deccorelated from codegen_. You can generate AVX instructions (if you really think it helps you) while writing the more common SSE intrinsics.
 3) Add some preliminary support for ARM Neon.
I think it will already work of you use LDC intrinsics (or intel-intrinsics) and build for ARM.
 4) Optional: Make SIMD to compile for all 32 bit targets, so I 
 don't have to write long-ass assembly code for stuff. Once I 
 even broke LDC with my code (see CPUblit).
Oh yes, so much this. It would be very nice to have D_SIMD on DMD 32-bit.
Jul 07 2018
next sibling parent reply solidstate1991 <laszloszeremi outlook.com> writes:
On Saturday, 7 July 2018 at 13:20:53 UTC, Guillaume Piolat wrote:
 Oh yes, so much this. It would be very nice to have D_SIMD on 
 DMD 32-bit.
Does D_SIMD work on LDC?
Jul 07 2018
parent Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 7 July 2018 at 15:34:40 UTC, solidstate1991 wrote:
 On Saturday, 7 July 2018 at 13:20:53 UTC, Guillaume Piolat 
 wrote:
 Oh yes, so much this. It would be very nice to have D_SIMD on 
 DMD 32-bit.
Does D_SIMD work on LDC?
No, it has different capabilities. (However some of it is similar).
Jul 07 2018
prev sibling parent reply Ethan <gooberman gmail.com> writes:
On Saturday, 7 July 2018 at 13:20:53 UTC, Guillaume Piolat wrote:
 The nice thing about LLVM intrinsics is that _semantics is 
 deccorelated from codegen_. You can generate AVX instructions 
 (if you really think it helps you) while writing the more 
 common SSE intrinsics.
Word to the wise. Some platforms out there require you to generate AVX encodings, not SSE. Nudge nudge.
Jul 07 2018
parent reply Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 7 July 2018 at 21:59:09 UTC, Ethan wrote:
 Word to the wise. Some platforms out there require you to 
 generate AVX encodings, not SSE. Nudge nudge.
Can you elaborate? What do you mean by AVX encoding, you mean the new VEX encoding or AVX intrinsics?
Jul 08 2018
parent Ethan <gooberman gmail.com> writes:
On Sunday, 8 July 2018 at 11:47:20 UTC, Guillaume Piolat wrote:
 Can you elaborate? What do you mean by AVX encoding, you mean 
 the new VEX encoding or AVX intrinsics?
VEX encoding. AVX intrinsics in the Intel API are just intrinsics extensions like every SSE revision before it. It's purely up to the compiler how it transforms those intrinsics in to instructions.
Jul 08 2018
prev sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Friday, 6 July 2018 at 23:08:27 UTC, Random D user wrote:
 [SPMD] works by running multiple iterations/instances of your 
 loop at once on SIMD and the compiler could do that 
 automatically for you and your normal loop & array code.
 It occurred to me that SPMD on SIMD would be really nice 
 addition to D's arsenal.
 Especially, since D doesn't even attempt any auto-vectorization 
 (poor results and difficult to implement) and manual loops are 
 quite tedious to write (even std.simd failed to materialize), 
 so SPMD would be nice alternative.
I think you are mistaken, D code is autovectorized often when using LDC. Sometimes it's not and it's hard to know why. A pragma we could have is the one in the Intel C++ Compiler that says "hey this loop is safe to autovectorize".
 What do you think?
I think that ispc is like OpenCL on the CPU, but can't work on the GPU, FPGA or other OpenCL implementation. OpenCL is so fast because caching is explicit (several levels of memory are exposed).
Jul 07 2018
parent reply Random D user <no email.com> writes:
On Saturday, 7 July 2018 at 13:26:10 UTC, Guillaume Piolat wrote:
 On Friday, 6 July 2018 at 23:08:27 UTC, Random D user wrote:
 Especially, since D doesn't even attempt any 
 auto-vectorization (poor results and difficult to implement) 
 and manual loops are quite tedious to write (even std.simd 
 failed to materialize), so SPMD would be nice alternative.
I think you are mistaken, D code is autovectorized often when using LDC.
That is good to know. I haven't looked that much into LDC (or clang). I mostly use dmd for fast edit-compile cycle. Although, plan is to use LDC for "release"/optimized build eventually. Anyway, I would just want to code some non-trivial loops in SIMD, but I wouldn't want to fiddle with intrinsics. Or write a higher level wrapper for them. In my experience, you can only get the real benefits out of SIMD if you carefully handcraft your hot loops to fully use it. Sprinkling some SIMD here and there with a SIMD vector type, doesn't really seem to yield big benefits.
 Sometimes it's not and it's hard to know why.
Exactly. In my experience compilers (msvc) often don't.
 A pragma we could have is the one in the Intel C++ Compiler 
 that says "hey this loop is safe to autovectorize".

 What do you think?
I think that ispc is like OpenCL on the CPU, but can't work on the GPU, FPGA or other OpenCL implementation. OpenCL is so fast because caching is explicit (several levels of memory are exposed).
Yeah, it should be similar. The point is not run it on GPU, you can do CUDA, OpenCL, compute shader etc. for that. CPU code is much easier to debug, and sometimes you're already doing things on the GPU, but your CPU side has more room for computation. And you don't have to copy your data between the GPU and CPU or deal with latency. Of course, OpenCL runs on CPU too, but I think there's quite a bit of code required to set it up and to use it. I guess my point was that I would like to do CPU SIMD code easily without intrinsics (or manually trying to trick the compiler to vectorize the code). SPMD stuff seems to solve these issues. It would also be a forward looking step for D. Ideally, just write your loop normally, debug it and add an annotation to get it to run fast on SIMD. Done.
Jul 08 2018
parent Guillaume Piolat <first.last gmail.com> writes:
On Sunday, 8 July 2018 at 19:07:57 UTC, Random D user wrote:
 In my experience, you can only get the real benefits out of 
 SIMD if you carefully handcraft your hot loops to fully use it. 
 Sprinkling some SIMD here and there with a SIMD vector type, 
 doesn't really seem to yield big benefits.
I agree. That's why it's useful to have a (stable) syntax for such instruction like PMADDWD that can't be described with just operator overloading.
Jul 09 2018