www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - SIMD under LDC

reply Igor <stojkovic.igor gmail.com> writes:
I found that I can't use __simd function from core.simd under LDC 
and that it has ldc.simd but I couldn't find how to implement 
equivalent to this with it:

ubyte16* masks = ...;
foreach (ref c; pixels) {
	c = __simd(XMM.PSHUFB, c, *masks);
}

I see it has shufflevector function but it only accepts constant 
masks and I am using a variable one. Is this possible under LDC?

BTW. Shuffling channels within pixels using DMD simd is about 5 
times faster than with normal code on my machine :)
Sep 04
next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Monday, 4 September 2017 at 20:39:11 UTC, Igor wrote:
 I found that I can't use __simd function from core.simd under 
 LDC
Correct LDC does not support the core.simd interface.
 and that it has ldc.simd but I couldn't find how to implement 
 equivalent to this with it:

 ubyte16* masks = ...;
 foreach (ref c; pixels) {
 	c = __simd(XMM.PSHUFB, c, *masks);
 }

 I see it has shufflevector function but it only accepts 
 constant masks and I am using a variable one. Is this possible 
 under LDC?
You have several options: * write a regular for loop and let LDC's optimiser take care of the rest. alias mask_t = ReturnType!(equalMask!ubyte16); pragma(LDC_intrinsic, "llvm.masked.load.v16i8.p0v16i8") ubyte16 llvm_masked_load(ubyte16* val,int align, mask_t mask, ubyte16 fallthru); ubyte16* masks = ...; foreach (ref c; pixels) { auto mask = equalMask!ubyte16(*masks, [-1,-1,-1, ...]); c = llvm_masked_load(&c,16,mask, [0,0,0,0 ... ]); } The second one might not work, because of type differences in llvm, but should serve as a guide to hacking the `cmpMask` IR code in ldc.simd to do what you want it to.
 BTW. Shuffling channels within pixels using DMD simd is about 5 
 times faster than with normal code on my machine :)
Don't underestimate ldc's optimiser ;)
Sep 04
parent reply 12345swordy <alexanderheistermann gmail.com> writes:
On Monday, 4 September 2017 at 23:06:27 UTC, Nicholas Wilson 
wrote:
 On Monday, 4 September 2017 at 20:39:11 UTC, Igor wrote:
 I found that I can't use __simd function from core.simd under 
 LDC
Correct LDC does not support the core.simd interface.
 and that it has ldc.simd but I couldn't find how to implement 
 equivalent to this with it:

 ubyte16* masks = ...;
 foreach (ref c; pixels) {
 	c = __simd(XMM.PSHUFB, c, *masks);
 }

 I see it has shufflevector function but it only accepts 
 constant masks and I am using a variable one. Is this possible 
 under LDC?
You have several options: * write a regular for loop and let LDC's optimiser take care of the rest. alias mask_t = ReturnType!(equalMask!ubyte16); pragma(LDC_intrinsic, "llvm.masked.load.v16i8.p0v16i8") ubyte16 llvm_masked_load(ubyte16* val,int align, mask_t mask, ubyte16 fallthru); ubyte16* masks = ...; foreach (ref c; pixels) { auto mask = equalMask!ubyte16(*masks, [-1,-1,-1, ...]); c = llvm_masked_load(&c,16,mask, [0,0,0,0 ... ]); } The second one might not work, because of type differences in llvm, but should serve as a guide to hacking the `cmpMask` IR code in ldc.simd to do what you want it to.
 BTW. Shuffling channels within pixels using DMD simd is about 
 5 times faster than with normal code on my machine :)
Don't underestimate ldc's optimiser ;)
I seen cases where the compiler fail to optimized for smid.
Sep 04
parent Igor <stojkovic.igor gmail.com> writes:
On Tuesday, 5 September 2017 at 01:11:29 UTC, 12345swordy wrote:
 On Monday, 4 September 2017 at 23:06:27 UTC, Nicholas Wilson 
 wrote:
 Don't underestimate ldc's optimiser ;)
I seen cases where the compiler fail to optimized for smid.
I tried it and LDC optimized build did generate SIMD instructions from regular code but it used multiple ones to do job so it is about 1.4 times slower than manual SIMD version with DMD. That is probably good enough for me.
Sep 05
prev sibling parent reply Johan Engelen <j j.nl> writes:
On Monday, 4 September 2017 at 20:39:11 UTC, Igor wrote:
 I found that I can't use __simd function from core.simd under 
 LDC and that it has ldc.simd but I couldn't find how to 
 implement equivalent to this with it:

 ubyte16* masks = ...;
 foreach (ref c; pixels) {
 	c = __simd(XMM.PSHUFB, c, *masks);
 }

 I see it has shufflevector function but it only accepts 
 constant masks and I am using a variable one. Is this possible 
 under LDC?
You can use the module ldc.gccbuiltins_x86.di, __builtin_ia32_pshufb128 and __builtin_ia32_pshufb256. (also see https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/X86-Built_002din-Functions.html) Please file a feature request about shufflevector with variable mask in our (LDC) issue tracker on Github; with some code that you'd expect to work. Thanks. - Johan
Sep 05
parent reply Igor <stojkovic.igor gmail.com> writes:
On Tuesday, 5 September 2017 at 18:50:34 UTC, Johan Engelen wrote:
 On Monday, 4 September 2017 at 20:39:11 UTC, Igor wrote:
 I found that I can't use __simd function from core.simd under 
 LDC and that it has ldc.simd but I couldn't find how to 
 implement equivalent to this with it:

 ubyte16* masks = ...;
 foreach (ref c; pixels) {
 	c = __simd(XMM.PSHUFB, c, *masks);
 }

 I see it has shufflevector function but it only accepts 
 constant masks and I am using a variable one. Is this possible 
 under LDC?
You can use the module ldc.gccbuiltins_x86.di, __builtin_ia32_pshufb128 and __builtin_ia32_pshufb256. (also see https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/X86-Built_002din-Functions.html) Please file a feature request about shufflevector with variable mask in our (LDC) issue tracker on Github; with some code that you'd expect to work. Thanks. - Johan
I'll try that this evening. Thanks! I'll also open an issue but are you sure such feature request is valid since LLVM shufflevector instruction, as far as I see, only supports constant masks as well.
Sep 06
parent reply Igor <stojkovic.igor gmail.com> writes:
On Wednesday, 6 September 2017 at 09:01:18 UTC, Igor wrote:
 On Tuesday, 5 September 2017 at 18:50:34 UTC, Johan Engelen 
 wrote:
 On Monday, 4 September 2017 at 20:39:11 UTC, Igor wrote:
 I found that I can't use __simd function from core.simd under 
 LDC and that it has ldc.simd but I couldn't find how to 
 implement equivalent to this with it:

 ubyte16* masks = ...;
 foreach (ref c; pixels) {
 	c = __simd(XMM.PSHUFB, c, *masks);
 }

 I see it has shufflevector function but it only accepts 
 constant masks and I am using a variable one. Is this 
 possible under LDC?
You can use the module ldc.gccbuiltins_x86.di, __builtin_ia32_pshufb128 and __builtin_ia32_pshufb256. (also see https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/X86-Built_002din-Functions.html) Please file a feature request about shufflevector with variable mask in our (LDC) issue tracker on Github; with some code that you'd expect to work. Thanks. - Johan
I'll try that this evening. Thanks! I'll also open an issue but are you sure such feature request is valid since LLVM shufflevector instruction, as far as I see, only supports constant masks as well.
I opened a feature request on github. I also tried using the gccbuiltins but I got this error: LLVM ERROR: Cannot select: 0x2199c96fd70: v16i8 = X86ISD::PSHUFB 0x2199c74e9a8, 0x2199c74d6c0 0x2199c74e9a8: v16i8,ch = CopyFromReg 0x21994bcfd90, Register:v16i8 %vreg384 0x2199c96fb00: v16i8 = Register %vreg384 0x2199c74d6c0: v16i8,ch = CopyFromReg 0x21994bcfd90, Register:v16i8 %vreg385 0x2199c74ed50: v16i8 = Register %vreg385 In function: _D7assetdb12loadBmpImageFAxaZf Building x64\LDCDebug\DNgin.exe failed! You can see the code I used here: https://github.com/igor84/dngin/blob/3c171330843af71170a6ee4ae164a76bf58c35f6/source/assetdb.d#L123 Note that if you want to try it you will need a test.bmp in specific format where header.compression == 3, like this one: https://drive.google.com/file/d/0B9l8IgnRaPwCU0hIWEtHUElhTTg/view?usp=sharing
Sep 06
parent reply Johan Engelen <j j.nl> writes:
On Wednesday, 6 September 2017 at 20:43:01 UTC, Igor wrote:
 I opened a feature request on github. I also tried using the 
 gccbuiltins but I got this error:

 LLVM ERROR: Cannot select: 0x2199c96fd70: v16i8 = 
 X86ISD::PSHUFB 0x2199c74e9a8, 0x2199c74d6c0
That's because SSSE3 instructions are not enabled by default, so the compiler isn't allowed to generate the PSHUFB instruction. Some options you have: 1. Set a cpu that has ssse3, e.g. compile with `-mcpu=native` 2. Enable SSSE3: compile with `-mattr=+ssse3` 3. Perhaps best for your case, enable SSSE3 for that function, importing the ldc.attributes module and using the target("ssse3") UDA on that function. -Johan
Sep 07
parent reply Igor <stojkovic.igor gmail.com> writes:
On Thursday, 7 September 2017 at 15:24:13 UTC, Johan Engelen 
wrote:
 On Wednesday, 6 September 2017 at 20:43:01 UTC, Igor wrote:
 I opened a feature request on github. I also tried using the 
 gccbuiltins but I got this error:

 LLVM ERROR: Cannot select: 0x2199c96fd70: v16i8 = 
 X86ISD::PSHUFB 0x2199c74e9a8, 0x2199c74d6c0
That's because SSSE3 instructions are not enabled by default, so the compiler isn't allowed to generate the PSHUFB instruction. Some options you have: 1. Set a cpu that has ssse3, e.g. compile with `-mcpu=native` 2. Enable SSSE3: compile with `-mattr=+ssse3` 3. Perhaps best for your case, enable SSSE3 for that function, importing the ldc.attributes module and using the target("ssse3") UDA on that function. -Johan
Thanks Johan. I tried this and now it does compile but it crashes with Access Violation in debug build. In optimized build it seems to be working though.
Sep 07
parent reply Igor <stojkovic.igor gmail.com> writes:
On Thursday, 7 September 2017 at 16:45:40 UTC, Igor wrote:
 On Thursday, 7 September 2017 at 15:24:13 UTC, Johan Engelen 
 wrote:
 On Wednesday, 6 September 2017 at 20:43:01 UTC, Igor wrote:
 I opened a feature request on github. I also tried using the 
 gccbuiltins but I got this error:

 LLVM ERROR: Cannot select: 0x2199c96fd70: v16i8 = 
 X86ISD::PSHUFB 0x2199c74e9a8, 0x2199c74d6c0
That's because SSSE3 instructions are not enabled by default, so the compiler isn't allowed to generate the PSHUFB instruction. Some options you have: 1. Set a cpu that has ssse3, e.g. compile with `-mcpu=native` 2. Enable SSSE3: compile with `-mattr=+ssse3` 3. Perhaps best for your case, enable SSSE3 for that function, importing the ldc.attributes module and using the target("ssse3") UDA on that function. -Johan
Thanks Johan. I tried this and now it does compile but it crashes with Access Violation in debug build. In optimized build it seems to be working though.
I will try to reproduce this in minimal project and open LDC bug if successful. In the meantime can anyone tell me how to add an attribute to a function only if something is defined, since this doesn't work: version(USE_SIMD_WITH_LDC) { import ldc.attributes; target("ssse3") } void funcThatUsesSIMD() { ... version(LDC) { import ldc.gccbuiltins_x86; c = __builtin_ia32_pshufb128(c, *simdMasks); } else { c = __simd(XMM.PSHUFB, c, *simdMasks); } ... }
Sep 11
parent Igor <stojkovic.igor gmail.com> writes:
On Monday, 11 September 2017 at 11:55:45 UTC, Igor wrote:
 In the meantime can anyone tell me how to add an attribute to a 
 function only if something is defined, since this doesn't work:

 version(USE_SIMD_WITH_LDC) {
   import ldc.attributes;
    target("ssse3")
 } void funcThatUsesSIMD() {
   ...
   version(LDC) {
     import ldc.gccbuiltins_x86;
     c = __builtin_ia32_pshufb128(c, *simdMasks);
   } else {
     c = __simd(XMM.PSHUFB, c, *simdMasks);
   }
   ...
 }
Regarding the crash in debug mode the problem was that my masks variable wasn't properly aligned and I guess the best I can do with the attribute is this: version(LDC) import ldc.attributes; else private struct target { string specifier; } target("ssse3") void funcThatUsesSIMD() {...}
Sep 11