www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - ldc/dcompute nvptx intrinsics

reply Bruce Carneal <bcarneal gmail.com> writes:
LLVM appears to "know" about most if not all of the CUDA 
intrinsics as can be seen here: 
https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXIntrinsics.td

I've successfully mimicked dcompute's pragma(LDC_intrinsic 
"llvm.nvvm...") idiom for some things but have not succeeded in 
implementing "ballot" or other important CUDA intrinsics that are 
commented out in dcompute.std.warp.d.  Perhaps I'm just not 
uttering the correct incantations...

I'd love to bring up some additional functionality but I'm new to 
LDC/LLVM so it's slow going at this point of the learning curve.  
Does anyone have additional LDC/dcompute CUDA intrinsics working 
or have some pointers for bringing up more CUDA/warp intrinsics 
generally?

I'm considering trying to merge a hand written PTX file in at 
build time and falling back to C++/CUDA if all else fails.  
Anything that will help me avoid a trip to the dark side would be 
much appreciated.
Feb 19 2021
parent reply Johan Engelen <j j.nl> writes:
On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:
 I'd love to bring up some additional functionality but I'm new 
 to LDC/LLVM so it's slow going at this point of the learning 
 curve.  Does anyone have additional LDC/dcompute CUDA 
 intrinsics working or have some pointers for bringing up more 
 CUDA/warp intrinsics generally?
Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Feb 20 2021
next sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen 
wrote:
 On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal 
 wrote:
 I'd love to bring up some additional functionality but I'm new 
 to LDC/LLVM so it's slow going at this point of the learning 
 curve.  Does anyone have additional LDC/dcompute CUDA 
 intrinsics working or have some pointers for bringing up more 
 CUDA/warp intrinsics generally?
Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
This looks like a very good way to enable intrinsics. I'll give it a try and report back. If it's clean enough I'll also submit a dcompute PR. Thanks Johan!
Feb 20 2021
prev sibling next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen 
wrote:
 On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal 
 wrote:
 I'd love to bring up some additional functionality but I'm new 
 to LDC/LLVM so it's slow going at this point of the learning 
 curve.  Does anyone have additional LDC/dcompute CUDA 
 intrinsics working or have some pointers for bringing up more 
 CUDA/warp intrinsics generally?
Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Here's what I get so far when testing the above out against llvm.nvvm.barrier0.popc (something that works as an LDC_intrinsic): "can only call functions from other compute modules in compute code" I'll keep digging.
Feb 20 2021
parent reply Johan <j j.nl> writes:
On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal wrote:
 On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen 
 wrote:
 On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal 
 wrote:
 I'd love to bring up some additional functionality but I'm 
 new to LDC/LLVM so it's slow going at this point of the 
 learning curve.  Does anyone have additional LDC/dcompute 
 CUDA intrinsics working or have some pointers for bringing up 
 more CUDA/warp intrinsics generally?
Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Here's what I get so far when testing the above out against llvm.nvvm.barrier0.popc (something that works as an LDC_intrinsic): "can only call functions from other compute modules in compute code" I'll keep digging.
Hi Bruce, I played around a bit and have a full working example for you: ``` compute(CompileFor.deviceOnly) module dcompute; import ldc.dcompute; // Copied from ldc.llvmasm, because indeed it needs to be defined in compute code. pragma(LDC_inline_ir) R __irEx(string prefix, string code, string suffix, R, P...)(P) trusted nothrow nogc; alias someFunction = __irEx!("declare i32 llvm.nvvm.barrier0.popc(i32)", "%i = call i32 llvm.nvvm.barrier0.popc(i32 %0)\nret i32 %i", "", int, int); void foo(GlobalPointer!float x_in) { auto i = someFunction(2); } ``` Compile with `bin/ldc2 -c -mdcompute-targets=cuda-350 -mdcompute-file-prefix=dcompute_kernel dcompute.d -O3` gives PTX output: ``` ❯ cat dcompute_kernel_cuda350_64.ptx // // Generated by LLVM NVPTX Back-End // .version 3.2 .target sm_35 .address_size 64 // .globl _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv .visible .func _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv( .param .b64 _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv_param_0 ) { .reg .b32 %r<3>; mov.u32 %r1, 2; { .reg .pred %p1; setp.ne.u32 %p1, %r1, 0; bar.red.popc.u32 %r2, 0, %p1; } ret; } ``` If this indeed will fit your usecase, you have a good argument for including `__irEx` into ldc.dcompute. Please file bugs/features on github! cheers, Johan
Feb 23 2021
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Tuesday, 23 February 2021 at 18:04:52 UTC, Johan wrote:
 On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal 
 wrote:
 On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen 
 wrote:
 On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal 
 wrote:
 I'd love to bring up some additional functionality but I'm 
 new to LDC/LLVM so it's slow going at this point of the 
 learning curve.  Does anyone have additional LDC/dcompute 
 CUDA intrinsics working or have some pointers for bringing 
 up more CUDA/warp intrinsics generally?
 Hi Bruce,
   I played around a bit and have a full working example for you:

 ```
  compute(CompileFor.deviceOnly) module dcompute;

 import ldc.dcompute;

 [... working example ...]

 If this indeed will fit your usecase, you have a good argument 
 for including `__irEx` into ldc.dcompute. Please file 
 bugs/features on github!

 cheers,
   Johan
Success! After verifying that your latest example worked I plugged in llvm.nvvm.clz.i from the .td file. That generated the hoped for single instruction function body: clz.b32 %r2, %r1 This clz intrinsic alone saves me a couple dozen instructions in a hot section of code where I "call" clz twice. I will expand the set of intrinsics enabled via the __irEx method over the next few days and then try to contact Nicholas W. and/or John C. via email or beerconf to get their take on the capabilities (they may suggest a more easily supported way to go about things, or have cuda naming suggestions, or want to rationalize these with OCL, or ...). Assuming that goes well I'll file with ldc and dcompute. Thank you Johan.
Feb 23 2021
parent Bruce Carneal <bcarneal gmail.com> writes:
On Tuesday, 23 February 2021 at 23:36:53 UTC, Bruce Carneal wrote:
 On Tuesday, 23 February 2021 at 18:04:52 UTC, Johan wrote:
 On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal 
 wrote:
[...]
 Hi Bruce,
   I played around a bit and have a full working example for 
 you:

 ```
  compute(CompileFor.deviceOnly) module dcompute;

 import ldc.dcompute;

 [... working example ...]

 If this indeed will fit your usecase, you have a good argument 
 for including `__irEx` into ldc.dcompute. Please file 
 bugs/features on github!

 cheers,
   Johan
Success! After verifying that your latest example worked I plugged in llvm.nvvm.clz.i from the .td file. That generated the hoped for single instruction function body: clz.b32 %r2, %r1 This clz intrinsic alone saves me a couple dozen instructions in a hot section of code where I "call" clz twice. I will expand the set of intrinsics enabled via the __irEx method over the next few days and then try to contact Nicholas W. and/or John C. via email or beerconf to get their take on the capabilities (they may suggest a more easily supported way to go about things, or have cuda naming suggestions, or want to rationalize these with OCL, or ...). Assuming that goes well I'll file with ldc and dcompute. Thank you Johan.
Turns out that almost all intrinsics are available after a tiny bit of .di file massaging (see the "ldc nvvm GPU intrinsics good news" thread). The irEx capability Johan pointed out seems to cover the rest. The irEx accessible set is defined at the top of the intrinsics .td file and includes the popc intrinsic, the clz initrinsic, and about a dozen others, most very useful. The .di file instrinsics number over 500, some of which appear useful.
Mar 04 2021
prev sibling parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen 
wrote:
 On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal 
 wrote:
 I'd love to bring up some additional functionality but I'm new 
 to LDC/LLVM so it's slow going at this point of the learning 
 curve.  Does anyone have additional LDC/dcompute CUDA 
 intrinsics working or have some pointers for bringing up more 
 CUDA/warp intrinsics generally?
Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Wat, didn't even know about that
Feb 23 2021