www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - ldc nvvm GPU intrinsics good news

reply Bruce Carneal <bcarneal gmail.com> writes:
After updating the first line to 
' compute(CompileFor.hostAndDevice) module ...' and adding an 
'import ldc.dcompute;' line, the 
runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
build apparently gives access to all manner of GPU intrinsics.

I've only tried it out on __syncthreads and __nvvm_shfl_down_i32 
but both "invocations" of those LDC_intrinsics resulted in the 
expected single ptx instruction in the dcompute .ptx output.

There are over 600 pragma(LDC_intrinsic, "llvm.nvvm.xxxxxx") 
builtins in the gcc_builtins_nvvm.di file so, of course, I've not 
hand tested them all but it looks very promising.

If you're working with dcompute on an OpenCL device, I'd love to 
hear if something similar works for your use cases of if you've 
found another way forward.
Mar 04
next sibling parent reply kinke <noone nowhere.com> writes:
On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.
Sorry, I assumed you had already seen it (plus ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC v1.22. These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord(). [1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp
Mar 04
next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:
 On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.
Sorry, I assumed you had already seen it (plus ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC v1.22. These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord(). [1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp
So what is the right/clean way to get compute access to these builtins? Should practitioners do a manual copy edit or is there a better way?
Mar 04
parent reply kinke <noone nowhere.com> writes:
On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:
 So what is the right/clean way to get  compute access to these 
 builtins?  Should practitioners do a manual copy edit or is 
 there a better way?
AFAICT, the only challenge is to find a suited signature in D for the LLVM intrinsic, incl. some unique name if there's no gcc builtin name - as a GitHub PR for that tool, so that all future .di files contain these extra intrinsics. To get there, I'd suggest adding some temp logging to inspect the records from the parsed .td file.
Mar 04
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 5 March 2021 at 00:48:26 UTC, kinke wrote:
 On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:
 So what is the right/clean way to get  compute access to these 
 builtins?  Should practitioners do a manual copy edit or is 
 there a better way?
AFAICT, the only challenge is to find a suited signature in D for the LLVM intrinsic, incl. some unique name if there's no gcc builtin name - as a GitHub PR for that tool, so that all future .di files contain these extra intrinsics. To get there, I'd suggest adding some temp logging to inspect the records from the parsed .td file.
To be clear, I'm not asking for how a new builtin might be added cleanly, I'm asking if there is a better way to access all the builtins present in the current .di file. As noted above, I had to copy and edit the .di file to avoid this error from ldc: ...: Error: can only call functions from other compute modules in compute code Once I did that, everything I have tried from the edited .di file appears to work as intended. Amazing! I just thought that copying/editing could be avoided.
Mar 04
parent reply kinke <noone nowhere.com> writes:
On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:
 To be clear, I'm not asking for how a new builtin might be 
 added cleanly, I'm asking if there is a better way to access 
 all the builtins present in the current .di file.

 As noted above, I had to copy and edit the .di file to avoid 
 this error from ldc:
 ...: Error: can only call functions from other  compute modules 
 in  compute code

 Once I did that, everything I have tried from the edited .di 
 file appears to work as intended.  Amazing!  I just thought 
 that copying/editing could be avoided.
The dcompute parts in the compiler could definitely be improved (but it's nowhere on my priorities list). What I tried to get at by linking my comment [1] is that you can generally target NVPTX/AMDGCN directly, without any compute stuff, and in that case importing and using intrinsics from that .di works without any issues. [1] https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886
Mar 05
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 5 March 2021 at 11:28:31 UTC, kinke wrote:
 On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:
[...]
 The dcompute parts in the compiler could definitely be improved 
 (but it's nowhere on my priorities list). What I tried to get 
 at by linking my comment [1] is that you can generally target 
 NVPTX/AMDGCN directly, without any  compute stuff, and in that 
 case importing and using intrinsics from that .di works without 
 any issues.

 [1] 
 https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886
Okay. I'll do the gccbuiltins_nvvm.di copy/edit for use in dcompute as part of my build script. A copy/edit at build time will give access to many of the nvvm intrinsics and Johan's irEx suggestion can be used for access to others. That leaves the intrinsics in the .td file that have llvm_i1_ty parameters and gcc builtin names (like vote/ballot). IIUC these would all be handled by the gen_gccbuiltins.cpp program if it knew how to deal with llvm_i1_ty parameters. Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms that I've tried for llvm_i1_ty gccbuiltin intrinsics have not worked, at least not for the vote/ballot test cases. I'll keep investigating. Clang and nvcc are both based on LLVM and deal with these somehow so I'll start looking there. I expect that you're very busy but if you already know or suspect what the llvm_i1_ty problem/fix is, a brief sketch here would be much appreciated. Regardless, thanks again for your efforts. A working LDC/dcompute is *much* nicer than C++/CUDA.
Mar 05
parent reply kinke <noone nowhere.com> writes:
On Friday, 5 March 2021 at 13:30:52 UTC, Bruce Carneal wrote:
 Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms 
 that I've tried for llvm_i1_ty gccbuiltin intrinsics have not 
 worked, at least not for the vote/ballot test cases.  I'll keep 
 investigating.
For __nvvm_vote_ballot (no idea what it does), this seems to suffice (bool for i1): pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot") int __nvvm_vote_ballot(bool); void foo(bool a) { __nvvm_vote_ballot(a); } https://run.dlang.io/is/z0k6l9
Mar 05
parent reply kinke <noone nowhere.com> writes:
On Friday, 5 March 2021 at 16:34:07 UTC, kinke wrote:
 https://run.dlang.io/is/z0k6l9
Small correction (IR generation was fine, compilation wasn't): that intrinsic requires a newer target shader model (-mcpu=sm_70) [and -betterC to avoid ModuleInfo etc.]. Then the asm can be generated and inspected too.
Mar 05
parent Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 5 March 2021 at 16:41:39 UTC, kinke wrote:
 On Friday, 5 March 2021 at 16:34:07 UTC, kinke wrote:
 https://run.dlang.io/is/z0k6l9
Small correction (IR generation was fine, compilation wasn't): that intrinsic requires a newer target shader model (-mcpu=sm_70) [and -betterC to avoid ModuleInfo etc.]. Then the asm can be generated and inspected too.
Yep. The above works for me as well. cuda_620 (sm_62 IIUC) is the highest currently allowed via the --mdcompute-targets interface. I'll work on expanding that. I'm not sure if the "i1" type can just plug in to gen_gccbuiltins.cpp as previously discussed but I'll look at that as well. Thanks.
Mar 05
prev sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:
 On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 ... These files are auto-generated from the LLVM .td file using 
 a tiny program [1]. It's currently limited to intrinsics with a 
 GCCBuiltinName, so if you'd like to make more of these 
 available in D, see processRecord().

 [1] 
 https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp
Good to know. __nvvm_vote_ballot and __nvvm_vote_ballot_sync both show up in the .td file with GCCBuiltin names but do not show up in the .di file. I've found a workaround for those but I'd like to understand what kept them from being included in the .di file. I'm guessing it is the dtype(...) call from within processRecord(...) that did not know what to do with an i1 type? If it's as simple as that I'd greatly appreciate your augmenting the if/else ladder in dtype() to include the branch for i1. The ballot/vote instruction may be better than my workaround and has other uses. I tried to work around the .di file issue by declaring my own pragma: pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot") int myBallot(bool); with no luck: LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.vote.ballot If you know how I can manually define the ballot pragmas correctly before any dtype(...) upgrade you might release, please let me know. Finally and again, many thanks for your efforts maintaining LDC.
Mar 04
prev sibling parent reply Johan Engelen <j j.nl> writes:
On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.
Hi Bruce, Why not submit a PR that modifies `gen_gccbuiltins.cpp` such that it adds the ` compute` attribute for the relevant intrinsics files? I think it's OK if `gen_gccbuiltins` contains some hacks like that . Please add a small compile test case, so we verify that it won't bitrot in the future. Wouldn't ` compute(CompileFor.deviceOnly)` make more sense, because the intrinsics will not be available on normal CPUs anyway? I hope all your work will land in either LDC or dcompute's repositories, such that others can easily benefit from it. cheers, Johan
Apr 25
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:
 On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.
Hi Bruce, Why not submit a PR that modifies `gen_gccbuiltins.cpp` such that it adds the ` compute` attribute for the relevant intrinsics files? I think it's OK if `gen_gccbuiltins` contains some hacks like that . Please add a small compile test case, so we verify that it won't bitrot in the future. Wouldn't ` compute(CompileFor.deviceOnly)` make more sense, because the intrinsics will not be available on normal CPUs anyway? I hope all your work will land in either LDC or dcompute's repositories, such that others can easily benefit from it. cheers, Johan
Yes, I'll help when the current push is over here, but I think I dont understand enough quite yet. I'm still bumping in to limitations/awkwardness in dcompute that should admit simple solutions. At least it feels that way. One idea from my experience to date is that we can and probably should create a simpler (from a programmer perspective) and finer granularity way to handle multiple targets. Intrinsic selection is part of that as is library selection. Also on my mind is how we should handle deployment. For the ultimate in speed we can do AOT per-target specialized compiles and "fat" binaries but using SPIR-V + Vulkan compute could significantly improve penetration and reduce bloat. I read a relatively recent thread in an LLVM forum indicating that the Intel guys are pushing a "real" SPIR-V IR effort now so maybe we can help out there. Also, I dont know how MLIR should fit in to our plans. I'll be in touch when I get my head above water here. Thanks to you and the rest of the LDC crew for the help so far. Looking forward to advancing dlang on GPUs in the future. It really can be much much better than C++ in that arena. Bruce
Apr 26
parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Monday, 26 April 2021 at 13:20:11 UTC, Bruce Carneal wrote:
 On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:
   [...]
Yes, I'll help when the current push is over here, but I think I dont understand enough quite yet. I'm still bumping in to limitations/awkwardness in dcompute that should admit simple solutions. At least it feels that way. [...]
Nice work, thanks for wanting to improve dcompute! I think D has real potential there
Apr 27