digitalmars.D.ldc - ldc nvvm GPU intrinsics good news

Bruce Carneal (14/14) Mar 04 2021 After updating the first line to

kinke (9/14) Mar 04 2021 Sorry, I assumed you had already seen it (plus

Bruce Carneal (4/18) Mar 04 2021 So what is the right/clean way to get @compute access to these

kinke (7/10) Mar 04 2021 AFAICT, the only challenge is to find a suited signature in D for

Bruce Carneal (11/21) Mar 04 2021 To be clear, I'm not asking for how a new builtin might be added

kinke (9/19) Mar 05 2021 The dcompute parts in the compiler could definitely be improved

Bruce Carneal (20/29) Mar 05 2021 Okay. I'll do the gccbuiltins_nvvm.di copy/edit for use in

kinke (7/11) Mar 05 2021 For __nvvm_vote_ballot (no idea what it does), this seems to

kinke (5/6) Mar 05 2021 Small correction (IR generation was fine, compilation wasn't):

Bruce Carneal (8/14) Mar 05 2021 Yep. The above works for me as well. cuda_620 (sm_62 IIUC) is

Bruce Carneal (21/28) Mar 04 2021 Good to know. __nvvm_vote_ballot and __nvvm_vote_ballot_sync

Johan Engelen (15/20) Apr 25 2021 Hi Bruce,

Bruce Carneal (22/42) Apr 26 2021 Yes, I'll help when the current push is over here, but I think I

Imperatorn (3/10) Apr 27 2021 Nice work, thanks for wanting to improve dcompute! I think D has

Bruce Carneal <bcarneal gmail.com> writes:

After updating the first line to 
' compute(CompileFor.hostAndDevice) module ...' and adding an 
'import ldc.dcompute;' line, the 
runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
build apparently gives access to all manner of GPU intrinsics.

I've only tried it out on __syncthreads and __nvvm_shfl_down_i32 
but both "invocations" of those LDC_intrinsics resulted in the 
expected single ptx instruction in the dcompute .ptx output.

There are over 600 pragma(LDC_intrinsic, "llvm.nvvm.xxxxxx") 
builtins in the gcc_builtins_nvvm.di file so, of course, I've not 
hand tested them all but it looks very promising.

If you're working with dcompute on an OpenCL device, I'd love to 
hear if something similar works for your use cases of if you've 
found another way forward.

Mar 04 2021

kinke <noone nowhere.com> writes:

On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.

Sorry, I assumed you had already seen it (plus 
ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC 
v1.22. These files are auto-generated from the LLVM .td file 
using a tiny program [1]. It's currently limited to intrinsics 
with a GCCBuiltinName, so if you'd like to make more of these 
available in D, see processRecord().

[1] 
https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp

Mar 04 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:
 On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.

 Sorry, I assumed you had already seen it (plus 
 ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC 
 v1.22. These files are auto-generated from the LLVM .td file 
 using a tiny program [1]. It's currently limited to intrinsics 
 with a GCCBuiltinName, so if you'd like to make more of these 
 available in D, see processRecord().

 [1] 
 https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp

So what is the right/clean way to get  compute access to these 
builtins?  Should practitioners do a manual copy edit or is there 
a better way?

Mar 04 2021

kinke <noone nowhere.com> writes:

On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:
 So what is the right/clean way to get  compute access to these 
 builtins?  Should practitioners do a manual copy edit or is 
 there a better way?

AFAICT, the only challenge is to find a suited signature in D for 
the LLVM intrinsic, incl. some unique name if there's no gcc 
builtin name - as a GitHub PR for that tool, so that all future 
.di files contain these extra intrinsics. To get there, I'd 
suggest adding some temp logging to inspect the records from the 
parsed .td file.

Mar 04 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 5 March 2021 at 00:48:26 UTC, kinke wrote:
 On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:
 So what is the right/clean way to get  compute access to these 
 builtins?  Should practitioners do a manual copy edit or is 
 there a better way?

 AFAICT, the only challenge is to find a suited signature in D 
 for the LLVM intrinsic, incl. some unique name if there's no 
 gcc builtin name - as a GitHub PR for that tool, so that all 
 future .di files contain these extra intrinsics. To get there, 
 I'd suggest adding some temp logging to inspect the records 
 from the parsed .td file.

To be clear, I'm not asking for how a new builtin might be added 
cleanly, I'm asking if there is a better way to access all the 
builtins present in the current .di file.

As noted above, I had to copy and edit the .di file to avoid this 
error from ldc:
...: Error: can only call functions from other  compute modules 
in  compute code

Once I did that, everything I have tried from the edited .di file 
appears to work as intended.  Amazing!  I just thought that 
copying/editing could be avoided.

Mar 04 2021

kinke <noone nowhere.com> writes:

On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:
 To be clear, I'm not asking for how a new builtin might be 
 added cleanly, I'm asking if there is a better way to access 
 all the builtins present in the current .di file.

 As noted above, I had to copy and edit the .di file to avoid 
 this error from ldc:
 ...: Error: can only call functions from other  compute modules 
 in  compute code

 Once I did that, everything I have tried from the edited .di 
 file appears to work as intended.  Amazing!  I just thought 
 that copying/editing could be avoided.

The dcompute parts in the compiler could definitely be improved 
(but it's nowhere on my priorities list). What I tried to get at 
by linking my comment [1] is that you can generally target 
NVPTX/AMDGCN directly, without any  compute stuff, and in that 
case importing and using intrinsics from that .di works without 
any issues.

[1] 
https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

Mar 05 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 5 March 2021 at 11:28:31 UTC, kinke wrote:
 On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:

[...]
 The dcompute parts in the compiler could definitely be improved 
 (but it's nowhere on my priorities list). What I tried to get 
 at by linking my comment [1] is that you can generally target 
 NVPTX/AMDGCN directly, without any  compute stuff, and in that 
 case importing and using intrinsics from that .di works without 
 any issues.

 [1] 
 https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

Okay.  I'll do the gccbuiltins_nvvm.di copy/edit for use in 
dcompute as part of my build script.

A copy/edit at build time will give access to many of the nvvm 
intrinsics and Johan's irEx suggestion can be used for access to 
others.

That leaves the intrinsics in the .td file that have llvm_i1_ty 
parameters and gcc builtin names (like vote/ballot).  IIUC these 
would all be handled by the gen_gccbuiltins.cpp program if it 
knew how to deal with llvm_i1_ty parameters.

Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms 
that I've tried for llvm_i1_ty gccbuiltin intrinsics have not 
worked, at least not for the vote/ballot test cases.  I'll keep 
investigating.  Clang and nvcc are both based on LLVM and deal 
with these somehow so I'll start looking there.

I expect that you're very busy but if you already know or suspect 
what the llvm_i1_ty problem/fix is, a brief sketch here would be 
much appreciated.  Regardless, thanks again for your efforts.  A 
working LDC/dcompute is *much* nicer than C++/CUDA.

Mar 05 2021

kinke <noone nowhere.com> writes:

On Friday, 5 March 2021 at 13:30:52 UTC, Bruce Carneal wrote:
 Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms 
 that I've tried for llvm_i1_ty gccbuiltin intrinsics have not 
 worked, at least not for the vote/ballot test cases.  I'll keep 
 investigating.

For __nvvm_vote_ballot (no idea what it does), this seems to 
suffice (bool for i1):

pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot")
int __nvvm_vote_ballot(bool);

void foo(bool a) { __nvvm_vote_ballot(a); }

https://run.dlang.io/is/z0k6l9

Mar 05 2021

kinke <noone nowhere.com> writes:

On Friday, 5 March 2021 at 16:34:07 UTC, kinke wrote:
 https://run.dlang.io/is/z0k6l9

Small correction (IR generation was fine, compilation wasn't): 
that intrinsic requires a newer target shader model (-mcpu=sm_70) 
[and -betterC to avoid ModuleInfo etc.]. Then the asm can be 
generated and inspected too.

Mar 05 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 5 March 2021 at 16:41:39 UTC, kinke wrote:
 On Friday, 5 March 2021 at 16:34:07 UTC, kinke wrote:
 https://run.dlang.io/is/z0k6l9

 Small correction (IR generation was fine, compilation wasn't): 
 that intrinsic requires a newer target shader model 
 (-mcpu=sm_70) [and -betterC to avoid ModuleInfo etc.]. Then the 
 asm can be generated and inspected too.

Yep.  The above works for me as well.  cuda_620 (sm_62 IIUC) is 
the highest currently allowed via the --mdcompute-targets 
interface.  I'll work on expanding that.

I'm not sure if the "i1" type can just plug in to 
gen_gccbuiltins.cpp as previously discussed but I'll look at that 
as well.

Thanks.

Mar 05 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:
 On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 ... These files are auto-generated from the LLVM .td file using 
 a tiny program [1]. It's currently limited to intrinsics with a 
 GCCBuiltinName, so if you'd like to make more of these 
 available in D, see processRecord().

 [1] 
 https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp

Good to know.  __nvvm_vote_ballot and __nvvm_vote_ballot_sync 
both show up in the .td file with GCCBuiltin names but do not 
show up in the .di file.

I've found a workaround for those but I'd like to understand what 
kept them from being included in the .di file.  I'm guessing it 
is the dtype(...) call from within processRecord(...) that did 
not know what to do with an i1 type?

If it's as simple as that I'd greatly appreciate your augmenting 
the if/else ladder in dtype() to include the branch for i1.  The 
ballot/vote instruction may be better than my workaround and has 
other uses.

I tried to work around the .di file issue by declaring my own 
pragma:
pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot") int myBallot(bool);
with no luck:
  LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.vote.ballot

If you know how I can manually define the ballot pragmas 
correctly before any dtype(...) upgrade you might release, please 
let me know.

Finally and again, many thanks for your efforts maintaining LDC.

Mar 04 2021

Johan Engelen <j j.nl> writes:

On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.

Hi Bruce,
   Why not submit a PR that modifies `gen_gccbuiltins.cpp` such 
that it adds the ` compute` attribute for the relevant intrinsics 
files?
I think it's OK if `gen_gccbuiltins` contains some hacks like 
that . Please add a small compile test case, so we verify that it 
won't bitrot in the future.

Wouldn't ` compute(CompileFor.deviceOnly)` make more sense, 
because the intrinsics will not be available on normal CPUs 
anyway?

I hope all your work will land in either LDC or dcompute's 
repositories, such that others can easily benefit from it.

cheers,
   Johan

Apr 25 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:
 On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
 After updating the first line to 
 ' compute(CompileFor.hostAndDevice) module ...' and adding an 
 'import ldc.dcompute;' line, the 
 runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
 build apparently gives access to all manner of GPU intrinsics.

 Hi Bruce,
   Why not submit a PR that modifies `gen_gccbuiltins.cpp` such 
 that it adds the ` compute` attribute for the relevant 
 intrinsics files?
 I think it's OK if `gen_gccbuiltins` contains some hacks like 
 that . Please add a small compile test case, so we verify that 
 it won't bitrot in the future.

 Wouldn't ` compute(CompileFor.deviceOnly)` make more sense, 
 because the intrinsics will not be available on normal CPUs 
 anyway?

 I hope all your work will land in either LDC or dcompute's 
 repositories, such that others can easily benefit from it.

 cheers,
   Johan

Yes, I'll help when the current push is over here, but I think I 
dont understand enough quite yet.  I'm still bumping in to 
limitations/awkwardness in dcompute that should admit simple 
solutions. At least it feels that way.

One idea from my experience to date is that we can and probably 
should create a simpler (from a programmer perspective) and finer 
granularity way to handle multiple targets.  Intrinsic selection 
is part of that as is library selection.

Also on my mind is how we should handle deployment.  For the 
ultimate in speed we can do AOT per-target specialized compiles 
and "fat" binaries but using SPIR-V + Vulkan compute could 
significantly improve penetration and reduce bloat.

I read a relatively recent thread in an LLVM forum indicating 
that the Intel guys are pushing a "real" SPIR-V IR effort now so 
maybe we can help out there.

Also, I dont know how MLIR should fit in to our plans.

I'll be in touch when I get my head above water here.  Thanks to 
you and the rest of the LDC crew for the help so far.  Looking 
forward to advancing dlang on GPUs in the future.  It really can 
be much much better than C++ in that arena.

Bruce

Apr 26 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Monday, 26 April 2021 at 13:20:11 UTC, Bruce Carneal wrote:
 On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:
   [...]

 Yes, I'll help when the current push is over here, but I think 
 I dont understand enough quite yet.  I'm still bumping in to 
 limitations/awkwardness in dcompute that should admit simple 
 solutions. At least it feels that way.

 [...]

Nice work, thanks for wanting to improve dcompute! I think D has 
real potential there

Apr 27 2021

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - ldc nvvm GPU intrinsics good news