www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - ldc/dcompute atomics for nvptx?

reply Bruce Carneal <bcarneal gmail.com> writes:
I'd like to use atomic (rmw) operations from within ldc while 
targeting nvptx (via dcompute).

The first place to check is dcompute.std.atomic.  That's a nice 
placeholder, but only a placeholder, so I started poking around 
in ldc and clang.  After a modest amount of poking I'm still not 
sure how to proceed.

If you know of a simple way to bring atomics online for 
dcompute/nvptx, I'd like to hear from you.  Alternatively, if you 
know why nvptx atomics will be hard to bring online, I'd also 
like to hear from you.

On a positive note, I've had some success in using dcompute/D's 
meta programming facilities reworking areal/stencil compute 
kernels to operate out of "arrays of registers".  You meta-unroll 
til you wrap around the stencil, avoiding moves, and you can use 
intra-warp shuffles to/from lateral neighbors to minimize load on 
the memory subsystem when rolling on to the next row.

Another D advantage over CUDA/C++ that can be exploited is nested 
functions.  You can declare variables at the outer function level 
where they'll pretty much all be mapped to registers (you've got 
at least 64 per SIMT "lane" to work with, and it's easy to check 
for spills). You can then access those enregistered variables 
directly from within the nested functions.  Sometimes it's nice 
not having to pass everything through an argument list.

Thanks again to the ldc/dcompute team for providing the tooling 
that makes the above possible.  And thanks in advance for any 
guidance on getting atomics up for nvptx.
Apr 05 2021
parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Tuesday, 6 April 2021 at 02:23:59 UTC, Bruce Carneal wrote:
 I'd like to use atomic (rmw) operations from within ldc while 
 targeting nvptx (via dcompute).

 [...]
Sorry or the late reply. These should all be doable with pragma(LDC_intrinsic, "llvm.nvvm.atomic.*") where * is any of "add", "load.add", etc, I'll try to get a full list. but there is no real difference between this say std.cuda.index I never implemented them because I didn't need them and my card didn't support them.
Apr 24 2021