digitalmars.D.ldc - ldc/dcompute and shared (programmer managed cache) access

Bruce Carneal (22/22) Mar 03 2021 ldc doesn't handle CUDA/Nvidia/PTX shared memory declarations but

kinke (4/11) Mar 04 2021 Not sure if it's of any help, but using something like

Bruce Carneal (24/35) Mar 04 2021 Per the other thread that you've seen, and helped with, the

Imperatorn (3/8) Mar 06 2021 (Just wanted to say it's nice to see someone showing interest in
Johan Engelen (5/12) Mar 06 2021 I'm not very familiar with DCompute, but sounds to me that all

Bruce Carneal <bcarneal gmail.com> writes:

ldc doesn't handle CUDA/Nvidia/PTX shared memory declarations but 
shared memory can be very useful when tuning block (subgroup) 
cooperative algorithms.

Turns out that you can manually (or programatically) inject a 
.shared declaration into the .ptx output file and, thereafter, 
obtain the shared memory pointer with a three instruction 
sequence.

As a slightly cleaner alternative I'll next look at using 
ldc/dcompute as a .o and .ptx generator while punting the 
fatbin/linking stuff to clang or nvcc in a build script.  The 
current simplicity of single-ptx-file is very nice but foregoing 
shared memory performance boosts is not nice so ...  I'm pretty 
sure that we'll need to move beyond the single ptx file model if 
we want to embrace shared cleanly.

I left a comment on the open dcompute git "issue" regarding 
shared but saw no response there.  If you have guidance to give 
on this topic please speak up.

Finally, I'll look at incorporating all the CUDA intrinsics that 
clang does after locking down the shared workaround.  Turns out 
that the very clean irEx hack that Johan provided that works for 
clz apparently only works on a smallish subset of the intrinsics 
(that or my ignorance is showing again).  More on that later.

Mar 03 2021

kinke <noone nowhere.com> writes:

On Thursday, 4 March 2021 at 04:52:52 UTC, Bruce Carneal wrote:
 As a slightly cleaner alternative I'll next look at using 
 ldc/dcompute as a .o and .ptx generator while punting the 
 fatbin/linking stuff to clang or nvcc in a build script.  The 
 current simplicity of single-ptx-file is very nice but 
 foregoing shared memory performance boosts is not nice so ...  
 I'm pretty sure that we'll need to move beyond the single ptx 
 file model if we want to embrace shared cleanly.

Not sure if it's of any help, but using something like 
`-mtriple=nvptx64 -mcpu=sm_50` might suffice: 
https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

Mar 04 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 5 March 2021 at 00:39:29 UTC, kinke wrote:
 On Thursday, 4 March 2021 at 04:52:52 UTC, Bruce Carneal wrote:
 As a slightly cleaner alternative I'll next look at using 
 ldc/dcompute as a .o and .ptx generator while punting the 
 fatbin/linking stuff to clang or nvcc in a build script.  The 
 current simplicity of single-ptx-file is very nice but 
 foregoing shared memory performance boosts is not nice so ...  
 I'm pretty sure that we'll need to move beyond the single ptx 
 file model if we want to embrace shared cleanly.

 Not sure if it's of any help, but using something like 
 `-mtriple=nvptx64 -mcpu=sm_50` might suffice: 
 https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

Per the other thread that you've seen, and helped with, the 
intrinsics are mostly good to go.  Thanks.

Unfortunately CUDA __shared__ is not, AFAICT, addressed by the 
pull request that you cited.  We need some way to get a 
__shared__  declaration emitted to the PTX file.

The code generation through standalone (uninitialized) __shared__ 
pointers, which we *can* declare, already emits the correct PTX 
instructions (ld.shared, st.shared) but we have no way, at least 
that I've found, to declare __shared__ address space storage that 
we can point to.  (clang cuda can do this, nvvc can obviously do 
this, and we can do this currently only if we manually modify the 
.PTX file)

The newly available shuffle intrinsics provide for efficient 
cooperative processing at the warp (32 lane) level, which helps a 
lot.  The __shared__ address space capability would allow for 
more efficient cooperation for up to 1024 lanes (an Nvidia 
"block" upper limit currently IIUC).

As I understand it the L1 caches on Nvidia cards are now 
writeback so, with the very relaxed memory model in play on GPUs 
an explicit programmer scratchpad (__shared__) is not as 
important as it used to be for many apps but in some situations 
it's still important.  Would be great if we could find a way to 
provide the capability some where down the road.

Mar 04 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Friday, 5 March 2021 at 04:49:56 UTC, Bruce Carneal wrote:
 On Friday, 5 March 2021 at 00:39:29 UTC, kinke wrote:
 [...]

 Per the other thread that you've seen, and helped with, the 
 intrinsics are mostly good to go.  Thanks.

 [...]

(Just wanted to say it's nice to see someone showing interest in 
dcompute, I hope it gets some love)

Mar 06 2021

Johan Engelen <j j.nl> writes:

On Friday, 5 March 2021 at 04:49:56 UTC, Bruce Carneal wrote:
 The code generation through standalone (uninitialized) 
 __shared__ pointers, which we *can* declare, already emits the 
 correct PTX instructions (ld.shared, st.shared) but we have no 
 way, at least that I've found, to declare __shared__ address 
 space storage that we can point to.  (clang cuda can do this, 
 nvvc can obviously do this, and we can do this currently only 
 if we manually modify the .PTX file)

I'm not very familiar with DCompute, but sounds to me that all 
that is needed is a proposed syntax for what you want. The 
compiler support is probably pretty easy to add.

-Johan

Mar 06 2021

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - ldc/dcompute and shared (programmer managed cache) access