www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - ldc/dcompute and shared (programmer managed cache) access

reply Bruce Carneal <bcarneal gmail.com> writes:
ldc doesn't handle CUDA/Nvidia/PTX shared memory declarations but 
shared memory can be very useful when tuning block (subgroup) 
cooperative algorithms.

Turns out that you can manually (or programatically) inject a 
.shared declaration into the .ptx output file and, thereafter, 
obtain the shared memory pointer with a three instruction 
sequence.

As a slightly cleaner alternative I'll next look at using 
ldc/dcompute as a .o and .ptx generator while punting the 
fatbin/linking stuff to clang or nvcc in a build script.  The 
current simplicity of single-ptx-file is very nice but foregoing 
shared memory performance boosts is not nice so ...  I'm pretty 
sure that we'll need to move beyond the single ptx file model if 
we want to embrace shared cleanly.

I left a comment on the open dcompute git "issue" regarding 
shared but saw no response there.  If you have guidance to give 
on this topic please speak up.

Finally, I'll look at incorporating all the CUDA intrinsics that 
clang does after locking down the shared workaround.  Turns out 
that the very clean irEx hack that Johan provided that works for 
clz apparently only works on a smallish subset of the intrinsics 
(that or my ignorance is showing again).  More on that later.
Mar 03 2021
parent reply kinke <noone nowhere.com> writes:
On Thursday, 4 March 2021 at 04:52:52 UTC, Bruce Carneal wrote:
 As a slightly cleaner alternative I'll next look at using 
 ldc/dcompute as a .o and .ptx generator while punting the 
 fatbin/linking stuff to clang or nvcc in a build script.  The 
 current simplicity of single-ptx-file is very nice but 
 foregoing shared memory performance boosts is not nice so ...  
 I'm pretty sure that we'll need to move beyond the single ptx 
 file model if we want to embrace shared cleanly.
Not sure if it's of any help, but using something like `-mtriple=nvptx64 -mcpu=sm_50` might suffice: https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886
Mar 04 2021
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 5 March 2021 at 00:39:29 UTC, kinke wrote:
 On Thursday, 4 March 2021 at 04:52:52 UTC, Bruce Carneal wrote:
 As a slightly cleaner alternative I'll next look at using 
 ldc/dcompute as a .o and .ptx generator while punting the 
 fatbin/linking stuff to clang or nvcc in a build script.  The 
 current simplicity of single-ptx-file is very nice but 
 foregoing shared memory performance boosts is not nice so ...  
 I'm pretty sure that we'll need to move beyond the single ptx 
 file model if we want to embrace shared cleanly.
Not sure if it's of any help, but using something like `-mtriple=nvptx64 -mcpu=sm_50` might suffice: https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886
Per the other thread that you've seen, and helped with, the intrinsics are mostly good to go. Thanks. Unfortunately CUDA __shared__ is not, AFAICT, addressed by the pull request that you cited. We need some way to get a __shared__ declaration emitted to the PTX file. The code generation through standalone (uninitialized) __shared__ pointers, which we *can* declare, already emits the correct PTX instructions (ld.shared, st.shared) but we have no way, at least that I've found, to declare __shared__ address space storage that we can point to. (clang cuda can do this, nvvc can obviously do this, and we can do this currently only if we manually modify the .PTX file) The newly available shuffle intrinsics provide for efficient cooperative processing at the warp (32 lane) level, which helps a lot. The __shared__ address space capability would allow for more efficient cooperation for up to 1024 lanes (an Nvidia "block" upper limit currently IIUC). As I understand it the L1 caches on Nvidia cards are now writeback so, with the very relaxed memory model in play on GPUs an explicit programmer scratchpad (__shared__) is not as important as it used to be for many apps but in some situations it's still important. Would be great if we could find a way to provide the capability some where down the road.
Mar 04 2021
next sibling parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Friday, 5 March 2021 at 04:49:56 UTC, Bruce Carneal wrote:
 On Friday, 5 March 2021 at 00:39:29 UTC, kinke wrote:
 [...]
Per the other thread that you've seen, and helped with, the intrinsics are mostly good to go. Thanks. [...]
(Just wanted to say it's nice to see someone showing interest in dcompute, I hope it gets some love)
Mar 06 2021
prev sibling parent Johan Engelen <j j.nl> writes:
On Friday, 5 March 2021 at 04:49:56 UTC, Bruce Carneal wrote:
 The code generation through standalone (uninitialized) 
 __shared__ pointers, which we *can* declare, already emits the 
 correct PTX instructions (ld.shared, st.shared) but we have no 
 way, at least that I've found, to declare __shared__ address 
 space storage that we can point to.  (clang cuda can do this, 
 nvvc can obviously do this, and we can do this currently only 
 if we manually modify the .PTX file)
I'm not very familiar with DCompute, but sounds to me that all that is needed is a proposed syntax for what you want. The compiler support is probably pretty easy to add. -Johan
Mar 06 2021