www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GPGPU progess

reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
I just caught up on the dconf material. It was a really interesting year,
and I'm super bummed I missed it!

So I'm just starting out on some GPGPU work, and I love the direction we're
going here.

Given this scenario;
I want to write CUDA kernels in D, and execute them in an existing C++ CUDA
app.
Can anyone point me at the getting started material, how do I achieve this
seemingly simple goal?
Has anyone started on this content? Where are the toolchains, how to I
build a CUDA kernel, how do I link/load it in my C++ app?

If this doesn't exist, there needs to be a wiki with getting started guide
covering various usage environments. I don't have a lot of time to spare on
detail, so success will depend on my ability to hit the ground running, and
that's really just a matter of simple clear reference material.

How far are we from integration into LDC without using forked compilers?
May 17
next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
 I just caught up on the dconf material. It was a really 
 interesting year, and I'm super bummed I missed it!

 So I'm just starting out on some GPGPU work, and I love the 
 direction we're going here.

 Given this scenario;
 I want to write CUDA kernels in D, and execute them in an 
 existing C++ CUDA
 app.
Can do already although not all the intrinsics are there yet, but they are very simple to add. See https://github.com/libmir/dcompute/blob/master/source/dcompute/std/cuda/index.d for how easy it is (PRs welcome!)
 Can anyone point me at the getting started material, how do I 
 achieve this
 seemingly simple goal?
If you're only interested in CUDA I think any LLVM 3.9.1 or 4.0 should work, If you want SPIRV get my LLVM from https://github.com/thewilsonator/llvm/tree/compute Get the dcompute branch of LDC ( https://github.com/ldc-developers/ldc/tree/dcompute) and its submodules (i.e. druntime, I don't think it points to a branch anymore because Johan deleted that when I merged that into the regular ldc druntime but it will still work.) The dcompute branch of LDC is in the process of being merged into master so expect some breakage when updating to that.
 Has anyone started on this content? Where are the toolchains, 
 how to I
 build a CUDA kernel, how do I link/load it in my C++ app?
There is some info on the dcompute wiki (https://github.com/libmir/dcompute/wiki) and I intend make this into "production quality" documentation (again PRs welcome or I can give you write access to the wiki), but I'm very busy at the moment. When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx is the CUDA compute capability specified on the command line and yy is 32 or 64 for 32 or 64bit) which should fit somewhere into your existing C++ pipeline.
 If this doesn't exist, there needs to be a wiki with getting 
 started guide covering various usage environments. I don't have 
 a lot of time to spare on detail, so success will depend on my 
 ability to hit the ground running, and that's really just a 
 matter of simple clear reference material.

 How far are we from integration into LDC without using forked 
 compilers?
Soon™, it the first thing I'm going to do (if I don't get it done before) after I hand in my honours thesis at the start of July. I'm stupidly busy this semester so don't count on it done beforehand. If you have any questions about anything probably the fastest way is the libmir or ldc gitter channels. (https://gitter.im/ldc-developers/main , https://gitter.im/libmir/public). Good luck! I look forward to your experience report. Nic
May 18
parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 18 May 2017 at 09:07:38 UTC, Nicholas Wilson wrote:
 When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx 
 is the CUDA compute capability specified on the command line 
 and yy is 32 or 64 for 32 or 64bit) which should fit somewhere 
 into your existing C++ pipeline.
Whoops, that assumes you have a CUDA driver API pipeline in your C++ code, which if you're asking I'm not sure that you have. If you're using the `kernel<<<...>>>(args)` form to launch you kernels then you are going to have a lot more work to do in D because you'll need to use the driver API (http://docs.nvidia.com/cuda/cuda-driver-api/#axzz4hQLA0Zdm) You'll need to: *get a device *create a context from it *get a stream on that context *load the ptx module (possibly linking it with other modules, to resolve missing symbols). *compile it for the device *then launch a kernel from that module on that device, by name passing the arguments in a void*[]. The sad thing is that its still nice than OpenCL because in OpenCL you have to pass the runtime args (with sizes) one by one to a function. Hence why I want to automate as much of that shit as is possible. I hope to have that done ASAP, but I don't have hardware set up to test CUDA at the moment (I have one on my windows box but I don't have dev set up there) and I'll be working on OpenCL at the same time (and theres only so much horrible API I can take in a day). I'll be working on dcompute part-part-time next semester though so I should be able to get a fair bit done and quite a few others are interested so that'll speed thing up a bit.
May 18
prev sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
 How far are we from integration into LDC without using forked 
 compilers?
The future is now! https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh forum.dlang.org https://github.com/ldc-developers/ldc/commit/69ad69e872f53c14c101e2c029c4757c4073f487 is the final commit from the stuff I've done prior to dconf.
May 30
parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:

 How far are we from integration into LDC without using forked compilers?
The future is now! https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh forum.dlang.org https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1 4c101e2c029c4757c4073f487 is the final commit from the stuff I've done prior to dconf.
Awesome stuff! That was fast :) You're right, I'm using kernel<<<...>>>, and it's very convenient. I looked briefly and realised that I had a lot of work to get running (as you describe), so I stuck with my current setup for the moment :( Is a <<<...>>> equivalent going to be possible in D, with kernel object fragments built into the binary together with the CPU code? I'm definitely looking forward to action in this space, and the wiki to come online :)
May 30
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Tuesday, 30 May 2017 at 08:14:16 UTC, Manu wrote:
 On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d < 
 digitalmars-d puremagic.com> wrote:

 On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:

 How far are we from integration into LDC without using forked 
 compilers?
The future is now! https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh forum.dlang.org https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1 4c101e2c029c4757c4073f487 is the final commit from the stuff I've done prior to dconf.
Awesome stuff! That was fast :) You're right, I'm using kernel<<<...>>>, and it's very convenient.
Yep, thats (one of the reasons) why CUDA is more successful than OpenCL and therefore one of the more powerful draws for those poor sods using OpenCL.
 I looked briefly and realised that I had a lot of work to get 
 running (as
 you describe), so I stuck with my current setup for the moment 
 :(
I figured, I'll get you using it eventually.
 Is a <<<...>>> equivalent going to be possible in D, with 
 kernel object fragments built into the binary together with the 
 CPU code?
As I explained in my dconf presentation: the idea is to have Queue q = ... ; // the equivalent of a CUDA stream q.enqueue!kernel(sizes)(kernel_arguments); where q.enqueue returns a callable that you then call with the arguments. It was modelled directly after CUDAs <<<...>>> as for embedding in the binary a post build step that does ubyte[] ptx_code = import("kernels_cuda620_64.ptx"); should be doable as should invoking ptxas and doing the same. Then proving a consistent naming convention is used the code can do its magic. Or the files could just be read from disk.
 I'm definitely looking forward to action in this space, and the 
 wiki to come online :)
Yeah once my thesis is done thing should start moving. Any input with your expertise with CUDA will be much appriciated.
May 30
parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 30 May 2017 at 19:54, Nicholas Wilson via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 as for embedding in the binary a post build step that does

 ubyte[] ptx_code = import("kernels_cuda620_64.ptx");

 should be doable as should invoking ptxas and doing the same.
 Then proving a consistent naming convention is used the code can do its
 magic.
 Or the files could just be read from disk.
Is it possible to convince the compiler to emit code built for the backend target directly into the same object file as the host code? I feel like this should be possible, along the lines of __attribute__((target(...)) to convince the compiler to generate code for a few functions with different targets than the module? Using solutions as you suggest above introduces dependent build sequencing into the build script. Different build systems might prove to be more or less difficult to integrate cleanly, and many people use build-script generators which might need to learn a few new tricks. Any input with your expertise with CUDA will be much appriciated.

'Expertise' is possibly not the word I'd suggest ;)
But I'll have some established software by that time that I'd love to
attempt to port, we can work through rough edges together when you're
available. No rush.
Jun 03
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Saturday, 3 June 2017 at 11:58:14 UTC, Manu wrote:
 Is it possible to convince the compiler to emit code built for 
 the backend target directly into the same object file as the 
 host code?
 I feel like this should be possible, along the lines of
 __attribute__((target(...)) to convince the compiler to 
 generate code for a few functions with different targets than 
 the module?
Alas no. __attribute__((target(...)) works because it targeting different targets of the _same_ backend, this targets different backends.
 Using solutions as you suggest above introduces dependent build 
 sequencing into the build script. Different build systems might 
 prove to be more or less difficult to integrate cleanly, and 
 many people use build-script generators which might need to 
 learn a few new tricks.
Hence why I made reading from disk (with a predictable output name) a viable option. Any build script trickery would simply just feed in slightly later down the pipeline of. read file from disk -> in memory -> compute API -> binary -> execute Also thats how using other kernels not written in D would work, just plug somewhere into the pipeline.
 'Expertise' is possibly not the word I'd suggest ;)
 But I'll have some established software by that time that I'd 
 love to attempt to port, we can work through rough edges 
 together when you're available. No rush.
Great. Heh, still more experience than I have. Looking forward to it, earliest would probably be August.
Jun 03
parent reply David Nadlinger <code klickverbot.at> writes:
On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
 Alas no. __attribute__((target(...)) works because it targeting 
 different targets of the _same_ backend, this targets different 
 backends.
But surely you could just take the buffer with the object code emitted by the compute target backend and emit it as constant global data into the host llvm::Module? — David
Jun 03
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Saturday, 3 June 2017 at 14:12:28 UTC, David Nadlinger wrote:
 On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
 Alas no. __attribute__((target(...)) works because it 
 targeting different targets of the _same_ backend, this 
 targets different backends.
But surely you could just take the buffer with the object code emitted by the compute target backend and emit it as constant global data into the host llvm::Module? — David
Good idea, name of global needs to be known in advance though.
Jun 03
parent David Nadlinger <code klickverbot.at> writes:
On Saturday, 3 June 2017 at 23:13:09 UTC, Nicholas Wilson wrote:
 Good idea, name of global needs to be known in advance though.
Why, and how would that be a non-trivial problem? — David
Jun 03