digitalmars.D - GPGPU progess

Manu via Digitalmars-d (16/16) May 17 2017 I just caught up on the dconf material. It was a really interesting year...

Nicholas Wilson (33/54) May 18 2017 Can do already although not all the intrinsics are there yet, but

Nicholas Wilson (28/32) May 18 2017 Whoops, that assumes you have a CUDA driver API pipeline in your

Nicholas Wilson (5/7) May 30 2017 The future is now!

Manu via Digitalmars-d (10/18) May 30 2017 Awesome stuff! That was fast :)

Nicholas Wilson (18/46) May 30 2017 Yep, thats (one of the reasons) why CUDA is more successful than

Manu via Digitalmars-d (16/22) Jun 03 2017 Is it possible to convince the compiler to emit code built for the backe...

Nicholas Wilson (14/30) Jun 03 2017 Alas no. __attribute__((target(...)) works because it targeting

David Nadlinger (5/8) Jun 03 2017 But surely you could just take the buffer with the object code

Nicholas Wilson (2/10) Jun 03 2017 Good idea, name of global needs to be known in advance though.

David Nadlinger (3/4) Jun 03 2017 Why, and how would that be a non-trivial problem?

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

I just caught up on the dconf material. It was a really interesting year,
and I'm super bummed I missed it!

So I'm just starting out on some GPGPU work, and I love the direction we're
going here.

Given this scenario;
I want to write CUDA kernels in D, and execute them in an existing C++ CUDA
app.
Can anyone point me at the getting started material, how do I achieve this
seemingly simple goal?
Has anyone started on this content? Where are the toolchains, how to I
build a CUDA kernel, how do I link/load it in my C++ app?

If this doesn't exist, there needs to be a wiki with getting started guide
covering various usage environments. I don't have a lot of time to spare on
detail, so success will depend on my ability to hit the ground running, and
that's really just a matter of simple clear reference material.

How far are we from integration into LDC without using forked compilers?

May 17 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
 I just caught up on the dconf material. It was a really 
 interesting year, and I'm super bummed I missed it!

 So I'm just starting out on some GPGPU work, and I love the 
 direction we're going here.

 Given this scenario;
 I want to write CUDA kernels in D, and execute them in an 
 existing C++ CUDA
 app.

Can do already although not all the intrinsics are there yet, but 
they are very simple to add. See 
https://github.com/libmir/dcompute/blob/master/source/dcompute/std/cuda/index.d
for how easy it is (PRs welcome!)

 Can anyone point me at the getting started material, how do I 
 achieve this
 seemingly simple goal?

If you're only interested in CUDA I think any LLVM 3.9.1 or 4.0 
should work, If you want SPIRV get my LLVM from 
https://github.com/thewilsonator/llvm/tree/compute

Get the dcompute branch of LDC ( 
https://github.com/ldc-developers/ldc/tree/dcompute) and its 
submodules (i.e. druntime, I don't think it points to a branch 
anymore because Johan deleted that when I merged that into the 
regular ldc druntime but it will still work.) The dcompute branch 
of LDC is in the process of being merged into master so expect 
some breakage when updating to that.

 Has anyone started on this content? Where are the toolchains, 
 how to I
 build a CUDA kernel, how do I link/load it in my C++ app?

There is some info on the dcompute wiki 
(https://github.com/libmir/dcompute/wiki) and I intend make this 
into "production quality" documentation (again PRs welcome or I 
can give you write access to the wiki), but I'm very busy at the 
moment.

When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx is 
the CUDA compute capability specified on the command line and yy 
is 32 or 64 for 32 or 64bit) which should fit somewhere into your 
existing C++ pipeline.

 If this doesn't exist, there needs to be a wiki with getting 
 started guide covering various usage environments. I don't have 
 a lot of time to spare on detail, so success will depend on my 
 ability to hit the ground running, and that's really just a 
 matter of simple clear reference material.

 How far are we from integration into LDC without using forked 
 compilers?

Soon™, it the first thing I'm going to do (if I don't get it done 
before) after I hand in my honours thesis at the start of July. 
I'm stupidly busy this semester so don't count on it done 
beforehand.

If you have any questions about anything probably the fastest way 
is the libmir or ldc gitter channels. 
(https://gitter.im/ldc-developers/main , 
https://gitter.im/libmir/public).

Good luck! I look forward to your experience report.
Nic

May 18 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 18 May 2017 at 09:07:38 UTC, Nicholas Wilson wrote:
 When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx 
 is the CUDA compute capability specified on the command line 
 and yy is 32 or 64 for 32 or 64bit) which should fit somewhere 
 into your existing C++ pipeline.

Whoops, that assumes you have a CUDA driver API pipeline in your 
C++ code, which if you're asking I'm not sure that you have.
If you're using the `kernel<<<...>>>(args)` form to launch you 
kernels then you are going to have a lot more work to do in D 
because you'll need to use the driver API 
(http://docs.nvidia.com/cuda/cuda-driver-api/#axzz4hQLA0Zdm)
You'll need to:
*get a device
*create a context from it
*get a stream on that context
*load the ptx module (possibly linking it with other modules, to 
resolve missing symbols).
*compile it for the device
*then launch a kernel from that module on that device, by name 
passing the arguments in a void*[].

The sad thing is that its still nice than OpenCL because in 
OpenCL you have to pass the runtime args (with sizes) one by one 
to a function.

Hence why I want to automate as much of that shit as is possible.
  I hope to have that done ASAP, but I don't have hardware set up 
to test CUDA at the moment (I have one on my windows box but I 
don't have dev set up there) and I'll be working on OpenCL at the 
same time (and theres only so much horrible API I can take in a 
day).
  I'll be working on dcompute part-part-time next semester though 
so I should be able to get a fair bit done and quite a few others 
are interested so that'll speed thing up a bit.

May 18 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
 How far are we from integration into LDC without using forked 
 compilers?

The future is now!

https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh forum.dlang.org

https://github.com/ldc-developers/ldc/commit/69ad69e872f53c14c101e2c029c4757c4073f487
is the final commit from the stuff I've done prior to dconf.

May 30 2017

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:

 How far are we from integration into LDC without using forked compilers?

 The future is now!

 https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh forum.dlang.org

 https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1
 4c101e2c029c4757c4073f487
 is the final commit from the stuff I've done prior to dconf.

Awesome stuff! That was fast :)

You're right, I'm using kernel<<<...>>>, and it's very convenient.
I looked briefly and realised that I had a lot of work to get running (as
you describe), so I stuck with my current setup for the moment :(

Is a <<<...>>> equivalent going to be possible in D, with kernel object
fragments built into the binary together with the CPU code?

I'm definitely looking forward to action in this space, and the wiki to
come online :)

May 30 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Tuesday, 30 May 2017 at 08:14:16 UTC, Manu wrote:
 On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d < 
 digitalmars-d puremagic.com> wrote:

 On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:

 How far are we from integration into LDC without using forked 
 compilers?

 The future is now!

 https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh forum.dlang.org

 https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1
 4c101e2c029c4757c4073f487
 is the final commit from the stuff I've done prior to dconf.

 Awesome stuff! That was fast :)

 You're right, I'm using kernel<<<...>>>, and it's very 
 convenient.

Yep, thats (one of the reasons) why CUDA is more successful than 
OpenCL and therefore one of the more powerful draws for those 
poor sods using OpenCL.

 I looked briefly and realised that I had a lot of work to get 
 running (as
 you describe), so I stuck with my current setup for the moment 
 :(

I figured, I'll get you using it eventually.

 Is a <<<...>>> equivalent going to be possible in D, with 
 kernel object fragments built into the binary together with the 
 CPU code?

As I explained in my dconf presentation: the idea is to have
Queue q = ... ; // the equivalent of a CUDA stream

     q.enqueue!kernel(sizes)(kernel_arguments);

where q.enqueue returns a callable that you then call with the 
arguments. It was modelled directly after CUDAs <<<...>>>

as for embedding in the binary a post build step that does

ubyte[] ptx_code = import("kernels_cuda620_64.ptx");

should be doable as should invoking ptxas and doing the same.
Then proving a consistent naming convention is used the code can 
do its magic.
Or the files could just be read from disk.

 I'm definitely looking forward to action in this space, and the 
 wiki to come online :)

Yeah once my thesis is done thing should start moving. Any input 
with your expertise with CUDA will be much appriciated.

May 30 2017

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 30 May 2017 at 19:54, Nicholas Wilson via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 as for embedding in the binary a post build step that does

 ubyte[] ptx_code = import("kernels_cuda620_64.ptx");

 should be doable as should invoking ptxas and doing the same.
 Then proving a consistent naming convention is used the code can do its
 magic.
 Or the files could just be read from disk.


Is it possible to convince the compiler to emit code built for the backend
target directly into the same object file as the host code?
I feel like this should be possible, along the lines of
__attribute__((target(...)) to convince the compiler to generate code for a
few functions with different targets than the module?

Using solutions as you suggest above introduces dependent build sequencing
into the build script. Different build systems might prove to be more or
less difficult to integrate cleanly, and many people use build-script
generators which might need to learn a few new tricks.

Any input with your expertise with CUDA will be much appriciated.

'Expertise' is possibly not the word I'd suggest ;)
But I'll have some established software by that time that I'd love to
attempt to port, we can work through rough edges together when you're
available. No rush.

Jun 03 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Saturday, 3 June 2017 at 11:58:14 UTC, Manu wrote:
 Is it possible to convince the compiler to emit code built for 
 the backend target directly into the same object file as the 
 host code?
 I feel like this should be possible, along the lines of
 __attribute__((target(...)) to convince the compiler to 
 generate code for a few functions with different targets than 
 the module?

Alas no. __attribute__((target(...)) works because it targeting 
different targets of the _same_ backend, this targets different 
backends.

 Using solutions as you suggest above introduces dependent build 
 sequencing into the build script. Different build systems might 
 prove to be more or less difficult to integrate cleanly, and 
 many people use build-script generators which might need to 
 learn a few new tricks.

Hence why I made reading from disk (with a predictable output 
name) a viable option.
Any build script trickery would simply just feed in slightly 
later down the pipeline of.

read file from disk -> in memory -> compute API -> binary -> 
execute

Also thats how using other kernels not written in D would work, 
just plug somewhere into the pipeline.

 'Expertise' is possibly not the word I'd suggest ;)
 But I'll have some established software by that time that I'd 
 love to attempt to port, we can work through rough edges 
 together when you're available. No rush.

Great. Heh, still more experience than I have. Looking forward to 
it, earliest would probably be August.

Jun 03 2017

David Nadlinger <code klickverbot.at> writes:

On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
 Alas no. __attribute__((target(...)) works because it targeting 
 different targets of the _same_ backend, this targets different 
 backends.

But surely you could just take the buffer with the object code 
emitted by the compute target backend and emit it as constant 
global data into the host llvm::Module?

  — David

Jun 03 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Saturday, 3 June 2017 at 14:12:28 UTC, David Nadlinger wrote:
 On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
 Alas no. __attribute__((target(...)) works because it 
 targeting different targets of the _same_ backend, this 
 targets different backends.

 But surely you could just take the buffer with the object code 
 emitted by the compute target backend and emit it as constant 
 global data into the host llvm::Module?

  — David

Good idea, name of global needs to be known in advance though.

Jun 03 2017

David Nadlinger <code klickverbot.at> writes:

On Saturday, 3 June 2017 at 23:13:09 UTC, Nicholas Wilson wrote:
 Good idea, name of global needs to be known in advance though.

Why, and how would that be a non-trivial problem?

  — David

Jun 03 2017

D Programming

C/C++ Programming

Other

digitalmars.D - GPGPU progess