digitalmars.D - GPGPUs

Russel Winder (41/41) Aug 13 2013 The era of GPGPUs for Bitcoin mining are now over, they moved to ASICs.

John Colvin (10/63) Aug 13 2013 I'm interested. There may be a significant need for gpu work for
eles (5/8) Aug 13 2013 You mean an alternative to OpenCL language?

Dejan Lekic (3/11) Aug 18 2013 Tra55er did it long ago - look at the cl4d wrapper. I think it is

John Colvin (3/18) Aug 18 2013 I had no idea that existed. Thanks :)

Russel Winder (19/38) Aug 18 2013 I had missed that as well. Bad Google and GitHub skills on my part

Trass3r (7/15) Jan 18 2014 Interesting. I discovered this thread via the just introduced

luminousone (5/58) Aug 13 2013 I suggest looking into HSA.

Nick B (8/14) Aug 13 2013 What a very interesting concept redesigned from the ground up.

luminousone (24/41) Aug 13 2013 And its not just AMD, HSA is supported by the HSA Foundation,

bearophile (8/10) Aug 15 2013 Probably I can't help you, but I think today some of the old
Atash (11/64) Aug 15 2013 Clarifying question:

Russel Winder (34/45) Aug 16 2013 Assembly language and other intermediate languages is far from scary but

Paul Jurczak (4/21) Aug 16 2013 It seems to me that you are describing something similar to C++

Russel Winder (21/24) Aug 16 2013 C++ AMP may be an open specification but it only targets DirectX. But

Atash (26/37) Aug 16 2013 I'm iffy on the assumption that the future holds unified memory

deadalnix (13/38) Aug 17 2013 Many laptop and mobile devices (they have it in the hardware, but

Atash (30/37) Aug 17 2013 I'm not saying 'ignore it', I'm saying that it's not the least

luminousone (9/46) Aug 17 2013 There are two major things, unified memory, and unified virtual

luminousone (47/47) Aug 17 2013 We basically have to follow these rules,

Atash (44/91) Aug 17 2013 Unified virtual address-space I can accept, fine. Ignoring that

luminousone (2/102) Aug 17 2013 sorry typo, meant known.
luminousone (48/148) Aug 17 2013 You do have limited Atomics, but you don't really have any sort

Atash (54/102) Aug 17 2013 I said 'point 11', not 'point 10'. You also dodged points 1 and

luminousone (68/182) Aug 17 2013 The Xeon Phi is interesting in so far as taking generic

Atash (20/89) Aug 18 2013 I can't argue with that either.

luminousone (9/113) Aug 18 2013 I am still learning and additional links to go over never hurt!,

luminousone (21/21) Aug 18 2013 I chose the term aggregate, because it is the term used in the

Atash (29/50) Aug 18 2013 I'm not sure if 'problem space' is the industry standard term (in

luminousone (38/54) Aug 16 2013 CUDA, works as a preprocessor pass that generates c files from

John Colvin (3/60) Aug 16 2013 We have a[] = b[] * c[] - 5; etc. which could work very neatly

luminousone (17/19) Aug 16 2013 While this in fact could work, given the nature of GPGPU it would

John Colvin (36/59) Aug 16 2013 I didn't literally mean automatically inserting GPU code.

luminousone (48/114) Aug 16 2013 You can't mix cpu and gpu code, they must be separate.

Atash (84/106) Aug 16 2013 H'okay, let's be clear here. When you say 'mix CPU and GPU code',
Atash (86/108) Aug 16 2013 H'okay, let's be clear here. When you say 'mix CPU and GPU code',

luminousone (33/146) Aug 17 2013 Often when first introducing programmers to gpu programming, they

Atash (6/63) Aug 16 2013 Regarding functionality, @microthreaded is sounding a lot like

luminousone (3/8) Aug 16 2013 Yes, And that is just a word I pull out the air, if another term

ponce (4/13) Jan 18 2014 You can write everything in OpenCL and dispatch to both a CPU or

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (18/23) Jan 22 2014 Compiler support with futures could be useful, e.g. write D

bearophile (5/9) Jan 22 2014 Could be of interest, to ease the porting of C++ code to Cuda:

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (10/12) Jan 22 2014 Yeah, gpu programming is going to develop faster in the coming

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (12/14) Jan 22 2014 You might want to generate code for coprocessors too, like

Paulo Pinto (5/19) Jan 23 2014 Why not just generate SPIR, HSAIL or PTX code instead ?

Ben Cumming (25/28) Jan 23 2014 We advertised an internship at my work to look at using D for

Ben Cumming (2/4) Jan 23 2014 I mean OpenCL, not OpenGL.
Paulo Pinto (9/36) Jan 23 2014 I did an internship at CERN during 2003-2004. Lots of interesting C++

Ben Cumming (4/7) Jan 23 2014 I enjoy C++ metaprogramming too, but it is often a chore. And I
Atila Neves (12/19) Jan 23 2014 Small world, we were at CERN at the same time. There's still a

Paulo Pinto (10/31) Jan 23 2014 Who knows, we might even have been seating close to each other!

Joseph Rushton Wakeling (5/7) Aug 16 2013 Yes, I'd be interested, particularly if it's possible to produce a GPGPU

Gambler (5/17) Aug 17 2013 There is some interesting work in that regards in .NET:

Russel Winder <russel winder.org.uk> writes:

The era of GPGPUs for Bitcoin mining are now over, they moved to ASICs.
The new market for GPGPUs is likely the banks, and other "Big Data"
folk. True many of the banks are already doing some GPGPU usage, but it
is not big as yet. But it is coming.

Most of the banks are either reinforcing their JVM commitment, via
Scala, or are re-architecting to C++ and Python. True there is some

diminishing (despite what you might hear from .NET oriented training
companies).

Currently GPGPU tooling means C. OpenCL and CUDA (if you have to) are C
API for C coding. There are some C++ bindings. There are interesting
moves afoot with the JVM to enable access to GPGPU from Java, Scala,
Groovy, etc. but this is years away, which is a longer timescale than
the opportunity.

Python's offerings, PyOpenCL and PyCUDA are basically ways of managing C
coded kernels which rather misses the point. I may get involved in
trying to write an expression language in Python to go with PyOpenCL so
that kernels can be written in Python =E2=80=93 a more ambitious version ai=
med
at Groovy is also mooted.

However, D has the opportunity of gaining a bridgehead if a combination
of D, PyD, QtD and C++ gets to be seen as a viable solid platform for
development.  The analogue here is the way Java is giving way to Scala
and Groovy, but in an evolutionary way as things all interwork. The
opportunity is for D to be seen as the analogue of Scala on the JVM for
the native code world: a language that interworks well with all the
other players on the platform but provides more.

The entry point would be if D had a way of creating GPGPU kernels that
is better than the current C/C++ + tooling.

This email is not a direct proposal to do work, just really an enquiry
to see if there is any interest in this area.
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 13 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
 The era of GPGPUs for Bitcoin mining are now over, they moved 
 to ASICs.
 The new market for GPGPUs is likely the banks, and other "Big 
 Data"
 folk. True many of the banks are already doing some GPGPU 
 usage, but it
 is not big as yet. But it is coming.

 Most of the banks are either reinforcing their JVM commitment, 
 via
 Scala, or are re-architecting to C++ and Python. True there is 
 some

 and it is
 diminishing (despite what you might hear from .NET oriented 
 training
 companies).

 Currently GPGPU tooling means C. OpenCL and CUDA (if you have 
 to) are C
 API for C coding. There are some C++ bindings. There are 
 interesting
 moves afoot with the JVM to enable access to GPGPU from Java, 
 Scala,
 Groovy, etc. but this is years away, which is a longer 
 timescale than
 the opportunity.

 Python's offerings, PyOpenCL and PyCUDA are basically ways of 
 managing C
 coded kernels which rather misses the point. I may get involved 
 in
 trying to write an expression language in Python to go with 
 PyOpenCL so
 that kernels can be written in Python – a more ambitious 
 version aimed
 at Groovy is also mooted.

 However, D has the opportunity of gaining a bridgehead if a 
 combination
 of D, PyD, QtD and C++ gets to be seen as a viable solid 
 platform for
 development.  The analogue here is the way Java is giving way 
 to Scala
 and Groovy, but in an evolutionary way as things all interwork. 
 The
 opportunity is for D to be seen as the analogue of Scala on the 
 JVM for
 the native code world: a language that interworks well with all 
 the
 other players on the platform but provides more.

 The entry point would be if D had a way of creating GPGPU 
 kernels that
 is better than the current C/C++ + tooling.

 This email is not a direct proposal to do work, just really an 
 enquiry
 to see if there is any interest in this area.

I'm interested. There may be a significant need for gpu work for 
my PhD, seeing as the amount of data needing to be crunched is a 
bit daunting (dozens of sensors with MHz sampling, with very 
intensive image analysis / computer vision work).
I could farm the whole thing out to cpu nodes, but using the gpu 
nodes would be more fun.


However, I'm insanely busy atm and have next to no experience 
with gpu programming, so I'm probably not gonna be that useful 
for a while!

Aug 13 2013

"eles" <eles eles.com> writes:

On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
 The entry point would be if D had a way of creating GPGPU 
 kernels that
 is better than the current C/C++ + tooling.

You mean an alternative to OpenCL language?

Because, I imagine, a library (libopencl) would be easy enough to 
write/bind.

Who'll gonna standardize this language?

Aug 13 2013

"Dejan Lekic" <dejan.lekic gmail.com> writes:

On Tuesday, 13 August 2013 at 18:21:12 UTC, eles wrote:
 On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
 The entry point would be if D had a way of creating GPGPU 
 kernels that
 is better than the current C/C++ + tooling.

 You mean an alternative to OpenCL language?

 Because, I imagine, a library (libopencl) would be easy enough 
 to write/bind.

 Who'll gonna standardize this language?

Tra55er did it long ago - look at the cl4d wrapper. I think it is 
on GitHub.

Aug 18 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Sunday, 18 August 2013 at 18:19:06 UTC, Dejan Lekic wrote:
 On Tuesday, 13 August 2013 at 18:21:12 UTC, eles wrote:
 On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder 
 wrote:
 The entry point would be if D had a way of creating GPGPU 
 kernels that
 is better than the current C/C++ + tooling.

 You mean an alternative to OpenCL language?

 Because, I imagine, a library (libopencl) would be easy enough 
 to write/bind.

 Who'll gonna standardize this language?

 Tra55er did it long ago - look at the cl4d wrapper. I think it 
 is on GitHub.

I had no idea that existed. Thanks :) 
https://github.com/Trass3r/cl4d

Aug 18 2013

Russel Winder <russel winder.org.uk> writes:

On Sun, 2013-08-18 at 20:27 +0200, John Colvin wrote:
 On Sunday, 18 August 2013 at 18:19:06 UTC, Dejan Lekic wrote:
 On Tuesday, 13 August 2013 at 18:21:12 UTC, eles wrote:
 On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder=20
 wrote:
 The entry point would be if D had a way of creating GPGPU=20
 kernels that
 is better than the current C/C++ + tooling.

 You mean an alternative to OpenCL language?

 Because, I imagine, a library (libopencl) would be easy enough=20
 to write/bind.

 Who'll gonna standardize this language?

 Tra55er did it long ago - look at the cl4d wrapper. I think it=20
 is on GitHub.


Thanks for pointing this out, I had completely missed it.

 I had no idea that existed. Thanks :)=20
 https://github.com/Trass3r/cl4d

I had missed that as well. Bad Google and GitHub skills on my part
clearly.

I think the path is now obvious, ask if the owner will turn this
repository over to a group so that it can become the focus of future
work via the repositories wiki and issue tracker.

I will fork this repository as is and begin to analyse the status quo
wrt the discussion recently on the email list.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 18 2013

"Trass3r" <un known.com> writes:

On Sunday, 18 August 2013 at 18:35:45 UTC, Russel Winder wrote:
 https://github.com/Trass3r/cl4d

 I had missed that as well. Bad Google and GitHub skills on my 
 part clearly.

 I think the path is now obvious, ask if the owner will turn this
 repository over to a group so that it can become the focus of 
 future work via the repositories wiki and issue tracker.

 I will fork this repository as is and begin to analyse the 
 status quo wrt the discussion recently on the email list.

Interesting. I discovered this thread via the just introduced 
traffic analytics over at Github ^^

Haven't touched the code for a long time but there have been some 
active forks.
I've been thinking to offer push rights to them for quite a while 
but never got around to it.

Jan 18 2014

"luminousone" <rd.hunt gmail.com> writes:

On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
 The era of GPGPUs for Bitcoin mining are now over, they moved 
 to ASICs.
 The new market for GPGPUs is likely the banks, and other "Big 
 Data"
 folk. True many of the banks are already doing some GPGPU 
 usage, but it
 is not big as yet. But it is coming.

 Most of the banks are either reinforcing their JVM commitment, 
 via
 Scala, or are re-architecting to C++ and Python. True there is 
 some

 and it is
 diminishing (despite what you might hear from .NET oriented 
 training
 companies).

 Currently GPGPU tooling means C. OpenCL and CUDA (if you have 
 to) are C
 API for C coding. There are some C++ bindings. There are 
 interesting
 moves afoot with the JVM to enable access to GPGPU from Java, 
 Scala,
 Groovy, etc. but this is years away, which is a longer 
 timescale than
 the opportunity.

 Python's offerings, PyOpenCL and PyCUDA are basically ways of 
 managing C
 coded kernels which rather misses the point. I may get involved 
 in
 trying to write an expression language in Python to go with 
 PyOpenCL so
 that kernels can be written in Python – a more ambitious 
 version aimed
 at Groovy is also mooted.

 However, D has the opportunity of gaining a bridgehead if a 
 combination
 of D, PyD, QtD and C++ gets to be seen as a viable solid 
 platform for
 development.  The analogue here is the way Java is giving way 
 to Scala
 and Groovy, but in an evolutionary way as things all interwork. 
 The
 opportunity is for D to be seen as the analogue of Scala on the 
 JVM for
 the native code world: a language that interworks well with all 
 the
 other players on the platform but provides more.

 The entry point would be if D had a way of creating GPGPU 
 kernels that
 is better than the current C/C++ + tooling.

 This email is not a direct proposal to do work, just really an 
 enquiry
 to see if there is any interest in this area.

I suggest looking into HSA.

http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf

This will be available on AMD APUs in December, and will trickle 
out to arm and other platforms over time.

Aug 13 2013

"Nick B" <nick.barbalich gmail.com> writes:

On Tuesday, 13 August 2013 at 18:35:28 UTC, luminousone wrote:
 On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
 The era of GPGPUs for Bitcoin mining are now over, they moved 
 to ASICs.


 http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf

 This will be available on AMD APUs in December, and will 
 trickle out to arm and other platforms over time.

What a very interesting concept redesigned from the ground up.

How about this for the future:

D2 > LLVM Compiler  > HSAIC Finaliser > Architected Queueing 
Language


Here is a another usefull link:

http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/

Nick

Aug 13 2013

"luminousone" <rd.hunt gmail.com> writes:

On Tuesday, 13 August 2013 at 22:24:18 UTC, Nick B wrote:
On Tuesday, 13 August 2013 at 18:35:28 UTC, luminousone wrote:
On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder
wrote:
The era of GPGPUs for Bitcoin mining are now over, they moved
to ASICs.

http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf

This will be available on AMD APUs in December, and will
trickle out to arm and other platforms over time.

What a very interesting concept redesigned from the ground up.

How about this for the future:

D2 > LLVM Compiler > HSAIC Finaliser > Architected Queueing
Language

Here is a another usefull link:

http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/

Nick

And its not just AMD, HSA is supported by the HSA Foundation,
with wide industry support.

http://hsafoundation.com/

It is platform independent, at least in so far as it doesn't need
x86 to operate, within a few years most ARM devices will support
HSA.

And HSAIL bytecode sits inside the ELF executable file next to
the platform's bytecode, so compiling it in does not break any
platform that does not support HSA.

Memory isn't simply a shared architecture, its fully cache
coherent, allows the GPU to use the same virtual address space as
the CPU, On an APU platform you never need to copy data to and
from GPU memory, simply pass a pointer and your done.

HSAIL bytecode also is expressive enough for c++(including
virtual functions) to compile directly to it, so no reason D
can't.

Some additions to D to handle micro threading, wouldn't go amiss
however.

The HSA finalizer AMD will be provided is supposed to use LLVM(at
least that is my understanding).

So changes required to support HSA would likely be minimal, most
of these changes would be friendly to building in support for
openCL or the like as well for none supported platforms.

Aug 13 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Russel Winder:

 This email is not a direct proposal to do work, just really an 
 enquiry to see if there is any interest in this area.

Probably I can't help you, but I think today some of the old 
purposes of a system language, that is to write code for heavy 
computations, are now done on GPUs (or even ASICs as you say). So 
using only the CPU from D is not enough. So what you are 
discussing here is important.

Bye,
bearophile

Aug 15 2013

"Atash" <nope nope.nope> writes:

On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
 The era of GPGPUs for Bitcoin mining are now over, they moved 
 to ASICs.
 The new market for GPGPUs is likely the banks, and other "Big 
 Data"
 folk. True many of the banks are already doing some GPGPU 
 usage, but it
 is not big as yet. But it is coming.

 Most of the banks are either reinforcing their JVM commitment, 
 via
 Scala, or are re-architecting to C++ and Python. True there is 
 some

 and it is
 diminishing (despite what you might hear from .NET oriented 
 training
 companies).

 Currently GPGPU tooling means C. OpenCL and CUDA (if you have 
 to) are C
 API for C coding. There are some C++ bindings. There are 
 interesting
 moves afoot with the JVM to enable access to GPGPU from Java, 
 Scala,
 Groovy, etc. but this is years away, which is a longer 
 timescale than
 the opportunity.

 Python's offerings, PyOpenCL and PyCUDA are basically ways of 
 managing C
 coded kernels which rather misses the point. I may get involved 
 in
 trying to write an expression language in Python to go with 
 PyOpenCL so
 that kernels can be written in Python – a more ambitious 
 version aimed
 at Groovy is also mooted.

 However, D has the opportunity of gaining a bridgehead if a 
 combination
 of D, PyD, QtD and C++ gets to be seen as a viable solid 
 platform for
 development.  The analogue here is the way Java is giving way 
 to Scala
 and Groovy, but in an evolutionary way as things all interwork. 
 The
 opportunity is for D to be seen as the analogue of Scala on the 
 JVM for
 the native code world: a language that interworks well with all 
 the
 other players on the platform but provides more.

 The entry point would be if D had a way of creating GPGPU 
 kernels that
 is better than the current C/C++ + tooling.

 This email is not a direct proposal to do work, just really an 
 enquiry
 to see if there is any interest in this area.

Clarifying question:

At what level is this interest pointed at? Is it at the level of 
assembly/IL and other scary stuff, or is it at creating bindings 
that are cleaner and providing more convenient tools?

'Cuz I'm seeing a lot of potential in D for a library-based 
solution, handling kernel code similar to how CUDA does (I think 
the word is 'conveniently'), 'cept with OpenCL or whatever 
lower-level-standard-that-exists-on-more-than-just-one-company's-hardware 
and abuse of the import() expression coupled with a heavy dose of 
metaprogramming magic.

Aug 15 2013

Russel Winder <russel winder.org.uk> writes:

On Fri, 2013-08-16 at 04:21 +0200, Atash wrote:
[=E2=80=A6]
 Clarifying question:
=20
 At what level is this interest pointed at? Is it at the level of=20
 assembly/IL and other scary stuff, or is it at creating bindings=20
 that are cleaner and providing more convenient tools?

Assembly language and other intermediate languages is far from scary but
not really what is proposed, which was intended to be "standards
compliant", with the current standard being OpenCL. There is a place for
assembly language level working but there are others who do that to
realize the standards giving us a C linkage API.

 'Cuz I'm seeing a lot of potential in D for a library-based=20
 solution, handling kernel code similar to how CUDA does (I think=20
 the word is 'conveniently'), 'cept with OpenCL or whatever=20
 lower-level-standard-that-exists-on-more-than-just-one-company's-hardware=

=20
 and abuse of the import() expression coupled with a heavy dose of=20
 metaprogramming magic.

CUDA has a huge tool chain designed as much to ensure lock in to C, C++
and NVIDIA as to make computations easier for end users.

OpenCL is (was?) Apple's vision for API access to the mixed CPU and
GPGPU hardware it envisaged, and which is now rapidly being delivered.
Intel chips from now on will always have one or more GPGPU on the CPU
chips. They label this the "many core" approach.

The core (!) point here is that processor chips are rapidly becoming a
collection of heterogeneous cores. Any programming language that assumes
a single CPU or a collection of homogeneous CPUs has built-in
obsolescence.

So the question I am interested in is whether D is the language that can
allow me to express in a single codebase a program in which parts will
be executed on one or more GPGPUs and parts on multiple CPUs. D has
support for the latter, std.parallelism and std.concurrency.

I guess my question is whether people are interested in std.gpgpu (or
some more sane name).

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 16 2013

"Paul Jurczak" <pauljurczak yahoo.com> writes:

On Friday, 16 August 2013 at 10:04:22 UTC, Russel Winder wrote:
 [...]
 The core (!) point here is that processor chips are rapidly 
 becoming a
 collection of heterogeneous cores. Any programming language 
 that assumes
 a single CPU or a collection of homogeneous CPUs has built-in
 obsolescence.

 So the question I am interested in is whether D is the language 
 that can
 allow me to express in a single codebase a program in which 
 parts will
 be executed on one or more GPGPUs and parts on multiple CPUs. D 
 has
 support for the latter, std.parallelism and std.concurrency.

 I guess my question is whether people are interested in 
 std.gpgpu (or
 some more sane name).

It seems to me that you are describing something similar to C++
AMP, which is a high level, language specific solution to GPGPU
problem.

Aug 16 2013

Russel Winder <russel winder.org.uk> writes:

On Fri, 2013-08-16 at 12:41 +0200, Paul Jurczak wrote:
[=E2=80=A6]
 It seems to me that you are describing something similar to C++
 AMP, which is a high level, language specific solution to GPGPU
 problem.

C++ AMP may be an open specification but it only targets DirectX. But
the ideas behind it are very sensible, use closures and internal
iteration with library support to drive the compiler to construct the
required kernels.

Today you have to download the kernel to the attached GPGPU over the
bus. In the near future the GPGPU will exist in a single memory address
space shared with all the CPUs. At this point separately downloadable
kernels become a thing of the past, it becomes a compiler/loader issue
to get things right.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 16 2013

"Atash" <nope nope.nope> writes:

On Friday, 16 August 2013 at 12:18:49 UTC, Russel Winder wrote:
 On Fri, 2013-08-16 at 12:41 +0200, Paul Jurczak wrote:
 […]
 Today you have to download the kernel to the attached GPGPU 
 over the
 bus. In the near future the GPGPU will exist in a single memory 
 address
 space shared with all the CPUs. At this point separately 
 downloadable
 kernels become a thing of the past, it becomes a 
 compiler/loader issue
 to get things right.

I'm iffy on the assumption that the future holds unified memory 
for heterogeneous devices. Even relatively recent products such 
as the Intel Xeon Phi have totally separate memory. I'm not aware 
of any general-computation-oriented products that don't have 
separate memory.

I'm also of the opinion that as long as people want to have 
devices that can scale in size, there will be modular devices. 
Because they're modular, there's some sort of a spacing between 
them and the machine, ex. PCIe (and, somewhat importantly, a 
physical distance between the added device and the CPU-stuff). 
Because of that, they're likely to have their own memory. 
Therefore, I'm personally not willing to bank on anything short 
of targeting the least common denominator here (non-uniform 
access memory) specifically because it looks like a necessity for 
scaling a physical collection of heterogeneous devices up in 
size, which in turn I *think* is a necessity for people trying to 
deal with growing data sets in the real world.

Annnnnnnnnnnndddddd because heterogeneous compute devices aren't 
*just* GPUs (ex. Intel Xeon Phi), I'd strongly suggest picking a 
more general name, like 'accelerators' or 'apu' (except AMD 
totally ran away with that acronym in marketing and I sort of 
hate them for it) or 
'<something-I-can't-think-of-because-words-are-hard>'.

That said, I'm no expert, so go ahead and rip 'mah opinions 
apart. :-D

Aug 16 2013

"deadalnix" <deadalnix gmail.com> writes:

On Friday, 16 August 2013 at 19:53:44 UTC, Atash wrote:
 I'm iffy on the assumption that the future holds unified memory 
 for heterogeneous devices. Even relatively recent products such 
 as the Intel Xeon Phi have totally separate memory. I'm not 
 aware of any general-computation-oriented products that don't 
 have separate memory.

Many laptop and mobile devices (they have it in the hardware, but 
the API don't permit to make use of it most of the time), next 
gen consoles (PS4, Xbox one).

nVidia is pushing against it, AMD/ATI is pushing for it, and they 
are right on that one.

 I'm also of the opinion that as long as people want to have 
 devices that can scale in size, there will be modular devices. 
 Because they're modular, there's some sort of a spacing between 
 them and the machine, ex. PCIe (and, somewhat importantly, a 
 physical distance between the added device and the CPU-stuff). 
 Because of that, they're likely to have their own memory. 
 Therefore, I'm personally not willing to bank on anything short 
 of targeting the least common denominator here (non-uniform 
 access memory) specifically because it looks like a necessity 
 for scaling a physical collection of heterogeneous devices up 
 in size, which in turn I *think* is a necessity for people 
 trying to deal with growing data sets in the real world.

You can have 2 sockets on the motherboard.

 Annnnnnnnnnnndddddd because heterogeneous compute devices 
 aren't *just* GPUs (ex. Intel Xeon Phi), I'd strongly suggest 
 picking a more general name, like 'accelerators' or 'apu' 
 (except AMD totally ran away with that acronym in marketing and 
 I sort of hate them for it) or 
 '<something-I-can't-think-of-because-words-are-hard>'.

 That said, I'm no expert, so go ahead and rip 'mah opinions 
 apart. :-D

Unified memory have too much benefice for it to be ignored. The 
open questions are about cache coherency, direct communication 
between chips, identical performances through the address space, 
and so on. But the unified memory will remains. Even when memory 
is physically separated, you'll see an unified memory model 
emerge, with disparate performances depending on address space.

Aug 17 2013

"Atash" <nope nope.nope> writes:

On Saturday, 17 August 2013 at 15:37:58 UTC, deadalnix wrote:
 Unified memory have too much benefice for it to be ignored. The 
 open questions are about cache coherency, direct communication 
 between chips, identical performances through the address 
 space, and so on. But the unified memory will remains. Even 
 when memory is physically separated, you'll see an unified 
 memory model emerge, with disparate performances depending on 
 address space.

I'm not saying 'ignore it', I'm saying that it's not the least 
common denominator among popular devices, and that in all 
likelihood it won't be the least common denominator among compute 
devices ever. AMD/ATi being 'right' doesn't mean that they'll 
dominate the whole market. Having two slots on your mobo is more 
limiting than having the ability to just chuck more computers in 
a line hidden behind some thin wrapper around some code built to 
deal with non-uniform memory access.

Additionally, in another post, I tried to demonstrate a way for 
it to target the least common denominator, and in my (obviously 
biased) opinion, it didn't look half bad.

Unlike uniform memory access, non-uniform memory access will 
*always* be a thing. Uniform memory access is cool n'all, but it 
isn't popular enough to be here now, and it isn't like 
non-uniform memory access which has a long history of being here 
and looks like it has a permanent stay in computing.

Pragmatism dictates to me here that any tool we want to be 
'awesome', eliciting 'wowzers' from all the folk of the land, 
should target the widest variety of devices while still being 
pleasant to work with. *That* tells me that it is paramount to 
*not* brush off non-uniform access, and that because non-uniform 
access is the least common denominator, that should be what is 
targeted.

On the other hand, if we want to start up some sort of thing 
where one lib handles the paradigm of uniform memory access in as 
convenient a way as possible, and another lib handles non-uniform 
memory access, that's fine too. Except that the first lib really 
would just be a specialization of the second alongside some more 
'convenience'-functions.

Aug 17 2013

"luminousone" <rd.hunt gmail.com> writes:

On Saturday, 17 August 2013 at 20:17:17 UTC, Atash wrote:
 On Saturday, 17 August 2013 at 15:37:58 UTC, deadalnix wrote:
 Unified memory have too much benefice for it to be ignored. 
 The open questions are about cache coherency, direct 
 communication between chips, identical performances through 
 the address space, and so on. But the unified memory will 
 remains. Even when memory is physically separated, you'll see 
 an unified memory model emerge, with disparate performances 
 depending on address space.

 I'm not saying 'ignore it', I'm saying that it's not the least 
 common denominator among popular devices, and that in all 
 likelihood it won't be the least common denominator among 
 compute devices ever. AMD/ATi being 'right' doesn't mean that 
 they'll dominate the whole market. Having two slots on your 
 mobo is more limiting than having the ability to just chuck 
 more computers in a line hidden behind some thin wrapper around 
 some code built to deal with non-uniform memory access.

 Additionally, in another post, I tried to demonstrate a way for 
 it to target the least common denominator, and in my (obviously 
 biased) opinion, it didn't look half bad.

 Unlike uniform memory access, non-uniform memory access will 
 *always* be a thing. Uniform memory access is cool n'all, but 
 it isn't popular enough to be here now, and it isn't like 
 non-uniform memory access which has a long history of being 
 here and looks like it has a permanent stay in computing.

 Pragmatism dictates to me here that any tool we want to be 
 'awesome', eliciting 'wowzers' from all the folk of the land, 
 should target the widest variety of devices while still being 
 pleasant to work with. *That* tells me that it is paramount to 
 *not* brush off non-uniform access, and that because 
 non-uniform access is the least common denominator, that should 
 be what is targeted.

 On the other hand, if we want to start up some sort of thing 
 where one lib handles the paradigm of uniform memory access in 
 as convenient a way as possible, and another lib handles 
 non-uniform memory access, that's fine too. Except that the 
 first lib really would just be a specialization of the second 
 alongside some more 'convenience'-functions.

There are two major things, unified memory, and unified virtual 
address space.

Unified virtual address will be universal within a few years, and 
the user space application will no longer manage the cpu/gpu 
copies anymore, this instead will be handled by the gpu system 
library, hMMU, and operating system.

Even non-uniform memory will be uniform from the perspective of 
the application writer.

Aug 17 2013

"luminousone" <rd.hunt gmail.com> writes:

We basically have to follow these rules,

1. The range must be none prior to execution of a gpu code block
2. The range can not be changed during execution of a gpu code 
block
3. Code blocks can only receive a single range, it can however be 
multidimensional
4. index keys used in a code block are immutable
5. Code blocks can only use a single key(the gpu executes many 
instances in parallel each with their own unique key)
6. index's are always an unsigned integer type
7. openCL,CUDA have no access to global state
8. gpu code blocks can not allocate memory
9. gpu code blocks can not call cpu functions
10. atomics tho available on the gpu are many times slower then 
on the cpu
11. separate running instances of the same code block on the gpu 
can not have any interdependency on each other.

Now if we are talking about HSA, or other similar setup, then a 
few of those rules don't apply or become fuzzy.

HSA, does have limited access to global state, HSA can call cpu 
functions that are pure, and of course because in HSA the cpu and 
gpu share the same virtual address space most of memory is open 
for access.

HSA also manages memory, via the hMMU, and their is no need for 
gpu memory management functions, as that is managed by the 
operating system and video card drivers.

Basically, D would either need to opt out of legacy api's such as 
openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and 
generally have ugly as sin syntax; or D would have go the route 
of a full and safe gpu subset of features.

I don't think such a setup can be implemented as simply a 
library, as the GPU needs compiled source.

If D where to implement gpgpu features, I would actually suggest 
starting by simply adding a microthreading function syntax, for 
example...

void example( aggregate in float a[] ; key , in float b[], out 
float c[]) {
	c[key] = a[key] + b[key];
}

By adding an aggregate keyword to the function, we can assume the 
range simply using the length of a[] without adding an extra set 
of brackets or something similar.

This would make access to the gpu more generic, and more 
importantly, because llvm will support HSA, removes the needs for 
writing more complex support into dmd as openCL and CUDA would 
require, a few hints for the llvm backend would be enough to 
generate the dual bytecode ELF executables.

Aug 17 2013

"Atash" <nope nope.nope> writes:

Unified virtual address-space I can accept, fine. Ignoring that 
it is, in fact, in a totally different address-space where memory 
latency is *entirely different*, I'm far *far* more iffy about.

 We basically have to follow these rules,

 1. The range must be none prior to execution of a gpu code block
 2. The range can not be changed during execution of a gpu code 
 block
 3. Code blocks can only receive a single range, it can however 
 be multidimensional
 4. index keys used in a code block are immutable
 5. Code blocks can only use a single key(the gpu executes many 
 instances in parallel each with their own unique key)
 6. index's are always an unsigned integer type
 7. openCL,CUDA have no access to global state
 8. gpu code blocks can not allocate memory
 9. gpu code blocks can not call cpu functions
 10. atomics tho available on the gpu are many times slower then 
 on the cpu
 11. separate running instances of the same code block on the 
 gpu can not have any interdependency on each other.

Please explain point 1 (specifically the use of the word 'none'), 
and why you added in point 3?

Additionally, point 11 doesn't make any sense to me. There is 
research out there showing how to use cooperative warp-scans, for 
example, to have multiple work-items cooperate over some local 
block of memory and perform sorting in blocks. There are even 
tutorials out there for OpenCL and CUDA that shows how to do 
this, specifically to create better performing code. This 
statement is in direct contradiction with what exists.

 Now if we are talking about HSA, or other similar setup, then a 
 few of those rules don't apply or become fuzzy.

 HSA, does have limited access to global state, HSA can call cpu 
 functions that are pure, and of course because in HSA the cpu 
 and gpu share the same virtual address space most of memory is 
 open for access.

 HSA also manages memory, via the hMMU, and their is no need for 
 gpu memory management functions, as that is managed by the 
 operating system and video card drivers.

Good for HSA. Now why are we latching onto this particular 
construction that, as far as I can tell, is missing the support 
of at least two highly relevant giants (Intel and NVidia)?

 Basically, D would either need to opt out of legacy api's such 
 as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, 
 and generally have ugly as sin syntax; or D would have go the 
 route of a full and safe gpu subset of features.

Wrappers do a lot to change the appearance of a program. Raw 
OpenCL may look ugly, but so do BLAS and LAPACK routines. The use 
of wrappers and expression templates does a lot to clean up code 
(ex. look at the way Eigen 3 or any other linear algebra library 
does expression templates in C++; something D can do even better).

 I don't think such a setup can be implemented as simply a 
 library, as the GPU needs compiled source.

This doesn't make sense. Your claim is contingent on opting out 
of OpenCL or any other mechanism that provides for the 
application to carry abstract instructions which are then 
compiled on the fly. If you're okay with creating kernel code on 
the fly, this can be implemented as a library, beyond any 
reasonable doubt.

 If D where to implement gpgpu features, I would actually 
 suggest starting by simply adding a microthreading function 
 syntax, for example...

 void example( aggregate in float a[] ; key , in float b[], out 
 float c[]) {
 	c[key] = a[key] + b[key];
 }

 By adding an aggregate keyword to the function, we can assume 
 the range simply using the length of a[] without adding an 
 extra set of brackets or something similar.

 This would make access to the gpu more generic, and more 
 importantly, because llvm will support HSA, removes the needs 
 for writing more complex support into dmd as openCL and CUDA 
 would require, a few hints for the llvm backend would be enough 
 to generate the dual bytecode ELF executables.

1) If you wanted to have that 'key' nonsense in there, I'm 
thinking you'd need to add several additional parameters: global 
size, group size, group count, and maybe group-local memory 
access (requires allowing multiple aggregates?). I mean, I get 
the gist of what you're saying, this isn't me pointing out a 
problem, just trying to get a clarification on it (maybe give 
'key' some additional structure, or something).

2) ... I kind of like this idea. I disagree with how you led up 
to it, but I like the idea.

3) How do you envision *calling* microthreaded code? Just the 
usual syntax?

4) How would this handle working on subranges?

ex. Let's say I'm coding up a radix sort using something like 
this:

https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0

What's the high-level program organization with this syntax if we 
can only use one range at a time? How many work-items get fired 
off? What's the gpu-code launch procedure?

Aug 17 2013

"luminousone" <rd.hunt gmail.com> writes:

On Sunday, 18 August 2013 at 01:43:33 UTC, Atash wrote:
 Unified virtual address-space I can accept, fine. Ignoring that 
 it is, in fact, in a totally different address-space where 
 memory latency is *entirely different*, I'm far *far* more iffy 
 about.

 We basically have to follow these rules,

 1. The range must be none prior to execution of a gpu code 
 block
 2. The range can not be changed during execution of a gpu code 
 block
 3. Code blocks can only receive a single range, it can however 
 be multidimensional
 4. index keys used in a code block are immutable
 5. Code blocks can only use a single key(the gpu executes many 
 instances in parallel each with their own unique key)
 6. index's are always an unsigned integer type
 7. openCL,CUDA have no access to global state
 8. gpu code blocks can not allocate memory
 9. gpu code blocks can not call cpu functions
 10. atomics tho available on the gpu are many times slower 
 then on the cpu
 11. separate running instances of the same code block on the 
 gpu can not have any interdependency on each other.

 Please explain point 1 (specifically the use of the word 
 'none'), and why you added in point 3?

 Additionally, point 11 doesn't make any sense to me. There is 
 research out there showing how to use cooperative warp-scans, 
 for example, to have multiple work-items cooperate over some 
 local block of memory and perform sorting in blocks. There are 
 even tutorials out there for OpenCL and CUDA that shows how to 
 do this, specifically to create better performing code. This 
 statement is in direct contradiction with what exists.

 Now if we are talking about HSA, or other similar setup, then 
 a few of those rules don't apply or become fuzzy.

 HSA, does have limited access to global state, HSA can call 
 cpu functions that are pure, and of course because in HSA the 
 cpu and gpu share the same virtual address space most of 
 memory is open for access.

 HSA also manages memory, via the hMMU, and their is no need 
 for gpu memory management functions, as that is managed by the 
 operating system and video card drivers.

 Good for HSA. Now why are we latching onto this particular 
 construction that, as far as I can tell, is missing the support 
 of at least two highly relevant giants (Intel and NVidia)?

 Basically, D would either need to opt out of legacy api's such 
 as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, 
 and generally have ugly as sin syntax; or D would have go the 
 route of a full and safe gpu subset of features.

 Wrappers do a lot to change the appearance of a program. Raw 
 OpenCL may look ugly, but so do BLAS and LAPACK routines. The 
 use of wrappers and expression templates does a lot to clean up 
 code (ex. look at the way Eigen 3 or any other linear algebra 
 library does expression templates in C++; something D can do 
 even better).

 I don't think such a setup can be implemented as simply a 
 library, as the GPU needs compiled source.

 This doesn't make sense. Your claim is contingent on opting out 
 of OpenCL or any other mechanism that provides for the 
 application to carry abstract instructions which are then 
 compiled on the fly. If you're okay with creating kernel code 
 on the fly, this can be implemented as a library, beyond any 
 reasonable doubt.

 If D where to implement gpgpu features, I would actually 
 suggest starting by simply adding a microthreading function 
 syntax, for example...

 void example( aggregate in float a[] ; key , in float b[], out 
 float c[]) {
 	c[key] = a[key] + b[key];
 }

 By adding an aggregate keyword to the function, we can assume 
 the range simply using the length of a[] without adding an 
 extra set of brackets or something similar.

 This would make access to the gpu more generic, and more 
 importantly, because llvm will support HSA, removes the needs 
 for writing more complex support into dmd as openCL and CUDA 
 would require, a few hints for the llvm backend would be 
 enough to generate the dual bytecode ELF executables.

 1) If you wanted to have that 'key' nonsense in there, I'm 
 thinking you'd need to add several additional parameters: 
 global size, group size, group count, and maybe group-local 
 memory access (requires allowing multiple aggregates?). I mean, 
 I get the gist of what you're saying, this isn't me pointing 
 out a problem, just trying to get a clarification on it (maybe 
 give 'key' some additional structure, or something).

 2) ... I kind of like this idea. I disagree with how you led up 
 to it, but I like the idea.

 3) How do you envision *calling* microthreaded code? Just the 
 usual syntax?

 4) How would this handle working on subranges?

 ex. Let's say I'm coding up a radix sort using something like 
 this:

 https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0

 What's the high-level program organization with this syntax if 
 we can only use one range at a time? How many work-items get 
 fired off? What's the gpu-code launch procedure?

sorry typo, meant known.

Aug 17 2013

"luminousone" <rd.hunt gmail.com> writes:

On Sunday, 18 August 2013 at 01:43:33 UTC, Atash wrote:
 Unified virtual address-space I can accept, fine. Ignoring that 
 it is, in fact, in a totally different address-space where 
 memory latency is *entirely different*, I'm far *far* more iffy 
 about.

 We basically have to follow these rules,

 1. The range must be none prior to execution of a gpu code 
 block
 2. The range can not be changed during execution of a gpu code 
 block
 3. Code blocks can only receive a single range, it can however 
 be multidimensional
 4. index keys used in a code block are immutable
 5. Code blocks can only use a single key(the gpu executes many 
 instances in parallel each with their own unique key)
 6. index's are always an unsigned integer type
 7. openCL,CUDA have no access to global state
 8. gpu code blocks can not allocate memory
 9. gpu code blocks can not call cpu functions
 10. atomics tho available on the gpu are many times slower 
 then on the cpu
 11. separate running instances of the same code block on the 
 gpu can not have any interdependency on each other.

 Please explain point 1 (specifically the use of the word 
 'none'), and why you added in point 3?

 Additionally, point 11 doesn't make any sense to me. There is 
 research out there showing how to use cooperative warp-scans, 
 for example, to have multiple work-items cooperate over some 
 local block of memory and perform sorting in blocks. There are 
 even tutorials out there for OpenCL and CUDA that shows how to 
 do this, specifically to create better performing code. This 
 statement is in direct contradiction with what exists.

You do have limited Atomics, but you don't really have any sort 
of complex messages, or anything like that.

 Now if we are talking about HSA, or other similar setup, then 
 a few of those rules don't apply or become fuzzy.

 HSA, does have limited access to global state, HSA can call 
 cpu functions that are pure, and of course because in HSA the 
 cpu and gpu share the same virtual address space most of 
 memory is open for access.

 HSA also manages memory, via the hMMU, and their is no need 
 for gpu memory management functions, as that is managed by the 
 operating system and video card drivers.

 Good for HSA. Now why are we latching onto this particular 
 construction that, as far as I can tell, is missing the support 
 of at least two highly relevant giants (Intel and NVidia)?

Intel doesn't have a dog in this race, so their is no way to know 
what they plan on doing if anything at all.

The reason to point out HSA, is because it is really easy add 
support for, it is not a giant task like opencl would be. A few 
changes to the front end compiler is all that is needed, LLVM's 
backend does the rest.

 Basically, D would either need to opt out of legacy api's such 
 as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, 
 and generally have ugly as sin syntax; or D would have go the 
 route of a full and safe gpu subset of features.

 Wrappers do a lot to change the appearance of a program. Raw 
 OpenCL may look ugly, but so do BLAS and LAPACK routines. The 
 use of wrappers and expression templates does a lot to clean up 
 code (ex. look at the way Eigen 3 or any other linear algebra 
 library does expression templates in C++; something D can do 
 even better).

 I don't think such a setup can be implemented as simply a 
 library, as the GPU needs compiled source.

 This doesn't make sense. Your claim is contingent on opting out 
 of OpenCL or any other mechanism that provides for the 
 application to carry abstract instructions which are then 
 compiled on the fly. If you're okay with creating kernel code 
 on the fly, this can be implemented as a library, beyond any 
 reasonable doubt.

OpenCL isn't just a library, it is a language extension, that is 
ran through a preprocessor that compiles the embedded __KERNEL 
and __DEVICE functions, into usable code, and then outputs 
.c/.cpp files for the c compiler to deal with.

 If D where to implement gpgpu features, I would actually 
 suggest starting by simply adding a microthreading function 
 syntax, for example...

 void example( aggregate in float a[] ; key , in float b[], out 
 float c[]) {
 	c[key] = a[key] + b[key];
 }

 By adding an aggregate keyword to the function, we can assume 
 the range simply using the length of a[] without adding an 
 extra set of brackets or something similar.

 This would make access to the gpu more generic, and more 
 importantly, because llvm will support HSA, removes the needs 
 for writing more complex support into dmd as openCL and CUDA 
 would require, a few hints for the llvm backend would be 
 enough to generate the dual bytecode ELF executables.

 1) If you wanted to have that 'key' nonsense in there, I'm 
 thinking you'd need to add several additional parameters: 
 global size, group size, group count, and maybe group-local 
 memory access (requires allowing multiple aggregates?). I mean, 
 I get the gist of what you're saying, this isn't me pointing 
 out a problem, just trying to get a clarification on it (maybe 
 give 'key' some additional structure, or something).

Those are all platform specific, they change based on the whim 
and fancy of NVIDIA and AMD with each and every new chip 
released, The size and configuration of CUDA clusters, or compute 
clusters, or EU's, or whatever the hell x chip maker feels like 
using at the moment.

Long term this will all be managed by the underlying support 
software in the video drivers, and operating system kernel. 
Putting any effort into this is a waste of time.

 2) ... I kind of like this idea. I disagree with how you led up 
 to it, but I like the idea.

 3) How do you envision *calling* microthreaded code? Just the 
 usual syntax?

void example( aggregate in float a[] ; key , in float b[], out
    float c[]) {
	c[key] = a[key] + b[key];
}

example(a,b,c);

in the function declaration you can think of the aggregate 
basically having the reserve order of the items in a foreach 
statement.

int a[100] = [ ... ];
int b[100];
foreach( v, k ; a ) { b = a[k]; }

int a[100] = [ ... ];
int b[100];

void example2( aggregate in float A[] ; k, out float B[] ) { B[k] 
= A[k]; }

example2(a,b);

 4) How would this handle working on subranges?

 ex. Let's say I'm coding up a radix sort using something like 
 this:

 https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0

 What's the high-level program organization with this syntax if 
 we can only use one range at a time? How many work-items get 
 fired off? What's the gpu-code launch procedure?

I am pretty sure they are simply multiplying the index value by 
the unit size they desire to work on

int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float b[]){
    b[k]   = a[k];
    b[k+1] = a[k+1];
}

example3( 0 .. 50 , a,b);

Then likely they are simply executing multiple __KERNEL functions 
in sequence, would be my guess.

Aug 17 2013

"Atash" <nope nope.nope> writes:

On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
 You do have limited Atomics, but you don't really have any sort 
 of complex messages, or anything like that.

I said 'point 11', not 'point 10'. You also dodged points 1 and 
3...

 Intel doesn't have a dog in this race, so their is no way to 
 know what they plan on doing if anything at all.

http://software.intel.com/en-us/vcsource/tools/opencl-sdk
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

Just based on those, I'm pretty certain they 'have a dog in this 
race'. The dog happens to be running with MPI and OpenCL across a 
bridge made of PCIe.

 The reason to point out HSA, is because it is really easy add 
 support for, it is not a giant task like opencl would be. A few 
 changes to the front end compiler is all that is needed, LLVM's 
 backend does the rest.

H'okay. I can accept that.

 OpenCL isn't just a library, it is a language extension, that 
 is ran through a preprocessor that compiles the embedded 
 __KERNEL and __DEVICE functions, into usable code, and then 
 outputs .c/.cpp files for the c compiler to deal with.

But all those extra bits are part of the computing *environment*. 
Is there something wrong with requiring the proper environment 
for an executable?

A more objective question: which devices are you trying to target 
here?

 Those are all platform specific, they change based on the whim 
 and fancy of NVIDIA and AMD with each and every new chip 
 released, The size and configuration of CUDA clusters, or 
 compute clusters, or EU's, or whatever the hell x chip maker 
 feels like using at the moment.

 Long term this will all be managed by the underlying support 
 software in the video drivers, and operating system kernel. 
 Putting any effort into this is a waste of time.

Yes. And the only way to optimize around them is to *know them*, 
otherwise you're pinning the developer down the same way OpenMP 
does. Actually, even worse than the way OpenMP does - at least 
OpenMP lets you set some hints about how many threads you want.

 void example( aggregate in float a[] ; key , in float b[], out
    float c[]) {
 	c[key] = a[key] + b[key];
 }

 example(a,b,c);

 in the function declaration you can think of the aggregate 
 basically having the reserve order of the items in a foreach 
 statement.

 int a[100] = [ ... ];
 int b[100];
 foreach( v, k ; a ) { b = a[k]; }

 int a[100] = [ ... ];
 int b[100];

 void example2( aggregate in float A[] ; k, out float B[] ) { 
 B[k] = A[k]; }

 example2(a,b);

Contextually solid. Read my response to the next bit.

 I am pretty sure they are simply multiplying the index value by 
 the unit size they desire to work on

 int a[100] = [ ... ];
 int b[100];
 void example3( aggregate in range r ; k, in float a[], float 
 b[]){
    b[k]   = a[k];
    b[k+1] = a[k+1];
 }

 example3( 0 .. 50 , a,b);

 Then likely they are simply executing multiple __KERNEL 
 functions in sequence, would be my guess.

I've implemented this algorithm before in OpenCL already, and 
what you're saying so far doesn't rhyme with what's needed.

There are at least two ranges, one keeping track of partial 
summations, the other holding the partial sorts. Three separate 
kernels are ran in cycles to reduce over and scatter the data. 
The way early exit is implemented isn't mentioned as part of the 
implementation details, but my implementation of the strategy 
requires a third range to act as a flag array to be reduced over 
and read in between kernel invocations.

It isn't just unit size multiplication - there's communication 
between work-items and *exquisitely* arranged local-group 
reductions and scans (so-called 'warpscans') that take advantage 
of the widely accepted concept of a local group of work-items (a 
parameter you explicitly disregarded) and their shared memory 
pool. The entire point of the paper is that it's possible to come 
up with a general algorithm that can be parameterized to fit 
individual GPU configurations if desired. This kind of algorithm 
provides opportunities for tuning... which seem to be lost, 
unnecessarily, in what I've read so far in your descriptions.

My point being, I don't like where this is going by treating 
coprocessors, which have so far been very *very* different from 
one another, as the same batch of *whatever*. I also don't like 
that it's ignoring NVidia, and ignoring Intel's push for 
general-purpose accelerators such as their Xeon Phi.

But, meh, if HSA is so easy, then it's low-hanging fruit, so 
whatever, go ahead and push for it.

=== REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:

"A more objective question: which devices are you trying to 
target here?"

=== AND SOMETHING ELSE:

I feel like we're just on different wavelengths. At what level do 
you imagine having this support, in terms of support for doing 
low-level things? Is this something like OpenMP, where threading 
and such are done at a really (really really really...) high 
level, or what?

Aug 17 2013

"luminousone" <rd.hunt gmail.com> writes:

On Sunday, 18 August 2013 at 05:05:48 UTC, Atash wrote:
 On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
 You do have limited Atomics, but you don't really have any 
 sort of complex messages, or anything like that.

 I said 'point 11', not 'point 10'. You also dodged points 1 and 
 3...

 Intel doesn't have a dog in this race, so their is no way to 
 know what they plan on doing if anything at all.

 http://software.intel.com/en-us/vcsource/tools/opencl-sdk
 http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

 Just based on those, I'm pretty certain they 'have a dog in 
 this race'. The dog happens to be running with MPI and OpenCL 
 across a bridge made of PCIe.

The Xeon Phi is interesting in so far as taking generic 
programming to a more parallel environment. However it has some 
serious limitations that will heavily damage its potential 
performance.

AVX2 is completely the wrong path to go about improving 
performance in parallel computing, The SIMD nature of this 
instruction set means that scalar operations, or even just not 
being able to fill the giant 256/512bit register wastes huge 
chunks of this things peak theoretical performance, and if any 
rules apply to instruction pairing on this multi issue pipeline 
you have yet more potential for wasted cycles.

I haven't seen anything about intels, micro thread scheduler, or 
how these chips handle mass context switching natural of micro 
threaded environments, These two items make a huge difference in 
performance, comparing radeon VLIW5/4 to radeon GCN is a good 
example, most of the performance benefit of GCN is from easy of 
scheduling scalar pipelines over more complex pipes with 
instruction pairing rules etc.

Frankly Intel, has some cool stuff, but they have been caught 
with their pants down, they have depended on their large fab 
advantage to carry them over and got lazy.

We likely are watching AMD64 all over again.

 The reason to point out HSA, is because it is really easy add 
 support for, it is not a giant task like opencl would be. A 
 few changes to the front end compiler is all that is needed, 
 LLVM's backend does the rest.

 H'okay. I can accept that.

 OpenCL isn't just a library, it is a language extension, that 
 is ran through a preprocessor that compiles the embedded 
 __KERNEL and __DEVICE functions, into usable code, and then 
 outputs .c/.cpp files for the c compiler to deal with.

 But all those extra bits are part of the computing 
 *environment*. Is there something wrong with requiring the 
 proper environment for an executable?

 A more objective question: which devices are you trying to 
 target here?

A first, simply a different way of approaching std.parallel like 
functionality, with an eye gpgpu in the future when easy 
integration solutions popup(such as HSA).

 Those are all platform specific, they change based on the whim 
 and fancy of NVIDIA and AMD with each and every new chip 
 released, The size and configuration of CUDA clusters, or 
 compute clusters, or EU's, or whatever the hell x chip maker 
 feels like using at the moment.

 Long term this will all be managed by the underlying support 
 software in the video drivers, and operating system kernel. 
 Putting any effort into this is a waste of time.

 Yes. And the only way to optimize around them is to *know 
 them*, otherwise you're pinning the developer down the same way 
 OpenMP does. Actually, even worse than the way OpenMP does - at 
 least OpenMP lets you set some hints about how many threads you 
 want.

It would be best to wait for a more generic software platform, to 
find out how this is handled by the next generation of micro 
threading tools.

The way openCL/CUDA work reminds me to much of someone setting up 
tomcat to have java code generate php that runs on their apache 
server, just because they can. I would rather tighter integration 
with the core language, then having a language in language.

 void example( aggregate in float a[] ; key , in float b[], out
   float c[]) {
 	c[key] = a[key] + b[key];
 }

 example(a,b,c);

 in the function declaration you can think of the aggregate 
 basically having the reserve order of the items in a foreach 
 statement.

 int a[100] = [ ... ];
 int b[100];
 foreach( v, k ; a ) { b = a[k]; }

 int a[100] = [ ... ];
 int b[100];

 void example2( aggregate in float A[] ; k, out float B[] ) { 
 B[k] = A[k]; }

 example2(a,b);

 Contextually solid. Read my response to the next bit.

 I am pretty sure they are simply multiplying the index value 
 by the unit size they desire to work on

 int a[100] = [ ... ];
 int b[100];
 void example3( aggregate in range r ; k, in float a[], float 
 b[]){
   b[k]   = a[k];
   b[k+1] = a[k+1];
 }

 example3( 0 .. 50 , a,b);

 Then likely they are simply executing multiple __KERNEL 
 functions in sequence, would be my guess.

 I've implemented this algorithm before in OpenCL already, and 
 what you're saying so far doesn't rhyme with what's needed.

 There are at least two ranges, one keeping track of partial 
 summations, the other holding the partial sorts. Three separate 
 kernels are ran in cycles to reduce over and scatter the data. 
 The way early exit is implemented isn't mentioned as part of 
 the implementation details, but my implementation of the 
 strategy requires a third range to act as a flag array to be 
 reduced over and read in between kernel invocations.

 It isn't just unit size multiplication - there's communication 
 between work-items and *exquisitely* arranged local-group 
 reductions and scans (so-called 'warpscans') that take 
 advantage of the widely accepted concept of a local group of 
 work-items (a parameter you explicitly disregarded) and their 
 shared memory pool. The entire point of the paper is that it's 
 possible to come up with a general algorithm that can be 
 parameterized to fit individual GPU configurations if desired. 
 This kind of algorithm provides opportunities for tuning... 
 which seem to be lost, unnecessarily, in what I've read so far 
 in your descriptions.

 My point being, I don't like where this is going by treating 
 coprocessors, which have so far been very *very* different from 
 one another, as the same batch of *whatever*. I also don't like 
 that it's ignoring NVidia, and ignoring Intel's push for 
 general-purpose accelerators such as their Xeon Phi.

 But, meh, if HSA is so easy, then it's low-hanging fruit, so 
 whatever, go ahead and push for it.

 === REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:

 "A more objective question: which devices are you trying to 
 target here?"

 === AND SOMETHING ELSE:

 I feel like we're just on different wavelengths. At what level 
 do you imagine having this support, in terms of support for 
 doing low-level things? Is this something like OpenMP, where 
 threading and such are done at a really (really really 
 really...) high level, or what?

Low level optimization is a wonderful thing, But I almost wonder 
if this will always be something where in order todo the low 
level optimization you will be using the vendors provided 
platform for doing it, as no generic tool will be able to match 
the custom one.

Most of my interaction with the gpu is via shader programs for 
Opengl, I have only lightly used CUDA for some image processing 
software, So I am certainly not the one to give in depth detail 
to optimization strategies.

sorry on point 1, that was a typo, I meant

1. The range must be known prior to execution of a gpu code block.

as for

3. Code blocks can only receive a single range, it can however be 
multidimensional

int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float
   b[]){
   b[k]   = a[k];
}
example3( 0 .. 100 , a,b);

This function would be executed 100 times.

int a[10_000] = [ ... ];
int b[10_000];
void example3( aggregate in range r ; kx,aggregate in range r2 ; 
ky, in float a[], float
   b[]){
   b[kx+(ky*100)]   = a[kx+(ky*100)];
}
example3( 0 .. 100 , 0 .. 100 , a,b);

this function would be executed 10,000 times. the two aggregate 
ranges being treated as a single 2 dimensional range.

Maybe a better description of the rule would be that multiple 
ranges are multiplicative, and functionally operate as a single 
range.

Aug 17 2013

"Atash" <nope nope.nope> writes:

On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
 The Xeon Phi is interesting in so far as taking generic 
 programming to a more parallel environment. However it has some 
 serious limitations that will heavily damage its potential 
 performance.

 AVX2 is completely the wrong path to go about improving 
 performance in parallel computing, The SIMD nature of this 
 instruction set means that scalar operations, or even just not 
 being able to fill the giant 256/512bit register wastes huge 
 chunks of this things peak theoretical performance, and if any 
 rules apply to instruction pairing on this multi issue pipeline 
 you have yet more potential for wasted cycles.

 I haven't seen anything about intels, micro thread scheduler, 
 or how these chips handle mass context switching natural of 
 micro threaded environments, These two items make a huge 
 difference in performance, comparing radeon VLIW5/4 to radeon 
 GCN is a good example, most of the performance benefit of GCN 
 is from easy of scheduling scalar pipelines over more complex 
 pipes with instruction pairing rules etc.

 Frankly Intel, has some cool stuff, but they have been caught 
 with their pants down, they have depended on their large fab 
 advantage to carry them over and got lazy.

 We likely are watching AMD64 all over again.

Well, I can't argue that one.

 A first, simply a different way of approaching std.parallel 
 like functionality, with an eye gpgpu in the future when easy 
 integration solutions popup(such as HSA).

I can't argue with that either.

 It would be best to wait for a more generic software platform, 
 to find out how this is handled by the next generation of micro 
 threading tools.

 The way openCL/CUDA work reminds me to much of someone setting 
 up tomcat to have java code generate php that runs on their 
 apache server, just because they can. I would rather tighter 
 integration with the core language, then having a language in 
 language.

Fair point. I have my own share of idyllic wants, so I can't 
argue with those.

 Low level optimization is a wonderful thing, But I almost 
 wonder if this will always be something where in order todo the 
 low level optimization you will be using the vendors provided 
 platform for doing it, as no generic tool will be able to match 
 the custom one.

But OpenCL is by no means a 'custom tool'. CUDA, maybe, but 
OpenCL just doesn't fit the bill in my opinion. I can see it 
being possible in the future that it'd be considered 'low-level', 
but it's a fairly generic solution. A little hackneyed under your 
earlier metaphors, but still a generic, standard solution.

 Most of my interaction with the gpu is via shader programs for 
 Opengl, I have only lightly used CUDA for some image processing 
 software, So I am certainly not the one to give in depth detail 
 to optimization strategies.

There was a *lot* of stuff that opened up when vendors dumped 
GPGPU out of Pandora's box. If you want to get a feel for some 
optimization strategies and what they require, check this site 
out: http://www.bealto.com/gpu-sorting_intro.html (and I hope I'm 
not insulting your intelligence here, if I am, I truly apologize).

 sorry on point 1, that was a typo, I meant

 1. The range must be known prior to execution of a gpu code 
 block.

 as for

 3. Code blocks can only receive a single range, it can however 
 be multidimensional

 int a[100] = [ ... ];
 int b[100];
 void example3( aggregate in range r ; k, in float a[], float
   b[]){
   b[k]   = a[k];
 }
 example3( 0 .. 100 , a,b);

 This function would be executed 100 times.

 int a[10_000] = [ ... ];
 int b[10_000];
 void example3( aggregate in range r ; kx,aggregate in range r2 
 ; ky, in float a[], float
   b[]){
   b[kx+(ky*100)]   = a[kx+(ky*100)];
 }
 example3( 0 .. 100 , 0 .. 100 , a,b);

 this function would be executed 10,000 times. the two aggregate 
 ranges being treated as a single 2 dimensional range.

 Maybe a better description of the rule would be that multiple 
 ranges are multiplicative, and functionally operate as a single 
 range.

OH.

I think I was totally misunderstanding you earlier. The 
'aggregate' is the range over the *problem space*, not the values 
being punched into the problem. Is this true or false?

(if true I'm about to feel incredibly sheepish)

Aug 18 2013

"luminousone" <rd.hunt gmail.com> writes:

On Sunday, 18 August 2013 at 07:28:02 UTC, Atash wrote:
 On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
 The Xeon Phi is interesting in so far as taking generic 
 programming to a more parallel environment. However it has 
 some serious limitations that will heavily damage its 
 potential performance.

 AVX2 is completely the wrong path to go about improving 
 performance in parallel computing, The SIMD nature of this 
 instruction set means that scalar operations, or even just not 
 being able to fill the giant 256/512bit register wastes huge 
 chunks of this things peak theoretical performance, and if any 
 rules apply to instruction pairing on this multi issue 
 pipeline you have yet more potential for wasted cycles.

 I haven't seen anything about intels, micro thread scheduler, 
 or how these chips handle mass context switching natural of 
 micro threaded environments, These two items make a huge 
 difference in performance, comparing radeon VLIW5/4 to radeon 
 GCN is a good example, most of the performance benefit of GCN 
 is from easy of scheduling scalar pipelines over more complex 
 pipes with instruction pairing rules etc.

 Frankly Intel, has some cool stuff, but they have been caught 
 with their pants down, they have depended on their large fab 
 advantage to carry them over and got lazy.

 We likely are watching AMD64 all over again.

 Well, I can't argue that one.

 A first, simply a different way of approaching std.parallel 
 like functionality, with an eye gpgpu in the future when easy 
 integration solutions popup(such as HSA).

 I can't argue with that either.

 It would be best to wait for a more generic software platform, 
 to find out how this is handled by the next generation of 
 micro threading tools.

 The way openCL/CUDA work reminds me to much of someone setting 
 up tomcat to have java code generate php that runs on their 
 apache server, just because they can. I would rather tighter 
 integration with the core language, then having a language in 
 language.

 Fair point. I have my own share of idyllic wants, so I can't 
 argue with those.

 Low level optimization is a wonderful thing, But I almost 
 wonder if this will always be something where in order todo 
 the low level optimization you will be using the vendors 
 provided platform for doing it, as no generic tool will be 
 able to match the custom one.

 But OpenCL is by no means a 'custom tool'. CUDA, maybe, but 
 OpenCL just doesn't fit the bill in my opinion. I can see it 
 being possible in the future that it'd be considered 
 'low-level', but it's a fairly generic solution. A little 
 hackneyed under your earlier metaphors, but still a generic, 
 standard solution.

I can agree with that.

 Most of my interaction with the gpu is via shader programs for 
 Opengl, I have only lightly used CUDA for some image 
 processing software, So I am certainly not the one to give in 
 depth detail to optimization strategies.

 There was a *lot* of stuff that opened up when vendors dumped 
 GPGPU out of Pandora's box. If you want to get a feel for some 
 optimization strategies and what they require, check this site 
 out: http://www.bealto.com/gpu-sorting_intro.html (and I hope 
 I'm not insulting your intelligence here, if I am, I truly 
 apologize).

I am still learning and additional links to go over never hurt!, 
I am of the opinion that a good programmers has never finished 
learning new stuff.

 sorry on point 1, that was a typo, I meant

 1. The range must be known prior to execution of a gpu code 
 block.

 as for

 3. Code blocks can only receive a single range, it can however 
 be multidimensional

 int a[100] = [ ... ];
 int b[100];
 void example3( aggregate in range r ; k, in float a[], float
  b[]){
  b[k]   = a[k];
 }
 example3( 0 .. 100 , a,b);

 This function would be executed 100 times.

 int a[10_000] = [ ... ];
 int b[10_000];
 void example3( aggregate in range r ; kx,aggregate in range r2 
 ; ky, in float a[], float
  b[]){
  b[kx+(ky*100)]   = a[kx+(ky*100)];
 }
 example3( 0 .. 100 , 0 .. 100 , a,b);

 this function would be executed 10,000 times. the two 
 aggregate ranges being treated as a single 2 dimensional range.

 Maybe a better description of the rule would be that multiple 
 ranges are multiplicative, and functionally operate as a 
 single range.

 OH.

 I think I was totally misunderstanding you earlier. The 
 'aggregate' is the range over the *problem space*, not the 
 values being punched into the problem. Is this true or false?

 (if true I'm about to feel incredibly sheepish)

Is "problem space" the correct industry term, I am self taught on 
much of this, so I on occasion I miss out on what the correct 
terminology for something is.

But yes that is what i meant.

Aug 18 2013

"luminousone" <rd.hunt gmail.com> writes:

I chose the term aggregate, because it is the term used in the 
description of the foreach syntax.

foreach( value, key ; aggregate )

aggregate being an array or range, it seems to fit as even when 
the aggregate is an array, as you still implicitly have a range 
being "0 .. array.length", and will have a key or index position 
created by the foreach in addition to the value.

A wrapped function could very easily be similar to the intended 
initial outcome

void example( ref float a[], float b[], float c[] ) {

    foreach( v, k ; a ) {
       a[k] = b[k] + c[k];
    }
}

is functionally the same as

void example( aggregate ref float a[] ; k, float b[], float c[] ) 
{
    a[k] = b[k] + c[k];
}

maybe : would make more sense then ; but I am not sure as to the 
best way to represent that index value.

Aug 18 2013

"Atash" <nope nope.nope> writes:

I'm not sure if 'problem space' is the industry standard term (in 
fact I doubt it), but it's certainly a term I've used over the 
years by taking a leaf out of math books and whatever my 
professors had touted. :-D I wish I knew what the standard term 
was, but for now I'm latching onto that because it seems to 
describe at a high implementation-agnostic level what's up, and 
in my personal experience most people seem to 'get it' when I use 
the term - it has empirically had an accurate connotation.

That all said, I'd like to know what the actual term is, too. -.-'

On Sunday, 18 August 2013 at 08:21:18 UTC, luminousone wrote:
 I chose the term aggregate, because it is the term used in the 
 description of the foreach syntax.

 foreach( value, key ; aggregate )

 aggregate being an array or range, it seems to fit as even when 
 the aggregate is an array, as you still implicitly have a range 
 being "0 .. array.length", and will have a key or index 
 position created by the foreach in addition to the value.

 A wrapped function could very easily be similar to the intended 
 initial outcome

 void example( ref float a[], float b[], float c[] ) {

    foreach( v, k ; a ) {
       a[k] = b[k] + c[k];
    }
 }

 is functionally the same as

 void example( aggregate ref float a[] ; k, float b[], float c[] 
 ) {
    a[k] = b[k] + c[k];
 }

 maybe : would make more sense then ; but I am not sure as to 
 the best way to represent that index value.

Aye, that makes awesome sense, but I'm left wishing that there 
was something in that syntax to support access to local/shared 
memory between work-items. Or, better yet, some way of hinting at 
desired amounts of memory in the various levels of the non-global 
memory hierarchy and a way of accessing those requested 
allocations.

I mean, I haven't *seen* anyone do anything device-wise with more 
hierarchical levels than just global-shared-private, but it's 
always bothered me that in OpenCL we could only specify memory 
allocations on those three levels. What if someone dropped in 
another hierarchical level? Suddenly it'd open another door to 
optimization of code, and there'd be no way for OpenCL to access 
it. Or what if someone scrapped local memory altogether, for 
whatever reason? The industry may never add/remove such memory 
levels, but, still, it just feels... kinda wrong that OpenCL 
doesn't have an immediate way of saying, "A'ight, it's cool that 
you have this, Mr. XYZ-compute-device, I can deal with it," 
before proceeding to put on sunglasses and driving away in a 
Ferrari. Or something like that.

Aug 18 2013

"luminousone" <rd.hunt gmail.com> writes:

 The core (!) point here is that processor chips are rapidly 
 becoming a
 collection of heterogeneous cores. Any programming language 
 that assumes
 a single CPU or a collection of homogeneous CPUs has built-in
 obsolescence.

 So the question I am interested in is whether D is the language 
 that can
 allow me to express in a single codebase a program in which 
 parts will
 be executed on one or more GPGPUs and parts on multiple CPUs. D 
 has
 support for the latter, std.parallelism and std.concurrency.

 I guess my question is whether people are interested in 
 std.gpgpu (or
 some more sane name).

CUDA, works as a preprocessor pass that generates c files from 
.cu extension files.

In effect, to create a sensible environment for microthreaded 
programming, they extend the language.

a basic CUDA function looking something like...

__global__ void add( float * a, float * b, float * c) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

add<<< 1, 10 >>>( ptrA, ptrB, ptrC );

Their is the buildin variables to handle the index location 
threadIdx.x in the above example, this is something generated by 
the thread scheduler in the video card/apu device.

Generally calls to this setup has a very high latency, so using 
this for a small handful of items as in the above example makes 
no sense. In the above example that would end up using a single 
execution cluster, and leave you prey to the latency of the pcie 
bus, execution time, and latency costs of the video memory.

it doesn't get effective until you are working with large data 
sets, that can take advantage of a massive number of threads 
where the latency problems would be secondary to the sheer 
calculations done.

as far as D goes, we really only have one build in microthreading 
capable language construct, foreach.

However I don't think a library extension similar to 
std.parallelism would work gpu based microthreading.

foreach would need to have something to tell the compiler to 
generate gpu bytecode for the code block it uses, and would need 
instructions on when to use said code block based on dataset size.

while it is completely possible to have very little change with 
function just add new property  microthreaded and the build in 
variables for the index position/s, the calling syntax would need 
changes to support a work range or multidimensional range of some 
sort.

perhaps looking something like....

add$(1 .. 10)(ptrA,ptrB,ptrC);

a templated function looking similar

add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);

Aug 16 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 16 August 2013 at 19:55:56 UTC, luminousone wrote:
 The core (!) point here is that processor chips are rapidly 
 becoming a
 collection of heterogeneous cores. Any programming language 
 that assumes
 a single CPU or a collection of homogeneous CPUs has built-in
 obsolescence.

 So the question I am interested in is whether D is the 
 language that can
 allow me to express in a single codebase a program in which 
 parts will
 be executed on one or more GPGPUs and parts on multiple CPUs. 
 D has
 support for the latter, std.parallelism and std.concurrency.

 I guess my question is whether people are interested in 
 std.gpgpu (or
 some more sane name).

 CUDA, works as a preprocessor pass that generates c files from 
 .cu extension files.

 In effect, to create a sensible environment for microthreaded 
 programming, they extend the language.

 a basic CUDA function looking something like...

 __global__ void add( float * a, float * b, float * c) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
 }

 add<<< 1, 10 >>>( ptrA, ptrB, ptrC );

 Their is the buildin variables to handle the index location 
 threadIdx.x in the above example, this is something generated 
 by the thread scheduler in the video card/apu device.

 Generally calls to this setup has a very high latency, so using 
 this for a small handful of items as in the above example makes 
 no sense. In the above example that would end up using a single 
 execution cluster, and leave you prey to the latency of the 
 pcie bus, execution time, and latency costs of the video memory.

 it doesn't get effective until you are working with large data 
 sets, that can take advantage of a massive number of threads 
 where the latency problems would be secondary to the sheer 
 calculations done.

 as far as D goes, we really only have one build in 
 microthreading capable language construct, foreach.

 However I don't think a library extension similar to 
 std.parallelism would work gpu based microthreading.

 foreach would need to have something to tell the compiler to 
 generate gpu bytecode for the code block it uses, and would 
 need instructions on when to use said code block based on 
 dataset size.

 while it is completely possible to have very little change with 
 function just add new property  microthreaded and the build in 
 variables for the index position/s, the calling syntax would 
 need changes to support a work range or multidimensional range 
 of some sort.

 perhaps looking something like....

 add$(1 .. 10)(ptrA,ptrB,ptrC);

 a templated function looking similar

 add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);

We have a[] = b[] * c[] - 5; etc. which could work very neatly 
perhaps?

Aug 16 2013

"luminousone" <rd.hunt gmail.com> writes:

On Friday, 16 August 2013 at 20:07:32 UTC, John Colvin wrote:
 We have a[] = b[] * c[] - 5; etc. which could work very neatly 
 perhaps?

While this in fact could work, given the nature of GPGPU it would
not be very effective.

In a non shared memory and non cache coherent setup, the entirety
of all 3 arrays have to be copied into GPU memory, had that
statement ran as gpu bytecode, and then copied back to complete
the operation.

GPGPU doesn't make sense on small code blocks, both in
instruction count, and by how memory bound a particular statement
would be.

The compiler needs to either be explicitly told what can/should
be ran as a GPU function, or have some intelligence about what to
or not to run as a GPU function.

This will get better in the future, APU's using the full HSA
implementation will drastically reduce the "buyin" latency/cycle
cost of using a GPGPU function, and make them more practical for
smaller(in instruction count/memory boundess) operations.

Aug 16 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 16 August 2013 at 22:11:41 UTC, luminousone wrote:
 On Friday, 16 August 2013 at 20:07:32 UTC, John Colvin wrote:
 We have a[] = b[] * c[] - 5; etc. which could work very neatly 
 perhaps?

 While this in fact could work, given the nature of GPGPU it 
 would
 not be very effective.

 In a non shared memory and non cache coherent setup, the 
 entirety
 of all 3 arrays have to be copied into GPU memory, had that
 statement ran as gpu bytecode, and then copied back to complete
 the operation.

 GPGPU doesn't make sense on small code blocks, both in
 instruction count, and by how memory bound a particular 
 statement
 would be.

 The compiler needs to either be explicitly told what can/should
 be ran as a GPU function, or have some intelligence about what 
 to
 or not to run as a GPU function.

 This will get better in the future, APU's using the full HSA
 implementation will drastically reduce the "buyin" latency/cycle
 cost of using a GPGPU function, and make them more practical for
 smaller(in instruction count/memory boundess) operations.

I didn't literally mean automatically inserting GPU code.

I was more imagining this:

void foo(T)(T[] arr)
{
     useArray(arr);
}

auto a = someLongArray;
auto b = someOtherLongArray;

gpu
{
     auto aGPU = toGPUMem(a);
     auto bGPU = toGPUMem(b);

     auto c = GPUArr(a.length);

     c[] = a[] * b[];

     auto cCPU = toCPUMem(c);
     c.foo();

     dot(c, iota(c.length).array().toGPUMem())
         .foo();
}

gpu T dot(T)(T[] a, T[] b)
{
     //gpu dot product
}


with cpu arrays and gpu arrays identified separately in the type 
system. Automatic conversions could be possible, but of course 
that would allow carelessness.

Obviously there is some cpu code mixed in with the gpu code 
there, which should be executed asynchronously if possible. You 
could also have
onlyGPU
{
     //only code that can all be executed on the GPU.
}



Just ideas off the top of my head. Definitely full of holes and I 
haven't really considered the detail :)

Aug 16 2013

"luminousone" <rd.hunt gmail.com> writes:

On Friday, 16 August 2013 at 23:30:12 UTC, John Colvin wrote:
 On Friday, 16 August 2013 at 22:11:41 UTC, luminousone wrote:
 On Friday, 16 August 2013 at 20:07:32 UTC, John Colvin wrote:
 We have a[] = b[] * c[] - 5; etc. which could work very 
 neatly perhaps?

 While this in fact could work, given the nature of GPGPU it 
 would
 not be very effective.

 In a non shared memory and non cache coherent setup, the 
 entirety
 of all 3 arrays have to be copied into GPU memory, had that
 statement ran as gpu bytecode, and then copied back to complete
 the operation.

 GPGPU doesn't make sense on small code blocks, both in
 instruction count, and by how memory bound a particular 
 statement
 would be.

 The compiler needs to either be explicitly told what can/should
 be ran as a GPU function, or have some intelligence about what 
 to
 or not to run as a GPU function.

 This will get better in the future, APU's using the full HSA
 implementation will drastically reduce the "buyin" 
 latency/cycle
 cost of using a GPGPU function, and make them more practical 
 for
 smaller(in instruction count/memory boundess) operations.

 I didn't literally mean automatically inserting GPU code.

 I was more imagining this:

 void foo(T)(T[] arr)
 {
     useArray(arr);
 }

 auto a = someLongArray;
 auto b = someOtherLongArray;

 gpu
 {
     auto aGPU = toGPUMem(a);
     auto bGPU = toGPUMem(b);

     auto c = GPUArr(a.length);

     c[] = a[] * b[];

     auto cCPU = toCPUMem(c);
     c.foo();

     dot(c, iota(c.length).array().toGPUMem())
         .foo();
 }

 gpu T dot(T)(T[] a, T[] b)
 {
     //gpu dot product
 }


 with cpu arrays and gpu arrays identified separately in the 
 type system. Automatic conversions could be possible, but of 
 course that would allow carelessness.

 Obviously there is some cpu code mixed in with the gpu code 
 there, which should be executed asynchronously if possible. You 
 could also have
 onlyGPU
 {
     //only code that can all be executed on the GPU.
 }



 Just ideas off the top of my head. Definitely full of holes and 
 I haven't really considered the detail :)


You can't mix cpu and gpu code, they must be separate.

auto a = someLongArray;
auto b = someOtherLongArray;

auto aGPU = toGPUMem(a);
auto bGPU = toGPUMem(b);

auto c = GPUArr(a.length);

gpu
{
     // this block is one gpu shader program
     c[] = a[] * b[];
}

auto cCPU = toCPUMem(c);
cCPU.foo();
auto cGPU = toGPUMem(cCPU);
auto dGPU = iota(c.length).array().toGPUMem();

gpu{
     // this block is another wholly separate shader program
     auto resultGPU = dot(cGPU, dGPU);
}

auto resultCPU = toCPUMem(resultGPU);
resultCPU.foo();

gpu T dot(T)(T[] a, T[] b)
{
     //gpu dot product
}


Your example rewritten to fit the gpu.

However this still has problems of the cpu having to generate CPU 
code from the contents of gpu{} code blocks, as the GPU is unable 
to allocate memory, so for example ,

gpu{
     auto resultGPU = dot(c, cGPU);
}

likely either won't work, or generates an array allocation in cpu 
code before the gpu block is otherwise ran.

Also how does that dot product function know the correct index 
range to run on?, are we assuming it knows based on the length of 
a?, while the syntax,

c[] = a[] * b[];

is safe for this sort of call, a function is less safe todo this 
with, with function calls the range needs to be told to the 
function, and you would call this function without the gpu{} 
block as the function itself is marked.

auto resultGPU = dot$(0 .. 
returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);



Remember with gpu's you don't send instructions, you send whole 
programs, and the whole program must finish before you can move 
onto the next cpu instruction.

Aug 16 2013

"Atash" <nope nope.nope> writes:

On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
 You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code', 
you mean you can't mix them physically in the compiled executable 
for all currently extant cases. They aren't the same. I agree 
with that. That said, this doesn't preclude having CUDA-like 
behavior where small functions could be written that don't 
violate the constraints of GPU code and simultaneously has 
semantics that could be executed on the CPU, and where such small 
functions are then allowed to be called from both CPU and GPU 
code.

 However this still has problems of the cpu having to generate 
 CPU code from the contents of gpu{} code blocks, as the GPU is 
 unable to allocate memory, so for example ,

 gpu{
     auto resultGPU = dot(c, cGPU);
 }

 likely either won't work, or generates an array allocation in 
 cpu code before the gpu block is otherwise ran.

I wouldn't be so negative with the 'won't work' bit, 'cuz frankly 
the 'or' you wrote there is semantically like what OpenCL and 
CUDA do anyway.

 Also how does that dot product function know the correct index 
 range to run on?, are we assuming it knows based on the length 
 of a?, while the syntax,

 c[] = a[] * b[];

 is safe for this sort of call, a function is less safe todo 
 this with, with function calls the range needs to be told to 
 the function, and you would call this function without the 
 gpu{} block as the function itself is marked.

 auto resultGPU = dot$(0 .. 
 returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

I think it was mentioned earlier that there should be, much like 
in OpenCL or CUDA, builtins or otherwise available symbols for 
getting the global identifier of each work-item, the work-group 
size, global size, etc.

 Remember with gpu's you don't send instructions, you send whole 
 programs, and the whole program must finish before you can move 
 onto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPU 
while the GPU is executing. Perhaps by default the behavior could 
be helpful for sequencing global memory in the GPU with CPU 
operations, but it's not a necessary behavior (see OpenCL and 
it's, in my opinion, really nice queuing mechanism).

=== Another thing...

I'm with luminousone's suggestion for some manner of function 
attribute, to the tune of several metric tonnes of chimes. Wind 
chimes. I'm supporting this suggestion with at least a metric 
tonne of wind chimes.

*This* (and some small number of helpers), rather than 
straight-up dumping a new keyword and block type into the 
language. I really don't think D *needs* to have this any lower 
level than a library based solution, because it already has the 
tools to make it ridiculously more convenient than C/C++ (not 
necessarily as much as CUDA's totally separate program nvcc does, 
but a huge amount).

ex.


 kernel auto myFun(BufferT)(BufferT glbmem)
{
   // brings in the kernel keywords and whatnot depending 
__FUNCTION__
   // (because mixins eval where they're mixed in)
   mixin KernelDefs;
   // ^ and that's just about all the syntactic noise, the rest 
uses mixed-in
   //   keywords and the glbmem object to define several 
expressions that
   //   effectively record the operations to be performed into the 
return type

   // assignment into global memory recovers the expression type 
in the glbmem.
   glbmem[glbid] += 4;

   // This assigns the *expression* glbmem[glbid] to val.
   auto val = glbmem[glbid];

   // Ignoring that this has a data race, this exemplifies 
recapturing the
   // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
   glbmem[glbid+1] = val;

   return glbmem; ///< I lied about the syntactic noise. This is 
the last bit.
}


Now if you want to, you can at runtime create an OpenCL-code 
string (for example) by passing a heavily metaprogrammed type in 
as BufferT. The call ends up looking like this:


auto promisedFutureResult = Gpu.call!myFun(buffer);


The kernel compilation (assuming OpenCL) is memoized, and the 
promisedFutureResult is some asynchronous object that implements 
concurrent programming's future (or something to that extent). 
For convenience, let's say that it blocks on any read other than 
some special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizable 
to even execute the code on the CPU, as the launching call ( 
Gpu.call!myFun(buffer) ) can, instead of using an 
expression-buffer, just pass a normal array in and have the 
proper result pop out given some interaction between the 
identifiers mixed in by KernelDefs and the launching caller (ex. 
using a loop).

With CTFE, this method *I think* can also generate the code at 
compile time given the proper kind of 
expression-type-recording-BufferT.

Again, though, this requires a significant amount of 
metaprogramming, heavy abuse of auto, and... did i mention a 
significant amount of metaprogramming? It's roughly the same 
method I used to embed OpenCL code in a C++ project of mine 
without writing a single line of OpenCL code, however, so I 
*know* it's doable, likely even moreso, in D.

Aug 16 2013

"Atash" <nope nope.nope> writes:

On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
 You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code', 
you mean you can't mix them physically in the compiled executable 
for all currently extant cases. They aren't the same. I agree 
with that.

That said, this doesn't preclude having CUDA-like behavior where 
small functions could be written that don't violate the 
constraints of GPU code and simultaneously has semantics that 
could be executed on the CPU, and where such small functions are 
then allowed to be called from both CPU and GPU code.

 However this still has problems of the cpu having to generate 
 CPU code from the contents of gpu{} code blocks, as the GPU is 
 unable to allocate memory, so for example ,

 gpu{
     auto resultGPU = dot(c, cGPU);
 }

 likely either won't work, or generates an array allocation in 
 cpu code before the gpu block is otherwise ran.

I'm fine with an array allocation. I'd 'prolly have to do it 
anyway.

 Also how does that dot product function know the correct index 
 range to run on?, are we assuming it knows based on the length 
 of a?, while the syntax,

 c[] = a[] * b[];

 is safe for this sort of call, a function is less safe todo 
 this with, with function calls the range needs to be told to 
 the function, and you would call this function without the 
 gpu{} block as the function itself is marked.

 auto resultGPU = dot$(0 .. 
 returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

'Dat's a point.

 Remember with gpu's you don't send instructions, you send whole 
 programs, and the whole program must finish before you can move 
 onto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPU 
while the GPU is executing. Perhaps by default the behavior could 
be helpful for sequencing global memory in the GPU with CPU 
operations, but it's not a *necessary* behavior.

Well, I disagree with the assumption assuming said assumption is 
being made and I'm not just misreading that bit. :-P

=== Another thing...

I'm with luminousone's suggestion for some manner of function 
attribute, to the tune of several metric tonnes of chimes. Wind 
chimes. I'm supporting this suggestion with at least a metric 
tonne of wind chimes.

I'd prefer this (and some small number of helpers) rather than 
straight-up dumping a new keyword and block type into the 
language. I really don't think D *needs* to have this any lower 
level than a library based solution, because it already has the 
tools to make it ridiculously more convenient than C/C++ (not 
necessarily as much as CUDA's totally separate program nvcc does, 
but a huge amount).

ex.


 kernel auto myFun(BufferT)(BufferT glbmem)
{
   // brings in the kernel keywords and whatnot depending 
__FUNCTION__
   // (because mixins eval where they're mixed in)
   mixin KernelDefs;
   // ^ and that's just about all the syntactic noise, the rest 
uses mixed-in
   //   keywords and the glbmem object to define several 
expressions that
   //   effectively record the operations to be performed into the 
return type

   // assignment into global memory recovers the expression type 
in the glbmem.
   glbmem[glbid] += 4;

   // This assigns the *expression* glbmem[glbid] to val.
   auto val = glbmem[glbid];

   // Ignoring that this has a data race, this exemplifies 
recapturing the
   // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
   glbmem[glbid+1] = val;

   return glbmem; ///< I lied about the syntactic noise. This is 
the last bit.
}


Now if you want to, you can at runtime create an OpenCL-code 
string (for example) by passing a heavily metaprogrammed type in 
as BufferT. The call ends up looking like this:


auto promisedFutureResult = Gpu.call!myFun(buffer);


The kernel compilation (assuming OpenCL) is memoized, and the 
promisedFutureResult is some asynchronous object that implements 
concurrent programming's future (or something to that extent). 
For convenience, let's say that it blocks on any read other than 
some special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizable 
to even execute the code on the CPU, as the launching call ( 
Gpu.call!myFun(buffer) ) can, instead of using an 
expression-buffer, just pass a normal array in and have the 
proper result pop out given some interaction between the 
identifiers mixed in by KernelDefs and the launching caller (ex. 
using a loop).

Alternatively to returning the captured expressions, the argument 
glbmem could have been passed ref, and the same sort of 
expression capturing could occur. Heck, more arguments could've 
been passed, too, this doesn't require there to be one single 
argument representing global memory.

With CTFE, this method *I think* can also generate the code at 
compile time given the proper kind of 
expression-type-recording-BufferT.

Again, though, all this requires a significant amount of 
metaprogramming, heavy abuse of auto, and... did I mention a 
significant amount of metaprogramming? It's roughly the same 
method I used to embed OpenCL code in a C++ project of mine 
without writing a single line of OpenCL code, however, so I 
*know* it's doable, likely even moreso, in D.

Aug 16 2013

"luminousone" <rd.hunt gmail.com> writes:

On Saturday, 17 August 2013 at 06:09:53 UTC, Atash wrote:
 On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
 You can't mix cpu and gpu code, they must be separate.

 H'okay, let's be clear here. When you say 'mix CPU and GPU 
 code', you mean you can't mix them physically in the compiled 
 executable for all currently extant cases. They aren't the 
 same. I agree with that.

 That said, this doesn't preclude having CUDA-like behavior 
 where small functions could be written that don't violate the 
 constraints of GPU code and simultaneously has semantics that 
 could be executed on the CPU, and where such small functions 
 are then allowed to be called from both CPU and GPU code.

 However this still has problems of the cpu having to generate 
 CPU code from the contents of gpu{} code blocks, as the GPU is 
 unable to allocate memory, so for example ,

 gpu{
    auto resultGPU = dot(c, cGPU);
 }

 likely either won't work, or generates an array allocation in 
 cpu code before the gpu block is otherwise ran.

 I'm fine with an array allocation. I'd 'prolly have to do it 
 anyway.

 Also how does that dot product function know the correct index 
 range to run on?, are we assuming it knows based on the length 
 of a?, while the syntax,

 c[] = a[] * b[];

 is safe for this sort of call, a function is less safe todo 
 this with, with function calls the range needs to be told to 
 the function, and you would call this function without the 
 gpu{} block as the function itself is marked.

 auto resultGPU = dot$(0 .. 
 returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

 'Dat's a point.

 Remember with gpu's you don't send instructions, you send 
 whole programs, and the whole program must finish before you 
 can move onto the next cpu instruction.

 I disagree with the assumption that the CPU must wait for the 
 GPU while the GPU is executing. Perhaps by default the behavior 
 could be helpful for sequencing global memory in the GPU with 
 CPU operations, but it's not a *necessary* behavior.

 Well, I disagree with the assumption assuming said assumption 
 is being made and I'm not just misreading that bit. :-P

 === Another thing...

 I'm with luminousone's suggestion for some manner of function 
 attribute, to the tune of several metric tonnes of chimes. Wind 
 chimes. I'm supporting this suggestion with at least a metric 
 tonne of wind chimes.

 I'd prefer this (and some small number of helpers) rather than 
 straight-up dumping a new keyword and block type into the 
 language. I really don't think D *needs* to have this any lower 
 level than a library based solution, because it already has the 
 tools to make it ridiculously more convenient than C/C++ (not 
 necessarily as much as CUDA's totally separate program nvcc 
 does, but a huge amount).

 ex.


  kernel auto myFun(BufferT)(BufferT glbmem)
 {
   // brings in the kernel keywords and whatnot depending 
 __FUNCTION__
   // (because mixins eval where they're mixed in)
   mixin KernelDefs;
   // ^ and that's just about all the syntactic noise, the rest 
 uses mixed-in
   //   keywords and the glbmem object to define several 
 expressions that
   //   effectively record the operations to be performed into 
 the return type

   // assignment into global memory recovers the expression type 
 in the glbmem.
   glbmem[glbid] += 4;

   // This assigns the *expression* glbmem[glbid] to val.
   auto val = glbmem[glbid];

   // Ignoring that this has a data race, this exemplifies 
 recapturing the
   // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
   glbmem[glbid+1] = val;

   return glbmem; ///< I lied about the syntactic noise. This is 
 the last bit.
 }


 Now if you want to, you can at runtime create an OpenCL-code 
 string (for example) by passing a heavily metaprogrammed type 
 in as BufferT. The call ends up looking like this:


 auto promisedFutureResult = Gpu.call!myFun(buffer);


 The kernel compilation (assuming OpenCL) is memoized, and the 
 promisedFutureResult is some asynchronous object that 
 implements concurrent programming's future (or something to 
 that extent). For convenience, let's say that it blocks on any 
 read other than some special poll/checking mechanism.

 The constraints imposed on the kernel functions is 
 generalizable to even execute the code on the CPU, as the 
 launching call ( Gpu.call!myFun(buffer) ) can, instead of using 
 an expression-buffer, just pass a normal array in and have the 
 proper result pop out given some interaction between the 
 identifiers mixed in by KernelDefs and the launching caller 
 (ex. using a loop).

 Alternatively to returning the captured expressions, the 
 argument glbmem could have been passed ref, and the same sort 
 of expression capturing could occur. Heck, more arguments 
 could've been passed, too, this doesn't require there to be one 
 single argument representing global memory.

 With CTFE, this method *I think* can also generate the code at 
 compile time given the proper kind of 
 expression-type-recording-BufferT.

 Again, though, all this requires a significant amount of 
 metaprogramming, heavy abuse of auto, and... did I mention a 
 significant amount of metaprogramming? It's roughly the same 
 method I used to embed OpenCL code in a C++ project of mine 
 without writing a single line of OpenCL code, however, so I 
 *know* it's doable, likely even moreso, in D.

Often when first introducing programmers to gpu programming, they 
imagine gpu instructions as being part of the instruction stream 
the cpu receives, completely missing the point of what makes the 
entire scheme so useful.

The gpu might better be imagined as a wholly separate computer, 
that happens to be networked via the system bus. Every 
interaction between the cpu and the gpu has to travel across this 
expensive comparatively high latency divide, so the goal is 
design that makes it easy to avoid interaction between the two 
separate entities as much as possible while still getting the 
maximum control and performance from them.

Opencl may have picked the term __KERNEL based on the idea that 
the gpu program in fact represents the devices operating system 
for the duration of that function call.

Single Statement code operations on the GPU, in this vain 
represent a horridly bad idea. so ...

gpu{
    c[] = a[] * b[];
}

seems like very bad design to me.

In fact being able to have random gpu {} code blocks seems like a 
bad idea in this vain. each line in such a block very likely 
would end up being separate gpu __KERNEL functions creating 
excessive amounts of cpu/gpu interaction, as each line may have 
different ranges.

The foreach loop type is actually fits the model of 
microthreading very nicely. It has a clearly defined range, their 
is no dependency related to the order in which the code functions 
on any arrays used in the loop, you have an implicit index that 
is unique for each value in the range, you can't change the size 
of the range mid execution(at least I haven't seen anyone do it 
so far).

Aug 17 2013

"Atash" <nope nope.nope> writes:

On Friday, 16 August 2013 at 19:55:56 UTC, luminousone wrote:
 The core (!) point here is that processor chips are rapidly 
 becoming a
 collection of heterogeneous cores. Any programming language 
 that assumes
 a single CPU or a collection of homogeneous CPUs has built-in
 obsolescence.

 So the question I am interested in is whether D is the 
 language that can
 allow me to express in a single codebase a program in which 
 parts will
 be executed on one or more GPGPUs and parts on multiple CPUs. 
 D has
 support for the latter, std.parallelism and std.concurrency.

 I guess my question is whether people are interested in 
 std.gpgpu (or
 some more sane name).

 CUDA, works as a preprocessor pass that generates c files from 
 .cu extension files.

 In effect, to create a sensible environment for microthreaded 
 programming, they extend the language.

 a basic CUDA function looking something like...

 __global__ void add( float * a, float * b, float * c) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
 }

 add<<< 1, 10 >>>( ptrA, ptrB, ptrC );

 Their is the buildin variables to handle the index location 
 threadIdx.x in the above example, this is something generated 
 by the thread scheduler in the video card/apu device.

 Generally calls to this setup has a very high latency, so using 
 this for a small handful of items as in the above example makes 
 no sense. In the above example that would end up using a single 
 execution cluster, and leave you prey to the latency of the 
 pcie bus, execution time, and latency costs of the video memory.

 it doesn't get effective until you are working with large data 
 sets, that can take advantage of a massive number of threads 
 where the latency problems would be secondary to the sheer 
 calculations done.

 as far as D goes, we really only have one build in 
 microthreading capable language construct, foreach.

 However I don't think a library extension similar to 
 std.parallelism would work gpu based microthreading.

 foreach would need to have something to tell the compiler to 
 generate gpu bytecode for the code block it uses, and would 
 need instructions on when to use said code block based on 
 dataset size.

 while it is completely possible to have very little change with 
 function just add new property  microthreaded and the build in 
 variables for the index position/s, the calling syntax would 
 need changes to support a work range or multidimensional range 
 of some sort.

 perhaps looking something like....

 add$(1 .. 10)(ptrA,ptrB,ptrC);

 a templated function looking similar

 add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);

Regarding functionality,  microthreaded is sounding a lot like 
the __kernel or __global__ keywords in OpenCL and CUDA. Is this 
intentional?

The more metaphors that can be drawn between extant tools and 
whatever is come up with the better, methinks.

Aug 16 2013

"luminousone" <rd.hunt gmail.com> writes:

On Friday, 16 August 2013 at 21:14:12 UTC, Atash wrote:

 Regarding functionality,  microthreaded is sounding a lot like 
 the __kernel or __global__ keywords in OpenCL and CUDA. Is this 
 intentional?

 The more metaphors that can be drawn between extant tools and 
 whatever is come up with the better, methinks.

Yes, And that is just a word I pull out the air, if another term 
makes more sense then I am not against it.

Aug 16 2013

"ponce" <contact gam3sfrommars.fr> writes:

On Friday, 16 August 2013 at 10:04:22 UTC, Russel Winder wrote:
 So the question I am interested in is whether D is the language 
 that can
 allow me to express in a single codebase a program in which 
 parts will
 be executed on one or more GPGPUs and parts on multiple CPUs. D 
 has
 support for the latter, std.parallelism and std.concurrency.

You can write everything in OpenCL and dispatch to both a CPU or 
GPU device, managing the submit queues yourself.

 I guess my question is whether people are interested in 
 std.gpgpu (or some more sane name).

What would be the purpose? To be on top of both CUDA and OpenCL?

Jan 18 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Saturday, 18 January 2014 at 19:34:49 UTC, ponce wrote:
 You can write everything in OpenCL and dispatch to both a CPU 
 or GPU device, managing the submit queues yourself.

 I guess my question is whether people are interested in 
 std.gpgpu (or some more sane name).

 What would be the purpose? To be on top of both CUDA and OpenCL?

Compiler support with futures could be useful, e.g. write D 
futures and let the compiler generate CUDA and OpenCL, while 
having a fall-back branch for the CPU in case the GPU is
unavailable/slow.

e.g.:

GPUStore store;
store[123]=somearray;
store[53]=someotherarray;

FutureCalcX futurecalcx = new....(store)...

futurecalcx.compute(store(123),store(53),1.34,299)
...
if(futurecalc.ready){
    y = futurecalc.result
}

or future with callback…

futurecalcx.thenCall(somecallback)
futurecalcx.compute(....)

Jan 22 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Ola Fosheim Grøstad:

 Compiler support with futures could be useful, e.g. write D 
 futures and let the compiler generate CUDA and OpenCL, while 
 having a fall-back branch for the CPU in case the GPU is
 unavailable/slow.

Could be of interest, to ease the porting of C++ code to Cuda:
http://www.alexstjohn.com/WP/2014/01/16/porting-cuda-6-0/

Buye,
bearophile

Jan 22 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Wednesday, 22 January 2014 at 13:56:22 UTC, bearophile wrote:
 Could be of interest, to ease the porting of C++ code to Cuda:
 http://www.alexstjohn.com/WP/2014/01/16/porting-cuda-6-0/

Yeah, gpu programming is going to develop faster in the coming 
years than dmd can keep track, probably.

I was more thinking about the simplicity:

Decorate a function call with some pragma and obtain a 
pregenerated CUDA or OpenCL string as a property on that 
function. That way the compiler only need to generate source-code 
and the runtime can do the rest. But hide it so well that it 
makes sense to write generic DMD code this way.

*shrug*

Jan 22 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Wednesday, 22 January 2014 at 15:21:37 UTC, Ola Fosheim 
Grøstad wrote:
 source-code and the runtime can do the rest. But hide it so 
 well that it makes sense to write generic DMD code this way.

You might want to generate code for coprocessors too, like

http://www.parallella.org/

Or FPGUs...
Or send it over to a small cluster on Amazon...

Basically being able to write sensible DMD code for the CPU and 
then later configure it to ship off isolated computations to 
whatever computational resources you have available (on- or 
off-site) would be more interesting than pure GPGPU which 
probably is going to be out of date real soon due to the shifts 
in technology.

Jan 22 2014

"Paulo Pinto" <pjmlp progtools.org> writes:

On Wednesday, 22 January 2014 at 15:26:26 UTC, Ola Fosheim 
Grøstad wrote:
 On Wednesday, 22 January 2014 at 15:21:37 UTC, Ola Fosheim 
 Grøstad wrote:
 source-code and the runtime can do the rest. But hide it so 
 well that it makes sense to write generic DMD code this way.

 You might want to generate code for coprocessors too, like

 http://www.parallella.org/

 Or FPGUs...
 Or send it over to a small cluster on Amazon...

 Basically being able to write sensible DMD code for the CPU and 
 then later configure it to ship off isolated computations to 
 whatever computational resources you have available (on- or 
 off-site) would be more interesting than pure GPGPU which 
 probably is going to be out of date real soon due to the shifts 
 in technology.

Why not just generate SPIR, HSAIL or PTX code instead ?

--
Paulo

Jan 23 2014

"Ben Cumming" <bcumming cscs.ch> writes:

On Thursday, 23 January 2014 at 11:50:19 UTC, Paulo Pinto wrote:
 Why not just generate SPIR, HSAIL or PTX code instead ?

 --
 Paulo

We advertised an internship at my work to look at using D for 
GPUs in HPC (I work at the Swiss National Supercomputing Centre, 
which recently acquired are rather large GPU-based system). We do 
a lot of C++ meta-programming to generate portable code that 
works on both CPU and GPUs. D looks like it could make this much 
less of a pain (because C++ meta programming gets very tired 
after a short while). From what I can see, it should be possible 
to use CTFE and string mixins to generate full OpenCL kernels 
from straight D code.

One of the main issues we also have with C++ is that our users 
are intimidated by it, and exposure to the nasty side effects 
libraries written with meta programming do little to convince 
them (like the error messages, and the propensity for even the 
best-designed C++ library to leak excessive amounts of boiler 
plate and templates into user code). Unfortunately this is a 
field where Fortran is still the dominant language.

The LLVM backend supports PTX generation, and Clang has full 
support for OpenGL. With those tools and some tinkering with the 
compiler, it might be possible to do some really neat things in 
D. And increase programmer productivity at the same time. Fortran 
sets the bar pretty low there!

If anybody knows undergrad or masters students in Europe who 
would be interested in a fully paid internship to work with D on 
big computers, get them in touch with us!

Jan 23 2014

"Ben Cumming" <bcumming cscs.ch> writes:

On Thursday, 23 January 2014 at 19:34:06 UTC, Ben Cumming wrote:
 The LLVM backend supports PTX generation, and Clang has full 
 support for OpenGL.

I mean OpenCL, not OpenGL.

Jan 23 2014

Paulo Pinto <pjmlp progtools.org> writes:

Am 23.01.2014 20:34, schrieb Ben Cumming:
 On Thursday, 23 January 2014 at 11:50:19 UTC, Paulo Pinto wrote:
 Why not just generate SPIR, HSAIL or PTX code instead ?

 --
 Paulo

 We advertised an internship at my work to look at using D for GPUs in
 HPC (I work at the Swiss National Supercomputing Centre, which recently
 acquired are rather large GPU-based system). We do a lot of C++
 meta-programming to generate portable code that works on both CPU and
 GPUs. D looks like it could make this much less of a pain (because C++
 meta programming gets very tired after a short while). From what I can
 see, it should be possible to use CTFE and string mixins to generate
 full OpenCL kernels from straight D code.

I did an internship at CERN during 2003-2004. Lots of interesting C++ 
being used there as well.

 One of the main issues we also have with C++ is that our users are
 intimidated by it, and exposure to the nasty side effects libraries
 written with meta programming do little to convince them (like the error
 messages, and the propensity for even the best-designed C++ library to
 leak excessive amounts of boiler plate and templates into user code).

I still like C++, but with C++14 and whatever might come in C++17 it 
might just be too much for any sane developer. :\

 Unfortunately this is a field where Fortran is still the dominant language.

Yep, ATLAS still had lots of it.

 The LLVM backend supports PTX generation, and Clang has full support for
 OpenGL. With those tools and some tinkering with the compiler, it might
 be possible to do some really neat things in D. And increase programmer
 productivity at the same time. Fortran sets the bar pretty low there!

 If anybody knows undergrad or masters students in Europe who would be
 interested in a fully paid internship to work with D on big computers,
 get them in touch with us!

I'll pass the info around.

--
Paulo

Jan 23 2014

"Ben Cumming" <bcumming cscs.ch> writes:

On Thursday, 23 January 2014 at 20:05:28 UTC, Paulo Pinto wrote:
 I still like C++, but with C++14 and whatever might come in 
 C++17 it might just be too much for any sane developer. :\

I enjoy C++ metaprogramming too, but it is often a chore. And I 
don't want to wait until 2017!

 I'll pass the info around.

Thanks!

Jan 23 2014

"Atila Neves" <atila.neves gmail.com> writes:

 I did an internship at CERN during 2003-2004. Lots of 
 interesting C++ being used there as well.

Small world, we were at CERN at the same time. There's still a
lot of "interesting" C++ going on there. I haven't worked there
in 5 years but my girlfriend still does. What's worse is how
"interesting" their Python is as well... screwing up C++ is easy.
Doing that to Python takes effort.

 I still like C++, but with C++14 and whatever might come in 
 C++17 it might just be too much for any sane developer. :\

I know what you mean. C++11 is a lot better than C++2003 but I
don't think anyone could argue with a straight face that it's not
more complicated. Fortunately for me, I started learning D pretty
much as soon as I realised I was still shooting myself in the
foot with C++11.

 Unfortunately this is a field where Fortran is still the 
 dominant language.

 Yep, ATLAS still had lots of it.

A lot of it is being / has been converted to C++, or so I hear.

Atila

Jan 23 2014

Paulo Pinto <pjmlp progtools.org> writes:

Am 23.01.2014 22:20, schrieb Atila Neves:
 I did an internship at CERN during 2003-2004. Lots of interesting C++
 being used there as well.

 Small world, we were at CERN at the same time. There's still a
 lot of "interesting" C++ going on there. I haven't worked there
 in 5 years but my girlfriend still does. What's worse is how
 "interesting" their Python is as well... screwing up C++ is easy.
 Doing that to Python takes effort.

Who knows, we might even have been seating close to each other!

I was there under the Portuguese program (ADI) at ATLAS HLT,
from January 2003 to December 2004.

 I still like C++, but with C++14 and whatever might come in C++17 it
 might just be too much for any sane developer. :\

 I know what you mean. C++11 is a lot better than C++2003 but I
 don't think anyone could argue with a straight face that it's not
 more complicated. Fortunately for me, I started learning D pretty
 much as soon as I realised I was still shooting myself in the
 foot with C++11.

I do use it from time to time in hobby projects, now less so thanks to D 
and friends.

At work, the enterprise world is all about JVM and .NET languages. I 
only touch C++, when we replace C++ based systems by the former ones.


 Unfortunately this is a field where Fortran is still the dominant
 language.

 Yep, ATLAS still had lots of it.

 A lot of it is being / has been converted to C++, or so I hear.

 Atila

Thanks for the info,

Paulo

Jan 23 2014

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 08/16/2013 12:04 PM, Russel Winder wrote:
 I guess my question is whether people are interested in std.gpgpu (or
 some more sane name).

Yes, I'd be interested, particularly if it's possible to produce a GPGPU
solution that is much more user-friendly than the current C/C++ options.

I think this could have a good deal of importance for scientific simulation and
other similarly demanding computational tasks.

Aug 16 2013

Gambler <fake feather.org.ru> writes:

On 8/13/2013 12:27 PM, Russel Winder wrote:
 The era of GPGPUs for Bitcoin mining are now over, they moved to ASICs.
 The new market for GPGPUs is likely the banks, and other "Big Data"
 folk. True many of the banks are already doing some GPGPU usage, but it
 is not big as yet. But it is coming.
 
 Most of the banks are either reinforcing their JVM commitment, via
 Scala, or are re-architecting to C++ and Python. True there is some

 diminishing (despite what you might hear from .NET oriented training
 companies).
 
 Currently GPGPU tooling means C. 

There is some interesting work in that regards in .NET:

http://research.microsoft.com/en-us/projects/Accelerator/

Obviously, it uses DirectX, but what's nice about it that it is normal

Aug 17 2013

D Programming

C/C++ Programming

Other

digitalmars.D - GPGPUs