www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Scientific computing and parallel computing C++23/C++26

reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
I found the CppCon 2021 presentation
[C++ Standard 
Parallelism](https://www.youtube.com/watch?v=LW_T2RGXego) by 
Bryce Adelstein Lelbach very interesting, unusually clear and 
filled with content. I like this man. No nonsense.

It provides a view into what is coming for relatively high level 
and hardware agnostic parallel programming in C++23 or C++26. 
Basically a portable "high level" high performance solution.

He also mentions the Nvidia C++ compiler *nvc++* which will make 
it possible to compile C++ to Nvidia GPUs in a somewhat 
transparent manner. (Maybe it already does, I have never tried to 
use it.)

My gut feeling is that it will be very difficult for other 
languages to stand up to C++, Python and Julia in parallel 
computing. I get a feeling that the distance will only increase 
as time goes on.

What do you think?
Jan 12 2022
next sibling parent reply forkit <forkit gmail.com> writes:
On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:
 What do you think?
For the general programmers/developer, parallelism needs to be deeply integrated into the language and it's std library, so that it can be 'inferred' (by the compiler/optimiser). Perhaps a language like D, could adopt parallelNO to instruct the compiler/optimiser to never infer parallelism in the code that follows. The O/S should also has a very important role in inferring parallelism. parallelism has been promoted as the new thing..for a very..very...long time now. I've had 8 cores available on my pc for well over 10 years now. I don't think anything running on my pc has the slighest clue that they even exist ;-) (except the o/s). I expect 'explicitly' coding parallelism will continue to be relegated to a niche subset of programmers/developers, due to the very considerable knowledge/skillset needed, to design/develop/test/debug parallel code.
Jan 12 2022
next sibling parent IGotD- <nise nise.com> writes:
On Thursday, 13 January 2022 at 00:41:25 UTC, forkit wrote:
 parallelism has been promoted as the new thing..for a 
 very..very...long time now.

 I've had 8 cores available on my pc for well over 10 years now. 
 I don't think anything running on my pc has the slighest clue 
 that they even exist ;-)  (except the o/s).

 I expect 'explicitly' coding parallelism will continue to be 
 relegated to a niche subset of programmers/developers, due to 
 the very considerable knowledge/skillset needed, to 
 design/develop/test/debug parallel code.
Yes, that parallelism is for many applications a dead end as you need something that can take advantage of it. Often forcing parallel execution can often instead reduce performance. In order to exploit parallelism you need to understand your program and how it can take advantage of it. Languages that tries to make things in parallel under the hood without the programmer knowledge has been a fantasy for decades and it still is. I'm not saying that the additions in C++ aren't useful, people will probably find good use for it. The presentation just reminds me how C++ just gets more ugly for every iteration and I'm happy I jumped off that horror train.
Jan 12 2022
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jan 13, 2022 at 12:41:25AM +0000, forkit via Digitalmars-d wrote:
[...]
 I've had 8 cores available on my pc for well over 10 years now. I
 don't think anything running on my pc has the slighest clue that they
 even exist ;-)  (except the o/s).
Recently, I wanted to use POVRay to render frames for a short video clip. It was taking far too long because it was running on a single core at a time, so I wrote this: import std.parallellism, std.process; foreach (frame; frames.parallel) { execute([ "povray" ] ~ povrayOpts ~ [ "+I", frame.infile, "+O", frame.outfile ]); } Instant 8x render speedup. (Well, almost 8x... there's of course a little bit of overhead. But you get the point.)
 I expect 'explicitly' coding parallelism will continue to be relegated
 to a niche subset of programmers/developers, due to the very
 considerable knowledge/skillset needed, to design/develop/test/debug
 parallel code.
For simple cases, the above example serves as a counterexample. ;-) Of course, for more complex situations things may not be quite so simple. But still, it doesn't have to be as complex as languages like C++ make it seem. In the above example I literally just added ".parallel" to the code and it Just Worked(tm). T -- The best way to destroy a cause is to defend it poorly.
Jan 12 2022
next sibling parent forkit <forkit gmail.com> writes:
On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 Recently, I wanted to use POVRay to render frames for a short 
 video clip. It was taking far too long because it was running 
 on a single core at a time, so I wrote this:

 	import std.parallellism, std.process;
 	foreach (frame; frames.parallel) {
 		execute([ "povray" ] ~ povrayOpts ~ [
 			"+I", frame.infile,
 			"+O", frame.outfile ]);
 	}

 Instant 8x render speedup. (Well, almost 8x... there's of 
 course a little bit of overhead. But you get the point.)
I'd like to see D simplify this even further: parallel foreach (frame; frames) { .. } that's it. just annotate it. that's all I have to do. Let the language tools do the rest.
Jan 12 2022
prev sibling next sibling parent reply forkit <forkit gmail.com> writes:
On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 ..... But still, it doesn't have to be as complex as languages 
 like C++ make it seem.  In the above example I literally just 
 added ".parallel" to the code and it Just Worked(tm).


 T
I wish below would "just work" // ---- module test; import std; safe void main() { //int[5] arr = [1, 2, 3, 4, 5]; // nope. won't work with .parallel int[] arr = [1, 2, 3, 4, 5];// has to by dynamic to work with .parallel ?? int x = 0; foreach(n; arr.parallel) // Nope - .parallel is a system function and cannot be called in safe { x += n; } writeln(x); } // -----
Jan 13 2022
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jan 13, 2022 at 08:07:51PM +0000, forkit via Digitalmars-d wrote:
[...]
 // ----
 module test;
 
 import std;
 
  safe
 void main()
 {
     //int[5] arr = [1, 2, 3, 4, 5]; // nope. won't work with .parallel
     int[] arr = [1, 2, 3, 4, 5];// has to by dynamic to work with .parallel
 ??
Just write instead: int[5] arr = [1, 2, 3, 4, 5]; foreach (n; arr[].parallel) ... In general, whenever something rejects static arrays, inserting `[]` usually fixes it. :-D I'm not 100% sure why .parallel is system, but I suspect it's because of potential issues with race conditions, since it does not prevent you from writing to the same local variable from multiple threads. If pointers are updated this way, it could lead to memory corruption problems. T -- Long, long ago, the ancient Chinese invented a device that lets them see through walls. It was called the "window".
Jan 13 2022
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:
 [snip]

 I'm not 100% sure why .parallel is  system, but I suspect it's 
 because of potential issues with race conditions, since it does 
 not prevent you from writing to the same local variable from 
 multiple threads. If pointers are updated this way, it could 
 lead to memory corruption problems.


 T
Could it be made safe when used with const/immutable variables?
Jan 13 2022
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jan 13, 2022 at 09:13:11PM +0000, jmh530 via Digitalmars-d wrote:
 On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:
[...]
 I'm not 100% sure why .parallel is  system, but I suspect it's
 because of potential issues with race conditions, since it does not
 prevent you from writing to the same local variable from multiple
 threads. If pointers are updated this way, it could lead to memory
 corruption problems.
[...]
 Could it be made  safe when used with const/immutable variables?
Apparently not, as Petar already pointed out. But even besides access to non-shared local variables, there's also the long-standing issue that a function that receives a delegate cannot have stricter attributes than the delegate itself, i.e.: // NG: safe function fun cannot call system delegate dg. void fun(scope void delegate() system dg) safe { dg(); } // You have to do this instead (i.e., delegate must be // restricted to be safe): void fun(scope void delegate() safe dg) safe { dg(); } There's currently no way to express that the safety of fun depends solely on the safety of dg, such that if you pass in a safe delegate, then fun should be regarded as safe and allowed to be called from safe code. This is a problem because .parallel is implemented using .opApply, which takes a delegate argument. It accepts an unqualified delegate in order to be usable with both system and safe delegates. But this unfortunately means it must be system, and therefore uncallable from safe code. Various proposals to fix this has been brought up before, but Walter either doesn't fully understand the issue, or else has some reasons he's not happy with the proposed solutions. In fact he has proposed something that goes the *opposite* way to what should be done in order to address this problem. Since both were shot down in the forum discussions, we're stuck at the current stalemate. :-( T -- Once bitten, twice cry...
Jan 13 2022
next sibling parent Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Thursday, 13 January 2022 at 21:44:13 UTC, H. S. Teoh wrote:
 On Thu, Jan 13, 2022 at 09:13:11PM +0000, jmh530 via 
 Digitalmars-d wrote:
 On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:
[...]
 [...]
[...]
 Could it be made  safe when used with const/immutable 
 variables?
Apparently not, as Petar already pointed out. But even besides access to non-shared local variables, there's also the long-standing issue that a function that receives a delegate cannot have stricter attributes than the delegate itself, i.e.: // NG: safe function fun cannot call system delegate dg. void fun(scope void delegate() system dg) safe { dg(); } // You have to do this instead (i.e., delegate must be // restricted to be safe): void fun(scope void delegate() safe dg) safe { dg(); } There's currently no way to express that the safety of fun depends solely on the safety of dg, such that if you pass in a safe delegate, then fun should be regarded as safe and allowed to be called from safe code. This is a problem because .parallel is implemented using .opApply, which takes a delegate argument. It accepts an unqualified delegate in order to be usable with both system and safe delegates. But this unfortunately means it must be system, and therefore uncallable from safe code. Various proposals to fix this has been brought up before, but Walter either doesn't fully understand the issue, or else has some reasons he's not happy with the proposed solutions. In fact he has proposed something that goes the *opposite* way to what should be done in order to address this problem. Since both were shot down in the forum discussions, we're stuck at the current stalemate. :-( T
There are two DIPs that aim to address the attribute propagation problem: * [Argument dependent attributes (ADAs)][1] by Geod24: https://github.com/Geod24/DIPs/blob/adas/DIPs/DIP4242.md * [Attributes for Higher-Order Functions][2] by Bolpat: https://github.com/Bolpat/DIPs/blob/AttributesHOF/DIPs/DIP-1NN4-QFS.md [1]: https://github.com/dlang/DIPs/pull/198 [2]: https://github.com/dlang/DIPs/pull/199
Jan 13 2022
prev sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 13 January 2022 at 21:44:13 UTC, H. S. Teoh wrote:
 [snip]
 Various proposals to fix this has been brought up before, but 
 Walter either doesn't fully understand the issue, or else has 
 some reasons he's not happy with the proposed solutions.  In 
 fact he has proposed something that goes the *opposite* way to 
 what should be done in order to address this problem.  Since 
 both were shot down in the forum discussions, we're stuck at 
 the current stalemate. :-(


 T
Thanks for the detailed explanation. Maybe the new DIPs can make a better effort at the beginning to communicate the issue (such as this example).
Jan 13 2022
prev sibling parent reply Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Thursday, 13 January 2022 at 21:13:11 UTC, jmh530 wrote:
 On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:
 [snip]

 I'm not 100% sure why .parallel is  system, but I suspect it's 
 because of potential issues with race conditions, since it 
 does not prevent you from writing to the same local variable 
 from multiple threads. If pointers are updated this way, it 
 could lead to memory corruption problems.


 T
Could it be made safe when used with const/immutable variables?
For some data to be safe-ly accessible across threads it must have no "unshared aliasing", meaning that `shared(const(T))` and `immutable(T)` are ok, but simply `T` and `const(T)` are not. The reason why the `.parallel` example above was not safe, is because the body of the foreach was passed as a delegate to the `ParallelForeach.opApply` and the problem is that delegates can access unshared mutable data through their closure. If the safe-ty holes regarding delegates are closed, presumably we could add a `ParallelForeach.opApply` overload that took a ` safe` delegate and then the whole `main` function could be marked as ` safe`. I think back when the module was under active development, the authors did carefully consider the safe-ty aspects, as they have written code that conditionally enables some function overloads to be ` trusted`, depending on the parameters they receive. But in the end it was the best they could given the state of the language at the time. Most likely the situation has improved sufficiently that more the of the API could be made (at least conditionally) safe. You can check the various comments explaining the situation: * https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L32-L34 * https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L3382-L3395 * https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L254-L261
Jan 13 2022
parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 13 January 2022 at 21:51:10 UTC, Petar Kirov 
[ZombineDev] wrote:
 [snip]
Thanks for the detailed explanation.
Jan 13 2022
prev sibling parent Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Thursday, 13 January 2022 at 20:07:51 UTC, forkit wrote:
 On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 ..... But still, it doesn't have to be as complex as languages 
 like C++ make it seem.  In the above example I literally just 
 added ".parallel" to the code and it Just Worked(tm).


 T
I wish below would "just work" // ---- module test; import std; safe void main() { //int[5] arr = [1, 2, 3, 4, 5]; // nope. won't work with .parallel int[] arr = [1, 2, 3, 4, 5];// has to by dynamic to work with .parallel ?? int x = 0; foreach(n; arr.parallel) // Nope - .parallel is a system function and cannot be called in safe { x += n; } writeln(x); } // -----
```d import core.atomic : atomicOp; import std.parallelism : parallel; import std.stdio : writeln; // Not safe, since `parallel` still allows access to non-shared-qualified // data. See: // https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L32-L34 void main() { int[5] arr = [1, 2, 3, 4, 5]; // Yes, static arrays work just fine. // `shared` is necessary to safely access data from multiple threads shared int x = 0; // Most functions in Phobos work with ranges, not containers (by design). // To get a range from a static array, simply slice it: foreach(n; arr[].parallel) { // Use atomic ops (or higher-level synchronization primitives) to work // with shared data, without data-races: x.atomicOp!`+=`(n); } writeln(x); } ```
Jan 13 2022
prev sibling parent Leoarndo Palozzi <lpalozzi gmail.com> writes:
On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 In the above example I literally just added ".parallel" to the 
 code and it Just Worked(tm).
So did I for my weekend raytracer (I am new to D and was pleasantly surprised how easy it was). foreach (i, ref pixel; parallel(image.pixels)) {...}
Jan 16 2022
prev sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Thursday, 13 January 2022 at 00:41:25 UTC, forkit wrote:
 For the general programmers/developer, parallelism needs to be 
 deeply integrated into the language and it's std library, so 
 that it can be 'inferred' (by the compiler/optimizer).

 Perhaps a language like D, could adopt  parallelNO to instruct 
 the compiler/optimizer to never infer parallelism in the code 
 that follows.

 The O/S should also has a very important role in inferring 
 parallelism.

 I've had 8 cores available on my pc for well over 10 years now. 
 I don't think anything running on my pc has the slightest clue 
 that they even exist ;-)  (except the o/s).
Number of cores is fine, but if you could take advantage of the GPU/CUDA cpu's on say a graphics card as well; **THAT** would be really cool. Imagine the huge speedup of say 7zip or other where simple processes, pattern matching or encoding/processing could speed up if you could make use of those **AS WELL AS** the number of cores you have. For a while I've been making scripts where i *find* files and split it via xargs; this converts any single-thread program to be run on lots of cores/processes (*by running lots of copies with different input files*), though in windows it may result in 5 processes for ever 1 you want to run. **Example:** find -iname "*.jpg" -print0 | xargs -0 -P $NUMBER_OF_PROCESSORS -n 1 jpegoptim --all-progressive
Jan 16 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 17 January 2022 at 06:13:03 UTC, Era Scarecrow wrote:
 would be really cool. Imagine the huge speedup of say 7zip or 
 other where simple processes, pattern matching or 
 encoding/processing could speed up if you could make use of 
 those **AS WELL AS** the number of cores you have.
Yes, but compression/decompression is too complex. You need to be careful with data dependencies so that the computations can be run in parallel on a massive scale. I don't know what the future holds, but today you also need a core to feed the GPU. Maybe in the future the GPU will be able to "feed itself" like an independent actor? Hard to tell. For FPGAs that ought to be a possibility, but they are only available in specialty setups. Maybe we need an open source computer platform with more interesting hardware (using commodity chips)?
Jan 17 2022
parent reply bioinfornatics <bioinfornatics fedoraproject.org> writes:
On Monday, 17 January 2022 at 18:17:10 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 17 January 2022 at 06:13:03 UTC, Era Scarecrow wrote:
 would be really cool. Imagine the huge speedup of say 7zip or 
 other where simple processes, pattern matching or 
 encoding/processing could speed up if you could make use of 
 those **AS WELL AS** the number of cores you have.
Yes, but compression/decompression is too complex. You need to be careful with data dependencies so that the computations can be run in parallel on a massive scale. I don't know what the future holds, but today you also need a core to feed the GPU. Maybe in the future the GPU will be able to "feed itself" like an independent actor? Hard to tell. For FPGAs that ought to be a possibility, but they are only available in specialty setups. Maybe we need an open source computer platform with more interesting hardware (using commodity chips)?
Some years ago we got a chance to provides an efficient way to perform efficient computation with library such as https://wiki.dlang.org/LDC_CUDA_and_SPIRV And recently I put my feedback where D could provide some killer feature in this area: https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org Unfortunately, this will not be possible in the near future, so others language will keep the market
Jan 17 2022
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 17, 2022 at 09:12:52PM +0000, bioinfornatics via Digitalmars-d
wrote:
[...]
 And recently I put my feedback where D could provide some killer
 feature in this area:
 https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org
 
 Unfortunately, this will not be possible in the near future, so others
 language will keep the market
Why would it not be possible in the near future? None of the items you listed seem to be specific to the language itself, it seems to be more of an ecosystem issue. T -- It is widely believed that reinventing the wheel is a waste of time; but I disagree: without wheel reinventers, we would be still be stuck with wooden horse-cart wheels.
Jan 17 2022
parent reply sfp <sfp cims.nyu.edu> writes:
On Monday, 17 January 2022 at 21:31:25 UTC, H. S. Teoh wrote:
 On Mon, Jan 17, 2022 at 09:12:52PM +0000, bioinfornatics via 
 Digitalmars-d wrote: [...]
 And recently I put my feedback where D could provide some 
 killer feature in this area: 
 https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org
 
 Unfortunately, this will not be possible in the near future, 
 so others language will keep the market
Why would it not be possible in the near future? None of the items you listed seem to be specific to the language itself, it seems to be more of an ecosystem issue. T
Take one item on the list: developing an equivalent of numpy and scipy. What do you take to be the "near future"? One year away? Two years? There is no language feature holding these items back. Addressing ecosystem issues is a massive undertaking. In order for a D clone of even just numpy to be successful, it needs to have a significant user base feeding input back into the development cycle so that it can go beyond simply being churned out by a few overeager developers and actually stabilized so that it becomes useful and robust. You must also consider that the items that bioinfornatics listed are all somewhat contingent on each other. In isolation they aren't nearly as useful. You might have a numpy/scipy clone, but if you don't also have a matplotlib clone (or some other means of doing data visualization from D) their utility is a bit limited. His wishlist is a tall order.
Jan 18 2022
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 [snip]

 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In isolation 
 they aren't nearly as useful. You might have a numpy/scipy 
 clone, but if you don't also have a matplotlib clone (or some 
 other means of doing data visualization from D) their utility 
 is a bit limited.

 His wishlist is a tall order.
Have you tried ggplotd [1]? [1] https://code.dlang.org/packages/ggplotd
Jan 18 2022
parent reply sfp <sfp cims.nyu.edu> writes:
On Tuesday, 18 January 2022 at 17:24:03 UTC, jmh530 wrote:
 On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 [snip]

 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In isolation 
 they aren't nearly as useful. You might have a numpy/scipy 
 clone, but if you don't also have a matplotlib clone (or some 
 other means of doing data visualization from D) their utility 
 is a bit limited.

 His wishlist is a tall order.
Have you tried ggplotd [1]? [1] https://code.dlang.org/packages/ggplotd
I haven't tried it. I also hadn't heard of it before. Judging from the small number of GitHub issues it appears that it is used by basically no one and that it's missing crucial standard features supported by all mature plotting libraries. In one of the issues the main developer indicated he is no longer adding new features. This library seems to be: 1) quite young, 2) not actively developed, 3) not actively maintained.
Jan 18 2022
parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 18 January 2022 at 18:32:41 UTC, sfp wrote:
 [snip]

 I haven't tried it. I also hadn't heard of it before.

 Judging from the small number of GitHub issues it appears that 
 it is used by basically no one and that it's missing crucial 
 standard features supported by all mature plotting libraries. 
 In one of the issues the main developer indicated he is no 
 longer adding new features. This library seems to be: 1) quite 
 young, 2) not actively developed, 3) not actively maintained.
Hadn't realized that. Shame.
Jan 18 2022
prev sibling next sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 On Monday, 17 January 2022 at 21:31:25 UTC, H. S. Teoh wrote:
 On Mon, Jan 17, 2022 at 09:12:52PM +0000, bioinfornatics via 
 Digitalmars-d wrote: [...]
 And recently I put my feedback where D could provide some 
 killer feature in this area: 
 https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org
 
 Unfortunately, this will not be possible in the near future, 
 so others language will keep the market
Why would it not be possible in the near future? None of the items you listed seem to be specific to the language itself, it seems to be more of an ecosystem issue. T
Take one item on the list: developing an equivalent of numpy and scipy. What do you take to be the "near future"? One year away? Two years? There is no language feature holding these items back. Addressing ecosystem issues is a massive undertaking. In order for a D clone of even just numpy to be successful, it needs to have a significant user base feeding input back into the development cycle so that it can go beyond simply being churned out by a few overeager developers and actually stabilized so that it becomes useful and robust. You must also consider that the items that bioinfornatics listed are all somewhat contingent on each other. In isolation they aren't nearly as useful. You might have a numpy/scipy clone, but if you don't also have a matplotlib clone (or some other means of doing data visualization from D) their utility is a bit limited. His wishlist is a tall order.
Yes. Better to concentrate on things D *can* enable, like a great performance programming experience. D appeals to me primarily because it lets me write simpler performant code. It regularly opens the door to better perf/complexity ratios than C++ for example. This is particularly important in markets where even small performance gains bring large economic benefits, where novel code is indicated. OTOH, if your value add is more about quickly assembling/rearranging existing components that are sufficiently peformant in themselves and in combination, well, by all means, carry on!
Jan 18 2022
prev sibling parent reply bachmeier <no spam.net> writes:
On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:

 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In isolation 
 they aren't nearly as useful. You might have a numpy/scipy 
 clone, but if you don't also have a matplotlib clone (or some 
 other means of doing data visualization from D) their utility 
 is a bit limited.
To my knowledge pyd still works. There's not much to be gained from rewriting a plotting library from scratch. It's not common that you're plotting 100 million times for each run of your program. I see too much NIH syndrome here. If you can call another language, all you need to do is write convenience wrappers on top of the many thousands of hours of work done in that language. You can replace the pieces where it makes sense to do so. The goal of the D program is whatever analysis you're doing on top of those libraries, not the libraries themselves. We call C libraries all the time. Nobody thinks that's a problem. A bunch of effort has gone into calling C++ libraries and there's tons of support for that effort. When it comes to calling any other language, even for things that don't require performance, there's no interest. The ability to interoperate with other languages is the number one reason I started using D and the main reason I still use it.
Jan 18 2022
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jan 18, 2022 at 08:28:52PM +0000, bachmeier via Digitalmars-d wrote:
 On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 
 You must also consider that the items that bioinfornatics listed are
 all somewhat contingent on each other. In isolation they aren't
 nearly as useful. You might have a numpy/scipy clone, but if you
 don't also have a matplotlib clone (or some other means of doing
 data visualization from D) their utility is a bit limited.
To my knowledge pyd still works. There's not much to be gained from rewriting a plotting library from scratch. It's not common that you're plotting 100 million times for each run of your program. I see too much NIH syndrome here. If you can call another language, all you need to do is write convenience wrappers on top of the many thousands of hours of work done in that language. You can replace the pieces where it makes sense to do so. The goal of the D program is whatever analysis you're doing on top of those libraries, not the libraries themselves.
[...] +1. Why do we need to reinvent numpy/scipy? One of the advantages conferred by D's metaprogramming capabilities is easier integration with other languages. Adam Ruppe's jni.d is one prime example of how metaprogramming can abstract away the nasty amounts of boilerplate you're otherwise forced to write when interfacing with Java via JNI. D's C ABI compatibility also means you can leverage the tons of C libraries out there right now, instead of waiting for somebody to reinvent the same libraries in D years down the road. D's capabilities makes it very amenable to being a "glue" language for interfacing with other languages. T -- Caffeine underflow. Brain dumped.
Jan 18 2022
next sibling parent reply bachmeier <no spam.net> writes:
On Tuesday, 18 January 2022 at 21:22:11 UTC, H. S. Teoh wrote:
 On Tue, Jan 18, 2022 at 08:28:52PM +0000, bachmeier via 
 Digitalmars-d wrote:
 On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 
 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In 
 isolation they aren't nearly as useful. You might have a 
 numpy/scipy clone, but if you don't also have a matplotlib 
 clone (or some other means of doing data visualization from 
 D) their utility is a bit limited.
To my knowledge pyd still works. There's not much to be gained from rewriting a plotting library from scratch. It's not common that you're plotting 100 million times for each run of your program. I see too much NIH syndrome here. If you can call another language, all you need to do is write convenience wrappers on top of the many thousands of hours of work done in that language. You can replace the pieces where it makes sense to do so. The goal of the D program is whatever analysis you're doing on top of those libraries, not the libraries themselves.
[...] +1. Why do we need to reinvent numpy/scipy? One of the advantages conferred by D's metaprogramming capabilities is easier integration with other languages. Adam Ruppe's jni.d is one prime example of how metaprogramming can abstract away the nasty amounts of boilerplate you're otherwise forced to write when interfacing with Java via JNI. D's C ABI compatibility also means you can leverage the tons of C libraries out there right now, instead of waiting for somebody to reinvent the same libraries in D years down the road. D's capabilities makes it very amenable to being a "glue" language for interfacing with other languages.
The next release of my embedr library (which I've been able to do now that my work life is finally returning to normal) will make it trivial to call D functions from R. What I mean by that is that you write a file of D functions and by the magic of metaprogramming, you don't need to write any boilerplate at all. Example: ``` import mir.random; import mir.random.variable; RVector rngexample(int n) { auto gen = Random(unpredictableSeed); auto rv = uniformVar(-10, 10); // [-10, 10] auto result = RVector(n); foreach(ii; 0..n) { result[ii] = rv(gen); } return result; } mixin(createRFunction!rngexample); ``` The only way you can do better is if someone else writes the program for you. But then it doesn't make much difference which language is used.
Jan 18 2022
parent sfp <sfp cims.nyu.edu> writes:
On Tuesday, 18 January 2022 at 22:00:42 UTC, bachmeier wrote:
 On Tuesday, 18 January 2022 at 21:22:11 UTC, H. S. Teoh wrote:
 [...]
The next release of my embedr library (which I've been able to do now that my work life is finally returning to normal) will make it trivial to call D functions from R. What I mean by that is that you write a file of D functions and by the magic of metaprogramming, you don't need to write any boilerplate at all. Example: ``` import mir.random; import mir.random.variable; RVector rngexample(int n) { auto gen = Random(unpredictableSeed); auto rv = uniformVar(-10, 10); // [-10, 10] auto result = RVector(n); foreach(ii; 0..n) { result[ii] = rv(gen); } return result; } mixin(createRFunction!rngexample); ``` The only way you can do better is if someone else writes the program for you. But then it doesn't make much difference which language is used.
This is all news to me. It's a shame these libraries and their capabilities aren't advertised my prominently. How hard would it be to automatically wrap a D library and expose it to Python, MATLAB, and Julia simultaneously? Say the library even has a simple C-style API, or a very simple single-inheritance OO hierarchy with no templates.
Jan 18 2022
prev sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 18 January 2022 at 21:22:11 UTC, H. S. Teoh wrote:
 [snip]

 +1.  Why do we need to reinvent numpy/scipy? One of the 
 advantages conferred by D's metaprogramming capabilities is 
 easier integration with other languages.  Adam Ruppe's jni.d is 
 one prime example of how metaprogramming can abstract away the 
 nasty amounts of boilerplate you're otherwise forced to write 
 when interfacing with Java via JNI. D's C ABI compatibility 
 also means you can leverage the tons of C libraries out there 
 right now, instead of waiting for somebody to reinvent the same 
 libraries in D years down the road.  D's capabilities makes it 
 very amenable to being a "glue" language for interfacing with 
 other languages.


 T
I'm all for leveraging C libraries in D, but if you have code that needs to be performant then you may run into limitations with python. If you're building one chart with Matplotlib, then it's probably fine. If you have some D code that takes longer to run (e.g. a simulation that deals with a lot of data and many threads), then you might be a little more careful about what python code to incorporate and how. I don't know the technical details needed to get the best performance in that situation (are there benchmarks?), but I saw some work done about using python buffer protocol when calling D functions from python. In addition, the python code might itself be calling the same C libraries that D can (e.g. LAPACK) (though potentially with different defaults, trading off performance vs. accuracy, resulting in python being faster in some cases than D). In that case, python is also a glue language. Taking the same approach in D can simplify your code base a little bit and you don't need to worry about any additional overhead or limitations from GIL that might get introduced. Again, not something you need to worry about when performance is not a big issue.
Jan 18 2022
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 18 January 2022 at 20:28:52 UTC, bachmeier wrote:
 from rewriting a plotting library from scratch. It's not common 
 that you're plotting 100 million times for each run of your 
 program.
It is not uncommon to interact with plots that are too big for matplotlib to handle well. The python visualization solutions are very primitive. Having something better than numpy+matplotlib is obviously an advantage, a selling point for other offerings. Having the exact same thing? Not so much.
 You can replace the pieces where it makes sense to do so. The 
 goal of the D program is whatever analysis you're doing on top 
 of those libraries, not the libraries themselves.
You don't get a unified API with good usability by collecting a hodge podge of libraries. You also don't get any performance or quality advantage over other solutions. Borrowing is ok, replicating APIs? Probably not. What is then the argument for not using the original language directly? The reason for moving to a new language (like Julia or Python) is that you get something that better fits what you want to do and that transitioning provides a smoother work flow in the end. If everything you achieve by switching is replacing one set of trade offs with another set of trade offs, then you are generally better off using the more mainstream, supported and well documented alternative. So where do you start? With a niche, e.g. signal processing or some other "mainstream" niche.
 We call C libraries all the time. Nobody thinks that's a 
 problem. A bunch of effort has gone into calling C++ libraries 
 and there's tons of support for that effort.
So, libraries are often written in C in order to support other languages and they are structured in a very basic way as far as C code goes. C-only libraries are sometimes not as easy to interface with as they rely heavily on macros, dedicated runtimes or specifics of the underlying platform. I also think the C++ interop D offers is a bit clunky. It is more suitable for people who write C-like C++ than people who try to write idiomatic C++. D has to align itself more with C++ semantics for this to be a good selling point. I am somewhat impressed that Python has many solutions for binding to C++ though, even when Python is semantically a very poor fit for C++… (e.g. [Binder](https://github.com/RosettaCommons/binder)). D's potential strength here is not so much in being able to bind to C++ in a limited fashion (like Python), but being able to port C++ to D and improve on it. To get there you need feature parity, which is what this thread is about. We now know that C++ will eventually get more powerful parallel computing abilities built into the language, supported by the hardware manufacturer Nvidia for their hardware (nvc++). That said Apple has shown little interest in making their versino of C++ work well with parallel computing and the C++ standard lib is not very good for numeric operations. Like, the simd code I wrote for inner product (using generic llvm SIMD) turned out to be 3 times faster than the generic C++ standard library solution. Yet, we see *"change is coming"* written on the horizon, I think. So either D has to move in a different direction than competing head-to-head with C++ or one has be more strategic in how the development process is structured. Or well, just more strategic in general.
Jan 18 2022
next sibling parent sfp <sfp cims.nyu.edu> writes:
On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 18 January 2022 at 20:28:52 UTC, bachmeier wrote:
 from rewriting a plotting library from scratch. It's not 
 common that you're plotting 100 million times for each run of 
 your program.
It is not uncommon to interact with plots that are too big for matplotlib to handle well. The python visualization solutions are very primitive. Having something better than numpy+matplotlib is obviously an advantage, a selling point for other offerings.
To add to this: matplotlib has *many* pain points. It has an inconsistent API, it is very slow, its 3D plotting is hacked together (and very slow). Making animations isn't straightforward (and very slow). Making just several hundred plots typically takes several minutes (at least). It should take <1s. That said, matplotlib is very powerful and handles essentially all important use cases. There is definitely room for improvement. If someone with NIH syndrome came along and wrote a plotting library which actually improves on matplotlib significantly, it would be to D's benefit, especially since it would be trivial to consume from other languages which would be interested in using it.
Jan 18 2022
prev sibling next sibling parent reply Tejas <notrealemail gmail.com> writes:
On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim Grøstad 
wrote:

 It is not uncommon to interact with plots that are too big for 
 matplotlib to handle well. The python visualization solutions 
 are very primitive. Having something better than 
 numpy+matplotlib is obviously an advantage, a selling point for 
 other offerings.
Wow, this is the first time I've read that matplotlib is inadequate. Can you please give an example of a visualisation library(any language) which you consider good?
Jan 18 2022
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 19 January 2022 at 03:21:38 UTC, Tejas wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:

 It is not uncommon to interact with plots that are too big for 
 matplotlib to handle well. The python visualization solutions 
 are very primitive. Having something better than 
 numpy+matplotlib is obviously an advantage, a selling point 
 for other offerings.
Wow, this is the first time I've read that matplotlib is inadequate. Can you please give an example of a visualisation library(any language) which you consider good?
There are commercial products for visualizing large datasets. I dont use them as I either create my own or use a soundeditor. But yes matplotlib feels more like a homegrown solution than a solid product. It also has layout issues with labeling. You can make it work, but it is clunky.
Jan 18 2022
prev sibling parent reply forkit <forkit gmail.com> writes:
On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim Grøstad 
wrote:
 ...D's potential strength here is not so much in being able to 
 bind to C++ in a limited fashion (like Python), but being able 
 to port C++ to D and improve on it. To get there you need 
 feature parity, which is what this thread is about.
Not just 'feature' parity, but 'performance' parity too: "Broad adoption of high-level languages by the scientific community is unlikely without compiler optimizations to mitigate the performance penalties these languages abstractions impose." - https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf
Jan 18 2022
next sibling parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able to 
 bind to C++ in a limited fashion (like Python), but being able 
 to port C++ to D and improve on it. To get there you need 
 feature parity, which is what this thread is about.
Not just 'feature' parity, but 'performance' parity too: "Broad adoption of high-level languages by the scientific community is unlikely without compiler optimizations to mitigate the performance penalties these languages abstractions impose." - https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf
That paper is from 2008, meanwhile in 2021, https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club// This is what D has to compete against, not only C++ with the existing SYSCL/CUDA tooling and their ongoing integration into ISO C++.
Jan 18 2022
next sibling parent reply M.M. <matus email.cz> writes:
On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able 
 to bind to C++ in a limited fashion (like Python), but being 
 able to port C++ to D and improve on it. To get there you 
 need feature parity, which is what this thread is about.
Not just 'feature' parity, but 'performance' parity too: "Broad adoption of high-level languages by the scientific community is unlikely without compiler optimizations to mitigate the performance penalties these languages abstractions impose." - https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf
That paper is from 2008, meanwhile in 2021, https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club// This is what D has to compete against, not only C++ with the existing SYSCL/CUDA tooling and their ongoing integration into ISO C++.
I am not sure what the article tells: that Julia is now popular and people use it? Or that D (and other languages) need to compete against self-written PR articles? (Many system-programming languages can achieve the same performance as what the article describes, when several research institutes combine forces on just that.) But yes, Julia's focus on small niche, and its popularity in that niche makes it attractive for contributors.
Jan 18 2022
parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Wednesday, 19 January 2022 at 07:24:09 UTC, M.M. wrote:
 On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto 
 wrote:
 On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able 
 to bind to C++ in a limited fashion (like Python), but being 
 able to port C++ to D and improve on it. To get there you 
 need feature parity, which is what this thread is about.
Not just 'feature' parity, but 'performance' parity too: "Broad adoption of high-level languages by the scientific community is unlikely without compiler optimizations to mitigate the performance penalties these languages abstractions impose." - https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf
That paper is from 2008, meanwhile in 2021, https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club// This is what D has to compete against, not only C++ with the existing SYSCL/CUDA tooling and their ongoing integration into ISO C++.
I am not sure what the article tells: that Julia is now popular and people use it? Or that D (and other languages) need to compete against self-written PR articles? (Many system-programming languages can achieve the same performance as what the article describes, when several research institutes combine forces on just that.) But yes, Julia's focus on small niche, and its popularity in that niche makes it attractive for contributors.
You might call it self-written PR articles, or educate yourself who is using it. https://juliacomputing.com/case-studies versus https://dlang.org/orgs-using-d.html Also I did mention C++, which you glossed over on your eagerness to devalue Julia's market domain versus D among HPC communities. As someone that spent two years at ATLAS TDAQ HLT, I know which languages those folks would be adopting, but hey it is a piece of self-written PR.
Jan 18 2022
parent M.M. <matus email.cz> writes:
On Wednesday, 19 January 2022 at 07:29:23 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 07:24:09 UTC, M.M. wrote:
 On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto 
 wrote:
 On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able 
 to bind to C++ in a limited fashion (like Python), but 
 being able to port C++ to D and improve on it. To get there 
 you need feature parity, which is what this thread is about.
Not just 'feature' parity, but 'performance' parity too: "Broad adoption of high-level languages by the scientific community is unlikely without compiler optimizations to mitigate the performance penalties these languages abstractions impose." - https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf
That paper is from 2008, meanwhile in 2021, https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club// This is what D has to compete against, not only C++ with the existing SYSCL/CUDA tooling and their ongoing integration into ISO C++.
I am not sure what the article tells: that Julia is now popular and people use it? Or that D (and other languages) need to compete against self-written PR articles? (Many system-programming languages can achieve the same performance as what the article describes, when several research institutes combine forces on just that.) But yes, Julia's focus on small niche, and its popularity in that niche makes it attractive for contributors.
You might call it self-written PR articles, or educate yourself who is using it. https://juliacomputing.com/case-studies versus https://dlang.org/orgs-using-d.html Also I did mention C++, which you glossed over on your eagerness to devalue Julia's market domain versus D among HPC communities. As someone that spent two years at ATLAS TDAQ HLT, I know which languages those folks would be adopting, but hey it is a piece of self-written PR.
I am sorry that you took my post as an attack: - the article itself is written by Julia people (the bottom of the article says "Source: Julia Computing"). Using this fact to tell me to "educate myself on a non-relevant topic, i.e., on who uses Julia" seems quite irrelevant to my note on who wrote the text. (Being sarcastic now: I am sure that whatever education I will do from now on till the end of my life will not change who wrote the article) - I also acknowledged that Julia is popular in the scientific computing. I do not understand where in my text I devalue Julia as a language/tool. (Again, I do not like that self-written articles are used in arguments. But I did not say anything about Julia "being not good".) What I did not write, but think, is that Julia is a very nice project, and I am a fan of its development.
Jan 19 2022
prev sibling parent reply forkit <forkit gmail.com> writes:
On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto wrote:
 That paper is from 2008, meanwhile in 2021,

 https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

 This is what D has to compete against, not only C++ with the 
 existing SYSCL/CUDA tooling and their ongoing integration into 
 ISO C++.
Oh. so dissmisive of it because its from 2008? It's focus is on methods for compiler optimisation, for one of the most important data structures in scientific computing -> arrays. As such, the more D can do to generate even more efficient parallel array computations, the more chance it has of attracting 'some' from the scientific community.
Jan 19 2022
parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Wednesday, 19 January 2022 at 11:43:25 UTC, forkit wrote:
 On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto 
 wrote:
 That paper is from 2008, meanwhile in 2021,

 https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

 This is what D has to compete against, not only C++ with the 
 existing SYSCL/CUDA tooling and their ongoing integration into 
 ISO C++.
Oh. so dissmisive of it because its from 2008? It's focus is on methods for compiler optimisation, for one of the most important data structures in scientific computing -> arrays. As such, the more D can do to generate even more efficient parallel array computations, the more chance it has of attracting 'some' from the scientific community.
Yes, because in 2008 CUDA and SYSCL were of little importance in HPC universe, almost everyone was focused on OpenMP, still thought OpenCL would cater with its C only API, and OpenAAC was yet to show up. Unless D comes with this in the package and those logos adopt it, just being a better language isn't enough. https://developer.nvidia.com/hpc-sdk https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html#gs.mbnkph https://www.amd.com/en/technologies/open-compute It also needs to plug into the libraries, IDEs and GPGPU debuggers available to the community.
Jan 19 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.
But the presentation is not only about HPC, but making parallel GPU computing as easy as writing regular C++ code and being able to debug that code on the CPU. I actually think it is sufficient to support Metal and Vulkan for this to be of value. The question is how much more performance Nvidia manage to get out of their their nvc++ compiler for regular GPUs in comparison to a Vulkan solution.
Jan 19 2022
parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim 
Grøstad wrote:
 On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto 
 wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.
But the presentation is not only about HPC, but making parallel GPU computing as easy as writing regular C++ code and being able to debug that code on the CPU. I actually think it is sufficient to support Metal and Vulkan for this to be of value. The question is how much more performance Nvidia manage to get out of their their nvc++ compiler for regular GPUs in comparison to a Vulkan solution.
Currently Vulkan Compute is not to be taken seriously. Yes, the end goal of the industry efforts is that C++ will be the lingua franca of GPGPUs and FPGAs, that is why SYSCL is collaborating with ISO C++ efforts. As for HPC, that is where the money for these kind of efforts comes from.
Jan 19 2022
next sibling parent reply Tejas <notrealemail gmail.com> writes:
On Wednesday, 19 January 2022 at 14:24:14 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim 
 Grøstad wrote:
 On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto 
 wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.
But the presentation is not only about HPC, but making parallel GPU computing as easy as writing regular C++ code and being able to debug that code on the CPU. I actually think it is sufficient to support Metal and Vulkan for this to be of value. The question is how much more performance Nvidia manage to get out of their their nvc++ compiler for regular GPUs in comparison to a Vulkan solution.
Currently Vulkan Compute is not to be taken seriously. Yes, the end goal of the industry efforts is that C++ will be the lingua franca of GPGPUs and FPGAs, that is why SYSCL is collaborating with ISO C++ efforts. As for HPC, that is where the money for these kind of efforts comes from.
Is Rust utterly irrelevant in this space? Feels weird not seeing it at all in this discussion, with all the talks about just how flexible the type system is and the emphasis on functional paradigm(things like the Typestate pattern), I thought it would matter quite a bit in this context as well, since functional programming languages are found to model hardware more fluidly(naturally?) than imperative languages like C++(yes, it's multi paradigm as well but come on)
Jan 19 2022
parent IGotD- <nise nise.com> writes:
On Wednesday, 19 January 2022 at 15:25:31 UTC, Tejas wrote:
 I thought it would matter quite a bit in this context as well, 
 since functional programming languages are found to model 
 hardware more fluidly(naturally?) than imperative languages 
 like C++(yes, it's multi paradigm as well but come on)
I haven't experienced that at all. Functional programming is nothing like a HDL language (VHDL, Verilog) and those languages functions completely differently than functional programming. They are somewhat parallel in nature at least but not like functional programming. I've found that imperative languages models a CPU better (sequence of instructions) than functional programming languages which seems to have more a high level concept.
Jan 19 2022
prev sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Wednesday, 19 January 2022 at 14:24:14 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim 
 Grøstad wrote:
 On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto 
 wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.
But the presentation is not only about HPC, but making parallel GPU computing as easy as writing regular C++ code and being able to debug that code on the CPU. I actually think it is sufficient to support Metal and Vulkan for this to be of value. The question is how much more performance Nvidia manage to get out of their their nvc++ compiler for regular GPUs in comparison to a Vulkan solution.
Currently Vulkan Compute is not to be taken seriously.
For those wishing to deploy today, I agree, but it should be considered for future deployments. That said, it's just one way for dcompute to tie in. My current dcompute work comes in, for example, via PTX-jit courtesy of an Nvidia driver.
 Yes, the end goal of the industry efforts is that C++ will be 
 the lingua franca of GPGPUs and FPGAs, that is why SYSCL is 
 collaborating with ISO C++ efforts.
Yes, apparently there's a huge amount of time/money being spent on SYCL. We can co-opt much of that work underneath (the upcoming LLVM SPIR-V backend, debuggers, profilers, some libs) and provide a much better language on top. C++/SYCL is, to put it charitably, cumbersome.
 As for HPC, that is where the money for these kind of efforts 
 comes from.
Perhaps, but I suspect other market segments will be (already are?) more important going forward. Gaming generally and ML on SoCs comes to mind.
Jan 19 2022
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able to 
 bind to C++ in a limited fashion (like Python), but being able 
 to port C++ to D and improve on it. To get there you need 
 feature parity, which is what this thread is about.
Not just 'feature' parity, but 'performance' parity too:
Yes, that is the issue I wanted to discuss in the OP. If hardware vendors create close source C++ compiler that uses internal knowledge of how their GPUs work, then it might be difficult to compete for other languages. You'd have to compile to metal/vulkan and fine tune it for each GPU. Or just compile to C++… I don't know. I guess we will find out in the years to come.
Jan 19 2022
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Wednesday, 19 January 2022 at 09:34:38 UTC, Ola Fosheim 
Grøstad wrote:
 If hardware vendors create close source C++ compiler that uses 
 internal knowledge of how their GPUs work, then it might be 
 difficult to compete for other languages. You'd have to compile 
 to metal/vulkan and fine tune it for each GPU.
Arguably that already describes Nvidia. Luckily for us, it has an intermediate layer in PTX that LLVM can target, and that's exactly what dcompute does. Unlike C++, D can much more easily statically condition on aspects of the hardware, making the tuning process faster to navigate the parameter configuration space.
Jan 19 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 19 January 2022 at 09:49:59 UTC, Nicholas Wilson 
wrote:
 Arguably that already describes Nvidia. Luckily for us, it has 
 an intermediate layer in PTX that LLVM can target, and that's 
 exactly what dcompute does.
For desktop applications one has to support Intel, AMD, Nvidia, Apple. So, does that mean that one have to support Metal, Vulkan, PTX and RocM? Sounds like too much…
 Unlike C++, D can much more easily statically condition on 
 aspects of the hardware, making the tuning process faster to 
 navigate the parameter configuration space.
Not sure what you meant here?
Jan 19 2022
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Wednesday, 19 January 2022 at 10:17:45 UTC, Ola Fosheim 
Grøstad wrote:
 On Wednesday, 19 January 2022 at 09:49:59 UTC, Nicholas Wilson 
 wrote:
 Arguably that already describes Nvidia. Luckily for us, it has 
 an intermediate layer in PTX that LLVM can target, and that's 
 exactly what dcompute does.
For desktop applications one has to support Intel, AMD, Nvidia, Apple. So, does that mean that one have to support Metal, Vulkan, PTX and RocM? Sounds like too much…
That was a comment mostly about the market share and "business practices" Nvidia. Intel is well supported by OpenCL/SPIR-V. There are some murmurings that AMD is getting SPIR-V support for ROCm, though if that is insufficient, I don't think it would be too difficult to hook the AMDGPU backend to LDC+DCompute (runtime libraries would be a bit tedious, given the lack of familiarity and volume of code), but I have no hardware to run ROCm math the moment. Metal should also not be too difficult (the kernel argument format is different which is annoying) to hook LDC up to, the main thing lacking is Objective-C support to bind the runtime libraries for DCompute (which would also need to be written. LDC can already target Vulkan compute (although the pipeline is tedious, and there is no runtime library support).
 Unlike C++, D can much more easily statically condition on 
 aspects of the hardware, making the tuning process faster to 
 navigate the parameter configuration space.
Not sure what you meant here?
I mean there are parametric attributes of the hardware, say for example cache size (or available registers for GPUs), that have a direct effect on how many times you can unroll the inner loop, say for a windowing function, and you want to ship optimised code for multiple configurations of hardware. You can much more easily create multiple copies for different sized cache (or register availability) in D than you can in C++, because static foreach and static if >>> if constexpr.
Jan 19 2022
next sibling parent reply Araq <rumpf_a web.de> writes:
On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
wrote:

 I mean there are parametric attributes of the hardware, say for 
 example cache size (or available registers for GPUs), that have 
 a direct effect on how many times you can unroll the inner 
 loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.
And you can do that even more easily with an AST macro system. Which Julia has...
Jan 19 2022
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 20 January 2022 at 04:01:09 UTC, Araq wrote:
 And you can do that even more easily with an AST macro system. 
 Which Julia has...
I think these approaches are somewhat pointless for desktop applications. Although a JIT does help. If time consuming compile-time adaption to the hardware is needed then this should happen at installation. A better approach is to ship code in a high level IR and then bundle a compiler with the installer.
Jan 20 2022
prev sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 20 January 2022 at 04:01:09 UTC, Araq wrote:
 On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
 wrote:

 I mean there are parametric attributes of the hardware, say 
 for example cache size (or available registers for GPUs), that 
 have a direct effect on how many times you can unroll the 
 inner loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.
And you can do that even more easily with an AST macro system. Which Julia has...
Given this endorsement I started reading up on Julia/GPU... Here are a few things that I found: A gentle tutorial: https://nextjournal.com/sdanisch/julia-gpu-programming Another, more concise: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/ For those that are video oriented, here's a recent workshop: https://www.youtube.com/watch?v=Hz9IMJuW5hU While I admit to just skimming that, very long, video I was impressed by the tooling on display and the friendly presentation. In short, I found a lot to like about Julia from the above and other writings but the material on Julia AST macros specifically was ... underwhelming. AST macros look like an inferior tool in this low level setting. They are slightly less readable to me then the dcompute alternatives without offering any compensating gain in performance.
Jan 20 2022
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
wrote:
 I mean there are parametric attributes of the hardware, say for 
 example cache size (or available registers for GPUs), that have 
 a direct effect on how many times you can unroll the inner 
 loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.
Hmm, I dont understand, the unrolling should happen at runtime so that you can target all GPUs with one executable? If you have to do the unrolling in D, then a lot of the advantage is lost and I might just as well write in a shader language...
Jan 19 2022
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 20 January 2022 at 06:57:28 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
 wrote:
 I mean there are parametric attributes of the hardware, say 
 for example cache size (or available registers for GPUs), that 
 have a direct effect on how many times you can unroll the 
 inner loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.
Hmm, I dont understand, the unrolling should happen at runtime so that you can target all GPUs with one executable?
Now you've confused me. You can select which implementation to use at runtime with e.g. CPUID or more sophisticated methods. LDC targeting DCompute can produce multiple objects with the same compiler invocation, i.e. you can get CUDA for any set of SM version, OpenCL compatible SPIR-V which you can get per GPU, inspect its hardware characteristics and then select which of your kernels to run.
 If you have to do the unrolling in D, then a lot of the 
 advantage is lost and I might just as well write in a shader 
 language...
D can be your compute shading language for Vulkan and with a bit of work whatever you'd use HLSL for, it can also be your compute kernel language substituting for OpenCL and CUDA. Same caveats apply for metal (should be pretty easy to do: need Objective-C support in LDC, need Metal bindings).
Jan 20 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 20 January 2022 at 08:20:58 UTC, Nicholas Wilson 
wrote:
  Now you've confused me. You can select which implementation to 
 use at runtime with e.g. CPUID or more sophisticated methods. 
 LDC targeting DCompute can produce multiple objects with the 
 same compiler invocation, i.e. you can get CUDA for any set of 
 SM version, OpenCL compatible SPIR-V which you can get per GPU, 
 inspect its hardware characteristics and then select which of 
 your kernels to run.
Yes, so why do you need compile time features? My understanding is that the goal of nvc++ is to compile to CPU or GPU based on what pays of more for the actual code. So it will not need any annotations (it is up to the compiler to choose between CPU/GPU?). Bryce suggested that it currently only targets one specific GPU, but that it will target multiple GPUs for the same executable in the future. The goal for C++ parallelism is to make it fairly transparent to the programmer. Or did I misunderstand what he said? My viewpoint is that if one are going to take a performance hit by not writing the shaders manually one need to get maximum convenience as a payoff. It should be an alternative for programmers that cannot afford to put in the extra time to support GPU compute manually.
 If you have to do the unrolling in D, then a lot of the 
 advantage is lost and I might just as well write in a shader 
 language...
D can be your compute shading language for Vulkan and with a bit of work whatever you'd use HLSL for, it can also be your compute kernel language substituting for OpenCL and CUDA.
I still don't understand why you would need static if/static for-loops? Seems to me that this is too hardwired, you'd be better off with compiler unrolling hints (C++ has these) if the compiler does the wrong thing.
 Same caveats apply for metal (should be pretty easy to do: need 
 Objective-C support in LDC, need Metal bindings).
Use clang to compile the objective-c code to object files and link with it?
Jan 20 2022
next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 20 January 2022 at 08:36:32 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 08:20:58 UTC, Nicholas Wilson 
 wrote:
  Now you've confused me. You can select which implementation 
 to use at runtime with e.g. CPUID or more sophisticated 
 methods. LDC targeting DCompute can produce multiple objects 
 with the same compiler invocation, i.e. you can get CUDA for 
 any set of SM version, OpenCL compatible SPIR-V which you can 
 get per GPU, inspect its hardware characteristics and then 
 select which of your kernels to run.
Yes, so why do you need compile time features?
Because compilers are not sufficiently advanced to extract all the performance that is available on their own. A good example of where the automated/simple approach was not good enough is CUB (CUDA unbound), a high performance CUDA library found here https://github.com/NVIDIA/cub/tree/main/cub I'd recommend taking a look at the specializations that occur in CUB in the name of performance. D compile time features can help reduce this kind of mess, both in extreme performance libraries and extreme performance code.
 My understanding is that the goal of nvc++ is to compile to CPU 
 or GPU based on what pays of more for the actual code. So it 
 will not need any annotations (it is up to the compiler to 
 choose between CPU/GPU?). Bryce suggested that it currently 
 only targets one specific GPU, but that it will target multiple 
 GPUs for the same executable in the future.

 The goal for C++ parallelism is to make it fairly transparent 
 to the programmer. Or did I misunderstand what he said?
I think that that is an entirely reasonable goal but such transparency may cost performance and any such cost will be unacceptable to some.
 My viewpoint is that if one are going to take a performance hit 
 by not writing the shaders manually one need to get maximum 
 convenience as a payoff.

 It should be an alternative for programmers that cannot afford 
 to put in the extra time to support GPU compute manually.
Yes. Always good to have alternatives. Fully automated is one option, hinted is a second alternative, meta-programming assisted manual is a third.
 If you have to do the unrolling in D, then a lot of the 
 advantage is lost and I might just as well write in a shader 
 language...
D can be your compute shading language for Vulkan and with a bit of work whatever you'd use HLSL for, it can also be your compute kernel language substituting for OpenCL and CUDA.
I still don't understand why you would need static if/static for-loops? Seems to me that this is too hardwired, you'd be better off with compiler unrolling hints (C++ has these) if the compiler does the wrong thing.
If you can achieve your performance objectives with automated or hinted solutions, great! But what if you can't? Most people will not have to go as hardcore as the CUB authors did to get the performance they need but I find myself wanting more than the compiler can easily give me quite a bit. I'm very happy to have the meta programming tools to factor/reduce these "manual" programming task.
 Same caveats apply for metal (should be pretty easy to do: 
 need Objective-C support in LDC, need Metal bindings).
Use clang to compile the objective-c code to object files and link with it?
Jan 20 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal wrote:
 Because compilers are not sufficiently advanced to extract all 
 the performance that is available on their own.
Well, but D developers cannot test on all available CPU/GPU combinations either so then you don't know if SIMD would perform better than GPU. Something automated has to be present, at least on install, otherwise you risk performance degradation compared to a pure SIMD implementation. And then it is better (and cheaper) to just avoid GPU altogether.
 A good example of where the automated/simple approach was not 
 good enough is CUB (CUDA unbound), a high performance CUDA 
 library found here https://github.com/NVIDIA/cub/tree/main/cub

 I'd recommend taking a look at the specializations that occur 
 in CUB in the name of performance.
I am sure you are right, but I didn't find anything special when I browsed through the repo?
 If you can achieve your performance objectives with automated 
 or hinted solutions, great!  But what if you can't?
Well, my gut instinct is that if you want maximal performance for a specific GPU then you would be better off using Metal/Vulkan/etc directly? But I have no experience with that as it is quite time consuming to go that route. Right now basic SIMD is time consuming enough… (but OK)
Jan 20 2022
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 20 January 2022 at 13:29:26 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal 
 wrote:
 Because compilers are not sufficiently advanced to extract all 
 the performance that is available on their own.
Well, but D developers cannot test on all available CPU/GPU combinations either so then you don't know if SIMD would perform better than GPU.
It can be very expensive to write and test all the permutations, yes, but often you'll understand the bottlenecks of your algorithms sufficiently to be able to correctly filter out the work up front. Restating here, these are a few of the traditional ways to look at it: Throughput or latency limited? Operand/memory or arithmetic limited? Power (watts) preferred or other performance? It's possible, for instance, that you can *know*, from first principles, that you'll never meet objective X if forced to use platform Y. In general, though, you'll just have a sense of the order in which things should be evaluated.
 Something automated has to be present, at least on install, 
 otherwise you risk performance degradation compared to a pure 
 SIMD implementation. And then it is better (and cheaper) to 
 just avoid GPU altogether.
Yes, SIMD can be the better performance choice sometimes. I think that many people will choose to do a SIMD implementation as a performance, correctness testing and portability baseline regardless of the accelerator possibilities.
 A good example of where the automated/simple approach was not 
 good enough is CUB (CUDA unbound), a high performance CUDA 
 library found here https://github.com/NVIDIA/cub/tree/main/cub

 I'd recommend taking a look at the specializations that occur 
 in CUB in the name of performance.
I am sure you are right, but I didn't find anything special when I browsed through the repo?
The key thing to note is how much effort the authors put into specialization wrt the HW x SW cross product. There are entire subdirectories devoted to specialization. At least some of this complexity, this programming burden, can be factored out with better language support.
 If you can achieve your performance objectives with automated 
 or hinted solutions, great!  But what if you can't?
Well, my gut instinct is that if you want maximal performance for a specific GPU then you would be better off using Metal/Vulkan/etc directly?
That's what seems reasonable, yes, but fortunately I don't think it's correct. By analogy, you *can* get maximum performance from assembly level programming, if you have all the compiler back-end knowledge in your head, but if your language allows you to communicate all relevant information (mainly dependencies and operand localities but also "intrinsics") then the compiler can do at least as well as the assembly level programmer. Add language support for inline and factored specialization and the lower level alternatives become even less attractive.
 But I have no experience with that as it is quite time 
 consuming to go that route. Right now basic SIMD is time 
 consuming enough… (but OK)
Indeed. I'm currently working on the SIMD variant of something I partially prototyped earlier on a 2080 and it has been slow going compared to either that GPU implementation or the scalar/serial variant. There are some very nice assists from D for SIMD programming: the __vector typing, __vector arithmetic, unaligned vector loads/stores via static array operations, static foreach to enable portable expression of single-instruction SIMD functions like min, max, select, various shuffles, masks, ... but, yes, SIMD programming is definitely a slog compared to either scalar or SIMT GPU programming.
Jan 20 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 20 January 2022 at 17:43:22 UTC, Bruce Carneal wrote:
 It's possible, for instance, that you can *know*, from first 
 principles, that you'll never meet objective X if forced to use 
 platform Y.  In general, though, you'll just have a sense of 
 the order in which things should be evaluated.
This doesn't change the desire to do performance testing at install or bootup IMO. Even a "narrow" platform like Mac is quite broad at this point. PCs are even broader.
 Yes, SIMD can be the better performance choice sometimes.  I 
 think that many people will choose to do a SIMD implementation 
 as a performance, correctness testing and portability baseline 
 regardless of the accelerator possibilities.
My understanding is that the presentation Bryce made suggested that you would just write "fairly normal" C++ code and let the compiler generate CPU or GPU instructions transparently, so you should not have to write SIMD code. SIMD would be the fallback option. I think that the point of having parallel support built into the language is not to get the absolute maximum performance, but to make writing more performant code more accessible and cheaper. If you end up having to handwrite SIMD to get decent performance then that pretty much makes parallel support a fringe feature. E.g. it won't be of much use outside HPC with expensive equipment. So in my mind this feature does require hardware vendors to focus on CPU/GPU integration, and it also requires a rather "intelligent" compiler and runtime setup in order to pay for the debts of the "abstraction overhead". I don't think just translating a language AST to an existing shared backend will be sufficient. If that was sufficient Nvidia wouldn't need to invest in nvc++? But, it remains to be seen who will pull this off, besides Nvidia.
Jan 20 2022
parent Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 20 January 2022 at 19:57:54 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 17:43:22 UTC, Bruce Carneal 
 wrote:
 It's possible, for instance, that you can *know*, from first 
 principles, that you'll never meet objective X if forced to 
 use platform Y.  In general, though, you'll just have a sense 
 of the order in which things should be evaluated.
This doesn't change the desire to do performance testing at install or bootup IMO. Even a "narrow" platform like Mac is quite broad at this point. PCs are even broader.
Never meant to say that it did. Just pointed out that you can factor some of the work.
 Yes, SIMD can be the better performance choice sometimes.  I 
 think that many people will choose to do a SIMD implementation 
 as a performance, correctness testing and portability baseline 
 regardless of the accelerator possibilities.
My understanding is that the presentation Bryce made suggested that you would just write "fairly normal" C++ code and let the compiler generate CPU or GPU instructions transparently, so you should not have to write SIMD code. SIMD would be the fallback option.
The dream, for decades, has been that "the compiler" will just "do the right thing" when provided dead simple code, that it will achieve near-or-better-than-human-tuned levels of performance in all scenarios that matter. It is a dream worth pursuing.
 I think that the point of having parallel support built into 
 the language is not to get the absolute maximum performance, 
 but to make writing more performant code more accessible and 
 cheaper.
If accessibility requires less performance then you, as a language designer, have a choice. I think it's a false choice but if forced to choose my choice would bias toward performance, "system language" and all that. Others, if forced to choose, would pick accessibility.
 If you end up having to handwrite SIMD to get decent 
 performance then that pretty much makes parallel support a 
 fringe feature. E.g. it won't be of much use outside HPC with 
 expensive equipment.
I disagree but can't see how pursuing it further would be useful. We can just leave it to the market.
 So in my mind this feature does require hardware vendors to 
 focus on CPU/GPU integration, and it also requires a rather 
 "intelligent" compiler and runtime setup in order to pay for 
 the debts of the "abstraction overhead".
I put more faith in efforts that cleanly reveal low level capabilities to the community, that are composable, than I do in future hardware vendor efforts.
 I don't think just translating a language AST to an existing 
 shared backend will be sufficient. If that was sufficient 
 Nvidia wouldn't need to invest in nvc++?
Well, at least for current dcompute users, it already is sufficient. The Julia efforts in this area also appear to be successful. Sean Baxter's "circle" offshoot of C++ is another. I imagine there are or will be other instances where relatively small manpower inputs successfully co-opt backends to provide nice access and great performance for their respective language communities.
 But, it remains to be seen who will pull this off, besides 
 Nvidia.
I don't think there is much that remains to be seen here. The rate and scope of adoption are still interesting questions but the "can we provide something very useful to our language community?" question has been answered in the affirmative. People choose dcompute, circle, Julia-GPU over or in addition to CUDA/OpenCL today. Others await more progress from the C++/SycL movement. Meaningful choice is good.
Jan 20 2022
prev sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 20 January 2022 at 08:36:32 UTC, Ola Fosheim Grøstad 
wrote:
 Yes, so why do you need compile time features?

 My understanding is that the goal of nvc++ is to compile to CPU 
 or GPU based on what pays of more for the actual code. So it 
 will not need any annotations (it is up to the compiler to 
 choose between CPU/GPU?). Bryce suggested that it currently 
 only targets one specific GPU, but that it will target multiple 
 GPUs for the same executable in the future.
There are two major advantages for compile time features, for the host and for the device (e.g. GPU). On the host side, D meta programming allows DCompute to do what CUDA does with its <<<>>> kernel launch syntax, in terms of type safety and convenience, with regular D code. This is the feature that makes CUDA nice to use and OpenCL's lack of such a feature quite horrible to use, and change of kernel signature a refactoring unto itself. On the device side, I'm sure Bruce can give you some concrete examples.
 The goal for C++ parallelism is to make it fairly transparent 
 to the programmer. Or did I misunderstand what he said?
You want it to be transparent, not invisible.
 Same caveats apply for metal (should be pretty easy to do: 
 need Objective-C support in LDC, need Metal bindings).
Use clang to compile the objective-c code to object files and link with it?
Wont work, D needs to be able to call the objective-c. I mean you could use a C or C++ shim, but that would be pretty ugly.
Jan 20 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 21 January 2022 at 03:23:59 UTC, Nicholas Wilson wrote:
 There are two major advantages for compile time features, for 
 the host and for the device (e.g. GPU).
Are these resolved at compile time (before the executable is installed on the computer) or are they resolved at runtime? I guess there might be instances where you might want to consider to change the entire data layout to fit the hardware, but then you to some extent outside of what most D programmers would be willing to do.
 The goal for C++ parallelism is to make it fairly transparent 
 to the programmer. Or did I misunderstand what he said?
You want it to be transparent, not invisible.
The goal is to make it look like a regular C++ library, no extra syntax.
 Wont work, D needs to be able to call the objective-c.
 I mean you could use a C or C++ shim, but that would be pretty 
 ugly.
Just write the whole runtime in Objective-C++. Why would it be ugly?
Jan 21 2022
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 21 January 2022 at 08:56:22 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 21 January 2022 at 03:23:59 UTC, Nicholas Wilson 
 wrote:
 There are two major advantages for compile time features, for 
 the host and for the device (e.g. GPU).
Are these resolved at compile time (before the executable is installed on the computer) or are they resolved at runtime?
Before. But with SPIR-V there is an additional compilation/optimisation step where it is converted into whatever format the hardware uses, also you could set specialisation constants here if I ever get around to supporting those. I think it probably also happens with PTX (which is an assembly like format) to whatever the binary format is.
 I guess there might be instances where you might want to 
 consider to change the entire data layout to fit the hardware, 
 but then you to some extent outside of what most D programmers 
 would be willing to do.
Indeed.
 You want it to be transparent, not invisible.
The goal is to make it look like a regular C++ library, no extra syntax.
There is an important difference between it looking like regular C++ (i.e. function calls not <<<>>>) and the compiler doing auto-GPU-isation. I'm not sure which one you're referring to here. I'm all for the former, that's what Dcompute does, the latter falls too far into the sufficiently advanced compiler and would have to necessarily determine what to send to the GPU and when, which could seriously impact performance.
 Just write the whole runtime in Objective-C++. Why would it be 
 ugly?
_Just_. I mean it would be doable, but I rather not spend my time doing that.
Jan 21 2022
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 21 January 2022 at 09:45:32 UTC, Nicholas Wilson wrote:
 _Just_. I mean it would be doable, but I rather not spend my 
 time doing that.
:-D This is where you need more than one person for the project… I might do it, if I found a use case for it. I am sure some other contributor than yourself could do it if Metal support was in.
Jan 21 2022
prev sibling next sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:
 I found the CppCon 2021 presentation
 [C++ Standard 
 Parallelism](https://www.youtube.com/watch?v=LW_T2RGXego) by 
 Bryce Adelstein Lelbach very interesting, unusually clear and 
 filled with content. I like this man. No nonsense.

 It provides a view into what is coming for relatively high 
 level and hardware agnostic parallel programming in C++23 or 
 C++26. Basically a portable "high level" high performance 
 solution.

 He also mentions the Nvidia C++ compiler *nvc++* which will 
 make it possible to compile C++ to Nvidia GPUs in a somewhat 
 transparent manner. (Maybe it already does, I have never tried 
 to use it.)

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only increase 
 as time goes on.

 What do you think?
Given the emergence of ML in the commercial space and the prevalence of accelerator HW on SoCs and elsewhere, this is a timely topic Ola. We have at least two options: 1) try to mimic or sit atop the, often byzantine, interfaces that creak out of the C++ community or 2) go direct to the evolving metal with D meta-programming shouldering most of the load. I favor the second of course. For reference, CUDA/C++ was my primary programming language for 5+ years prior to taking up D and, even in its admittedly less-than-newbie-friendly state, I prefer dcompute to CUDA. With some additional work dcompute could become a broadly accessible path to world beating performance/watt libraries and apps. Code that you can actually understand at a glance when you pick it up down the road. Kudos to the dcompute contributors, especially Nicholas.
Jan 12 2022
prev sibling next sibling parent reply bachmeier <no spam.net> writes:
On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only increase 
 as time goes on.

 What do you think?
It doesn't matter all that much for D TBH. Without the basic infrastructure for scientific computing like you get out of the box with those three languages, the ability to target another platform isn't going to matter. There are lots of pieces here and there in our community, but it's going to take some effort to (a) make it easy to use the different parts together, (b) document everything, and (c) write the missing pieces.
Jan 12 2022
next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only 
 increase as time goes on.

 What do you think?
It doesn't matter all that much for D TBH. Without the basic infrastructure for scientific computing like you get out of the box with those three languages, the ability to target another platform isn't going to matter. There are lots of pieces here and there in our community, but it's going to take some effort to (a) make it easy to use the different parts together, (b) document everything, and (c) write the missing pieces.
I disagree. D/dcompute can be used as a better general purpose GPU kernel language now (superior meta programming, sane nested functions, ...). If you are concerned about "infrastructure" you embed in C++. There *are* improvements to be made but, by my lights, dcompute is already better than CUDA in many ways. If we improve usability, make dcompute accessible to "mere mortals", make it a "no big deal" choice instead of a "here be dragons" choice, we'd really have something. By contrast, I just don't see the C++ crowd getting to sanity/simplicity any time soon... not unless ideas from the circle compiler or similar make their way to mainstream.
Jan 12 2022
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 07:23:40 UTC, Bruce Carneal wrote:
 I disagree.  D/dcompute can be used as a better general purpose 
 GPU kernel language now (superior meta programming, sane nested 
 functions, ...).
Is *dcompute* being actively developed or is it in a "frozen" state? longevity is important for adoption, I think.
 There *are* improvements to be made but, by my lights, dcompute 
 is already better than CUDA in many ways.  If we improve 
 usability, make dcompute accessible to "mere mortals", make it 
 a "no big deal" choice instead of a "here be dragons" choice, 
 we'd really have something.
Maybe it would be possible to do something with a more limited scope, but more low level? Like something targeting Metal and Vulkan directly? Something like this might be possible to do well if D would change the focus and build a high level IR. I think one of Bryce's main points is that there is more long term stability in C++ than in the other APIs for parallel computing, so for long term development it would be better to express parallel code in terms of a C++ standard library construct than other compute-APIs. That argument makes sense for me, I don't want to deal with CUDA or OpenCL as dependencies. I'd rather have something sit directly on top of the lower level APIs.
 By contrast, I just don't see the C++ crowd getting to 
 sanity/simplicity any time soon... not unless ideas from the 
 circle compiler or similar make their way to mainstream.
It does look a bit complex, but what I find promising for C++ is that Nvidia is pushing their hardware by creating backends for C++ parallel libraries that targets multiple GPUs. That in turn might push Apple to do the same for Metal and so on. If C++20 had what Bryce presented then I would've considered using it for signal processing. Right now it would make more sense to target Metal/Vulkan directly, but that is time consuming, so I probably won't.
Jan 13 2022
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 13 January 2022 at 09:10:48 UTC, Ola Fosheim Grøstad 
wrote:
 Is *dcompute* being actively developed or is it in a "frozen" 
 state? longevity is important for adoption, I think.
not actively per se, but I have been adding features recently...
 Maybe it would be possible to do something with a more limited 
 scope, but more low level? Like something targeting Metal and 
 Vulkan directly? Something like this might be possible to do 
 well if D would change the focus and build a high level IR.
... one of which was compiler support for Vulkan compute shaders (no runtime yet Ethan didn't need that, and graphics APIs are large, and I'm not sure if there are any good bindings). Metal is annoyingly different is its kernel signatures, which could be done fairly easily, but * LDC lacks Objective-C support so even if the compiler side of Metal support worked the runtime side would not. (N.B. adding Objective-C support shouldn't be too difficult. but I don't have particular need for it.) * kernels written for metal would not be compatible with the OpenCL and CUDA ones (not that I suppose that would be a particular problem if all you care about is Metal.
 I think one of Bryce's main points is that there is more long 
 term stability in C++ than in the other APIs for parallel 
 computing, so for long term development it would be better to 
 express parallel code in terms of a C++ standard library 
 construct than other compute-APIs.

 That argument makes sense for me, I don't want to deal with 
 CUDA or OpenCL as dependencies. I'd rather have something sit 
 directly on top of the lower level APIs.
Dcompute essentially sits as a thin layer over both, but importantly automates the crap out of the really tedious and error prone usage of the APIs. It would be entirely possible to create a thicker API agnostic layer over the top of both of them.
 By contrast, I just don't see the C++ crowd getting to 
 sanity/simplicity any time soon... not unless ideas from the 
 circle compiler or similar make their way to mainstream.
It does look a bit complex, but what I find promising for C++ is that Nvidia is pushing their hardware by creating backends for C++ parallel libraries that targets multiple GPUs. That in turn might push Apple to do the same for Metal and so on. If C++20 had what Bryce presented then I would've considered using it for signal processing. Right now it would make more sense to target Metal/Vulkan directly, but that is time consuming, so I probably won't.
If there is sufficient interest for it, I might have a go at adding Metal compute support to ldc.
Jan 13 2022
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 09:42:04 UTC, Nicholas Wilson 
wrote:
 If there is sufficient interest for it, I might have a go at 
 adding Metal compute support to ldc.
I don't know if there is enough interest for it today. Right now, maybe easy visualization is more important. But when GUI/visualization is in place then I think a compute solution that supports lower level GPU APIs would be valuable for desktop application development. Not sure if it is a good idea to do a compute-only runtime as I would think that the application developer would want to balance resources used for compute and visualization in some way?
Jan 13 2022
prev sibling parent reply bachmeier <no spam.net> writes:
On Thursday, 13 January 2022 at 07:23:40 UTC, Bruce Carneal wrote:
 On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only 
 increase as time goes on.

 What do you think?
It doesn't matter all that much for D TBH. Without the basic infrastructure for scientific computing like you get out of the box with those three languages, the ability to target another platform isn't going to matter. There are lots of pieces here and there in our community, but it's going to take some effort to (a) make it easy to use the different parts together, (b) document everything, and (c) write the missing pieces.
I disagree. D/dcompute can be used as a better general purpose GPU kernel language now (superior meta programming, sane nested functions, ...). If you are concerned about "infrastructure" you embed in C++.
I was referring to libraries like numpy for Python or the numerical capabilities built into Julia. D just isn't in a state where a researcher is going to say "let's write a D program for that simulation". You can call some things in Mir and cobble together an interface to some C libraries or whatever. That's not the same as Julia, where you write the code you need for the task at hand. That's the starting point to make it into scientific computing. On the embedding, yes, that is the strength of D. If you write code in Python, it's realistically only for the Python world. Probably the same for Julia.
Jan 13 2022
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 14:50:59 UTC, bachmeier wrote:
 If you write code in Python, it's realistically only for the 
 Python world. Probably the same for Julia.
Does scipy provide the functionality you would need? Could it in some sense be considered a baseline for scientific computing APIs?
Jan 13 2022
parent reply sfp <sfp cims.nyu.edu> writes:
On Thursday, 13 January 2022 at 15:09:13 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 13 January 2022 at 14:50:59 UTC, bachmeier wrote:
 If you write code in Python, it's realistically only for the 
 Python world. Probably the same for Julia.
Does scipy provide the functionality you would need? Could it in some sense be considered a baseline for scientific computing APIs?
SciPy is fairly useful but it is only one amongst a constellation of Python scientific computing libraries. It emulates a fair amount of what is provided by MATLAB, and it sits on top of numpy. Using SciPy, numpy, and matplotlib in tandem gives a user access to roughly the same functionality as a vanilla installation of MATLAB. SciPy and numpy are built on top of a substrate of old and stable packages written in Fortran and C (BLAS, LAPACK, fftw, etc.). Python, MATLAB, and Julia are basically targeted at scientists and engineers writing "application code". These languages aren't appropriate for "low-level" scientific computing along the lines of the libraries mentioned above. Julia does make a claim to the contrary: it is feasible to write fast low-level kernels in it, but (last time I checked) it is not so straightforward to export them to other languages, since Julia likes to do things at runtime. Fortran and C remain good choices for low-level kernel development because they are easily consumed by Python et al. And as far as parallelism goes, OpenMP is the most common since it is straightforward conceptually. C++ is also fairly popular but since consuming something like a highly templatized header-only C++ library using e.g. Python's FFI is a pain, it is a less natural choice. (It's easier using pybind11, but the compile times will make you weep.) Fortran, C, and C++ are also all standardized. This is valuable. The people developing these libraries are---more often than not---academics, who aren't able to devote much of their time to software development. Having some confidence that their programming language isn't going to change underneath gives them some assurance that they aren't going to be forced to spend an inordinate amount of time keeping their code in compliance for it to remain usable. Either that, or they write a library in Python and abandon it later. As an aside, people lament the use of MATLAB, but one of its stated goals is backwards compatibility. Consequently, there's rather a lot of old MATLAB code floating around still in use. "High-level" D is currently not that interesting for high-level scientific application code. There is a long list of "everyday" scientific computing tasks I could think of which I'd like to be able to execute in a small number of lines, but this is currently impossible using any flavor of D. See https://www.numerical-tours.com for some ideas. "BetterC" D could be useful for developing numerical kernels. An interesting idea would to use D's introspection capabilities to automatically generate wrappers and documentation for each commonly used scientific programming language (Python, MATLAB, Julia). But D not being standardized makes it less attractive than C or Fortran. It is also unclear how stable D is as an open source project. The community surrounding it is rather small and doesn't seem to have much momentum. There also do not appear to be any scientific computing success stories with D. My personal view is that people in science are generally more interested in actually doing science than in playing around with programming trivia. Having to spend time to understand something like C++'s argument dependent lookup is generally viewed as undesirable and a waste of time.
Jan 13 2022
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 16:10:39 UTC, sfp wrote:
 My personal view is that people in science are generally more 
 interested in actually doing science than in playing around 
 with programming trivia.
Yes, this is probably true. My impression is that the physics department tend to be in favour of Python, C++ and I guess Fortran. In signal processing Matlab with Python as an upcoming alternative. Maybe GPU-compute support is more relevant for desktop application development than scientific computing, in the context of D.
Jan 13 2022
prev sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 13 January 2022 at 14:50:59 UTC, bachmeier wrote:
 On Thursday, 13 January 2022 at 07:23:40 UTC, Bruce Carneal 
 wrote:
 On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only 
 increase as time goes on.

 What do you think?
It doesn't matter all that much for D TBH. Without the basic infrastructure for scientific computing like you get out of the box with those three languages, the ability to target another platform isn't going to matter. There are lots of pieces here and there in our community, but it's going to take some effort to (a) make it easy to use the different parts together, (b) document everything, and (c) write the missing pieces.
I disagree. D/dcompute can be used as a better general purpose GPU kernel language now (superior meta programming, sane nested functions, ...). If you are concerned about "infrastructure" you embed in C++.
I was referring to libraries like numpy for Python or the numerical capabilities built into Julia. D just isn't in a state where a researcher is going to say "let's write a D program for that simulation". You can call some things in Mir and cobble together an interface to some C libraries or whatever. That's not the same as Julia, where you write the code you need for the task at hand. That's the starting point to make it into scientific computing.
I agree. If the heavy lifting for a new project is accomplished by libraries that you can't easily co-opt then better to employ D as the GPU language or not at all. More broadly, I don't think we should set ourselves a task of displacing language X in community Y. Better to focus on making accelerator programming "no big deal" in general so that people opt-in more often (first as accelerator language sub-component, then maybe more). While my present day use of dcompute is in real time video, where it works a treat, I'm most excited about the possibilities dcompute would afford on SoCs. World class perf/watt from dead simple code deployable to billions of units? Yes, please.
 ...
Jan 13 2022
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 platform isn't going to matter. There are lots of pieces here 
 and there in our community, but it's going to take some effort 
 to (a) make it easy to use the different parts together, (b) 
 document everything, and (c) write the missing pieces.
What C++ seems to do for (a) is adding a library construct for fully configurable multidimensional non-owning slices (```mdspan```).
Jan 13 2022
prev sibling parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:
 I found the CppCon 2021 presentation
 [C++ Standard 
 Parallelism](https://www.youtube.com/watch?v=LW_T2RGXego) by 
 Bryce Adelstein Lelbach very interesting, unusually clear and 
 filled with content. I like this man. No nonsense.

 It provides a view into what is coming for relatively high 
 level and hardware agnostic parallel programming in C++23 or 
 C++26. Basically a portable "high level" high performance 
 solution.

 He also mentions the Nvidia C++ compiler *nvc++* which will 
 make it possible to compile C++ to Nvidia GPUs in a somewhat 
 transparent manner. (Maybe it already does, I have never tried 
 to use it.)

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only increase 
 as time goes on.

 What do you think?
I think the ship has already sailed, given the industry standards of SYSCL and C++ for OpenCL, and their integration into clang (check the CppCon talks on the same) and FPGA generation. D can have a go at it, but only by plugging into the LLVM ecosystem where C++ is the name of the game, and given it is approaching Linux level of industry contributors it isn't going anywhere. There was a time to try overthrow C++, that was 10 years ago, LLVM was hardly relevant and GPGPU computing still wasn't mainstream.
Jan 12 2022
next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 13 January 2022 at 07:46:32 UTC, Paulo Pinto wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:
 ...
 What do you think?
... D can have a go at it, but only by plugging into the LLVM ecosystem where C++ is the name of the game, and given it is approaching Linux level of industry contributors it isn't going anywhere.
Yes. The language independent work in LLVM in the accelerator area is hugely important for dcompute, essential. Gotta surf that wave as we don't have the manpower to go independent. I dont think *anybody* has that amount of manpower, hence the collaboration/consolidation around LLVM as a back-end for accelerators.
 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.
Yes. The "overthrow" of C++ should be a non-goal, IMO, starting yesterday.
Jan 13 2022
parent reply Tejas <notrealemail gmail.com> writes:
On Thursday, 13 January 2022 at 14:24:59 UTC, Bruce Carneal wrote:

 Yes.  The language independent work in LLVM in the accelerator 
 area is hugely important for dcompute, essential.
Sorry if this sounds ignorant, but does SPIR-V count for nothing?
  Gotta surf that wave as we don't have the manpower to go 
 independent.  I dont think *anybody* has that amount of 
 manpower, hence the collaboration/consolidation around LLVM as 
 a back-end for accelerators.

 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.
Yes. The "overthrow" of C++ should be a non-goal, IMO, starting yesterday.
Overthrowing may be hopeless, but I feel we should at least be a really competitive with them. Because it doesn't matter whether we're competing with C++ or not, people will compare us with it since that's the other choice when people will want to write extremely performant GPU code(if they care about ease of setup and productivity and _not_ performance-at-any-cost, Julia and Python have beat us to it :-( )
Jan 13 2022
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 13 January 2022 at 16:31:11 UTC, Tejas wrote:
 On Thursday, 13 January 2022 at 14:24:59 UTC, Bruce Carneal 
 wrote:

 Yes.  The language independent work in LLVM in the accelerator 
 area is hugely important for dcompute, essential.
Sorry if this sounds ignorant, but does SPIR-V count for nothing?
SPIR-V is *very* useful. It is the catalyst and focal point of some of the most important ongoing LLVM accelerator work. Nicholas and I both believe that that work could provide a much more robust intermediate target for dcompute once it hits release status.
  Gotta surf that wave as we don't have the manpower to go 
 independent.  I dont think *anybody* has that amount of 
 manpower, hence the collaboration/consolidation around LLVM as 
 a back-end for accelerators.

 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.
Yes. The "overthrow" of C++ should be a non-goal, IMO, starting yesterday.
Overthrowing may be hopeless, but I feel we should at least be a really competitive with them.
Sure. We need to offer something that is actually better, we just don't need to be perceived as better by everyone in all scenarios. An example: if management is deathly afraid of anything but microscopic incremental development or, more charitably, management weighs the risks of new development very very heavily, then D is unlikely to be given a chance.
 Because it doesn't matter whether we're competing with C++ or 
 not, people will compare us with it since that's the other 
 choice when people will want to write extremely performant GPU 
 code(if they care about ease of setup and productivity and 
 _not_ performance-at-any-cost, Julia and Python have beat us to 
 it :-(
 )
Yes. We should evaluate our efforts by comparing (competing) with alternatives where available. D/dcompute is already, for my GPU work at least, much better than CUDA/C++. Concretely: I can achieve equivalent or higher performance more quickly with more readable code than I could formerly with CUDA/C++. There are some things that are trivial in D kernels (like live-in-register/mem-bandwidth-minimized stencil processing) that would require "heroic" effort in CUDA/C++. That said, there are definitely things that we could improve in the dcompute/accelerator area, particularly wrt the on-ramp for those new to accelerator programming. But, as you note, D is unlikely to be adopted by the "performance is good enough with existing solutions" crowd in any case. That's fine.
Jan 13 2022
parent reply bachmeier <no spam.net> writes:
On Thursday, 13 January 2022 at 18:41:54 UTC, Bruce Carneal wrote:

 Yes.  We should evaluate our efforts by comparing (competing) 
 with alternatives where available.  D/dcompute is already, for 
 my GPU work at least, much better than CUDA/C++.  Concretely: I 
 can achieve equivalent or higher performance more quickly with 
 more readable code than I could formerly with CUDA/C++.  There 
 are some things that are trivial in D kernels (like 
 live-in-register/mem-bandwidth-minimized stencil processing) 
 that would require "heroic" effort in CUDA/C++.
Does anyone else know anything about this? Burying it deep in a mailing list post isn't exactly the best way to publicize it. Ironically, I might add, in a discussion about lack of uptake.
Jan 13 2022
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 13 January 2022 at 19:35:28 UTC, bachmeier wrote:
 On Thursday, 13 January 2022 at 18:41:54 UTC, Bruce Carneal 
 wrote:

 Yes.  We should evaluate our efforts by comparing (competing) 
 with alternatives where available.  D/dcompute is already, for 
 my GPU work at least, much better than CUDA/C++.  Concretely: 
 I can achieve equivalent or higher performance more quickly 
 with more readable code than I could formerly with CUDA/C++.  
 There are some things that are trivial in D kernels (like 
 live-in-register/mem-bandwidth-minimized stencil processing) 
 that would require "heroic" effort in CUDA/C++.
Does anyone else know anything about this? Burying it deep in a mailing list post isn't exactly the best way to publicize it. Ironically, I might add, in a discussion about lack of uptake.
I know, right? Ridiculously big opportunity/effort ratio for dlang and near zero awareness... I usually talk a bit about dcompute at the beerconfs but to date I've only corresponded on the topic with Nicholas, Ethan, and Max (a little). Ethan might have a sufficiently compelling economic case for promoting dcompute to his company in the relatively near future. Nicholas recently addressed their need for access to the texture hardware and fitting within their work flow, but there may be other requirements... An adoption by a world class game studio would, of course, be very good news but I think Ethan is slammed (perpetually, and in a mostly good way, I think) so it might be a while. Before promoting dcompute broadly I believe we should work through the installation/build/deployment procedures and some examples for the "new to accelerators" crowd. It's no big deal as it sits for old hands but first impressions are important and even veteran programmers will appreciate an "it just works" on ramp. If you're interested I suggest we continue the conversation on dcompute at the next beerconf where we can plot its path to world domination... :-)
Jan 13 2022
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal wrote:
 I know, right?  Ridiculously big opportunity/effort ratio for 
 dlang and near zero awareness...
If dcompute is here to stay, why not put it in the official documentation for D as an "optional" part of the spec? I honestly assumed that it was unsupported and close to dead as I had not heard much about it for a long time.
Jan 13 2022
next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 13 January 2022 at 21:06:45 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal 
 wrote:
 I know, right?  Ridiculously big opportunity/effort ratio for 
 dlang and near zero awareness...
If dcompute is here to stay, why not put it in the official documentation for D as an "optional" part of the spec?
There are two reasons that I have not promoted dcompute to the general community up to now: 1) Any resultant increase in support load would fall on one volunteer (that is not me) and 2) IMO, a better on-ramp, particularly for those new to accelerators, is needed: additional examples, docs, and "it just works" install/build/deploy vetting would go a long way to reducing the support load and increasing happy uptake. Additionally, Nicholas has a list of "TODOs" that probably should be worked through before additional promotion occurs. None of them impact my work but they might hit others. Nicholas opinion on the matter is much more important than mine as he already has a non-D "day job" and would bear the brunt of a, possibly premature, promotion of dcompute.
Jan 13 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 21:39:07 UTC, Bruce Carneal wrote:
 1) Any resultant increase in support load would fall on one 
 volunteer (that is not me) and
Yes, that is not a good situation… The caveat is that if fewer people use dcompute, then fewer people will help out with it, then it will take more time to reach a state where it is "ready"… Showing how/when dcompute improves performance on standard desktop computers might make more people interested in participating. Are there some performance benchmarks on modest hardware? (e.g. a standard macbook, imac or mac mini) Benchmarks that compares dcompute to CPU with auto-vectorization (SIMD)?
Jan 13 2022
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim Grøstad 
wrote:
 Are there some performance benchmarks on modest hardware? (e.g. 
 a standard macbook, imac or mac mini) Benchmarks that compares 
 dcompute to CPU with auto-vectorization (SIMD)?
Part of the difficulty with that, is that it is an apples to oranges comparison. Also I no longer have hardware that can run dcompute, as my old windows box (with intel x86 and OpenCL 2.1 with an nvidia GPU) died some time ago. Unfortunately Macs and dcompute don't work very well. CUDA requires nvidia, and OpenCL needs the ability to run SPIR-V (clCreateProgramWithIL call) which requires OpenCL 2.x which Apple do not support. Hence why supporting Metal was of some interest. You might in theory be able to use PoCL or intel based OpenCL runtimes but I don't have an intel mac anymore and I haven't tried PoCL.
Jan 13 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 14 January 2022 at 01:39:32 UTC, Nicholas Wilson wrote:
 On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim 
 Grøstad wrote:
 Are there some performance benchmarks on modest hardware? 
 (e.g. a standard macbook, imac or mac mini) Benchmarks that 
 compares dcompute to CPU with auto-vectorization (SIMD)?
Part of the difficulty with that, is that it is an apples to oranges comparison. Also I no longer have hardware that can run dcompute, as my old windows box (with intel x86 and OpenCL 2.1 with an nvidia GPU) died some time ago. Unfortunately Macs and dcompute don't work very well. CUDA requires nvidia, and OpenCL needs the ability to run SPIR-V (clCreateProgramWithIL call) which requires OpenCL 2.x which Apple do not support. Hence why supporting Metal was of some interest. You might in theory be able to use PoCL or intel based OpenCL runtimes but I don't have an intel mac anymore and I haven't tried PoCL.
**\*nods**\* For a long time we could expect "home computers" to be Intel/AMD, but then the computing environment changed and maybe Apple tries to make its own platform stand out as faster than it is by forcing developers to special case their code for Metal rather than going through a generic API. I guess FPGAs will be available in entry level machines at some point as well. So, I understand that it will be a challenge to get *dcompute* to a "ready for the public" stage when there is no multi-person team behind it. But I am not so sure about the apples and oranges aspect of it. The presentation by Bryce was quite explicitly focusing on making GPU computation available at the same level as CPU computations (sans function pointers). This should be possible for homogeneous memory systems (GPU and CPU sharing the same memory bus) in a rather transparent manner and languages that plan for this might be perceived as being much more productive and performant if/when this becomes reality. And C++23 isn't far away, if they make the deadline. It was also interesting to me that ISO C23 will provide custom bit width integers and that this would make it easier to efficiently compile C-code to tighter FPGA logic. I remember that LLVM used to have that in their IR, but I think it was taken out and limited to more conventional bit sizes? It just shows that being a system-level programming language requires a lot of adaptability over time and frameworks like *dcompute* cannot ever be considered truly finished.
Jan 14 2022
next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 14 January 2022 at 15:17:59 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 14 January 2022 at 01:39:32 UTC, Nicholas Wilson 
 wrote:
 On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim 
 Grøstad wrote:
... The presentation by Bryce was quite explicitly focusing on making GPU computation available at the same level as CPU computations (sans function pointers). This should be possible for homogeneous memory systems (GPU and CPU sharing the same memory bus) in a rather transparent manner and languages that plan for this might be perceived as being much more productive and performant if/when this becomes reality. And C++23 isn't far away, if they make the deadline.
Yes. Homogeneous memory accelerators, as found today in game consoles and SoCs, open up some nice possibilities. Scheduling could still be problematic with a centralized resource (unlike per-core SIMD). Distinct instruction formats (GPU vs CPU) also present a challenge to achieving an it-just-works "sans function pointers" level of integration. Surmountable, but a little work to do there. I'm hopeful that SoCs, with their relatively friendlier accelerator configurations, will be the economic enabler for widespread uptake of dcompute. World beating perf/watt from very readable code deployable on billions of units? I'm up for that!
Jan 14 2022
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 14 January 2022 at 16:57:21 UTC, Bruce Carneal wrote:
 I'm hopeful that SoCs, with their relatively friendlier 
 accelerator configurations, will be the economic enabler for 
 widespread uptake of dcompute.
It is difficult to predict the future, but it is at least possible that the mainstream home-computing market will be dominated by smaller focused machines with SoCs. If we ignore Apple, then maybe the market will split into something like Chrome-books for non-geek users, something like Steam Deck/Machine for gamers and some other SoC with builtin FPGA or some other tinkering-friendly configuration for Linux enthusiasts. It seems reasonable that only storage will be on discrete chips in the long term. Drops in price levels tend to favour volume markets, so it is reasonable to expect SoCs to win out.
Jan 14 2022
parent Bruce Carneal <bcarneal gmail.com> writes:
On Friday, 14 January 2022 at 17:38:36 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 14 January 2022 at 16:57:21 UTC, Bruce Carneal wrote:
 I'm hopeful that SoCs, with their relatively friendlier 
 accelerator configurations, will be the economic enabler for 
 widespread uptake of dcompute.
It is difficult to predict the future, but it is at least possible that the mainstream home-computing market will be dominated by smaller focused machines with SoCs. If we ignore Apple, then maybe the market will split into something like Chrome-books for non-geek users, something like Steam Deck/Machine for gamers and some other SoC with builtin FPGA or some other tinkering-friendly configuration for Linux enthusiasts. It seems reasonable that only storage will be on discrete chips in the long term. Drops in price levels tend to favour volume markets, so it is reasonable to expect SoCs to win out.
Yes, I think the rollout of SoCs that you describe could very well occur. I hadn't even considered those! I was thinking of the accelerators in phone SoCs. Googling just now I saw an estimate of the number of "smart phones" world wide of over 6 billion. That seems a little high to me but the number of accelerator equipped phone SoCs is certainly in the billions with the number trending to saturation in line with the world's population. Anybody can hook into an accelerator library, and that will be fine for many apps, but with dcompute you'll have the ability to quickly go beyond the canned solutions when those are deficient. Lots of ways to win with dcompute.
Jan 14 2022
prev sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 14 January 2022 at 15:17:59 UTC, Ola Fosheim Grøstad 
wrote:
 **\*nods**\* For a long time we could expect "home computers" 
 to be Intel/AMD, but then the computing environment changed and 
 maybe Apple tries to make its own platform stand out as faster 
 than it is by forcing developers to special case their code for 
 Metal rather than going through a generic API.

 I guess FPGAs will be available in entry level machines at some 
 point as well. So, I understand that it will be a challenge to 
 get *dcompute* to a "ready for the public" stage when there is 
 no multi-person team behind it.
Maybe, but I suspect not for a while though, but that could be wildly wrong. Anyway, I don't think they will be too difficult to support, provided the vendor in question provides an OpenCL implementation. The only thing to do is support `pipe`s. As for manpower, the reason is I don't have any personal particular need for dcompute these days. I am happy to do features for people that need something in particular, e.g. Vulkan compute shader, textures, and PR are welcome. Though if Bruce makes millions and gives me a job then that will obviously change ;)
 But I am not so sure about the apples and oranges aspect of it.
The apples to oranges comment was about doing benchmarks with CPU vs. GPU, there are so many factors that make performance comparisons (more) difficult. Is the GPU discrete? How important is latency vs. throughput? How "powerful" is the GPU compared to the CPU?How well suited to the task is the GPU? The list goes on. Its hard enough to do CPU benchmarks in an unbiased way. If the intention is to say, "look at the speedup you can for for $TASK using $COMMON_HARDWARE" then yeah, that would be possible. It would certainly be possible to do a benchmark of, say, "ease of implementation with comparable performance" of dcopmute vs CUDA, e.g. LoC, verbosity, brittleness etc., since the main advantage of D/dcompute (vs CUDA) is enumeration of kernel designs for performance. That would give a nice measurable goal to improve usability.
 The presentation by Bryce was quite explicitly focusing on 
 making GPU computation available at the same level as CPU 
 computations (sans function pointers). This should be possible 
 for homogeneous memory systems (GPU and CPU sharing the same 
 memory bus) in a rather transparent manner and languages that 
 plan for this might be perceived as being much more productive 
 and performant if/when this becomes reality. And C++23 isn't 
 far away, if they make the deadline.
Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete. I've no idea how cacheing work in a system like that though.
 It was also interesting to me that ISO C23 will provide custom 
 bit width integers and that this would make it easier to 
 efficiently compile C-code to tighter FPGA logic. I remember 
 that LLVM used to have that in their IR, but I think it was 
 taken out and limited to more conventional bit sizes?
Arbitrary Precision integers are still a part of LLVM, and I presume LLVM IR. the problem with that is, like with addressed spaced pointers, D has no way to declare such types. I seem to remember Luís Marqeus doing something crazy like that (maybe in a dconf presentation?), compiling D to verilog.
 It just  shows that being a system-level programming language 
 requires a lot of adaptability over time and frameworks like 
 *dcompute* cannot ever be considered truly finished.
Of course.
Jan 14 2022
next sibling parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
wrote:
 ....

 Definitely. Homogenous memory is interesting for the ability to 
 make GPUs do the things GPUs are good at and leave the rest to 
 the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete. I've no idea how cacheing 
 work in a system like that though.
 ...
How is this different from unified memory? https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd
Jan 15 2022
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Saturday, 15 January 2022 at 08:01:15 UTC, Paulo Pinto wrote:
 On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
 wrote:
 ....

 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the rest 
 to the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete. I've no idea how cacheing 
 work in a system like that though.
 ...
How is this different from unified memory? https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd
there still a PCI-e in-between. Fundamentally the memory must exist in either the CPUs RAM or the GPUs (V)RAM, from what I understand unified memory allows the GPU to access the host RAM with the same pointer. This reduces the total memory consumed by the program, but to get to the GPU the data must still cross the PCI-e.
Jan 15 2022
next sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson 
wrote:
 from what I understand unified memory allows the GPU to access 
 the host RAM with the same pointer. This reduces the total 
 memory consumed by the program, but to get to the GPU the data 
 must still cross the PCI-e.
Exactly. I remember that in 2013 "Unified Memory Access" on NVIDIA was underwhelming, performing worse than pinned transfer + GPU memory access.
Jan 15 2022
parent Bruce Carneal <bcarneal gmail.com> writes:
On Saturday, 15 January 2022 at 10:35:29 UTC, Guillaume Piolat 
wrote:
 On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson 
 wrote:
 from what I understand unified memory allows the GPU to access 
 the host RAM with the same pointer. This reduces the total 
 memory consumed by the program, but to get to the GPU the data 
 must still cross the PCI-e.
Exactly. I remember that in 2013 "Unified Memory Access" on NVIDIA was underwhelming, performing worse than pinned transfer + GPU memory access.
Exactly++. Pinned buffers + async HW copies always won out for me. I imagine there could be scenarios where programmatic peeking/poking from either side wins but I've not seen them, probably because if your data flows are small enough for that to win you'd just fire up SIMD and call it a day.
Jan 15 2022
prev sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson 
wrote:
 On Saturday, 15 January 2022 at 08:01:15 UTC, Paulo Pinto wrote:
 On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
 wrote:
 ....

 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the 
 rest to the CPU without worrying about memory transfer across 
 the PCI-e. Something which CUDA can't take advantage of on 
 account of nvidia GPUs being only discrete. I've no idea how 
 cacheing work in a system like that though.
 ...
How is this different from unified memory? https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd
there still a PCI-e in-between. Fundamentally the memory must exist in either the CPUs RAM or the GPUs (V)RAM, from what I understand unified memory allows the GPU to access the host RAM with the same pointer. This reduces the total memory consumed by the program, but to get to the GPU the data must still cross the PCI-e.
Yes. You also gain some simplification, from unified memory, if your data structures are pointer heavy. I've tried to gain advantage from GPU-side pulls across the bus in the past but could never win out over explicit async copying utilizing dedicated copy circuitry. Others, particularly those with high compute-to-load/store ratios, may have had better luck. For reference, I've only been able to get a little over 80% of the advertised PCI-e peak bandwidth out of the dedicated Nvidia copy HW.
Jan 15 2022
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
wrote:
 As for manpower, the reason is I don't have any personal 
 particular need for dcompute these days. I am happy to do 
 features for people that need something in particular, e.g. 
 Vulkan compute shader, textures, and PR are welcome. Though if 
 Bruce makes millions and gives me a job then that will 
 obviously change ;)
He can put me on the application list as well… This sounds like lots of fun!!!
 important is latency vs. throughput? How "powerful" is the GPU 
 compared to the CPU?How well suited to the task is the GPU? The 
 list goes on. Its hard enough to do CPU benchmarks in an 
 unbiased way.
I don't think people would expect benchmarks to be unbiased. It could be 3-4 short benchmarks, some showcasing where it is beneficial, some showcasing where data dependencies (or other challenges) makes it less suitable. E.g. 1. compute autocorrelation over many different lags 2. multiply and take the square root of two long arrays 3. compute a simple IIR filter (I assume a recursive filter would be a worst case?)
 If the intention is to say, "look at the speedup you can for 
 for $TASK using $COMMON_HARDWARE" then yeah, that would be 
 possible. It would certainly be possible to do a benchmark of, 
 say, "ease of implementation with comparable performance" of 
 dcopmute vs CUDA, e.g. LoC, verbosity, brittleness etc., since 
 the main advantage of D/dcompute (vs CUDA) is enumeration of 
 kernel designs for performance. That would give a nice 
 measurable goal to improve usability.
Yes, but I think of it as an inspiration with a tutorial of how to get the benchmarks to run. For instance, like you, I have no need for this at the moment and my current computer isn't really a good showcase of GPU computation either, but I have one long term hobby project where I might use GPU-computations eventually. I suspect many think of GPU computations as something requiring a significant amount of time to get into. Even though they may be interested that threshold alone is enough to put it in the "interesting, but I'll look at it later" box. If you can tease people into playing with it for fun, then I think there is a larger chance of them using it at a later stage (or even thinking about the possibility of using it) when they see a need in some heavy computational problem they are working on. There is a lower threshold to get started with something new if you already have a tiny toy-project you can cut and paste from that you have written yourself. Also, updated benchmarks could generate new interest on the announce forum thread. Lurking forum readers, probably only read them on occasion, so you have to make several posts to make people aware of it.
 Definitely. Homogenous memory is interesting for the ability to 
 make GPUs do the things GPUs are good at and leave the rest to 
 the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete. I've no idea how cacheing 
 work in a system like that though.
I don't know, but Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip, at least that is what I've read. Maybe there will be more technical info available on how that works at the hardware level later, or maybe it is already on AMDs website? If someone reading this thread has more info on this, it would be nice if they would share what they have found out! :-)
Jan 15 2022
next sibling parent Paulo Pinto <pjmlp progtools.org> writes:
On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad 
wrote:
 ...

 I don't know, but Steam Deck, which appears to come out next 
 month, seems to run under Linux and has an "AMD APU" with a 
 modern GPU and CPU integrated on the same chip, at least that 
 is what I've read. Maybe there will be more technical info 
 available on how that works at the hardware level later, or 
 maybe it is already on AMDs website?

 If someone reading this thread has more info on this, it would 
 be nice if they would share what they have found out! :-)
According to the public documentation you can expect it to be similar to AMD Ryzen 7 3750H, with Radeon RX Vega 10 Graphics (16 GB). https://partner.steamgames.com/doc/steamdeck/testing#5
Jan 15 2022
prev sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad 
wrote:
 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the rest 
 to the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete.
Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip
Related: has anyone here seen an actual measured performance gain from co-located CPU and GPU on the same chip? I used to test with OpenCL + Intel SoC and again, it was underwhelming and not faster. I'd be happy to know about other experiences.
Jan 15 2022
next sibling parent max haughton <maxhaton gmail.com> writes:
On Saturday, 15 January 2022 at 17:29:35 UTC, Guillaume Piolat 
wrote:
 On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim 
 Grøstad wrote:
 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the 
 rest to the CPU without worrying about memory transfer across 
 the PCI-e. Something which CUDA can't take advantage of on 
 account of nvidia GPUs being only discrete.
Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip
Related: has anyone here seen an actual measured performance gain from co-located CPU and GPU on the same chip? I used to test with OpenCL + Intel SoC and again, it was underwhelming and not faster. I'd be happy to know about other experiences.
Well console memory systems are basically built around this idea. On the assumption that you mean a consumer chip with integrated graphics, any gain you see from sharing memory is going to be contrasted against the chip being intended for people who were going to actually use integrated graphics. For compute especially it seems like this is very dependant on what patterns you actually want to do with the memory. The new Apple chips have a unified memory architecture, and a really fast one too. I don't know what GPGPU is like on it but it's one of the reason why it absolutely flies on normal code.
Jan 15 2022
prev sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Saturday, 15 January 2022 at 17:29:35 UTC, Guillaume Piolat 
wrote:
 On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim 
 Grøstad wrote:
 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the 
 rest to the CPU without worrying about memory transfer across 
 the PCI-e. Something which CUDA can't take advantage of on 
 account of nvidia GPUs being only discrete.
Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip
Related: has anyone here seen an actual measured performance gain from co-located CPU and GPU on the same chip? I used to test with OpenCL + Intel SoC and again, it was underwhelming and not faster. I'd be happy to know about other experiences.
The link below on the vkpolybench software includes graphs for integrated GPUs, among others, and shows significant (more than SIMD width) speedups wrt a single CPU core for many of the benchmarks but also break-even or worse on a few. Reports on real world experiences with the integrated accelerators would be better. https://github.com/ElsevierSoftwareX/SOFTX_2020_86 On paper, at least, it looks like SoC GPU performance will be severely impacted by the working set size but who isn't?. Currently it also looks like the dcompute/SoC-GPU version will beat out my SIMD variant but it'll be at least a few months before I have hard data to share. Anyone out there have real world data now?
Jan 15 2022
prev sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 13 January 2022 at 21:06:45 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal 
 wrote:
 I know, right?  Ridiculously big opportunity/effort ratio for 
 dlang and near zero awareness...
If dcompute is here to stay, why not put it in the official documentation for D as an "optional" part of the spec? I honestly assumed that it was unsupported and close to dead as I had not heard much about it for a long time.
I suppose that's my fault for not marketing more, the code generation is tested in LDC's CI pipelines so that is unlikely to break, and the library is built on slow moving APIs that are also unlikely to break. Just because it doesn't get a lot of commits doesn't mean its going to stop working. As for specification, I think that would be a wasted effort and too constraining. On the compiler side, it is mostly using the existing LDC infrastructure with (more than) a few hacks to get everything to stick together, and it is heavily dependant on LDC and LLVM internals to be part of the D spec. On the runtime side of it, I fear specification would either be too constraining or end up out of sync with the implementation.
Jan 13 2022
prev sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal wrote:
 Ethan might have a sufficiently compelling economic case for 
 promoting dcompute to his company in the relatively near 
 future. Nicholas recently addressed their need for access to 
 the texture hardware and fitting within their work flow, but 
 there may be other requirements...  An adoption by a world 
 class game studio would, of course, be very good news but I 
 think Ethan is slammed (perpetually, and in a mostly good way, 
 I think) so it might be a while.
As a former GPGPU guy: can you explain in what ways dcompute improves life over using CUDA and OpenCL through DerelictCL/DerelictCUDA (I used to maintain them and I think nobody ever used them). Using the API directly seems to offer the most control to me, and no special compiler support.
Jan 13 2022
next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 13 January 2022 at 23:28:01 UTC, Guillaume Piolat 
wrote:
 As a former GPGPU guy: can you explain in what ways dcompute 
 improves life over using CUDA and OpenCL through 
 DerelictCL/DerelictCUDA (I used to maintain them and I think 
 nobody ever used them). Using the API directly seems to offer 
 the most control to me, and no special compiler support.
It is entirely possible to use dcompute as simply a wrapper over OpenCL/CUDA and benefit from the enhanced usability that it offers (e.g. querying OpenCL API objects for their properties is _faaaar_ simpler and less error prone with dcompute) because it exposes the underlying API objects 1:1, and you can always get the raw pointer and do things manually if you need to. Also dcompute uses DerelictCL/DerelictCUDA underneath anyway (thanks for them!). If you're thinking of "special compiler support" as what CUDA does with its <<<>>>, then no, dcompute does all of that, but not with special help from the compiler, only with what meta programming and reflection is available to any other D program. It's D all the way down to the API calls. Obviously there is special compiler support to turn D code into compute kernels. The main benefit of dcompute is turning kernel launches into type safe one-liners, as opposed to brittle, type unsafe, paragraphs of code.
Jan 13 2022
parent reply Guillaume Piolat <first.last gmail.com> writes:
On Friday, 14 January 2022 at 00:56:32 UTC, Nicholas Wilson wrote:
 If you're thinking of "special compiler support" as what CUDA 
 does with its <<<>>>, then no, dcompute does all of that, but 
 not with special help from the compiler, only with what meta 
 programming and reflection is available to any other D program.
 It's D all the way down to the API calls. Obviously there is 
 special compiler support to turn D code into compute kernels.

 The main benefit of dcompute is turning kernel launches into 
 type safe one-liners, as opposed to brittle, type unsafe, 
 paragraphs of code.
Sound indeed less brittle than separate langage. In my time in CUDA I never got to use <<<>>>. In OpenCL you'd have to templatize the string kernels quite quickly, and with CUDA you'd have to also make lots of entry points. Plus all the import problems, so I can see how it's better with LDC intrinsics.
Jan 14 2022
parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 14 January 2022 at 09:39:58 UTC, Guillaume Piolat 
wrote:
 The main benefit of dcompute is turning kernel launches into 
 type safe one-liners, as opposed to brittle, type unsafe, 
 paragraphs of code.
Sound indeed less brittle than separate langage. In my time in CUDA I never got to use <<<>>>.
Pity, the <<<>>> is actually quite nice, and not at all brittle, but it is CUDA C/C++ (and maybe fortran?) only, AMDs attempts at HIP notwithstanding. The main thing that makes it brittle is that is you change the signature of the kernel then you need to remember to change wherever it is invoked, and the compiler will not tell you that you forgot something.
 In OpenCL you'd have to templatize the string kernels quite 
 quickly, and with CUDA you'd have to also make lots of entry 
 points. Plus all the import problems, so I can see how it's 
 better with LDC intrinsics.
I'm not quite sure what you mean here.
Jan 14 2022
prev sibling next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 13 January 2022 at 23:28:01 UTC, Guillaume Piolat 
wrote:
 On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal 
 wrote:
 Ethan might have a sufficiently compelling economic case for 
 promoting dcompute to his company in the relatively near 
 future. Nicholas recently addressed their need for access to 
 the texture hardware and fitting within their work flow, but 
 there may be other requirements...  An adoption by a world 
 class game studio would, of course, be very good news but I 
 think Ethan is slammed (perpetually, and in a mostly good way, 
 I think) so it might be a while.
As a former GPGPU guy: can you explain in what ways dcompute improves life over using CUDA and OpenCL through DerelictCL/DerelictCUDA (I used to maintain them and I think nobody ever used them). Using the API directly seems to offer the most control to me, and no special compiler support.
For me there were several things, including: 1) the dcompute kernel invocation was simpler, made more sense, letting me easily create invocation abstractions to my liking (partially memoized futures in my case but easy enough to do other stuff) 2) the kernel meta programming was much friendlier generally, of course. 3) the D nested function capability, in conjunction with better meta programming, enabled great decomposition, intra kernel. You could get the compiler to keep everything within the maximum-dispatch register limit (64) with ease, with readable code. 4) using the above I found it easy to reduce/minimize memory traffic, an important consideration in that much of my current work is memory bound. Trivial example: use static foreach to logically unroll a window neighborhood algorithm eliminating both unnecessary loads and all extraneous reg-to-reg moves as you naturally mod around. It's not that you that you can't do such things in CUDA/C++, eventually, sometimes, after quite a bit of discomfort, once you acquire your level-bazillion C++ meta programming merit badge, it's that it's all so much *easier* to do in dcompute. You get to save the heroics for something else. I'm sure that new idioms/benefits will emerge with additional use (this was my first dcompute project) but, as you will have noticed :-), I'm already hooked. WRT OpenCL I don't have much to say. From what I gather people consider OpenCL to be even less hospitable than CUDA, preferring OpenCL mostly (only?) for its non-proprietary status. I'd be interested to hear from OpenCL gurus on this topic. Finally, if any of the above doesn't make sense, or you'd like to discuss it further, I suggest we meet up at beerconf. I'd also love to talk about data parallel latency sensitive coding strategies, about how we should deal with HW capability variation, about how we can introduce data parallelism to many more in the dlang community, ...
Jan 13 2022
parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 14 January 2022 at 01:37:29 UTC, Bruce Carneal wrote:
 WRT OpenCL I don't have much to say.  From what I gather people 
 consider OpenCL to be even less hospitable than CUDA, 
 preferring OpenCL mostly (only?) for its non-proprietary 
 status.  I'd be interested to hear from OpenCL gurus on this 
 topic.
Not that I'm an OpenCL guru by any stretch of the imagination, but yes, OpenCL as a base API is much less nice than even the CUDA driver APIs, but the foundation is solid and you can abstract and prettify it with D to a level of usability that is at least on par with (and imo exceeds) CUDA's runtime API (the one with the <<<>>>'s) with D kernels. That is to say the selling point for dcompute vs. OpenCL is, you get an API that is just as easy as CUDA (w.r.t type safety and tedium) and you get to write your kernels in D, whereas dcompute vs. CUDA is _just_, you get to write your kernels in D (and the API is not any worse).
Jan 13 2022
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 23:28:01 UTC, Guillaume Piolat 
wrote:
 As a former GPGPU guy: can you explain in what ways dcompute 
 improves life over using CUDA and OpenCL through 
 DerelictCL/DerelictCUDA (I used to maintain them and I think 
 nobody ever used them). Using the API directly seems to offer 
 the most control to me, and no special compiler support.
Forgot to respond to this. This probably does not apply to *dcompute*, but Bryce pointed out in his presentation how you could step through your "GPU code" on the CPU using a regular debugger since the parallel code was regular C++. Not exactly sure how that works, but I would imagine that they provide functions that match the GPU? That sounds like a massive productivity advantage to me if you want to write complicated "shaders".
Jan 15 2022
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 January 2022 at 07:46:32 UTC, Paulo Pinto wrote:
 I think the ship has already sailed, given the industry 
 standards of SYSCL and C++ for OpenCL, and their integration 
 into clang (check the CppCon talks on the same) and FPGA 
 generation.
The SYSCL/FPGA presentation was interesting, but he said it should be considered a research project at this point? I am a bit weary of all the solutions that are coming from Khronos. It is difficult to say what becomes prevalent across many platforms. Both Microsoft and Apple have undermined open standards such as OpenGL in their desire to lock in developers to their own "monopolistic eco system" … So, a focused language solution of limited scope might actually be better for developers than (big) open standards.
 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.
Overthrowing C++ isn't possible, but D could focus more on desktop application development and provide a framework for it. Then you need to have a set of features/modules/libraries in place in a way that fits well together. GPU-compute would be one of those I think.
Jan 13 2022