digitalmars.D - Scientific computing and parallel computing C++23/C++26

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (17/17) Jan 12 2022 I found the CppCon 2021 presentation

forkit (19/20) Jan 12 2022 For the general programmers/developer, parallelism needs to be

IGotD- (12/21) Jan 12 2022 Yes, that parallelism is for many applications a dead end as you
H. S. Teoh (21/28) Jan 12 2022 Recently, I wanted to use POVRay to render frames for a short video

forkit (5/16) Jan 12 2022 I'd like to see D simplify this even further:
forkit (21/25) Jan 13 2022 I wish below would "just work"

H. S. Teoh (15/26) Jan 13 2022 Just write instead:

jmh530 (2/9) Jan 13 2022 Could it be made @safe when used with const/immutable variables?

H. S. Teoh (34/41) Jan 13 2022 [...]

Petar Kirov [ZombineDev] (9/48) Jan 13 2022 There are two DIPs that aim to address the attribute propagation
jmh530 (4/13) Jan 13 2022 Thanks for the detailed explanation. Maybe the new DIPs can make

Petar Kirov [ZombineDev] (27/39) Jan 13 2022 For some data to be @safe-ly accessible across threads it must

jmh530 (3/4) Jan 13 2022 Thanks for the detailed explanation.

Petar Kirov [ZombineDev] (30/58) Jan 13 2022 ```d

Leoarndo Palozzi (4/6) Jan 16 2022 So did I for my weekend raytracer (I am new to D and was

Era Scarecrow (14/25) Jan 16 2022 Number of cores is fine, but if you could take advantage of the

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (10/14) Jan 17 2022 Yes, but compression/decompression is too complex. You need to be

bioinfornatics (10/24) Jan 17 2022 Some years ago we got a chance to provides an efficient way to

H. S. Teoh (8/14) Jan 17 2022 Why would it not be possible in the near future?

sfp (17/29) Jan 18 2022 Take one item on the list: developing an equivalent of numpy and

jmh530 (3/11) Jan 18 2022 Have you tried ggplotd [1]?

sfp (8/21) Jan 18 2022 I haven't tried it. I also hadn't heard of it before.

jmh530 (2/10) Jan 18 2022 Hadn't realized that. Shame.

Bruce Carneal (13/47) Jan 18 2022 Yes. Better to concentrate on things D *can* enable, like a
bachmeier (18/24) Jan 18 2022 To my knowledge pyd still works. There's not much to be gained

H. S. Teoh (15/32) Jan 18 2022 [...]

bachmeier (24/58) Jan 18 2022 The next release of my embedr library (which I've been able to do

sfp (7/32) Jan 18 2022 This is all news to me. It's a shame these libraries and their

jmh530 (21/34) Jan 18 2022 I'm all for leveraging C libraries in D, but if you have code

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (50/59) Jan 18 2022 It is not uncommon to interact with plots that are too big for

sfp (13/22) Jan 18 2022 To add to this: matplotlib has *many* pain points. It has an
Tejas (5/10) Jan 18 2022 Wow, this is the first time I've read that matplotlib is

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/17) Jan 18 2022 There are commercial products for visualizing large datasets. I

forkit (7/11) Jan 18 2022 Not just 'feature' parity, but 'performance' parity too:

Paulo Pinto (6/18) Jan 18 2022 That paper is from 2008, meanwhile in 2021,

M.M. (9/29) Jan 18 2022 I am not sure what the article tells: that Julia is now popular

Paulo Pinto (10/43) Jan 18 2022 You might call it self-written PR articles, or educate yourself

M.M. (18/65) Jan 19 2022 I am sorry that you took my post as an attack:

forkit (8/13) Jan 19 2022 Oh. so dissmisive of it because its from 2008?

Paulo Pinto (12/29) Jan 19 2022 Yes, because in 2008 CUDA and SYSCL were of little importance in

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (8/10) Jan 19 2022 But the presentation is not only about HPC, but making parallel

Paulo Pinto (8/19) Jan 19 2022 Currently Vulkan Compute is not to be taken seriously.

Tejas (9/30) Jan 19 2022 Is Rust utterly irrelevant in this space? Feels weird not seeing

IGotD- (9/13) Jan 19 2022 I haven't experienced that at all. Functional programming is

Bruce Carneal (13/34) Jan 19 2022 For those wishing to deploy today, I agree, but it should be

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (8/15) Jan 19 2022 Yes, that is the issue I wanted to discuss in the OP.

Nicholas Wilson (8/12) Jan 19 2022 Arguably that already describes Nvidia. Luckily for us, it has an

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/12) Jan 19 2022 For desktop applications one has to support Intel, AMD, Nvidia,

Nicholas Wilson (25/37) Jan 19 2022 That was a comment mostly about the market share and "business

Araq (4/12) Jan 19 2022 And you can do that even more easily with an AST macro system.

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/9) Jan 20 2022 I think these approaches are somewhat pointless for desktop
Bruce Carneal (17/30) Jan 20 2022 Given this endorsement I started reading up on Julia/GPU... Here

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/14) Jan 19 2022 Hmm, I dont understand, the unrolling should happen at runtime so

Nicholas Wilson (14/30) Jan 20 2022 Now you've confused me. You can select which implementation to

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (22/37) Jan 20 2022 Yes, so why do you need compile time features?

Bruce Carneal (24/62) Jan 20 2022 Because compilers are not sufficiently advanced to extract all

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (16/25) Jan 20 2022 Well, but D developers cannot test on all available CPU/GPU

Bruce Carneal (42/69) Jan 20 2022 It can be very expensive to write and test all the permutations,

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (23/31) Jan 20 2022 This doesn't change the desire to do performance testing at

Bruce Carneal (32/67) Jan 20 2022 Never meant to say that it did. Just pointed out that you can

Nicholas Wilson (16/29) Jan 20 2022 There are two major advantages for compile time features, for the

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (11/19) Jan 21 2022 Are these resolved at compile time (before the executable is

Nicholas Wilson (18/33) Jan 21 2022 Before. But with SPIR-V there is an additional

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/6) Jan 21 2022 :-D This is where you need more than one person for the project…

Bruce Carneal (17/35) Jan 12 2022 Given the emergence of ML in the commercial space and the
bachmeier (9/14) Jan 12 2022 It doesn't matter all that much for D TBH. Without the basic

Bruce Carneal (13/28) Jan 12 2022 I disagree. D/dcompute can be used as a better general purpose

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (23/34) Jan 13 2022 Is *dcompute* being actively developed or is it in a "frozen"

Nicholas Wilson (21/46) Jan 13 2022 not actively per se, but I have been adding features recently...

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (10/12) Jan 13 2022 I don't know if there is enough interest for it today. Right now,

bachmeier (12/35) Jan 13 2022 I was referring to libraries like numpy for Python or the

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (3/5) Jan 13 2022 Does scipy provide the functionality you would need? Could it in

sfp (58/64) Jan 13 2022 SciPy is fairly useful but it is only one amongst a constellation

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (8/11) Jan 13 2022 Yes, this is probably true. My impression is that the physics

Bruce Carneal (13/48) Jan 13 2022 I agree. If the heavy lifting for a new project is accomplished

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/8) Jan 13 2022 What C++ seems to do for (a) is adding a library construct for

Paulo Pinto (12/30) Jan 12 2022 I think the ship has already sailed, given the industry standards

Bruce Carneal (9/21) Jan 13 2022 Yes. The language independent work in LLVM in the accelerator

Tejas (10/22) Jan 13 2022 Overthrowing may be hopeless, but I feel we should at least be a

Bruce Carneal (25/52) Jan 13 2022 SPIR-V is *very* useful. It is the catalyst and focal point of

bachmeier (4/12) Jan 13 2022 Does anyone else know anything about this? Burying it deep in a

Bruce Carneal (23/36) Jan 13 2022 I know, right? Ridiculously big opportunity/effort ratio for

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (5/7) Jan 13 2022 If dcompute is here to stay, why not put it in the official

Bruce Carneal (16/22) Jan 13 2022 There are two reasons that I have not promoted dcompute to the

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (11/13) Jan 13 2022 Yes, that is not a good situation…

Nicholas Wilson (13/16) Jan 13 2022 Part of the difficulty with that, is that it is an apples to

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (27/43) Jan 14 2022 **\*nods**\* For a long time we could expect "home computers" to

Bruce Carneal (13/26) Jan 14 2022 Yes. Homogeneous memory accelerators, as found today in game

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (12/15) Jan 14 2022 It is difficult to predict the future, but it is at least

Bruce Carneal (14/29) Jan 14 2022 Yes, I think the rollout of SoCs that you describe could very

Nicholas Wilson (38/64) Jan 14 2022 Maybe, but I suspect not for a while though, but that could be

Paulo Pinto (4/12) Jan 15 2022 How is this different from unified memory?

Nicholas Wilson (7/20) Jan 15 2022 there still a PCI-e in-between. Fundamentally the memory must

Guillaume Piolat (5/9) Jan 15 2022 Exactly. I remember that in 2013 "Unified Memory Access" on

Bruce Carneal (8/18) Jan 15 2022 Exactly++. Pinned buffers + async HW copies always won out for

Bruce Carneal (11/33) Jan 15 2022 Yes. You also gain some simplification, from unified memory, if

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (42/66) Jan 15 2022 He can put me on the application list as well… This sounds like

Paulo Pinto (6/15) Jan 15 2022 According to the public documentation you can expect it to be
Guillaume Piolat (6/14) Jan 15 2022 Related: has anyone here seen an actual measured performance gain

max haughton (12/28) Jan 15 2022 Well console memory systems are basically built around this idea.
Bruce Carneal (15/31) Jan 15 2022 The link below on the vkpolybench software includes graphs for

Nicholas Wilson (14/22) Jan 13 2022 I suppose that's my fault for not marketing more, the code

Guillaume Piolat (6/14) Jan 13 2022 As a former GPGPU guy: can you explain in what ways dcompute

Nicholas Wilson (19/24) Jan 13 2022 It is entirely possible to use dcompute as simply a wrapper over

Guillaume Piolat (7/16) Jan 14 2022 Sound indeed less brittle than separate langage. In my time in

Nicholas Wilson (9/18) Jan 14 2022 Pity, the <<<>>> is actually quite nice, and not at all brittle,

Bruce Carneal (38/54) Jan 13 2022 For me there were several things, including:

Nicholas Wilson (12/17) Jan 13 2022 Not that I'm an OpenCL guru by any stretch of the imagination,

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (10/15) Jan 15 2022 Forgot to respond to this. This probably does not apply to

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (15/22) Jan 13 2022 The SYSCL/FPGA presentation was interesting, but he said it

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

I found the CppCon 2021 presentation
[C++ Standard 
Parallelism](https://www.youtube.com/watch?v=LW_T2RGXego) by 
Bryce Adelstein Lelbach very interesting, unusually clear and 
filled with content. I like this man. No nonsense.

It provides a view into what is coming for relatively high level 
and hardware agnostic parallel programming in C++23 or C++26. 
Basically a portable "high level" high performance solution.

He also mentions the Nvidia C++ compiler *nvc++* which will make 
it possible to compile C++ to Nvidia GPUs in a somewhat 
transparent manner. (Maybe it already does, I have never tried to 
use it.)

My gut feeling is that it will be very difficult for other 
languages to stand up to C++, Python and Julia in parallel 
computing. I get a feeling that the distance will only increase 
as time goes on.

What do you think?

Jan 12 2022

forkit <forkit gmail.com> writes:

On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:
 What do you think?

For the general programmers/developer, parallelism needs to be 
deeply integrated into the language and it's std library, so that 
it can be 'inferred' (by the compiler/optimiser).

Perhaps a language like D, could adopt  parallelNO to instruct 
the compiler/optimiser to never infer parallelism in the code 
that follows.

The O/S should also has a very important role in inferring 
parallelism.

parallelism has been promoted as the new thing..for a 
very..very...long time now.

I've had 8 cores available on my pc for well over 10 years now. I 
don't think anything running on my pc has the slighest clue that 
they even exist ;-)  (except the o/s).

I expect 'explicitly' coding parallelism will continue to be 
relegated to a niche subset of programmers/developers, due to the 
very considerable knowledge/skillset needed, to 
design/develop/test/debug parallel code.

Jan 12 2022

IGotD- <nise nise.com> writes:

On Thursday, 13 January 2022 at 00:41:25 UTC, forkit wrote:
 parallelism has been promoted as the new thing..for a 
 very..very...long time now.

 I've had 8 cores available on my pc for well over 10 years now. 
 I don't think anything running on my pc has the slighest clue 
 that they even exist ;-)  (except the o/s).

 I expect 'explicitly' coding parallelism will continue to be 
 relegated to a niche subset of programmers/developers, due to 
 the very considerable knowledge/skillset needed, to 
 design/develop/test/debug parallel code.

Yes, that parallelism is for many applications a dead end as you 
need something that can take advantage of it. Often forcing 
parallel execution can often instead reduce performance.

In order to exploit parallelism you need to understand your 
program and how it can take advantage of it. Languages that tries 
to make things in parallel under the hood without the programmer 
knowledge has been a fantasy for decades and it still is.

I'm not saying that the additions in C++ aren't useful, people 
will probably find good use for it. The presentation just reminds 
me how C++ just gets more ugly for every iteration and I'm happy 
I jumped off that horror train.

Jan 12 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jan 13, 2022 at 12:41:25AM +0000, forkit via Digitalmars-d wrote:
[...]
 I've had 8 cores available on my pc for well over 10 years now. I
 don't think anything running on my pc has the slighest clue that they
 even exist ;-)  (except the o/s).

Recently, I wanted to use POVRay to render frames for a short video
clip. It was taking far too long because it was running on a single core
at a time, so I wrote this:

	import std.parallellism, std.process;
	foreach (frame; frames.parallel) {
		execute([ "povray" ] ~ povrayOpts ~ [
			"+I", frame.infile,
			"+O", frame.outfile ]);
	}

Instant 8x render speedup. (Well, almost 8x... there's of course a
little bit of overhead. But you get the point.)


 I expect 'explicitly' coding parallelism will continue to be relegated
 to a niche subset of programmers/developers, due to the very
 considerable knowledge/skillset needed, to design/develop/test/debug
 parallel code.

For simple cases, the above example serves as a counterexample. ;-)

Of course, for more complex situations things may not be quite so
simple.  But still, it doesn't have to be as complex as languages like
C++ make it seem.  In the above example I literally just added
".parallel" to the code and it Just Worked(tm).


T

-- 
The best way to destroy a cause is to defend it poorly.

Jan 12 2022

forkit <forkit gmail.com> writes:

On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 Recently, I wanted to use POVRay to render frames for a short 
 video clip. It was taking far too long because it was running 
 on a single core at a time, so I wrote this:

 	import std.parallellism, std.process;
 	foreach (frame; frames.parallel) {
 		execute([ "povray" ] ~ povrayOpts ~ [
 			"+I", frame.infile,
 			"+O", frame.outfile ]);
 	}

 Instant 8x render speedup. (Well, almost 8x... there's of 
 course a little bit of overhead. But you get the point.)

I'd like to see D simplify this even further:

 parallel foreach (frame; frames) { .. }


that's it. just annotate it. that's all I have to do. Let the 
language tools do the rest.

Jan 12 2022

forkit <forkit gmail.com> writes:

On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 ..... But still, it doesn't have to be as complex as languages 
 like C++ make it seem.  In the above example I literally just 
 added ".parallel" to the code and it Just Worked(tm).


 T

I wish below would "just work"


// ----
module test;

import std;

 safe
void main()
{
     //int[5] arr = [1, 2, 3, 4, 5]; // nope. won't work with 
.parallel
     int[] arr = [1, 2, 3, 4, 5];// has to by dynamic to work with 
.parallel ??

     int x = 0;

     foreach(n; arr.parallel) // Nope - .parallel is a  system 
function and cannot be called in  safe
     {
         x += n;
     }

     writeln(x);
}
// -----

Jan 13 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jan 13, 2022 at 08:07:51PM +0000, forkit via Digitalmars-d wrote:
[...]
 // ----
 module test;
 
 import std;
 
  safe
 void main()
 {
     //int[5] arr = [1, 2, 3, 4, 5]; // nope. won't work with .parallel
     int[] arr = [1, 2, 3, 4, 5];// has to by dynamic to work with .parallel
 ??

Just write instead:

	int[5] arr = [1, 2, 3, 4, 5];
	foreach (n; arr[].parallel) ...

In general, whenever something rejects static arrays, inserting `[]`
usually fixes it. :-D

I'm not 100% sure why .parallel is  system, but I suspect it's because
of potential issues with race conditions, since it does not prevent you
from writing to the same local variable from multiple threads. If
pointers are updated this way, it could lead to memory corruption
problems.


T

-- 
Long, long ago, the ancient Chinese invented a device that lets them see
through walls. It was called the "window".

Jan 13 2022

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:
 [snip]

 I'm not 100% sure why .parallel is  system, but I suspect it's 
 because of potential issues with race conditions, since it does 
 not prevent you from writing to the same local variable from 
 multiple threads. If pointers are updated this way, it could 
 lead to memory corruption problems.


 T

Could it be made  safe when used with const/immutable variables?

Jan 13 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jan 13, 2022 at 09:13:11PM +0000, jmh530 via Digitalmars-d wrote:
 On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:

[...]
 I'm not 100% sure why .parallel is  system, but I suspect it's
 because of potential issues with race conditions, since it does not
 prevent you from writing to the same local variable from multiple
 threads. If pointers are updated this way, it could lead to memory
 corruption problems.


[...]
 Could it be made  safe when used with const/immutable variables?

Apparently not, as Petar already pointed out.

But even besides access to non-shared local variables, there's also the
long-standing issue that a function that receives a delegate cannot have
stricter attributes than the delegate itself, i.e.:

	// NG:  safe function fun cannot call  system delegate dg.
	void fun(scope void delegate()  system dg)  safe {
		dg();
	}

	// You have to do this instead (i.e., delegate must be
	// restricted to be  safe):
	void fun(scope void delegate()  safe dg)  safe {
		dg();
	}

There's currently no way to express that the  safety of fun depends
solely on the  safety of dg, such that if you pass in a  safe delegate,
then fun should be regarded as  safe and allowed to be called from  safe
code.

This is a problem because .parallel is implemented using .opApply, which
takes a delegate argument. It accepts an unqualified delegate in order
to be usable with both  system and  safe delegates. But this
unfortunately means it must be  system, and therefore uncallable from
 safe code.

Various proposals to fix this has been brought up before, but Walter
either doesn't fully understand the issue, or else has some reasons he's
not happy with the proposed solutions.  In fact he has proposed
something that goes the *opposite* way to what should be done in order
to address this problem.  Since both were shot down in the forum
discussions, we're stuck at the current stalemate. :-(


T

-- 
Once bitten, twice cry...

Jan 13 2022

Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:

On Thursday, 13 January 2022 at 21:44:13 UTC, H. S. Teoh wrote:
 On Thu, Jan 13, 2022 at 09:13:11PM +0000, jmh530 via 
 Digitalmars-d wrote:
 On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:

 [...]
 [...]


 [...]
 Could it be made  safe when used with const/immutable 
 variables?

 Apparently not, as Petar already pointed out.

 But even besides access to non-shared local variables, there's 
 also the long-standing issue that a function that receives a 
 delegate cannot have stricter attributes than the delegate 
 itself, i.e.:

 	// NG:  safe function fun cannot call  system delegate dg.
 	void fun(scope void delegate()  system dg)  safe {
 		dg();
 	}

 	// You have to do this instead (i.e., delegate must be
 	// restricted to be  safe):
 	void fun(scope void delegate()  safe dg)  safe {
 		dg();
 	}

 There's currently no way to express that the  safety of fun 
 depends solely on the  safety of dg, such that if you pass in a 
  safe delegate, then fun should be regarded as  safe and 
 allowed to be called from  safe code.

 This is a problem because .parallel is implemented using 
 .opApply, which takes a delegate argument. It accepts an 
 unqualified delegate in order to be usable with both  system 
 and  safe delegates. But this unfortunately means it must be 
  system, and therefore uncallable from  safe code.

 Various proposals to fix this has been brought up before, but 
 Walter either doesn't fully understand the issue, or else has 
 some reasons he's not happy with the proposed solutions.  In 
 fact he has proposed something that goes the *opposite* way to 
 what should be done in order to address this problem.  Since 
 both were shot down in the forum discussions, we're stuck at 
 the current stalemate. :-(


 T

There are two DIPs that aim to address the attribute propagation 
problem:

* [Argument dependent attributes (ADAs)][1] by  Geod24: 
https://github.com/Geod24/DIPs/blob/adas/DIPs/DIP4242.md

* [Attributes for Higher-Order Functions][2] by  Bolpat: 
https://github.com/Bolpat/DIPs/blob/AttributesHOF/DIPs/DIP-1NN4-QFS.md

[1]: https://github.com/dlang/DIPs/pull/198
[2]: https://github.com/dlang/DIPs/pull/199

Jan 13 2022

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 13 January 2022 at 21:44:13 UTC, H. S. Teoh wrote:
 [snip]
 Various proposals to fix this has been brought up before, but 
 Walter either doesn't fully understand the issue, or else has 
 some reasons he's not happy with the proposed solutions.  In 
 fact he has proposed something that goes the *opposite* way to 
 what should be done in order to address this problem.  Since 
 both were shot down in the forum discussions, we're stuck at 
 the current stalemate. :-(


 T

Thanks for the detailed explanation. Maybe the new DIPs can make 
a better effort at the beginning to communicate the issue (such 
as this example).

Jan 13 2022

Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:

On Thursday, 13 January 2022 at 21:13:11 UTC, jmh530 wrote:
 On Thursday, 13 January 2022 at 20:58:25 UTC, H. S. Teoh wrote:
 [snip]

 I'm not 100% sure why .parallel is  system, but I suspect it's 
 because of potential issues with race conditions, since it 
 does not prevent you from writing to the same local variable 
 from multiple threads. If pointers are updated this way, it 
 could lead to memory corruption problems.


 T

 Could it be made  safe when used with const/immutable variables?

For some data to be  safe-ly accessible across threads it must 
have no "unshared aliasing", meaning that `shared(const(T))` and 
`immutable(T)` are ok, but simply `T` and `const(T)` are not.

The reason why the `.parallel` example above was not safe, is 
because the body of the foreach was passed as a delegate to the 
`ParallelForeach.opApply` and the problem is that delegates can 
access unshared mutable data through their closure. If the 
 safe-ty holes regarding delegates are closed, presumably we 
could add a `ParallelForeach.opApply` overload that took a 
` safe` delegate and then the whole `main` function could be 
marked as ` safe`.

I think back when the module was under active development, the 
authors did carefully consider the  safe-ty aspects, as they have 
written code that conditionally enables some function overloads 
to be ` trusted`, depending on the parameters they receive. But 
in the end it was the best they could given the state of the 
language at the time. Most likely the situation has improved 
sufficiently that more the of the API could be made (at least 
conditionally)  safe.

You can check the various comments explaining the situation:

* 
https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L32-L34
* 
https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L3382-L3395
* 
https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L254-L261

Jan 13 2022

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 13 January 2022 at 21:51:10 UTC, Petar Kirov 
[ZombineDev] wrote:
 [snip]

Thanks for the detailed explanation.

Jan 13 2022

Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:

On Thursday, 13 January 2022 at 20:07:51 UTC, forkit wrote:
 On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 ..... But still, it doesn't have to be as complex as languages 
 like C++ make it seem.  In the above example I literally just 
 added ".parallel" to the code and it Just Worked(tm).


 T

 I wish below would "just work"


 // ----
 module test;

 import std;

  safe
 void main()
 {
     //int[5] arr = [1, 2, 3, 4, 5]; // nope. won't work with 
 .parallel
     int[] arr = [1, 2, 3, 4, 5];// has to by dynamic to work 
 with .parallel ??

     int x = 0;

     foreach(n; arr.parallel) // Nope - .parallel is a  system 
 function and cannot be called in  safe
     {
         x += n;
     }

     writeln(x);
 }
 // -----

```d
import core.atomic : atomicOp;
import std.parallelism : parallel;
import std.stdio : writeln;

     // Not  safe, since `parallel` still allows access to 
non-shared-qualified
     // data. See:
     // 
https://github.com/dlang/phobos/blob/v2.098.1/std/parallelism.d#L32-L34
     void main()
     {
         int[5] arr = [1, 2, 3, 4, 5]; // Yes, static arrays work 
just fine.

         // `shared` is necessary to safely access data from 
multiple threads
         shared int x = 0;

         // Most functions in Phobos work with ranges, not 
containers (by design).
         // To get a range from a static array, simply slice it:
         foreach(n; arr[].parallel)
         {
             // Use atomic ops (or higher-level synchronization 
primitives) to work
             // with shared data, without data-races:
             x.atomicOp!`+=`(n);
         }

         writeln(x);
     }
     ```

Jan 13 2022

Leoarndo Palozzi <lpalozzi gmail.com> writes:

On Thursday, 13 January 2022 at 01:19:07 UTC, H. S. Teoh wrote:
 In the above example I literally just added ".parallel" to the 
 code and it Just Worked(tm).

So did I for my weekend raytracer (I am new to D and was 
pleasantly surprised how easy it was).

     foreach (i, ref pixel; parallel(image.pixels)) {...}

Jan 16 2022

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Thursday, 13 January 2022 at 00:41:25 UTC, forkit wrote:
 For the general programmers/developer, parallelism needs to be 
 deeply integrated into the language and it's std library, so 
 that it can be 'inferred' (by the compiler/optimizer).

 Perhaps a language like D, could adopt  parallelNO to instruct 
 the compiler/optimizer to never infer parallelism in the code 
 that follows.

 The O/S should also has a very important role in inferring 
 parallelism.

 I've had 8 cores available on my pc for well over 10 years now. 
 I don't think anything running on my pc has the slightest clue 
 that they even exist ;-)  (except the o/s).

  Number of cores is fine, but if you could take advantage of the 
GPU/CUDA cpu's on say a graphics card as well; **THAT** would be 
really cool. Imagine the huge speedup of say 7zip or other where 
simple processes, pattern matching or encoding/processing could 
speed up if you could make use of those **AS WELL AS** the number 
of cores you have.

  For a while I've been making scripts where i *find* files and 
split it via xargs; this converts any single-thread program to be 
run on lots of cores/processes (*by running lots of copies with 
different input files*), though in windows it may result in 5 
processes for ever 1 you want to run.

**Example:** find -iname "*.jpg" -print0 | xargs -0 -P 
$NUMBER_OF_PROCESSORS -n 1 jpegoptim --all-progressive

Jan 16 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Monday, 17 January 2022 at 06:13:03 UTC, Era Scarecrow wrote:
 would be really cool. Imagine the huge speedup of say 7zip or 
 other where simple processes, pattern matching or 
 encoding/processing could speed up if you could make use of 
 those **AS WELL AS** the number of cores you have.

Yes, but compression/decompression is too complex. You need to be 
careful with data dependencies so that the computations can be 
run in parallel on a massive scale. I don't know what the future 
holds, but today you also need a core to feed the GPU. Maybe in 
the future the GPU will be able to "feed itself" like an 
independent actor? Hard to tell.  For FPGAs that ought to be a 
possibility, but they are only available in specialty setups.

Maybe we need an open source computer platform with more 
interesting hardware (using commodity chips)?

Jan 17 2022

bioinfornatics <bioinfornatics fedoraproject.org> writes:

On Monday, 17 January 2022 at 18:17:10 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 17 January 2022 at 06:13:03 UTC, Era Scarecrow wrote:
 would be really cool. Imagine the huge speedup of say 7zip or 
 other where simple processes, pattern matching or 
 encoding/processing could speed up if you could make use of 
 those **AS WELL AS** the number of cores you have.

 Yes, but compression/decompression is too complex. You need to 
 be careful with data dependencies so that the computations can 
 be run in parallel on a massive scale. I don't know what the 
 future holds, but today you also need a core to feed the GPU. 
 Maybe in the future the GPU will be able to "feed itself" like 
 an independent actor? Hard to tell.  For FPGAs that ought to be 
 a possibility, but they are only available in specialty setups.

 Maybe we need an open source computer platform with more 
 interesting hardware (using commodity chips)?

Some years ago we got a chance to provides an efficient way to 
perform efficient computation with library such as 
https://wiki.dlang.org/LDC_CUDA_and_SPIRV

And recently I put my feedback where D could provide some killer 
feature in this area: 
https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org

Unfortunately, this will not be possible in the near future, so 
others language will keep the market

Jan 17 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jan 17, 2022 at 09:12:52PM +0000, bioinfornatics via Digitalmars-d
wrote:
[...]
 And recently I put my feedback where D could provide some killer
 feature in this area:
 https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org
 
 Unfortunately, this will not be possible in the near future, so others
 language will keep the market

Why would it not be possible in the near future?

None of the items you listed seem to be specific to the language itself,
it seems to be more of an ecosystem issue.


T

-- 
It is widely believed that reinventing the wheel is a waste of time; but I
disagree: without wheel reinventers, we would be still be stuck with wooden
horse-cart wheels.

Jan 17 2022

sfp <sfp cims.nyu.edu> writes:

On Monday, 17 January 2022 at 21:31:25 UTC, H. S. Teoh wrote:
 On Mon, Jan 17, 2022 at 09:12:52PM +0000, bioinfornatics via 
 Digitalmars-d wrote: [...]
 And recently I put my feedback where D could provide some 
 killer feature in this area: 
 https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org
 
 Unfortunately, this will not be possible in the near future, 
 so others language will keep the market

 Why would it not be possible in the near future?

 None of the items you listed seem to be specific to the 
 language itself, it seems to be more of an ecosystem issue.


 T

Take one item on the list: developing an equivalent of numpy and 
scipy. What do you take to be the "near future"? One year away? 
Two years?

There is no language feature holding these items back. Addressing 
ecosystem issues is a massive undertaking.

In order for a D clone of even just numpy to be successful, it 
needs to have a significant user base feeding input back into the 
development cycle so that it can go beyond simply being churned 
out by a few overeager developers and actually stabilized so that 
it becomes useful and robust.

You must also consider that the items that bioinfornatics listed 
are all somewhat contingent on each other. In isolation they 
aren't nearly as useful. You might have a numpy/scipy clone, but 
if you don't also have a matplotlib clone (or some other means of 
doing data visualization from D) their utility is a bit limited.

His wishlist is a tall order.

Jan 18 2022

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 [snip]

 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In isolation 
 they aren't nearly as useful. You might have a numpy/scipy 
 clone, but if you don't also have a matplotlib clone (or some 
 other means of doing data visualization from D) their utility 
 is a bit limited.

 His wishlist is a tall order.

Have you tried ggplotd [1]?

[1] https://code.dlang.org/packages/ggplotd

Jan 18 2022

sfp <sfp cims.nyu.edu> writes:

On Tuesday, 18 January 2022 at 17:24:03 UTC, jmh530 wrote:
 On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 [snip]

 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In isolation 
 they aren't nearly as useful. You might have a numpy/scipy 
 clone, but if you don't also have a matplotlib clone (or some 
 other means of doing data visualization from D) their utility 
 is a bit limited.

 His wishlist is a tall order.

 Have you tried ggplotd [1]?

 [1] https://code.dlang.org/packages/ggplotd

I haven't tried it. I also hadn't heard of it before.

Judging from the small number of GitHub issues it appears that it 
is used by basically no one and that it's missing crucial 
standard features supported by all mature plotting libraries. In 
one of the issues the main developer indicated he is no longer 
adding new features. This library seems to be: 1) quite young, 2) 
not actively developed, 3) not actively maintained.

Jan 18 2022

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 18 January 2022 at 18:32:41 UTC, sfp wrote:
 [snip]

 I haven't tried it. I also hadn't heard of it before.

 Judging from the small number of GitHub issues it appears that 
 it is used by basically no one and that it's missing crucial 
 standard features supported by all mature plotting libraries. 
 In one of the issues the main developer indicated he is no 
 longer adding new features. This library seems to be: 1) quite 
 young, 2) not actively developed, 3) not actively maintained.

Hadn't realized that. Shame.

Jan 18 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 On Monday, 17 January 2022 at 21:31:25 UTC, H. S. Teoh wrote:
 On Mon, Jan 17, 2022 at 09:12:52PM +0000, bioinfornatics via 
 Digitalmars-d wrote: [...]
 And recently I put my feedback where D could provide some 
 killer feature in this area: 
 https://forum.dlang.org/thread/fuzvsdlqtklhmxsnzgye forum.dlang.org
 
 Unfortunately, this will not be possible in the near future, 
 so others language will keep the market

 Why would it not be possible in the near future?

 None of the items you listed seem to be specific to the 
 language itself, it seems to be more of an ecosystem issue.


 T

 Take one item on the list: developing an equivalent of numpy 
 and scipy. What do you take to be the "near future"? One year 
 away? Two years?

 There is no language feature holding these items back. 
 Addressing ecosystem issues is a massive undertaking.

 In order for a D clone of even just numpy to be successful, it 
 needs to have a significant user base feeding input back into 
 the development cycle so that it can go beyond simply being 
 churned out by a few overeager developers and actually 
 stabilized so that it becomes useful and robust.

 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In isolation 
 they aren't nearly as useful. You might have a numpy/scipy 
 clone, but if you don't also have a matplotlib clone (or some 
 other means of doing data visualization from D) their utility 
 is a bit limited.

 His wishlist is a tall order.

Yes.  Better to concentrate on things D *can* enable, like a 
great performance programming experience.

D appeals to me primarily because it lets me write simpler 
performant code.  It regularly opens the door to better 
perf/complexity ratios than C++ for example.  This is 
particularly important in markets where even small performance 
gains bring large economic benefits, where novel code is 
indicated.

OTOH, if your value add is more about quickly 
assembling/rearranging existing components that are sufficiently 
peformant in themselves and in combination, well, by all means, 
carry on!

Jan 18 2022

bachmeier <no spam.net> writes:

On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:

 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In isolation 
 they aren't nearly as useful. You might have a numpy/scipy 
 clone, but if you don't also have a matplotlib clone (or some 
 other means of doing data visualization from D) their utility 
 is a bit limited.

To my knowledge pyd still works. There's not much to be gained 
from rewriting a plotting library from scratch. It's not common 
that you're plotting 100 million times for each run of your 
program.

I see too much NIH syndrome here. If you can call another 
language, all you need to do is write convenience wrappers on top 
of the many thousands of hours of work done in that language. You 
can replace the pieces where it makes sense to do so. The goal of 
the D program is whatever analysis you're doing on top of those 
libraries, not the libraries themselves.

We call C libraries all the time. Nobody thinks that's a problem. 
A bunch of effort has gone into calling C++ libraries and there's 
tons of support for that effort. When it comes to calling any 
other language, even for things that don't require performance, 
there's no interest. The ability to interoperate with other 
languages is the number one reason I started using D and the main 
reason I still use it.

Jan 18 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Jan 18, 2022 at 08:28:52PM +0000, bachmeier via Digitalmars-d wrote:
 On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 
 You must also consider that the items that bioinfornatics listed are
 all somewhat contingent on each other. In isolation they aren't
 nearly as useful. You might have a numpy/scipy clone, but if you
 don't also have a matplotlib clone (or some other means of doing
 data visualization from D) their utility is a bit limited.

 
 To my knowledge pyd still works. There's not much to be gained from
 rewriting a plotting library from scratch. It's not common that you're
 plotting 100 million times for each run of your program.

 I see too much NIH syndrome here. If you can call another language,
 all you need to do is write convenience wrappers on top of the many
 thousands of hours of work done in that language. You can replace the
 pieces where it makes sense to do so. The goal of the D program is
 whatever analysis you're doing on top of those libraries, not the
 libraries themselves.

[...]

+1.  Why do we need to reinvent numpy/scipy? One of the advantages
conferred by D's metaprogramming capabilities is easier integration with
other languages.  Adam Ruppe's jni.d is one prime example of how
metaprogramming can abstract away the nasty amounts of boilerplate
you're otherwise forced to write when interfacing with Java via JNI. D's
C ABI compatibility also means you can leverage the tons of C libraries
out there right now, instead of waiting for somebody to reinvent the
same libraries in D years down the road.  D's capabilities makes it very
amenable to being a "glue" language for interfacing with other
languages.


T

-- 
Caffeine underflow. Brain dumped.

Jan 18 2022

bachmeier <no spam.net> writes:

On Tuesday, 18 January 2022 at 21:22:11 UTC, H. S. Teoh wrote:
 On Tue, Jan 18, 2022 at 08:28:52PM +0000, bachmeier via 
 Digitalmars-d wrote:
 On Tuesday, 18 January 2022 at 17:03:33 UTC, sfp wrote:
 
 You must also consider that the items that bioinfornatics 
 listed are all somewhat contingent on each other. In 
 isolation they aren't nearly as useful. You might have a 
 numpy/scipy clone, but if you don't also have a matplotlib 
 clone (or some other means of doing data visualization from 
 D) their utility is a bit limited.

 
 To my knowledge pyd still works. There's not much to be gained 
 from rewriting a plotting library from scratch. It's not 
 common that you're plotting 100 million times for each run of 
 your program.

 I see too much NIH syndrome here. If you can call another 
 language, all you need to do is write convenience wrappers on 
 top of the many thousands of hours of work done in that 
 language. You can replace the pieces where it makes sense to 
 do so. The goal of the D program is whatever analysis you're 
 doing on top of those libraries, not the libraries themselves.

 [...]

 +1.  Why do we need to reinvent numpy/scipy? One of the 
 advantages conferred by D's metaprogramming capabilities is 
 easier integration with other languages.  Adam Ruppe's jni.d is 
 one prime example of how metaprogramming can abstract away the 
 nasty amounts of boilerplate you're otherwise forced to write 
 when interfacing with Java via JNI. D's C ABI compatibility 
 also means you can leverage the tons of C libraries out there 
 right now, instead of waiting for somebody to reinvent the same 
 libraries in D years down the road.  D's capabilities makes it 
 very amenable to being a "glue" language for interfacing with 
 other languages.

The next release of my embedr library (which I've been able to do 
now that my work life is finally returning to normal) will make 
it trivial to call D functions from R. What I mean by that is 
that you write a file of D functions and by the magic of 
metaprogramming, you don't need to write any boilerplate at all.

Example:

```
import mir.random;
import mir.random.variable;

RVector rngexample(int n) {
	auto gen = Random(unpredictableSeed);
	auto rv = uniformVar(-10, 10); // [-10, 10]
	auto result = RVector(n);
	foreach(ii; 0..n) {
		result[ii] = rv(gen);
	}
	return result;
}
mixin(createRFunction!rngexample);
```

The only way you can do better is if someone else writes the 
program for you. But then it doesn't make much difference which 
language is used.

Jan 18 2022

sfp <sfp cims.nyu.edu> writes:

On Tuesday, 18 January 2022 at 22:00:42 UTC, bachmeier wrote:
 On Tuesday, 18 January 2022 at 21:22:11 UTC, H. S. Teoh wrote:
 [...]

 The next release of my embedr library (which I've been able to 
 do now that my work life is finally returning to normal) will 
 make it trivial to call D functions from R. What I mean by that 
 is that you write a file of D functions and by the magic of 
 metaprogramming, you don't need to write any boilerplate at all.

 Example:

 ```
 import mir.random;
 import mir.random.variable;

 RVector rngexample(int n) {
 	auto gen = Random(unpredictableSeed);
 	auto rv = uniformVar(-10, 10); // [-10, 10]
 	auto result = RVector(n);
 	foreach(ii; 0..n) {
 		result[ii] = rv(gen);
 	}
 	return result;
 }
 mixin(createRFunction!rngexample);
 ```

 The only way you can do better is if someone else writes the 
 program for you. But then it doesn't make much difference which 
 language is used.

This is all news to me. It's a shame these libraries and their 
capabilities aren't advertised my prominently.

How hard would it be to automatically wrap a D library and expose 
it to Python, MATLAB, and Julia simultaneously? Say the library 
even has a simple C-style API, or a very simple 
single-inheritance OO hierarchy with no templates.

Jan 18 2022

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 18 January 2022 at 21:22:11 UTC, H. S. Teoh wrote:
 [snip]

 +1.  Why do we need to reinvent numpy/scipy? One of the 
 advantages conferred by D's metaprogramming capabilities is 
 easier integration with other languages.  Adam Ruppe's jni.d is 
 one prime example of how metaprogramming can abstract away the 
 nasty amounts of boilerplate you're otherwise forced to write 
 when interfacing with Java via JNI. D's C ABI compatibility 
 also means you can leverage the tons of C libraries out there 
 right now, instead of waiting for somebody to reinvent the same 
 libraries in D years down the road.  D's capabilities makes it 
 very amenable to being a "glue" language for interfacing with 
 other languages.


 T

I'm all for leveraging C libraries in D, but if you have code 
that needs to be performant then you may run into limitations 
with python. If you're building one chart with Matplotlib, then 
it's probably fine. If you have some D code that takes longer to 
run (e.g. a simulation that deals with a lot of data and many 
threads), then you might be a little more careful about what 
python code to incorporate and how. I don't know the technical 
details needed to get the best performance in that situation (are 
there benchmarks?), but I saw some work done about using python 
buffer protocol when calling D functions from python.

In addition, the python code might itself be calling the same C 
libraries that D can (e.g. LAPACK) (though potentially with 
different defaults, trading off performance vs. accuracy, 
resulting in python being faster in some cases than D). In that 
case, python is also a glue language. Taking the same approach in 
D can simplify your code base a little bit and you don't need to 
worry about any additional overhead or limitations from GIL that 
might get introduced.

Again, not something you need to worry about when performance is 
not a big issue.

Jan 18 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 18 January 2022 at 20:28:52 UTC, bachmeier wrote:
 from rewriting a plotting library from scratch. It's not common 
 that you're plotting 100 million times for each run of your 
 program.

It is not uncommon to interact with plots that are too big for 
matplotlib to handle well. The python visualization solutions are 
very primitive. Having something better than numpy+matplotlib is 
obviously an advantage, a selling point for other offerings.

Having the exact same thing? Not so much.

 You can replace the pieces where it makes sense to do so. The 
 goal of the D program is whatever analysis you're doing on top 
 of those libraries, not the libraries themselves.

You don't get a unified API with good usability by collecting a 
hodge podge of libraries. You also don't get any performance or 
quality advantage over other solutions. Borrowing is ok, 
replicating APIs? Probably not. What is then the argument for not 
using the original language directly?

The reason for moving to a new language (like Julia or Python) is 
that you get something that better fits what you want to do and 
that transitioning provides a smoother work flow in the end.

If everything you achieve by switching is replacing one set of 
trade offs with another set of trade offs, then you are generally 
better off using the more mainstream, supported and well 
documented alternative.

So where do you start? With a niche, e.g. signal processing or 
some other "mainstream" niche.


 We call C libraries all the time. Nobody thinks that's a 
 problem. A bunch of effort has gone into calling C++ libraries 
 and there's tons of support for that effort.

So, libraries are often written in C in order to support other 
languages and they are structured in a very basic way as far as C 
code goes. C-only libraries are sometimes not as easy to 
interface with as they rely heavily on macros, dedicated runtimes 
or specifics of the underlying platform.

I also think the C++ interop D offers is a bit clunky. It is more 
suitable for people who write C-like C++ than people who try to 
write idiomatic C++. D has to align itself more with C++ 
semantics for this to be a good selling point.

I am somewhat impressed that Python has many solutions for 
binding to C++ though, even when Python is semantically a very 
poor fit for C++… (e.g. 
[Binder](https://github.com/RosettaCommons/binder)). D's 
potential strength here is not so much in being able to bind to 
C++ in a limited fashion (like Python), but being able to port 
C++ to D and improve on it. To get there you need feature parity, 
which is what this thread is about.

We now know that C++ will eventually get more powerful parallel 
computing abilities built into the language, supported by the 
hardware manufacturer Nvidia for their hardware (nvc++). That 
said Apple has shown little interest in making their versino of 
C++ work well with parallel computing and the C++ standard lib is 
not very good for numeric operations. Like, the simd code I wrote 
for inner product (using generic llvm SIMD) turned out to be 3 
times faster than the generic C++ standard library solution.

Yet, we see *"change is coming"* written on the horizon, I think.

So either D has to move in a different direction than competing 
head-to-head with C++ or one has be more strategic in how the 
development process is structured. Or well, just more strategic 
in general.

Jan 18 2022

sfp <sfp cims.nyu.edu> writes:

On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 18 January 2022 at 20:28:52 UTC, bachmeier wrote:
 from rewriting a plotting library from scratch. It's not 
 common that you're plotting 100 million times for each run of 
 your program.

 It is not uncommon to interact with plots that are too big for 
 matplotlib to handle well. The python visualization solutions 
 are very primitive. Having something better than 
 numpy+matplotlib is obviously an advantage, a selling point for 
 other offerings.

To add to this: matplotlib has *many* pain points. It has an 
inconsistent API, it is very slow, its 3D plotting is hacked 
together (and very slow). Making animations isn't straightforward 
(and very slow). Making just several hundred plots typically 
takes several minutes (at least). It should take <1s. That said, 
matplotlib is very powerful and handles essentially all important 
use cases. There is definitely room for improvement. If someone 
with NIH syndrome came along and wrote a plotting library which 
actually improves on matplotlib significantly, it would be to D's 
benefit, especially since it would be trivial to consume from 
other languages which would be interested in using it.

Jan 18 2022

Tejas <notrealemail gmail.com> writes:

On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim Grøstad 
wrote:

 It is not uncommon to interact with plots that are too big for 
 matplotlib to handle well. The python visualization solutions 
 are very primitive. Having something better than 
 numpy+matplotlib is obviously an advantage, a selling point for 
 other offerings.

Wow, this is the first time I've read that matplotlib is 
inadequate. Can you please give an example of a visualisation 
library(any language) which you consider good?

Jan 18 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Wednesday, 19 January 2022 at 03:21:38 UTC, Tejas wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:

 It is not uncommon to interact with plots that are too big for 
 matplotlib to handle well. The python visualization solutions 
 are very primitive. Having something better than 
 numpy+matplotlib is obviously an advantage, a selling point 
 for other offerings.

 Wow, this is the first time I've read that matplotlib is 
 inadequate. Can you please give an example of a visualisation 
 library(any language) which you consider good?

There are commercial products for visualizing large datasets. I 
dont use them as I either create my own or use a soundeditor.

But yes matplotlib feels more like a homegrown solution than a 
solid product. It also has layout issues with labeling. You can 
make it work, but it is clunky.

Jan 18 2022

forkit <forkit gmail.com> writes:

On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim Grøstad 
wrote:
 ...D's potential strength here is not so much in being able to 
 bind to C++ in a limited fashion (like Python), but being able 
 to port C++ to D and improve on it. To get there you need 
 feature parity, which is what this thread is about.

Not just 'feature' parity, but 'performance' parity too:

"Broad adoption of high-level languages by the scientific 
community is unlikely without compiler optimizations to mitigate 
the performance penalties these languages abstractions impose." - 
https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf

Jan 18 2022

Paulo Pinto <pjmlp progtools.org> writes:

On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able to 
 bind to C++ in a limited fashion (like Python), but being able 
 to port C++ to D and improve on it. To get there you need 
 feature parity, which is what this thread is about.

 Not just 'feature' parity, but 'performance' parity too:

 "Broad adoption of high-level languages by the scientific 
 community is unlikely without compiler optimizations to 
 mitigate the performance penalties these languages abstractions 
 impose." - 
 https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf

That paper is from 2008, meanwhile in 2021,

https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

This is what D has to compete against, not only C++ with the 
existing SYSCL/CUDA tooling and their ongoing integration into 
ISO C++.

Jan 18 2022

M.M. <matus email.cz> writes:

On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able 
 to bind to C++ in a limited fashion (like Python), but being 
 able to port C++ to D and improve on it. To get there you 
 need feature parity, which is what this thread is about.

 Not just 'feature' parity, but 'performance' parity too:

 "Broad adoption of high-level languages by the scientific 
 community is unlikely without compiler optimizations to 
 mitigate the performance penalties these languages 
 abstractions impose." - 
 https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf

 That paper is from 2008, meanwhile in 2021,

 https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

 This is what D has to compete against, not only C++ with the 
 existing SYSCL/CUDA tooling and their ongoing integration into 
 ISO C++.

I am not sure what the article tells: that Julia is now popular 
and people use it? Or that D (and other languages) need to 
compete against self-written PR articles?

(Many system-programming languages can achieve the same 
performance as what the article describes, when several research 
institutes combine forces on just that.)

But yes, Julia's focus on small niche, and its popularity in that 
niche makes it attractive for contributors.

Jan 18 2022

Paulo Pinto <pjmlp progtools.org> writes:

On Wednesday, 19 January 2022 at 07:24:09 UTC, M.M. wrote:
 On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto 
 wrote:
 On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able 
 to bind to C++ in a limited fashion (like Python), but being 
 able to port C++ to D and improve on it. To get there you 
 need feature parity, which is what this thread is about.

 Not just 'feature' parity, but 'performance' parity too:

 "Broad adoption of high-level languages by the scientific 
 community is unlikely without compiler optimizations to 
 mitigate the performance penalties these languages 
 abstractions impose." - 
 https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf

 That paper is from 2008, meanwhile in 2021,

 https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

 This is what D has to compete against, not only C++ with the 
 existing SYSCL/CUDA tooling and their ongoing integration into 
 ISO C++.

 I am not sure what the article tells: that Julia is now popular 
 and people use it? Or that D (and other languages) need to 
 compete against self-written PR articles?

 (Many system-programming languages can achieve the same 
 performance as what the article describes, when several 
 research institutes combine forces on just that.)

 But yes, Julia's focus on small niche, and its popularity in 
 that niche makes it attractive for contributors.

You might call it self-written PR articles, or educate yourself 
who is using it.

https://juliacomputing.com/case-studies versus 
https://dlang.org/orgs-using-d.html

Also I did mention C++, which you glossed over on your eagerness 
to devalue Julia's market domain versus D among HPC communities.

As someone that spent two years at ATLAS TDAQ HLT, I know which 
languages those folks would be adopting, but hey it is a piece of 
self-written PR.

Jan 18 2022

M.M. <matus email.cz> writes:

On Wednesday, 19 January 2022 at 07:29:23 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 07:24:09 UTC, M.M. wrote:
 On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto 
 wrote:
 On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able 
 to bind to C++ in a limited fashion (like Python), but 
 being able to port C++ to D and improve on it. To get there 
 you need feature parity, which is what this thread is about.

 Not just 'feature' parity, but 'performance' parity too:

 "Broad adoption of high-level languages by the scientific 
 community is unlikely without compiler optimizations to 
 mitigate the performance penalties these languages 
 abstractions impose." - 
 https://www.cs.rice.edu/~vs3/PDF/Joyner-MainThesis.pdf

 That paper is from 2008, meanwhile in 2021,

 https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

 This is what D has to compete against, not only C++ with the 
 existing SYSCL/CUDA tooling and their ongoing integration 
 into ISO C++.

 I am not sure what the article tells: that Julia is now 
 popular and people use it? Or that D (and other languages) 
 need to compete against self-written PR articles?

 (Many system-programming languages can achieve the same 
 performance as what the article describes, when several 
 research institutes combine forces on just that.)

 But yes, Julia's focus on small niche, and its popularity in 
 that niche makes it attractive for contributors.

 You might call it self-written PR articles, or educate yourself 
 who is using it.

 https://juliacomputing.com/case-studies versus 
 https://dlang.org/orgs-using-d.html

 Also I did mention C++, which you glossed over on your 
 eagerness to devalue Julia's market domain versus D among HPC 
 communities.

 As someone that spent two years at ATLAS TDAQ HLT, I know which 
 languages those folks would be adopting, but hey it is a piece 
 of self-written PR.

I am sorry that you took my post as an attack:
- the article itself is written by Julia people (the bottom of 
the article says "Source: Julia Computing"). Using this fact to 
tell me to "educate myself on a non-relevant topic, i.e., on who 
uses Julia" seems quite irrelevant to my note on who wrote the 
text. (Being sarcastic now: I am sure that whatever education I 
will do from now on till the end of my life will not change who 
wrote the article)
- I also acknowledged that Julia is popular in the scientific 
computing.

I do not understand where in my text I devalue Julia as a 
language/tool.
(Again, I do not like that self-written articles are used in 
arguments. But I did not say anything about Julia "being not 
good".)

What I did not write, but think, is that Julia is a very nice 
project, and I am a fan of its development.

Jan 19 2022

forkit <forkit gmail.com> writes:

On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto wrote:
 That paper is from 2008, meanwhile in 2021,

 https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

 This is what D has to compete against, not only C++ with the 
 existing SYSCL/CUDA tooling and their ongoing integration into 
 ISO C++.

Oh. so dissmisive of it because its from 2008?

It's focus is on methods for compiler optimisation, for one of 
the most important data structures in scientific computing -> 
arrays.

As such, the more D can do to generate even more efficient 
parallel array computations, the more chance it has of attracting 
'some' from the scientific community.

Jan 19 2022

Paulo Pinto <pjmlp progtools.org> writes:

On Wednesday, 19 January 2022 at 11:43:25 UTC, forkit wrote:
 On Wednesday, 19 January 2022 at 06:58:55 UTC, Paulo Pinto 
 wrote:
 That paper is from 2008, meanwhile in 2021,

 https://www.hpcwire.com/off-the-wire/julia-joins-petaflop-club//

 This is what D has to compete against, not only C++ with the 
 existing SYSCL/CUDA tooling and their ongoing integration into 
 ISO C++.

 Oh. so dissmisive of it because its from 2008?

 It's focus is on methods for compiler optimisation, for one of 
 the most important data structures in scientific computing -> 
 arrays.

 As such, the more D can do to generate even more efficient 
 parallel array computations, the more chance it has of 
 attracting 'some' from the scientific community.

Yes, because in 2008 CUDA and SYSCL were of little importance in 
HPC universe, almost everyone was focused on OpenMP, still 
thought OpenCL would cater with its C only API, and OpenAAC was 
yet to show up.

Unless D comes with this in the package and those logos adopt it, 
just being a better language isn't enough.

https://developer.nvidia.com/hpc-sdk

https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html#gs.mbnkph

https://www.amd.com/en/technologies/open-compute

It also needs to plug into the libraries, IDEs and GPGPU 
debuggers available to the community.

Jan 19 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.

But the presentation is not only about HPC, but making parallel 
GPU computing as easy as writing regular C++ code and being able 
to debug that code on the CPU.

I actually think it is sufficient to support Metal and Vulkan for 
this to be of value. The question is how much more performance 
Nvidia manage to get out of their their nvc++ compiler for 
regular GPUs in comparison to a Vulkan solution.

Jan 19 2022

Paulo Pinto <pjmlp progtools.org> writes:

On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim 
Grøstad wrote:
 On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto 
 wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.

 But the presentation is not only about HPC, but making parallel 
 GPU computing as easy as writing regular C++ code and being 
 able to debug that code on the CPU.

 I actually think it is sufficient to support Metal and Vulkan 
 for this to be of value. The question is how much more 
 performance Nvidia manage to get out of their their nvc++ 
 compiler for regular GPUs in comparison to a Vulkan solution.

Currently Vulkan Compute is not to be taken seriously.

Yes, the end goal of the industry efforts is that C++ will be the 
lingua franca of GPGPUs and FPGAs, that is why SYSCL is 
collaborating with ISO C++ efforts.

As for HPC, that is where the money for these kind of efforts 
comes from.

Jan 19 2022

Tejas <notrealemail gmail.com> writes:

On Wednesday, 19 January 2022 at 14:24:14 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim 
 Grøstad wrote:
 On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto 
 wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.

 But the presentation is not only about HPC, but making 
 parallel GPU computing as easy as writing regular C++ code and 
 being able to debug that code on the CPU.

 I actually think it is sufficient to support Metal and Vulkan 
 for this to be of value. The question is how much more 
 performance Nvidia manage to get out of their their nvc++ 
 compiler for regular GPUs in comparison to a Vulkan solution.

 Currently Vulkan Compute is not to be taken seriously.

 Yes, the end goal of the industry efforts is that C++ will be 
 the lingua franca of GPGPUs and FPGAs, that is why SYSCL is 
 collaborating with ISO C++ efforts.

 As for HPC, that is where the money for these kind of efforts 
 comes from.

Is Rust utterly irrelevant in this space? Feels weird not seeing 
it at all in this discussion, with all the talks about just how 
flexible the type system is and the emphasis on functional 
paradigm(things like the Typestate pattern), I thought it would 
matter quite a bit in this context as well, since functional 
programming languages are found to model hardware more 
fluidly(naturally?) than imperative languages like C++(yes, it's 
multi paradigm as well but come on)

Jan 19 2022

IGotD- <nise nise.com> writes:

On Wednesday, 19 January 2022 at 15:25:31 UTC, Tejas wrote:
 I thought it would matter quite a bit in this context as well, 
 since functional programming languages are found to model 
 hardware more fluidly(naturally?) than imperative languages 
 like C++(yes, it's multi paradigm as well but come on)

I haven't experienced that at all. Functional programming is 
nothing like a HDL language (VHDL, Verilog) and those languages 
functions completely differently than functional programming. 
They are somewhat parallel in nature at least but not like 
functional programming.

I've found that imperative languages models a CPU better 
(sequence of instructions) than functional programming languages 
which seems to have more a high level concept.

Jan 19 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Wednesday, 19 January 2022 at 14:24:14 UTC, Paulo Pinto wrote:
 On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim 
 Grøstad wrote:
 On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto 
 wrote:
 It also needs to plug into the libraries, IDEs and GPGPU 
 debuggers available to the community.

 But the presentation is not only about HPC, but making 
 parallel GPU computing as easy as writing regular C++ code and 
 being able to debug that code on the CPU.

 I actually think it is sufficient to support Metal and Vulkan 
 for this to be of value. The question is how much more 
 performance Nvidia manage to get out of their their nvc++ 
 compiler for regular GPUs in comparison to a Vulkan solution.

 Currently Vulkan Compute is not to be taken seriously.

For those wishing to deploy today, I agree, but it should be 
considered for future deployments.  That said, it's just one way 
for dcompute to tie in. My current dcompute work comes in, for 
example, via PTX-jit courtesy of an Nvidia driver.


 Yes, the end goal of the industry efforts is that C++ will be 
 the lingua franca of GPGPUs and FPGAs, that is why SYSCL is 
 collaborating with ISO C++ efforts.

Yes, apparently there's a huge amount of time/money being spent 
on SYCL.  We can co-opt much of that work underneath (the 
upcoming LLVM SPIR-V backend, debuggers, profilers, some libs) 
and provide a much better language on top.  C++/SYCL is, to put 
it charitably, cumbersome.

 As for HPC, that is where the money for these kind of efforts 
 comes from.

Perhaps, but I suspect other market segments will be (already 
are?) more important going forward.  Gaming generally and ML on 
SoCs comes to mind.

Jan 19 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Wednesday, 19 January 2022 at 04:45:20 UTC, forkit wrote:
 On Tuesday, 18 January 2022 at 22:21:40 UTC, Ola Fosheim 
 Grøstad wrote:
 ...D's potential strength here is not so much in being able to 
 bind to C++ in a limited fashion (like Python), but being able 
 to port C++ to D and improve on it. To get there you need 
 feature parity, which is what this thread is about.

 Not just 'feature' parity, but 'performance' parity too:

Yes, that is the issue I wanted to discuss in the OP.

If hardware vendors create close source C++ compiler that uses 
internal knowledge of how their GPUs work, then it might be 
difficult to compete for other languages. You'd have to compile 
to metal/vulkan and fine tune it for each GPU.

Or just compile to C++…

I don't know. I guess we will find out in the years to come.

Jan 19 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Wednesday, 19 January 2022 at 09:34:38 UTC, Ola Fosheim 
Grøstad wrote:
 If hardware vendors create close source C++ compiler that uses 
 internal knowledge of how their GPUs work, then it might be 
 difficult to compete for other languages. You'd have to compile 
 to metal/vulkan and fine tune it for each GPU.

Arguably that already describes Nvidia. Luckily for us, it has an 
intermediate layer in PTX that LLVM can target, and that's 
exactly what dcompute does. Unlike C++, D can much more easily 
statically condition on aspects of the hardware, making the 
tuning process faster to navigate the parameter configuration 
space.

Jan 19 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Wednesday, 19 January 2022 at 09:49:59 UTC, Nicholas Wilson 
wrote:
 Arguably that already describes Nvidia. Luckily for us, it has 
 an intermediate layer in PTX that LLVM can target, and that's 
 exactly what dcompute does.

For desktop applications one has to support Intel, AMD, Nvidia, 
Apple. So, does that mean that one have to support Metal, Vulkan, 
PTX and RocM? Sounds like too much…

 Unlike C++, D can much more easily statically condition on 
 aspects of the hardware, making the tuning process faster to 
 navigate the parameter configuration space.

Not sure what you meant here?

Jan 19 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Wednesday, 19 January 2022 at 10:17:45 UTC, Ola Fosheim 
Grøstad wrote:
 On Wednesday, 19 January 2022 at 09:49:59 UTC, Nicholas Wilson 
 wrote:
 Arguably that already describes Nvidia. Luckily for us, it has 
 an intermediate layer in PTX that LLVM can target, and that's 
 exactly what dcompute does.

 For desktop applications one has to support Intel, AMD, Nvidia, 
 Apple. So, does that mean that one have to support Metal, 
 Vulkan, PTX and RocM? Sounds like too much…

That was a comment mostly about the market share and "business 
practices" Nvidia.

Intel is well supported by OpenCL/SPIR-V.

There are some murmurings that AMD is getting SPIR-V support for 
ROCm, though if that is insufficient, I don't think it would be 
too difficult to hook the AMDGPU backend to LDC+DCompute (runtime 
libraries would be a bit tedious, given the lack of familiarity 
and volume of code), but I have no hardware to run ROCm math the 
moment.

Metal should also not be too difficult (the kernel argument 
format is different which is annoying) to hook LDC up to, the 
main thing lacking is Objective-C support to bind the runtime 
libraries for DCompute (which would also need to be written.

LDC can already target Vulkan compute (although the pipeline is 
tedious, and there is no runtime library support).

 Unlike C++, D can much more easily statically condition on 
 aspects of the hardware, making the tuning process faster to 
 navigate the parameter configuration space.

 Not sure what you meant here?

I mean there are parametric attributes of the hardware, say for 
example cache size (or available registers for GPUs), that have a 
direct effect on how many times you can unroll the inner loop, 
say for a windowing function, and you want to ship optimised  
code for multiple configurations of hardware.

You can much more easily create multiple copies for different 
sized cache (or register availability) in D than you can in C++, 
because static foreach and static if >>> if constexpr.

Jan 19 2022

Araq <rumpf_a web.de> writes:

On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
wrote:

 I mean there are parametric attributes of the hardware, say for 
 example cache size (or available registers for GPUs), that have 
 a direct effect on how many times you can unroll the inner 
 loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.

And you can do that even more easily with an AST macro system. 
Which Julia has...

Jan 19 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 20 January 2022 at 04:01:09 UTC, Araq wrote:
 And you can do that even more easily with an AST macro system. 
 Which Julia has...

I think these approaches are somewhat pointless for desktop 
applications. Although a JIT does help.

If time consuming compile-time adaption to the hardware is needed 
then this should happen at installation. A better approach is to 
ship code in a high level IR and then bundle a compiler with the 
installer.

Jan 20 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 20 January 2022 at 04:01:09 UTC, Araq wrote:
 On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
 wrote:

 I mean there are parametric attributes of the hardware, say 
 for example cache size (or available registers for GPUs), that 
 have a direct effect on how many times you can unroll the 
 inner loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.

 And you can do that even more easily with an AST macro system. 
 Which Julia has...

Given this endorsement I started reading up on Julia/GPU...  Here 
are a few things that I found:
A gentle tutorial: 
https://nextjournal.com/sdanisch/julia-gpu-programming
Another, more concise: 
https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

For those that are video oriented, here's a recent workshop:
https://www.youtube.com/watch?v=Hz9IMJuW5hU

While I admit to just skimming that, very long, video I was 
impressed by the tooling on display and the friendly presentation.

In short, I found a lot to like about Julia from the above and 
other writings but the material on Julia AST macros specifically 
was ...  underwhelming.  AST macros look like an inferior tool in 
this low level setting.  They are slightly less readable to me 
then the dcompute alternatives without offering any compensating 
gain in performance.

Jan 20 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
wrote:
 I mean there are parametric attributes of the hardware, say for 
 example cache size (or available registers for GPUs), that have 
 a direct effect on how many times you can unroll the inner 
 loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.

Hmm, I dont understand, the unrolling should happen at runtime so 
that you can target all GPUs with one executable?

If you have to do the unrolling in D, then a lot of the advantage 
is lost and I might just as well write in a shader language...

Jan 19 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 20 January 2022 at 06:57:28 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
 wrote:
 I mean there are parametric attributes of the hardware, say 
 for example cache size (or available registers for GPUs), that 
 have a direct effect on how many times you can unroll the 
 inner loop, say for a windowing function, and you want to ship 
 optimised  code for multiple configurations of hardware.

 You can much more easily create multiple copies for different 
 sized cache (or register availability) in D than you can in 
 C++, because static foreach and static if >>> if constexpr.

 Hmm, I dont understand, the unrolling should happen at runtime 
 so that you can target all GPUs with one executable?

  Now you've confused me. You can select which implementation to 
use at runtime with e.g. CPUID or more sophisticated methods. LDC 
targeting DCompute can produce multiple objects with the same 
compiler invocation, i.e. you can get CUDA for any set of SM 
version, OpenCL compatible SPIR-V which you can get per GPU, 
inspect its hardware characteristics and then select which of 
your kernels to run.

 If you have to do the unrolling in D, then a lot of the 
 advantage is lost and I might just as well write in a shader 
 language...

D can be your compute shading language for Vulkan and with a bit 
of work whatever you'd use HLSL for, it can also be your compute 
kernel language substituting for OpenCL and CUDA. Same caveats 
apply for metal (should be pretty easy to do: need Objective-C 
support in LDC, need Metal bindings).

Jan 20 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 20 January 2022 at 08:20:58 UTC, Nicholas Wilson 
wrote:
  Now you've confused me. You can select which implementation to 
 use at runtime with e.g. CPUID or more sophisticated methods. 
 LDC targeting DCompute can produce multiple objects with the 
 same compiler invocation, i.e. you can get CUDA for any set of 
 SM version, OpenCL compatible SPIR-V which you can get per GPU, 
 inspect its hardware characteristics and then select which of 
 your kernels to run.

Yes, so why do you need compile time features?

My understanding is that the goal of nvc++ is to compile to CPU 
or GPU based on what pays of more for the actual code. So it will 
not need any annotations (it is up to the compiler to choose 
between CPU/GPU?). Bryce suggested that it currently only targets 
one specific GPU, but that it will target multiple GPUs for the 
same executable in the future.

The goal for C++ parallelism is to make it fairly transparent to 
the programmer. Or did I misunderstand what he said?

My viewpoint is that if one are going to take a performance hit 
by not writing the shaders manually one need to get maximum 
convenience as a payoff.

It should be an alternative for programmers that cannot afford to 
put in the extra time to support GPU compute manually.


 If you have to do the unrolling in D, then a lot of the 
 advantage is lost and I might just as well write in a shader 
 language...

 D can be your compute shading language for Vulkan and with a 
 bit of work whatever you'd use HLSL for, it can also be your 
 compute kernel language substituting for OpenCL and CUDA.

I still don't understand why you would need static if/static 
for-loops? Seems to me that this is too hardwired, you'd be 
better off with compiler unrolling hints (C++ has these) if the 
compiler does the wrong thing.


 Same caveats apply for metal (should be pretty easy to do: need 
 Objective-C support in LDC, need Metal bindings).

Use clang to compile the objective-c code to object files and 
link with it?

Jan 20 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 20 January 2022 at 08:36:32 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 08:20:58 UTC, Nicholas Wilson 
 wrote:
  Now you've confused me. You can select which implementation 
 to use at runtime with e.g. CPUID or more sophisticated 
 methods. LDC targeting DCompute can produce multiple objects 
 with the same compiler invocation, i.e. you can get CUDA for 
 any set of SM version, OpenCL compatible SPIR-V which you can 
 get per GPU, inspect its hardware characteristics and then 
 select which of your kernels to run.

 Yes, so why do you need compile time features?

Because compilers are not sufficiently advanced to extract all 
the performance that is available on their own.

A good example of where the automated/simple approach was not 
good enough is CUB (CUDA unbound), a high performance CUDA 
library found here https://github.com/NVIDIA/cub/tree/main/cub

I'd recommend taking a look at the specializations that occur in 
CUB in the name of performance.

D compile time features can help reduce this kind of mess, both 
in extreme performance libraries and extreme performance code.


 My understanding is that the goal of nvc++ is to compile to CPU 
 or GPU based on what pays of more for the actual code. So it 
 will not need any annotations (it is up to the compiler to 
 choose between CPU/GPU?). Bryce suggested that it currently 
 only targets one specific GPU, but that it will target multiple 
 GPUs for the same executable in the future.

 The goal for C++ parallelism is to make it fairly transparent 
 to the programmer. Or did I misunderstand what he said?

I think that that is an entirely reasonable goal but such 
transparency may cost performance and any such cost will be 
unacceptable to some.

 My viewpoint is that if one are going to take a performance hit 
 by not writing the shaders manually one need to get maximum 
 convenience as a payoff.

 It should be an alternative for programmers that cannot afford 
 to put in the extra time to support GPU compute manually.

Yes.  Always good to have alternatives.  Fully automated is one 
option, hinted is a second alternative, meta-programming assisted 
manual is a third.

 If you have to do the unrolling in D, then a lot of the 
 advantage is lost and I might just as well write in a shader 
 language...

 D can be your compute shading language for Vulkan and with a 
 bit of work whatever you'd use HLSL for, it can also be your 
 compute kernel language substituting for OpenCL and CUDA.

 I still don't understand why you would need static if/static 
 for-loops? Seems to me that this is too hardwired, you'd be 
 better off with compiler unrolling hints (C++ has these) if the 
 compiler does the wrong thing.

If you can achieve your performance objectives with automated or 
hinted solutions, great!  But what if you can't?  Most people 
will not have to go as hardcore as the CUB authors did to get the 
performance they need but I find myself wanting more than the 
compiler can easily give me quite a bit.  I'm very happy to have 
the meta programming tools to factor/reduce these "manual" 
programming task.

 Same caveats apply for metal (should be pretty easy to do: 
 need Objective-C support in LDC, need Metal bindings).

 Use clang to compile the objective-c code to object files and 
 link with it?

Jan 20 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal wrote:
 Because compilers are not sufficiently advanced to extract all 
 the performance that is available on their own.

Well, but D developers cannot test on all available CPU/GPU 
combinations either so then you don't know if SIMD would perform 
better than GPU.

Something automated has to be present, at least on install, 
otherwise you risk performance degradation compared to a pure 
SIMD implementation. And then it is better (and cheaper) to just 
avoid GPU altogether.

 A good example of where the automated/simple approach was not 
 good enough is CUB (CUDA unbound), a high performance CUDA 
 library found here https://github.com/NVIDIA/cub/tree/main/cub

 I'd recommend taking a look at the specializations that occur 
 in CUB in the name of performance.

I am sure you are right, but I didn't find anything special when 
I browsed through the repo?

 If you can achieve your performance objectives with automated 
 or hinted solutions, great!  But what if you can't?

Well, my gut instinct is that if you want maximal performance for 
a specific GPU then you would be better off using 
Metal/Vulkan/etc directly?

But I have no experience with that as it is quite time consuming 
to go that route. Right now basic SIMD is time consuming enough… 
(but OK)

Jan 20 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 20 January 2022 at 13:29:26 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal 
 wrote:
 Because compilers are not sufficiently advanced to extract all 
 the performance that is available on their own.

 Well, but D developers cannot test on all available CPU/GPU 
 combinations either so then you don't know if SIMD would 
 perform better than GPU.

It can be very expensive to write and test all the permutations, 
yes, but often you'll understand the bottlenecks of your 
algorithms sufficiently to be able to correctly filter out the 
work up front.  Restating here, these are a few of the 
traditional ways to look at it: Throughput or latency limited?  
Operand/memory or arithmetic limited?  Power (watts) preferred or 
other performance?

It's possible, for instance, that you can *know*, from first 
principles, that you'll never meet objective X if forced to use 
platform Y.  In general, though, you'll just have a sense of the 
order in which things should be evaluated.

 Something automated has to be present, at least on install, 
 otherwise you risk performance degradation compared to a pure 
 SIMD implementation. And then it is better (and cheaper) to 
 just avoid GPU altogether.

Yes, SIMD can be the better performance choice sometimes.  I 
think that many people will choose to do a SIMD implementation as 
a performance, correctness testing and portability baseline 
regardless of the accelerator possibilities.

 A good example of where the automated/simple approach was not 
 good enough is CUB (CUDA unbound), a high performance CUDA 
 library found here https://github.com/NVIDIA/cub/tree/main/cub

 I'd recommend taking a look at the specializations that occur 
 in CUB in the name of performance.

 I am sure you are right, but I didn't find anything special 
 when I browsed through the repo?

The key thing to note is how much effort the authors put into 
specialization wrt the HW x SW cross product. There are entire 
subdirectories devoted to specialization.

At least some of this complexity, this programming burden, can be 
factored out with better language support.

 If you can achieve your performance objectives with automated 
 or hinted solutions, great!  But what if you can't?

 Well, my gut instinct is that if you want maximal performance 
 for a specific GPU then you would be better off using 
 Metal/Vulkan/etc directly?

That's what seems reasonable, yes, but fortunately I don't think 
it's correct.  By analogy, you *can* get maximum performance from 
assembly level programming, if you have all the compiler back-end 
knowledge in your head, but if your language allows you to 
communicate all relevant information (mainly dependencies and 
operand localities but also "intrinsics") then the compiler can 
do at least as well as the assembly level programmer.  Add 
language support for inline and factored specialization and the 
lower level alternatives become even less attractive.

 But I have no experience with that as it is quite time 
 consuming to go that route. Right now basic SIMD is time 
 consuming enough… (but OK)

Indeed.  I'm currently working on the SIMD variant of something I 
partially prototyped earlier on a 2080 and it has been slow going 
compared to either that GPU implementation or the scalar/serial 
variant.

There are some very nice assists from D for SIMD programming: the 
__vector typing,  __vector arithmetic, unaligned vector 
loads/stores via static array operations, static foreach to 
enable portable expression of single-instruction SIMD functions 
like min, max, select, various shuffles, masks, ...  but, yes, 
SIMD programming is definitely a slog compared to either scalar 
or SIMT GPU programming.

Jan 20 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 20 January 2022 at 17:43:22 UTC, Bruce Carneal wrote:
 It's possible, for instance, that you can *know*, from first 
 principles, that you'll never meet objective X if forced to use 
 platform Y.  In general, though, you'll just have a sense of 
 the order in which things should be evaluated.

This doesn't change the desire to do performance testing at 
install or bootup IMO. Even a "narrow" platform like Mac is quite 
broad at this point. PCs are even broader.


 Yes, SIMD can be the better performance choice sometimes.  I 
 think that many people will choose to do a SIMD implementation 
 as a performance, correctness testing and portability baseline 
 regardless of the accelerator possibilities.

My understanding is that the presentation Bryce made suggested 
that you would just write "fairly normal" C++ code and let the 
compiler generate CPU or GPU instructions transparently, so you 
should not have to write SIMD code. SIMD would be the fallback 
option.

I think that the point of having parallel support built into the 
language is not to get the absolute maximum performance, but to 
make writing more performant code more accessible and cheaper.

If you end up having to handwrite SIMD to get decent performance 
then that pretty much makes parallel support a fringe feature. 
E.g. it won't be of much use outside HPC with expensive equipment.

So in my mind this feature does require hardware vendors to focus 
on CPU/GPU integration, and it also requires a rather 
"intelligent" compiler and runtime setup in order to pay for the 
debts of the "abstraction overhead".

I don't think just translating a language AST to an existing 
shared backend will be sufficient. If that was sufficient Nvidia 
wouldn't need to invest in nvc++?

But, it remains to be seen who will pull this off, besides Nvidia.

Jan 20 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 20 January 2022 at 19:57:54 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 20 January 2022 at 17:43:22 UTC, Bruce Carneal 
 wrote:
 It's possible, for instance, that you can *know*, from first 
 principles, that you'll never meet objective X if forced to 
 use platform Y.  In general, though, you'll just have a sense 
 of the order in which things should be evaluated.

 This doesn't change the desire to do performance testing at 
 install or bootup IMO. Even a "narrow" platform like Mac is 
 quite broad at this point. PCs are even broader.

Never meant to say that it did.  Just pointed out that you can 
factor some of the work.

 Yes, SIMD can be the better performance choice sometimes.  I 
 think that many people will choose to do a SIMD implementation 
 as a performance, correctness testing and portability baseline 
 regardless of the accelerator possibilities.

 My understanding is that the presentation Bryce made suggested 
 that you would just write "fairly normal" C++ code and let the 
 compiler generate CPU or GPU instructions transparently, so you 
 should not have to write SIMD code. SIMD would be the fallback 
 option.

The dream, for decades, has been that "the compiler" will just 
"do the right thing" when provided dead simple code, that it will 
achieve near-or-better-than-human-tuned levels of performance in 
all scenarios that matter.  It is a dream worth pursuing.

 I think that the point of having parallel support built into 
 the language is not to get the absolute maximum performance, 
 but to make writing more performant code more accessible and 
 cheaper.

If accessibility requires less performance then you, as a 
language designer, have a choice.  I think it's a false choice 
but if forced to choose my choice would bias toward performance, 
"system language" and all that.  Others, if forced to choose, 
would pick accessibility.

 If you end up having to handwrite SIMD to get decent 
 performance then that pretty much makes parallel support a 
 fringe feature. E.g. it won't be of much use outside HPC with 
 expensive equipment.

I disagree but can't see how pursuing it further would be useful. 
  We can just leave it to the market.

 So in my mind this feature does require hardware vendors to 
 focus on CPU/GPU integration, and it also requires a rather 
 "intelligent" compiler and runtime setup in order to pay for 
 the debts of the "abstraction overhead".

I put more faith in efforts that cleanly reveal low level 
capabilities to the community, that are composable, than I do in 
future hardware vendor efforts.

 I don't think just translating a language AST to an existing 
 shared backend will be sufficient. If that was sufficient 
 Nvidia wouldn't need to invest in nvc++?

Well, at least for current dcompute users, it already is 
sufficient.  The Julia efforts in this area also appear to be 
successful.  Sean Baxter's "circle" offshoot of C++ is another. I 
imagine there are or will be other instances where relatively 
small manpower inputs successfully co-opt backends to provide 
nice access and great performance for their respective language 
communities.

 But, it remains to be seen who will pull this off, besides 
 Nvidia.

I don't think there is much that remains to be seen here.  The 
rate and scope of adoption are still interesting questions but 
the "can we provide something very useful to our language 
community?" question has been answered in the affirmative.

People choose dcompute, circle, Julia-GPU over or in addition to 
CUDA/OpenCL today.  Others await more progress from the C++/SycL 
movement.  Meaningful choice is good.

Jan 20 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 20 January 2022 at 08:36:32 UTC, Ola Fosheim Grøstad 
wrote:
 Yes, so why do you need compile time features?

 My understanding is that the goal of nvc++ is to compile to CPU 
 or GPU based on what pays of more for the actual code. So it 
 will not need any annotations (it is up to the compiler to 
 choose between CPU/GPU?). Bryce suggested that it currently 
 only targets one specific GPU, but that it will target multiple 
 GPUs for the same executable in the future.

There are two major advantages for compile time features, for the 
host and for the device (e.g. GPU).

On the host side, D meta programming allows DCompute to do what 
CUDA does with its <<<>>> kernel launch syntax, in terms of type 
safety and convenience, with regular D code. This is the feature 
that makes CUDA nice to use and OpenCL's lack of such a feature 
quite horrible to use, and change of kernel signature a 
refactoring unto itself.

On the device side, I'm sure Bruce can give you some concrete 
examples.

 The goal for C++ parallelism is to make it fairly transparent 
 to the programmer. Or did I misunderstand what he said?

You want it to be transparent, not invisible.

 Same caveats apply for metal (should be pretty easy to do: 
 need Objective-C support in LDC, need Metal bindings).

 Use clang to compile the objective-c code to object files and 
 link with it?

Wont work, D needs to be able to call the objective-c.
I mean you could use a C or C++ shim, but that would be pretty 
ugly.

Jan 20 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 21 January 2022 at 03:23:59 UTC, Nicholas Wilson wrote:
 There are two major advantages for compile time features, for 
 the host and for the device (e.g. GPU).

Are these resolved at compile time (before the executable is 
installed on the computer) or are they resolved at runtime?

I guess there might be instances where you might want to consider 
to change the entire data layout to fit the hardware, but then 
you to some extent outside of what most D programmers would be 
willing to do.

 The goal for C++ parallelism is to make it fairly transparent 
 to the programmer. Or did I misunderstand what he said?

 You want it to be transparent, not invisible.

The goal is to make it look like a regular C++ library, no extra 
syntax.

 Wont work, D needs to be able to call the objective-c.
 I mean you could use a C or C++ shim, but that would be pretty 
 ugly.

Just write the whole runtime in Objective-C++. Why would it be 
ugly?

Jan 21 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Friday, 21 January 2022 at 08:56:22 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 21 January 2022 at 03:23:59 UTC, Nicholas Wilson 
 wrote:
 There are two major advantages for compile time features, for 
 the host and for the device (e.g. GPU).

 Are these resolved at compile time (before the executable is 
 installed on the computer) or are they resolved at runtime?

Before. But with SPIR-V there is an additional 
compilation/optimisation step where it is converted into whatever 
format the hardware uses, also you could set specialisation 
constants here if I ever get around to supporting those. I think 
it probably also happens with PTX (which is an assembly like 
format) to whatever the binary format is.

 I guess there might be instances where you might want to 
 consider to change the entire data layout to fit the hardware, 
 but then you to some extent outside of what most D programmers 
 would be willing to do.

Indeed.

 You want it to be transparent, not invisible.

 The goal is to make it look like a regular C++ library, no 
 extra syntax.

There is an important difference between it looking like regular 
C++ (i.e. function calls not <<<>>>) and the compiler doing 
auto-GPU-isation. I'm not sure which one you're referring to 
here. I'm all for the former, that's what Dcompute does, the 
latter falls too far into the sufficiently advanced compiler and 
would have to necessarily determine what to send to the GPU and 
when, which could seriously impact performance.

 Just write the whole runtime in Objective-C++. Why would it be 
 ugly?

_Just_. I mean it would be doable, but I rather not spend my time 
doing that.

Jan 21 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 21 January 2022 at 09:45:32 UTC, Nicholas Wilson wrote:
 _Just_. I mean it would be doable, but I rather not spend my 
 time doing that.

:-D This is where you need more than one person for the project…

I might do it, if I found a use case for it. I am sure some other 
contributor than yourself could do it if Metal support was in.

Jan 21 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:
 I found the CppCon 2021 presentation
 [C++ Standard 
 Parallelism](https://www.youtube.com/watch?v=LW_T2RGXego) by 
 Bryce Adelstein Lelbach very interesting, unusually clear and 
 filled with content. I like this man. No nonsense.

 It provides a view into what is coming for relatively high 
 level and hardware agnostic parallel programming in C++23 or 
 C++26. Basically a portable "high level" high performance 
 solution.

 He also mentions the Nvidia C++ compiler *nvc++* which will 
 make it possible to compile C++ to Nvidia GPUs in a somewhat 
 transparent manner. (Maybe it already does, I have never tried 
 to use it.)

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only increase 
 as time goes on.

 What do you think?

Given the emergence of ML in the commercial space and the 
prevalence of accelerator HW on SoCs and elsewhere, this is a 
timely topic Ola.

We have at least two options: 1) try to mimic or sit atop the, 
often byzantine, interfaces that creak out of the C++ community 
or 2) go direct to the evolving metal with D meta-programming 
shouldering most of the load.  I favor the second of course.

For reference, CUDA/C++ was my primary programming language for 
5+ years prior to taking up D and, even in its admittedly 
less-than-newbie-friendly state, I prefer dcompute to CUDA.

With some additional work dcompute could become a broadly 
accessible path to world beating performance/watt libraries and 
apps. Code that you can actually understand at a glance when you 
pick it up down the road.

Kudos to the dcompute contributors, especially Nicholas.

Jan 12 2022

bachmeier <no spam.net> writes:

On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only increase 
 as time goes on.

 What do you think?

It doesn't matter all that much for D TBH. Without the basic 
infrastructure for scientific computing like you get out of the 
box with those three languages, the ability to target another 
platform isn't going to matter. There are lots of pieces here and 
there in our community, but it's going to take some effort to (a) 
make it easy to use the different parts together, (b) document 
everything, and (c) write the missing pieces.

Jan 12 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only 
 increase as time goes on.

 What do you think?

 It doesn't matter all that much for D TBH. Without the basic 
 infrastructure for scientific computing like you get out of the 
 box with those three languages, the ability to target another 
 platform isn't going to matter. There are lots of pieces here 
 and there in our community, but it's going to take some effort 
 to (a) make it easy to use the different parts together, (b) 
 document everything, and (c) write the missing pieces.

I disagree.  D/dcompute can be used as a better general purpose 
GPU kernel language now (superior meta programming, sane nested 
functions, ...).  If you are concerned about "infrastructure" you 
embed in C++.

There *are* improvements to be made but, by my lights, dcompute 
is already better than CUDA in many ways.  If we improve 
usability, make dcompute accessible to "mere mortals", make it a 
"no big deal" choice instead of a "here be dragons" choice, we'd 
really have something.

By contrast, I just don't see the C++ crowd getting to 
sanity/simplicity any time soon... not unless ideas from the 
circle compiler or similar make their way to mainstream.

Jan 12 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 07:23:40 UTC, Bruce Carneal wrote:
 I disagree.  D/dcompute can be used as a better general purpose 
 GPU kernel language now (superior meta programming, sane nested 
 functions, ...).

Is *dcompute* being actively developed or is it in a "frozen" 
state? longevity is important for adoption, I think.

 There *are* improvements to be made but, by my lights, dcompute 
 is already better than CUDA in many ways.  If we improve 
 usability, make dcompute accessible to "mere mortals", make it 
 a "no big deal" choice instead of a "here be dragons" choice, 
 we'd really have something.

Maybe it would be possible to do something with a more limited 
scope, but more low level? Like something targeting Metal and 
Vulkan directly? Something like this might be possible to do well 
if D would change the focus and build a high level IR.

I think one of Bryce's main points is that there is more long 
term stability in C++ than in the other APIs for parallel 
computing, so for long term development it would be better to 
express parallel code in terms of a C++ standard library 
construct than other compute-APIs.

That argument makes sense for me, I don't want to deal with CUDA 
or OpenCL as dependencies. I'd rather have something sit directly 
on top of the lower level APIs.

 By contrast, I just don't see the C++ crowd getting to 
 sanity/simplicity any time soon... not unless ideas from the 
 circle compiler or similar make their way to mainstream.

It does look a bit complex, but what I find promising for C++ is 
that Nvidia is pushing their hardware by creating backends for 
C++ parallel libraries that targets multiple GPUs. That in turn 
might push Apple to do the same for Metal and so on.

If C++20 had what Bryce presented then I would've considered 
using it for signal processing. Right now it would make more 
sense to target Metal/Vulkan directly, but that is time 
consuming, so I probably won't.

Jan 13 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 13 January 2022 at 09:10:48 UTC, Ola Fosheim Grøstad 
wrote:
 Is *dcompute* being actively developed or is it in a "frozen" 
 state? longevity is important for adoption, I think.

not actively per se, but I have been adding features recently...

 Maybe it would be possible to do something with a more limited 
 scope, but more low level? Like something targeting Metal and 
 Vulkan directly? Something like this might be possible to do 
 well if D would change the focus and build a high level IR.

... one of which was compiler support for Vulkan compute shaders 
(no runtime yet Ethan didn't need that, and graphics APIs are 
large, and I'm not sure if there are any good bindings).
Metal is annoyingly different is its kernel signatures, which 
could be done fairly easily, but
* LDC lacks Objective-C support so even if the compiler side of 
Metal support worked the runtime side would not. (N.B. adding 
Objective-C support shouldn't be too difficult. but I don't have 
particular need for it.)
* kernels written for metal would not be compatible with the 
OpenCL and CUDA ones (not that I suppose that would be a 
particular problem if all you care about is Metal.

 I think one of Bryce's main points is that there is more long 
 term stability in C++ than in the other APIs for parallel 
 computing, so for long term development it would be better to 
 express parallel code in terms of a C++ standard library 
 construct than other compute-APIs.

 That argument makes sense for me, I don't want to deal with 
 CUDA or OpenCL as dependencies. I'd rather have something sit 
 directly on top of the lower level APIs.

Dcompute essentially sits as a thin layer over both, but 
importantly automates the crap out of the really tedious and 
error prone usage of the APIs. It would be entirely possible to 
create a thicker API agnostic layer over the top of both of them.

 By contrast, I just don't see the C++ crowd getting to 
 sanity/simplicity any time soon... not unless ideas from the 
 circle compiler or similar make their way to mainstream.

 It does look a bit complex, but what I find promising for C++ 
 is that Nvidia is pushing their hardware by creating backends 
 for C++ parallel libraries that targets multiple GPUs. That in 
 turn might push Apple to do the same for Metal and so on.

 If C++20 had what Bryce presented then I would've considered 
 using it for signal processing. Right now it would make more 
 sense to target Metal/Vulkan directly, but that is time 
 consuming, so I probably won't.

If there is sufficient interest for it, I might have a go at 
adding Metal compute support to ldc.

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 09:42:04 UTC, Nicholas Wilson 
wrote:
 If there is sufficient interest for it, I might have a go at 
 adding Metal compute support to ldc.

I don't know if there is enough interest for it today. Right now, 
maybe easy visualization is more important. But when 
GUI/visualization is in place then I think a compute solution 
that supports lower level GPU APIs would be valuable for desktop 
application development.

Not sure if it is a good idea to do a compute-only runtime as I 
would think that the application developer would want to balance 
resources used for compute and visualization in some way?

Jan 13 2022

bachmeier <no spam.net> writes:

On Thursday, 13 January 2022 at 07:23:40 UTC, Bruce Carneal wrote:
 On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only 
 increase as time goes on.

 What do you think?

 It doesn't matter all that much for D TBH. Without the basic 
 infrastructure for scientific computing like you get out of 
 the box with those three languages, the ability to target 
 another platform isn't going to matter. There are lots of 
 pieces here and there in our community, but it's going to take 
 some effort to (a) make it easy to use the different parts 
 together, (b) document everything, and (c) write the missing 
 pieces.

 I disagree.  D/dcompute can be used as a better general purpose 
 GPU kernel language now (superior meta programming, sane nested 
 functions, ...).  If you are concerned about "infrastructure" 
 you embed in C++.

I was referring to libraries like numpy for Python or the 
numerical capabilities built into Julia. D just isn't in a state 
where a researcher is going to say "let's write a D program for 
that simulation". You can call some things in Mir and cobble 
together an interface to some C libraries or whatever. That's not 
the same as Julia, where you write the code you need for the task 
at hand. That's the starting point to make it into scientific 
computing.

On the embedding, yes, that is the strength of D. If you write 
code in Python, it's realistically only for the Python world. 
Probably the same for Julia.

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 14:50:59 UTC, bachmeier wrote:
 If you write code in Python, it's realistically only for the 
 Python world. Probably the same for Julia.

Does scipy provide the functionality you would need? Could it in 
some sense be considered a baseline for scientific computing APIs?

Jan 13 2022

sfp <sfp cims.nyu.edu> writes:

On Thursday, 13 January 2022 at 15:09:13 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 13 January 2022 at 14:50:59 UTC, bachmeier wrote:
 If you write code in Python, it's realistically only for the 
 Python world. Probably the same for Julia.

 Does scipy provide the functionality you would need? Could it 
 in some sense be considered a baseline for scientific computing 
 APIs?

SciPy is fairly useful but it is only one amongst a constellation 
of Python scientific computing libraries. It emulates a fair 
amount of what is provided by MATLAB, and it sits on top of 
numpy. Using SciPy, numpy, and matplotlib in tandem gives a user 
access to roughly the same functionality as a vanilla 
installation of MATLAB.

SciPy and numpy are built on top of a substrate of old and stable 
packages written in Fortran and C (BLAS, LAPACK, fftw, etc.).

Python, MATLAB, and Julia are basically targeted at scientists 
and engineers writing "application code". These languages aren't 
appropriate for "low-level" scientific computing along the lines 
of the libraries mentioned above. Julia does make a claim to the 
contrary: it is feasible to write fast low-level kernels in it, 
but (last time I checked) it is not so straightforward to export 
them to other languages, since Julia likes to do things at 
runtime.

Fortran and C remain good choices for low-level kernel 
development because they are easily consumed by Python et al. And 
as far as parallelism goes, OpenMP is the most common since it is 
straightforward conceptually. C++ is also fairly popular but 
since consuming something like a highly templatized header-only 
C++ library using e.g. Python's FFI is a pain, it is a less 
natural choice. (It's easier using pybind11, but the compile 
times will make you weep.)

Fortran, C, and C++ are also all standardized. This is valuable. 
The people developing these libraries are---more often than 
not---academics, who aren't able to devote much of their time to 
software development. Having some confidence that their 
programming language isn't going to change underneath gives them 
some assurance that they aren't going to be forced to spend an 
inordinate amount of time keeping their code in compliance for it 
to remain usable. Either that, or they write a library in Python 
and abandon it later.

As an aside, people lament the use of MATLAB, but one of its 
stated goals is backwards compatibility. Consequently, there's 
rather a lot of old MATLAB code floating around still in use.

"High-level" D is currently not that interesting for high-level 
scientific application code. There is a long list of "everyday" 
scientific computing tasks I could think of which I'd like to be 
able to execute in a small number of lines, but this is currently 
impossible using any flavor of D. See 
https://www.numerical-tours.com for some ideas.

"BetterC" D could be useful for developing numerical kernels. An 
interesting idea would to use D's introspection capabilities to 
automatically generate wrappers and documentation for each 
commonly used scientific programming language (Python, MATLAB, 
Julia). But D not being standardized makes it less attractive 
than C or Fortran. It is also unclear how stable D is as an open 
source project. The community surrounding it is rather small and 
doesn't seem to have much momentum. There also do not appear to 
be any scientific computing success stories with D.

My personal view is that people in science are generally more 
interested in actually doing science than in playing around with 
programming trivia. Having to spend time to understand something 
like C++'s argument dependent lookup is generally viewed as 
undesirable and a waste of time.

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 16:10:39 UTC, sfp wrote:
 My personal view is that people in science are generally more 
 interested in actually doing science than in playing around 
 with programming trivia.

Yes, this is probably true. My impression is that the physics 
department tend to be in favour of Python, C++ and I guess 
Fortran. In signal processing Matlab with Python as an upcoming 
alternative.

Maybe GPU-compute support is more relevant for desktop 
application development than scientific computing, in the context 
of D.

Jan 13 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 13 January 2022 at 14:50:59 UTC, bachmeier wrote:
 On Thursday, 13 January 2022 at 07:23:40 UTC, Bruce Carneal 
 wrote:
 On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only 
 increase as time goes on.

 What do you think?

 It doesn't matter all that much for D TBH. Without the basic 
 infrastructure for scientific computing like you get out of 
 the box with those three languages, the ability to target 
 another platform isn't going to matter. There are lots of 
 pieces here and there in our community, but it's going to 
 take some effort to (a) make it easy to use the different 
 parts together, (b) document everything, and (c) write the 
 missing pieces.

 I disagree.  D/dcompute can be used as a better general 
 purpose GPU kernel language now (superior meta programming, 
 sane nested functions, ...).  If you are concerned about 
 "infrastructure" you embed in C++.

 I was referring to libraries like numpy for Python or the 
 numerical capabilities built into Julia. D just isn't in a 
 state where a researcher is going to say "let's write a D 
 program for that simulation". You can call some things in Mir 
 and cobble together an interface to some C libraries or 
 whatever. That's not the same as Julia, where you write the 
 code you need for the task at hand. That's the starting point 
 to make it into scientific computing.

I agree.  If the heavy lifting for a new project is accomplished 
by libraries that you can't easily co-opt then better to employ D 
as the GPU language or not at all.

More broadly, I don't think we should set ourselves a task of 
displacing language X in community Y.  Better to focus on making 
accelerator programming "no big deal" in general so that people 
opt-in more often (first as accelerator language sub-component, 
then maybe more).

While my present day use of dcompute is in real time video, where 
it works a treat, I'm most excited about the possibilities 
dcompute would afford on SoCs.  World class perf/watt from dead 
simple code deployable to billions of units?  Yes, please.

 ...

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 03:56:00 UTC, bachmeier wrote:
 platform isn't going to matter. There are lots of pieces here 
 and there in our community, but it's going to take some effort 
 to (a) make it easy to use the different parts together, (b) 
 document everything, and (c) write the missing pieces.

What C++ seems to do for (a) is adding a library construct for 
fully configurable multidimensional non-owning slices 
(```mdspan```).

Jan 13 2022

Paulo Pinto <pjmlp progtools.org> writes:

On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
Grøstad wrote:
 I found the CppCon 2021 presentation
 [C++ Standard 
 Parallelism](https://www.youtube.com/watch?v=LW_T2RGXego) by 
 Bryce Adelstein Lelbach very interesting, unusually clear and 
 filled with content. I like this man. No nonsense.

 It provides a view into what is coming for relatively high 
 level and hardware agnostic parallel programming in C++23 or 
 C++26. Basically a portable "high level" high performance 
 solution.

 He also mentions the Nvidia C++ compiler *nvc++* which will 
 make it possible to compile C++ to Nvidia GPUs in a somewhat 
 transparent manner. (Maybe it already does, I have never tried 
 to use it.)

 My gut feeling is that it will be very difficult for other 
 languages to stand up to C++, Python and Julia in parallel 
 computing. I get a feeling that the distance will only increase 
 as time goes on.

 What do you think?

I think the ship has already sailed, given the industry standards 
of SYSCL and C++ for OpenCL, and their integration into clang 
(check the CppCon talks on the same) and FPGA generation.

D can have a go at it, but only by plugging into the LLVM 
ecosystem where C++ is the name of the game, and given it is 
approaching Linux level of industry contributors it isn't going 
anywhere.

There was a time to try overthrow C++, that was 10 years ago, 
LLVM was hardly relevant and GPGPU computing still wasn't 
mainstream.

Jan 12 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 13 January 2022 at 07:46:32 UTC, Paulo Pinto wrote:
 On Wednesday, 12 January 2022 at 22:50:38 UTC, Ola Fosheim 
 Grøstad wrote:
 ...
 What do you think?

 ...

 D can have a go at it, but only by plugging into the LLVM 
 ecosystem where C++ is the name of the game, and given it is 
 approaching Linux level of industry contributors it isn't going 
 anywhere.

Yes.  The language independent work in LLVM in the accelerator 
area is hugely important for dcompute, essential.  Gotta surf 
that wave as we don't have the manpower to go independent.  I 
dont think *anybody* has that amount of manpower, hence the 
collaboration/consolidation around LLVM as a back-end for 
accelerators.

 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.

Yes.  The "overthrow" of C++ should be a non-goal, IMO, starting 
yesterday.

Jan 13 2022

Tejas <notrealemail gmail.com> writes:

On Thursday, 13 January 2022 at 14:24:59 UTC, Bruce Carneal wrote:

 Yes.  The language independent work in LLVM in the accelerator 
 area is hugely important for dcompute, essential.

Sorry if this sounds ignorant, but does SPIR-V count for nothing?


  Gotta surf that wave as we don't have the manpower to go 
 independent.  I dont think *anybody* has that amount of 
 manpower, hence the collaboration/consolidation around LLVM as 
 a back-end for accelerators.

 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.

 Yes.  The "overthrow" of C++ should be a non-goal, IMO, 
 starting yesterday.

Overthrowing may be hopeless, but I feel we should at least be a 
really competitive with them.
Because it doesn't matter whether we're competing with C++ or 
not, people will compare us with it since that's the other choice 
when people will want to write extremely performant GPU code(if 
they care about ease of setup and productivity and _not_ 
performance-at-any-cost, Julia and Python have beat us to it :-(
)

Jan 13 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 13 January 2022 at 16:31:11 UTC, Tejas wrote:
 On Thursday, 13 January 2022 at 14:24:59 UTC, Bruce Carneal 
 wrote:

 Yes.  The language independent work in LLVM in the accelerator 
 area is hugely important for dcompute, essential.

 Sorry if this sounds ignorant, but does SPIR-V count for 
 nothing?

SPIR-V is *very* useful.  It is the catalyst and focal point of 
some of the most important ongoing LLVM accelerator work.  
Nicholas and I both believe that that work could provide a much 
more robust intermediate target for dcompute once it hits release 
status.

  Gotta surf that wave as we don't have the manpower to go 
 independent.  I dont think *anybody* has that amount of 
 manpower, hence the collaboration/consolidation around LLVM as 
 a back-end for accelerators.

 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.

 Yes.  The "overthrow" of C++ should be a non-goal, IMO, 
 starting yesterday.

 Overthrowing may be hopeless, but I feel we should at least be 
 a really competitive with them.

Sure.  We need to offer something that is actually better, we 
just don't need to be perceived as better by everyone in all 
scenarios.  An example: if management is deathly afraid of 
anything but microscopic incremental development or, more 
charitably, management weighs the risks of new development very 
very heavily, then D is unlikely to be given a chance.

 Because it doesn't matter whether we're competing with C++ or 
 not, people will compare us with it since that's the other 
 choice when people will want to write extremely performant GPU 
 code(if they care about ease of setup and productivity and 
 _not_ performance-at-any-cost, Julia and Python have beat us to 
 it :-(
 )

Yes.  We should evaluate our efforts by comparing (competing) 
with alternatives where available.  D/dcompute is already, for my 
GPU work at least, much better than CUDA/C++.  Concretely: I can 
achieve equivalent or higher performance more quickly with more 
readable code than I could formerly with CUDA/C++.  There are 
some things that are trivial in D kernels (like 
live-in-register/mem-bandwidth-minimized stencil processing) that 
would require "heroic" effort in CUDA/C++.

That said, there are definitely things that we could improve in 
the dcompute/accelerator area, particularly wrt the on-ramp for 
those new to accelerator programming.  But, as you note, D is 
unlikely to be adopted by the "performance is good enough with 
existing solutions" crowd in any case.  That's fine.

Jan 13 2022

bachmeier <no spam.net> writes:

On Thursday, 13 January 2022 at 18:41:54 UTC, Bruce Carneal wrote:

 Yes.  We should evaluate our efforts by comparing (competing) 
 with alternatives where available.  D/dcompute is already, for 
 my GPU work at least, much better than CUDA/C++.  Concretely: I 
 can achieve equivalent or higher performance more quickly with 
 more readable code than I could formerly with CUDA/C++.  There 
 are some things that are trivial in D kernels (like 
 live-in-register/mem-bandwidth-minimized stencil processing) 
 that would require "heroic" effort in CUDA/C++.

Does anyone else know anything about this? Burying it deep in a 
mailing list post isn't exactly the best way to publicize it. 
Ironically, I might add, in a discussion about lack of uptake.

Jan 13 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 13 January 2022 at 19:35:28 UTC, bachmeier wrote:
 On Thursday, 13 January 2022 at 18:41:54 UTC, Bruce Carneal 
 wrote:

 Yes.  We should evaluate our efforts by comparing (competing) 
 with alternatives where available.  D/dcompute is already, for 
 my GPU work at least, much better than CUDA/C++.  Concretely: 
 I can achieve equivalent or higher performance more quickly 
 with more readable code than I could formerly with CUDA/C++.  
 There are some things that are trivial in D kernels (like 
 live-in-register/mem-bandwidth-minimized stencil processing) 
 that would require "heroic" effort in CUDA/C++.

 Does anyone else know anything about this? Burying it deep in a 
 mailing list post isn't exactly the best way to publicize it. 
 Ironically, I might add, in a discussion about lack of uptake.

I know, right?  Ridiculously big opportunity/effort ratio for 
dlang and near zero awareness...

I usually talk a bit about dcompute at the beerconfs but to date 
I've only corresponded on the topic with Nicholas, Ethan, and Max 
(a little).

Ethan might have a sufficiently compelling economic case for 
promoting dcompute to his company in the relatively near future. 
Nicholas recently addressed their need for access to the texture 
hardware and fitting within their work flow, but there may be 
other requirements...  An adoption by a world class game studio 
would, of course, be very good news but I think Ethan is slammed 
(perpetually, and in a mostly good way, I think) so it might be a 
while.

Before promoting dcompute broadly I believe we should work 
through the installation/build/deployment procedures and some 
examples for the "new to accelerators" crowd.  It's no big deal 
as it sits for old hands but first impressions are important and 
even veteran programmers will appreciate an "it just works" on 
ramp.

If you're interested I suggest we continue the conversation on 
dcompute at the next beerconf where we can plot its path to world 
domination... :-)

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal wrote:
 I know, right?  Ridiculously big opportunity/effort ratio for 
 dlang and near zero awareness...

If dcompute is here to stay, why not put it in the official 
documentation for D as an "optional" part of the spec?

I honestly assumed that it was unsupported and close to dead as I 
had not heard much about it for a long time.

Jan 13 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 13 January 2022 at 21:06:45 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal 
 wrote:
 I know, right?  Ridiculously big opportunity/effort ratio for 
 dlang and near zero awareness...

 If dcompute is here to stay, why not put it in the official 
 documentation for D as an "optional" part of the spec?

There are two reasons that I have not promoted dcompute to the 
general community up to now:

1) Any resultant increase in support load would fall on one 
volunteer (that is not me) and

2) IMO, a better on-ramp, particularly for those new to 
accelerators, is needed: additional examples, docs, and "it just 
works" install/build/deploy vetting would go a long way to 
reducing the support load and increasing happy uptake.  
Additionally, Nicholas has a list of "TODOs" that probably should 
be worked through before additional promotion occurs.  None of 
them impact my work but they might hit others.

Nicholas opinion on the matter is much more important than mine 
as he already has a non-D "day job" and would bear the brunt of 
a, possibly premature, promotion of dcompute.

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 21:39:07 UTC, Bruce Carneal wrote:
 1) Any resultant increase in support load would fall on one 
 volunteer (that is not me) and

Yes, that is not a good situation…

The caveat is that if fewer people use dcompute, then fewer 
people will help out with it, then it will take more time to 
reach a state where it is "ready"…

Showing how/when dcompute improves performance on standard 
desktop computers might make more people interested in 
participating.

Are there some performance benchmarks on modest hardware? (e.g. a 
standard macbook, imac or mac mini) Benchmarks that compares 
dcompute to CPU with auto-vectorization (SIMD)?

Jan 13 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim Grøstad 
wrote:
 Are there some performance benchmarks on modest hardware? (e.g. 
 a standard macbook, imac or mac mini) Benchmarks that compares 
 dcompute to CPU with auto-vectorization (SIMD)?

Part of the difficulty with that, is that it is an apples to 
oranges comparison. Also I no longer have hardware that can run 
dcompute, as my old windows box (with intel x86 and OpenCL 2.1 
with an nvidia GPU) died some time ago.

Unfortunately Macs and dcompute don't work very well. CUDA 
requires nvidia, and OpenCL needs the ability to run SPIR-V 
(clCreateProgramWithIL call) which requires OpenCL 2.x which 
Apple do not support. Hence why supporting Metal was of some 
interest. You might in theory be able to use PoCL or intel based 
OpenCL runtimes but I don't have an intel mac anymore and I 
haven't tried PoCL.

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 14 January 2022 at 01:39:32 UTC, Nicholas Wilson wrote:
 On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim 
 Grøstad wrote:
 Are there some performance benchmarks on modest hardware? 
 (e.g. a standard macbook, imac or mac mini) Benchmarks that 
 compares dcompute to CPU with auto-vectorization (SIMD)?

 Part of the difficulty with that, is that it is an apples to 
 oranges comparison. Also I no longer have hardware that can run 
 dcompute, as my old windows box (with intel x86 and OpenCL 2.1 
 with an nvidia GPU) died some time ago.

 Unfortunately Macs and dcompute don't work very well. CUDA 
 requires nvidia, and OpenCL needs the ability to run SPIR-V 
 (clCreateProgramWithIL call) which requires OpenCL 2.x which 
 Apple do not support. Hence why supporting Metal was of some 
 interest. You might in theory be able to use PoCL or intel 
 based OpenCL runtimes but I don't have an intel mac anymore and 
 I haven't tried PoCL.

**\*nods**\* For a long time we could expect "home computers" to 
be Intel/AMD, but then the computing environment changed and 
maybe Apple tries to make its own platform stand out as faster 
than it is by forcing developers to special case their code for 
Metal rather than going through a generic API.

I guess FPGAs will be available in entry level machines at some 
point as well. So, I understand that it will be a challenge to 
get *dcompute* to a "ready for the public" stage when there is no 
multi-person team behind it.

But I am not so sure about the apples and oranges aspect of it. 
The presentation by Bryce was quite explicitly focusing on making 
GPU computation available at the same level as CPU computations 
(sans function pointers). This should be possible for homogeneous 
memory systems (GPU and CPU sharing the same memory bus) in a 
rather transparent manner and languages that plan for this might 
be perceived as being much more productive and performant if/when 
this becomes reality. And C++23 isn't far away, if they make the 
deadline.

It was also interesting to me that ISO C23 will provide custom 
bit width integers and that this would make it easier to 
efficiently compile C-code to tighter FPGA logic. I remember that 
LLVM used to have that in their IR, but I think it was taken out 
and limited to more conventional bit sizes? It just shows that 
being a system-level programming language requires a lot of 
adaptability over time and frameworks like *dcompute* cannot ever 
be considered truly finished.

Jan 14 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 14 January 2022 at 15:17:59 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 14 January 2022 at 01:39:32 UTC, Nicholas Wilson 
 wrote:
 On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim 
 Grøstad wrote:

...
 The presentation by Bryce was quite explicitly focusing on 
 making GPU computation available at the same level as CPU 
 computations (sans function pointers). This should be possible 
 for homogeneous memory systems (GPU and CPU sharing the same 
 memory bus) in a rather transparent manner and languages that 
 plan for this might be perceived as being much more productive 
 and performant if/when this becomes reality. And C++23 isn't 
 far away, if they make the deadline.

Yes.  Homogeneous memory accelerators, as found today in game 
consoles and SoCs, open up some nice possibilities.  Scheduling 
could still be problematic with a centralized resource (unlike 
per-core SIMD).  Distinct instruction formats (GPU vs CPU) also 
present a challenge to achieving an it-just-works "sans function 
pointers" level of integration.  Surmountable, but a little work 
to do there.

I'm hopeful that SoCs, with their relatively friendlier 
accelerator configurations, will be the economic enabler for 
widespread uptake of dcompute.  World beating perf/watt from very 
readable code deployable on billions of units?  I'm up for that!

Jan 14 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 14 January 2022 at 16:57:21 UTC, Bruce Carneal wrote:
 I'm hopeful that SoCs, with their relatively friendlier 
 accelerator configurations, will be the economic enabler for 
 widespread uptake of dcompute.

It is difficult to predict the future, but it is at least 
possible that the mainstream home-computing market will be 
dominated by smaller focused machines with SoCs. If we ignore 
Apple, then maybe the market will split into something like 
Chrome-books for non-geek users, something like Steam 
Deck/Machine for gamers and some other SoC with builtin FPGA or 
some other tinkering-friendly configuration for Linux 
enthusiasts. It seems reasonable that only storage will be on 
discrete chips in the long term. Drops in price levels tend to 
favour volume markets, so it is reasonable to expect SoCs to win 
out.

Jan 14 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 14 January 2022 at 17:38:36 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 14 January 2022 at 16:57:21 UTC, Bruce Carneal wrote:
 I'm hopeful that SoCs, with their relatively friendlier 
 accelerator configurations, will be the economic enabler for 
 widespread uptake of dcompute.

 It is difficult to predict the future, but it is at least 
 possible that the mainstream home-computing market will be 
 dominated by smaller focused machines with SoCs. If we ignore 
 Apple, then maybe the market will split into something like 
 Chrome-books for non-geek users, something like Steam 
 Deck/Machine for gamers and some other SoC with builtin FPGA or 
 some other tinkering-friendly configuration for Linux 
 enthusiasts. It seems reasonable that only storage will be on 
 discrete chips in the long term. Drops in price levels tend to 
 favour volume markets, so it is reasonable to expect SoCs to 
 win out.

Yes, I think the rollout of SoCs that you describe could very 
well occur.  I hadn't even considered those! I was thinking of 
the accelerators in phone SoCs.

Googling just now I saw an estimate of the number of "smart 
phones" world wide of over 6 billion.  That seems a little high 
to me but the number of accelerator equipped phone SoCs is 
certainly in the billions with the number trending to saturation 
in line with the world's population.

Anybody can hook into an accelerator library, and that will be 
fine for many apps, but with dcompute you'll have the ability to 
quickly go beyond the canned solutions when those are deficient.

Lots of ways to win with dcompute.

Jan 14 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Friday, 14 January 2022 at 15:17:59 UTC, Ola Fosheim Grøstad 
wrote:
 **\*nods**\* For a long time we could expect "home computers" 
 to be Intel/AMD, but then the computing environment changed and 
 maybe Apple tries to make its own platform stand out as faster 
 than it is by forcing developers to special case their code for 
 Metal rather than going through a generic API.

 I guess FPGAs will be available in entry level machines at some 
 point as well. So, I understand that it will be a challenge to 
 get *dcompute* to a "ready for the public" stage when there is 
 no multi-person team behind it.

Maybe, but I suspect not for a while though, but that could be 
wildly wrong. Anyway, I don't think they will be too difficult to 
support, provided the vendor in question provides an OpenCL 
implementation. The only thing to do is support `pipe`s.

As for manpower, the reason is I don't have any personal 
particular need for dcompute these days. I am happy to do 
features for people that need something in particular, e.g. 
Vulkan compute shader, textures, and PR are welcome. Though if 
Bruce makes millions and gives me a job then that will obviously 
change ;)

 But I am not so sure about the apples and oranges aspect of it.

The apples to oranges comment was about doing benchmarks with CPU 
vs. GPU, there are so many factors that make performance 
comparisons (more) difficult. Is the GPU discrete? How important 
is latency vs. throughput? How "powerful" is the GPU compared to 
the CPU?How well suited to the task is the GPU? The list goes on. 
Its hard enough to do CPU benchmarks in an unbiased way.

If the intention is to say, "look at the speedup you can for for 
$TASK using $COMMON_HARDWARE" then yeah, that would be possible. 
It would certainly be possible to do a benchmark of, say, "ease 
of implementation with comparable performance" of dcopmute vs 
CUDA, e.g. LoC, verbosity, brittleness etc., since the main 
advantage of D/dcompute (vs CUDA) is enumeration of kernel 
designs for performance. That would give a nice measurable goal 
to improve usability.

 The presentation by Bryce was quite explicitly focusing on 
 making GPU computation available at the same level as CPU 
 computations (sans function pointers). This should be possible 
 for homogeneous memory systems (GPU and CPU sharing the same 
 memory bus) in a rather transparent manner and languages that 
 plan for this might be perceived as being much more productive 
 and performant if/when this becomes reality. And C++23 isn't 
 far away, if they make the deadline.

Definitely. Homogenous memory is interesting for the ability to 
make GPUs do the things GPUs are good at and leave the rest to 
the CPU without worrying about memory transfer across the PCI-e. 
Something which CUDA can't take advantage of on account of nvidia 
GPUs being only discrete. I've no idea how cacheing work in a 
system like that though.

 It was also interesting to me that ISO C23 will provide custom 
 bit width integers and that this would make it easier to 
 efficiently compile C-code to tighter FPGA logic. I remember 
 that LLVM used to have that in their IR, but I think it was 
 taken out and limited to more conventional bit sizes?

Arbitrary Precision integers are still a part of LLVM, and I 
presume LLVM IR. the problem with that is, like with addressed 
spaced pointers, D has no way to declare such types. I seem to 
remember Luís Marqeus doing something crazy like that (maybe in a 
dconf presentation?), compiling D to verilog.

 It just  shows that being a system-level programming language 
 requires a lot of adaptability over time and frameworks like 
 *dcompute* cannot ever be considered truly finished.

Of course.

Jan 14 2022

Paulo Pinto <pjmlp progtools.org> writes:

On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
wrote:
 ....

 Definitely. Homogenous memory is interesting for the ability to 
 make GPUs do the things GPUs are good at and leave the rest to 
 the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete. I've no idea how cacheing 
 work in a system like that though.
 ...

How is this different from unified memory?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Jan 15 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Saturday, 15 January 2022 at 08:01:15 UTC, Paulo Pinto wrote:
 On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
 wrote:
 ....

 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the rest 
 to the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete. I've no idea how cacheing 
 work in a system like that though.
 ...

 How is this different from unified memory?

 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

there still a PCI-e in-between. Fundamentally the memory must 
exist in either the CPUs RAM or the GPUs (V)RAM, from what I 
understand unified memory allows the GPU to access the host RAM 
with the same pointer. This reduces the total memory consumed by 
the program, but to get to the GPU the data must still cross the 
PCI-e.

Jan 15 2022

Guillaume Piolat <first.last gmail.com> writes:

On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson 
wrote:
 from what I understand unified memory allows the GPU to access 
 the host RAM with the same pointer. This reduces the total 
 memory consumed by the program, but to get to the GPU the data 
 must still cross the PCI-e.

Exactly. I remember that in 2013 "Unified Memory Access" on 
NVIDIA was underwhelming, performing worse than pinned transfer + 
GPU memory access.

Jan 15 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Saturday, 15 January 2022 at 10:35:29 UTC, Guillaume Piolat 
wrote:
 On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson 
 wrote:
 from what I understand unified memory allows the GPU to access 
 the host RAM with the same pointer. This reduces the total 
 memory consumed by the program, but to get to the GPU the data 
 must still cross the PCI-e.

 Exactly. I remember that in 2013 "Unified Memory Access" on 
 NVIDIA was underwhelming, performing worse than pinned transfer 
 + GPU memory access.

Exactly++.  Pinned buffers + async HW copies always won out for 
me.

I imagine there could be scenarios where programmatic 
peeking/poking from either side wins but I've not seen them, 
probably because if your data flows are small enough for that to 
win you'd just fire up SIMD and call it a day.

Jan 15 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson
wrote:
On Saturday, 15 January 2022 at 08:01:15 UTC, Paulo Pinto wrote:
On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson
wrote:
....

Definitely. Homogenous memory is interesting for the ability
to make GPUs do the things GPUs are good at and leave the
rest to the CPU without worrying about memory transfer across
the PCI-e. Something which CUDA can't take advantage of on
account of nvidia GPUs being only discrete. I've no idea how
cacheing work in a system like that though.
...

How is this different from unified memory?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

there still a PCI-e in-between. Fundamentally the memory must
exist in either the CPUs RAM or the GPUs (V)RAM, from what I
understand unified memory allows the GPU to access the host RAM
with the same pointer. This reduces the total memory consumed
by the program, but to get to the GPU the data must still cross
the PCI-e.

Yes. You also gain some simplification, from unified memory, if
your data structures are pointer heavy.

I've tried to gain advantage from GPU-side pulls across the bus
in the past but could never win out over explicit async copying
utilizing dedicated copy circuitry. Others, particularly those
with high compute-to-load/store ratios, may have had better luck.

For reference, I've only been able to get a little over 80% of
the advertised PCI-e peak bandwidth out of the dedicated Nvidia
copy HW.

Jan 15 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
wrote:
 As for manpower, the reason is I don't have any personal 
 particular need for dcompute these days. I am happy to do 
 features for people that need something in particular, e.g. 
 Vulkan compute shader, textures, and PR are welcome. Though if 
 Bruce makes millions and gives me a job then that will 
 obviously change ;)

He can put me on the application list as well… This sounds like 
lots of fun!!!


 important is latency vs. throughput? How "powerful" is the GPU 
 compared to the CPU?How well suited to the task is the GPU? The 
 list goes on. Its hard enough to do CPU benchmarks in an 
 unbiased way.

I don't think people would expect benchmarks to be unbiased. It 
could be 3-4 short benchmarks, some showcasing where it is 
beneficial, some showcasing where data dependencies (or other 
challenges) makes it less suitable.

E.g.
1. compute autocorrelation over many different lags
2. multiply and take the square root of two long arrays
3. compute a simple IIR filter (I assume a recursive filter would 
be a worst case?)


 If the intention is to say, "look at the speedup you can for 
 for $TASK using $COMMON_HARDWARE" then yeah, that would be 
 possible. It would certainly be possible to do a benchmark of, 
 say, "ease of implementation with comparable performance" of 
 dcopmute vs CUDA, e.g. LoC, verbosity, brittleness etc., since 
 the main advantage of D/dcompute (vs CUDA) is enumeration of 
 kernel designs for performance. That would give a nice 
 measurable goal to improve usability.

Yes, but I think of it as an inspiration with a tutorial of how 
to get the benchmarks to run. For instance, like you, I have no 
need for this at the moment and my current computer isn't really 
a good showcase of GPU computation either, but I have one long 
term hobby project where I might use GPU-computations eventually.

I suspect many think of GPU computations as something requiring a 
significant amount of time to get into. Even though they may be 
interested that threshold alone is enough to put it in the 
"interesting, but I'll look at it later" box.

If you can tease people into playing with it for fun, then I 
think there is a larger chance of them using it at a later stage 
(or even thinking about the possibility of using it) when they 
see a need in some heavy computational problem they are working 
on.

There is a lower threshold to get started with something new if 
you already have a tiny toy-project you can cut and paste from 
that you have written yourself.

Also, updated benchmarks could generate new interest on the 
announce forum thread. Lurking forum readers, probably only read 
them on occasion, so you have to make several posts to make 
people aware of it.

 Definitely. Homogenous memory is interesting for the ability to 
 make GPUs do the things GPUs are good at and leave the rest to 
 the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete. I've no idea how cacheing 
 work in a system like that though.

I don't know, but Steam Deck, which appears to come out next 
month, seems to run under Linux and has an "AMD APU" with a 
modern GPU and CPU integrated on the same chip, at least that is 
what I've read. Maybe there will be more technical info available 
on how that works at the hardware level later, or maybe it is 
already on AMDs website?

If someone reading this thread has more info on this, it would be 
nice if they would share what they have found out! :-)

Jan 15 2022

Paulo Pinto <pjmlp progtools.org> writes:

On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad 
wrote:
 ...

 I don't know, but Steam Deck, which appears to come out next 
 month, seems to run under Linux and has an "AMD APU" with a 
 modern GPU and CPU integrated on the same chip, at least that 
 is what I've read. Maybe there will be more technical info 
 available on how that works at the hardware level later, or 
 maybe it is already on AMDs website?

 If someone reading this thread has more info on this, it would 
 be nice if they would share what they have found out! :-)

According to the public documentation you can expect it to be 
similar to AMD Ryzen 7 3750H, with Radeon RX Vega 10 Graphics (16 
GB).

https://partner.steamgames.com/doc/steamdeck/testing#5

Jan 15 2022

Guillaume Piolat <first.last gmail.com> writes:

On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad 
wrote:
 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the rest 
 to the CPU without worrying about memory transfer across the 
 PCI-e. Something which CUDA can't take advantage of on account 
 of nvidia GPUs being only discrete.

 Steam Deck, which appears to come out next month, seems to run 
 under Linux and has an "AMD APU" with a modern GPU and CPU 
 integrated on the same chip

Related: has anyone here seen an actual measured performance gain 
from co-located CPU and GPU on the same chip? I used to test with 
OpenCL + Intel SoC and again, it was underwhelming and not 
faster. I'd be happy to know about other experiences.

Jan 15 2022

max haughton <maxhaton gmail.com> writes:

On Saturday, 15 January 2022 at 17:29:35 UTC, Guillaume Piolat 
wrote:
 On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim 
 Grøstad wrote:
 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the 
 rest to the CPU without worrying about memory transfer across 
 the PCI-e. Something which CUDA can't take advantage of on 
 account of nvidia GPUs being only discrete.

 Steam Deck, which appears to come out next month, seems to run 
 under Linux and has an "AMD APU" with a modern GPU and CPU 
 integrated on the same chip

 Related: has anyone here seen an actual measured performance 
 gain from co-located CPU and GPU on the same chip? I used to 
 test with OpenCL + Intel SoC and again, it was underwhelming 
 and not faster. I'd be happy to know about other experiences.

Well console memory systems are basically built around this idea. 
On the assumption that you mean a consumer chip with integrated 
graphics, any gain you see from sharing memory is going to be 
contrasted against the chip being intended for people who were 
going to actually use integrated graphics. For compute especially 
it seems like this is very dependant on what patterns you 
actually want to do with the memory.

The new Apple chips have a unified memory architecture, and a 
really fast one too. I don't know what GPGPU is like on it but 
it's one of the reason why it absolutely flies on normal code.

Jan 15 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Saturday, 15 January 2022 at 17:29:35 UTC, Guillaume Piolat 
wrote:
 On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim 
 Grøstad wrote:
 Definitely. Homogenous memory is interesting for the ability 
 to make GPUs do the things GPUs are good at and leave the 
 rest to the CPU without worrying about memory transfer across 
 the PCI-e. Something which CUDA can't take advantage of on 
 account of nvidia GPUs being only discrete.

 Steam Deck, which appears to come out next month, seems to run 
 under Linux and has an "AMD APU" with a modern GPU and CPU 
 integrated on the same chip

 Related: has anyone here seen an actual measured performance 
 gain from co-located CPU and GPU on the same chip? I used to 
 test with OpenCL + Intel SoC and again, it was underwhelming 
 and not faster. I'd be happy to know about other experiences.

The link below on the vkpolybench software includes graphs for 
integrated GPUs, among others, and shows significant (more than 
SIMD width) speedups wrt a single CPU core for many of the 
benchmarks but also break-even or worse on a few.  Reports on 
real world experiences with the integrated accelerators would be 
better.

https://github.com/ElsevierSoftwareX/SOFTX_2020_86

On paper, at least, it looks like SoC GPU performance will be 
severely impacted by the working set size but who isn't?. 
Currently it also looks like the dcompute/SoC-GPU version will 
beat out my SIMD variant but it'll be at least a few months 
before I have hard data to share.

Anyone out there have real world data now?

Jan 15 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 13 January 2022 at 21:06:45 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal 
 wrote:
 I know, right?  Ridiculously big opportunity/effort ratio for 
 dlang and near zero awareness...

 If dcompute is here to stay, why not put it in the official 
 documentation for D as an "optional" part of the spec?

 I honestly assumed that it was unsupported and close to dead as 
 I had not heard much about it for a long time.

I suppose that's my fault for not marketing more, the code 
generation is tested in LDC's CI pipelines so that is unlikely to 
break, and the library is built on slow moving APIs that are also 
unlikely to break. Just because it doesn't get a lot of commits 
doesn't mean its going to stop working.

As for specification, I think that would be a wasted effort and 
too constraining. On the compiler side, it is mostly using the 
existing LDC infrastructure with (more than) a few hacks to get 
everything to stick together, and it is heavily dependant on LDC 
and LLVM internals to be part of the D spec. On the runtime side 
of it, I fear specification would either be too constraining or 
end up out of sync with the implementation.

Jan 13 2022

Guillaume Piolat <first.last gmail.com> writes:

On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal wrote:
 Ethan might have a sufficiently compelling economic case for 
 promoting dcompute to his company in the relatively near 
 future. Nicholas recently addressed their need for access to 
 the texture hardware and fitting within their work flow, but 
 there may be other requirements...  An adoption by a world 
 class game studio would, of course, be very good news but I 
 think Ethan is slammed (perpetually, and in a mostly good way, 
 I think) so it might be a while.

As a former GPGPU guy: can you explain in what ways dcompute 
improves life over using CUDA and OpenCL through 
DerelictCL/DerelictCUDA (I used to maintain them and I think 
nobody ever used them). Using the API directly seems to offer the 
most control to me, and no special compiler support.

Jan 13 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 13 January 2022 at 23:28:01 UTC, Guillaume Piolat 
wrote:
 As a former GPGPU guy: can you explain in what ways dcompute 
 improves life over using CUDA and OpenCL through 
 DerelictCL/DerelictCUDA (I used to maintain them and I think 
 nobody ever used them). Using the API directly seems to offer 
 the most control to me, and no special compiler support.

It is entirely possible to use dcompute as simply a wrapper over 
OpenCL/CUDA and benefit from the enhanced usability that it 
offers (e.g. querying OpenCL API objects for their properties is 
_faaaar_ simpler and less error prone with dcompute) because it 
exposes the underlying API objects 1:1, and you can always get 
the raw pointer and do things manually if you need to. Also 
dcompute uses DerelictCL/DerelictCUDA underneath anyway (thanks 
for them!).

If you're thinking of "special compiler support" as what CUDA 
does with its <<<>>>, then no, dcompute does all of that, but not 
with special help from the compiler, only with what meta 
programming and reflection is available to any other D program.
It's D all the way down to the API calls. Obviously there is 
special compiler support to turn D code into compute kernels.

The main benefit of dcompute is turning kernel launches into type 
safe one-liners, as opposed to brittle, type unsafe, paragraphs 
of code.

Jan 13 2022

Guillaume Piolat <first.last gmail.com> writes:

On Friday, 14 January 2022 at 00:56:32 UTC, Nicholas Wilson wrote:
 If you're thinking of "special compiler support" as what CUDA 
 does with its <<<>>>, then no, dcompute does all of that, but 
 not with special help from the compiler, only with what meta 
 programming and reflection is available to any other D program.
 It's D all the way down to the API calls. Obviously there is 
 special compiler support to turn D code into compute kernels.

 The main benefit of dcompute is turning kernel launches into 
 type safe one-liners, as opposed to brittle, type unsafe, 
 paragraphs of code.

Sound indeed less brittle than separate langage. In my time in 
CUDA I never got to use <<<>>>.

In OpenCL you'd have to templatize the string kernels quite 
quickly, and with CUDA you'd have to also make lots of entry 
points. Plus all the import problems, so I can see how it's 
better with LDC intrinsics.

Jan 14 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Friday, 14 January 2022 at 09:39:58 UTC, Guillaume Piolat 
wrote:
 The main benefit of dcompute is turning kernel launches into 
 type safe one-liners, as opposed to brittle, type unsafe, 
 paragraphs of code.

 Sound indeed less brittle than separate langage. In my time in 
 CUDA I never got to use <<<>>>.

Pity, the <<<>>> is actually quite nice, and not at all brittle, 
but it is CUDA C/C++ (and maybe fortran?) only, AMDs attempts at 
HIP notwithstanding. The main thing that makes it brittle is that 
is you change the signature of the kernel then you need to 
remember to change wherever it is invoked, and the compiler will 
not tell you that you forgot something.

 In OpenCL you'd have to templatize the string kernels quite 
 quickly, and with CUDA you'd have to also make lots of entry 
 points. Plus all the import problems, so I can see how it's 
 better with LDC intrinsics.

I'm not quite sure what you mean here.

Jan 14 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 13 January 2022 at 23:28:01 UTC, Guillaume Piolat 
wrote:
 On Thursday, 13 January 2022 at 20:38:19 UTC, Bruce Carneal 
 wrote:
 Ethan might have a sufficiently compelling economic case for 
 promoting dcompute to his company in the relatively near 
 future. Nicholas recently addressed their need for access to 
 the texture hardware and fitting within their work flow, but 
 there may be other requirements...  An adoption by a world 
 class game studio would, of course, be very good news but I 
 think Ethan is slammed (perpetually, and in a mostly good way, 
 I think) so it might be a while.

 As a former GPGPU guy: can you explain in what ways dcompute 
 improves life over using CUDA and OpenCL through 
 DerelictCL/DerelictCUDA (I used to maintain them and I think 
 nobody ever used them). Using the API directly seems to offer 
 the most control to me, and no special compiler support.

For me there were several things, including:

   1) the dcompute kernel invocation was simpler, made more sense, 
letting me easily create invocation abstractions to my liking 
(partially memoized futures in my case but easy enough to do 
other stuff)

   2) the kernel meta programming was much friendlier generally, 
of course.

   3) the D nested function capability, in conjunction with better 
meta programming, enabled great decomposition, intra kernel.  You 
could get the compiler to keep everything within the 
maximum-dispatch register limit (64) with ease, with readable 
code.

   4) using the above I found it easy to reduce/minimize memory 
traffic, an important consideration in that much of my current 
work is memory bound.  Trivial example: use static foreach to 
logically unroll a window neighborhood algorithm eliminating both 
unnecessary loads and all extraneous reg-to-reg moves as you 
naturally mod around.

It's not that you that you can't do such things in CUDA/C++, 
eventually, sometimes, after quite a bit of discomfort, once you 
acquire your level-bazillion C++ meta programming merit badge, 
it's that it's all so much *easier* to do in dcompute.  You get 
to save the heroics for something else.

I'm sure that new idioms/benefits will emerge with additional use 
(this was my first dcompute project) but, as you will have 
noticed :-), I'm already hooked.

WRT OpenCL I don't have much to say.  From what I gather people 
consider OpenCL to be even less hospitable than CUDA, preferring 
OpenCL mostly (only?) for its non-proprietary status.  I'd be 
interested to hear from OpenCL gurus on this topic.

Finally, if any of the above doesn't make sense, or you'd like to 
discuss it further, I suggest we meet up at beerconf.  I'd also 
love to talk about data parallel latency sensitive coding 
strategies, about how we should deal with HW capability 
variation, about how we can introduce data parallelism to many 
more in the dlang community, ...

Jan 13 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Friday, 14 January 2022 at 01:37:29 UTC, Bruce Carneal wrote:
 WRT OpenCL I don't have much to say.  From what I gather people 
 consider OpenCL to be even less hospitable than CUDA, 
 preferring OpenCL mostly (only?) for its non-proprietary 
 status.  I'd be interested to hear from OpenCL gurus on this 
 topic.

Not that I'm an OpenCL guru by any stretch of the imagination, 
but yes, OpenCL as a base API is much less nice than even the 
CUDA driver APIs, but the foundation is solid and you can 
abstract and prettify it with D to a level of usability that is 
at least on par with (and imo exceeds) CUDA's runtime API (the 
one with the <<<>>>'s) with D kernels.

That is to say the selling point for dcompute vs. OpenCL is, you 
get an API that is just as easy as CUDA (w.r.t type safety and 
tedium) and you get to write your kernels in D, whereas dcompute 
vs. CUDA is _just_, you get to write your kernels in D (and the 
API is not any worse).

Jan 13 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 23:28:01 UTC, Guillaume Piolat 
wrote:
 As a former GPGPU guy: can you explain in what ways dcompute 
 improves life over using CUDA and OpenCL through 
 DerelictCL/DerelictCUDA (I used to maintain them and I think 
 nobody ever used them). Using the API directly seems to offer 
 the most control to me, and no special compiler support.

Forgot to respond to this. This probably does not apply to 
*dcompute*, but Bryce pointed out in his presentation how you 
could step through your "GPU code" on the CPU using a regular 
debugger since the parallel code was regular C++. Not exactly 
sure how that works, but I would imagine that they provide 
functions that match the GPU?

That sounds like a massive productivity advantage to me if you 
want to write complicated "shaders".

Jan 15 2022

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 January 2022 at 07:46:32 UTC, Paulo Pinto wrote:
 I think the ship has already sailed, given the industry 
 standards of SYSCL and C++ for OpenCL, and their integration 
 into clang (check the CppCon talks on the same) and FPGA 
 generation.

The SYSCL/FPGA presentation was interesting, but he said it 
should be considered a research project at this point?

I am a bit weary of all the solutions that are coming from 
Khronos. It is difficult to say what becomes prevalent across 
many platforms. Both Microsoft and Apple have undermined open 
standards such as OpenGL in their desire to lock in developers to 
their own "monopolistic eco system" …

So, a focused language solution of limited scope might actually 
be better for developers than (big) open standards.

 There was a time to try overthrow C++, that was 10 years ago, 
 LLVM was hardly relevant and GPGPU computing still wasn't 
 mainstream.

Overthrowing C++ isn't possible, but D could focus more on 
desktop application development and provide a framework for it. 
Then you need to have a set of features/modules/libraries in 
place in a way that fits well together. GPU-compute would be one 
of those I think.

Jan 13 2022

D Programming

C/C++ Programming

Other

digitalmars.D - Scientific computing and parallel computing C++23/C++26