www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.parallel_algorithm

reply dsimcha <dsimcha yahoo.com> writes:
Here's some very early work in progress on std.parallel_algorithm:

https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorithm.d

I haven't compiled the docs yet because it's too much of a WIP and 
doesn't even work without a few modifications I made to std.parallelism. 
  So far I've got parallel merge, sort and dot product.

One issue is that most of these algorithms require pretty fine grained 
parallelism and the break even point depends on a lot of things, like 
the details of the hardware.  How should we go about documenting/solving 
this?  Is it reasonable to always just punt the issue of finding the 
break even point and whether one is above or below it to the user of the 
library?  If not, what is a reasonable solution?
May 22 2011
next sibling parent reply Russel Winder <russel russel.org.uk> writes:
David,

On Mon, 2011-05-23 at 00:00 -0400, dsimcha wrote:
 Here's some very early work in progress on std.parallel_algorithm:
=20
 https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algori=
thm.d
=20
 I haven't compiled the docs yet because it's too much of a WIP and=20
 doesn't even work without a few modifications I made to std.parallelism.=
=20
   So far I've got parallel merge, sort and dot product.
=20
 One issue is that most of these algorithms require pretty fine grained=
=20
 parallelism and the break even point depends on a lot of things, like=20
 the details of the hardware.  How should we go about documenting/solving=
=20
 this?  Is it reasonable to always just punt the issue of finding the=20
 break even point and whether one is above or below it to the user of the=
=20
 library?  If not, what is a reasonable solution?
Does this requires a version of std.parallelism that is not in the standard Phobos? + dmd -m32 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible + dmd -m64 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 22 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 5/23/2011 1:23 AM, Russel Winder wrote:
 David,

 On Mon, 2011-05-23 at 00:00 -0400, dsimcha wrote:
 Here's some very early work in progress on std.parallel_algorithm:

 https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorithm.d

 I haven't compiled the docs yet because it's too much of a WIP and
 doesn't even work without a few modifications I made to std.parallelism.
    So far I've got parallel merge, sort and dot product.

 One issue is that most of these algorithms require pretty fine grained
 parallelism and the break even point depends on a lot of things, like
 the details of the hardware.  How should we go about documenting/solving
 this?  Is it reasonable to always just punt the issue of finding the
 break even point and whether one is above or below it to the user of the
 library?  If not, what is a reasonable solution?
Does this requires a version of std.parallelism that is not in the standard Phobos? + dmd -m32 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible + dmd -m64 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible
Yeah, I meant to mention that when I was talking about how much of a work in progress it is. I didn't expect anyone to actually want to test it this soon. You need the slightly modified one at: https://github.com/dsimcha/phobos/blob/master/std/parallelism.d
May 23 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-05-22 21:00, dsimcha wrote:
 Here's some very early work in progress on std.parallel_algorithm:
 
 https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorith
 m.d
 
 I haven't compiled the docs yet because it's too much of a WIP and
 doesn't even work without a few modifications I made to std.parallelism.
   So far I've got parallel merge, sort and dot product.
 
 One issue is that most of these algorithms require pretty fine grained
 parallelism and the break even point depends on a lot of things, like
 the details of the hardware.  How should we go about documenting/solving
 this?  Is it reasonable to always just punt the issue of finding the
 break even point and whether one is above or below it to the user of the
 library?  If not, what is a reasonable solution?
Well, unless you intend to try and have functions which choose whether they're parallel or not at runtime based on what they're given, I'd fully expect the issue of whether to use the parallel or non-parallel versions up to the programmer using them, and if that's the case, it's completely up to them to determine which is more efficient given their particular use case. At best, you could give some sort of information on what you expect the break-even point to be, and if it's highly system dependent (as it sounds like it is), then I don't see how you could do even that. So, really, I'd expect it to be completely up to the programmer. - Jonathan M Davis
May 22 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
David, do you have some automated way of adding new modules to the
makefile? It's a bit of a drag to manually do this. ;)

Also, make sucks. I made a typo at line 429 and it reported an error
at line 542. Gah!
May 22 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Russell, get parallelism.d from here:
https://github.com/dsimcha/phobos/tree/master/std
May 22 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Anyway here's some results on a quad-core AMD:

https://gist.github.com/986322

Was too lazy to calculate an average out of N runs. :)
May 22 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:
 Russell, get parallelism.d from here:
 https://github.com/dsimcha/phobos/tree/master/std
Too many ls in there but I'll assume it's me ;-) So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change? --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 23 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-05-23 02:07, Russel Winder wrote:
 On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:
 Russell, get parallelism.d from here:
 https://github.com/dsimcha/phobos/tree/master/std
Too many ls in there but I'll assume it's me ;-) So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change?
That's what David said in his initial post. He had to make additional changes to std.parallelism for std.parallel_algorithm to work. - Jonathan M Davis
May 23 2011
prev sibling parent reply Russel Winder <russel russel.org.uk> writes:
Jonathan,

On Mon, 2011-05-23 at 02:15 -0700, Jonathan M Davis wrote:
 On 2011-05-23 02:07, Russel Winder wrote:
 On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:
 Russell, get parallelism.d from here:
 https://github.com/dsimcha/phobos/tree/master/std
=20 Too many ls in there but I'll assume it's me ;-) =20 So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change?
=20 That's what David said in his initial post. He had to make additional cha=
nges=20
 to std.parallelism for std.parallel_algorithm to work.
Ah, thanks for the reminder. That'll teach me to skim read, get the URL, act, and fail to actually note important information. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 23 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 5/23/2011 6:10 AM, Russel Winder wrote:
 Jonathan,

 On Mon, 2011-05-23 at 02:15 -0700, Jonathan M Davis wrote:
 On 2011-05-23 02:07, Russel Winder wrote:
 On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:
 Russell, get parallelism.d from here:
 https://github.com/dsimcha/phobos/tree/master/std
Too many ls in there but I'll assume it's me ;-) So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change?
That's what David said in his initial post. He had to make additional changes to std.parallelism for std.parallel_algorithm to work.
Ah, thanks for the reminder. That'll teach me to skim read, get the URL, act, and fail to actually note important information.
Interesting. Thanks. Unfortunately it doesn't look like merge and dotProduct scale to a quad core very well, because I tuned these to be very close to the break-even point on a dual core. This is part of my concern: In the worst case, stuff that's very close to break-even on a dual or quad core can get **slower** when run on a 6 or 8 core. This is just a fact of life. Eventually, I'd like to push the break-even point for std.parallelism a little lower and I know roughly how to do it (mainly by implementing work stealing instead of a simple FIFO queue). However, there are tons of nagging little implementation issues and there would still always be a break even point, even if it's lower. Right now the break-even point seems to be on the order of a few microseconds per task, which on modern hardware equates to ~10,000 clock cycles. This is good enough for a lot of practical purposes, but obviously not nano-parallelism. I've made improving this a fairly low priority. Given the implementation difficulties, the fact that it's fairly easy to take full advantage of multicore hardware without getting down to this granularity (usually if you need more performance you're executing some expensive algorithm in a loop and you can just parallelize the outer loop and be done with it) and the fact that there's plenty of other useful hacking jobs in the D ecosystem, the benefits-to-effort ratio doesn't seem that good.
May 23 2011