digitalmars.D - review of std.parallelism

Andrei Alexandrescu (91/91) Mar 18 2011 0. Overview and vote

dsimcha (43/62) Mar 18 2011 Thanks for the advice. You mentioned in the past that the documentation...

Simen kjaeraas (5/15) Mar 19 2011 Please add an example showing that, too. Sure, the documentation says

dsimcha (4/18) Mar 19 2011 I don't understand how this can be demonstrated in an example. It's an

Simen kjaeraas (5/24) Mar 19 2011 Yeah, scratch that. I for some reason thought the "array of size

Andrei Alexandrescu (16/63) Mar 19 2011 Got it.

dsimcha (26/55) Mar 19 2011 I've already made it private. I can't see what purpose having it public...

Andrei Alexandrescu (5/13) Mar 19 2011 A great way to handle bunches of almost-identical overloads is to group

dsimcha (5/96) Mar 19 2011 Ok, thanks again for clarifying **how** the docs could be improved.

Andrei Alexandrescu (129/133) Mar 19 2011 * When using "parallelism" as a common noun, prefix is with a '_' so

Michel Fortin (26/34) Mar 19 2011 Actually, I think this is a bad description of what it subverts. What

dsimcha (13/43) Mar 19 2011 If you want pedal-to-metal parallelism without insane levels of

Michel Fortin (36/66) Mar 19 2011 I'm not sure where my proposal asks for more verbosity or less

Michel Fortin (12/31) Mar 19 2011 ... or maybe I just totally misunderstood asyncBuf. Rereading the

dsimcha (3/32) Mar 19 2011 Right. And this is pretty obviously a race. The other example (without...

dsimcha (44/77) Mar 20 2011 Ok, I've had some time to think about this. The following example is

Michel Fortin (36/127) Mar 20 2011 I don't see a problem with the above. The array elements you modify are

dsimcha (46/105) Mar 20 2011 Ok, I completely agree in principle, though I question whether it's

Michel Fortin (46/149) Mar 21 2011 Well, it'll work irrespective of whether shared delegates are used or

dsimcha (36/113) Mar 21 2011 On second thought, no, but for practical, not theoretical reasons: One,...

Michel Fortin (41/61) Mar 21 2011 Indeed a problem. Either we fix the compiler to support that, or we

dsimcha (24/68) Mar 21 2011 This problem is **much** less severe than you are suggesting. x86 can a...

dsimcha (4/42) Mar 21 2011 Actually, just remembered that word tearing is also an issue with unalig...

dsimcha (13/55) Mar 21 2011 s/is/may be . I'm trying to read up on word tearing and post questions

bearophile (4/7) Mar 19 2011 Please, don't give up. It's a lot of time from the first time I've asked...

Robert Jacques (4/34) Mar 20 2011 Technically, you'd only need a shared delegate or a const delegate to

dsimcha (93/172) Mar 19 2011 I don't see what I could put here that isn't totally redundant with the

Michel Fortin (7/26) Mar 19 2011 I think an improvement over undefined behaviour would be to throw an exc...

dsimcha (12/35) Mar 19 2011 The only problem is that there's no easy, well-documented way to tell

dsimcha (6/12) Mar 19 2011 ...Though now that you make me think of it I need to support exception
Michel Fortin (36/75) Mar 19 2011 It's not that silly.

dsimcha (14/40) Mar 19 2011 Using continue is well defined behavior and does exactly what I think

Daniel Gibson (7/10) Mar 19 2011 What about some kind of parallel search? You just want to know if someth...

dsimcha (7/17) Mar 19 2011 It's an interesting suggestion, and I'll give some thought to it, but on...

Andrei Alexandrescu (120/281) Mar 19 2011 I'm thinking along the lines of:

dsimcha (30/87) Mar 19 2011 Fair enough. Now that I think of it most of my frustration is that

Andrei Alexandrescu (11/37) Mar 19 2011 Exactly. I understand. Well here's a suggestion. How about we "git

dsimcha (2/27) Mar 19 2011 Sounds like a plan.
Lars T. Kyllingstad (3/31) Mar 20 2011 Sounds good. I'll make the announcement.

dsimcha (9/15) Mar 19 2011 In addition the the deadline issues already mentioned and resolved, I

Jonathan M Davis (5/21) Mar 19 2011 Both std.datetime and std.unittests underwent a fair number of changes o...
Jonathan M Davis (5/21) Mar 19 2011 Both std.datetime and std.unittests underwent a fair number of changes o...

dsimcha (3/24) Mar 19 2011 Please check your newsreader settings. You've been double-posting a lot...

dsimcha (96/215) Mar 19 2011 This is definitely feasible in principle. I'd like to implement it, but...

dsimcha (10/30) Mar 20 2011 I tried putting a few benchmarks in for the performance of parallel

dsimcha (4/11) Mar 19 2011 I did this for consistency with all the other functions, where work unit...
Jonathan M Davis (3/12) Mar 19 2011 Oooh. That's a neat trick. I didn't know that you could do that.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

0. Overview and vote

I think the library delivers the high-level goods (parallel foreach, 
map, reduce) but is a bit fuzzy in the lower level details.

Documentation hurts understanding of the capabilities of the library and 
essentially is of inadequate quality. Entity documentation and examples 
do little more than give the impression of dispensing with an unpleasant 
chore.

My vote in favor of acceptance is contingent upon a radical improvement 
in the quality of documentation and examples. Most if not all artifacts 
should be motivated by simple, compelling examples. The introductory 
section must contain a brief and attractive synopsis of the flagship 
capabilities. All terms must be defined before being used and introduced 
in a carefully-chosen order. The relationship between various entities 
should be clarified.

I've seen the argument that simple and strong examples are difficult to 
find. Though I agree such examples are not easy to come up with, I also 
believe the author of library is in the best position to produce them.

================

1. Library proper:

* "In the case of non-random access ranges, parallel foreach is still 
usable but buffers lazily to an array..." Wouldn't strided processing 
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 
1, 5, 9, ... and so on.

* Example with squares would be better if double replaced uint, and if a 
more complicated operation (sqrt, log...) were involved.

* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious 
method should be better: just use one circular buffer. The presence of 
two dependent parameters makes this abstraction difficult to operate with.

* Same question about asyncBuf. What is wrong with a circular buffer 
filled on one side by threads and on the consumed from the other by the 
client? I can think of a couple of answers but it would be great if they 
were part of the documentation.

* Why not make workerIndex a ulong and be done with it?

* Is stop() really trusted or just unsafe? If it's forcibly killing 
threads then its unsafe.

* uint size() should be size_t for conformity.

* defaultPoolThreads - should it be a  property?

* I don't understand what Task is supposed to do. It is circularly 
defined: "A struct that encapsulates the information about a task, 
including its current status, what pool it was submitted to, and its 
arguments." OK, but what's a task? Could a task be used outside a task 
pool, and if so to what effect(s)?

* If LazyMap is only necessary as the result of lazyMap, could that 
become a local type inside lazyMap?


2. Documentation:

* Documentation unacceptable. It lacks precision and uses terms before 
defining them. For example: "This class encapsulates a task queue and a 
set of worker threads" comes before the notions of "task queue" and 
"worker thread" are defined. Clearly there is an intuition of what those 
are supposed to mean, but in practice each library lays down some fairly 
detailed definition of such.

* "Note: Initializing a pool with zero threads (as would happen in the 
case of a single-core CPU) is well-tested and does work." The absence of 
a bug should not be advertised in so many words. Simpler: "Note: 
Single-CPU machines will operate transparently with zero-sized pools."

* "Allows for custom pool size." I have no idea what pool size means.

* "// Do something interesting." Worst example ever. You are the creator 
of the library. If _you_ can't find a brief compelling example, who could?

* "Immediately after the range argument, an optional block size argument 
may be provided. If none is provided, the default block size is used. An 
optional buffer for returining the results may be provided as the last 
argument. If one is not provided, one will be automatically allocated. 
If one is provided, it must be the same length as the range." An example 
should be inserted after each of these sentences.

* "// Do something expensive with line." Better you do something 
expensive with line.

* "This is not the same thing as commutativity." I think this is obvious 
enough to be left out. The example of matrices (associative but not 
commutative) is nice though.

* "immutable n = 1000000000;" -> use underscores

* Would be great if the documentation examples included some rough 
experimental results, e.g. "3.8 times faster on a 4-core machine than 
the equivalent serial loop".

* No example for workerIndex and why it's useful.

* Is LocalWorker better than WorkerLocal? No, because it's not the 
worker that's local, it's the storage - which is missing from the name! 
WorkerLocalStorage? Also replace "create" with "make" or drop it 
entirely. The example doesn't tell me how I can use bufs. I suspect 
workerIndex has something to do with it but the example fails to divulge 
that relationship.

* No example for join(), which is quite a central primitive.

* No example for put(), which would be interesting to see.

* No example for task().

* No example for run().

* The example for Task uses myFuture without defining it. In case Task 
defines the promise/future idiom, the documentation misses a terrific 
opportunity of clarifying that.

* What is 'run' in the definition of safe task()?

* Why is AsyncBuf's doc far away from the function returning it? Same 
about WorkerLocal.


Andrei

Mar 18 2011

dsimcha <dsimcha yahoo.com> writes:

Thanks for the advice.  You mentioned in the past that the documentation 
was inadequate but didn't give enough specifics as to how until now.  As 
the author of the library, things seem obvious to me that don't seem 
obvious to anyone else, so I don't feel that I'm in a good position to 
judge the quality of the documentation and where it needs improvement. 
I plan to fix most of the issues you raised, but I've left comments for 
the few that I can't/won't fix or believe are based on misunderstandings 
below.

On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.

You can have this if you want, by setting the work unit size to 1. 
Setting it to a larger size just causes more elements to be buffered, 
which may be more efficient in some cases.

 * I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
 method should be better: just use one circular buffer. The presence of
 two dependent parameters makes this abstraction difficult to operate with.

 * Same question about asyncBuf. What is wrong with a circular buffer
 filled on one side by threads and on the consumed from the other by the
 client? I can think of a couple of answers but it would be great if they
 were part of the documentation.

Are you really suggesting I give detailed rationales for implementation 
decisions in the documentation?  Anyhow, the two reasons for this choice 
are to avoid needing synchronization/atomic ops/etc. on every write to 
the buffer (which we would need since it can be read and written 
concurrently and we need to track whether we have space to write to) and 
because parallel map works best when it operates on relatively large 
buffers, resulting in minimal synchronization overhead per element. 
(Under the hood, the buffer is filled and then eager parallel map is 
called.)

 * Why not make workerIndex a ulong and be done with it?

I doubt anyone's really going to create anywhere near 4 billion TaskPool 
threads over the lifetime of a program.  Part of the point of TaskPool 
is recycling threads rather than paying the overhead of creating and 
destroying them.  Using a ulong on a 32-bit architecture would make 
worker-local storage substantially slower.  workerIndex is how 
worker-local storage works under the hood, so it needs to be fast.

 * No example for workerIndex and why it's useful.

It should just be private.  The fact that it's public is an artifact of 
when I was designing worker-local storage and didn't know how it was 
going to work yet.  I never thought to revisit this until now.  It 
really isn't useful to client code.

 * Is stop() really trusted or just unsafe? If it's forcibly killing
 threads then its unsafe.

It's not forcibly killing threads.  As the documentation states, it has 
no effect on jobs already executing, only ones in the queue. 
Furthermore, it's needed unless makeDaemon is called.  Speaking of 
which, makeDaemon and makeAngel should probably be trusted, too.

 * defaultPoolThreads - should it be a  property?

Yes.  In spirit it's a global variable.  It requires some extra 
machinations, though, to be threadsafe, which is why it's not 
implemented as a simple global variable.

 * No example for task().

???? Yes there is, for both flavors, though these could admittedly be 
improved.  Only the safe version doesn't have an example, and this is 
just a more restricted version of the function pointer case, so it seems 
silly to make a separate example for it.

 * What is 'run' in the definition of safe task()?

It's just the run() adapter function.  Isn't that obvious?

Mar 18 2011

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:

 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.

 You can have this if you want, by setting the work unit size to 1.  
 Setting it to a larger size just causes more elements to be buffered,  
 which may be more efficient in some cases.

Please add an example showing that, too. Sure, the documentation says
that's what's being done, but an example would show it more clearly.


-- 
Simen

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:
 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.

 You can have this if you want, by setting the work unit size to 1.
 Setting it to a larger size just causes more elements to be buffered,
 which may be more efficient in some cases.

 Please add an example showing that, too. Sure, the documentation says
 that's what's being done, but an example would show it more clearly.

I don't understand how this can be demonstrated in an example.  It's an
under-the-hood thing.  The only place this appears in the API is in the
workUnitSize parameter.

Mar 19 2011

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

On Sat, 19 Mar 2011 13:51:56 +0100, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:
 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works  


 on
 1, 5, 9, ... and so on.

 You can have this if you want, by setting the work unit size to 1.
 Setting it to a larger size just causes more elements to be buffered,
 which may be more efficient in some cases.

 Please add an example showing that, too. Sure, the documentation says
 that's what's being done, but an example would show it more clearly.

 I don't understand how this can be demonstrated in an example.  It's an
 under-the-hood thing.  The only place this appears in the API is in the
 workUnitSize parameter.

Yeah, scratch that. I for some reason thought the "array of size
workUnitSize" was global, but it's per thread, innit? Seems so logical now.


-- 
Simen

Mar 19 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 03/18/2011 11:40 PM, dsimcha wrote:
 Thanks for the advice. You mentioned in the past that the documentation
 was inadequate but didn't give enough specifics as to how until now. As
 the author of the library, things seem obvious to me that don't seem
 obvious to anyone else, so I don't feel that I'm in a good position to
 judge the quality of the documentation and where it needs improvement. I
 plan to fix most of the issues you raised, but I've left comments for
 the few that I can't/won't fix or believe are based on misunderstandings
 below.

Great, thanks.

 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.

 You can have this if you want, by setting the work unit size to 1.
 Setting it to a larger size just causes more elements to be buffered,
 which may be more efficient in some cases.

Got it.

 * Why not make workerIndex a ulong and be done with it?

 I doubt anyone's really going to create anywhere near 4 billion TaskPool
 threads over the lifetime of a program. Part of the point of TaskPool is
 recycling threads rather than paying the overhead of creating and
 destroying them. Using a ulong on a 32-bit architecture would make
 worker-local storage substantially slower. workerIndex is how
 worker-local storage works under the hood, so it needs to be fast.

If you're confident that overflow won't occur, you may want to eliminate 
that detail from the docs. It throws off the reader.

  > * No example for workerIndex and why it's useful.

 It should just be private. The fact that it's public is an artifact of
 when I was designing worker-local storage and didn't know how it was
 going to work yet. I never thought to revisit this until now. It really
 isn't useful to client code.

It could be public and undocumented.

 * Is stop() really trusted or just unsafe? If it's forcibly killing
 threads then its unsafe.

 It's not forcibly killing threads. As the documentation states, it has
 no effect on jobs already executing, only ones in the queue.
 Furthermore, it's needed unless makeDaemon is called. Speaking of which,
 makeDaemon and makeAngel should probably be trusted, too.

Great. The more safe/trusted, the better.

 * defaultPoolThreads - should it be a  property?

 Yes. In spirit it's a global variable. It requires some extra
 machinations, though, to be threadsafe, which is why it's not
 implemented as a simple global variable.

Then it should be a  property. I think ddoc doesn't reflect that, but an 
example could.

 * No example for task().

 ???? Yes there is, for both flavors, though these could admittedly be
 improved. Only the safe version doesn't have an example, and this is
 just a more restricted version of the function pointer case, so it seems
 silly to make a separate example for it.

Towards the bottom of the document there are overloads of task that 
don't have examples.

 * What is 'run' in the definition of safe task()?

 It's just the run() adapter function. Isn't that obvious?

I'm referring to this:

Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp, 
Args args);

What is "run"?


Andrei

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:
 On 03/18/2011 11:40 PM, dsimcha wrote:
 It should just be private. The fact that it's public is an artifact of
 when I was designing worker-local storage and didn't know how it was
 going to work yet. I never thought to revisit this until now. It really
 isn't useful to client code.

 It could be public and undocumented.

I've already made it private.  I can't see what purpose having it public 
would serve.  The fact that it was public before was **purely** an 
oversight.

 * defaultPoolThreads - should it be a  property?

 Yes. In spirit it's a global variable. It requires some extra
 machinations, though, to be threadsafe, which is why it's not
 implemented as a simple global variable.

 Then it should be a  property. I think ddoc doesn't reflect that, but an
 example could.

Right.  It's always been  property but ddoc doesn't reflect this.  I've 
changed the docs slightly to call it a "property" instead of a 
"function".  It seems like overkill to me to give examples for this, 
though, since it's just a getter and a setter.

 * No example for task().

 ???? Yes there is, for both flavors, though these could admittedly be
 improved. Only the safe version doesn't have an example, and this is
 just a more restricted version of the function pointer case, so it seems
 silly to make a separate example for it.

 Towards the bottom of the document there are overloads of task that
 don't have examples.

You mean TaskPool.task()?  Since these are such slight variations of the 
other overloads, I thought an example would be overkill.  Since people 
less familiar with the library don't think so, though, I've added 
examples that are accordingly slight variations of the examples for the 
other overloads.

 * What is 'run' in the definition of safe task()?

 It's just the run() adapter function. Isn't that obvious?

 I'm referring to this:

 Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp,
 Args args);

 What is "run"?

Ok, I ended up moving the docs for run() directly above the stuff that 
uses it.  run() is described as "Calls a delegate or function pointer 
with args. This is an adapter that makes Task work with delegates, 
function pointers and functors instead of just aliases. It is included 
in the documentation to clarify how this case is handled, but is not 
meant to be used directly by client code."

I know the Higher Principles of Encapsulation say this should be private 
and the relevant overloads should return auto.  I strongly believe, 
though, that being anal about encapsulation of this detail is silly 
since it is so unlikely to change and that exposing it helps to clarify 
what's really going on here.  Encapsulation is good up to a point, but 
sometimes it's just easier to think about things when you know how they 
really work at a concrete level, and this tradeoff needs to be weighted.

Mar 19 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 03/19/2011 10:28 AM, dsimcha wrote:
 On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:
 Towards the bottom of the document there are overloads of task that
 don't have examples.

 You mean TaskPool.task()? Since these are such slight variations of the
 other overloads, I thought an example would be overkill. Since people
 less familiar with the library don't think so, though, I've added
 examples that are accordingly slight variations of the examples for the
 other overloads.

A great way to handle bunches of almost-identical overloads is to group 
them together with /// ditto and explain the slight differences in the 
consolidated documentation.

Andrei

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

Ok, thanks again for clarifying **how** the docs could be improved. 
I've implemented the suggestions and generally given the docs a good 
reading over and clean up.  The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 0. Overview and vote

 I think the library delivers the high-level goods (parallel foreach,
 map, reduce) but is a bit fuzzy in the lower level details.

 Documentation hurts understanding of the capabilities of the library and
 essentially is of inadequate quality. Entity documentation and examples
 do little more than give the impression of dispensing with an unpleasant
 chore.

 My vote in favor of acceptance is contingent upon a radical improvement
 in the quality of documentation and examples. Most if not all artifacts
 should be motivated by simple, compelling examples. The introductory
 section must contain a brief and attractive synopsis of the flagship
 capabilities. All terms must be defined before being used and introduced
 in a carefully-chosen order. The relationship between various entities
 should be clarified.

 I've seen the argument that simple and strong examples are difficult to
 find. Though I agree such examples are not easy to come up with, I also
 believe the author of library is in the best position to produce them.

 ================

 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.

 * Example with squares would be better if double replaced uint, and if a
 more complicated operation (sqrt, log...) were involved.

 * I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
 method should be better: just use one circular buffer. The presence of
 two dependent parameters makes this abstraction difficult to operate with.

 * Same question about asyncBuf. What is wrong with a circular buffer
 filled on one side by threads and on the consumed from the other by the
 client? I can think of a couple of answers but it would be great if they
 were part of the documentation.

 * Why not make workerIndex a ulong and be done with it?

 * Is stop() really trusted or just unsafe? If it's forcibly killing
 threads then its unsafe.

 * uint size() should be size_t for conformity.

 * defaultPoolThreads - should it be a  property?

 * I don't understand what Task is supposed to do. It is circularly
 defined: "A struct that encapsulates the information about a task,
 including its current status, what pool it was submitted to, and its
 arguments." OK, but what's a task? Could a task be used outside a task
 pool, and if so to what effect(s)?

 * If LazyMap is only necessary as the result of lazyMap, could that
 become a local type inside lazyMap?


 2. Documentation:

 * Documentation unacceptable. It lacks precision and uses terms before
 defining them. For example: "This class encapsulates a task queue and a
 set of worker threads" comes before the notions of "task queue" and
 "worker thread" are defined. Clearly there is an intuition of what those
 are supposed to mean, but in practice each library lays down some fairly
 detailed definition of such.

 * "Note: Initializing a pool with zero threads (as would happen in the
 case of a single-core CPU) is well-tested and does work." The absence of
 a bug should not be advertised in so many words. Simpler: "Note:
 Single-CPU machines will operate transparently with zero-sized pools."

 * "Allows for custom pool size." I have no idea what pool size means.

 * "// Do something interesting." Worst example ever. You are the creator
 of the library. If _you_ can't find a brief compelling example, who could?

 * "Immediately after the range argument, an optional block size argument
 may be provided. If none is provided, the default block size is used. An
 optional buffer for returining the results may be provided as the last
 argument. If one is not provided, one will be automatically allocated.
 If one is provided, it must be the same length as the range." An example
 should be inserted after each of these sentences.

 * "// Do something expensive with line." Better you do something
 expensive with line.

 * "This is not the same thing as commutativity." I think this is obvious
 enough to be left out. The example of matrices (associative but not
 commutative) is nice though.

 * "immutable n = 1000000000;" -> use underscores

 * Would be great if the documentation examples included some rough
 experimental results, e.g. "3.8 times faster on a 4-core machine than
 the equivalent serial loop".

 * No example for workerIndex and why it's useful.

 * Is LocalWorker better than WorkerLocal? No, because it's not the
 worker that's local, it's the storage - which is missing from the name!
 WorkerLocalStorage? Also replace "create" with "make" or drop it
 entirely. The example doesn't tell me how I can use bufs. I suspect
 workerIndex has something to do with it but the example fails to divulge
 that relationship.

 * No example for join(), which is quite a central primitive.

 * No example for put(), which would be interesting to see.

 * No example for task().

 * No example for run().

 * The example for Task uses myFuture without defining it. In case Task
 defines the promise/future idiom, the documentation misses a terrific
 opportunity of clarifying that.

 * What is 'run' in the definition of safe task()?

 * Why is AsyncBuf's doc far away from the function returning it? Same
 about WorkerLocal.


 Andrei

Mar 19 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

* When using "parallelism" as a common noun, prefix is with a '_' so 
ddoc doesn't underline it.

* "Most of this module completely subverts..." Vague characterizations 
("most", "completely", "some") don't belong in a technical 
documentation. (For example there's either subversion going on or there 
isn't.) Also, std.concurrency and std.parallelism address different 
needs so there's little competition between them. Better: "Unless 
explicitly marked as $(D  trusted) or $(D  safe), artifacts in this 
module are not provably memory-safe and cannot be used with SafeD. If 
used as documented, memory safety is guaranteed."

* Speaking of std.concurrency vs. std.parallelism, the first paragraph 
might be something like: "This module implements high-level primitives 
for shared memory SMP parallelism. These include parallel foreach, 
parallel reduce, parallel eager map, pipelining and future/promise 
parallelism primitives. $(D std._parallelism) is best recommended when 
the same operation is to be executed in parallel over different data. 
For communication between arbitrary threads, see $(D std.concurrency)."

* Still no synopsis example that illustrates in a catchy way the most 
attractive artifacts.

* "After creation, Task objects are submitted to a TaskPool for 
execution." I understand it's possible to use Task straight as a 
promise/future, so s/are/may be/. Also it is my understanding that 
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary 
number of tasks. If that's the case, it's worth mentioning.

* "A call to workWait(), yieldWait(), or spinWait() can be used to 
retrive the return value after the function is finished executing." As 
an aside, a spell checking step would be great ("retrieve") - just put 
the text in an editor with spellchecking. I think what this means is: 
"The methods workWait(), yieldWait(), or spinWait() make sure that the 
function finishes execution and then return its result to the initiating 
thread. Each uses a different waiting strategy as detailed below."

* "If a Task has been submitted to a TaskPool instance, is being stored 
in a stack frame, and has not yet finished, the destructor for this 
struct will automatically call yieldWait() so that the task can finish 
and the stack frame can be destroyed safely." At this point in the doc 
the reader doesn't understand that at all because TaskPool has not been 
seen yet. The reader gets worried that she'll be essentially serializing 
the entire process by mistake. Either move this explanation down or 
provide an example.

* "Function results are returned from yieldWait() and friends by ref." 
Someone coming from C++ may be thrown off by this sudden casual use of 
"friends" and think there's a notion of frienship by reference in D. 
Better: "The forcing methods yieldWait(), workWait(), and spinWait() 
return the result by reference."

* Speaking of which, I'd replace "Wait" with "Force". Right now the 
nomenclature is far removed from futures and promises.

* Is done() a property?

* The example that reads two files at the same time should NOT use 
taskPool. It's just one task, why would the pool ever be needed? If you 
also provided an example that reads n files in memory at the same time 
using a pool, that would illustrate nicely why you need it. If a Task 
can't be launched without being put in a pool, there should be a 
possibility to do so. At my work we have a simple function called 
callInNewThread that does what's needed to launch a function in a new 
thread.

* The note below that example gets me thinking: it is an artificial 
limitation to force users of Task to worry about scope and such. One 
should be able to create a Future object (Task I think in your 
terminology), pass it around like a normal value, and ultimately force 
it. This is the case for all other languages that implement futures. I 
suspect the "scope" parameter associated with the delegate a couple of 
definitions below plays a role here, but I think we need to work for 
providing the smoothest interface possible (possibly prompting 
improvements in the language along the way).

* I'm not sure how to interpret the docs for

ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args);

So it's documented but I'm not supposed to care. Why not just remove? 
Surely there must be higher-level examples that clarify that I can use 
delegates etc.

* The examples have code at top-level. That's fine for short snippets 
but not when using import etc. I recommend putting the code inside 
unittests or function bodies for such cases.

* "If you want to escape the Task object from the function in which it 
was created or prefer to heap allocate and automatically submit to the 
pool, see TaskPool.task()." I'm uncomfortable that I need to remember a 
subtle distinction of the same primitive name ("task") depending on 
membership in a TaskPool or not, which is a tenuous relationship to 
remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking 
ourselves what's the relative overhead of closure allocation vs. task 
switching etc. Maybe we get to simplify things a lot at a small cost in 
efficiency.

* As I mentioned, in the definition:

Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);

I can't tell what "run" is.

* osReportedNcpu - a case of naming convention gone wild... how about 
totalCPUs or reportedCPUs.

* "A goto from inside the parallel foreach loop to a label outside the 
loop will result in undefined behavior." Would this be a bug in dmd?

* I would group these two together:

ParallelForeach!(R) parallel(R)(R range, size_t workUnitSize);
ParallelForeach!(R) parallel(R)(R range);

Just put a /// ditto for the second, use only one code snippet that 
illustrates two foreach loops, one with the explicit parameter and one 
without. In fact swap the two - most of the time people will use the one 
with the default work unit size. At best don't specify 512 because later 
on you may come with better algorithm to determine the optimal block size.

* "workUnitSize: The number of elements to evaluate in a single Task. 
Must be less than or equal to bufSize, and in practice should be a 
fraction of bufSize such that all worker threads can be used." Then why 
not specify a different parameter such a multiplier or something? The 
dependence between bufSize and worUnitSize is a sign that these two 
should be improved. If you have good reasons that the user must have the 
parameters in this form, give an example substantiating that.

* "auto myMax = taskPool.reduce!"a + b * b"(myArr);" This is not 
computing anything meaningful. To compute the sum of squares, you need 
to give the seed 0.0.

* Again: speed of e.g. parallel min/max vs. serial, pi computation etc. 
on a usual machine?

* The join example is fine, but the repetitive code suggests that loops 
should be better:

import std.file;

void main() {
     auto pool = new TaskPool();
     foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
       pool.put(task!read(filename));
     }

     // Call join() to guarantee that all tasks are done running, the worker
     // threads have terminated and that the results of all of the tasks can
     // be accessed without any synchronization primitives.
     pool.join();

     ubyte[][] data; // void[], meh
     // Use spinWait() since the results are guaranteed to have been 
computed
     // and spinWait() is the cheapest of the wait functions.
     foreach (task; pool.howDoIEnumerateTasks()) {
         data ~= task1.spinWait();
     }


Andrei

Mar 19 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 * "Most of this module completely subverts..." Vague characterizations 
 ("most", "completely", "some") don't belong in a technical 
 documentation. (For example there's either subversion going on or there 
 isn't.) Also, std.concurrency and std.parallelism address different 
 needs so there's little competition between them. Better: "Unless 
 explicitly marked as $(D  trusted) or $(D  safe), artifacts in this 
 module are not provably memory-safe and cannot be used with SafeD. If 
 used as documented, memory safety is guaranteed."

Actually, I think this is a bad description of what it subverts. What 
it subverts isn't the memory-safety that SafeD provides, but the safety 
against low-level races that even unsafe D protects against unless you 
cast shared away. For instance:

	void main() {
		int sum = 0;
		foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
			sum += value;
		}
		writeln(sum);
	}

The "+=" would need to be an atomic operation to avoid low-level races.

I think that ParallelForeach's opApply should only accept a shared 
delegate. I define shared delegate as a delegate that does not 
reference any non-shared variables of its outer scope. The problem is 
that DMD currently doesn't know how to determine whether a delegate 
literal is shared or not, thus a delegate literal is never shared and 
if ParallelForeach's opApply asked a shared delegate as it should it 
would just not work. Fix DMD to create shared delegate literals where 
appropriate and everything can be guarantied race-free.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 1:09 PM, Michel Fortin wrote:
 On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 * "Most of this module completely subverts..." Vague characterizations
 ("most", "completely", "some") don't belong in a technical
 documentation. (For example there's either subversion going on or
 there isn't.) Also, std.concurrency and std.parallelism address
 different needs so there's little competition between them. Better:
 "Unless explicitly marked as $(D  trusted) or $(D  safe), artifacts in
 this module are not provably memory-safe and cannot be used with
 SafeD. If used as documented, memory safety is guaranteed."

 Actually, I think this is a bad description of what it subverts. What it
 subverts isn't the memory-safety that SafeD provides, but the safety
 against low-level races that even unsafe D protects against unless you
 cast shared away. For instance:

 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 sum += value;
 }
 writeln(sum);
 }

 The "+=" would need to be an atomic operation to avoid low-level races.

 I think that ParallelForeach's opApply should only accept a shared
 delegate. I define shared delegate as a delegate that does not reference
 any non-shared variables of its outer scope. The problem is that DMD
 currently doesn't know how to determine whether a delegate literal is
 shared or not, thus a delegate literal is never shared and if
 ParallelForeach's opApply asked a shared delegate as it should it would
 just not work. Fix DMD to create shared delegate literals where
 appropriate and everything can be guarantied race-free.

If you want pedal-to-metal parallelism without insane levels of 
verbosity, you can't have these safety features.  I thought long and 
hard about this issue before submitting this lib for review and 
concluded that any solution would make std.parallelism so slow, so 
limited and/or such a PITA to use that I'd rather it just punt these 
issues to the programmer.  In practice, parallel foreach is used with 
very small, performance-critical snippets that are fairly easy to reason 
about even without any language-level race safety.  This is a 
fundamental design decision that will not be changing.  If it's 
unacceptable then:

1.  I give up.

2.  I wish someone had told me earlier.

Mar 19 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 1:09 PM, Michel Fortin wrote:
 For instance:
 
 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 sum += value;
 }
 writeln(sum);
 }
 
 The "+=" would need to be an atomic operation to avoid low-level races.
 
 I think that ParallelForeach's opApply should only accept a shared
 delegate. I define shared delegate as a delegate that does not reference
 any non-shared variables of its outer scope. The problem is that DMD
 currently doesn't know how to determine whether a delegate literal is
 shared or not, thus a delegate literal is never shared and if
 ParallelForeach's opApply asked a shared delegate as it should it would
 just not work. Fix DMD to create shared delegate literals where
 appropriate and everything can be guarantied race-free.

 
 If you want pedal-to-metal parallelism without insane levels of 
 verbosity, you can't have these safety features.

I'm not sure where my proposal asks for more verbosity or less 
performance. All I can see is a few less casts in std.parallelism and 
that it'd disallow the case in my example above that is totally wrong. 
Either you're interpreting it wrong or there are things I haven't 
thought about (and I'd be happy to know about them).

But by looking at all the examples in the documentation, I cannot find 
one that would need to be changed... well, except the one I'll discuss 
below.


 I thought long and hard about this issue before submitting this lib for 
 review and concluded that any solution would make std.parallelism so 
 slow, so limited and/or such a PITA to use that I'd rather it just punt 
 these issues to the programmer.  In practice, parallel foreach is used 
 with very small, performance-critical snippets that are fairly easy to 
 reason about even without any language-level race safety.

I'm not too convinced about the "I know what I'm doing" argument when I 
look at this example from asyncBuf's documentation:

    auto lines = File("foo.txt").byLine();
    auto duped = map!"a.idup"(lines);  // Necessary b/c byLine() 
recycles buffer

    // Fetch more lines in the background while we process the lines already
    // read into memory into a matrix of doubles.
    double[][] matrix;
    auto asyncReader = taskPool.asyncBuf(duped);

    foreach(line; asyncReader) {
        auto ls = line.split("\t");
        matrix ~= to!(double[])(ls);
    }

Look at the last line of the foreach. You are appending to a non-shared 
array from many different threads. How is that not a race condition?

With my proposal, the compiler would have caught that because opApply 
would want the foreach body to be  a shared delegate, and 
reading/writing to non-shared variable "matrix" in the outer scope from 
a shared delegate literal would be an error.

I'm not too sure how hard it'd be to do that in the compiler, but I 
think it's the right thing to do. Once the compiler can do this we can 
have safety; until that time I'd agree to see std.parallelism stay 
unsafe.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 19 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-19 14:14:51 -0400, Michel Fortin <michel.fortin michelf.com> said:

 I'm not too convinced about the "I know what I'm doing" argument when I 
 look at this example from asyncBuf's documentation:
 
     auto lines = File("foo.txt").byLine();
     auto duped = map!"a.idup"(lines);  // Necessary b/c byLine() 
 recycles buffer
 
     // Fetch more lines in the background while we process the lines already
     // read into memory into a matrix of doubles.
     double[][] matrix;
     auto asyncReader = taskPool.asyncBuf(duped);
 
     foreach(line; asyncReader) {
         auto ls = line.split("\t");
         matrix ~= to!(double[])(ls);
     }
 
 Look at the last line of the foreach. You are appending to a non-shared 
 array from many different threads. How is that not a race condition?

... or maybe I just totally misunderstood asyncBuf. Rereading the 
documentation I'm under the impression I'd have to write this to get 
what I expected:

	foreach (line; parallel(asyncReader))
		...

And that would cause a race condition. If that's the case, the example 
is fine. Sorry for the misunderstanding.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 2:25 PM, Michel Fortin wrote:
 On 2011-03-19 14:14:51 -0400, Michel Fortin <michel.fortin michelf.com>
 said:

 I'm not too convinced about the "I know what I'm doing" argument when
 I look at this example from asyncBuf's documentation:

 auto lines = File("foo.txt").byLine();
 auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles
 buffer

 // Fetch more lines in the background while we process the lines already
 // read into memory into a matrix of doubles.
 double[][] matrix;
 auto asyncReader = taskPool.asyncBuf(duped);

 foreach(line; asyncReader) {
 auto ls = line.split("\t");
 matrix ~= to!(double[])(ls);
 }

 Look at the last line of the foreach. You are appending to a
 non-shared array from many different threads. How is that not a race
 condition?

 .... or maybe I just totally misunderstood asyncBuf. Rereading the
 documentation I'm under the impression I'd have to write this to get
 what I expected:

 foreach (line; parallel(asyncReader))
 ...

 And that would cause a race condition. If that's the case, the example
 is fine. Sorry for the misunderstanding.

Right.  And this is pretty obviously a race.  The other example (without 
the parallel) is completely safe.

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 2:14 PM, Michel Fortin wrote:
 On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 1:09 PM, Michel Fortin wrote:
 For instance:

 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 sum += value;
 }
 writeln(sum);
 }

 The "+=" would need to be an atomic operation to avoid low-level races.

 I think that ParallelForeach's opApply should only accept a shared
 delegate. I define shared delegate as a delegate that does not reference
 any non-shared variables of its outer scope. The problem is that DMD
 currently doesn't know how to determine whether a delegate literal is
 shared or not, thus a delegate literal is never shared and if
 ParallelForeach's opApply asked a shared delegate as it should it would
 just not work. Fix DMD to create shared delegate literals where
 appropriate and everything can be guarantied race-free.

 If you want pedal-to-metal parallelism without insane levels of
 verbosity, you can't have these safety features.

 I'm not sure where my proposal asks for more verbosity or less
 performance. All I can see is a few less casts in std.parallelism and
 that it'd disallow the case in my example above that is totally wrong.
 Either you're interpreting it wrong or there are things I haven't
 thought about (and I'd be happy to know about them).

 But by looking at all the examples in the documentation, I cannot find
 one that would need to be changed... well, except the one I'll discuss
 below.

Ok, I've had some time to think about this.  The following example is 
safe, but wouldn't work if I understand correctly what you're proposing:

auto logs = new double[1_000_000];

foreach(i, ref elem; taskPool.parallel(logs)) {
     elem = log(i + 1.0);
}

The problem is that you're writing to the same array from multiple 
threads, which the compiler would interpret as writing to the same 
variable.  However, you can guarantee that no two threads ever write to 
the same element, so it's safe.

Note:  I'm aware of the false sharing issue when writing to adjacent 
memory addresses.  However, when writing to an array this big it occurs 
to a negligible extent.  If you were using a smaller array, then either 
the loop body would be so expensive that the cost of false sharing would 
be negligible, or the whole loop would be too cheap to be worth 
parallelizing.

I'm also aware that word tearing is a concern on some architectures, 
though not x86.  IMHO std.parallelism and its documentation should not 
be pedantic about portability to architectures that a D compiler doesn't 
even exist for yet.

Also, your example can be trivially modified to be safe.

void main() {
     int sum = 0;
     foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
          synchronized sum += value;
      }
      writeln(sum);
}

In this case that kills all parallelism, but in more realistic cases I 
use this pattern often.  I find it very common to have an expensive loop 
body can be performed in parallel, except for a tiny portion that must 
update a shared data structure.  I'm aware that it might be possible, in 
theory, to write this more formally using reduce() or something.  However:

1.  If the portion of the loop that deals with shared data is very small 
(and therefore the serialization caused by the synchronized block is 
negligible), it's often more efficient to only keep one data structure 
in memory and update it concurrently, rather than use stronger isolation 
between threads like reduce() does, and have to maintain one data 
structure for each thread.

2.  In my experience synchronizing on a small portion of the loop body 
works very well in practice.  My general philosophy is that, in a 
library like this, dangerous but useful constructs must be supported and 
treated as innocent until proven guilty, not the other way round.

Mar 20 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-20 11:42:04 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 2:14 PM, Michel Fortin wrote:
 On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:
 
 On 3/19/2011 1:09 PM, Michel Fortin wrote:
 For instance:
 
 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 sum += value;
 }
 writeln(sum);
 }
 
 The "+=" would need to be an atomic operation to avoid low-level races.
 
 I think that ParallelForeach's opApply should only accept a shared
 delegate. I define shared delegate as a delegate that does not reference
 any non-shared variables of its outer scope. The problem is that DMD
 currently doesn't know how to determine whether a delegate literal is
 shared or not, thus a delegate literal is never shared and if
 ParallelForeach's opApply asked a shared delegate as it should it would
 just not work. Fix DMD to create shared delegate literals where
 appropriate and everything can be guarantied race-free.

 
 If you want pedal-to-metal parallelism without insane levels of
 verbosity, you can't have these safety features.

 
 I'm not sure where my proposal asks for more verbosity or less
 performance. All I can see is a few less casts in std.parallelism and
 that it'd disallow the case in my example above that is totally wrong.
 Either you're interpreting it wrong or there are things I haven't
 thought about (and I'd be happy to know about them).
 
 But by looking at all the examples in the documentation, I cannot find
 one that would need to be changed... well, except the one I'll discuss
 below.
 

 
 Ok, I've had some time to think about this.  The following example is 
 safe, but wouldn't work if I understand correctly what you're proposing:
 
 auto logs = new double[1_000_000];
 
 foreach(i, ref elem; taskPool.parallel(logs)) {
      elem = log(i + 1.0);
 }
 
 The problem is that you're writing to the same array from multiple 
 threads, which the compiler would interpret as writing to the same 
 variable.  However, you can guarantee that no two threads ever write to 
 the same element, so it's safe.

I don't see a problem with the above. The array elements you modify are 
passed through parallel's opApply which can check easily whether it's 
safe or not to pass them by ref to different threads (by checking the 
element's size) and allow or disallow the operation accordingly.

It could even do a clever trick to make it safe to pass things such as 
elements of array of bytes by ref (by coalescing loop iterations for 
all bytes sharing the same word into one task). That might not work for 
ranges which are not arrays however.

That said, feel free to suggest more problematic examples.


 Note:  I'm aware of the false sharing issue when writing to adjacent 
 memory addresses.  However, when writing to an array this big it occurs 
 to a negligible extent.  If you were using a smaller array, then either 
 the loop body would be so expensive that the cost of false sharing 
 would be negligible, or the whole loop would be too cheap to be worth 
 parallelizing.
 
 I'm also aware that word tearing is a concern on some architectures, 
 though not x86.  IMHO std.parallelism and its documentation should not 
 be pedantic about portability to architectures that a D compiler 
 doesn't even exist for yet.

No question there.


 Also, your example can be trivially modified to be safe.
 
 void main() {
      int sum = 0;
      foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
           synchronized sum += value;
       }
       writeln(sum);
 }
 
 In this case that kills all parallelism, but in more realistic cases I 
 use this pattern often.  I find it very common to have an expensive 
 loop body can be performed in parallel, except for a tiny portion that 
 must update a shared data structure.  I'm aware that it might be 
 possible, in theory, to write this more formally using reduce() or 
 something.  However:
 
 1.  If the portion of the loop that deals with shared data is very 
 small (and therefore the serialization caused by the synchronized block 
 is negligible), it's often more efficient to only keep one data 
 structure in memory and update it concurrently, rather than use 
 stronger isolation between threads like reduce() does, and have to 
 maintain one data structure for each thread.
 
 2.  In my experience synchronizing on a small portion of the loop body 
 works very well in practice.  My general philosophy is that, in a 
 library like this, dangerous but useful constructs must be supported 
 and treated as innocent until proven guilty, not the other way round.

Your second example is not really a good justification of anything. 
I'll refer you to how synchronized classes work. It was decided that 
synchronized in a class protects everything that is directly stored in 
the class. Anything behind an indirection is considered shared by the 
compiler. The implication of this is that if you have an array or a 
pointer to something that you want semantically to be protected by the 
class's mutex, you have to cast things to unshared. It was decided that 
things should be safe against low-level races first, and convenience 
was relegated as a secondary concern. I don't like it very much, but 
that's what was decided and written in TDPL.

Now we're in the exact same situation (except that no classes are 
involved) and you're proposing that for this case we make convenience 
prime over low-level race safety? To me this would be an utter lack of 
coherency in the design of D's "synchronized" feature to go that route.

For the case above, wouldn't it be better to use an atomic add instead 
of a synchronized block? In this case you can make "sum" shared and 
have the type system check that everything is safe (using my proposed 
rules). And if your data structure is bigger, likely it'll be a 
synchronized class and you won't have to resort to bypass type-system 
safeties inside your loop (although you might have to inside your 
synchronized class, but that's another matter).

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 20 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/20/2011 10:44 PM, Michel Fortin wrote:
 I don't see a problem with the above. The array elements you modify are
 passed through parallel's opApply which can check easily whether it's
 safe or not to pass them by ref to different threads (by checking the
 element's size) and allow or disallow the operation accordingly.

 It could even do a clever trick to make it safe to pass things such as
 elements of array of bytes by ref (by coalescing loop iterations for all
 bytes sharing the same word into one task). That might not work for
 ranges which are not arrays however.

 That said, feel free to suggest more problematic examples.

Ok, I completely agree in principle, though I question whether it's 
worth actually implementing something like this, especially until we get 
some kind of support for shared delegates.

 Also, your example can be trivially modified to be safe.

 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 synchronized sum += value;
 }
 writeln(sum);
 }

 In this case that kills all parallelism, but in more realistic cases I
 use this pattern often. I find it very common to have an expensive
 loop body can be performed in parallel, except for a tiny portion that
 must update a shared data structure. I'm aware that it might be
 possible, in theory, to write this more formally using reduce() or
 something. However:

 1. If the portion of the loop that deals with shared data is very
 small (and therefore the serialization caused by the synchronized
 block is negligible), it's often more efficient to only keep one data
 structure in memory and update it concurrently, rather than use
 stronger isolation between threads like reduce() does, and have to
 maintain one data structure for each thread.

 2. In my experience synchronizing on a small portion of the loop body
 works very well in practice. My general philosophy is that, in a
 library like this, dangerous but useful constructs must be supported
 and treated as innocent until proven guilty, not the other way round.

 Your second example is not really a good justification of anything. I'll
 refer you to how synchronized classes work. It was decided that
 synchronized in a class protects everything that is directly stored in
 the class. Anything behind an indirection is considered shared by the
 compiler. The implication of this is that if you have an array or a
 pointer to something that you want semantically to be protected by the
 class's mutex, you have to cast things to unshared. It was decided that
 things should be safe against low-level races first, and convenience was
 relegated as a secondary concern. I don't like it very much, but that's
 what was decided and written in TDPL.

I'd go a little further.  If the guarantees that shared was supposed to 
provide are strong, i.e. apply no matter what threading module is used, 
then I utterly despise it.  It's one of the worst decisions made in the 
design of D.  Making things pedantically strict, so that the type system 
gets in the way more than it helps, encourages the user to reflexively 
circumvent the type system without thinking hard about doing this, thus 
defeating its purpose.  (The alternative of always complying with what 
the type system "expects" you to do is too inflexible to even be worth 
considering.)  Type systems should err on the side of accepting a 
superset of what's correct and treating code as innocent until proven 
guilty, not the other way around.  I still believe this even if some of 
the bugs it could be letting pass through might be very difficult to 
debug.  See the discussion we had a few weeks ago about implicit integer 
casting and porting code to 64.

My excuse for std.parallelism is that it's pedal-to-metal parallelism, 
so it's more acceptable for it to be dangerous than general case 
concurrency.  IMHO when you use the non- safe parts of std.parallelism 
(i.e. most of the library), that's equivalent to casting away shared in 
a whole bunch of places.  Typing "import std.parallelism;" in a 
non- safe module is an explicit enough step here.

The guarantee is still preserved that, if you only use std.concurrency 
(D's flagship "safe" concurrency module) for multithreading and don't 
cast away shared, there can be no low level data races.  IMHO this is 
still a substantial accomplishment in that there exists a way to do 
safe, statically checkable concurrency in D, even if it's not the 
**only** way concurrency can be done.  BTW, core.thread can also be used 
to get around D's type system, not just std.parallelism.  If you want to 
check that only safe concurrency is used, importing std.parallelism and 
core.thread can be grepped just as easily as casting away shared.

If, on the other hand, the guarantees of shared are supposed to be weak 
in that they only apply to programs where only std.concurrency is used 
for multithreading, then I think strictness is the right thing to do. 
The whole point of std.concurrency is to give strong guarantees, but if 
you prefer more dangerous but more flexible multithreading, other 
paradigms should be readily available.

 Now we're in the exact same situation (except that no classes are
 involved) and you're proposing that for this case we make convenience
 prime over low-level race safety? To me this would be an utter lack of
 coherency in the design of D's "synchronized" feature to go that route.

 For the case above, wouldn't it be better to use an atomic add instead
 of a synchronized block?

Yes.  I've used this pattern in much more complicated cases, though, 
where atomic wouldn't cut it.

 In this case you can make "sum" shared and have
 the type system check that everything is safe (using my proposed rules).
 And if your data structure is bigger, likely it'll be a synchronized
 class and you won't have to resort to bypass type-system safeties inside
 your loop (although you might have to inside your synchronized class,
 but that's another matter).

I'm **still** totally confused about how shared is supposed to work, 
because I don't have a fully debugged/implemented implementation or good 
examples of stuff written in this paradigm to play around with.  TDPL 
helps to some degree, but for me it's a lot easier to understand 
something by actually trying to use it.

Mar 20 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-20 23:21:49 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/20/2011 10:44 PM, Michel Fortin wrote:
 
 I don't see a problem with the above. The array elements you modify are
 passed through parallel's opApply which can check easily whether it's
 safe or not to pass them by ref to different threads (by checking the
 element's size) and allow or disallow the operation accordingly.
 
 It could even do a clever trick to make it safe to pass things such as
 elements of array of bytes by ref (by coalescing loop iterations for all
 bytes sharing the same word into one task). That might not work for
 ranges which are not arrays however.
 
 That said, feel free to suggest more problematic examples.

 
 Ok, I completely agree in principle, though I question whether it's 
 worth actually implementing something like this, especially until we 
 get some kind of support for shared delegates.

Well, it'll work irrespective of whether shared delegates are used or 
not. I think you could add a compile-time check that the array element 
size is a multiple of the word size when the element is passed by ref 
in the loop and leave the clever trick as a possible future 
improvements. Would that work?


 Also, your example can be trivially modified to be safe.
 
 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 synchronized sum += value;
 }
 writeln(sum);
 }
 
 In this case that kills all parallelism, but in more realistic cases I
 use this pattern often. I find it very common to have an expensive
 loop body can be performed in parallel, except for a tiny portion that
 must update a shared data structure. I'm aware that it might be
 possible, in theory, to write this more formally using reduce() or
 something. However:
 
 1. If the portion of the loop that deals with shared data is very
 small (and therefore the serialization caused by the synchronized
 block is negligible), it's often more efficient to only keep one data
 structure in memory and update it concurrently, rather than use
 stronger isolation between threads like reduce() does, and have to
 maintain one data structure for each thread.
 
 2. In my experience synchronizing on a small portion of the loop body
 works very well in practice. My general philosophy is that, in a
 library like this, dangerous but useful constructs must be supported
 and treated as innocent until proven guilty, not the other way round.

 
 Your second example is not really a good justification of anything. I'll
 refer you to how synchronized classes work. It was decided that
 synchronized in a class protects everything that is directly stored in
 the class. Anything behind an indirection is considered shared by the
 compiler. The implication of this is that if you have an array or a
 pointer to something that you want semantically to be protected by the
 class's mutex, you have to cast things to unshared. It was decided that
 things should be safe against low-level races first, and convenience was
 relegated as a secondary concern. I don't like it very much, but that's
 what was decided and written in TDPL.

 
 I'd go a little further.  If the guarantees that shared was supposed to 
 provide are strong, i.e. apply no matter what threading module is used, 
 then I utterly despise it.  It's one of the worst decisions made in the 
 design of D.  Making things pedantically strict, so that the type 
 system gets in the way more than it helps, encourages the user to 
 reflexively circumvent the type system without thinking hard about 
 doing this, thus defeating its purpose.  (The alternative of always 
 complying with what the type system "expects" you to do is too 
 inflexible to even be worth considering.)  Type systems should err on 
 the side of accepting a superset of what's correct and treating code as 
 innocent until proven guilty, not the other way around.  I still 
 believe this even if some of the bugs it could be letting pass through 
 might be very difficult to debug.  See the discussion we had a few 
 weeks ago about implicit integer casting and porting code to 64.

I agree with you that this is a serious problem. I think part of why it 
hasn't been talked much yet is that nobody is currently using D2 
seriously for multithreaded stuff at this time (apart from you I 
guess), so we're missing experience with it. Andrei seems to think it's 
fine to required casts as soon as you need to protect something beyond 
an indirection inside synchronized classes, with the mitigation measure 
that you can make classes share their mutex (not implemented yet I 
think) so if the indirection leads to a class it is less of a problem. 
Personally, I don't.


 My excuse for std.parallelism is that it's pedal-to-metal parallelism, 
 so it's more acceptable for it to be dangerous than general case 
 concurrency.  IMHO when you use the non- safe parts of std.parallelism 
 (i.e. most of the library), that's equivalent to casting away shared in 
 a whole bunch of places.  Typing "import std.parallelism;" in a 
 non- safe module is an explicit enough step here.

I still think this "pedal-to-metal" qualification needs to be 
justified. Not having shared delegates in the language seems like an 
appropriate justification to me. Wanting to bypass casts you normally 
have to do around synchronized as the sole reason seems like a bad 
justification to me.

It's not that I like how synchronized works, it's just that I think it 
should work the same everywhere.


 The guarantee is still preserved that, if you only use std.concurrency 
 (D's flagship "safe" concurrency module) for multithreading and don't 
 cast away shared, there can be no low level data races. IMHO this is 
 still a substantial accomplishment in that there exists a way to do 
 safe, statically checkable concurrency in D, even if it's not the 
 **only** way concurrency can be done.  BTW, core.thread can also be 
 used to get around D's type system, not just std.parallelism.  If you 
 want to check that only safe concurrency is used, importing 
 std.parallelism and core.thread can be grepped just as easily as 
 casting away shared.

Unless I'm mistaken, the only thing that bypasses race-safety in 
core.thread is the Thread constructor that takes a delegate. Which 
means it could easily be made race-safe by making that delegate 
parameter shared (once shared delegates are implemented).


 If, on the other hand, the guarantees of shared are supposed to be weak 
 in that they only apply to programs where only std.concurrency is used 
 for multithreading, then I think strictness is the right thing to do. 
 The whole point of std.concurrency is to give strong guarantees, but if 
 you prefer more dangerous but more flexible multithreading, other 
 paradigms should be readily available.

I think the language as a whole is designed to have strong guaranties, 
otherwise synchronized classes wouldn't require out-of-guaranty casts 
at every indirection.

I'm not too pleased with the way synchronized classes are supposed to 
work, nor am I too pleased with how it impacts the rest of the 
language. But if this is a problem (and I think it is), it ought to be 
fixed globally, not by shutting down safeties in every module dealing 
with multithreading that isn't std.concurrency.


 I'm **still** totally confused about how shared is supposed to work, 
 because I don't have a fully debugged/implemented implementation or 
 good examples of stuff written in this paradigm to play around with.

I think nobody have played much with the paradigm at this point, or 
we'd have heard some feedback. Well, actually we have your feedback 
which seem to indicate that it's better to shut off safeties than to 
play nice with them.

 - - -

Quoting Andrei, February 4 of 2010 in "there is no escape" on the 
dmd-concurrency mailing list:

 As we already knew, shared/synchronized limit quite drastically the 
 range of lock-based designs without casts. Fortunately, using a class 
 reference member inside a synchronized object will be possible 
 because... well I'll explain in the text.
 
 I continue to believe this is the right bet to make, but I expect push 
 back from experienced lock-based programmers.

Now is the beginning of that push back, I guess.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 21 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/21/2011 8:37 AM, Michel Fortin wrote:
 Well, it'll work irrespective of whether shared delegates are used or
 not. I think you could add a compile-time check that the array element
 size is a multiple of the word size when the element is passed by ref in
 the loop and leave the clever trick as a possible future improvements.
 Would that work?

On second thought, no, but for practical, not theoretical reasons:  One, 
you can't introspect whether a foreach loop is using a ref or a value 
parameter.  This is an issue with how opApply works.  Two, AFAIK there's 
no way to get the native word size.

 I'd go a little further. If the guarantees that shared was supposed to
 provide are strong, i.e. apply no matter what threading module is
 used, then I utterly despise it. It's one of the worst decisions made
 in the design of D. Making things pedantically strict, so that the
 type system gets in the way more than it helps, encourages the user to
 reflexively circumvent the type system without thinking hard about
 doing this, thus defeating its purpose. (The alternative of always
 complying with what the type system "expects" you to do is too
 inflexible to even be worth considering.) Type systems should err on
 the side of accepting a superset of what's correct and treating code
 as innocent until proven guilty, not the other way around. I still
 believe this even if some of the bugs it could be letting pass through
 might be very difficult to debug. See the discussion we had a few
 weeks ago about implicit integer casting and porting code to 64.

 I agree with you that this is a serious problem. I think part of why it
 hasn't been talked much yet is that nobody is currently using D2
 seriously for multithreaded stuff at this time (apart from you I guess),
 so we're missing experience with it. Andrei seems to think it's fine to
 required casts as soon as you need to protect something beyond an
 indirection inside synchronized classes, with the mitigation measure
 that you can make classes share their mutex (not implemented yet I
 think) so if the indirection leads to a class it is less of a problem.
 Personally, I don't.


 My excuse for std.parallelism is that it's pedal-to-metal parallelism,
 so it's more acceptable for it to be dangerous than general case
 concurrency. IMHO when you use the non- safe parts of std.parallelism
 (i.e. most of the library), that's equivalent to casting away shared
 in a whole bunch of places. Typing "import std.parallelism;" in a
 non- safe module is an explicit enough step here.

 I still think this "pedal-to-metal" qualification needs to be justified.
 Not having shared delegates in the language seems like an appropriate
 justification to me. Wanting to bypass casts you normally have to do
 around synchronized as the sole reason seems like a bad justification to
 me.

 It's not that I like how synchronized works, it's just that I think it
 should work the same everywhere.

This is where you and I disagree.  I think that the type system's 
guarantees should be weak, i.e. only apply to std.concurrency.  IMHO the 
strictness is reasonable when using message passing as your primary 
method of multithreading and only very little shared state.  However, 
it's completely unreasonable if you want to use a paradigm where shared 
state is more heavily used.  D, being a systems language, needs to allow 
other styles of multithreading without making them a huge PITA that 
requires casts everywhere.

 The guarantee is still preserved that, if you only use std.concurrency
 (D's flagship "safe" concurrency module) for multithreading and don't
 cast away shared, there can be no low level data races. IMHO this is
 still a substantial accomplishment in that there exists a way to do
 safe, statically checkable concurrency in D, even if it's not the
 **only** way concurrency can be done. BTW, core.thread can also be
 used to get around D's type system, not just std.parallelism. If you
 want to check that only safe concurrency is used, importing
 std.parallelism and core.thread can be grepped just as easily as
 casting away shared.

 Unless I'm mistaken, the only thing that bypasses race-safety in
 core.thread is the Thread constructor that takes a delegate. Which means
 it could easily be made race-safe by making that delegate parameter
 shared (once shared delegates are implemented).

And then you'd only need one cast to break it if you wanted to, not 
casts everywhere.  Just cast an unshared delegate to shared when passing 
it to core.thread.

 If, on the other hand, the guarantees of shared are supposed to be
 weak in that they only apply to programs where only std.concurrency is
 used for multithreading, then I think strictness is the right thing to
 do. The whole point of std.concurrency is to give strong guarantees,
 but if you prefer more dangerous but more flexible multithreading,
 other paradigms should be readily available.

 I think the language as a whole is designed to have strong guaranties,
 otherwise synchronized classes wouldn't require out-of-guaranty casts at
 every indirection.

Well, there's an easy way around that, too.  Just declare the whole 
method body synchronized, but don't declare the method synchronized in 
the signature.

 I'm not too pleased with the way synchronized classes are supposed to
 work, nor am I too pleased with how it impacts the rest of the language.
 But if this is a problem (and I think it is), it ought to be fixed
 globally, not by shutting down safeties in every module dealing with
 multithreading that isn't std.concurrency.

Again, I completely disagree.  IMHO it's not fixable globally such that 
both of the following are achieved:

1.  The strong guarantees when using only std.concurrency are preserved.

2.  More shared state-intensive multithreading can be done without the 
type system getting in the way more than it helps.

 I'm **still** totally confused about how shared is supposed to work,
 because I don't have a fully debugged/implemented implementation or
 good examples of stuff written in this paradigm to play around with.

 I think nobody have played much with the paradigm at this point, or we'd
 have heard some feedback. Well, actually we have your feedback which
 seem to indicate that it's better to shut off safeties than to play nice
 with them.

I agree that it's better to shut off the safeties **unless** you're 
doing very coarse-grained multithreading with very little shared state 
like std.concurrency had in mind.  If this is what you want, then the 
safeties are great.

Unfortunately, I'm going to have to take a hard line on this one.  The 
issue of integrating std.parallelism into the race safety system had 
been discussed a bunch in the past and it was basically agreed that 
std.parallelism is a "here be dragons" module that cannot reasonably be 
made to conform to such a model.  Given the choice (and I hope I'm not 
forced to make this choice) I'd rather std.parallelism be a third-party 
module that's actually usable than a Phobos module that is such a PITA 
to use that noone does in practice.

Mar 21 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-21 09:50:09 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/21/2011 8:37 AM, Michel Fortin wrote:
 Well, it'll work irrespective of whether shared delegates are used or
 not. I think you could add a compile-time check that the array element
 size is a multiple of the word size when the element is passed by ref in
 the loop and leave the clever trick as a possible future improvements.
 Would that work?

 
 On second thought, no, but for practical, not theoretical reasons:  
 One, you can't introspect whether a foreach loop is using a ref or a 
 value parameter.  This is an issue with how opApply works.

Indeed a problem. Either we fix the compiler to support that, or we 
change the syntax to something like this:

	taskPool.apply(range, (ref int value) {
		...
	});

Or we leave things as they are.

 Two, AFAIK there's no way to get the native word size.

Right. That's a problem too... you could probably alleviate this by 
doing a runtime check with some fancy instructions to get the native 
word size, but I'd expect that to be rather convoluted.

I'd like to check if I understand that well. For instance this code:

	int[100] values;
	foreach (i, ref value; parallel(values))
		value = i;

would normally run fine on a 32-bit processor, but it'd create 
low-level a race on a 64-bit processor (even a 64-bit processor running 
a 32-bit program in 32-bit compatibility mode). And even that is a 
generalization, some 32-bit processors out there *might* have 64-bit 
native words. So the code above isn't portable. Is that right?

Which makes me think... we need to document those pitfalls somewhere. 
Perhaps std.parallelism's documentation should link to a related page 
about what you can and what you can't do safely. People who read that 
"all the safeties are off" in std.parallelism aren't going to 
understand what you're talking about unless you explain the pitfalls 
with actual examples (like the one above).


 Unfortunately, I'm going to have to take a hard line on this one.  The 
 issue of integrating std.parallelism into the race safety system had 
 been discussed a bunch in the past and it was basically agreed that 
 std.parallelism is a "here be dragons" module that cannot reasonably be 
 made to conform to such a model.

About the "cannot reasonably be made to conform to such a model" part: 
that is certainly true today, but might turn out not to be true as the 
model evolves. It certainly is true as long as we don't have shared 
delegates. Beyond that it becomes more fuzzy. The final goal of the 
module shouldn't be to bypass safeties but to provide good parallelism 
primitives. By all means, if safeties can be enabled reasonably they 
should be (but as of today, they can't).

And I totally agree with you that it's quite silly to require casts 
everywhere to use synchronized. I'll be the first to admit that it's 
hard to see synchronized classes being very practical as they're 
implemented today. There's room for improvements there too.


 Given the choice (and I hope I'm not forced to make this choice) I'd 
 rather std.parallelism be a third-party module that's actually usable 
 than a Phobos module that is such a PITA to use that noone does in 
 practice.

Which is understandable.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 21 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On second thought, no, but for practical, not theoretical reasons:
 One, you can't introspect whether a foreach loop is using a ref or a
 value parameter.  This is an issue with how opApply works.

 Indeed a problem. Either we fix the compiler to support that, or we
 change the syntax to something like this:
 	taskPool.apply(range, (ref int value) {
 		...
 	});
 Or we leave things as they are.
 Two, AFAIK there's no way to get the native word size.

 Right. That's a problem too... you could probably alleviate this by
 doing a runtime check with some fancy instructions to get the native
 word size, but I'd expect that to be rather convoluted.
 I'd like to check if I understand that well. For instance this code:
 	int[100] values;
 	foreach (i, ref value; parallel(values))
 		value = i;
 would normally run fine on a 32-bit processor, but it'd create
 low-level a race on a 64-bit processor (even a 64-bit processor running
 a 32-bit program in 32-bit compatibility mode). And even that is a
 generalization, some 32-bit processors out there *might* have 64-bit
 native words. So the code above isn't portable. Is that right?
 Which makes me think... we need to document those pitfalls somewhere.
 Perhaps std.parallelism's documentation should link to a related page
 about what you can and what you can't do safely. People who read that
 "all the safeties are off" in std.parallelism aren't going to
 understand what you're talking about unless you explain the pitfalls
 with actual examples (like the one above).

This problem is **much** less severe than you are suggesting.  x86 can address
single bytes, so it's not a problem even if you're iterating over bytes on a
64-bit machine.  CPUs like Alpha (which no D compiler even exists for) can't
natively address individual bytes.  Therefore, writing to a byte would be
implemented much like writing to a bit is on x86:  You'd read the full word in,
change one byte, write the full word back.  I'm not sure exactly how it would be
implemented at the compiler level, or whether you'd even be allowed to have a
reference to a byte in such an implementation.  This is why I consider this
more a
theoretical problem than a serious practical issue.


 Unfortunately, I'm going to have to take a hard line on this one.  The
 issue of integrating std.parallelism into the race safety system had
 been discussed a bunch in the past and it was basically agreed that
 std.parallelism is a "here be dragons" module that cannot reasonably be
 made to conform to such a model.

 About the "cannot reasonably be made to conform to such a model" part:
 that is certainly true today, but might turn out not to be true as the
 model evolves. It certainly is true as long as we don't have shared
 delegates. Beyond that it becomes more fuzzy. The final goal of the
 module shouldn't be to bypass safeties but to provide good parallelism
 primitives. By all means, if safeties can be enabled reasonably they
 should be (but as of today, they can't).

I'll agree that, if the model evolves sufficiently, everything I've said
deserves
to be reconsidered.  My comments only apply to the model as it exists currently
or
will exist in the foreseeable future.  It's just that I have **serious** doubts
that it will evolve much and I don't want to delay things to have an academic
discussion about what-ifs with regard to this.

 And I totally agree with you that it's quite silly to require casts
 everywhere to use synchronized. I'll be the first to admit that it's
 hard to see synchronized classes being very practical as they're
 implemented today. There's room for improvements there too.

This is the point I'm trying to make:  Requiring these casts is perfectly
reasonable if you assume that shared state is going to be rare, as it is in the
std.concurrency model.  When the user is looking for a more fine-grained, more
shared state-heavy paradigm, I don't think it's possible to make it safe without
also making it virtually unusable.  This is why we punted and decided to go with
message passing and very limited shared data as the flagship multithreading
model.
 On the other hand, in a systems language, we can't virtually prohibit shared
state-heavy multithreading by making it virtually unusable.  Thus, the no low
level race guarantee needs to apply to std.concurrency only.

Mar 21 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On second thought, no, but for practical, not theoretical reasons:
 One, you can't introspect whether a foreach loop is using a ref or a
 value parameter.  This is an issue with how opApply works.

 Indeed a problem. Either we fix the compiler to support that, or we
 change the syntax to something like this:
 	taskPool.apply(range, (ref int value) {
 		...
 	});
 Or we leave things as they are.
 Two, AFAIK there's no way to get the native word size.

 Right. That's a problem too... you could probably alleviate this by
 doing a runtime check with some fancy instructions to get the native
 word size, but I'd expect that to be rather convoluted.
 I'd like to check if I understand that well. For instance this code:
 	int[100] values;
 	foreach (i, ref value; parallel(values))
 		value = i;
 would normally run fine on a 32-bit processor, but it'd create
 low-level a race on a 64-bit processor (even a 64-bit processor running
 a 32-bit program in 32-bit compatibility mode). And even that is a
 generalization, some 32-bit processors out there *might* have 64-bit
 native words. So the code above isn't portable. Is that right?
 Which makes me think... we need to document those pitfalls somewhere.
 Perhaps std.parallelism's documentation should link to a related page
 about what you can and what you can't do safely. People who read that
 "all the safeties are off" in std.parallelism aren't going to
 understand what you're talking about unless you explain the pitfalls
 with actual examples (like the one above).

 This problem is **much** less severe than you are suggesting.  x86 can address
 single bytes, so it's not a problem even if you're iterating over bytes on a
 64-bit machine.  CPUs like Alpha (which no D compiler even exists for) can't
 natively address individual bytes.  Therefore, writing to a byte would be
 implemented much like writing to a bit is on x86:  You'd read the full word in,
 change one byte, write the full word back.  I'm not sure exactly how it would
be
 implemented at the compiler level, or whether you'd even be allowed to have a
 reference to a byte in such an implementation.  This is why I consider this
more a
 theoretical problem than a serious practical issue.

Actually, just remembered that word tearing is also an issue with unaligned
memory
access.  I guess I could just include a warning that says not to do this with
unaligned memory.

Mar 21 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/21/2011 11:58 AM, dsimcha wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On second thought, no, but for practical, not theoretical reasons:
 One, you can't introspect whether a foreach loop is using a ref or a
 value parameter.  This is an issue with how opApply works.

 Indeed a problem. Either we fix the compiler to support that, or we
 change the syntax to something like this:
 	taskPool.apply(range, (ref int value) {
 		...
 	});
 Or we leave things as they are.
 Two, AFAIK there's no way to get the native word size.

 Right. That's a problem too... you could probably alleviate this by
 doing a runtime check with some fancy instructions to get the native
 word size, but I'd expect that to be rather convoluted.
 I'd like to check if I understand that well. For instance this code:
 	int[100] values;
 	foreach (i, ref value; parallel(values))
 		value = i;
 would normally run fine on a 32-bit processor, but it'd create
 low-level a race on a 64-bit processor (even a 64-bit processor running
 a 32-bit program in 32-bit compatibility mode). And even that is a
 generalization, some 32-bit processors out there *might* have 64-bit
 native words. So the code above isn't portable. Is that right?
 Which makes me think... we need to document those pitfalls somewhere.
 Perhaps std.parallelism's documentation should link to a related page
 about what you can and what you can't do safely. People who read that
 "all the safeties are off" in std.parallelism aren't going to
 understand what you're talking about unless you explain the pitfalls
 with actual examples (like the one above).

 This problem is **much** less severe than you are suggesting.  x86 can address
 single bytes, so it's not a problem even if you're iterating over bytes on a
 64-bit machine.  CPUs like Alpha (which no D compiler even exists for) can't
 natively address individual bytes.  Therefore, writing to a byte would be
 implemented much like writing to a bit is on x86:  You'd read the full word in,
 change one byte, write the full word back.  I'm not sure exactly how it would
be
 implemented at the compiler level, or whether you'd even be allowed to have a
 reference to a byte in such an implementation.  This is why I consider this
more a
 theoretical problem than a serious practical issue.

 Actually, just remembered that word tearing is also an issue with unaligned
memory
 access.  I guess I could just include a warning that says not to do this with
 unaligned memory.

s/is/may be .  I'm trying to read up on word tearing and post questions 
here and to StackOverflow.  Good documentation about it seems 
ridiculously hard to come by.  About the only solid pieces of 
information I've found are that x86 definitely **can** do byte 
granularity (which doesn't mean it always does) and that Java makes 
these guarantees (which is only useful if you're writing in Java).  The 
general gut feeling people here and on StackOverflow seem to have is 
that it's not an issue, but there doesn't seem to be much backing up of 
this.

Maybe I'm just being paranoid and this is a complete non-issue except on 
the most obscure/ancient hardware and/or most stupidly implemented 
compilers, which is why few people even think of it.

Mar 21 2011

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:

 1.  I give up.
 
 2.  I wish someone had told me earlier.

Please, don't give up. It's a lot of time from the first time I've asked a
parallel map in Phobos :-)

Bye,
bearophile

Mar 19 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Sat, 19 Mar 2011 13:09:23 -0400, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu  
 <SeeWebsiteForEmail erdani.org> said:

 * "Most of this module completely subverts..." Vague characterizations  
 ("most", "completely", "some") don't belong in a technical  
 documentation. (For example there's either subversion going on or there  
 isn't.) Also, std.concurrency and std.parallelism address different  
 needs so there's little competition between them. Better: "Unless  
 explicitly marked as $(D  trusted) or $(D  safe), artifacts in this  
 module are not provably memory-safe and cannot be used with SafeD. If  
 used as documented, memory safety is guaranteed."

 Actually, I think this is a bad description of what it subverts. What it  
 subverts isn't the memory-safety that SafeD provides, but the safety  
 against low-level races that even unsafe D protects against unless you  
 cast shared away. For instance:

 	void main() {
 		int sum = 0;
 		foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 			sum += value;
 		}
 		writeln(sum);
 	}

 The "+=" would need to be an atomic operation to avoid low-level races.

 I think that ParallelForeach's opApply should only accept a shared  
 delegate. I define shared delegate as a delegate that does not reference  
 any non-shared variables of its outer scope. The problem is that DMD  
 currently doesn't know how to determine whether a delegate literal is  
 shared or not, thus a delegate literal is never shared and if  
 ParallelForeach's opApply asked a shared delegate as it should it would  
 just not work. Fix DMD to create shared delegate literals where  
 appropriate and everything can be guarantied race-free.

Technically, you'd only need a shared delegate or a const delegate to  
guarantee race safety for parallel foreach.

Mar 20 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

 * Still no synopsis example that illustrates in a catchy way the most
 attractive artifacts.

I don't see what I could put here that isn't totally redundant with the 
rest of the documentation.  Anything I could think of would basically 
just involve concatentating all the examples.  Furthermore, none of the 
other Phobos modules have this, so I don't know what one should look 
like.  In general I feel like std.parallelism is being held to a 
ridiculously high standard that none of the other Phobos modules 
currently meet.

 * "After creation, Task objects are submitted to a TaskPool for
 execution." I understand it's possible to use Task straight as a
 promise/future, so s/are/may be/.

No.  The only way Task is useful is by submitting it to a pool to be 
executed.  (Though this may change, see below.)

 Also it is my understanding that
 TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
 number of tasks. If that's the case, it's worth mentioning.

Isn't this kind of obvious from the examples, etc.?

 * "If a Task has been submitted to a TaskPool instance, is being stored
 in a stack frame, and has not yet finished, the destructor for this
 struct will automatically call yieldWait() so that the task can finish
 and the stack frame can be destroyed safely." At this point in the doc
 the reader doesn't understand that at all because TaskPool has not been
 seen yet. The reader gets worried that she'll be essentially serializing
 the entire process by mistake. Either move this explanation down or
 provide an example.

This is getting ridiculous.  There are too many circular dependencies 
between Task and TaskPool that are impossible to remove here that I'm 
not even going to try.  One or the other has to be introduced first, but 
neither can be explained without mentioning the other.  This is why I 
explain the relationship briefly in the module level summary, so that 
the user has at least some idea.  I think this is about the best I can do.

 * Is done() a property?

Yes.  DDoc sucks.

 * The example that reads two files at the same time should NOT use
 taskPool. It's just one task, why would the pool ever be needed? If you
 also provided an example that reads n files in memory at the same time
 using a pool, that would illustrate nicely why you need it. If a Task
 can't be launched without being put in a pool, there should be a
 possibility to do so. At my work we have a simple function called
 callInNewThread that does what's needed to launch a function in a new
 thread.

I guess I could add something like this to Task.  Under the hood it 
would (for implementation simplicity, to reuse a bunch of code from 
TaskPool) fire up a new single-thread pool, submit the task, call 
TaskPool.finish(), and then return.  Since you're already creating a new 
thread, the extra overhead of creating a new TaskPool for the thread 
would be negligible and it would massively simplify the implementation. 
  My only concern is that, when combined with scoped versus non-scoped 
tasks (which are definitely here to stay, see below) this small 
convenience function would add way more API complexity than it's worth. 
  Besides, task pools don't need to be created explicitly anyhow. 
That's what the default instantiation is for.  I don't see how 
callInNewThread would really solve much.

Secondly, I think you're reading **WAY** too much into what was meant to 
be a simple example to illustrate usage mechanics.  This is another case 
where I can't think of a small, cute example of where you'd really need 
the pool.  There are plenty of larger examples, but the smallest/most 
self-contained one I can think of is a parallel sort.  I decided to use 
file reading because it was good enough to illustrate the mechanics of 
usage, even if it didn't illustrate a particularly good use case.

 * The note below that example gets me thinking: it is an artificial
 limitation to force users of Task to worry about scope and such. One
 should be able to create a Future object (Task I think in your
 terminology), pass it around like a normal value, and ultimately force
 it. This is the case for all other languages that implement futures. I
 suspect the "scope" parameter associated with the delegate a couple of
 definitions below plays a role here, but I think we need to work for
 providing the smoothest interface possible (possibly prompting
 improvements in the language along the way).

This is what TaskPool.task is for.  Maybe this should be moved to the 
top of the definition of TaskPool and emphasized, and the scoped/stack 
allocated versions should be moved below TaskPool and de-emphasized?

At any rate, though, anyone who's playing with parallelism should be an 
advanced enough programmer that concepts like scope and destructors are 
second nature.

 * I'm not sure how to interpret the docs for

 ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args);

 So it's documented but I'm not supposed to care. Why not just remove?
 Surely there must be higher-level examples that clarify that I can use
 delegates etc.

Yes, and there are such examples.  It's just that I want to explicitly 
state what type is returned in this case rather than returning auto.  If 
I omitted the docs for run(), you'd ask what run() is.

 * "If you want to escape the Task object from the function in which it
 was created or prefer to heap allocate and automatically submit to the
 pool, see TaskPool.task()." I'm uncomfortable that I need to remember a
 subtle distinction of the same primitive name ("task") depending on
 membership in a TaskPool or not, which is a tenuous relationship to
 remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking
 ourselves what's the relative overhead of closure allocation vs. task
 switching etc. Maybe we get to simplify things a lot at a small cost in
 efficiency.

We absolutely **need** scope delegates for calling stuff where a closure 
can't be allocated due to objects on the stack frame having scoped 
destruction.  This is not going anywhere.  Also, I **much** prefer to 
have everything that does conceptually the same thing have the same 
name.  I think this is clearer, not less clear.

 * As I mentioned, in the definition:

 Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);

 I can't tell what "run" is.

run() is just the adapter function described right above this code.  I 
cannot for the life of me understand how this could possibly be unclear.

 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful 
semantics.  I should probably mention in the docs that the same applies 
to labeled break and continue.

I have no idea what semantics these should have, and even if I did, 
given the long odds that even one person would actually need them, I 
think they'd be more trouble than they're worth to implement.  For 
example, once you break out of a parallel foreach loop to some arbitrary 
address (and different threads can goto different labels, etc.), well, 
it's no longer a parallel foreach loop.  It's just a bunch of completely 
unstructured threading doing god-knows-what.

Therefore, I slapped undefined behavior on it as a big sign that says, 
"Just don't do it."  This also has the advantage that, if anyone ever 
thinks of any good, clearly useful semantics, these will be 
implementable without breaking code later.
 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?

I **STRONGLY** believe this does not belong in API documentation because 
it's too machine specific, compiler specific, stack alignment specific, 
etc. and almost any benchmark worth doing takes up more space than an 
example should. Furthermore, anyone who wants to know this can easily 
time it themselves.  I have absolutely no intention of including this. 
While in general I appreciate and have tried to accommodate your 
suggestions, this is one I'll be standing firm on.

 * The join example is fine, but the repetitive code suggests that loops
 should be better:

 import std.file;

 void main() {
 auto pool = new TaskPool();
 foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
 pool.put(task!read(filename));
 }

 // Call join() to guarantee that all tasks are done running, the worker
 // threads have terminated and that the results of all of the tasks can
 // be accessed without any synchronization primitives.
 pool.join();

 ubyte[][] data; // void[], meh
 // Use spinWait() since the results are guaranteed to have been computed
 // and spinWait() is the cheapest of the wait functions.
 foreach (task; pool.howDoIEnumerateTasks()) {
 data ~= task1.spinWait();
 }

You can't enumerate the tasks like this, but there's a good reason for 
this limitation.  The design requires the pool not hold onto any 
references to the tasks after they're out of the queue, so that they can 
be safely destroyed as soon as the caller wants to destroy them.  If you 
need to enumerate over them like this, then:

1.  You're probably doing something wrong and parallel foreach would be 
a better choice.

2.  If you insist on doing this, you need to explicitly store them in an 
array or something.

As a more general comment, I know the join() example sucks.  join() is 
actually a seldom-used primitive, not a frequently used one as you 
suggest.  Most of the time you wait for individual tasks (explicitly or 
implicitly through the higher level functions), not the whole pool, to 
finish executing.  I feel like it's obligatory that join() and finish() 
exist, but I never use them myself.  They were much more useful in 
pre-release versions of this lib, when parallel foreach didn't exist and 
it made sense to have explicit arrays of Tasks.  The difficulty I've had 
in coming up with a decent example for them makes me halfway tempted to

Mar 19 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:

 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?

 
 No, it's because a goto of this form has no reasonable, useful 
 semantics.  I should probably mention in the docs that the same applies 
 to labeled break and continue.
 
 I have no idea what semantics these should have, and even if I did, 
 given the long odds that even one person would actually need them, I 
 think they'd be more trouble than they're worth to implement.  For 
 example, once you break out of a parallel foreach loop to some 
 arbitrary address (and different threads can goto different labels, 
 etc.), well, it's no longer a parallel foreach loop.  It's just a bunch 
 of completely unstructured threading doing god-knows-what.
 
 Therefore, I slapped undefined behavior on it as a big sign that says, 
 "Just don't do it."  This also has the advantage that, if anyone ever 
 thinks of any good, clearly useful semantics, these will be 
 implementable without breaking code later.

I think an improvement over undefined behaviour would be to throw an exception.

Also, what happens if one of the tasks throws an exception?

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 2:35 PM, Michel Fortin wrote:
 On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:

 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?

 No, it's because a goto of this form has no reasonable, useful
 semantics. I should probably mention in the docs that the same applies
 to labeled break and continue.

 I have no idea what semantics these should have, and even if I did,
 given the long odds that even one person would actually need them, I
 think they'd be more trouble than they're worth to implement. For
 example, once you break out of a parallel foreach loop to some
 arbitrary address (and different threads can goto different labels,
 etc.), well, it's no longer a parallel foreach loop. It's just a bunch
 of completely unstructured threading doing god-knows-what.

 Therefore, I slapped undefined behavior on it as a big sign that says,
 "Just don't do it." This also has the advantage that, if anyone ever
 thinks of any good, clearly useful semantics, these will be
 implementable without breaking code later.

 I think an improvement over undefined behaviour would be to throw an
 exception.

The only problem is that there's no easy, well-documented way to tell 
from the return value of opApply whether it was a break, a goto, a 
labeled break/continue, etc.  This would be implementable only if I 
changed the semantics of break to also throw.  This might not be a bad 
thing (IMHO any type of breaking out of a parallel foreach loop is just 
silly) but others had asked for different semantics for break.

 Also, what happens if one of the tasks throws an exception?

It gets rethrown when yieldWait()/spinWait()/workWait() is called.  In 
the case of the higher-level primitives, it gets re-thrown to the 
calling thread at some non-deterministic point in the execution of these 
functions.  I didn't see the need to document this explicitly because it 
"just works".

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 3:36 PM, dsimcha wrote:
 On 3/19/2011 2:35 PM, Michel Fortin wrote:
 It gets rethrown when yieldWait()/spinWait()/workWait() is called. In
 the case of the higher-level primitives, it gets re-thrown to the
 calling thread at some non-deterministic point in the execution of these
 functions. I didn't see the need to document this explicitly because it
 "just works".

...Though now that you make me think of it I need to support exception 
chaining for the case of multiple concurrently thrown exceptions instead 
of just arbitrarily and non-deterministically rethrowing one.  The code 
that handles this was written before exception chaining existed.  Will 
get on that.

Mar 19 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-19 15:36:24 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 2:35 PM, Michel Fortin wrote:
 On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:
 
 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?

 
 No, it's because a goto of this form has no reasonable, useful
 semantics. I should probably mention in the docs that the same applies
 to labeled break and continue.
 
 I have no idea what semantics these should have, and even if I did,
 given the long odds that even one person would actually need them, I
 think they'd be more trouble than they're worth to implement. For
 example, once you break out of a parallel foreach loop to some
 arbitrary address (and different threads can goto different labels,
 etc.), well, it's no longer a parallel foreach loop. It's just a bunch
 of completely unstructured threading doing god-knows-what.
 
 Therefore, I slapped undefined behavior on it as a big sign that says,
 "Just don't do it." This also has the advantage that, if anyone ever
 thinks of any good, clearly useful semantics, these will be
 implementable without breaking code later.

 
 I think an improvement over undefined behaviour would be to throw an
 exception.

 
 The only problem is that there's no easy, well-documented way to tell 
 from the return value of opApply whether it was a break, a goto, a 
 labeled break/continue, etc.  This would be implementable only if I 
 changed the semantics of break to also throw.  This might not be a bad 
 thing (IMHO any type of breaking out of a parallel foreach loop is just 
 silly) but others had asked for different semantics for break.

It's not that silly.

Essentially, what you'd express like this with a normal function taking 
a delegate:

	taskPool.apply([1,2,3], (int i) {
		if (i == 1)
			return;
		// do some things
	});

you'd express like this in a parallel foreach:

	foreach (int i; parallel([1,2,3])) {
		if (i == 1)
			break;
		// do some things
	}

It's not following the semantics of break within a foreach, but it's 
still useful to be able to return early from a function (from a loop 
iteration in this case), so I see the use case for making 'break' do 
what it does.

My only gripe is that the semantics are distorted. In fact, just making 
foreach parallel distorts its semantics. I was confused earlier about a 
foreach being parallel when it was not, someone could also be confused 
in the other direction, thinking foreach is the normal foreach when it 
actually is parallel. This makes code harder to review. Don't consider 
only my opinion on this, but in my opinion the first form above 
(taskPool.apply) is preferable because you absolutely can't mistake it 
with a normal foreach. And I think it's especially important to make it 
stand out given that the compiler can't check for low-level races.


 Also, what happens if one of the tasks throws an exception?

 
 It gets rethrown when yieldWait()/spinWait()/workWait() is called.  In 
 the case of the higher-level primitives, it gets re-thrown to the 
 calling thread at some non-deterministic point in the execution of 
 these functions.  I didn't see the need to document this explicitly 
 because it "just works".

That's indeed what I'd expect. But I think it'd still be worth 
mentioning in a short sentense in yeildWait()/spinWait()/workWait()'s 
documentation. It's comforting when the documentation confirms your 
expectations.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 5:33 PM, Michel Fortin wrote:
 On 2011-03-19 15:36:24 -0400, dsimcha <dsimcha yahoo.com> said:
 The only problem is that there's no easy, well-documented way to tell
 from the return value of opApply whether it was a break, a goto, a
 labeled break/continue, etc. This would be implementable only if I
 changed the semantics of break to also throw. This might not be a bad
 thing (IMHO any type of breaking out of a parallel foreach loop is
 just silly) but others had asked for different semantics for break.

 It's not that silly.

 Essentially, what you'd express like this with a normal function taking
 a delegate:

 taskPool.apply([1,2,3], (int i) {
 if (i == 1)
 return;
 // do some things
 });

 you'd express like this in a parallel foreach:

 foreach (int i; parallel([1,2,3])) {
 if (i == 1)
 break;
 // do some things
 }

 It's not following the semantics of break within a foreach, but it's
 still useful to be able to return early from a function (from a loop
 iteration in this case), so I see the use case for making 'break' do
 what it does.

Using continue is well defined behavior and does exactly what I think 
you're suggesting.  The problem with break is that it's supposed to stop 
all subsequent iterations of the loop from being executed, not just end 
the current one early.  This only makes sense in a serial context, where 
"subsequent iterations" is well-defined.  Currently break breaks from 
the current work unit but continues executing all other work units.  I 
put this semantic in because I couldn't think of anything else to make 
it do and someone (I don't remember who) asked for it.  I have never 
encountered a use case for it.

IMHO all early termination that affects subsequent loop iterations as 
well (break, goto, labeled break and continue, but not regular continue) 
should just throw because they make absolutely no sense in a parallel 
context.

Mar 19 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 19.03.2011 22:45, schrieb dsimcha:
 IMHO all early termination that affects subsequent loop iterations as well
 (break, goto, labeled break and continue, but not regular continue) should just
 throw because they make absolutely no sense in a parallel context.

What about some kind of parallel search? You just want to know if something is
there and on first sight you're done, so you want to break out of this iteration
and don't want any new parallel iterations to be done.
(Not sure what should be done with currently running iterations.. should they be
killed or go on until they're done - anyway, the executing thread shouldn't
start a new iteration after that)

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 7:33 PM, Daniel Gibson wrote:
 Am 19.03.2011 22:45, schrieb dsimcha:
 IMHO all early termination that affects subsequent loop iterations as well
 (break, goto, labeled break and continue, but not regular continue) should just
 throw because they make absolutely no sense in a parallel context.

 What about some kind of parallel search? You just want to know if something is
 there and on first sight you're done, so you want to break out of this
iteration
 and don't want any new parallel iterations to be done.
 (Not sure what should be done with currently running iterations.. should they
be
 killed or go on until they're done - anyway, the executing thread shouldn't
 start a new iteration after that)

It's an interesting suggestion, and I'll give some thought to it, but on 
first glance it sounds **very** difficult to implement efficiently.  To 
avoid doing a lot of extra work after you've found what you need, you'd 
need to use very small work units.  To avoid excessive overhead you'd 
need to use fairly large work units.  The only case I can think of where 
this would be do-able is when evaluating the predicate is very expensive.

Mar 19 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 03/19/2011 12:16 PM, dsimcha wrote:
 On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

 * Still no synopsis example that illustrates in a catchy way the most
 attractive artifacts.

 I don't see what I could put here that isn't totally redundant with the
 rest of the documentation. Anything I could think of would basically
 just involve concatentating all the examples. Furthermore, none of the
 other Phobos modules have this, so I don't know what one should look
 like.

I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to the 
synopsis. It's simple, clean, and easy to relate to. Generally, you'd 
put here not all details but the stuff you think would make it easiest 
for people to get into your library.

 In general I feel like std.parallelism is being held to a
 ridiculously high standard that none of the other Phobos modules
 currently meet.

I agree, but that goes without saying. This is not a double standard; 
it's a simple means to improve quality of Phobos overall. Clearly 
certain modules that are already in Phobos would not make it if proposed 
today. And that's a good thing! Comparing our current ambitions to the 
quality of the average Phobos module would not help us.
	
Generally it seems you have grown already tired of the exchange and it 
would take only a few more exchanges for you to say, "you know what, 
either let's get it over it or forget about it."

Please consider for a minute how this is the wrong tone and attitude to 
be having on several levels. This is not a debate and not the place to 
get defensive. Your role in the discussion is not symmetric with the 
others' and at best you'd use the review as an opportunity to improve 
your design, not stick heels in the ground to defend its current 
incarnation (within reason). Your potential customers are attempting to 
help you by asking questions (some of which no doubt are silly) and by 
making suggestions (some of which, again, are ill-founded). 
Nevertheless, we _are_ your potential customers and in a way the 
customer is always right. Your art is to steer customers from what they 
think they want to what you know they need - because you're the expert! 
- and to improve design, nomenclature, and implementation to the extent 
that would help them.

Furthermore, you should expect that the review process will prompt 
changes. My perception is that you consider the submission more or less 
final modulo possibly a few minor nits. You shouldn't. I'm convinced you 
know much more about SMP than most or all others in this group, but in 
no way that means your design has reached perfection and is beyond 
improvement even from a non-expert.

I know you'd have no problem finding the right voice in this discussion 
if you only frame it in the right light. Again, people are trying to 
help (however awkwardly) and in no way is that ridiculous.

 * "After creation, Task objects are submitted to a TaskPool for
 execution." I understand it's possible to use Task straight as a
 promise/future, so s/are/may be/.

 No. The only way Task is useful is by submitting it to a pool to be
 executed. (Though this may change, see below.)

I very much hope this does change. Otherwise the role of Task in the 
design could be drastically reduced (e.g. nested type inside of 
TaskPool) without prejudice. At the minimum I want to be able to create 
a task, launch it, and check its result later without involving a pool. 
A pool is when I have many tasks that may exceed the number of CPUs etc. 
Simplicity would be great.

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();

There's no need at this level for a task pool. What would be nice would 
be to have a join() that joins all tasks spawned by the current thread:

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();

The way I see it is, task pools are an advanced means that coordinate m 
threads over n CPUs. If I don't care about that (as above) there should 
be no need for a pool at all. (Of course it's fine if used by the 
implementation.)

 Also it is my understanding that
 TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
 number of tasks. If that's the case, it's worth mentioning.

 Isn't this kind of obvious from the examples, etc.?

It wasn't to me. I had an intuition that that might be the case, but 
obvious - far from it. The fact that I need a task pool even when I 
don't care about m:n issues made me doubtful.

 * "If a Task has been submitted to a TaskPool instance, is being stored
 in a stack frame, and has not yet finished, the destructor for this
 struct will automatically call yieldWait() so that the task can finish
 and the stack frame can be destroyed safely." At this point in the doc
 the reader doesn't understand that at all because TaskPool has not been
 seen yet. The reader gets worried that she'll be essentially serializing
 the entire process by mistake. Either move this explanation down or
 provide an example.

 This is getting ridiculous. There are too many circular dependencies
 between Task and TaskPool that are impossible to remove here that I'm
 not even going to try. One or the other has to be introduced first, but
 neither can be explained without mentioning the other. This is why I
 explain the relationship briefly in the module level summary, so that
 the user has at least some idea. I think this is about the best I can do.

Perhaps cross-references could help (see macro XREF). As I mentioned, an 
example here could go a long way, too.

 * The example that reads two files at the same time should NOT use
 taskPool. It's just one task, why would the pool ever be needed? If you
 also provided an example that reads n files in memory at the same time
 using a pool, that would illustrate nicely why you need it. If a Task
 can't be launched without being put in a pool, there should be a
 possibility to do so. At my work we have a simple function called
 callInNewThread that does what's needed to launch a function in a new
 thread.

 I guess I could add something like this to Task. Under the hood it would
 (for implementation simplicity, to reuse a bunch of code from TaskPool)
 fire up a new single-thread pool, submit the task, call
 TaskPool.finish(), and then return. Since you're already creating a new
 thread, the extra overhead of creating a new TaskPool for the thread
 would be negligible and it would massively simplify the implementation.
 My only concern is that, when combined with scoped versus non-scoped
 tasks (which are definitely here to stay, see below) this small
 convenience function would add way more API complexity than it's worth.
 Besides, task pools don't need to be created explicitly anyhow. That's
 what the default instantiation is for. I don't see how callInNewThread
 would really solve much.

If we sit down for a second and think outside of the current design, the 
simplest Task use would be as such: "I start a task and get a sort of a 
future. I do something else. Then when I nede the result of the future I 
call some method force() against it." Nowhere is the notion of a task 
pool to be found. To me this looks like an artifact of the 
implementation has spilled into the API, which I understand how is 
natural to you. But also it's unnatural to me.

 Secondly, I think you're reading **WAY** too much into what was meant to
 be a simple example to illustrate usage mechanics. This is another case
 where I can't think of a small, cute example of where you'd really need
 the pool. There are plenty of larger examples, but the smallest/most
 self-contained one I can think of is a parallel sort. I decided to use
 file reading because it was good enough to illustrate the mechanics of
 usage, even if it didn't illustrate a particularly good use case.

It's impossible to not have a good small example. Sorting is great. You 
have the partition primitive already in std.algorithm, then off you go 
with tasks. Dot product on dense vectors is another good one. There's 
just plenty of operations that people understand are important to make fast.

So to me it looks like this:

* there are already some good examples of "I want 1-3 of tasks but don't 
care about pooling them": file reading and processing etc.

* there should be 1-2 good example of partitioning a large chunk of work 
into many tasks and let taskPool carry them.

With such, people would clearly understand the capabilities of the 
library and what they can achieve with various levels of involvement.

 * The note below that example gets me thinking: it is an artificial
 limitation to force users of Task to worry about scope and such. One
 should be able to create a Future object (Task I think in your
 terminology), pass it around like a normal value, and ultimately force
 it. This is the case for all other languages that implement futures. I
 suspect the "scope" parameter associated with the delegate a couple of
 definitions below plays a role here, but I think we need to work for
 providing the smoothest interface possible (possibly prompting
 improvements in the language along the way).

 This is what TaskPool.task is for. Maybe this should be moved to the top
 of the definition of TaskPool and emphasized, and the scoped/stack
 allocated versions should be moved below TaskPool and de-emphasized?

 At any rate, though, anyone who's playing with parallelism should be an
 advanced enough programmer that concepts like scope and destructors are
 second nature.

I agree. That's not a reason to not find good names for entities. To me 
the relationship "global task() <=> scoped delegate; TaskPool method 
task() <=> closure" is inscrutable. I see no way of remembering it 
except rote memorization.

 We absolutely **need** scope delegates for calling stuff where a closure
 can't be allocated due to objects on the stack frame having scoped
 destruction. This is not going anywhere. Also, I **much** prefer to have
 everything that does conceptually the same thing have the same name. I
 think this is clearer, not less clear.

I agree in principle, but this case is just not that.

 * As I mentioned, in the definition:

 Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);

 I can't tell what "run" is.

 run() is just the adapter function described right above this code. I
 cannot for the life of me understand how this could possibly be unclear.

OK.

 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?

 No, it's because a goto of this form has no reasonable, useful
 semantics. I should probably mention in the docs that the same applies
 to labeled break and continue.

 I have no idea what semantics these should have, and even if I did,
 given the long odds that even one person would actually need them, I
 think they'd be more trouble than they're worth to implement. For
 example, once you break out of a parallel foreach loop to some arbitrary
 address (and different threads can goto different labels, etc.), well,
 it's no longer a parallel foreach loop. It's just a bunch of completely
 unstructured threading doing god-knows-what.

 Therefore, I slapped undefined behavior on it as a big sign that says,
 "Just don't do it." This also has the advantage that, if anyone ever
 thinks of any good, clearly useful semantics, these will be
 implementable without breaking code later.

Yah, I was actually thinking of disabling goto outside a local delegate 
everywhere.

 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?

 I **STRONGLY** believe this does not belong in API documentation because
 it's too machine specific, compiler specific, stack alignment specific,
 etc. and almost any benchmark worth doing takes up more space than an
 example should. Furthermore, anyone who wants to know this can easily
 time it themselves. I have absolutely no intention of including this.
 While in general I appreciate and have tried to accommodate your
 suggestions, this is one I'll be standing firm on.

If scalability information is present in however a non-committal form, 
then people would be compelled ("ok, so this shape of the loop would 
actually scale linearly with CPUs... neat").

Speaking of efficiency, I assume parallel foreach uses opApply with a 
delegate and the inherent overhead. So I'm thinking that a practical way 
to take advantage of parallel foreach would be to parallelize at some 
block level and then do a regular foreach inside the block?

foreach (i, ref elem; taskPool.parallel(logs)) {
     foreach (???)
         elem = log(i + 1.0);
}

How can I arrange things such that I compute e.g. 64 logs serially 
inside each pass?

 * The join example is fine, but the repetitive code suggests that loops
 should be better:

 import std.file;

 void main() {
 auto pool = new TaskPool();
 foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
 pool.put(task!read(filename));
 }

 // Call join() to guarantee that all tasks are done running, the worker
 // threads have terminated and that the results of all of the tasks can
 // be accessed without any synchronization primitives.
 pool.join();

 ubyte[][] data; // void[], meh
 // Use spinWait() since the results are guaranteed to have been computed
 // and spinWait() is the cheapest of the wait functions.
 foreach (task; pool.howDoIEnumerateTasks()) {
 data ~= task1.spinWait();
 }

 You can't enumerate the tasks like this, but there's a good reason for
 this limitation. The design requires the pool not hold onto any
 references to the tasks after they're out of the queue, so that they can
 be safely destroyed as soon as the caller wants to destroy them. If you
 need to enumerate over them like this, then:

 1. You're probably doing something wrong and parallel foreach would be a
 better choice.

 2. If you insist on doing this, you need to explicitly store them in an
 array or something.

 As a more general comment, I know the join() example sucks. join() is
 actually a seldom-used primitive, not a frequently used one as you
 suggest. Most of the time you wait for individual tasks (explicitly or
 implicitly through the higher level functions), not the whole pool, to
 finish executing. I feel like it's obligatory that join() and finish()
 exist, but I never use them myself. They were much more useful in
 pre-release versions of this lib, when parallel foreach didn't exist and
 it made sense to have explicit arrays of Tasks. The difficulty I've had
 in coming up with a decent example for them makes me halfway tempted to



I'm confused here. I use join() pretty much all the time. I launch stuff 
(e.g. run a large matrix-vector multiplication for distinct vectors) and 
then I join that. Then again and again. The thread of execution has a 
repeated hourglass shape - it fans out and then becomes single-threaded 
at join points. I assume some of those computations are indeed the 
charter of parallel foreach, but I'd guess not all. But then what do I 
know - I'm the customer :o).


Andrei

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 12:16 PM, dsimcha wrote:
 On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved.
 I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

 * Still no synopsis example that illustrates in a catchy way the most
 attractive artifacts.

 I don't see what I could put here that isn't totally redundant with the
 rest of the documentation. Anything I could think of would basically
 just involve concatentating all the examples. Furthermore, none of the
 other Phobos modules have this, so I don't know what one should look
 like.

 I'm thinking along the lines of:

 http://www.digitalmars.com/d/2.0/phobos/std_exception.html

 A nice synopsis would be the pi computation. Just move that up to the
 synopsis. It's simple, clean, and easy to relate to. Generally, you'd
 put here not all details but the stuff you think would make it easiest
 for people to get into your library.

 In general I feel like std.parallelism is being held to a
 ridiculously high standard that none of the other Phobos modules
 currently meet.

 I agree, but that goes without saying. This is not a double standard;
 it's a simple means to improve quality of Phobos overall. Clearly
 certain modules that are already in Phobos would not make it if proposed
 today. And that's a good thing! Comparing our current ambitions to the
 quality of the average Phobos module would not help us.

 Generally it seems you have grown already tired of the exchange and it
 would take only a few more exchanges for you to say, "you know what,
 either let's get it over it or forget about it."

 Please consider for a minute how this is the wrong tone and attitude to
 be having on several levels. This is not a debate and not the place to
 get defensive. Your role in the discussion is not symmetric with the
 others' and at best you'd use the review as an opportunity to improve
 your design, not stick heels in the ground to defend its current
 incarnation (within reason). Your potential customers are attempting to
 help you by asking questions (some of which no doubt are silly) and by
 making suggestions (some of which, again, are ill-founded).
 Nevertheless, we _are_ your potential customers and in a way the
 customer is always right. Your art is to steer customers from what they
 think they want to what you know they need - because you're the expert!
 - and to improve design, nomenclature, and implementation to the extent
 that would help them.

 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.

 I know you'd have no problem finding the right voice in this discussion
 if you only frame it in the right light. Again, people are trying to
 help (however awkwardly) and in no way is that ridiculous.

Fair enough.  Now that I think of it most of my frustration is that 
these details are only getting looked at now, when I have a week (and an 
otherwise very busy week) to fix all this stuff, when this module has 
been officially in review for the past two weeks and unofficially for 
several months.  I would be much more open to this process if the issues 
raised could be fixed at my leisure rather than on a hard and tight 
deadline.  This is exacerbated by the fact that I have another 
important, unrelated deadline, also next Friday.

At the same time, though, I'm afraid that if we didn't fix a vote date 
and put some urgency into things, the pace of the reviews would be 
glacial at best, like it was for the first two weeks of official review 
and the months of unofficial review.

How about we make next Friday a soft deadline/start of voting?  It can 
be extended as necessary by mutual agreement of me and Lars (the review 
manager), and likely will be if the review process is still yielding 
good suggestions and/or I haven't had time to implement/clean up some 
key things.  Having a deadline breathing down your neck like this is 
really not conducive to being open to suggestions and thoughtful 
consideration, especially for issues that seem like fairly minor details.

Also increasing the deadline pressure issue, Michael Fortin just caused 
me to rethink the issue of exception handling in parallel foreach.  I 
had more-or-less working code for this, but I realized it's severely 
broken in subtle ways that I've (knock on wood) never actually run into 
in real world code.  It's gonna take some time to fix.  These kinds of 
issues with error handling code can very easily slip under the radar in 
a library with only a few users, and unfortunately, one has.  I 
definitely know how to fix it, but I have to implement the fix and it's 
somewhat non-trivial, i.e. I'm debugging it now and it's looking like a 
marathon debugging session.

Mar 19 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 03/19/2011 04:25 PM, dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 I know you'd have no problem finding the right voice in this discussion
 if you only frame it in the right light. Again, people are trying to
 help (however awkwardly) and in no way is that ridiculous.

 Fair enough. Now that I think of it most of my frustration is that these
 details are only getting looked at now, when I have a week (and an
 otherwise very busy week) to fix all this stuff, when this module has
 been officially in review for the past two weeks and unofficially for
 several months. I would be much more open to this process if the issues
 raised could be fixed at my leisure rather than on a hard and tight
 deadline. This is exacerbated by the fact that I have another important,
 unrelated deadline, also next Friday.

 At the same time, though, I'm afraid that if we didn't fix a vote date
 and put some urgency into things, the pace of the reviews would be
 glacial at best, like it was for the first two weeks of official review
 and the months of unofficial review.

Exactly. I understand. Well here's a suggestion. How about we "git 
stash" this review? I recall another submission was vying for the queue 
so we can proceed with that while you have time to operate the changes 
you want. If not, we can simply wait 2-3 weeks and then have you 
resubmit for a shorter process (1-2 weeks) that would gather more 
feedback (hopefully minor that time) and votes.

Lars?

 Also increasing the deadline pressure issue, Michael Fortin just caused
 me to rethink the issue of exception handling in parallel foreach. I had
 more-or-less working code for this, but I realized it's severely broken
 in subtle ways that I've (knock on wood) never actually run into in real
 world code. It's gonna take some time to fix. These kinds of issues with
 error handling code can very easily slip under the radar in a library
 with only a few users, and unfortunately, one has. I definitely know how
 to fix it, but I have to implement the fix and it's somewhat
 non-trivial, i.e. I'm debugging it now and it's looking like a marathon
 debugging session.

Fixing these sooner rather than later would be great, so  all the better 
for suspending the review.

Andrei

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 6:58 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 04:25 PM, dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 I know you'd have no problem finding the right voice in this discussion
 if you only frame it in the right light. Again, people are trying to
 help (however awkwardly) and in no way is that ridiculous.

 Fair enough. Now that I think of it most of my frustration is that these
 details are only getting looked at now, when I have a week (and an
 otherwise very busy week) to fix all this stuff, when this module has
 been officially in review for the past two weeks and unofficially for
 several months. I would be much more open to this process if the issues
 raised could be fixed at my leisure rather than on a hard and tight
 deadline. This is exacerbated by the fact that I have another important,
 unrelated deadline, also next Friday.

 At the same time, though, I'm afraid that if we didn't fix a vote date
 and put some urgency into things, the pace of the reviews would be
 glacial at best, like it was for the first two weeks of official review
 and the months of unofficial review.

 Exactly. I understand. Well here's a suggestion. How about we "git
 stash" this review? I recall another submission was vying for the queue
 so we can proceed with that while you have time to operate the changes
 you want. If not, we can simply wait 2-3 weeks and then have you
 resubmit for a shorter process (1-2 weeks) that would gather more
 feedback (hopefully minor that time) and votes.

Sounds like a plan.

Mar 19 2011

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Sat, 19 Mar 2011 17:58:46 -0500, Andrei Alexandrescu wrote:

 On 03/19/2011 04:25 PM, dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 I know you'd have no problem finding the right voice in this
 discussion if you only frame it in the right light. Again, people are
 trying to help (however awkwardly) and in no way is that ridiculous.

 Fair enough. Now that I think of it most of my frustration is that
 these details are only getting looked at now, when I have a week (and
 an otherwise very busy week) to fix all this stuff, when this module
 has been officially in review for the past two weeks and unofficially
 for several months. I would be much more open to this process if the
 issues raised could be fixed at my leisure rather than on a hard and
 tight deadline. This is exacerbated by the fact that I have another
 important, unrelated deadline, also next Friday.

 At the same time, though, I'm afraid that if we didn't fix a vote date
 and put some urgency into things, the pace of the reviews would be
 glacial at best, like it was for the first two weeks of official review
 and the months of unofficial review.

 
 Exactly. I understand. Well here's a suggestion. How about we "git
 stash" this review? I recall another submission was vying for the queue
 so we can proceed with that while you have time to operate the changes
 you want. If not, we can simply wait 2-3 weeks and then have you
 resubmit for a shorter process (1-2 weeks) that would gather more
 feedback (hopefully minor that time) and votes.
 
 Lars?

Sounds good.  I'll make the announcement.

-Lars

Mar 20 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.

In addition the the deadline issues already mentioned and resolved, I 
did misunderstand the review process somewhat.  I didn't participate in 
the reviews for std.datetime (because I know nothing about what makes a 
good date/time lib) or for std.unittest (because I was fairly ambivalent 
about it), so I didn't learn anything from them.  I was under the 
impression that the module is **expected** to be very close to its final 
form and that, if a lot of issues are found, then that basically means 
the proposal is going to be rejected.

Mar 19 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 19 March 2011 17:31:18 dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.

 
 In addition the the deadline issues already mentioned and resolved, I
 did misunderstand the review process somewhat.  I didn't participate in
 the reviews for std.datetime (because I know nothing about what makes a
 good date/time lib) or for std.unittest (because I was fairly ambivalent
 about it), so I didn't learn anything from them.  I was under the
 impression that the module is **expected** to be very close to its final
 form and that, if a lot of issues are found, then that basically means
 the proposal is going to be rejected.

Both std.datetime and std.unittests underwent a fair number of changes over the 
course the review process. A lot of the stuff stayed the same, but a lot of it 
changed too. On the whole, though, the end results were much better for it.

- Jonathan M Davis

Mar 19 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 19 March 2011 17:31:18 dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.

 
 In addition the the deadline issues already mentioned and resolved, I
 did misunderstand the review process somewhat.  I didn't participate in
 the reviews for std.datetime (because I know nothing about what makes a
 good date/time lib) or for std.unittest (because I was fairly ambivalent
 about it), so I didn't learn anything from them.  I was under the
 impression that the module is **expected** to be very close to its final
 form and that, if a lot of issues are found, then that basically means
 the proposal is going to be rejected.

Both std.datetime and std.unittests underwent a fair number of changes over the 
course the review process. A lot of the stuff stayed the same, but a lot of it 
changed too. On the whole, though, the end results were much better for it.

- Jonathan M Davis

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 8:48 PM, Jonathan M Davis wrote:
 On Saturday 19 March 2011 17:31:18 dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.

 In addition the the deadline issues already mentioned and resolved, I
 did misunderstand the review process somewhat.  I didn't participate in
 the reviews for std.datetime (because I know nothing about what makes a
 good date/time lib) or for std.unittest (because I was fairly ambivalent
 about it), so I didn't learn anything from them.  I was under the
 impression that the module is **expected** to be very close to its final
 form and that, if a lot of issues are found, then that basically means
 the proposal is going to be rejected.

 Both std.datetime and std.unittests underwent a fair number of changes over the
 course the review process. A lot of the stuff stayed the same, but a lot of it
 changed too. On the whole, though, the end results were much better for it.

 - Jonathan M Davis

Please check your newsreader settings.  You've been double-posting a lot 
lately.

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 12:16 PM, dsimcha wrote:
 On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved.
 I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

 * Still no synopsis example that illustrates in a catchy way the most
 attractive artifacts.

 I don't see what I could put here that isn't totally redundant with the
 rest of the documentation. Anything I could think of would basically
 just involve concatentating all the examples. Furthermore, none of the
 other Phobos modules have this, so I don't know what one should look
 like.

 I'm thinking along the lines of:

 http://www.digitalmars.com/d/2.0/phobos/std_exception.html

 A nice synopsis would be the pi computation. Just move that up to the
 synopsis. It's simple, clean, and easy to relate to. Generally, you'd
 put here not all details but the stuff you think would make it easiest
 for people to get into your library.

Good example, will do.

 * "After creation, Task objects are submitted to a TaskPool for
 execution." I understand it's possible to use Task straight as a
 promise/future, so s/are/may be/.

 No. The only way Task is useful is by submitting it to a pool to be
 executed. (Though this may change, see below.)

 I very much hope this does change. Otherwise the role of Task in the
 design could be drastically reduced (e.g. nested type inside of
 TaskPool) without prejudice. At the minimum I want to be able to create
 a task, launch it, and check its result later without involving a pool.
 A pool is when I have many tasks that may exceed the number of CPUs etc.
 Simplicity would be great.

 // start three reads
 auto readFoo = task!readText("foo.txt");
 auto readBar = task!readText("bar.txt");
 auto readBaz = task!readText("baz.txt");
 // join'em all
 auto foo = readFoo.yieldWait();
 auto bar = readBar.yieldWait();
 auto baz = readBaz.yieldWait();

This is definitely feasible in principle.  I'd like to implement it, but 
there's a few annoying, hairy details standing in the way.  For reasons 
I detailed previously, we need both scoped and non-scoped tasks.  We 
also have alias vs. callable (i.e. function pointer or delegate) tasks. 
  Now we're adding pool vs. new-thread tasks.  This is turning into a 
combinatorial explosion and needs to be simplified somehow.  I propose 
the following:

1.  I've reconsidered and actually like the idea of task() vs. 
scopedTask().  task() returns a pointer on the heap.  scopedTask() 
returns a struct on the stack.  Neither would be a member function of 
TaskPool.

2.  Non-scoped Task pointers would need to be explicitly submitted to 
the task pool via the put() method.  This means getting rid of 
TaskPool.task().

3.  The Task struct would grow a function runInNewThread() or something 
similar.  (If you think this would be a common case, even just execute() 
might cut it.)

The work flow would now be that you call task() to get a heap-allocated 
Task*, or scopedTask to get a stack-allocated Task.  You then call 
either TaskPool.put() to execute it on a pool or Task.runInNewThread() 
to run it in a new thread.  The creation of the Task is completely 
orthogonal to how it's run.

 There's no need at this level for a task pool. What would be nice would
 be to have a join() that joins all tasks spawned by the current thread:

 // start three reads
 auto readFoo = task!readText("foo.txt");
 auto readBar = task!readText("bar.txt");
 auto readBaz = task!readText("baz.txt");
 // join'em all
 join();
 // fetch results
 auto foo = readFoo.spinWait();
 auto bar = readBar.spinWait();
 auto baz = readBaz.spinWait();

I don't understand how this would be a substantial improvement over the 
first example, where you just call yieldWait() on all three. 
Furthermore, implementing join() as shown in this example would require 
some kind of central registry of all tasks/worker threads/task 
pools/something similar, which would be a huge PITA to implement 
efficiently.

 Secondly, I think you're reading **WAY** too much into what was meant to
 be a simple example to illustrate usage mechanics. This is another case
 where I can't think of a small, cute example of where you'd really need
 the pool. There are plenty of larger examples, but the smallest/most
 self-contained one I can think of is a parallel sort. I decided to use
 file reading because it was good enough to illustrate the mechanics of
 usage, even if it didn't illustrate a particularly good use case.

 It's impossible to not have a good small example. Sorting is great. You
 have the partition primitive already in std.algorithm, then off you go
 with tasks. Dot product on dense vectors is another good one. There's
 just plenty of operations that people understand are important to make
 fast.

I forgot about std.algorithm.partition.  This makes a parallel quick 
sort so trivial to implement (ignoring the issue of selecting a good 
pivot, which I think can be safely ignored in example code) that it 
might actually make a good example.


 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?

 No, it's because a goto of this form has no reasonable, useful
 semantics. I should probably mention in the docs that the same applies
 to labeled break and continue.

 I have no idea what semantics these should have, and even if I did,
 given the long odds that even one person would actually need them, I
 think they'd be more trouble than they're worth to implement. For
 example, once you break out of a parallel foreach loop to some arbitrary
 address (and different threads can goto different labels, etc.), well,
 it's no longer a parallel foreach loop. It's just a bunch of completely
 unstructured threading doing god-knows-what.

 Therefore, I slapped undefined behavior on it as a big sign that says,
 "Just don't do it." This also has the advantage that, if anyone ever
 thinks of any good, clearly useful semantics, these will be
 implementable without breaking code later.

 Yah, I was actually thinking of disabling goto outside a local delegate
 everywhere.

See the discussion I'm having with Michael Fortin.  My latest idea is 
that break, labeled break/continue, return, and goto should all throw 
exceptions when found inside a parallel foreach loop.  They affect the 
flow of subsequent iterations, and "subsequent iterations" only makes 
sense when the loop is being executed in serial.

 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?

 I **STRONGLY** believe this does not belong in API documentation because
 it's too machine specific, compiler specific, stack alignment specific,
 etc. and almost any benchmark worth doing takes up more space than an
 example should. Furthermore, anyone who wants to know this can easily
 time it themselves. I have absolutely no intention of including this.
 While in general I appreciate and have tried to accommodate your
 suggestions, this is one I'll be standing firm on.

 If scalability information is present in however a non-committal form,
 then people would be compelled ("ok, so this shape of the loop would
 actually scale linearly with CPUs... neat").

Ok, I thought you were asking for something much more rigorous than 
this.  I therefore didn't want to provide it because I figured that, no 
matter what I did, someone would be able to say that the benchmark is 
flawed some how, yada, yada, yada.  Given how inexact a science 
benchmarking is, I'm still hesitant to put such results in API docs, but 
I can see where you're coming from here.

 Speaking of efficiency, I assume parallel foreach uses opApply with a
 delegate and the inherent overhead. So I'm thinking that a practical way
 to take advantage of parallel foreach would be to parallelize at some
 block level and then do a regular foreach inside the block?

 foreach (i, ref elem; taskPool.parallel(logs)) {
 foreach (???)
 elem = log(i + 1.0);
 }

 How can I arrange things such that I compute e.g. 64 logs serially
 inside each pass?

Three things here:

1.  For this kind of nano-parallelism, map() might be better suited.  To 
fill an existing array, just use map's explicit buffer feature.

taskPool.map!log(iota(1, logs.length + 1), logs);

The option of using map() for nano-parallelism is part of my rationale 
for keeping the pretty but mildly inefficient delegate call in parallel 
foreach.

2.  You're severely overestimating the overhead of the delegate call.
log() is a pretty cheap function and even so speedups are decent with 
parallel foreach compared to a regular foreach loop.

import std.stdio, std.parallelism, std.datetime, std.math;


void main() {
     auto logs = new float[10_000_000];
     taskPool();  // Initialize TaskPool before timing.

     auto sw = StopWatch(autoStart);
     foreach(i, ref elem; logs) {
         elem = log(i + 1);
     }
     writeln("Serial:  ", sw.peek.msecs);

     sw.reset();
     foreach(i, ref elem; parallel(logs)) {
         elem = log(i + 1);
     }
     writeln("Parallel Foreach:  ", sw.peek.msecs);
}


Results:

Serial:  619
Parallel Foreach:  388

I'd include parallel map, too, but for some reason (probably stack 
alignment or something) including it changes the results for parallel 
foreach.

3.  If you really want to do as you suggest, just make a chunks range or 
something (maybe this should be in std.range) that iterates over all 
non-overlapping size N slices of a sliceable range.  Use this for the 
parallel loop, then loop over the individual elements of the slice inside.

 I'm confused here. I use join() pretty much all the time. I launch stuff
 (e.g. run a large matrix-vector multiplication for distinct vectors) and
 then I join that. Then again and again. The thread of execution has a
 repeated hourglass shape - it fans out and then becomes single-threaded
 at join points. I assume some of those computations are indeed the
 charter of parallel foreach, but I'd guess not all. But then what do I
 know - I'm the customer :o).

Yes, parallel foreach is idiomatic here, or maybe parallel map.

In the pre-release versions of std.parallelism (my experimentation with 
parallelism libraries in D goes back to mid-2008, over a year before I 
released the first version as Parallelfuture), only the task primitive 
existed.  I discovered that, in practice, creating Task objects in a 
loop is almost always a PITA.  It can also be inefficient in that, in a 
naive implementation all of these exist in memory simultaneously when 
this might be completely unnecessary.  I decided to create higher level 
primitives to handle all the use cases I could think of where you might 
otherwise want to create Task objects in a loop.  If you're explicitly 
creating Task objects in a loop in std.parallelism, I can just about 
guarantee that there's an easier and at least equally efficient way to 
accomplish what you want to accomplish.  If there are any important use 
cases I missed, I'd be much more interested in creating a few more 
high-level primitives rather than making it easier to work with Task 
objects created in a loop.

Mar 19 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 10:16 PM, dsimcha wrote:
 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?

 I **STRONGLY** believe this does not belong in API documentation because
 it's too machine specific, compiler specific, stack alignment specific,
 etc. and almost any benchmark worth doing takes up more space than an
 example should. Furthermore, anyone who wants to know this can easily
 time it themselves. I have absolutely no intention of including this.
 While in general I appreciate and have tried to accommodate your
 suggestions, this is one I'll be standing firm on.

 If scalability information is present in however a non-committal form,
 then people would be compelled ("ok, so this shape of the loop would
 actually scale linearly with CPUs... neat").

 Ok, I thought you were asking for something much more rigorous than
 this. I therefore didn't want to provide it because I figured that, no
 matter what I did, someone would be able to say that the benchmark is
 flawed some how, yada, yada, yada. Given how inexact a science
 benchmarking is, I'm still hesitant to put such results in API docs, but
 I can see where you're coming from here.

I tried putting a few benchmarks in for the performance of parallel 
foreach, map and reduce.  Basically, I just put the timings in for the 
examples on my dual core machine, and for an equivalent serial version 
that is not shown in the interest of brevity but is easily inferred.  I 
didn't do this for the pipelining or future/promise primitives because 
these aren't as capable of automatically taking full advantage of as 
many cores as you can throw at them as map, reduce and foreach.  This 
means that coming up with a benchmark that shows where a good speedup 
can be obtained is a much more difficult art.

Mar 20 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 * "workUnitSize: The number of elements to evaluate in a single Task.
 Must be less than or equal to bufSize, and in practice should be a
 fraction of bufSize such that all worker threads can be used." Then why
 not specify a different parameter such a multiplier or something? The
 dependence between bufSize and worUnitSize is a sign that these two
 should be improved. If you have good reasons that the user must have the
 parameters in this form, give an example substantiating that.

I did this for consistency with all the other functions, where work unit 
size is specified directly.  I don't want lazyMap to be the oddball 
function where it's specified in a completely different way.

Mar 19 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 19 March 2011 09:03:51 Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:
 
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

 
 * When using "parallelism" as a common noun, prefix is with a '_' so
 ddoc doesn't underline it.

Oooh. That's a neat trick. I didn't know that you could do that.

- Jonathan M Davis

Mar 19 2011

D Programming

C/C++ Programming

Other

digitalmars.D - review of std.parallelism