www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - review of std.parallelism

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
0. Overview and vote

I think the library delivers the high-level goods (parallel foreach, 
map, reduce) but is a bit fuzzy in the lower level details.

Documentation hurts understanding of the capabilities of the library and 
essentially is of inadequate quality. Entity documentation and examples 
do little more than give the impression of dispensing with an unpleasant 
chore.

My vote in favor of acceptance is contingent upon a radical improvement 
in the quality of documentation and examples. Most if not all artifacts 
should be motivated by simple, compelling examples. The introductory 
section must contain a brief and attractive synopsis of the flagship 
capabilities. All terms must be defined before being used and introduced 
in a carefully-chosen order. The relationship between various entities 
should be clarified.

I've seen the argument that simple and strong examples are difficult to 
find. Though I agree such examples are not easy to come up with, I also 
believe the author of library is in the best position to produce them.

================

1. Library proper:

* "In the case of non-random access ranges, parallel foreach is still 
usable but buffers lazily to an array..." Wouldn't strided processing 
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 
1, 5, 9, ... and so on.

* Example with squares would be better if double replaced uint, and if a 
more complicated operation (sqrt, log...) were involved.

* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious 
method should be better: just use one circular buffer. The presence of 
two dependent parameters makes this abstraction difficult to operate with.

* Same question about asyncBuf. What is wrong with a circular buffer 
filled on one side by threads and on the consumed from the other by the 
client? I can think of a couple of answers but it would be great if they 
were part of the documentation.

* Why not make workerIndex a ulong and be done with it?

* Is stop() really trusted or just unsafe? If it's forcibly killing 
threads then its unsafe.

* uint size() should be size_t for conformity.

* defaultPoolThreads - should it be a  property?

* I don't understand what Task is supposed to do. It is circularly 
defined: "A struct that encapsulates the information about a task, 
including its current status, what pool it was submitted to, and its 
arguments." OK, but what's a task? Could a task be used outside a task 
pool, and if so to what effect(s)?

* If LazyMap is only necessary as the result of lazyMap, could that 
become a local type inside lazyMap?


2. Documentation:

* Documentation unacceptable. It lacks precision and uses terms before 
defining them. For example: "This class encapsulates a task queue and a 
set of worker threads" comes before the notions of "task queue" and 
"worker thread" are defined. Clearly there is an intuition of what those 
are supposed to mean, but in practice each library lays down some fairly 
detailed definition of such.

* "Note: Initializing a pool with zero threads (as would happen in the 
case of a single-core CPU) is well-tested and does work." The absence of 
a bug should not be advertised in so many words. Simpler: "Note: 
Single-CPU machines will operate transparently with zero-sized pools."

* "Allows for custom pool size." I have no idea what pool size means.

* "// Do something interesting." Worst example ever. You are the creator 
of the library. If _you_ can't find a brief compelling example, who could?

* "Immediately after the range argument, an optional block size argument 
may be provided. If none is provided, the default block size is used. An 
optional buffer for returining the results may be provided as the last 
argument. If one is not provided, one will be automatically allocated. 
If one is provided, it must be the same length as the range." An example 
should be inserted after each of these sentences.

* "// Do something expensive with line." Better you do something 
expensive with line.

* "This is not the same thing as commutativity." I think this is obvious 
enough to be left out. The example of matrices (associative but not 
commutative) is nice though.

* "immutable n = 1000000000;" -> use underscores

* Would be great if the documentation examples included some rough 
experimental results, e.g. "3.8 times faster on a 4-core machine than 
the equivalent serial loop".

* No example for workerIndex and why it's useful.

* Is LocalWorker better than WorkerLocal? No, because it's not the 
worker that's local, it's the storage - which is missing from the name! 
WorkerLocalStorage? Also replace "create" with "make" or drop it 
entirely. The example doesn't tell me how I can use bufs. I suspect 
workerIndex has something to do with it but the example fails to divulge 
that relationship.

* No example for join(), which is quite a central primitive.

* No example for put(), which would be interesting to see.

* No example for task().

* No example for run().

* The example for Task uses myFuture without defining it. In case Task 
defines the promise/future idiom, the documentation misses a terrific 
opportunity of clarifying that.

* What is 'run' in the definition of safe task()?

* Why is AsyncBuf's doc far away from the function returning it? Same 
about WorkerLocal.


Andrei
Mar 18 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
Thanks for the advice.  You mentioned in the past that the documentation 
was inadequate but didn't give enough specifics as to how until now.  As 
the author of the library, things seem obvious to me that don't seem 
obvious to anyone else, so I don't feel that I'm in a good position to 
judge the quality of the documentation and where it needs improvement. 
I plan to fix most of the issues you raised, but I've left comments for 
the few that I can't/won't fix or believe are based on misunderstandings 
below.

On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.
You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
 * I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
 method should be better: just use one circular buffer. The presence of
 two dependent parameters makes this abstraction difficult to operate with.

 * Same question about asyncBuf. What is wrong with a circular buffer
 filled on one side by threads and on the consumed from the other by the
 client? I can think of a couple of answers but it would be great if they
 were part of the documentation.
Are you really suggesting I give detailed rationales for implementation decisions in the documentation? Anyhow, the two reasons for this choice are to avoid needing synchronization/atomic ops/etc. on every write to the buffer (which we would need since it can be read and written concurrently and we need to track whether we have space to write to) and because parallel map works best when it operates on relatively large buffers, resulting in minimal synchronization overhead per element. (Under the hood, the buffer is filled and then eager parallel map is called.)
 * Why not make workerIndex a ulong and be done with it?
I doubt anyone's really going to create anywhere near 4 billion TaskPool threads over the lifetime of a program. Part of the point of TaskPool is recycling threads rather than paying the overhead of creating and destroying them. Using a ulong on a 32-bit architecture would make worker-local storage substantially slower. workerIndex is how worker-local storage works under the hood, so it needs to be fast.
 * No example for workerIndex and why it's useful.
It should just be private. The fact that it's public is an artifact of when I was designing worker-local storage and didn't know how it was going to work yet. I never thought to revisit this until now. It really isn't useful to client code.
 * Is stop() really trusted or just unsafe? If it's forcibly killing
 threads then its unsafe.
It's not forcibly killing threads. As the documentation states, it has no effect on jobs already executing, only ones in the queue. Furthermore, it's needed unless makeDaemon is called. Speaking of which, makeDaemon and makeAngel should probably be trusted, too.
 * defaultPoolThreads - should it be a  property?
Yes. In spirit it's a global variable. It requires some extra machinations, though, to be threadsafe, which is why it's not implemented as a simple global variable.
 * No example for task().
???? Yes there is, for both flavors, though these could admittedly be improved. Only the safe version doesn't have an example, and this is just a more restricted version of the function pointer case, so it seems silly to make a separate example for it.
 * What is 'run' in the definition of safe task()?
It's just the run() adapter function. Isn't that obvious?
Mar 18 2011
next sibling parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:

 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.
You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
Please add an example showing that, too. Sure, the documentation says that's what's being done, but an example would show it more clearly. -- Simen
Mar 19 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:
 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.
You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
Please add an example showing that, too. Sure, the documentation says that's what's being done, but an example would show it more clearly.
I don't understand how this can be demonstrated in an example. It's an under-the-hood thing. The only place this appears in the API is in the workUnitSize parameter.
Mar 19 2011
parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
On Sat, 19 Mar 2011 13:51:56 +0100, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:
 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works  
on
 1, 5, 9, ... and so on.
You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
Please add an example showing that, too. Sure, the documentation says that's what's being done, but an example would show it more clearly.
I don't understand how this can be demonstrated in an example. It's an under-the-hood thing. The only place this appears in the API is in the workUnitSize parameter.
Yeah, scratch that. I for some reason thought the "array of size workUnitSize" was global, but it's per thread, innit? Seems so logical now. -- Simen
Mar 19 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 03/18/2011 11:40 PM, dsimcha wrote:
 Thanks for the advice. You mentioned in the past that the documentation
 was inadequate but didn't give enough specifics as to how until now. As
 the author of the library, things seem obvious to me that don't seem
 obvious to anyone else, so I don't feel that I'm in a good position to
 judge the quality of the documentation and where it needs improvement. I
 plan to fix most of the issues you raised, but I've left comments for
 the few that I can't/won't fix or believe are based on misunderstandings
 below.
Great, thanks.
 On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.
You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
Got it.
 * Why not make workerIndex a ulong and be done with it?
I doubt anyone's really going to create anywhere near 4 billion TaskPool threads over the lifetime of a program. Part of the point of TaskPool is recycling threads rather than paying the overhead of creating and destroying them. Using a ulong on a 32-bit architecture would make worker-local storage substantially slower. workerIndex is how worker-local storage works under the hood, so it needs to be fast.
If you're confident that overflow won't occur, you may want to eliminate that detail from the docs. It throws off the reader.
  > * No example for workerIndex and why it's useful.

 It should just be private. The fact that it's public is an artifact of
 when I was designing worker-local storage and didn't know how it was
 going to work yet. I never thought to revisit this until now. It really
 isn't useful to client code.
It could be public and undocumented.
 * Is stop() really trusted or just unsafe? If it's forcibly killing
 threads then its unsafe.
It's not forcibly killing threads. As the documentation states, it has no effect on jobs already executing, only ones in the queue. Furthermore, it's needed unless makeDaemon is called. Speaking of which, makeDaemon and makeAngel should probably be trusted, too.
Great. The more safe/trusted, the better.
 * defaultPoolThreads - should it be a  property?
Yes. In spirit it's a global variable. It requires some extra machinations, though, to be threadsafe, which is why it's not implemented as a simple global variable.
Then it should be a property. I think ddoc doesn't reflect that, but an example could.
 * No example for task().
???? Yes there is, for both flavors, though these could admittedly be improved. Only the safe version doesn't have an example, and this is just a more restricted version of the function pointer case, so it seems silly to make a separate example for it.
Towards the bottom of the document there are overloads of task that don't have examples.
 * What is 'run' in the definition of safe task()?
It's just the run() adapter function. Isn't that obvious?
I'm referring to this: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp, Args args); What is "run"? Andrei
Mar 19 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:
 On 03/18/2011 11:40 PM, dsimcha wrote:
 It should just be private. The fact that it's public is an artifact of
 when I was designing worker-local storage and didn't know how it was
 going to work yet. I never thought to revisit this until now. It really
 isn't useful to client code.
It could be public and undocumented.
I've already made it private. I can't see what purpose having it public would serve. The fact that it was public before was **purely** an oversight.
 * defaultPoolThreads - should it be a  property?
Yes. In spirit it's a global variable. It requires some extra machinations, though, to be threadsafe, which is why it's not implemented as a simple global variable.
Then it should be a property. I think ddoc doesn't reflect that, but an example could.
Right. It's always been property but ddoc doesn't reflect this. I've changed the docs slightly to call it a "property" instead of a "function". It seems like overkill to me to give examples for this, though, since it's just a getter and a setter.
 * No example for task().
???? Yes there is, for both flavors, though these could admittedly be improved. Only the safe version doesn't have an example, and this is just a more restricted version of the function pointer case, so it seems silly to make a separate example for it.
Towards the bottom of the document there are overloads of task that don't have examples.
You mean TaskPool.task()? Since these are such slight variations of the other overloads, I thought an example would be overkill. Since people less familiar with the library don't think so, though, I've added examples that are accordingly slight variations of the examples for the other overloads.
 * What is 'run' in the definition of safe task()?
It's just the run() adapter function. Isn't that obvious?
I'm referring to this: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp, Args args); What is "run"?
Ok, I ended up moving the docs for run() directly above the stuff that uses it. run() is described as "Calls a delegate or function pointer with args. This is an adapter that makes Task work with delegates, function pointers and functors instead of just aliases. It is included in the documentation to clarify how this case is handled, but is not meant to be used directly by client code." I know the Higher Principles of Encapsulation say this should be private and the relevant overloads should return auto. I strongly believe, though, that being anal about encapsulation of this detail is silly since it is so unlikely to change and that exposing it helps to clarify what's really going on here. Encapsulation is good up to a point, but sometimes it's just easier to think about things when you know how they really work at a concrete level, and this tradeoff needs to be weighted.
Mar 19 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 03/19/2011 10:28 AM, dsimcha wrote:
 On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:
 Towards the bottom of the document there are overloads of task that
 don't have examples.
You mean TaskPool.task()? Since these are such slight variations of the other overloads, I thought an example would be overkill. Since people less familiar with the library don't think so, though, I've added examples that are accordingly slight variations of the examples for the other overloads.
A great way to handle bunches of almost-identical overloads is to group them together with /// ditto and explain the slight differences in the consolidated documentation. Andrei
Mar 19 2011
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
Ok, thanks again for clarifying **how** the docs could be improved. 
I've implemented the suggestions and generally given the docs a good 
reading over and clean up.  The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
 0. Overview and vote

 I think the library delivers the high-level goods (parallel foreach,
 map, reduce) but is a bit fuzzy in the lower level details.

 Documentation hurts understanding of the capabilities of the library and
 essentially is of inadequate quality. Entity documentation and examples
 do little more than give the impression of dispensing with an unpleasant
 chore.

 My vote in favor of acceptance is contingent upon a radical improvement
 in the quality of documentation and examples. Most if not all artifacts
 should be motivated by simple, compelling examples. The introductory
 section must contain a brief and attractive synopsis of the flagship
 capabilities. All terms must be defined before being used and introduced
 in a carefully-chosen order. The relationship between various entities
 should be clarified.

 I've seen the argument that simple and strong examples are difficult to
 find. Though I agree such examples are not easy to come up with, I also
 believe the author of library is in the best position to produce them.

 ================

 1. Library proper:

 * "In the case of non-random access ranges, parallel foreach is still
 usable but buffers lazily to an array..." Wouldn't strided processing
 help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
 1, 5, 9, ... and so on.

 * Example with squares would be better if double replaced uint, and if a
 more complicated operation (sqrt, log...) were involved.

 * I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
 method should be better: just use one circular buffer. The presence of
 two dependent parameters makes this abstraction difficult to operate with.

 * Same question about asyncBuf. What is wrong with a circular buffer
 filled on one side by threads and on the consumed from the other by the
 client? I can think of a couple of answers but it would be great if they
 were part of the documentation.

 * Why not make workerIndex a ulong and be done with it?

 * Is stop() really trusted or just unsafe? If it's forcibly killing
 threads then its unsafe.

 * uint size() should be size_t for conformity.

 * defaultPoolThreads - should it be a  property?

 * I don't understand what Task is supposed to do. It is circularly
 defined: "A struct that encapsulates the information about a task,
 including its current status, what pool it was submitted to, and its
 arguments." OK, but what's a task? Could a task be used outside a task
 pool, and if so to what effect(s)?

 * If LazyMap is only necessary as the result of lazyMap, could that
 become a local type inside lazyMap?


 2. Documentation:

 * Documentation unacceptable. It lacks precision and uses terms before
 defining them. For example: "This class encapsulates a task queue and a
 set of worker threads" comes before the notions of "task queue" and
 "worker thread" are defined. Clearly there is an intuition of what those
 are supposed to mean, but in practice each library lays down some fairly
 detailed definition of such.

 * "Note: Initializing a pool with zero threads (as would happen in the
 case of a single-core CPU) is well-tested and does work." The absence of
 a bug should not be advertised in so many words. Simpler: "Note:
 Single-CPU machines will operate transparently with zero-sized pools."

 * "Allows for custom pool size." I have no idea what pool size means.

 * "// Do something interesting." Worst example ever. You are the creator
 of the library. If _you_ can't find a brief compelling example, who could?

 * "Immediately after the range argument, an optional block size argument
 may be provided. If none is provided, the default block size is used. An
 optional buffer for returining the results may be provided as the last
 argument. If one is not provided, one will be automatically allocated.
 If one is provided, it must be the same length as the range." An example
 should be inserted after each of these sentences.

 * "// Do something expensive with line." Better you do something
 expensive with line.

 * "This is not the same thing as commutativity." I think this is obvious
 enough to be left out. The example of matrices (associative but not
 commutative) is nice though.

 * "immutable n = 1000000000;" -> use underscores

 * Would be great if the documentation examples included some rough
 experimental results, e.g. "3.8 times faster on a 4-core machine than
 the equivalent serial loop".

 * No example for workerIndex and why it's useful.

 * Is LocalWorker better than WorkerLocal? No, because it's not the
 worker that's local, it's the storage - which is missing from the name!
 WorkerLocalStorage? Also replace "create" with "make" or drop it
 entirely. The example doesn't tell me how I can use bufs. I suspect
 workerIndex has something to do with it but the example fails to divulge
 that relationship.

 * No example for join(), which is quite a central primitive.

 * No example for put(), which would be interesting to see.

 * No example for task().

 * No example for run().

 * The example for Task uses myFuture without defining it. In case Task
 defines the promise/future idiom, the documentation misses a terrific
 opportunity of clarifying that.

 * What is 'run' in the definition of safe task()?

 * Why is AsyncBuf's doc far away from the function returning it? Same
 about WorkerLocal.


 Andrei
Mar 19 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
* When using "parallelism" as a common noun, prefix is with a '_' so ddoc doesn't underline it. * "Most of this module completely subverts..." Vague characterizations ("most", "completely", "some") don't belong in a technical documentation. (For example there's either subversion going on or there isn't.) Also, std.concurrency and std.parallelism address different needs so there's little competition between them. Better: "Unless explicitly marked as $(D trusted) or $(D safe), artifacts in this module are not provably memory-safe and cannot be used with SafeD. If used as documented, memory safety is guaranteed." * Speaking of std.concurrency vs. std.parallelism, the first paragraph might be something like: "This module implements high-level primitives for shared memory SMP parallelism. These include parallel foreach, parallel reduce, parallel eager map, pipelining and future/promise parallelism primitives. $(D std._parallelism) is best recommended when the same operation is to be executed in parallel over different data. For communication between arbitrary threads, see $(D std.concurrency)." * Still no synopsis example that illustrates in a catchy way the most attractive artifacts. * "After creation, Task objects are submitted to a TaskPool for execution." I understand it's possible to use Task straight as a promise/future, so s/are/may be/. Also it is my understanding that TaskPool efficiently adapts the concrete number of CPUs to an arbitrary number of tasks. If that's the case, it's worth mentioning. * "A call to workWait(), yieldWait(), or spinWait() can be used to retrive the return value after the function is finished executing." As an aside, a spell checking step would be great ("retrieve") - just put the text in an editor with spellchecking. I think what this means is: "The methods workWait(), yieldWait(), or spinWait() make sure that the function finishes execution and then return its result to the initiating thread. Each uses a different waiting strategy as detailed below." * "If a Task has been submitted to a TaskPool instance, is being stored in a stack frame, and has not yet finished, the destructor for this struct will automatically call yieldWait() so that the task can finish and the stack frame can be destroyed safely." At this point in the doc the reader doesn't understand that at all because TaskPool has not been seen yet. The reader gets worried that she'll be essentially serializing the entire process by mistake. Either move this explanation down or provide an example. * "Function results are returned from yieldWait() and friends by ref." Someone coming from C++ may be thrown off by this sudden casual use of "friends" and think there's a notion of frienship by reference in D. Better: "The forcing methods yieldWait(), workWait(), and spinWait() return the result by reference." * Speaking of which, I'd replace "Wait" with "Force". Right now the nomenclature is far removed from futures and promises. * Is done() a property? * The example that reads two files at the same time should NOT use taskPool. It's just one task, why would the pool ever be needed? If you also provided an example that reads n files in memory at the same time using a pool, that would illustrate nicely why you need it. If a Task can't be launched without being put in a pool, there should be a possibility to do so. At my work we have a simple function called callInNewThread that does what's needed to launch a function in a new thread. * The note below that example gets me thinking: it is an artificial limitation to force users of Task to worry about scope and such. One should be able to create a Future object (Task I think in your terminology), pass it around like a normal value, and ultimately force it. This is the case for all other languages that implement futures. I suspect the "scope" parameter associated with the delegate a couple of definitions below plays a role here, but I think we need to work for providing the smoothest interface possible (possibly prompting improvements in the language along the way). * I'm not sure how to interpret the docs for ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args); So it's documented but I'm not supposed to care. Why not just remove? Surely there must be higher-level examples that clarify that I can use delegates etc. * The examples have code at top-level. That's fine for short snippets but not when using import etc. I recommend putting the code inside unittests or function bodies for such cases. * "If you want to escape the Task object from the function in which it was created or prefer to heap allocate and automatically submit to the pool, see TaskPool.task()." I'm uncomfortable that I need to remember a subtle distinction of the same primitive name ("task") depending on membership in a TaskPool or not, which is a tenuous relationship to remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking ourselves what's the relative overhead of closure allocation vs. task switching etc. Maybe we get to simplify things a lot at a small cost in efficiency. * As I mentioned, in the definition: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args); I can't tell what "run" is. * osReportedNcpu - a case of naming convention gone wild... how about totalCPUs or reportedCPUs. * "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd? * I would group these two together: ParallelForeach!(R) parallel(R)(R range, size_t workUnitSize); ParallelForeach!(R) parallel(R)(R range); Just put a /// ditto for the second, use only one code snippet that illustrates two foreach loops, one with the explicit parameter and one without. In fact swap the two - most of the time people will use the one with the default work unit size. At best don't specify 512 because later on you may come with better algorithm to determine the optimal block size. * "workUnitSize: The number of elements to evaluate in a single Task. Must be less than or equal to bufSize, and in practice should be a fraction of bufSize such that all worker threads can be used." Then why not specify a different parameter such a multiplier or something? The dependence between bufSize and worUnitSize is a sign that these two should be improved. If you have good reasons that the user must have the parameters in this form, give an example substantiating that. * "auto myMax = taskPool.reduce!"a + b * b"(myArr);" This is not computing anything meaningful. To compute the sum of squares, you need to give the seed 0.0. * Again: speed of e.g. parallel min/max vs. serial, pi computation etc. on a usual machine? * The join example is fine, but the repetitive code suggests that loops should be better: import std.file; void main() { auto pool = new TaskPool(); foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) { pool.put(task!read(filename)); } // Call join() to guarantee that all tasks are done running, the worker // threads have terminated and that the results of all of the tasks can // be accessed without any synchronization primitives. pool.join(); ubyte[][] data; // void[], meh // Use spinWait() since the results are guaranteed to have been computed // and spinWait() is the cheapest of the wait functions. foreach (task; pool.howDoIEnumerateTasks()) { data ~= task1.spinWait(); } Andrei
Mar 19 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 * "Most of this module completely subverts..." Vague characterizations 
 ("most", "completely", "some") don't belong in a technical 
 documentation. (For example there's either subversion going on or there 
 isn't.) Also, std.concurrency and std.parallelism address different 
 needs so there's little competition between them. Better: "Unless 
 explicitly marked as $(D  trusted) or $(D  safe), artifacts in this 
 module are not provably memory-safe and cannot be used with SafeD. If 
 used as documented, memory safety is guaranteed."
Actually, I think this is a bad description of what it subverts. What it subverts isn't the memory-safety that SafeD provides, but the safety against low-level races that even unsafe D protects against unless you cast shared away. For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 1:09 PM, Michel Fortin wrote:
 On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 * "Most of this module completely subverts..." Vague characterizations
 ("most", "completely", "some") don't belong in a technical
 documentation. (For example there's either subversion going on or
 there isn't.) Also, std.concurrency and std.parallelism address
 different needs so there's little competition between them. Better:
 "Unless explicitly marked as $(D  trusted) or $(D  safe), artifacts in
 this module are not provably memory-safe and cannot be used with
 SafeD. If used as documented, memory safety is guaranteed."
Actually, I think this is a bad description of what it subverts. What it subverts isn't the memory-safety that SafeD provides, but the safety against low-level races that even unsafe D protects against unless you cast shared away. For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free.
If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features. I thought long and hard about this issue before submitting this lib for review and concluded that any solution would make std.parallelism so slow, so limited and/or such a PITA to use that I'd rather it just punt these issues to the programmer. In practice, parallel foreach is used with very small, performance-critical snippets that are fairly easy to reason about even without any language-level race safety. This is a fundamental design decision that will not be changing. If it's unacceptable then: 1. I give up. 2. I wish someone had told me earlier.
Mar 19 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 1:09 PM, Michel Fortin wrote:
 For instance:
 
 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 sum += value;
 }
 writeln(sum);
 }
 
 The "+=" would need to be an atomic operation to avoid low-level races.
 
 I think that ParallelForeach's opApply should only accept a shared
 delegate. I define shared delegate as a delegate that does not reference
 any non-shared variables of its outer scope. The problem is that DMD
 currently doesn't know how to determine whether a delegate literal is
 shared or not, thus a delegate literal is never shared and if
 ParallelForeach's opApply asked a shared delegate as it should it would
 just not work. Fix DMD to create shared delegate literals where
 appropriate and everything can be guarantied race-free.
If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features.
I'm not sure where my proposal asks for more verbosity or less performance. All I can see is a few less casts in std.parallelism and that it'd disallow the case in my example above that is totally wrong. Either you're interpreting it wrong or there are things I haven't thought about (and I'd be happy to know about them). But by looking at all the examples in the documentation, I cannot find one that would need to be changed... well, except the one I'll discuss below.
 I thought long and hard about this issue before submitting this lib for 
 review and concluded that any solution would make std.parallelism so 
 slow, so limited and/or such a PITA to use that I'd rather it just punt 
 these issues to the programmer.  In practice, parallel foreach is used 
 with very small, performance-critical snippets that are fairly easy to 
 reason about even without any language-level race safety.
I'm not too convinced about the "I know what I'm doing" argument when I look at this example from asyncBuf's documentation: auto lines = File("foo.txt").byLine(); auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles buffer // Fetch more lines in the background while we process the lines already // read into memory into a matrix of doubles. double[][] matrix; auto asyncReader = taskPool.asyncBuf(duped); foreach(line; asyncReader) { auto ls = line.split("\t"); matrix ~= to!(double[])(ls); } Look at the last line of the foreach. You are appending to a non-shared array from many different threads. How is that not a race condition? With my proposal, the compiler would have caught that because opApply would want the foreach body to be a shared delegate, and reading/writing to non-shared variable "matrix" in the outer scope from a shared delegate literal would be an error. I'm not too sure how hard it'd be to do that in the compiler, but I think it's the right thing to do. Once the compiler can do this we can have safety; until that time I'd agree to see std.parallelism stay unsafe. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-19 14:14:51 -0400, Michel Fortin <michel.fortin michelf.com> said:

 I'm not too convinced about the "I know what I'm doing" argument when I 
 look at this example from asyncBuf's documentation:
 
     auto lines = File("foo.txt").byLine();
     auto duped = map!"a.idup"(lines);  // Necessary b/c byLine() 
 recycles buffer
 
     // Fetch more lines in the background while we process the lines already
     // read into memory into a matrix of doubles.
     double[][] matrix;
     auto asyncReader = taskPool.asyncBuf(duped);
 
     foreach(line; asyncReader) {
         auto ls = line.split("\t");
         matrix ~= to!(double[])(ls);
     }
 
 Look at the last line of the foreach. You are appending to a non-shared 
 array from many different threads. How is that not a race condition?
... or maybe I just totally misunderstood asyncBuf. Rereading the documentation I'm under the impression I'd have to write this to get what I expected: foreach (line; parallel(asyncReader)) ... And that would cause a race condition. If that's the case, the example is fine. Sorry for the misunderstanding. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 2:25 PM, Michel Fortin wrote:
 On 2011-03-19 14:14:51 -0400, Michel Fortin <michel.fortin michelf.com>
 said:

 I'm not too convinced about the "I know what I'm doing" argument when
 I look at this example from asyncBuf's documentation:

 auto lines = File("foo.txt").byLine();
 auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles
 buffer

 // Fetch more lines in the background while we process the lines already
 // read into memory into a matrix of doubles.
 double[][] matrix;
 auto asyncReader = taskPool.asyncBuf(duped);

 foreach(line; asyncReader) {
 auto ls = line.split("\t");
 matrix ~= to!(double[])(ls);
 }

 Look at the last line of the foreach. You are appending to a
 non-shared array from many different threads. How is that not a race
 condition?
.... or maybe I just totally misunderstood asyncBuf. Rereading the documentation I'm under the impression I'd have to write this to get what I expected: foreach (line; parallel(asyncReader)) ... And that would cause a race condition. If that's the case, the example is fine. Sorry for the misunderstanding.
Right. And this is pretty obviously a race. The other example (without the parallel) is completely safe.
Mar 19 2011
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 2:14 PM, Michel Fortin wrote:
 On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 1:09 PM, Michel Fortin wrote:
 For instance:

 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 sum += value;
 }
 writeln(sum);
 }

 The "+=" would need to be an atomic operation to avoid low-level races.

 I think that ParallelForeach's opApply should only accept a shared
 delegate. I define shared delegate as a delegate that does not reference
 any non-shared variables of its outer scope. The problem is that DMD
 currently doesn't know how to determine whether a delegate literal is
 shared or not, thus a delegate literal is never shared and if
 ParallelForeach's opApply asked a shared delegate as it should it would
 just not work. Fix DMD to create shared delegate literals where
 appropriate and everything can be guarantied race-free.
If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features.
I'm not sure where my proposal asks for more verbosity or less performance. All I can see is a few less casts in std.parallelism and that it'd disallow the case in my example above that is totally wrong. Either you're interpreting it wrong or there are things I haven't thought about (and I'd be happy to know about them). But by looking at all the examples in the documentation, I cannot find one that would need to be changed... well, except the one I'll discuss below.
Ok, I've had some time to think about this. The following example is safe, but wouldn't work if I understand correctly what you're proposing: auto logs = new double[1_000_000]; foreach(i, ref elem; taskPool.parallel(logs)) { elem = log(i + 1.0); } The problem is that you're writing to the same array from multiple threads, which the compiler would interpret as writing to the same variable. However, you can guarantee that no two threads ever write to the same element, so it's safe. Note: I'm aware of the false sharing issue when writing to adjacent memory addresses. However, when writing to an array this big it occurs to a negligible extent. If you were using a smaller array, then either the loop body would be so expensive that the cost of false sharing would be negligible, or the whole loop would be too cheap to be worth parallelizing. I'm also aware that word tearing is a concern on some architectures, though not x86. IMHO std.parallelism and its documentation should not be pedantic about portability to architectures that a D compiler doesn't even exist for yet. Also, your example can be trivially modified to be safe. void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { synchronized sum += value; } writeln(sum); } In this case that kills all parallelism, but in more realistic cases I use this pattern often. I find it very common to have an expensive loop body can be performed in parallel, except for a tiny portion that must update a shared data structure. I'm aware that it might be possible, in theory, to write this more formally using reduce() or something. However: 1. If the portion of the loop that deals with shared data is very small (and therefore the serialization caused by the synchronized block is negligible), it's often more efficient to only keep one data structure in memory and update it concurrently, rather than use stronger isolation between threads like reduce() does, and have to maintain one data structure for each thread. 2. In my experience synchronizing on a small portion of the loop body works very well in practice. My general philosophy is that, in a library like this, dangerous but useful constructs must be supported and treated as innocent until proven guilty, not the other way round.
Mar 20 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-20 11:42:04 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 2:14 PM, Michel Fortin wrote:
 On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:
 
 On 3/19/2011 1:09 PM, Michel Fortin wrote:
 For instance:
 
 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 sum += value;
 }
 writeln(sum);
 }
 
 The "+=" would need to be an atomic operation to avoid low-level races.
 
 I think that ParallelForeach's opApply should only accept a shared
 delegate. I define shared delegate as a delegate that does not reference
 any non-shared variables of its outer scope. The problem is that DMD
 currently doesn't know how to determine whether a delegate literal is
 shared or not, thus a delegate literal is never shared and if
 ParallelForeach's opApply asked a shared delegate as it should it would
 just not work. Fix DMD to create shared delegate literals where
 appropriate and everything can be guarantied race-free.
If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features.
I'm not sure where my proposal asks for more verbosity or less performance. All I can see is a few less casts in std.parallelism and that it'd disallow the case in my example above that is totally wrong. Either you're interpreting it wrong or there are things I haven't thought about (and I'd be happy to know about them). But by looking at all the examples in the documentation, I cannot find one that would need to be changed... well, except the one I'll discuss below.
Ok, I've had some time to think about this. The following example is safe, but wouldn't work if I understand correctly what you're proposing: auto logs = new double[1_000_000]; foreach(i, ref elem; taskPool.parallel(logs)) { elem = log(i + 1.0); } The problem is that you're writing to the same array from multiple threads, which the compiler would interpret as writing to the same variable. However, you can guarantee that no two threads ever write to the same element, so it's safe.
I don't see a problem with the above. The array elements you modify are passed through parallel's opApply which can check easily whether it's safe or not to pass them by ref to different threads (by checking the element's size) and allow or disallow the operation accordingly. It could even do a clever trick to make it safe to pass things such as elements of array of bytes by ref (by coalescing loop iterations for all bytes sharing the same word into one task). That might not work for ranges which are not arrays however. That said, feel free to suggest more problematic examples.
 Note:  I'm aware of the false sharing issue when writing to adjacent 
 memory addresses.  However, when writing to an array this big it occurs 
 to a negligible extent.  If you were using a smaller array, then either 
 the loop body would be so expensive that the cost of false sharing 
 would be negligible, or the whole loop would be too cheap to be worth 
 parallelizing.
 
 I'm also aware that word tearing is a concern on some architectures, 
 though not x86.  IMHO std.parallelism and its documentation should not 
 be pedantic about portability to architectures that a D compiler 
 doesn't even exist for yet.
No question there.
 Also, your example can be trivially modified to be safe.
 
 void main() {
      int sum = 0;
      foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
           synchronized sum += value;
       }
       writeln(sum);
 }
 
 In this case that kills all parallelism, but in more realistic cases I 
 use this pattern often.  I find it very common to have an expensive 
 loop body can be performed in parallel, except for a tiny portion that 
 must update a shared data structure.  I'm aware that it might be 
 possible, in theory, to write this more formally using reduce() or 
 something.  However:
 
 1.  If the portion of the loop that deals with shared data is very 
 small (and therefore the serialization caused by the synchronized block 
 is negligible), it's often more efficient to only keep one data 
 structure in memory and update it concurrently, rather than use 
 stronger isolation between threads like reduce() does, and have to 
 maintain one data structure for each thread.
 
 2.  In my experience synchronizing on a small portion of the loop body 
 works very well in practice.  My general philosophy is that, in a 
 library like this, dangerous but useful constructs must be supported 
 and treated as innocent until proven guilty, not the other way round.
Your second example is not really a good justification of anything. I'll refer you to how synchronized classes work. It was decided that synchronized in a class protects everything that is directly stored in the class. Anything behind an indirection is considered shared by the compiler. The implication of this is that if you have an array or a pointer to something that you want semantically to be protected by the class's mutex, you have to cast things to unshared. It was decided that things should be safe against low-level races first, and convenience was relegated as a secondary concern. I don't like it very much, but that's what was decided and written in TDPL. Now we're in the exact same situation (except that no classes are involved) and you're proposing that for this case we make convenience prime over low-level race safety? To me this would be an utter lack of coherency in the design of D's "synchronized" feature to go that route. For the case above, wouldn't it be better to use an atomic add instead of a synchronized block? In this case you can make "sum" shared and have the type system check that everything is safe (using my proposed rules). And if your data structure is bigger, likely it'll be a synchronized class and you won't have to resort to bypass type-system safeties inside your loop (although you might have to inside your synchronized class, but that's another matter). -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 20 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/20/2011 10:44 PM, Michel Fortin wrote:
 I don't see a problem with the above. The array elements you modify are
 passed through parallel's opApply which can check easily whether it's
 safe or not to pass them by ref to different threads (by checking the
 element's size) and allow or disallow the operation accordingly.

 It could even do a clever trick to make it safe to pass things such as
 elements of array of bytes by ref (by coalescing loop iterations for all
 bytes sharing the same word into one task). That might not work for
 ranges which are not arrays however.

 That said, feel free to suggest more problematic examples.
Ok, I completely agree in principle, though I question whether it's worth actually implementing something like this, especially until we get some kind of support for shared delegates.
 Also, your example can be trivially modified to be safe.

 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 synchronized sum += value;
 }
 writeln(sum);
 }

 In this case that kills all parallelism, but in more realistic cases I
 use this pattern often. I find it very common to have an expensive
 loop body can be performed in parallel, except for a tiny portion that
 must update a shared data structure. I'm aware that it might be
 possible, in theory, to write this more formally using reduce() or
 something. However:

 1. If the portion of the loop that deals with shared data is very
 small (and therefore the serialization caused by the synchronized
 block is negligible), it's often more efficient to only keep one data
 structure in memory and update it concurrently, rather than use
 stronger isolation between threads like reduce() does, and have to
 maintain one data structure for each thread.

 2. In my experience synchronizing on a small portion of the loop body
 works very well in practice. My general philosophy is that, in a
 library like this, dangerous but useful constructs must be supported
 and treated as innocent until proven guilty, not the other way round.
Your second example is not really a good justification of anything. I'll refer you to how synchronized classes work. It was decided that synchronized in a class protects everything that is directly stored in the class. Anything behind an indirection is considered shared by the compiler. The implication of this is that if you have an array or a pointer to something that you want semantically to be protected by the class's mutex, you have to cast things to unshared. It was decided that things should be safe against low-level races first, and convenience was relegated as a secondary concern. I don't like it very much, but that's what was decided and written in TDPL.
I'd go a little further. If the guarantees that shared was supposed to provide are strong, i.e. apply no matter what threading module is used, then I utterly despise it. It's one of the worst decisions made in the design of D. Making things pedantically strict, so that the type system gets in the way more than it helps, encourages the user to reflexively circumvent the type system without thinking hard about doing this, thus defeating its purpose. (The alternative of always complying with what the type system "expects" you to do is too inflexible to even be worth considering.) Type systems should err on the side of accepting a superset of what's correct and treating code as innocent until proven guilty, not the other way around. I still believe this even if some of the bugs it could be letting pass through might be very difficult to debug. See the discussion we had a few weeks ago about implicit integer casting and porting code to 64. My excuse for std.parallelism is that it's pedal-to-metal parallelism, so it's more acceptable for it to be dangerous than general case concurrency. IMHO when you use the non- safe parts of std.parallelism (i.e. most of the library), that's equivalent to casting away shared in a whole bunch of places. Typing "import std.parallelism;" in a non- safe module is an explicit enough step here. The guarantee is still preserved that, if you only use std.concurrency (D's flagship "safe" concurrency module) for multithreading and don't cast away shared, there can be no low level data races. IMHO this is still a substantial accomplishment in that there exists a way to do safe, statically checkable concurrency in D, even if it's not the **only** way concurrency can be done. BTW, core.thread can also be used to get around D's type system, not just std.parallelism. If you want to check that only safe concurrency is used, importing std.parallelism and core.thread can be grepped just as easily as casting away shared. If, on the other hand, the guarantees of shared are supposed to be weak in that they only apply to programs where only std.concurrency is used for multithreading, then I think strictness is the right thing to do. The whole point of std.concurrency is to give strong guarantees, but if you prefer more dangerous but more flexible multithreading, other paradigms should be readily available.
 Now we're in the exact same situation (except that no classes are
 involved) and you're proposing that for this case we make convenience
 prime over low-level race safety? To me this would be an utter lack of
 coherency in the design of D's "synchronized" feature to go that route.

 For the case above, wouldn't it be better to use an atomic add instead
 of a synchronized block?
Yes. I've used this pattern in much more complicated cases, though, where atomic wouldn't cut it.
 In this case you can make "sum" shared and have
 the type system check that everything is safe (using my proposed rules).
 And if your data structure is bigger, likely it'll be a synchronized
 class and you won't have to resort to bypass type-system safeties inside
 your loop (although you might have to inside your synchronized class,
 but that's another matter).
I'm **still** totally confused about how shared is supposed to work, because I don't have a fully debugged/implemented implementation or good examples of stuff written in this paradigm to play around with. TDPL helps to some degree, but for me it's a lot easier to understand something by actually trying to use it.
Mar 20 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-20 23:21:49 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/20/2011 10:44 PM, Michel Fortin wrote:
 
 I don't see a problem with the above. The array elements you modify are
 passed through parallel's opApply which can check easily whether it's
 safe or not to pass them by ref to different threads (by checking the
 element's size) and allow or disallow the operation accordingly.
 
 It could even do a clever trick to make it safe to pass things such as
 elements of array of bytes by ref (by coalescing loop iterations for all
 bytes sharing the same word into one task). That might not work for
 ranges which are not arrays however.
 
 That said, feel free to suggest more problematic examples.
Ok, I completely agree in principle, though I question whether it's worth actually implementing something like this, especially until we get some kind of support for shared delegates.
Well, it'll work irrespective of whether shared delegates are used or not. I think you could add a compile-time check that the array element size is a multiple of the word size when the element is passed by ref in the loop and leave the clever trick as a possible future improvements. Would that work?
 Also, your example can be trivially modified to be safe.
 
 void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 synchronized sum += value;
 }
 writeln(sum);
 }
 
 In this case that kills all parallelism, but in more realistic cases I
 use this pattern often. I find it very common to have an expensive
 loop body can be performed in parallel, except for a tiny portion that
 must update a shared data structure. I'm aware that it might be
 possible, in theory, to write this more formally using reduce() or
 something. However:
 
 1. If the portion of the loop that deals with shared data is very
 small (and therefore the serialization caused by the synchronized
 block is negligible), it's often more efficient to only keep one data
 structure in memory and update it concurrently, rather than use
 stronger isolation between threads like reduce() does, and have to
 maintain one data structure for each thread.
 
 2. In my experience synchronizing on a small portion of the loop body
 works very well in practice. My general philosophy is that, in a
 library like this, dangerous but useful constructs must be supported
 and treated as innocent until proven guilty, not the other way round.
Your second example is not really a good justification of anything. I'll refer you to how synchronized classes work. It was decided that synchronized in a class protects everything that is directly stored in the class. Anything behind an indirection is considered shared by the compiler. The implication of this is that if you have an array or a pointer to something that you want semantically to be protected by the class's mutex, you have to cast things to unshared. It was decided that things should be safe against low-level races first, and convenience was relegated as a secondary concern. I don't like it very much, but that's what was decided and written in TDPL.
I'd go a little further. If the guarantees that shared was supposed to provide are strong, i.e. apply no matter what threading module is used, then I utterly despise it. It's one of the worst decisions made in the design of D. Making things pedantically strict, so that the type system gets in the way more than it helps, encourages the user to reflexively circumvent the type system without thinking hard about doing this, thus defeating its purpose. (The alternative of always complying with what the type system "expects" you to do is too inflexible to even be worth considering.) Type systems should err on the side of accepting a superset of what's correct and treating code as innocent until proven guilty, not the other way around. I still believe this even if some of the bugs it could be letting pass through might be very difficult to debug. See the discussion we had a few weeks ago about implicit integer casting and porting code to 64.
I agree with you that this is a serious problem. I think part of why it hasn't been talked much yet is that nobody is currently using D2 seriously for multithreaded stuff at this time (apart from you I guess), so we're missing experience with it. Andrei seems to think it's fine to required casts as soon as you need to protect something beyond an indirection inside synchronized classes, with the mitigation measure that you can make classes share their mutex (not implemented yet I think) so if the indirection leads to a class it is less of a problem. Personally, I don't.
 My excuse for std.parallelism is that it's pedal-to-metal parallelism, 
 so it's more acceptable for it to be dangerous than general case 
 concurrency.  IMHO when you use the non- safe parts of std.parallelism 
 (i.e. most of the library), that's equivalent to casting away shared in 
 a whole bunch of places.  Typing "import std.parallelism;" in a 
 non- safe module is an explicit enough step here.
I still think this "pedal-to-metal" qualification needs to be justified. Not having shared delegates in the language seems like an appropriate justification to me. Wanting to bypass casts you normally have to do around synchronized as the sole reason seems like a bad justification to me. It's not that I like how synchronized works, it's just that I think it should work the same everywhere.
 The guarantee is still preserved that, if you only use std.concurrency 
 (D's flagship "safe" concurrency module) for multithreading and don't 
 cast away shared, there can be no low level data races. IMHO this is 
 still a substantial accomplishment in that there exists a way to do 
 safe, statically checkable concurrency in D, even if it's not the 
 **only** way concurrency can be done.  BTW, core.thread can also be 
 used to get around D's type system, not just std.parallelism.  If you 
 want to check that only safe concurrency is used, importing 
 std.parallelism and core.thread can be grepped just as easily as 
 casting away shared.
Unless I'm mistaken, the only thing that bypasses race-safety in core.thread is the Thread constructor that takes a delegate. Which means it could easily be made race-safe by making that delegate parameter shared (once shared delegates are implemented).
 If, on the other hand, the guarantees of shared are supposed to be weak 
 in that they only apply to programs where only std.concurrency is used 
 for multithreading, then I think strictness is the right thing to do. 
 The whole point of std.concurrency is to give strong guarantees, but if 
 you prefer more dangerous but more flexible multithreading, other 
 paradigms should be readily available.
I think the language as a whole is designed to have strong guaranties, otherwise synchronized classes wouldn't require out-of-guaranty casts at every indirection. I'm not too pleased with the way synchronized classes are supposed to work, nor am I too pleased with how it impacts the rest of the language. But if this is a problem (and I think it is), it ought to be fixed globally, not by shutting down safeties in every module dealing with multithreading that isn't std.concurrency.
 I'm **still** totally confused about how shared is supposed to work, 
 because I don't have a fully debugged/implemented implementation or 
 good examples of stuff written in this paradigm to play around with.
I think nobody have played much with the paradigm at this point, or we'd have heard some feedback. Well, actually we have your feedback which seem to indicate that it's better to shut off safeties than to play nice with them. - - - Quoting Andrei, February 4 of 2010 in "there is no escape" on the dmd-concurrency mailing list:
 As we already knew, shared/synchronized limit quite drastically the 
 range of lock-based designs without casts. Fortunately, using a class 
 reference member inside a synchronized object will be possible 
 because... well I'll explain in the text.
 
 I continue to believe this is the right bet to make, but I expect push 
 back from experienced lock-based programmers.
Now is the beginning of that push back, I guess. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 21 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/21/2011 8:37 AM, Michel Fortin wrote:
 Well, it'll work irrespective of whether shared delegates are used or
 not. I think you could add a compile-time check that the array element
 size is a multiple of the word size when the element is passed by ref in
 the loop and leave the clever trick as a possible future improvements.
 Would that work?
On second thought, no, but for practical, not theoretical reasons: One, you can't introspect whether a foreach loop is using a ref or a value parameter. This is an issue with how opApply works. Two, AFAIK there's no way to get the native word size.
 I'd go a little further. If the guarantees that shared was supposed to
 provide are strong, i.e. apply no matter what threading module is
 used, then I utterly despise it. It's one of the worst decisions made
 in the design of D. Making things pedantically strict, so that the
 type system gets in the way more than it helps, encourages the user to
 reflexively circumvent the type system without thinking hard about
 doing this, thus defeating its purpose. (The alternative of always
 complying with what the type system "expects" you to do is too
 inflexible to even be worth considering.) Type systems should err on
 the side of accepting a superset of what's correct and treating code
 as innocent until proven guilty, not the other way around. I still
 believe this even if some of the bugs it could be letting pass through
 might be very difficult to debug. See the discussion we had a few
 weeks ago about implicit integer casting and porting code to 64.
I agree with you that this is a serious problem. I think part of why it hasn't been talked much yet is that nobody is currently using D2 seriously for multithreaded stuff at this time (apart from you I guess), so we're missing experience with it. Andrei seems to think it's fine to required casts as soon as you need to protect something beyond an indirection inside synchronized classes, with the mitigation measure that you can make classes share their mutex (not implemented yet I think) so if the indirection leads to a class it is less of a problem. Personally, I don't.
 My excuse for std.parallelism is that it's pedal-to-metal parallelism,
 so it's more acceptable for it to be dangerous than general case
 concurrency. IMHO when you use the non- safe parts of std.parallelism
 (i.e. most of the library), that's equivalent to casting away shared
 in a whole bunch of places. Typing "import std.parallelism;" in a
 non- safe module is an explicit enough step here.
I still think this "pedal-to-metal" qualification needs to be justified. Not having shared delegates in the language seems like an appropriate justification to me. Wanting to bypass casts you normally have to do around synchronized as the sole reason seems like a bad justification to me. It's not that I like how synchronized works, it's just that I think it should work the same everywhere.
This is where you and I disagree. I think that the type system's guarantees should be weak, i.e. only apply to std.concurrency. IMHO the strictness is reasonable when using message passing as your primary method of multithreading and only very little shared state. However, it's completely unreasonable if you want to use a paradigm where shared state is more heavily used. D, being a systems language, needs to allow other styles of multithreading without making them a huge PITA that requires casts everywhere.
 The guarantee is still preserved that, if you only use std.concurrency
 (D's flagship "safe" concurrency module) for multithreading and don't
 cast away shared, there can be no low level data races. IMHO this is
 still a substantial accomplishment in that there exists a way to do
 safe, statically checkable concurrency in D, even if it's not the
 **only** way concurrency can be done. BTW, core.thread can also be
 used to get around D's type system, not just std.parallelism. If you
 want to check that only safe concurrency is used, importing
 std.parallelism and core.thread can be grepped just as easily as
 casting away shared.
Unless I'm mistaken, the only thing that bypasses race-safety in core.thread is the Thread constructor that takes a delegate. Which means it could easily be made race-safe by making that delegate parameter shared (once shared delegates are implemented).
And then you'd only need one cast to break it if you wanted to, not casts everywhere. Just cast an unshared delegate to shared when passing it to core.thread.
 If, on the other hand, the guarantees of shared are supposed to be
 weak in that they only apply to programs where only std.concurrency is
 used for multithreading, then I think strictness is the right thing to
 do. The whole point of std.concurrency is to give strong guarantees,
 but if you prefer more dangerous but more flexible multithreading,
 other paradigms should be readily available.
I think the language as a whole is designed to have strong guaranties, otherwise synchronized classes wouldn't require out-of-guaranty casts at every indirection.
Well, there's an easy way around that, too. Just declare the whole method body synchronized, but don't declare the method synchronized in the signature.
 I'm not too pleased with the way synchronized classes are supposed to
 work, nor am I too pleased with how it impacts the rest of the language.
 But if this is a problem (and I think it is), it ought to be fixed
 globally, not by shutting down safeties in every module dealing with
 multithreading that isn't std.concurrency.
Again, I completely disagree. IMHO it's not fixable globally such that both of the following are achieved: 1. The strong guarantees when using only std.concurrency are preserved. 2. More shared state-intensive multithreading can be done without the type system getting in the way more than it helps.
 I'm **still** totally confused about how shared is supposed to work,
 because I don't have a fully debugged/implemented implementation or
 good examples of stuff written in this paradigm to play around with.
I think nobody have played much with the paradigm at this point, or we'd have heard some feedback. Well, actually we have your feedback which seem to indicate that it's better to shut off safeties than to play nice with them.
I agree that it's better to shut off the safeties **unless** you're doing very coarse-grained multithreading with very little shared state like std.concurrency had in mind. If this is what you want, then the safeties are great. Unfortunately, I'm going to have to take a hard line on this one. The issue of integrating std.parallelism into the race safety system had been discussed a bunch in the past and it was basically agreed that std.parallelism is a "here be dragons" module that cannot reasonably be made to conform to such a model. Given the choice (and I hope I'm not forced to make this choice) I'd rather std.parallelism be a third-party module that's actually usable than a Phobos module that is such a PITA to use that noone does in practice.
Mar 21 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-21 09:50:09 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/21/2011 8:37 AM, Michel Fortin wrote:
 Well, it'll work irrespective of whether shared delegates are used or
 not. I think you could add a compile-time check that the array element
 size is a multiple of the word size when the element is passed by ref in
 the loop and leave the clever trick as a possible future improvements.
 Would that work?
On second thought, no, but for practical, not theoretical reasons: One, you can't introspect whether a foreach loop is using a ref or a value parameter. This is an issue with how opApply works.
Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.
 Two, AFAIK there's no way to get the native word size.
Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).
 Unfortunately, I'm going to have to take a hard line on this one.  The 
 issue of integrating std.parallelism into the race safety system had 
 been discussed a bunch in the past and it was basically agreed that 
 std.parallelism is a "here be dragons" module that cannot reasonably be 
 made to conform to such a model.
About the "cannot reasonably be made to conform to such a model" part: that is certainly true today, but might turn out not to be true as the model evolves. It certainly is true as long as we don't have shared delegates. Beyond that it becomes more fuzzy. The final goal of the module shouldn't be to bypass safeties but to provide good parallelism primitives. By all means, if safeties can be enabled reasonably they should be (but as of today, they can't). And I totally agree with you that it's quite silly to require casts everywhere to use synchronized. I'll be the first to admit that it's hard to see synchronized classes being very practical as they're implemented today. There's room for improvements there too.
 Given the choice (and I hope I'm not forced to make this choice) I'd 
 rather std.parallelism be a third-party module that's actually usable 
 than a Phobos module that is such a PITA to use that noone does in 
 practice.
Which is understandable. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 21 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On second thought, no, but for practical, not theoretical reasons:
 One, you can't introspect whether a foreach loop is using a ref or a
 value parameter.  This is an issue with how opApply works.
Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.
 Two, AFAIK there's no way to get the native word size.
Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).
This problem is **much** less severe than you are suggesting. x86 can address single bytes, so it's not a problem even if you're iterating over bytes on a 64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't natively address individual bytes. Therefore, writing to a byte would be implemented much like writing to a bit is on x86: You'd read the full word in, change one byte, write the full word back. I'm not sure exactly how it would be implemented at the compiler level, or whether you'd even be allowed to have a reference to a byte in such an implementation. This is why I consider this more a theoretical problem than a serious practical issue.
 Unfortunately, I'm going to have to take a hard line on this one.  The
 issue of integrating std.parallelism into the race safety system had
 been discussed a bunch in the past and it was basically agreed that
 std.parallelism is a "here be dragons" module that cannot reasonably be
 made to conform to such a model.
About the "cannot reasonably be made to conform to such a model" part: that is certainly true today, but might turn out not to be true as the model evolves. It certainly is true as long as we don't have shared delegates. Beyond that it becomes more fuzzy. The final goal of the module shouldn't be to bypass safeties but to provide good parallelism primitives. By all means, if safeties can be enabled reasonably they should be (but as of today, they can't).
I'll agree that, if the model evolves sufficiently, everything I've said deserves to be reconsidered. My comments only apply to the model as it exists currently or will exist in the foreseeable future. It's just that I have **serious** doubts that it will evolve much and I don't want to delay things to have an academic discussion about what-ifs with regard to this.
 And I totally agree with you that it's quite silly to require casts
 everywhere to use synchronized. I'll be the first to admit that it's
 hard to see synchronized classes being very practical as they're
 implemented today. There's room for improvements there too.
This is the point I'm trying to make: Requiring these casts is perfectly reasonable if you assume that shared state is going to be rare, as it is in the std.concurrency model. When the user is looking for a more fine-grained, more shared state-heavy paradigm, I don't think it's possible to make it safe without also making it virtually unusable. This is why we punted and decided to go with message passing and very limited shared data as the flagship multithreading model. On the other hand, in a systems language, we can't virtually prohibit shared state-heavy multithreading by making it virtually unusable. Thus, the no low level race guarantee needs to apply to std.concurrency only.
Mar 21 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On second thought, no, but for practical, not theoretical reasons:
 One, you can't introspect whether a foreach loop is using a ref or a
 value parameter.  This is an issue with how opApply works.
Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.
 Two, AFAIK there's no way to get the native word size.
Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).
This problem is **much** less severe than you are suggesting. x86 can address single bytes, so it's not a problem even if you're iterating over bytes on a 64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't natively address individual bytes. Therefore, writing to a byte would be implemented much like writing to a bit is on x86: You'd read the full word in, change one byte, write the full word back. I'm not sure exactly how it would be implemented at the compiler level, or whether you'd even be allowed to have a reference to a byte in such an implementation. This is why I consider this more a theoretical problem than a serious practical issue.
Actually, just remembered that word tearing is also an issue with unaligned memory access. I guess I could just include a warning that says not to do this with unaligned memory.
Mar 21 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 3/21/2011 11:58 AM, dsimcha wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On second thought, no, but for practical, not theoretical reasons:
 One, you can't introspect whether a foreach loop is using a ref or a
 value parameter.  This is an issue with how opApply works.
Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.
 Two, AFAIK there's no way to get the native word size.
Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).
This problem is **much** less severe than you are suggesting. x86 can address single bytes, so it's not a problem even if you're iterating over bytes on a 64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't natively address individual bytes. Therefore, writing to a byte would be implemented much like writing to a bit is on x86: You'd read the full word in, change one byte, write the full word back. I'm not sure exactly how it would be implemented at the compiler level, or whether you'd even be allowed to have a reference to a byte in such an implementation. This is why I consider this more a theoretical problem than a serious practical issue.
Actually, just remembered that word tearing is also an issue with unaligned memory access. I guess I could just include a warning that says not to do this with unaligned memory.
s/is/may be . I'm trying to read up on word tearing and post questions here and to StackOverflow. Good documentation about it seems ridiculously hard to come by. About the only solid pieces of information I've found are that x86 definitely **can** do byte granularity (which doesn't mean it always does) and that Java makes these guarantees (which is only useful if you're writing in Java). The general gut feeling people here and on StackOverflow seem to have is that it's not an issue, but there doesn't seem to be much backing up of this. Maybe I'm just being paranoid and this is a complete non-issue except on the most obscure/ancient hardware and/or most stupidly implemented compilers, which is why few people even think of it.
Mar 21 2011
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 1.  I give up.
 
 2.  I wish someone had told me earlier.
Please, don't give up. It's a lot of time from the first time I've asked a parallel map in Phobos :-) Bye, bearophile
Mar 19 2011
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Sat, 19 Mar 2011 13:09:23 -0400, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu  
 <SeeWebsiteForEmail erdani.org> said:

 * "Most of this module completely subverts..." Vague characterizations  
 ("most", "completely", "some") don't belong in a technical  
 documentation. (For example there's either subversion going on or there  
 isn't.) Also, std.concurrency and std.parallelism address different  
 needs so there's little competition between them. Better: "Unless  
 explicitly marked as $(D  trusted) or $(D  safe), artifacts in this  
 module are not provably memory-safe and cannot be used with SafeD. If  
 used as documented, memory safety is guaranteed."
Actually, I think this is a bad description of what it subverts. What it subverts isn't the memory-safety that SafeD provides, but the safety against low-level races that even unsafe D protects against unless you cast shared away. For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free.
Technically, you'd only need a shared delegate or a const delegate to guarantee race safety for parallel foreach.
Mar 20 2011
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.
I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like. In general I feel like std.parallelism is being held to a ridiculously high standard that none of the other Phobos modules currently meet.
 * "After creation, Task objects are submitted to a TaskPool for
 execution." I understand it's possible to use Task straight as a
 promise/future, so s/are/may be/.
No. The only way Task is useful is by submitting it to a pool to be executed. (Though this may change, see below.)
 Also it is my understanding that
 TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
 number of tasks. If that's the case, it's worth mentioning.
Isn't this kind of obvious from the examples, etc.?
 * "If a Task has been submitted to a TaskPool instance, is being stored
 in a stack frame, and has not yet finished, the destructor for this
 struct will automatically call yieldWait() so that the task can finish
 and the stack frame can be destroyed safely." At this point in the doc
 the reader doesn't understand that at all because TaskPool has not been
 seen yet. The reader gets worried that she'll be essentially serializing
 the entire process by mistake. Either move this explanation down or
 provide an example.
This is getting ridiculous. There are too many circular dependencies between Task and TaskPool that are impossible to remove here that I'm not even going to try. One or the other has to be introduced first, but neither can be explained without mentioning the other. This is why I explain the relationship briefly in the module level summary, so that the user has at least some idea. I think this is about the best I can do.
 * Is done() a property?
Yes. DDoc sucks.
 * The example that reads two files at the same time should NOT use
 taskPool. It's just one task, why would the pool ever be needed? If you
 also provided an example that reads n files in memory at the same time
 using a pool, that would illustrate nicely why you need it. If a Task
 can't be launched without being put in a pool, there should be a
 possibility to do so. At my work we have a simple function called
 callInNewThread that does what's needed to launch a function in a new
 thread.
I guess I could add something like this to Task. Under the hood it would (for implementation simplicity, to reuse a bunch of code from TaskPool) fire up a new single-thread pool, submit the task, call TaskPool.finish(), and then return. Since you're already creating a new thread, the extra overhead of creating a new TaskPool for the thread would be negligible and it would massively simplify the implementation. My only concern is that, when combined with scoped versus non-scoped tasks (which are definitely here to stay, see below) this small convenience function would add way more API complexity than it's worth. Besides, task pools don't need to be created explicitly anyhow. That's what the default instantiation is for. I don't see how callInNewThread would really solve much. Secondly, I think you're reading **WAY** too much into what was meant to be a simple example to illustrate usage mechanics. This is another case where I can't think of a small, cute example of where you'd really need the pool. There are plenty of larger examples, but the smallest/most self-contained one I can think of is a parallel sort. I decided to use file reading because it was good enough to illustrate the mechanics of usage, even if it didn't illustrate a particularly good use case.
 * The note below that example gets me thinking: it is an artificial
 limitation to force users of Task to worry about scope and such. One
 should be able to create a Future object (Task I think in your
 terminology), pass it around like a normal value, and ultimately force
 it. This is the case for all other languages that implement futures. I
 suspect the "scope" parameter associated with the delegate a couple of
 definitions below plays a role here, but I think we need to work for
 providing the smoothest interface possible (possibly prompting
 improvements in the language along the way).
This is what TaskPool.task is for. Maybe this should be moved to the top of the definition of TaskPool and emphasized, and the scoped/stack allocated versions should be moved below TaskPool and de-emphasized? At any rate, though, anyone who's playing with parallelism should be an advanced enough programmer that concepts like scope and destructors are second nature.
 * I'm not sure how to interpret the docs for

 ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args);

 So it's documented but I'm not supposed to care. Why not just remove?
 Surely there must be higher-level examples that clarify that I can use
 delegates etc.
Yes, and there are such examples. It's just that I want to explicitly state what type is returned in this case rather than returning auto. If I omitted the docs for run(), you'd ask what run() is.
 * "If you want to escape the Task object from the function in which it
 was created or prefer to heap allocate and automatically submit to the
 pool, see TaskPool.task()." I'm uncomfortable that I need to remember a
 subtle distinction of the same primitive name ("task") depending on
 membership in a TaskPool or not, which is a tenuous relationship to
 remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking
 ourselves what's the relative overhead of closure allocation vs. task
 switching etc. Maybe we get to simplify things a lot at a small cost in
 efficiency.
We absolutely **need** scope delegates for calling stuff where a closure can't be allocated due to objects on the stack frame having scoped destruction. This is not going anywhere. Also, I **much** prefer to have everything that does conceptually the same thing have the same name. I think this is clearer, not less clear.
 * As I mentioned, in the definition:

 Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);

 I can't tell what "run" is.
run() is just the adapter function described right above this code. I cannot for the life of me understand how this could possibly be unclear.
 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?
No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.
 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?
I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.
 * The join example is fine, but the repetitive code suggests that loops
 should be better:

 import std.file;

 void main() {
 auto pool = new TaskPool();
 foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
 pool.put(task!read(filename));
 }

 // Call join() to guarantee that all tasks are done running, the worker
 // threads have terminated and that the results of all of the tasks can
 // be accessed without any synchronization primitives.
 pool.join();

 ubyte[][] data; // void[], meh
 // Use spinWait() since the results are guaranteed to have been computed
 // and spinWait() is the cheapest of the wait functions.
 foreach (task; pool.howDoIEnumerateTasks()) {
 data ~= task1.spinWait();
 }
You can't enumerate the tasks like this, but there's a good reason for this limitation. The design requires the pool not hold onto any references to the tasks after they're out of the queue, so that they can be safely destroyed as soon as the caller wants to destroy them. If you need to enumerate over them like this, then: 1. You're probably doing something wrong and parallel foreach would be a better choice. 2. If you insist on doing this, you need to explicitly store them in an array or something. As a more general comment, I know the join() example sucks. join() is actually a seldom-used primitive, not a frequently used one as you suggest. Most of the time you wait for individual tasks (explicitly or implicitly through the higher level functions), not the whole pool, to finish executing. I feel like it's obligatory that join() and finish() exist, but I never use them myself. They were much more useful in pre-release versions of this lib, when parallel foreach didn't exist and it made sense to have explicit arrays of Tasks. The difficulty I've had in coming up with a decent example for them makes me halfway tempted to
Mar 19 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:

 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?
No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.
I think an improvement over undefined behaviour would be to throw an exception. Also, what happens if one of the tasks throws an exception? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 2:35 PM, Michel Fortin wrote:
 On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:

 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?
No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.
I think an improvement over undefined behaviour would be to throw an exception.
The only problem is that there's no easy, well-documented way to tell from the return value of opApply whether it was a break, a goto, a labeled break/continue, etc. This would be implementable only if I changed the semantics of break to also throw. This might not be a bad thing (IMHO any type of breaking out of a parallel foreach loop is just silly) but others had asked for different semantics for break.
 Also, what happens if one of the tasks throws an exception?
It gets rethrown when yieldWait()/spinWait()/workWait() is called. In the case of the higher-level primitives, it gets re-thrown to the calling thread at some non-deterministic point in the execution of these functions. I didn't see the need to document this explicitly because it "just works".
Mar 19 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 3:36 PM, dsimcha wrote:
 On 3/19/2011 2:35 PM, Michel Fortin wrote:
 It gets rethrown when yieldWait()/spinWait()/workWait() is called. In
 the case of the higher-level primitives, it gets re-thrown to the
 calling thread at some non-deterministic point in the execution of these
 functions. I didn't see the need to document this explicitly because it
 "just works".
...Though now that you make me think of it I need to support exception chaining for the case of multiple concurrently thrown exceptions instead of just arbitrarily and non-deterministically rethrowing one. The code that handles this was written before exception chaining existed. Will get on that.
Mar 19 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-19 15:36:24 -0400, dsimcha <dsimcha yahoo.com> said:

 On 3/19/2011 2:35 PM, Michel Fortin wrote:
 On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:
 
 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?
No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.
I think an improvement over undefined behaviour would be to throw an exception.
The only problem is that there's no easy, well-documented way to tell from the return value of opApply whether it was a break, a goto, a labeled break/continue, etc. This would be implementable only if I changed the semantics of break to also throw. This might not be a bad thing (IMHO any type of breaking out of a parallel foreach loop is just silly) but others had asked for different semantics for break.
It's not that silly. Essentially, what you'd express like this with a normal function taking a delegate: taskPool.apply([1,2,3], (int i) { if (i == 1) return; // do some things }); you'd express like this in a parallel foreach: foreach (int i; parallel([1,2,3])) { if (i == 1) break; // do some things } It's not following the semantics of break within a foreach, but it's still useful to be able to return early from a function (from a loop iteration in this case), so I see the use case for making 'break' do what it does. My only gripe is that the semantics are distorted. In fact, just making foreach parallel distorts its semantics. I was confused earlier about a foreach being parallel when it was not, someone could also be confused in the other direction, thinking foreach is the normal foreach when it actually is parallel. This makes code harder to review. Don't consider only my opinion on this, but in my opinion the first form above (taskPool.apply) is preferable because you absolutely can't mistake it with a normal foreach. And I think it's especially important to make it stand out given that the compiler can't check for low-level races.
 Also, what happens if one of the tasks throws an exception?
It gets rethrown when yieldWait()/spinWait()/workWait() is called. In the case of the higher-level primitives, it gets re-thrown to the calling thread at some non-deterministic point in the execution of these functions. I didn't see the need to document this explicitly because it "just works".
That's indeed what I'd expect. But I think it'd still be worth mentioning in a short sentense in yeildWait()/spinWait()/workWait()'s documentation. It's comforting when the documentation confirms your expectations. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 5:33 PM, Michel Fortin wrote:
 On 2011-03-19 15:36:24 -0400, dsimcha <dsimcha yahoo.com> said:
 The only problem is that there's no easy, well-documented way to tell
 from the return value of opApply whether it was a break, a goto, a
 labeled break/continue, etc. This would be implementable only if I
 changed the semantics of break to also throw. This might not be a bad
 thing (IMHO any type of breaking out of a parallel foreach loop is
 just silly) but others had asked for different semantics for break.
It's not that silly. Essentially, what you'd express like this with a normal function taking a delegate: taskPool.apply([1,2,3], (int i) { if (i == 1) return; // do some things }); you'd express like this in a parallel foreach: foreach (int i; parallel([1,2,3])) { if (i == 1) break; // do some things } It's not following the semantics of break within a foreach, but it's still useful to be able to return early from a function (from a loop iteration in this case), so I see the use case for making 'break' do what it does.
Using continue is well defined behavior and does exactly what I think you're suggesting. The problem with break is that it's supposed to stop all subsequent iterations of the loop from being executed, not just end the current one early. This only makes sense in a serial context, where "subsequent iterations" is well-defined. Currently break breaks from the current work unit but continues executing all other work units. I put this semantic in because I couldn't think of anything else to make it do and someone (I don't remember who) asked for it. I have never encountered a use case for it. IMHO all early termination that affects subsequent loop iterations as well (break, goto, labeled break and continue, but not regular continue) should just throw because they make absolutely no sense in a parallel context.
Mar 19 2011
parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 19.03.2011 22:45, schrieb dsimcha:
 IMHO all early termination that affects subsequent loop iterations as well
 (break, goto, labeled break and continue, but not regular continue) should just
 throw because they make absolutely no sense in a parallel context.
What about some kind of parallel search? You just want to know if something is there and on first sight you're done, so you want to break out of this iteration and don't want any new parallel iterations to be done. (Not sure what should be done with currently running iterations.. should they be killed or go on until they're done - anyway, the executing thread shouldn't start a new iteration after that)
Mar 19 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 7:33 PM, Daniel Gibson wrote:
 Am 19.03.2011 22:45, schrieb dsimcha:
 IMHO all early termination that affects subsequent loop iterations as well
 (break, goto, labeled break and continue, but not regular continue) should just
 throw because they make absolutely no sense in a parallel context.
What about some kind of parallel search? You just want to know if something is there and on first sight you're done, so you want to break out of this iteration and don't want any new parallel iterations to be done. (Not sure what should be done with currently running iterations.. should they be killed or go on until they're done - anyway, the executing thread shouldn't start a new iteration after that)
It's an interesting suggestion, and I'll give some thought to it, but on first glance it sounds **very** difficult to implement efficiently. To avoid doing a lot of extra work after you've found what you need, you'd need to use very small work units. To avoid excessive overhead you'd need to use fairly large work units. The only case I can think of where this would be do-able is when evaluating the predicate is very expensive.
Mar 19 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 03/19/2011 12:16 PM, dsimcha wrote:
 On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.
I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like.
I'm thinking along the lines of: http://www.digitalmars.com/d/2.0/phobos/std_exception.html A nice synopsis would be the pi computation. Just move that up to the synopsis. It's simple, clean, and easy to relate to. Generally, you'd put here not all details but the stuff you think would make it easiest for people to get into your library.
 In general I feel like std.parallelism is being held to a
 ridiculously high standard that none of the other Phobos modules
 currently meet.
I agree, but that goes without saying. This is not a double standard; it's a simple means to improve quality of Phobos overall. Clearly certain modules that are already in Phobos would not make it if proposed today. And that's a good thing! Comparing our current ambitions to the quality of the average Phobos module would not help us. Generally it seems you have grown already tired of the exchange and it would take only a few more exchanges for you to say, "you know what, either let's get it over it or forget about it." Please consider for a minute how this is the wrong tone and attitude to be having on several levels. This is not a debate and not the place to get defensive. Your role in the discussion is not symmetric with the others' and at best you'd use the review as an opportunity to improve your design, not stick heels in the ground to defend its current incarnation (within reason). Your potential customers are attempting to help you by asking questions (some of which no doubt are silly) and by making suggestions (some of which, again, are ill-founded). Nevertheless, we _are_ your potential customers and in a way the customer is always right. Your art is to steer customers from what they think they want to what you know they need - because you're the expert! - and to improve design, nomenclature, and implementation to the extent that would help them. Furthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert. I know you'd have no problem finding the right voice in this discussion if you only frame it in the right light. Again, people are trying to help (however awkwardly) and in no way is that ridiculous.
 * "After creation, Task objects are submitted to a TaskPool for
 execution." I understand it's possible to use Task straight as a
 promise/future, so s/are/may be/.
No. The only way Task is useful is by submitting it to a pool to be executed. (Though this may change, see below.)
I very much hope this does change. Otherwise the role of Task in the design could be drastically reduced (e.g. nested type inside of TaskPool) without prejudice. At the minimum I want to be able to create a task, launch it, and check its result later without involving a pool. A pool is when I have many tasks that may exceed the number of CPUs etc. Simplicity would be great. // start three reads auto readFoo = task!readText("foo.txt"); auto readBar = task!readText("bar.txt"); auto readBaz = task!readText("baz.txt"); // join'em all auto foo = readFoo.yieldWait(); auto bar = readBar.yieldWait(); auto baz = readBaz.yieldWait(); There's no need at this level for a task pool. What would be nice would be to have a join() that joins all tasks spawned by the current thread: // start three reads auto readFoo = task!readText("foo.txt"); auto readBar = task!readText("bar.txt"); auto readBaz = task!readText("baz.txt"); // join'em all join(); // fetch results auto foo = readFoo.spinWait(); auto bar = readBar.spinWait(); auto baz = readBaz.spinWait(); The way I see it is, task pools are an advanced means that coordinate m threads over n CPUs. If I don't care about that (as above) there should be no need for a pool at all. (Of course it's fine if used by the implementation.)
 Also it is my understanding that
 TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
 number of tasks. If that's the case, it's worth mentioning.
Isn't this kind of obvious from the examples, etc.?
It wasn't to me. I had an intuition that that might be the case, but obvious - far from it. The fact that I need a task pool even when I don't care about m:n issues made me doubtful.
 * "If a Task has been submitted to a TaskPool instance, is being stored
 in a stack frame, and has not yet finished, the destructor for this
 struct will automatically call yieldWait() so that the task can finish
 and the stack frame can be destroyed safely." At this point in the doc
 the reader doesn't understand that at all because TaskPool has not been
 seen yet. The reader gets worried that she'll be essentially serializing
 the entire process by mistake. Either move this explanation down or
 provide an example.
This is getting ridiculous. There are too many circular dependencies between Task and TaskPool that are impossible to remove here that I'm not even going to try. One or the other has to be introduced first, but neither can be explained without mentioning the other. This is why I explain the relationship briefly in the module level summary, so that the user has at least some idea. I think this is about the best I can do.
Perhaps cross-references could help (see macro XREF). As I mentioned, an example here could go a long way, too.
 * The example that reads two files at the same time should NOT use
 taskPool. It's just one task, why would the pool ever be needed? If you
 also provided an example that reads n files in memory at the same time
 using a pool, that would illustrate nicely why you need it. If a Task
 can't be launched without being put in a pool, there should be a
 possibility to do so. At my work we have a simple function called
 callInNewThread that does what's needed to launch a function in a new
 thread.
I guess I could add something like this to Task. Under the hood it would (for implementation simplicity, to reuse a bunch of code from TaskPool) fire up a new single-thread pool, submit the task, call TaskPool.finish(), and then return. Since you're already creating a new thread, the extra overhead of creating a new TaskPool for the thread would be negligible and it would massively simplify the implementation. My only concern is that, when combined with scoped versus non-scoped tasks (which are definitely here to stay, see below) this small convenience function would add way more API complexity than it's worth. Besides, task pools don't need to be created explicitly anyhow. That's what the default instantiation is for. I don't see how callInNewThread would really solve much.
If we sit down for a second and think outside of the current design, the simplest Task use would be as such: "I start a task and get a sort of a future. I do something else. Then when I nede the result of the future I call some method force() against it." Nowhere is the notion of a task pool to be found. To me this looks like an artifact of the implementation has spilled into the API, which I understand how is natural to you. But also it's unnatural to me.
 Secondly, I think you're reading **WAY** too much into what was meant to
 be a simple example to illustrate usage mechanics. This is another case
 where I can't think of a small, cute example of where you'd really need
 the pool. There are plenty of larger examples, but the smallest/most
 self-contained one I can think of is a parallel sort. I decided to use
 file reading because it was good enough to illustrate the mechanics of
 usage, even if it didn't illustrate a particularly good use case.
It's impossible to not have a good small example. Sorting is great. You have the partition primitive already in std.algorithm, then off you go with tasks. Dot product on dense vectors is another good one. There's just plenty of operations that people understand are important to make fast. So to me it looks like this: * there are already some good examples of "I want 1-3 of tasks but don't care about pooling them": file reading and processing etc. * there should be 1-2 good example of partitioning a large chunk of work into many tasks and let taskPool carry them. With such, people would clearly understand the capabilities of the library and what they can achieve with various levels of involvement.
 * The note below that example gets me thinking: it is an artificial
 limitation to force users of Task to worry about scope and such. One
 should be able to create a Future object (Task I think in your
 terminology), pass it around like a normal value, and ultimately force
 it. This is the case for all other languages that implement futures. I
 suspect the "scope" parameter associated with the delegate a couple of
 definitions below plays a role here, but I think we need to work for
 providing the smoothest interface possible (possibly prompting
 improvements in the language along the way).
This is what TaskPool.task is for. Maybe this should be moved to the top of the definition of TaskPool and emphasized, and the scoped/stack allocated versions should be moved below TaskPool and de-emphasized? At any rate, though, anyone who's playing with parallelism should be an advanced enough programmer that concepts like scope and destructors are second nature.
I agree. That's not a reason to not find good names for entities. To me the relationship "global task() <=> scoped delegate; TaskPool method task() <=> closure" is inscrutable. I see no way of remembering it except rote memorization.
 We absolutely **need** scope delegates for calling stuff where a closure
 can't be allocated due to objects on the stack frame having scoped
 destruction. This is not going anywhere. Also, I **much** prefer to have
 everything that does conceptually the same thing have the same name. I
 think this is clearer, not less clear.
I agree in principle, but this case is just not that.
 * As I mentioned, in the definition:

 Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);

 I can't tell what "run" is.
run() is just the adapter function described right above this code. I cannot for the life of me understand how this could possibly be unclear.
OK.
 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?
No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.
Yah, I was actually thinking of disabling goto outside a local delegate everywhere.
 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?
I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.
If scalability information is present in however a non-committal form, then people would be compelled ("ok, so this shape of the loop would actually scale linearly with CPUs... neat"). Speaking of efficiency, I assume parallel foreach uses opApply with a delegate and the inherent overhead. So I'm thinking that a practical way to take advantage of parallel foreach would be to parallelize at some block level and then do a regular foreach inside the block? foreach (i, ref elem; taskPool.parallel(logs)) { foreach (???) elem = log(i + 1.0); } How can I arrange things such that I compute e.g. 64 logs serially inside each pass?
 * The join example is fine, but the repetitive code suggests that loops
 should be better:

 import std.file;

 void main() {
 auto pool = new TaskPool();
 foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
 pool.put(task!read(filename));
 }

 // Call join() to guarantee that all tasks are done running, the worker
 // threads have terminated and that the results of all of the tasks can
 // be accessed without any synchronization primitives.
 pool.join();

 ubyte[][] data; // void[], meh
 // Use spinWait() since the results are guaranteed to have been computed
 // and spinWait() is the cheapest of the wait functions.
 foreach (task; pool.howDoIEnumerateTasks()) {
 data ~= task1.spinWait();
 }
You can't enumerate the tasks like this, but there's a good reason for this limitation. The design requires the pool not hold onto any references to the tasks after they're out of the queue, so that they can be safely destroyed as soon as the caller wants to destroy them. If you need to enumerate over them like this, then: 1. You're probably doing something wrong and parallel foreach would be a better choice. 2. If you insist on doing this, you need to explicitly store them in an array or something. As a more general comment, I know the join() example sucks. join() is actually a seldom-used primitive, not a frequently used one as you suggest. Most of the time you wait for individual tasks (explicitly or implicitly through the higher level functions), not the whole pool, to finish executing. I feel like it's obligatory that join() and finish() exist, but I never use them myself. They were much more useful in pre-release versions of this lib, when parallel foreach didn't exist and it made sense to have explicit arrays of Tasks. The difficulty I've had in coming up with a decent example for them makes me halfway tempted to
I'm confused here. I use join() pretty much all the time. I launch stuff (e.g. run a large matrix-vector multiplication for distinct vectors) and then I join that. Then again and again. The thread of execution has a repeated hourglass shape - it fans out and then becomes single-threaded at join points. I assume some of those computations are indeed the charter of parallel foreach, but I'd guess not all. But then what do I know - I'm the customer :o). Andrei
Mar 19 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 12:16 PM, dsimcha wrote:
 On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved.
 I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.
I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like.
I'm thinking along the lines of: http://www.digitalmars.com/d/2.0/phobos/std_exception.html A nice synopsis would be the pi computation. Just move that up to the synopsis. It's simple, clean, and easy to relate to. Generally, you'd put here not all details but the stuff you think would make it easiest for people to get into your library.
 In general I feel like std.parallelism is being held to a
 ridiculously high standard that none of the other Phobos modules
 currently meet.
I agree, but that goes without saying. This is not a double standard; it's a simple means to improve quality of Phobos overall. Clearly certain modules that are already in Phobos would not make it if proposed today. And that's a good thing! Comparing our current ambitions to the quality of the average Phobos module would not help us. Generally it seems you have grown already tired of the exchange and it would take only a few more exchanges for you to say, "you know what, either let's get it over it or forget about it." Please consider for a minute how this is the wrong tone and attitude to be having on several levels. This is not a debate and not the place to get defensive. Your role in the discussion is not symmetric with the others' and at best you'd use the review as an opportunity to improve your design, not stick heels in the ground to defend its current incarnation (within reason). Your potential customers are attempting to help you by asking questions (some of which no doubt are silly) and by making suggestions (some of which, again, are ill-founded). Nevertheless, we _are_ your potential customers and in a way the customer is always right. Your art is to steer customers from what they think they want to what you know they need - because you're the expert! - and to improve design, nomenclature, and implementation to the extent that would help them. Furthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert. I know you'd have no problem finding the right voice in this discussion if you only frame it in the right light. Again, people are trying to help (however awkwardly) and in no way is that ridiculous.
Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review. How about we make next Friday a soft deadline/start of voting? It can be extended as necessary by mutual agreement of me and Lars (the review manager), and likely will be if the review process is still yielding good suggestions and/or I haven't had time to implement/clean up some key things. Having a deadline breathing down your neck like this is really not conducive to being open to suggestions and thoughtful consideration, especially for issues that seem like fairly minor details. Also increasing the deadline pressure issue, Michael Fortin just caused me to rethink the issue of exception handling in parallel foreach. I had more-or-less working code for this, but I realized it's severely broken in subtle ways that I've (knock on wood) never actually run into in real world code. It's gonna take some time to fix. These kinds of issues with error handling code can very easily slip under the radar in a library with only a few users, and unfortunately, one has. I definitely know how to fix it, but I have to implement the fix and it's somewhat non-trivial, i.e. I'm debugging it now and it's looking like a marathon debugging session.
Mar 19 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 03/19/2011 04:25 PM, dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 I know you'd have no problem finding the right voice in this discussion
 if you only frame it in the right light. Again, people are trying to
 help (however awkwardly) and in no way is that ridiculous.
Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review.
Exactly. I understand. Well here's a suggestion. How about we "git stash" this review? I recall another submission was vying for the queue so we can proceed with that while you have time to operate the changes you want. If not, we can simply wait 2-3 weeks and then have you resubmit for a shorter process (1-2 weeks) that would gather more feedback (hopefully minor that time) and votes. Lars?
 Also increasing the deadline pressure issue, Michael Fortin just caused
 me to rethink the issue of exception handling in parallel foreach. I had
 more-or-less working code for this, but I realized it's severely broken
 in subtle ways that I've (knock on wood) never actually run into in real
 world code. It's gonna take some time to fix. These kinds of issues with
 error handling code can very easily slip under the radar in a library
 with only a few users, and unfortunately, one has. I definitely know how
 to fix it, but I have to implement the fix and it's somewhat
 non-trivial, i.e. I'm debugging it now and it's looking like a marathon
 debugging session.
Fixing these sooner rather than later would be great, so all the better for suspending the review. Andrei
Mar 19 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 6:58 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 04:25 PM, dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 I know you'd have no problem finding the right voice in this discussion
 if you only frame it in the right light. Again, people are trying to
 help (however awkwardly) and in no way is that ridiculous.
Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review.
Exactly. I understand. Well here's a suggestion. How about we "git stash" this review? I recall another submission was vying for the queue so we can proceed with that while you have time to operate the changes you want. If not, we can simply wait 2-3 weeks and then have you resubmit for a shorter process (1-2 weeks) that would gather more feedback (hopefully minor that time) and votes.
Sounds like a plan.
Mar 19 2011
prev sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Sat, 19 Mar 2011 17:58:46 -0500, Andrei Alexandrescu wrote:

 On 03/19/2011 04:25 PM, dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 I know you'd have no problem finding the right voice in this
 discussion if you only frame it in the right light. Again, people are
 trying to help (however awkwardly) and in no way is that ridiculous.
Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review.
Exactly. I understand. Well here's a suggestion. How about we "git stash" this review? I recall another submission was vying for the queue so we can proceed with that while you have time to operate the changes you want. If not, we can simply wait 2-3 weeks and then have you resubmit for a shorter process (1-2 weeks) that would gather more feedback (hopefully minor that time) and votes. Lars?
Sounds good. I'll make the announcement. -Lars
Mar 20 2011
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.
In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Mar 19 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 19 March 2011 17:31:18 dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.
In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Both std.datetime and std.unittests underwent a fair number of changes over the course the review process. A lot of the stuff stayed the same, but a lot of it changed too. On the whole, though, the end results were much better for it. - Jonathan M Davis
Mar 19 2011
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 19 March 2011 17:31:18 dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.
In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Both std.datetime and std.unittests underwent a fair number of changes over the course the review process. A lot of the stuff stayed the same, but a lot of it changed too. On the whole, though, the end results were much better for it. - Jonathan M Davis
Mar 19 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 8:48 PM, Jonathan M Davis wrote:
 On Saturday 19 March 2011 17:31:18 dsimcha wrote:
 On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 Furthermore, you should expect that the review process will prompt
 changes. My perception is that you consider the submission more or less
 final modulo possibly a few minor nits. You shouldn't. I'm convinced you
 know much more about SMP than most or all others in this group, but in
 no way that means your design has reached perfection and is beyond
 improvement even from a non-expert.
In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Both std.datetime and std.unittests underwent a fair number of changes over the course the review process. A lot of the stuff stayed the same, but a lot of it changed too. On the whole, though, the end results were much better for it. - Jonathan M Davis
Please check your newsreader settings. You've been double-posting a lot lately.
Mar 19 2011
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 12:16 PM, dsimcha wrote:
 On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved.
 I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.
I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like.
I'm thinking along the lines of: http://www.digitalmars.com/d/2.0/phobos/std_exception.html A nice synopsis would be the pi computation. Just move that up to the synopsis. It's simple, clean, and easy to relate to. Generally, you'd put here not all details but the stuff you think would make it easiest for people to get into your library.
Good example, will do.
 * "After creation, Task objects are submitted to a TaskPool for
 execution." I understand it's possible to use Task straight as a
 promise/future, so s/are/may be/.
No. The only way Task is useful is by submitting it to a pool to be executed. (Though this may change, see below.)
I very much hope this does change. Otherwise the role of Task in the design could be drastically reduced (e.g. nested type inside of TaskPool) without prejudice. At the minimum I want to be able to create a task, launch it, and check its result later without involving a pool. A pool is when I have many tasks that may exceed the number of CPUs etc. Simplicity would be great. // start three reads auto readFoo = task!readText("foo.txt"); auto readBar = task!readText("bar.txt"); auto readBaz = task!readText("baz.txt"); // join'em all auto foo = readFoo.yieldWait(); auto bar = readBar.yieldWait(); auto baz = readBaz.yieldWait();
This is definitely feasible in principle. I'd like to implement it, but there's a few annoying, hairy details standing in the way. For reasons I detailed previously, we need both scoped and non-scoped tasks. We also have alias vs. callable (i.e. function pointer or delegate) tasks. Now we're adding pool vs. new-thread tasks. This is turning into a combinatorial explosion and needs to be simplified somehow. I propose the following: 1. I've reconsidered and actually like the idea of task() vs. scopedTask(). task() returns a pointer on the heap. scopedTask() returns a struct on the stack. Neither would be a member function of TaskPool. 2. Non-scoped Task pointers would need to be explicitly submitted to the task pool via the put() method. This means getting rid of TaskPool.task(). 3. The Task struct would grow a function runInNewThread() or something similar. (If you think this would be a common case, even just execute() might cut it.) The work flow would now be that you call task() to get a heap-allocated Task*, or scopedTask to get a stack-allocated Task. You then call either TaskPool.put() to execute it on a pool or Task.runInNewThread() to run it in a new thread. The creation of the Task is completely orthogonal to how it's run.
 There's no need at this level for a task pool. What would be nice would
 be to have a join() that joins all tasks spawned by the current thread:

 // start three reads
 auto readFoo = task!readText("foo.txt");
 auto readBar = task!readText("bar.txt");
 auto readBaz = task!readText("baz.txt");
 // join'em all
 join();
 // fetch results
 auto foo = readFoo.spinWait();
 auto bar = readBar.spinWait();
 auto baz = readBaz.spinWait();
I don't understand how this would be a substantial improvement over the first example, where you just call yieldWait() on all three. Furthermore, implementing join() as shown in this example would require some kind of central registry of all tasks/worker threads/task pools/something similar, which would be a huge PITA to implement efficiently.
 Secondly, I think you're reading **WAY** too much into what was meant to
 be a simple example to illustrate usage mechanics. This is another case
 where I can't think of a small, cute example of where you'd really need
 the pool. There are plenty of larger examples, but the smallest/most
 self-contained one I can think of is a parallel sort. I decided to use
 file reading because it was good enough to illustrate the mechanics of
 usage, even if it didn't illustrate a particularly good use case.
It's impossible to not have a good small example. Sorting is great. You have the partition primitive already in std.algorithm, then off you go with tasks. Dot product on dense vectors is another good one. There's just plenty of operations that people understand are important to make fast.
I forgot about std.algorithm.partition. This makes a parallel quick sort so trivial to implement (ignoring the issue of selecting a good pivot, which I think can be safely ignored in example code) that it might actually make a good example.
 * "A goto from inside the parallel foreach loop to a label outside the
 loop will result in undefined behavior." Would this be a bug in dmd?
No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.
Yah, I was actually thinking of disabling goto outside a local delegate everywhere.
See the discussion I'm having with Michael Fortin. My latest idea is that break, labeled break/continue, return, and goto should all throw exceptions when found inside a parallel foreach loop. They affect the flow of subsequent iterations, and "subsequent iterations" only makes sense when the loop is being executed in serial.
 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?
I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.
If scalability information is present in however a non-committal form, then people would be compelled ("ok, so this shape of the loop would actually scale linearly with CPUs... neat").
Ok, I thought you were asking for something much more rigorous than this. I therefore didn't want to provide it because I figured that, no matter what I did, someone would be able to say that the benchmark is flawed some how, yada, yada, yada. Given how inexact a science benchmarking is, I'm still hesitant to put such results in API docs, but I can see where you're coming from here.
 Speaking of efficiency, I assume parallel foreach uses opApply with a
 delegate and the inherent overhead. So I'm thinking that a practical way
 to take advantage of parallel foreach would be to parallelize at some
 block level and then do a regular foreach inside the block?

 foreach (i, ref elem; taskPool.parallel(logs)) {
 foreach (???)
 elem = log(i + 1.0);
 }

 How can I arrange things such that I compute e.g. 64 logs serially
 inside each pass?
Three things here: 1. For this kind of nano-parallelism, map() might be better suited. To fill an existing array, just use map's explicit buffer feature. taskPool.map!log(iota(1, logs.length + 1), logs); The option of using map() for nano-parallelism is part of my rationale for keeping the pretty but mildly inefficient delegate call in parallel foreach. 2. You're severely overestimating the overhead of the delegate call. log() is a pretty cheap function and even so speedups are decent with parallel foreach compared to a regular foreach loop. import std.stdio, std.parallelism, std.datetime, std.math; void main() { auto logs = new float[10_000_000]; taskPool(); // Initialize TaskPool before timing. auto sw = StopWatch(autoStart); foreach(i, ref elem; logs) { elem = log(i + 1); } writeln("Serial: ", sw.peek.msecs); sw.reset(); foreach(i, ref elem; parallel(logs)) { elem = log(i + 1); } writeln("Parallel Foreach: ", sw.peek.msecs); } Results: Serial: 619 Parallel Foreach: 388 I'd include parallel map, too, but for some reason (probably stack alignment or something) including it changes the results for parallel foreach. 3. If you really want to do as you suggest, just make a chunks range or something (maybe this should be in std.range) that iterates over all non-overlapping size N slices of a sliceable range. Use this for the parallel loop, then loop over the individual elements of the slice inside.
 I'm confused here. I use join() pretty much all the time. I launch stuff
 (e.g. run a large matrix-vector multiplication for distinct vectors) and
 then I join that. Then again and again. The thread of execution has a
 repeated hourglass shape - it fans out and then becomes single-threaded
 at join points. I assume some of those computations are indeed the
 charter of parallel foreach, but I'd guess not all. But then what do I
 know - I'm the customer :o).
Yes, parallel foreach is idiomatic here, or maybe parallel map. In the pre-release versions of std.parallelism (my experimentation with parallelism libraries in D goes back to mid-2008, over a year before I released the first version as Parallelfuture), only the task primitive existed. I discovered that, in practice, creating Task objects in a loop is almost always a PITA. It can also be inefficient in that, in a naive implementation all of these exist in memory simultaneously when this might be completely unnecessary. I decided to create higher level primitives to handle all the use cases I could think of where you might otherwise want to create Task objects in a loop. If you're explicitly creating Task objects in a loop in std.parallelism, I can just about guarantee that there's an easier and at least equally efficient way to accomplish what you want to accomplish. If there are any important use cases I missed, I'd be much more interested in creating a few more high-level primitives rather than making it easier to work with Task objects created in a loop.
Mar 19 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 10:16 PM, dsimcha wrote:
 * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
 on a usual machine?
I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.
If scalability information is present in however a non-committal form, then people would be compelled ("ok, so this shape of the loop would actually scale linearly with CPUs... neat").
Ok, I thought you were asking for something much more rigorous than this. I therefore didn't want to provide it because I figured that, no matter what I did, someone would be able to say that the benchmark is flawed some how, yada, yada, yada. Given how inexact a science benchmarking is, I'm still hesitant to put such results in API docs, but I can see where you're coming from here.
I tried putting a few benchmarks in for the performance of parallel foreach, map and reduce. Basically, I just put the timings in for the examples on my dual core machine, and for an equivalent serial version that is not shown in the interest of brevity but is easily inferred. I didn't do this for the pipelining or future/promise primitives because these aren't as capable of automatically taking full advantage of as many cores as you can throw at them as map, reduce and foreach. This means that coming up with a benchmark that shows where a good speedup can be obtained is a much more difficult art.
Mar 20 2011
prev sibling next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
 * "workUnitSize: The number of elements to evaluate in a single Task.
 Must be less than or equal to bufSize, and in practice should be a
 fraction of bufSize such that all worker threads can be used." Then why
 not specify a different parameter such a multiplier or something? The
 dependence between bufSize and worUnitSize is a sign that these two
 should be improved. If you have good reasons that the user must have the
 parameters in this form, give an example substantiating that.
I did this for consistency with all the other functions, where work unit size is specified directly. I don't want lazyMap to be the oddball function where it's specified in a completely different way.
Mar 19 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 19 March 2011 09:03:51 Andrei Alexandrescu wrote:
 On 03/19/2011 02:32 AM, dsimcha wrote:
 Ok, thanks again for clarifying **how** the docs could be improved. I've
 implemented the suggestions and generally given the docs a good reading
 over and clean up. The new docs are at:
 
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
* When using "parallelism" as a common noun, prefix is with a '_' so ddoc doesn't underline it.
Oooh. That's a neat trick. I didn't know that you could do that. - Jonathan M Davis
Mar 19 2011