digitalmars.D - std.parallelism changes done

dsimcha (36/36) Mar 23 2011 I've finished all of the changes that were discussed in the initial

Simen kjaeraas (4/7) Mar 23 2011 spawnAndRun?
=?ISO-8859-1?Q?S=F6nke_Ludwig?= (18/21) Mar 24 2011 The threading system I designed for the company I work for uses priority...

Michel Fortin (10/34) Mar 24 2011 Do you think having multiple task pools each with a different thread

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (15/45) Mar 24 2011 Yes, that may be actually enough because although you would normally

dsimcha (25/74) Mar 24 2011 My main concern here is that these kinds of use cases are getting far

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (8/85) Mar 24 2011 Well what can I say.. things can become more complex and you cannot

dsimcha (8/14) Mar 24 2011 Agreed. I'm not trying to be pedantic here, and I'm certainly willing

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (8/22) Mar 25 2011 I agree, the proirities are things that can be important in some cases

dsimcha (14/29) Mar 24 2011 This may not be an issue in the std.parallelism design. A TaskPool task...

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (21/52) Mar 24 2011 Indeed this pattern solves the problem to wait for the completion of a

dsimcha (10/33) Mar 24 2011 Can you elaborate and/or provide an example of the "general" problem?

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (38/75) Mar 24 2011 I have one very specific constellation that I can only sketch.. Suppose

dsimcha (44/91) Mar 24 2011 As I've said before in related discussions, you are _probably_ better

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (66/158) Mar 25 2011 I would certainly agree that this belongs to a higher level structure.

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (7/8) Mar 25 2011 Apart from all this - I just want to make this a known problem, what you...

dsimcha (14/21) Mar 25 2011 You've done a good job of explaining a complex problem. I appreciate

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (2/24) Mar 26 2011 Agreed

bearophile (15/30) Mar 24 2011 For the examples I suggest to use q{a + b} instead of "a + b".

dsimcha (5/20) Mar 24 2011 Eventually, maybe. Definitely not now, though, because in practice it

bearophile (6/8) Mar 24 2011 OK. Then the question is why std.algorithm uses normal strings instead o...

dsimcha (25/31) Mar 24 2011 I personally think "" strings look nicer for simple cases like "a + b". ...

bearophile (6/12) Mar 24 2011 Andrei is good at this :-)
spir (8/13) Mar 24 2011 If we decide eager is (an implicite) standard, then we must rename (seve...

dsimcha (4/15) Mar 24 2011 Ok, I'm convinced that the eagerness of map() may throw people off. Unl...

spir (9/12) Mar 24 2011 I agree there should be a consistent naming scheme here. Aligning with s...

dsimcha (19/33) Mar 24 2011 Thanks for the suggestions, everyone. I've made all the changes

Russel Winder (18/31) Mar 24 2011 q{} ones.

dsimcha (4/12) Mar 24 2011 It's for pipelining. Please read the API documentation for details abou...

=?ISO-8859-1?Q?S=F6nke_Ludwig?= (18/18) Mar 24 2011 Hm depending on the way the pool is used, it might be a better default

Michel Fortin (14/24) Mar 24 2011 The current default is good for command line applications where the

dsimcha (10/30) Mar 24 2011 Please review the changes carefully, then, because this is a use case I

Michel Fortin (23/33) Mar 24 2011 Well, it's practically the same thing except you never want to execute

dsimcha (9/28) Mar 24 2011 Sounds like a good plan. In general, I've tried to keep the design of

Michel Fortin (10/16) Mar 24 2011 What it adds is a task pool, where you have a fixed number of threads

dsimcha (12/18) Mar 24 2011 This is a great **long-term** todo. Please file an enhancement request

dsimcha (25/40) Mar 24 2011 It's funny, it seems like the task parallelism stuff is getting much

spir (13/22) Mar 24 2011 If I may have a suggestion: just let it aside. Instead of piling up feat...
spir (11/13) Mar 24 2011 About the doc: very good. I could understand most of it, while knowing n...

dsimcha (3/14) Mar 24 2011 Task.isPure isn't even supposed to be public. I didn't notice that it

Jonathan M Davis (9/26) Mar 24 2011 It's not really DDoc's fault. It's dmd's fault. Currently, _all_ templat...

dsimcha <dsimcha yahoo.com> writes:

I've finished all of the changes that were discussed in the initial 
std.parallelism review.  I know I said I needed more time than this, but 
honestly, I hit a best-case scenario.  I had more time than I 
anticipated to work on it *and* the changes (especially fixing the 
exception handling issue) took less time than I anticipated.

I would say that the documentation has now improved "radically", like 
Andrei suggested it needs to.  I want to thank Andrei and Michael Fortin 
for their extremely useful suggestions, and apologize for getting 
defensive at times.  I felt compelled to defend my initial design in 
cases where I shouldn't have, due to a combination of time pressure and 
a misunderstanding of the review process.  Andrei's suggestions in 
particular led to a tremendous improvement of the documentation and, to 
a lesser extent, an improvement of the API.

In addition to improving the documentation, I added 
Task.executeInNewThread() to allow Task to be useful without a TaskPool. 
  (Should this have a less verbose name?)  I also fixed some exception 
handling bugs, implemented exception chaining for exceptions thrown 
concurrently, and fixed some silliness with respect to seed values in 
reduce().

One thing Andrei mentioned that I'm really not sure about is what to do 
with TaskPool.join().  My example for it is still terrible, because I 
think it's an evolutionary artifact.  It was useful in earlier designs 
that were never released and didn't have high-level data parallelism 
primitives.  I never use it, don't have any good use cases for it and 
would be inclined to remove it entirely.  Andrei seems to have some good 
use cases in mind, but he has not detailed any that I believe are 
reasonably implementable and I'm not sure whether they could be solved 
better using the higher-level data parallelism primitives.

As far as the vote, I know I asked for more time and to allow another 
module ahead of me, and of course I'll honor that.  I'd like to un-git 
stash this review as soon as isemail is done, though.  I anticipate a 
decent amount more suggestions now, given how major the changes have 
been, but most will probably be minor things like documentation 
clarifications and renaming stuff.

The new docs are at 
http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

Mar 23 2011

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

On Thu, 24 Mar 2011 05:32:26 +0100, dsimcha <dsimcha yahoo.com> wrote:

 In addition to improving the documentation, I added  
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.  
   (Should this have a less verbose name?)

spawnAndRun?


-- 
Simen

Mar 23 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority 
per task to control which tasks can overtake others. A special priority 
is out-of-bands (the name my be debatable), which will guarantee that 
the task will run in its own thread so it can safely wait for other 
tasks. However, those threads that process OOB tasks are also cached in 
the thread pool and reused for new OOB tasks. Only if the number of 
parallel OOB tasks goes over a specific number, new threads will be 
created and destroyed. This can safe quite a bit of time for those tasks.

Both kinds of priority have been very useful and I would suggest to put 
at least the executeInNewThread() method into ThreadPool to be later 
able to make such an optimization.

The task priority thing in general may only be necessary for complex 
applications with user interaction, where you have to statisfy certain 
interactivity needs. I wouldn't be too sad if this is not implemented 
now, but it would be good to keep it in mind as a possible improvement 
for later.

S�nke

Mar 24 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-24 03:00:01 -0400, S�nke Ludwig 
<ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

 
 The threading system I designed for the company I work for uses 
 priority per task to control which tasks can overtake others. A special 
 priority is out-of-bands (the name my be debatable), which will 
 guarantee that the task will run in its own thread so it can safely 
 wait for other tasks. However, those threads that process OOB tasks are 
 also cached in the thread pool and reused for new OOB tasks. Only if 
 the number of parallel OOB tasks goes over a specific number, new 
 threads will be created and destroyed. This can safe quite a bit of 
 time for those tasks.
 
 Both kinds of priority have been very useful and I would suggest to put 
 at least the executeInNewThread() method into ThreadPool to be later 
 able to make such an optimization.
 
 The task priority thing in general may only be necessary for complex 
 applications with user interaction, where you have to statisfy certain 
 interactivity needs. I wouldn't be too sad if this is not implemented 
 now, but it would be good to keep it in mind as a possible improvement 
 for later.

Do you think having multiple task pools each with a different thread 
priority would do the trick? Simply put tasks in the task pool with the 
right priority...  I had a similar use case in mind and this is what I 
proposed in the previous discussion.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 24 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 24.03.2011 13:03, schrieb Michel Fortin:
 On 2011-03-24 03:00:01 -0400, S�nke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

 The threading system I designed for the company I work for uses
 priority per task to control which tasks can overtake others. A
 special priority is out-of-bands (the name my be debatable), which
 will guarantee that the task will run in its own thread so it can
 safely wait for other tasks. However, those threads that process OOB
 tasks are also cached in the thread pool and reused for new OOB tasks.
 Only if the number of parallel OOB tasks goes over a specific number,
 new threads will be created and destroyed. This can safe quite a bit
 of time for those tasks.

 Both kinds of priority have been very useful and I would suggest to
 put at least the executeInNewThread() method into ThreadPool to be
 later able to make such an optimization.

 The task priority thing in general may only be necessary for complex
 applications with user interaction, where you have to statisfy certain
 interactivity needs. I wouldn't be too sad if this is not implemented
 now, but it would be good to keep it in mind as a possible improvement
 for later.

 Do you think having multiple task pools each with a different thread
 priority would do the trick? Simply put tasks in the task pool with the
 right priority... I had a similar use case in mind and this is what I
 proposed in the previous discussion.

Yes, that may be actually enough because although you would normally 
want to avoid the overhead of the additional threads running in 
parallel, in the scenarios I have in mind you always have unrelated 
things in different priority classes. An for these different tasks it 
should only be an exception to run in parallel (otherwise using 
priorities would be strange in the first place).

The only thing that is a bit of a pity is that now you have to manage 
multiple thread pools instead of simply using the one singleton instance 
in the whole application. And this could really cause some headaches if 
you have a lot of different types of workload that may all have 
different priorities but also may have the same - you would somehow have 
to share several thread pools across those types of workload.

(type of workload = copying files, computing a preview images, computing 
some physics calcs etc)

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 9:15 PM, S�nke Ludwig wrote:
 Am 24.03.2011 13:03, schrieb Michel Fortin:
 On 2011-03-24 03:00:01 -0400, S�nke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a
 TaskPool.
 (Should this have a less verbose name?)

 The threading system I designed for the company I work for uses
 priority per task to control which tasks can overtake others. A
 special priority is out-of-bands (the name my be debatable), which
 will guarantee that the task will run in its own thread so it can
 safely wait for other tasks. However, those threads that process OOB
 tasks are also cached in the thread pool and reused for new OOB tasks.
 Only if the number of parallel OOB tasks goes over a specific number,
 new threads will be created and destroyed. This can safe quite a bit
 of time for those tasks.

 Both kinds of priority have been very useful and I would suggest to
 put at least the executeInNewThread() method into ThreadPool to be
 later able to make such an optimization.

 The task priority thing in general may only be necessary for complex
 applications with user interaction, where you have to statisfy certain
 interactivity needs. I wouldn't be too sad if this is not implemented
 now, but it would be good to keep it in mind as a possible improvement
 for later.

 Do you think having multiple task pools each with a different thread
 priority would do the trick? Simply put tasks in the task pool with the
 right priority... I had a similar use case in mind and this is what I
 proposed in the previous discussion.

 Yes, that may be actually enough because although you would normally
 want to avoid the overhead of the additional threads running in
 parallel, in the scenarios I have in mind you always have unrelated
 things in different priority classes. An for these different tasks it
 should only be an exception to run in parallel (otherwise using
 priorities would be strange in the first place).

 The only thing that is a bit of a pity is that now you have to manage
 multiple thread pools instead of simply using the one singleton instance
 in the whole application. And this could really cause some headaches if
 you have a lot of different types of workload that may all have
 different priorities but also may have the same - you would somehow have
 to share several thread pools across those types of workload.

 (type of workload = copying files, computing a preview images, computing
 some physics calcs etc)

My main concern here is that these kinds of use cases are getting far 
beyond the scope of std.parallelism.  By definition (at least as I 
understand it) parallelism is focused on throughput, not 
responsiveness/latency and is about utilizing as many execution 
resources as possible for useful work.  (This article, originally posted 
here by Andrei, describes the distinction nicely: 
http://existentialtype.wordpress.com/2011/03/17/parallelism-
s-not-concurrency/) 
  If you're implementing parallelism, then it is correct to only use one 
thread on a single-core machine (std.parallelism does this by default), 
since one thread will utilize all execution resources.  If you're 
implementing concurrency, this is not correct.  Concurrency is used to 
implement parallelism, but that's different from saying concurrency _is_ 
parallelism.

When you start talking about application responsiveness, prioritization, 
etc., you're getting beyond _parallelism_ and into general-case 
concurrency.  I have neither the expertise nor the desire to build a 
general case concurrency library. D already has a general case 
concurrency library (std.concurrency), and this might be a better place 
to implement suggestions dealing with general-case concurrency.

std.parallelism was designed from the ground up to focus on parallelism, 
not general-case concurrency.  I don't mind implementing features useful 
to general-case concurrency if they're trivial in both interface and 
implementation, but I'd rather not do any that require major changes to 
the interface or implementation.

Mar 24 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 25.03.2011 02:51, schrieb dsimcha:
 On 3/24/2011 9:15 PM, S�nke Ludwig wrote:
 Am 24.03.2011 13:03, schrieb Michel Fortin:
 On 2011-03-24 03:00:01 -0400, S�nke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a
 TaskPool.
 (Should this have a less verbose name?)

 The threading system I designed for the company I work for uses
 priority per task to control which tasks can overtake others. A
 special priority is out-of-bands (the name my be debatable), which
 will guarantee that the task will run in its own thread so it can
 safely wait for other tasks. However, those threads that process OOB
 tasks are also cached in the thread pool and reused for new OOB tasks.
 Only if the number of parallel OOB tasks goes over a specific number,
 new threads will be created and destroyed. This can safe quite a bit
 of time for those tasks.

 Both kinds of priority have been very useful and I would suggest to
 put at least the executeInNewThread() method into ThreadPool to be
 later able to make such an optimization.

 The task priority thing in general may only be necessary for complex
 applications with user interaction, where you have to statisfy certain
 interactivity needs. I wouldn't be too sad if this is not implemented
 now, but it would be good to keep it in mind as a possible improvement
 for later.

 Do you think having multiple task pools each with a different thread
 priority would do the trick? Simply put tasks in the task pool with the
 right priority... I had a similar use case in mind and this is what I
 proposed in the previous discussion.

 Yes, that may be actually enough because although you would normally
 want to avoid the overhead of the additional threads running in
 parallel, in the scenarios I have in mind you always have unrelated
 things in different priority classes. An for these different tasks it
 should only be an exception to run in parallel (otherwise using
 priorities would be strange in the first place).

 The only thing that is a bit of a pity is that now you have to manage
 multiple thread pools instead of simply using the one singleton instance
 in the whole application. And this could really cause some headaches if
 you have a lot of different types of workload that may all have
 different priorities but also may have the same - you would somehow have
 to share several thread pools across those types of workload.

 (type of workload = copying files, computing a preview images, computing
 some physics calcs etc)

 My main concern here is that these kinds of use cases are getting far
 beyond the scope of std.parallelism. By definition (at least as I
 understand it) parallelism is focused on throughput, not
 responsiveness/latency and is about utilizing as many execution
 resources as possible for useful work. (This article, originally posted
 here by Andrei, describes the distinction nicely:
 http://existentialtype.wordpress.com/2011/03/17/parallelism-is-not-concurrency/)
 If you're implementing parallelism, then it is correct to only use one
 thread on a single-core machine (std.parallelism does this by default),
 since one thread will utilize all execution resources. If you're
 implementing concurrency, this is not correct. Concurrency is used to
 implement parallelism, but that's different from saying concurrency _is_
 parallelism.

 When you start talking about application responsiveness, prioritization,
 etc., you're getting beyond _parallelism_ and into general-case
 concurrency. I have neither the expertise nor the desire to build a
 general case concurrency library. D already has a general case
 concurrency library (std.concurrency), and this might be a better place
 to implement suggestions dealing with general-case concurrency.

 std.parallelism was designed from the ground up to focus on parallelism,
 not general-case concurrency. I don't mind implementing features useful
 to general-case concurrency if they're trivial in both interface and
 implementation, but I'd rather not do any that require major changes to
 the interface or implementation.

Well what can I say.. things can become more complex and you cannot 
always say this is parallelism and this is concurrency ore something. 
It's just nice when the libary does not get in the way when you are in a 
situation where eg. throughput and responsiveness or whatever else 
matter. Sometimes it can be a small change that can make or break the 
deal. It's sad that I'm not able to get my points across.. But I'm too 
tired right now to go down and discuss the parallelism/concurrency topic.

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 10:31 PM, S�nke Ludwig wrote:
 Well what can I say.. things can become more complex and you cannot
 always say this is parallelism and this is concurrency ore something.
 It's just nice when the libary does not get in the way when you are in a
 situation where eg. throughput and responsiveness or whatever else
 matter. Sometimes it can be a small change that can make or break the
 deal.

Agreed.  I'm not trying to be pedantic here, and I'm certainly willing 
to make **small** changes even if they stretch the scope somewhat into 
general concurrency.  It's just that I don't want to make big changes, 
especially if they will make the interface more complex, reduce 
efficiency and/or lock me into certain implementations.  (For example, 
using a priority queue precludes supporting work stealing later without 
breaking the priority feature.)

Mar 24 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 25.03.2011 04:32, schrieb dsimcha:
 On 3/24/2011 10:31 PM, S�nke Ludwig wrote:
 Well what can I say.. things can become more complex and you cannot
 always say this is parallelism and this is concurrency ore something.
 It's just nice when the libary does not get in the way when you are in a
 situation where eg. throughput and responsiveness or whatever else
 matter. Sometimes it can be a small change that can make or break the
 deal.

 Agreed. I'm not trying to be pedantic here, and I'm certainly willing to
 make **small** changes even if they stretch the scope somewhat into
 general concurrency. It's just that I don't want to make big changes,
 especially if they will make the interface more complex, reduce
 efficiency and/or lock me into certain implementations. (For example,
 using a priority queue precludes supporting work stealing later without
 breaking the priority feature.)

I agree, the proirities are things that can be important in some cases 
but most of the time they are not really _necessary_ in that sense. And 
maybe in most of those cases where someone would like to have them, the 
suggestion by Michel to create a second thread pool with a different 
priority may be just fine.

The more important aspect was the OOB part with the cached threads for 
something like executeInNewThread. And even that is not a real deal-breaker.

Mar 25 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 3:00 AM, S�nke Ludwig wrote:
 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

 The threading system I designed for the company I work for uses priority
 per task to control which tasks can overtake others. A special priority
 is out-of-bands (the name my be debatable), which will guarantee that
 the task will run in its own thread so it can safely wait for other
 tasks.

This may not be an issue in the std.parallelism design.  A TaskPool task 
can safely wait on other tasks.  What prevents this from causing a 
deadlock is that calling yieldForce, spinForce, or waitForce on a task 
that has not started executing yet will execute the task immediately in 
the thread that tried to force it, regardless of where it is in the queue.

However, those threads that process OOB tasks are also cached in
 the thread pool and reused for new OOB tasks. Only if the number of
 parallel OOB tasks goes over a specific number, new threads will be
 created and destroyed. This can safe quite a bit of time for those tasks.

Unfortunately this is not implementable without a massive overhaul of 
TaskPool.  There are some baked in assumptions that the number of worker 
threads in a pool will not change after the pool's creation. 
Furthermore, I'm not sure how worker-local storage would be efficiently 
implementable without this restriction.

 Both kinds of priority have been very useful and I would suggest to put
 at least the executeInNewThread() method into ThreadPool to be later
 able to make such an optimization.

Can you elaborate on this?  The whole point of executeInNewThread() was 
supposed to be that a TaskPool is not needed for simple cases.

Mar 24 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 24.03.2011 14:25, schrieb dsimcha:
 On 3/24/2011 3:00 AM, S�nke Ludwig wrote:
 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

 The threading system I designed for the company I work for uses priority
 per task to control which tasks can overtake others. A special priority
 is out-of-bands (the name my be debatable), which will guarantee that
 the task will run in its own thread so it can safely wait for other
 tasks.

 This may not be an issue in the std.parallelism design. A TaskPool task
 can safely wait on other tasks. What prevents this from causing a
 deadlock is that calling yieldForce, spinForce, or waitForce on a task
 that has not started executing yet will execute the task immediately in
 the thread that tried to force it, regardless of where it is in the queue.

Indeed this pattern solves the problem to wait for the completion of a 
specific task. It also avoids a huge potential of deadlocks that a 
general yield() that does not take a task would have. However, it will 
not solve the general problem of one task waiting for another, which 
could be in terms of a condition variable or just a mutex that is used 
in the middle of the task execution.

 However, those threads that process OOB tasks are also cached in
 the thread pool and reused for new OOB tasks. Only if the number of
 parallel OOB tasks goes over a specific number, new threads will be
 created and destroyed. This can safe quite a bit of time for those tasks.

 Unfortunately this is not implementable without a massive overhaul of
 TaskPool. There are some baked in assumptions that the number of worker
 threads in a pool will not change after the pool's creation.
 Furthermore, I'm not sure how worker-local storage would be efficiently
 implementable without this restriction.

 Both kinds of priority have been very useful and I would suggest to put
 at least the executeInNewThread() method into ThreadPool to be later
 able to make such an optimization.

 Can you elaborate on this? The whole point of executeInNewThread() was
 supposed to be that a TaskPool is not needed for simple cases.

Well OK if that is the only purpose to provide a shortcut for (new 
Thread(&fun)).start then my suggestion may not make too much sense. 
However, I have some doubts that the advantage to have this shortcut 
here justifies the functional duplication of core.thread. Is there some 
specific use case where you would use a Task object but not a ThreadPool?

But what I wanted to say is, even if it may be difficult to implement 
such thread caching now, putting means to execute a Task in its own 
thread now into the ThreadPool allows for such an optimization later (it 
could even exist while still keeping Task.executeInNewThread()).

Btw. another more advanced thing that is interesting is to use IO 
completion ports to dynamically increase the number of threads in cases 
of contention or IO waits. But this is of course a whole different story 
and I certainly would not expect this to get in at this stage or maybe 
ever. Just wanted to throw that in.

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 9:05 PM, S�nke Ludwig wrote:
 This may not be an issue in the std.parallelism design. A TaskPool task
 can safely wait on other tasks. What prevents this from causing a
 deadlock is that calling yieldForce, spinForce, or waitForce on a task
 that has not started executing yet will execute the task immediately in
 the thread that tried to force it, regardless of where it is in the
 queue.

 Indeed this pattern solves the problem to wait for the completion of a
 specific task. It also avoids a huge potential of deadlocks that a
 general yield() that does not take a task would have. However, it will
 not solve the general problem of one task waiting for another, which
 could be in terms of a condition variable or just a mutex that is used
 in the middle of the task execution.

Can you elaborate and/or provide an example of the "general" problem? 
I'm not quite sure what you're getting at.

 Can you elaborate on this? The whole point of executeInNewThread() was
 supposed to be that a TaskPool is not needed for simple cases.

 Well OK if that is the only purpose to provide a shortcut for (new
 Thread(&fun)).start then my suggestion may not make too much sense.
 However, I have some doubts that the advantage to have this shortcut
 here justifies the functional duplication of core.thread. Is there some
 specific use case where you would use a Task object but not a ThreadPool?

executeInNewThread() is useful where you only have a few Tasks, rather 
than needing to map a large number of Tasks onto a small number of 
threads.  Using core.thread here doesn't cut it, because core.thread 
doesn't automate passing the function's arguments to the new thread. 
Also, I figured out that some of the conditions for  safe tasks can be 
relaxed a little if they run in a dedicated thread instead of a TaskPool.

 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later (it
 could even exist while still keeping Task.executeInNewThread()).

I can't really comment because I still don't understand this very well.

Mar 24 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 25.03.2011 02:17, schrieb dsimcha:
 On 3/24/2011 9:05 PM, S�nke Ludwig wrote:
 This may not be an issue in the std.parallelism design. A TaskPool task
 can safely wait on other tasks. What prevents this from causing a
 deadlock is that calling yieldForce, spinForce, or waitForce on a task
 that has not started executing yet will execute the task immediately in
 the thread that tried to force it, regardless of where it is in the
 queue.

 Indeed this pattern solves the problem to wait for the completion of a
 specific task. It also avoids a huge potential of deadlocks that a
 general yield() that does not take a task would have. However, it will
 not solve the general problem of one task waiting for another, which
 could be in terms of a condition variable or just a mutex that is used
 in the middle of the task execution.

 Can you elaborate and/or provide an example of the "general" problem?
 I'm not quite sure what you're getting at.

I have one very specific constellation that I can only sketch.. Suppose 
you have some kind of complex computation going on in the ThreadPool. 
This computation is done by a large set of tasks where each tasks 
depends on the result of one or more other tasks. One task is 
responsible for coordinating the work - it is spawning tasks and waiting 
for their completion to spawn new tasks for which the results are now 
available.

First thing here is that you do not want to do the waitForce() kind of 
waiting in the coordinator task because this might cause the coordinator 
to be busy with an expensive taks while it could already spawn new tasks 
because maybe in the meantime some other tasks have already finished.

However, if you wait for a condition variable instead (which is fired 
after each finished task) and if you can have multiple computations of 
this kind running in parallel, you can immediately run into the 
situation that the thread pool is crowded with coordinator tasks that 
are all waiting for their condition variables which will never be 
triggered because no worker tasks can be executed.

This is only one example and basically this problem can arise in all 
cases where one task depends on another task by some form of waiting 
that will not execute the dependency like waitForce() does.

I also have a completely different example invloving the main thread 
(doing GPU work) which is much more diffucult, but I don't think I can 
make that clear with only text.

 Can you elaborate on this? The whole point of executeInNewThread() was
 supposed to be that a TaskPool is not needed for simple cases.

 Well OK if that is the only purpose to provide a shortcut for (new
 Thread(&fun)).start then my suggestion may not make too much sense.
 However, I have some doubts that the advantage to have this shortcut
 here justifies the functional duplication of core.thread. Is there some
 specific use case where you would use a Task object but not a ThreadPool?

 executeInNewThread() is useful where you only have a few Tasks, rather
 than needing to map a large number of Tasks onto a small number of
 threads. Using core.thread here doesn't cut it, because core.thread
 doesn't automate passing the function's arguments to the new thread.
 Also, I figured out that some of the conditions for  safe tasks can be
 relaxed a little if they run in a dedicated thread instead of a TaskPool.

I see.

 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later (it
 could even exist while still keeping Task.executeInNewThread()).

 I can't really comment because I still don't understand this very well.

I hope I could make it a little more clear what I mean. The problem is 
just that the system I'm talking about is quite complex and it's not 
easy to find good and simple examples in that system. The problems of 
course arise only in the most complex pathes of execution..

What I'm not sure about is if executeInNewThread() is supposed to be 
useful just because it is somtimes nice to have the fine-grained 
parallelism of the OS scheduler as opposed to task granilarity, or if 
the advantage is supposed to be efficiency gained because the thread 
pool is not created. In the latter case the caching of some threads to 
be reused for a executeInOwnThread()-method should lead to a better 
performance in almost any case where thread creation overhead is relevant.

.. OK my writing skills are degrading rapidly as I have to fight against 
my closing eyes.. will go to sleep now

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 10:21 PM, S�nke Ludwig wrote:
 Indeed this pattern solves the problem to wait for the completion of a
 specific task. It also avoids a huge potential of deadlocks that a
 general yield() that does not take a task would have. However, it will
 not solve the general problem of one task waiting for another, which
 could be in terms of a condition variable or just a mutex that is used
 in the middle of the task execution.

 Can you elaborate and/or provide an example of the "general" problem?
 I'm not quite sure what you're getting at.

 I have one very specific constellation that I can only sketch.. Suppose
 you have some kind of complex computation going on in the ThreadPool.
 This computation is done by a large set of tasks where each tasks
 depends on the result of one or more other tasks. One task is
 responsible for coordinating the work - it is spawning tasks and waiting
 for their completion to spawn new tasks for which the results are now
 available.

As I've said before in related discussions, you are _probably_ better 
off using one of the high level primitives instead of using tasks 
directly in these cases.  If not, I'd prefer to improve the higher level 
primitives and/or create new ones if possible.  (Feel free to suggest 
one if you can think of it.)  Tasks are, IMHO, too low level for 
anything except basic future/promise parallelism and implementing higher 
level primitives.  Incidentally the situation you describe (a 
coordinator task creating lots of worker tasks) is exactly how amap(), 
reduce() and parallel foreach work under the hood.  This complexity is 
completely encapsulated, though.

 First thing here is that you do not want to do the waitForce() kind of
 waiting in the coordinator task because this might cause the coordinator
 to be busy with an expensive taks while it could already spawn new tasks
 because maybe in the meantime some other tasks have already finished.

I assume you mean yieldForce().

 However, if you wait for a condition variable instead (which is fired
 after each finished task) and if you can have multiple computations of
 this kind running in parallel, you can immediately run into the
 situation that the thread pool is crowded with coordinator tasks that
 are all waiting for their condition variables which will never be
 triggered because no worker tasks can be executed.

I assume you're talking about a condition variable other than the one 
yieldForce() uses.  As mentioned previously, in the specific case of 
yieldForce() this is a solved problem.  In the general case I can see 
the problem.

 This is only one example and basically this problem can arise in all
 cases where one task depends on another task by some form of waiting
 that will not execute the dependency like waitForce() does.

Hmm, ok, I definitely understand the problem now.

 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later (it
 could even exist while still keeping Task.executeInNewThread()).

 I can't really comment because I still don't understand this very well.

 I hope I could make it a little more clear what I mean. The problem is
 just that the system I'm talking about is quite complex and it's not
 easy to find good and simple examples in that system. The problems of
 course arise only in the most complex pathes of execution..

 What I'm not sure about is if executeInNewThread() is supposed to be
 useful just because it is somtimes nice to have the fine-grained
 parallelism of the OS scheduler as opposed to task granilarity, or if
 the advantage is supposed to be efficiency gained because the thread
 pool is not created. In the latter case the caching of some threads to
 be reused for a executeInOwnThread()-method should lead to a better
 performance in almost any case where thread creation overhead is relevant.

Ok, now I'm starting to understand this.  Please correct me (once you've 
gotten a good night's sleep and can think again) wherever this is wrong:

1.  As is currently the case, executeInNewThread() is _guaranteed_ to 
start the task immediately.  There is never a queue involved.

2.  Unlike the current implementation, executeInNewThread() may use a 
cached thread.  It will **NOT**, however, put the task on a queue or 
otherwise delay its execution.  If no cached thread is available, it 
will create a new one and possibly destroy it when the task is done.

Thanks for this suggestion.  Now that I (think I) understand it, it 
makes sense in principle.  The devil may be in the details, though.

1.  How many threads should be cached?  I guess this could just be 
configurable with some reasonable default.

2.  Should the cache be lazily or eagerly created?  I'd assume lazily.

3.  Where would these threads be stored?  I think they probably belong 
in some kind of thread-safe global data structure, **NOT** in a TaskPool 
instance.

4.  How would we explain to people what the cache is good for and how to 
use it?  The fact that you're proposing it and even you find this 
difficult to convey makes me skeptical that this feature is worth the 
weight it adds to the API.  Maybe you'll find it easier once you get 
some sleep.   (I understand the problem it solves at an abstract level 
but have never encountered a concrete use case.  It also took me a long 
time to understand it.)

5.  It would break some relaxations of what  safe tasks can do when 
started via executeInNewThread().

6.  This whole proposal might fit better in std.concurrency, by using a 
thread cache for spawn().

Mar 24 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 25.03.2011 05:14, schrieb dsimcha:
 On 3/24/2011 10:21 PM, S�nke Ludwig wrote:
 Can you elaborate and/or provide an example of the "general" problem?
 I'm not quite sure what you're getting at.

 I have one very specific constellation that I can only sketch.. Suppose
 you have some kind of complex computation going on in the ThreadPool.
 This computation is done by a large set of tasks where each tasks
 depends on the result of one or more other tasks. One task is
 responsible for coordinating the work - it is spawning tasks and waiting
 for their completion to spawn new tasks for which the results are now
 available.

 As I've said before in related discussions, you are _probably_ better
 off using one of the high level primitives instead of using tasks
 directly in these cases. If not, I'd prefer to improve the higher level
 primitives and/or create new ones if possible. (Feel free to suggest one
 if you can think of it.) Tasks are, IMHO, too low level for anything
 except basic future/promise parallelism and implementing higher level
 primitives. Incidentally the situation you describe (a coordinator task
 creating lots of worker tasks) is exactly how amap(), reduce() and
 parallel foreach work under the hood. This complexity is completely
 encapsulated, though.

I would certainly agree that this belongs to a higher level structure. 
This structure would basically get a set of special tasks, where each of 
those tasks has a list of all the tasks it depends on. All tasks would 
then be executed in parallel on a thread pool in on order that 
statisfies their dependencies - possibly with some form of cost function 
that controls which task should come first if there are multiple orders 
possible.

One problem here is for example that for the system I have here, I need 
to execute several tasks in the main thread by sending a message to it 
(the main thread executes window messages in a loop). Specifically this 
is for tasks that use OpenGL or a similar API that has a single thread 
assigned - and the main thread is the most efficient one to use because 
it already has an OpenGL context.

The important thing is to either support such things or to make it 
general enough to let the user add it from the outside. Otherwise if you 
really need such things, the only option is to completely use a custom 
thread pool and thins means no parallel for, map, reduce and whatever 
might be added later.

 First thing here is that you do not want to do the waitForce() kind of
 waiting in the coordinator task because this might cause the coordinator
 to be busy with an expensive taks while it could already spawn new tasks
 because maybe in the meantime some other tasks have already finished.

 I assume you mean yieldForce().

Yes, sorry, got the names mixed up.

 However, if you wait for a condition variable instead (which is fired
 after each finished task) and if you can have multiple computations of
 this kind running in parallel, you can immediately run into the
 situation that the thread pool is crowded with coordinator tasks that
 are all waiting for their condition variables which will never be
 triggered because no worker tasks can be executed.

 I assume you're talking about a condition variable other than the one
 yieldForce() uses. As mentioned previously, in the specific case of
 yieldForce() this is a solved problem. In the general case I can see the
 problem.

Yes, just the general problem with other condition variables.

 This is only one example and basically this problem can arise in all
 cases where one task depends on another task by some form of waiting
 that will not execute the dependency like waitForce() does.

 Hmm, ok, I definitely understand the problem now.

 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later
 (it
 could even exist while still keeping Task.executeInNewThread()).

 I can't really comment because I still don't understand this very well.

 I hope I could make it a little more clear what I mean. The problem is
 just that the system I'm talking about is quite complex and it's not
 easy to find good and simple examples in that system. The problems of
 course arise only in the most complex pathes of execution..

 What I'm not sure about is if executeInNewThread() is supposed to be
 useful just because it is somtimes nice to have the fine-grained
 parallelism of the OS scheduler as opposed to task granilarity, or if
 the advantage is supposed to be efficiency gained because the thread
 pool is not created. In the latter case the caching of some threads to
 be reused for a executeInOwnThread()-method should lead to a better
 performance in almost any case where thread creation overhead is
 relevant.

 Ok, now I'm starting to understand this. Please correct me (once you've
 gotten a good night's sleep and can think again) wherever this is wrong:

 1. As is currently the case, executeInNewThread() is _guaranteed_ to
 start the task immediately. There is never a queue involved.

 2. Unlike the current implementation, executeInNewThread() may use a
 cached thread. It will **NOT**, however, put the task on a queue or
 otherwise delay its execution. If no cached thread is available, it will
 create a new one and possibly destroy it when the task is done.

Exactly.

 Thanks for this suggestion. Now that I (think I) understand it, it makes
 sense in principle. The devil may be in the details, though.

 1. How many threads should be cached? I guess this could just be
 configurable with some reasonable default.

A configurable minimum number of threads sounds reasonable. The default 
could probably be a fixed small number like 1 or 2.

 2. Should the cache be lazily or eagerly created? I'd assume lazily.

Lazy sounds good.

 3. Where would these threads be stored? I think they probably belong in
 some kind of thread-safe global data structure, **NOT** in a TaskPool
 instance.

Thats a good question.. ThreadPool would be nice because it is the class 
of which maybe you are already dragging an instance through your code. 
Global would certainly work.

 4. How would we explain to people what the cache is good for and how to
 use it? The fact that you're proposing it and even you find this
 difficult to convey makes me skeptical that this feature is worth the
 weight it adds to the API. Maybe you'll find it easier once you get some
 sleep. (I understand the problem it solves at an abstract level but have
 never encountered a concrete use case. It also took me a long time to
 understand it.)

I would basically just say its a faster way than to create a new thread 
each time you start a task. Use it whenever you need to have a task run 
outside of the thread pool threads - candidates are tasks that wait a 
lot, either because of IO or because of waiting primitives apart from 
the ones present in ThreadPool (message queues, condition variables).

But please don't make the mistake to dismiss the problem because it is 
complex. Beeing complex and maybe rare does not mean it cannot be 
important. Its like a bug that will delete your data but in very rare 
and complex use cases of the application. You would not want to ignore 
that just because of those reasons.

Also I'm not sure if using the primitives of std.concurrency is allowed 
within in a Task, maybe not. But if it is, it would be really easy to 
construct a higher level example without condition variables and stuff 
like that.

 5. It would break some relaxations of what  safe tasks can do when
 started via executeInNewThread().

You mean because of TLS that is not reinitialized each time? I have to 
admit that I can't really gauge the impact of this.

 6. This whole proposal might fit better in std.concurrency, by using a
 thread cache for spawn().

But isn't the previous problem (5.) even more relevant in 
std.concurrency? Putting it near ThreadPool could be a good idea because 
it still is some sort of thread pool in the abstract sense. It could 
also be something that std.concurrency uses for its spawn().

Anyway, I would be happy if there would be a place allocated for this 
wherever this fits. If it is std.concurrency, its fine as long as 
std.concurrency and std.parallelism play well together. One problem with 
std.concurrency is that it also does not really work well when you need 
to wait for other primitives than a message. To have something like 
WaitForMultipleObjects is critical in many cases, but thats another topic.

Having said all this, I just want to make sure that you don't get this 
wrong. I certainly to not want to push in complex changes for no reason 
and no complex changes at all for that matter. And in this case I see 
how it is _functionally_ independent from the ThreadPool itself.

My problem here is just that there is a executeInNewThread function that 
really should not be used for _many_ tasks and maybe in most other cases 
it would be cleaner to use spawn() - it may still have its place if you 
count in the  safe implications, but I would like to also see a function 
supporting the same threading guarantees but suitable for many tasks, if 
only to avoid bad usage patterns of executeInNewThread.

However it definitely is in the gray area between std.concurrency and 
std.parallelism.

Mar 25 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 25.03.2011 10:33, schrieb S�nke Ludwig:
yadda-yadda

Apart from all this - I just want to make this a known problem, what you 
(or maybe Andrei for std.concurrency) decide is up to you and I'm fine 
with any outcome for my personal stuff because I do not have such a 
complex system apart from work (C++).

And I would like to say that I do like this module in general very much 
. It also fits almost 99% the design I have in my libraries ;)

Mar 25 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/25/2011 5:42 AM, S�nke Ludwig wrote:
 Am 25.03.2011 10:33, schrieb S�nke Ludwig:
 yadda-yadda

  >

 Apart from all this - I just want to make this a known problem, what you
 (or maybe Andrei for std.concurrency) decide is up to you and I'm fine
 with any outcome for my personal stuff because I do not have such a
 complex system apart from work (C++).

You've done a good job of explaining a complex problem.  I appreciate 
it.  I think we should make this a long-term todo, like Michael Fortin's 
suggestion that std.concurrency should be able to create tasks or 
std.parallelism should handle message passing.  You should probably file 
a Bugzilla enhancement request saying Phobos should support thread 
caching, so that this proposal doesn't get lost.

Your proposal is feasible and solves an important problem.  On the other 
hand, the design and implementation details are still vague and would 
require substantial discussion.  The feature is tangential to the 
purpose of std.parallelism and clearly crosses the line into general 
case concurrency.

Bottom line:  This proposal should not hold up the vote and adoption of 
std.parallelism, but it should not be discarded permanently either.

Mar 25 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Am 25.03.2011 15:40, schrieb dsimcha:
 On 3/25/2011 5:42 AM, S�nke Ludwig wrote:
 Am 25.03.2011 10:33, schrieb S�nke Ludwig:
 yadda-yadda

 Apart from all this - I just want to make this a known problem, what you
 (or maybe Andrei for std.concurrency) decide is up to you and I'm fine
 with any outcome for my personal stuff because I do not have such a
 complex system apart from work (C++).

 You've done a good job of explaining a complex problem. I appreciate it.
 I think we should make this a long-term todo, like Michael Fortin's
 suggestion that std.concurrency should be able to create tasks or
 std.parallelism should handle message passing. You should probably file
 a Bugzilla enhancement request saying Phobos should support thread
 caching, so that this proposal doesn't get lost.

 Your proposal is feasible and solves an important problem. On the other
 hand, the design and implementation details are still vague and would
 require substantial discussion. The feature is tangential to the purpose
 of std.parallelism and clearly crosses the line into general case
 concurrency.

 Bottom line: This proposal should not hold up the vote and adoption of
 std.parallelism, but it should not be discarded permanently either.

Agreed

Mar 26 2011

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:

 and apologize for getting defensive at times.

It happens to mammals, don't worry.


 The new docs are at 
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

    real getTerm(int i) {
        immutable x = ( i - 0.5 ) * delta;
        return delta / ( 1.0 + x * x ) ;
    }
    immutable pi = 4.0 * taskPool.reduce!"a + b"(
        std.algorithm.map!getTerm(iota(n))
    );

For the examples I suggest to use q{a + b} instead of "a + b".

When D will gain a good implementation of conditional purity, I think
taskPool.reduce and taskPool.map may accept only pure functions to map on and
pure iterables to work on.


template map(functions...)
    Eager parallel map.

So in Phobos you have std.algorithm.map that's lazy and taskPool.map that's
eager. I think it's not so good to have two functions with the same name but
subtly different purposes. I have suggested Andrei to add a std.algorithm.amap,
that is array(map()):
http://d.puremagic.com/issues/show_bug.cgi?id=5756
So I suggest to rename "taskPool.map" as "askPool.amap" and to rename
"taskPool.lazyMap" as "taskPool.map".


    auto squareRoots = new float[numbers.length];
    taskPool.map!sqrt(numbers, squareRoots);

Currently to do the same thing with the normal std.algorithm.map you need
something like:
auto squareRoots = new double[numbers.length];
copy(map!sqrt(numbers), squareRoots);


 A semi-lazy parallel map that can be used for pipelining.

The idea of such vectorized lazyness may give a performance enhancement even
for the serial std.algorithm.map.

In the module documentation I'd like to see a graph that shows how the parallel
map/reduce/foreach scale as the number of cores goes to 1 to 2 to 4 to 8 (or
more) :-)

Thank you for this very nice Phobos module.

Bye,
bearophile

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 3:23 AM, bearophile wrote:
 dsimcha:

 and apologize for getting defensive at times.

 It happens to mammals, don't worry.


 The new docs are at
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

     real getTerm(int i) {
         immutable x = ( i - 0.5 ) * delta;
         return delta / ( 1.0 + x * x ) ;
     }
     immutable pi = 4.0 * taskPool.reduce!"a + b"(
         std.algorithm.map!getTerm(iota(n))
     );

 For the examples I suggest to use q{a + b} instead of "a + b".

I tried to keep it as consistent as possible with std.algorithm.
 When D will gain a good implementation of conditional purity, I think
taskPool.reduce and taskPool.map may accept only pure functions to map on and
pure iterables to work on.

Eventually, maybe.  Definitely not now, though, because in practice it 
would severely affect usability.

 In the module documentation I'd like to see a graph that shows how the
parallel map/reduce/foreach scale as the number of cores goes to 1 to 2 to 4 to
8 (or more) :-)

Unfortunately I don't have access to this kind of hardware except at work.

Mar 24 2011

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:

 I tried to keep it as consistent as possible with std.algorithm.

OK. Then the question is why std.algorithm uses normal strings instead of q{}
ones.

And regarding consistency with std.algorithm, a more important factor is that
std.algorithm.map is lazy, while you have a eager map, and the lazy version has
lazy in the name, so the names are kind of opposite of std.algorithm.


 Unfortunately I don't have access to this kind of hardware except at work.

Maybe some other person in this group has access to a 8 core CPU and is willing
to take some numbers, to create few little graphs.

Bye,
bearophile

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
 I tried to keep it as consistent as possible with std.algorithm.

 OK. Then the question is why std.algorithm uses normal strings instead of q{}
ones.

I personally think "" strings look nicer for simple cases like "a + b".  At any
rate, this is a bikeshed issue.

 And regarding consistency with std.algorithm, a more important factor is that

std.algorithm.map is lazy, while you have a eager map, and the lazy version has
lazy in the name, so the names are kind of opposite of std.algorithm.

Hmm, you do have a point there.  Two reasons:

1.  map() was there first and at the time I didn't feel like renaming it.

2.  I think map() is much more frequently useful than lazyMap() and name
verbosity
should be inversely proportional to usage frequency.  (Which is why I really
need
help thinking of a better name than executeInNewThread().)

I'm not sure whether I want to change this.  Input from the rest of the
community
would be useful as long as it doesn't end up going off onto some wild tangent
and
starting Bikeshed War III.

 Unfortunately I don't have access to this kind of hardware except at work.

 Maybe some other person in this group has access to a 8 core CPU and is willing

to take some numbers, to create few little graphs.

I can kinda see Andrei's point about adding a few benchmarks just to give a
rough
idea of the level of fine-grainedness this library is capable of and how to get
it, but these requests are exactly what I was afraid of when I put them in.  I
really don't think rigorous, detailed benchmarks have any place in API
documentation.  The purpose of API documentation is for people to figure out
what
the library does and how to use it, not to sell a product or precisely quantify
its performance.

Next people will be asking for standard deviations, error bars, P-values,
confidence intervals, regression models of performance vs. cores and basically
things that dstats is good for, or for me to port their favorite non-trivial
benchmark from their favorite non-D language.

Mar 24 2011

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:

 2.  I think map() is much more frequently useful than lazyMap() and name
verbosity
 should be inversely proportional to usage frequency.

I agree, but I have suggested to replace "map" => "amap" and "lazyMap" => "map"
(and to add a fully eager amap to std.algorithm). The increase of verbosity is
not big.


 (Which is why I really need
 help thinking of a better name than executeInNewThread().)

Andrei is good at this :-)


 I really don't think rigorous, detailed benchmarks have any place in API
 documentation.

OK. I will be happy to see few graphs outside the API doc :-)

Bye,
bearophile

Mar 24 2011

spir <denis.spir gmail.com> writes:

On 03/24/2011 06:04 PM, dsimcha wrote:
 Hmm, you do have a point there.  Two reasons:

 1.  map() was there first and at the time I didn't feel like renaming it.

 2.  I think map() is much more frequently useful than lazyMap() and name
verbosity
 should be inversely proportional to usage frequency.  (Which is why I really
need
 help thinking of a better name than executeInNewThread().)

If we decide eager is (an implicite) standard, then we must rename (several?) 
std.algo funcs, I guess.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from spir (denis.spir gmail.com)'s article
 On 03/24/2011 06:04 PM, dsimcha wrote:
 Hmm, you do have a point there.  Two reasons:

 1.  map() was there first and at the time I didn't feel like renaming it.

 2.  I think map() is much more frequently useful than lazyMap() and name
verbosity
 should be inversely proportional to usage frequency.  (Which is why I really
need
 help thinking of a better name than executeInNewThread().)

 If we decide eager is (an implicite) standard, then we must rename (several?)
 std.algo funcs, I guess.
 Denis

Ok, I'm convinced that the eagerness of map() may throw people off.  Unless
anyone
objects, I'm going with map() for the lazy version and amap() for the eager
version.  This is both consistent and terse.

Mar 24 2011

spir <denis.spir gmail.com> writes:

On 03/24/2011 05:32 PM, bearophile wrote:
 I tried to keep it as consistent as possible with std.algorithm.

 OK. Then the question is why std.algorithm uses normal strings instead of q{}
ones.

 And regarding consistency with std.algorithm, a more important factor is that
std.algorithm.map is lazy, while you have a eager map, and the lazy version has
lazy in the name, so the names are kind of opposite of std.algorithm.

I agree there should be a consistent naming scheme here. Aligning with std.algo 
to consider lazy as standard sounds good. (Better than the opposite anyway, 
maybe not as good as explicite for both cases --but this is a lost fight ;-)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 1:34 PM, spir wrote:
 On 03/24/2011 05:32 PM, bearophile wrote:
 I tried to keep it as consistent as possible with std.algorithm.

 OK. Then the question is why std.algorithm uses normal strings instead
 of q{} ones.

 And regarding consistency with std.algorithm, a more important factor
 is that std.algorithm.map is lazy, while you have a eager map, and the
 lazy version has lazy in the name, so the names are kind of opposite
 of std.algorithm.

 I agree there should be a consistent naming scheme here. Aligning with
 std.algo to consider lazy as standard sounds good. (Better than the
 opposite anyway, maybe not as good as explicite for both cases --but
 this is a lost fight ;-)

 Denis

Thanks for the suggestions, everyone.  I've made all the changes 
suggested today, plus a few others:

1.  map -> amap (eager), lazyMap -> map (lazy)

2.  isPure is no longer documented, since it was never meant to be part 
of the public API.

3.  I put TaskPool.join() in a version(none) statement and removed all 
references to it from the documentation.  If anyone has a compelling 
reason why this is useful (preferably with a good, short example), I'll 
add it back.  Otherwise, it's gone, since IMHO it's an evolutionary 
artifact that makes sense in theory but that I can't think of a use for 
anymore in practice.

4.  After some prodding and thinking I've finally been convinced that 
task() should just return auto.  I thought documenting the return types 
explicitly would clarify things, but from the questions I've gotten 
about it, it seems to obfuscate more than it clarifies.

5.  This one should have been done a long time ago:  amap() now accepts 
explicit buffers with element types that are implicit conversion targets 
of the return type of the map function.

Mar 24 2011

Russel Winder <russel russel.org.uk> writes:

On Thu, 2011-03-24 at 12:32 -0400, bearophile wrote:
 dsimcha:
=20
 I tried to keep it as consistent as possible with std.algorithm.

=20
 OK. Then the question is why std.algorithm uses normal strings instead of=

 q{} ones.

Actually the question why user strings at all, why not have a lambda
function syntax that is less verbose?

 And regarding consistency with std.algorithm, a more important factor
 is that std.algorithm.map is lazy, while you have a eager map, and the
 lazy version has lazy in the name, so the names are kind of opposite
 of std.algorithm.

Is there actually any point in having a lazy parallel map?

 Unfortunately I don't have access to this kind of hardware except at wo=


rk.
=20
 Maybe some other person in this group has access to a 8 core CPU and
 is willing to take some numbers, to create few little graphs.

I have such hardware, if someone sends me a program to execute I can run
it a few times to get a dataset.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Russel Winder (russel russel.org.uk)'s article
 Is there actually any point in having a lazy parallel map?

It's for pipelining.  Please read the API documentation for details about how it
works.  It's actually only semi-lazy.

 Unfortunately I don't have access to this kind of hardware except at wo=


 rk.
=20
 Maybe some other person in this group has access to a 8 core CPU and
 is willing to take some numbers, to create few little graphs.

 I have such hardware, if someone sends me a program to execute I can run
 it a few times to get a dataset.

Just try it with the examples in the API docs if you want.

Mar 24 2011

=?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:

Hm depending on the way the pool is used, it might be a better default 
to have the number of threads equal the number of cpu cores. In my 
experience the control thread is mostly either waiting for tasks or 
processing messages and blocking in between so it rarely uses a full 
core, wasting the available computation time in this case.

However, I'm not really sure if it is like this for the majority of all 
applications or if there are more cases where the control thread will 
continue to do computations in parallel. Maybe we could collect some 
opinions on this?

On another note, I would like to see a rough description on what the 
default workUnitSize is depending on the size of the input. Otherwise it 
feels rather uncomfortable to use this version of parallel().

Another small addition would be to state that the object returned by 
asyncBuf either is an InputRange or which useful methods it might have 
(some kind of progress counter could also be useful here).

Btw., sorry if anything of this has already been discussed. I have 
missed the previous discussion unfortunately.

S�nke

Mar 24 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-24 03:29:52 -0400, S�nke Ludwig 
<ludwig informatik.uni-luebeck.de> said:

 Hm depending on the way the pool is used, it might be a better default 
 to have the number of threads equal the number of cpu cores. In my 
 experience the control thread is mostly either waiting for tasks or 
 processing messages and blocking in between so it rarely uses a full 
 core, wasting the available computation time in this case.
 
 However, I'm not really sure if it is like this for the majority of all 
 applications or if there are more cases where the control thread will 
 continue to do computations in parallel. Maybe we could collect some 
 opinions on this?

The current default is good for command line applications where the 
main thread generally blocks while you're doing your work. The default 
you're proposing is good when you're using the task pool to pile up 
tasks to perform in background, which is generally what you do in an 
event-driven application. The current default is good to keeps simpler 
the simpler programs which are more linear in nature.

My use case is like yours: a event-driven main thread which starts 
tasks to be performed in the background.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 8:03 AM, Michel Fortin wrote:
 On 2011-03-24 03:29:52 -0400, S�nke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Hm depending on the way the pool is used, it might be a better default
 to have the number of threads equal the number of cpu cores. In my
 experience the control thread is mostly either waiting for tasks or
 processing messages and blocking in between so it rarely uses a full
 core, wasting the available computation time in this case.

 However, I'm not really sure if it is like this for the majority of
 all applications or if there are more cases where the control thread
 will continue to do computations in parallel. Maybe we could collect
 some opinions on this?

 The current default is good for command line applications where the main
 thread generally blocks while you're doing your work. The default you're
 proposing is good when you're using the task pool to pile up tasks to
 perform in background, which is generally what you do in an event-driven
 application. The current default is good to keeps simpler the simpler
 programs which are more linear in nature.

 My use case is like yours: a event-driven main thread which starts tasks
 to be performed in the background.

Please review the changes carefully, then, because this is a use case I 
know next to nothing about and didn't design for.  I added a few 
(previously discussed) things to help this use case, at your request. 
BTW, one hint that I'd like to mention (not sure how to work it into the 
docs) is that, if you **don't** want to execute a task in the current 
thread if it's not started, but want its return value if it's done (as 
may be the case when you need threading for responsiveness, not 
throughput), you can call Task.done to check if it's done, and then 
Task.spinForce() to get the return value after you know it's already done.

Mar 24 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-24 09:46:01 -0400, dsimcha <dsimcha yahoo.com> said:

 Please review the changes carefully, then, because this is a use case I 
 know next to nothing about and didn't design for.

Well, it's practically the same thing except you never want to execute 
a task in the main thread, because the main thread acts more like a 
coordinator for various things and the coordinator must stay responsive.

And since your main thread might be creating various kind of tasks you 
need a way to priorize some tasks over others. I think creating a few 
task pools with various priority and relying on OS thread scheduling 
would be adequate in most cases, but I haven't tried.

One thing I'd want to be sure however is that you can use a parallel 
foreach from within a task. So if you have one or two tasks that could 
benefit from data parallelism it won't bring the whole system down. 
From the API I don't think it'll be a problem.


 I added a few (previously discussed) things to help this use case, at 
 your request. BTW, one hint that I'd like to mention (not sure how to 
 work it into the docs) is that, if you **don't** want to execute a task 
 in the current thread if it's not started, but want its return value if 
 it's done (as may be the case when you need threading for 
 responsiveness, not throughput), you can call Task.done to check if 
 it's done, and then Task.spinForce() to get the return value after you 
 know it's already done.

The problem with using Task.done that way is that it requires polling. 
It might be appropriate in some cases, but in general you want to 
receive a message telling you when the task is done. That's not really 
complicated however: all the task has to do is send back its result 
through std.concurrency's "send" or through some other event dispatch 
mechanism instead of through a return statement.

So I hope your tasks can accept "void" as a return type.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 10:34 AM, Michel Fortin wrote:
 One thing I'd want to be sure however is that you can use a parallel
 foreach from within a task. So if you have one or two tasks that could
 benefit from data parallelism it won't bring the whole system down. From
 the API I don't think it'll be a problem.

Right.  This is completely do-able and I've done it before in practice.

 I added a few (previously discussed) things to help this use case, at
 your request. BTW, one hint that I'd like to mention (not sure how to
 work it into the docs) is that, if you **don't** want to execute a
 task in the current thread if it's not started, but want its return
 value if it's done (as may be the case when you need threading for
 responsiveness, not throughput), you can call Task.done to check if
 it's done, and then Task.spinForce() to get the return value after you
 know it's already done.

 The problem with using Task.done that way is that it requires polling.
 It might be appropriate in some cases, but in general you want to
 receive a message telling you when the task is done. That's not really
 complicated however: all the task has to do is send back its result
 through std.concurrency's "send" or through some other event dispatch
 mechanism instead of through a return statement.

Sounds like a good plan.  In general, I've tried to keep the design of 
std.parallelism simple but composable.  I have no intention of 
re-implementing any kind of message system when std.concurrency already 
does this well.  If this is what you want to do, though, maybe you 
should just use std.concurrency.  I'm not sure what std.parallelism 
would add.

 So I hope your tasks can accept "void" as a return type.

Yes, this works.

Mar 24 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-03-24 10:43:08 -0400, dsimcha <dsimcha yahoo.com> said:

 Sounds like a good plan.  In general, I've tried to keep the design of 
 std.parallelism simple but composable.  I have no intention of 
 re-implementing any kind of message system when std.concurrency already 
 does this well.  If this is what you want to do, though, maybe you 
 should just use std.concurrency.  I'm not sure what std.parallelism 
 would add.

What it adds is a task pool, where you have a fixed number of threads 
for an unlimited number of tasks. Spawning 10,000 threads because you 
have 10,000 parallelizable tasks generally isn't a good idea.

That said, perhaps std.concurrency's "spawn" should have the ability to 
create tasks instead of always creating new threads...

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 11:00 AM, Michel Fortin wrote:

 What it adds is a task pool, where you have a fixed number of threads
 for an unlimited number of tasks. Spawning 10,000 threads because you
 have 10,000 parallelizable tasks generally isn't a good idea.

 That said, perhaps std.concurrency's "spawn" should have the ability to
 create tasks instead of always creating new threads...

This is a great **long-term** todo.  Please file an enhancement request 
in Bugzilla w.r.t. std.concurrency pooling threads so it doesn't get 
lost.  TaskPool would probably be a good back end for this, but IMHO any 
use of TaskPool by std.concurrency should be regarded as an 
implementation detail.

On the other hand, this kind of inter-module cooperation requires lots 
of discussion about how it should be designed.  It is also well beyond 
the scope of what std.parallelism was designed to do.  This will take a 
long time to design and implement, have ripple effects into 
std.concurrency, etc.  I don't think it needs to be implemented **now** 
or should hold up the vote and inclusion of std.parallelism in Phobos.

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 3:29 AM, S�nke Ludwig wrote:
 Hm depending on the way the pool is used, it might be a better default
 to have the number of threads equal the number of cpu cores. In my
 experience the control thread is mostly either waiting for tasks or
 processing messages and blocking in between so it rarely uses a full
 core, wasting the available computation time in this case.

It's funny, it seems like the task parallelism stuff is getting much 
more attention from the community than the data parallelism stuff.  I 
hardly ever use the task parallelism and use mostly data parallelism. 
I'm inclined to leave this as-is because:

1.  It's definitely the right answer for data parallelism and the task 
parallelism case is much less obvious.

2.  The main thread is utilized in the situation you describe.  As I 
mentioned in a previous post, when a task that has not been started by a 
worker thread yet is forced, it is executed immediately in the thread 
that tried to force it, regardless of its position in the queue.  There 
are two reasons for this:

     a.  It guarantees that there won't be any deadlocks where a task 
waits for another task that's behind it in the queue.

     b.  If you're trying to force a task, then you obviously need the 
results ASAP, so it's an ad-hoc form of prioritization.
 However, I'm not really sure if it is like this for the majority of all
 applications or if there are more cases where the control thread will
 continue to do computations in parallel. Maybe we could collect some
 opinions on this?

 On another note, I would like to see a rough description on what the
 default workUnitSize is depending on the size of the input. Otherwise it
 feels rather uncomfortable to use this version of parallel().

Hmm, this was there in the old documentation.  Andrei recommended 
against documenting it for one of the cases because it might change.  I 
can tell you that, right now, it's:

1.  Whatever workUnitSize would create TaskPool.size * 4 work units, if 
the range has a length.

2.  512 if the range doesn't have a length.
 Another small addition would be to state that the object returned by
 asyncBuf either is an InputRange or which useful methods it might have
 (some kind of progress counter could also be useful here).

I guess this could be a little clearer, but it's really just a plain 
vanilla input range that has a length iff range has a length.  There are 
no other public methods.

Mar 24 2011

spir <denis.spir gmail.com> writes:

On 03/24/2011 05:32 AM, dsimcha wrote:
 One thing Andrei mentioned that I'm really not sure about is what to do with
 TaskPool.join().  My example for it is still terrible, because I think it's an
 evolutionary artifact.  It was useful in earlier designs that were never
 released and didn't have high-level data parallelism primitives.  I never use
 it, don't have any good use cases for it and would be inclined to remove it
 entirely.  Andrei seems to have some good use cases in mind, but he has not
 detailed any that I believe are reasonably implementable and I'm not sure
 whether they could be solved better using the higher-level data parallelism
 primitives.

If I may have a suggestion: just let it aside. Instead of piling up features, 
just have the core available, and let actual needs show up in real life.
An exception may be when a given potential feature would require major 
re-design. Then, better to anticipate with a design that could accomodate it. 
Else, better to apply the famous phrase that design is finished when nothing is 
left to remove. Very true (even more in PL design where it's about impossible 
to remove a feature once released, as an after-thought).

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 24 2011

spir <denis.spir gmail.com> writes:

On 03/24/2011 05:32 AM, dsimcha wrote:
 [...]

 The new docs are at http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

About the doc: very good. I could understand most of it, while knowing nearly 
nothing about parallelism prior to reading.
2 details:
* highlight key words only on first occurrence (bold online)
* wrong doc for Task.isPure (gets a copy of Tast.args' doc)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 24 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/24/2011 8:35 AM, spir wrote:
 On 03/24/2011 05:32 AM, dsimcha wrote:
 [...]

 The new docs are at
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

 About the doc: very good. I could understand most of it, while knowing
 nearly nothing about parallelism prior to reading.
 2 details:
 * highlight key words only on first occurrence (bold online)
 * wrong doc for Task.isPure (gets a copy of Tast.args' doc)

 Denis

Task.isPure isn't even supposed to be public.  I didn't notice that it 
had slipped into the docs.  WTF, DDoc?

Mar 24 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

 On 3/24/2011 8:35 AM, spir wrote:
 On 03/24/2011 05:32 AM, dsimcha wrote:
 [...]
 
 The new docs are at
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

 
 About the doc: very good. I could understand most of it, while knowing
 nearly nothing about parallelism prior to reading.
 2 details:
 * highlight key words only on first occurrence (bold online)
 * wrong doc for Task.isPure (gets a copy of Tast.args' doc)
 
 Denis

 
 Task.isPure isn't even supposed to be public.  I didn't notice that it
 had slipped into the docs.  WTF, DDoc?

It's not really DDoc's fault. It's dmd's fault. Currently, _all_ templated 
_anything_ is public. _Period_. It doesn't matter what access level you 
actually give it. Task is a template. isPure is a member of Task, and you put 
ddoc comments on it. dmd wrongly thinks that it's public, so ddoc sees it as 
public and puts it in the generated docs. You need to use /+ instead of /++ or 
/* instead of /* so that the ddoc comment is not an actual ddoc comment.

I think that there are actually multiple bug reports on the issue, but here's 
one: http://d.puremagic.com/issues/show_bug.cgi?id=2775

- Jonathan M Davis

Mar 24 2011

D Programming

C/C++ Programming

Other

digitalmars.D - std.parallelism changes done