www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.parallelism changes done

reply dsimcha <dsimcha yahoo.com> writes:
I've finished all of the changes that were discussed in the initial 
std.parallelism review.  I know I said I needed more time than this, but 
honestly, I hit a best-case scenario.  I had more time than I 
anticipated to work on it *and* the changes (especially fixing the 
exception handling issue) took less time than I anticipated.

I would say that the documentation has now improved "radically", like 
Andrei suggested it needs to.  I want to thank Andrei and Michael Fortin 
for their extremely useful suggestions, and apologize for getting 
defensive at times.  I felt compelled to defend my initial design in 
cases where I shouldn't have, due to a combination of time pressure and 
a misunderstanding of the review process.  Andrei's suggestions in 
particular led to a tremendous improvement of the documentation and, to 
a lesser extent, an improvement of the API.

In addition to improving the documentation, I added 
Task.executeInNewThread() to allow Task to be useful without a TaskPool. 
  (Should this have a less verbose name?)  I also fixed some exception 
handling bugs, implemented exception chaining for exceptions thrown 
concurrently, and fixed some silliness with respect to seed values in 
reduce().

One thing Andrei mentioned that I'm really not sure about is what to do 
with TaskPool.join().  My example for it is still terrible, because I 
think it's an evolutionary artifact.  It was useful in earlier designs 
that were never released and didn't have high-level data parallelism 
primitives.  I never use it, don't have any good use cases for it and 
would be inclined to remove it entirely.  Andrei seems to have some good 
use cases in mind, but he has not detailed any that I believe are 
reasonably implementable and I'm not sure whether they could be solved 
better using the higher-level data parallelism primitives.

As far as the vote, I know I asked for more time and to allow another 
module ahead of me, and of course I'll honor that.  I'd like to un-git 
stash this review as soon as isemail is done, though.  I anticipate a 
decent amount more suggestions now, given how major the changes have 
been, but most will probably be minor things like documentation 
clarifications and renaming stuff.

The new docs are at 
http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .
Mar 23 2011
next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
On Thu, 24 Mar 2011 05:32:26 +0100, dsimcha <dsimcha yahoo.com> wrote:

 In addition to improving the documentation, I added  
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.  
   (Should this have a less verbose name?)

spawnAndRun? -- Simen
Mar 23 2011
prev sibling next sibling parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority per task to control which tasks can overtake others. A special priority is out-of-bands (the name my be debatable), which will guarantee that the task will run in its own thread so it can safely wait for other tasks. However, those threads that process OOB tasks are also cached in the thread pool and reused for new OOB tasks. Only if the number of parallel OOB tasks goes over a specific number, new threads will be created and destroyed. This can safe quite a bit of time for those tasks. Both kinds of priority have been very useful and I would suggest to put at least the executeInNewThread() method into ThreadPool to be later able to make such an optimization. The task priority thing in general may only be necessary for complex applications with user interaction, where you have to statisfy certain interactivity needs. I wouldn't be too sad if this is not implemented now, but it would be good to keep it in mind as a possible improvement for later. Sönke
Mar 24 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-24 03:00:01 -0400, Sönke Ludwig 
<ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority per task to control which tasks can overtake others. A special priority is out-of-bands (the name my be debatable), which will guarantee that the task will run in its own thread so it can safely wait for other tasks. However, those threads that process OOB tasks are also cached in the thread pool and reused for new OOB tasks. Only if the number of parallel OOB tasks goes over a specific number, new threads will be created and destroyed. This can safe quite a bit of time for those tasks. Both kinds of priority have been very useful and I would suggest to put at least the executeInNewThread() method into ThreadPool to be later able to make such an optimization. The task priority thing in general may only be necessary for complex applications with user interaction, where you have to statisfy certain interactivity needs. I wouldn't be too sad if this is not implemented now, but it would be good to keep it in mind as a possible improvement for later.

Do you think having multiple task pools each with a different thread priority would do the trick? Simply put tasks in the task pool with the right priority... I had a similar use case in mind and this is what I proposed in the previous discussion. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 24 2011
parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 24.03.2011 13:03, schrieb Michel Fortin:
 On 2011-03-24 03:00:01 -0400, Sönke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority per task to control which tasks can overtake others. A special priority is out-of-bands (the name my be debatable), which will guarantee that the task will run in its own thread so it can safely wait for other tasks. However, those threads that process OOB tasks are also cached in the thread pool and reused for new OOB tasks. Only if the number of parallel OOB tasks goes over a specific number, new threads will be created and destroyed. This can safe quite a bit of time for those tasks. Both kinds of priority have been very useful and I would suggest to put at least the executeInNewThread() method into ThreadPool to be later able to make such an optimization. The task priority thing in general may only be necessary for complex applications with user interaction, where you have to statisfy certain interactivity needs. I wouldn't be too sad if this is not implemented now, but it would be good to keep it in mind as a possible improvement for later.

Do you think having multiple task pools each with a different thread priority would do the trick? Simply put tasks in the task pool with the right priority... I had a similar use case in mind and this is what I proposed in the previous discussion.

Yes, that may be actually enough because although you would normally want to avoid the overhead of the additional threads running in parallel, in the scenarios I have in mind you always have unrelated things in different priority classes. An for these different tasks it should only be an exception to run in parallel (otherwise using priorities would be strange in the first place). The only thing that is a bit of a pity is that now you have to manage multiple thread pools instead of simply using the one singleton instance in the whole application. And this could really cause some headaches if you have a lot of different types of workload that may all have different priorities but also may have the same - you would somehow have to share several thread pools across those types of workload. (type of workload = copying files, computing a preview images, computing some physics calcs etc)
Mar 24 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 9:15 PM, Sönke Ludwig wrote:
 Am 24.03.2011 13:03, schrieb Michel Fortin:
 On 2011-03-24 03:00:01 -0400, Sönke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a
 TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority per task to control which tasks can overtake others. A special priority is out-of-bands (the name my be debatable), which will guarantee that the task will run in its own thread so it can safely wait for other tasks. However, those threads that process OOB tasks are also cached in the thread pool and reused for new OOB tasks. Only if the number of parallel OOB tasks goes over a specific number, new threads will be created and destroyed. This can safe quite a bit of time for those tasks. Both kinds of priority have been very useful and I would suggest to put at least the executeInNewThread() method into ThreadPool to be later able to make such an optimization. The task priority thing in general may only be necessary for complex applications with user interaction, where you have to statisfy certain interactivity needs. I wouldn't be too sad if this is not implemented now, but it would be good to keep it in mind as a possible improvement for later.

Do you think having multiple task pools each with a different thread priority would do the trick? Simply put tasks in the task pool with the right priority... I had a similar use case in mind and this is what I proposed in the previous discussion.

Yes, that may be actually enough because although you would normally want to avoid the overhead of the additional threads running in parallel, in the scenarios I have in mind you always have unrelated things in different priority classes. An for these different tasks it should only be an exception to run in parallel (otherwise using priorities would be strange in the first place). The only thing that is a bit of a pity is that now you have to manage multiple thread pools instead of simply using the one singleton instance in the whole application. And this could really cause some headaches if you have a lot of different types of workload that may all have different priorities but also may have the same - you would somehow have to share several thread pools across those types of workload. (type of workload = copying files, computing a preview images, computing some physics calcs etc)

My main concern here is that these kinds of use cases are getting far beyond the scope of std.parallelism. By definition (at least as I understand it) parallelism is focused on throughput, not responsiveness/latency and is about utilizing as many execution resources as possible for useful work. (This article, originally posted here by Andrei, describes the distinction nicely: http://existentialtype.wordpress.com/2011/03/17/parallelism- s-not-concurrency/) If you're implementing parallelism, then it is correct to only use one thread on a single-core machine (std.parallelism does this by default), since one thread will utilize all execution resources. If you're implementing concurrency, this is not correct. Concurrency is used to implement parallelism, but that's different from saying concurrency _is_ parallelism. When you start talking about application responsiveness, prioritization, etc., you're getting beyond _parallelism_ and into general-case concurrency. I have neither the expertise nor the desire to build a general case concurrency library. D already has a general case concurrency library (std.concurrency), and this might be a better place to implement suggestions dealing with general-case concurrency. std.parallelism was designed from the ground up to focus on parallelism, not general-case concurrency. I don't mind implementing features useful to general-case concurrency if they're trivial in both interface and implementation, but I'd rather not do any that require major changes to the interface or implementation.
Mar 24 2011
parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 25.03.2011 02:51, schrieb dsimcha:
 On 3/24/2011 9:15 PM, Sönke Ludwig wrote:
 Am 24.03.2011 13:03, schrieb Michel Fortin:
 On 2011-03-24 03:00:01 -0400, Sönke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a
 TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority per task to control which tasks can overtake others. A special priority is out-of-bands (the name my be debatable), which will guarantee that the task will run in its own thread so it can safely wait for other tasks. However, those threads that process OOB tasks are also cached in the thread pool and reused for new OOB tasks. Only if the number of parallel OOB tasks goes over a specific number, new threads will be created and destroyed. This can safe quite a bit of time for those tasks. Both kinds of priority have been very useful and I would suggest to put at least the executeInNewThread() method into ThreadPool to be later able to make such an optimization. The task priority thing in general may only be necessary for complex applications with user interaction, where you have to statisfy certain interactivity needs. I wouldn't be too sad if this is not implemented now, but it would be good to keep it in mind as a possible improvement for later.

Do you think having multiple task pools each with a different thread priority would do the trick? Simply put tasks in the task pool with the right priority... I had a similar use case in mind and this is what I proposed in the previous discussion.

Yes, that may be actually enough because although you would normally want to avoid the overhead of the additional threads running in parallel, in the scenarios I have in mind you always have unrelated things in different priority classes. An for these different tasks it should only be an exception to run in parallel (otherwise using priorities would be strange in the first place). The only thing that is a bit of a pity is that now you have to manage multiple thread pools instead of simply using the one singleton instance in the whole application. And this could really cause some headaches if you have a lot of different types of workload that may all have different priorities but also may have the same - you would somehow have to share several thread pools across those types of workload. (type of workload = copying files, computing a preview images, computing some physics calcs etc)

My main concern here is that these kinds of use cases are getting far beyond the scope of std.parallelism. By definition (at least as I understand it) parallelism is focused on throughput, not responsiveness/latency and is about utilizing as many execution resources as possible for useful work. (This article, originally posted here by Andrei, describes the distinction nicely: http://existentialtype.wordpress.com/2011/03/17/parallelism-is-not-concurrency/) If you're implementing parallelism, then it is correct to only use one thread on a single-core machine (std.parallelism does this by default), since one thread will utilize all execution resources. If you're implementing concurrency, this is not correct. Concurrency is used to implement parallelism, but that's different from saying concurrency _is_ parallelism. When you start talking about application responsiveness, prioritization, etc., you're getting beyond _parallelism_ and into general-case concurrency. I have neither the expertise nor the desire to build a general case concurrency library. D already has a general case concurrency library (std.concurrency), and this might be a better place to implement suggestions dealing with general-case concurrency. std.parallelism was designed from the ground up to focus on parallelism, not general-case concurrency. I don't mind implementing features useful to general-case concurrency if they're trivial in both interface and implementation, but I'd rather not do any that require major changes to the interface or implementation.

Well what can I say.. things can become more complex and you cannot always say this is parallelism and this is concurrency ore something. It's just nice when the libary does not get in the way when you are in a situation where eg. throughput and responsiveness or whatever else matter. Sometimes it can be a small change that can make or break the deal. It's sad that I'm not able to get my points across.. But I'm too tired right now to go down and discuss the parallelism/concurrency topic.
Mar 24 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 10:31 PM, Sönke Ludwig wrote:
 Well what can I say.. things can become more complex and you cannot
 always say this is parallelism and this is concurrency ore something.
 It's just nice when the libary does not get in the way when you are in a
 situation where eg. throughput and responsiveness or whatever else
 matter. Sometimes it can be a small change that can make or break the
 deal.

Agreed. I'm not trying to be pedantic here, and I'm certainly willing to make **small** changes even if they stretch the scope somewhat into general concurrency. It's just that I don't want to make big changes, especially if they will make the interface more complex, reduce efficiency and/or lock me into certain implementations. (For example, using a priority queue precludes supporting work stealing later without breaking the priority feature.)
Mar 24 2011
parent =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 25.03.2011 04:32, schrieb dsimcha:
 On 3/24/2011 10:31 PM, Sönke Ludwig wrote:
 Well what can I say.. things can become more complex and you cannot
 always say this is parallelism and this is concurrency ore something.
 It's just nice when the libary does not get in the way when you are in a
 situation where eg. throughput and responsiveness or whatever else
 matter. Sometimes it can be a small change that can make or break the
 deal.

Agreed. I'm not trying to be pedantic here, and I'm certainly willing to make **small** changes even if they stretch the scope somewhat into general concurrency. It's just that I don't want to make big changes, especially if they will make the interface more complex, reduce efficiency and/or lock me into certain implementations. (For example, using a priority queue precludes supporting work stealing later without breaking the priority feature.)

I agree, the proirities are things that can be important in some cases but most of the time they are not really _necessary_ in that sense. And maybe in most of those cases where someone would like to have them, the suggestion by Michel to create a second thread pool with a different priority may be just fine. The more important aspect was the OOB part with the cached threads for something like executeInNewThread. And even that is not a real deal-breaker.
Mar 25 2011
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 3:00 AM, Sönke Ludwig wrote:
 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority per task to control which tasks can overtake others. A special priority is out-of-bands (the name my be debatable), which will guarantee that the task will run in its own thread so it can safely wait for other tasks.

This may not be an issue in the std.parallelism design. A TaskPool task can safely wait on other tasks. What prevents this from causing a deadlock is that calling yieldForce, spinForce, or waitForce on a task that has not started executing yet will execute the task immediately in the thread that tried to force it, regardless of where it is in the queue. However, those threads that process OOB tasks are also cached in
 the thread pool and reused for new OOB tasks. Only if the number of
 parallel OOB tasks goes over a specific number, new threads will be
 created and destroyed. This can safe quite a bit of time for those tasks.

Unfortunately this is not implementable without a massive overhaul of TaskPool. There are some baked in assumptions that the number of worker threads in a pool will not change after the pool's creation. Furthermore, I'm not sure how worker-local storage would be efficiently implementable without this restriction.
 Both kinds of priority have been very useful and I would suggest to put
 at least the executeInNewThread() method into ThreadPool to be later
 able to make such an optimization.

Can you elaborate on this? The whole point of executeInNewThread() was supposed to be that a TaskPool is not needed for simple cases.
Mar 24 2011
parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 24.03.2011 14:25, schrieb dsimcha:
 On 3/24/2011 3:00 AM, Sönke Ludwig wrote:
 Am 24.03.2011 05:32, schrieb dsimcha:
 In addition to improving the documentation, I added
 Task.executeInNewThread() to allow Task to be useful without a TaskPool.
 (Should this have a less verbose name?)

The threading system I designed for the company I work for uses priority per task to control which tasks can overtake others. A special priority is out-of-bands (the name my be debatable), which will guarantee that the task will run in its own thread so it can safely wait for other tasks.

This may not be an issue in the std.parallelism design. A TaskPool task can safely wait on other tasks. What prevents this from causing a deadlock is that calling yieldForce, spinForce, or waitForce on a task that has not started executing yet will execute the task immediately in the thread that tried to force it, regardless of where it is in the queue.

Indeed this pattern solves the problem to wait for the completion of a specific task. It also avoids a huge potential of deadlocks that a general yield() that does not take a task would have. However, it will not solve the general problem of one task waiting for another, which could be in terms of a condition variable or just a mutex that is used in the middle of the task execution.
 However, those threads that process OOB tasks are also cached in
 the thread pool and reused for new OOB tasks. Only if the number of
 parallel OOB tasks goes over a specific number, new threads will be
 created and destroyed. This can safe quite a bit of time for those tasks.

Unfortunately this is not implementable without a massive overhaul of TaskPool. There are some baked in assumptions that the number of worker threads in a pool will not change after the pool's creation. Furthermore, I'm not sure how worker-local storage would be efficiently implementable without this restriction.
 Both kinds of priority have been very useful and I would suggest to put
 at least the executeInNewThread() method into ThreadPool to be later
 able to make such an optimization.

Can you elaborate on this? The whole point of executeInNewThread() was supposed to be that a TaskPool is not needed for simple cases.

Well OK if that is the only purpose to provide a shortcut for (new Thread(&fun)).start then my suggestion may not make too much sense. However, I have some doubts that the advantage to have this shortcut here justifies the functional duplication of core.thread. Is there some specific use case where you would use a Task object but not a ThreadPool? But what I wanted to say is, even if it may be difficult to implement such thread caching now, putting means to execute a Task in its own thread now into the ThreadPool allows for such an optimization later (it could even exist while still keeping Task.executeInNewThread()). Btw. another more advanced thing that is interesting is to use IO completion ports to dynamically increase the number of threads in cases of contention or IO waits. But this is of course a whole different story and I certainly would not expect this to get in at this stage or maybe ever. Just wanted to throw that in.
Mar 24 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 9:05 PM, Sönke Ludwig wrote:
 This may not be an issue in the std.parallelism design. A TaskPool task
 can safely wait on other tasks. What prevents this from causing a
 deadlock is that calling yieldForce, spinForce, or waitForce on a task
 that has not started executing yet will execute the task immediately in
 the thread that tried to force it, regardless of where it is in the
 queue.

Indeed this pattern solves the problem to wait for the completion of a specific task. It also avoids a huge potential of deadlocks that a general yield() that does not take a task would have. However, it will not solve the general problem of one task waiting for another, which could be in terms of a condition variable or just a mutex that is used in the middle of the task execution.

Can you elaborate and/or provide an example of the "general" problem? I'm not quite sure what you're getting at.
 Can you elaborate on this? The whole point of executeInNewThread() was
 supposed to be that a TaskPool is not needed for simple cases.

Well OK if that is the only purpose to provide a shortcut for (new Thread(&fun)).start then my suggestion may not make too much sense. However, I have some doubts that the advantage to have this shortcut here justifies the functional duplication of core.thread. Is there some specific use case where you would use a Task object but not a ThreadPool?

executeInNewThread() is useful where you only have a few Tasks, rather than needing to map a large number of Tasks onto a small number of threads. Using core.thread here doesn't cut it, because core.thread doesn't automate passing the function's arguments to the new thread. Also, I figured out that some of the conditions for safe tasks can be relaxed a little if they run in a dedicated thread instead of a TaskPool.
 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later (it
 could even exist while still keeping Task.executeInNewThread()).

I can't really comment because I still don't understand this very well.
Mar 24 2011
parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 25.03.2011 02:17, schrieb dsimcha:
 On 3/24/2011 9:05 PM, Sönke Ludwig wrote:
 This may not be an issue in the std.parallelism design. A TaskPool task
 can safely wait on other tasks. What prevents this from causing a
 deadlock is that calling yieldForce, spinForce, or waitForce on a task
 that has not started executing yet will execute the task immediately in
 the thread that tried to force it, regardless of where it is in the
 queue.

Indeed this pattern solves the problem to wait for the completion of a specific task. It also avoids a huge potential of deadlocks that a general yield() that does not take a task would have. However, it will not solve the general problem of one task waiting for another, which could be in terms of a condition variable or just a mutex that is used in the middle of the task execution.

Can you elaborate and/or provide an example of the "general" problem? I'm not quite sure what you're getting at.

I have one very specific constellation that I can only sketch.. Suppose you have some kind of complex computation going on in the ThreadPool. This computation is done by a large set of tasks where each tasks depends on the result of one or more other tasks. One task is responsible for coordinating the work - it is spawning tasks and waiting for their completion to spawn new tasks for which the results are now available. First thing here is that you do not want to do the waitForce() kind of waiting in the coordinator task because this might cause the coordinator to be busy with an expensive taks while it could already spawn new tasks because maybe in the meantime some other tasks have already finished. However, if you wait for a condition variable instead (which is fired after each finished task) and if you can have multiple computations of this kind running in parallel, you can immediately run into the situation that the thread pool is crowded with coordinator tasks that are all waiting for their condition variables which will never be triggered because no worker tasks can be executed. This is only one example and basically this problem can arise in all cases where one task depends on another task by some form of waiting that will not execute the dependency like waitForce() does. I also have a completely different example invloving the main thread (doing GPU work) which is much more diffucult, but I don't think I can make that clear with only text.
 Can you elaborate on this? The whole point of executeInNewThread() was
 supposed to be that a TaskPool is not needed for simple cases.

Well OK if that is the only purpose to provide a shortcut for (new Thread(&fun)).start then my suggestion may not make too much sense. However, I have some doubts that the advantage to have this shortcut here justifies the functional duplication of core.thread. Is there some specific use case where you would use a Task object but not a ThreadPool?

executeInNewThread() is useful where you only have a few Tasks, rather than needing to map a large number of Tasks onto a small number of threads. Using core.thread here doesn't cut it, because core.thread doesn't automate passing the function's arguments to the new thread. Also, I figured out that some of the conditions for safe tasks can be relaxed a little if they run in a dedicated thread instead of a TaskPool.

I see.
 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later (it
 could even exist while still keeping Task.executeInNewThread()).

I can't really comment because I still don't understand this very well.

I hope I could make it a little more clear what I mean. The problem is just that the system I'm talking about is quite complex and it's not easy to find good and simple examples in that system. The problems of course arise only in the most complex pathes of execution.. What I'm not sure about is if executeInNewThread() is supposed to be useful just because it is somtimes nice to have the fine-grained parallelism of the OS scheduler as opposed to task granilarity, or if the advantage is supposed to be efficiency gained because the thread pool is not created. In the latter case the caching of some threads to be reused for a executeInOwnThread()-method should lead to a better performance in almost any case where thread creation overhead is relevant. .. OK my writing skills are degrading rapidly as I have to fight against my closing eyes.. will go to sleep now
Mar 24 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 10:21 PM, Sönke Ludwig wrote:
 Indeed this pattern solves the problem to wait for the completion of a
 specific task. It also avoids a huge potential of deadlocks that a
 general yield() that does not take a task would have. However, it will
 not solve the general problem of one task waiting for another, which
 could be in terms of a condition variable or just a mutex that is used
 in the middle of the task execution.

Can you elaborate and/or provide an example of the "general" problem? I'm not quite sure what you're getting at.

I have one very specific constellation that I can only sketch.. Suppose you have some kind of complex computation going on in the ThreadPool. This computation is done by a large set of tasks where each tasks depends on the result of one or more other tasks. One task is responsible for coordinating the work - it is spawning tasks and waiting for their completion to spawn new tasks for which the results are now available.

As I've said before in related discussions, you are _probably_ better off using one of the high level primitives instead of using tasks directly in these cases. If not, I'd prefer to improve the higher level primitives and/or create new ones if possible. (Feel free to suggest one if you can think of it.) Tasks are, IMHO, too low level for anything except basic future/promise parallelism and implementing higher level primitives. Incidentally the situation you describe (a coordinator task creating lots of worker tasks) is exactly how amap(), reduce() and parallel foreach work under the hood. This complexity is completely encapsulated, though.
 First thing here is that you do not want to do the waitForce() kind of
 waiting in the coordinator task because this might cause the coordinator
 to be busy with an expensive taks while it could already spawn new tasks
 because maybe in the meantime some other tasks have already finished.

I assume you mean yieldForce().
 However, if you wait for a condition variable instead (which is fired
 after each finished task) and if you can have multiple computations of
 this kind running in parallel, you can immediately run into the
 situation that the thread pool is crowded with coordinator tasks that
 are all waiting for their condition variables which will never be
 triggered because no worker tasks can be executed.

I assume you're talking about a condition variable other than the one yieldForce() uses. As mentioned previously, in the specific case of yieldForce() this is a solved problem. In the general case I can see the problem.
 This is only one example and basically this problem can arise in all
 cases where one task depends on another task by some form of waiting
 that will not execute the dependency like waitForce() does.

Hmm, ok, I definitely understand the problem now.
 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later (it
 could even exist while still keeping Task.executeInNewThread()).

I can't really comment because I still don't understand this very well.

I hope I could make it a little more clear what I mean. The problem is just that the system I'm talking about is quite complex and it's not easy to find good and simple examples in that system. The problems of course arise only in the most complex pathes of execution.. What I'm not sure about is if executeInNewThread() is supposed to be useful just because it is somtimes nice to have the fine-grained parallelism of the OS scheduler as opposed to task granilarity, or if the advantage is supposed to be efficiency gained because the thread pool is not created. In the latter case the caching of some threads to be reused for a executeInOwnThread()-method should lead to a better performance in almost any case where thread creation overhead is relevant.

Ok, now I'm starting to understand this. Please correct me (once you've gotten a good night's sleep and can think again) wherever this is wrong: 1. As is currently the case, executeInNewThread() is _guaranteed_ to start the task immediately. There is never a queue involved. 2. Unlike the current implementation, executeInNewThread() may use a cached thread. It will **NOT**, however, put the task on a queue or otherwise delay its execution. If no cached thread is available, it will create a new one and possibly destroy it when the task is done. Thanks for this suggestion. Now that I (think I) understand it, it makes sense in principle. The devil may be in the details, though. 1. How many threads should be cached? I guess this could just be configurable with some reasonable default. 2. Should the cache be lazily or eagerly created? I'd assume lazily. 3. Where would these threads be stored? I think they probably belong in some kind of thread-safe global data structure, **NOT** in a TaskPool instance. 4. How would we explain to people what the cache is good for and how to use it? The fact that you're proposing it and even you find this difficult to convey makes me skeptical that this feature is worth the weight it adds to the API. Maybe you'll find it easier once you get some sleep. (I understand the problem it solves at an abstract level but have never encountered a concrete use case. It also took me a long time to understand it.) 5. It would break some relaxations of what safe tasks can do when started via executeInNewThread(). 6. This whole proposal might fit better in std.concurrency, by using a thread cache for spawn().
Mar 24 2011
parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 25.03.2011 05:14, schrieb dsimcha:
 On 3/24/2011 10:21 PM, Sönke Ludwig wrote:
 Can you elaborate and/or provide an example of the "general" problem?
 I'm not quite sure what you're getting at.

I have one very specific constellation that I can only sketch.. Suppose you have some kind of complex computation going on in the ThreadPool. This computation is done by a large set of tasks where each tasks depends on the result of one or more other tasks. One task is responsible for coordinating the work - it is spawning tasks and waiting for their completion to spawn new tasks for which the results are now available.

As I've said before in related discussions, you are _probably_ better off using one of the high level primitives instead of using tasks directly in these cases. If not, I'd prefer to improve the higher level primitives and/or create new ones if possible. (Feel free to suggest one if you can think of it.) Tasks are, IMHO, too low level for anything except basic future/promise parallelism and implementing higher level primitives. Incidentally the situation you describe (a coordinator task creating lots of worker tasks) is exactly how amap(), reduce() and parallel foreach work under the hood. This complexity is completely encapsulated, though.

I would certainly agree that this belongs to a higher level structure. This structure would basically get a set of special tasks, where each of those tasks has a list of all the tasks it depends on. All tasks would then be executed in parallel on a thread pool in on order that statisfies their dependencies - possibly with some form of cost function that controls which task should come first if there are multiple orders possible. One problem here is for example that for the system I have here, I need to execute several tasks in the main thread by sending a message to it (the main thread executes window messages in a loop). Specifically this is for tasks that use OpenGL or a similar API that has a single thread assigned - and the main thread is the most efficient one to use because it already has an OpenGL context. The important thing is to either support such things or to make it general enough to let the user add it from the outside. Otherwise if you really need such things, the only option is to completely use a custom thread pool and thins means no parallel for, map, reduce and whatever might be added later.
 First thing here is that you do not want to do the waitForce() kind of
 waiting in the coordinator task because this might cause the coordinator
 to be busy with an expensive taks while it could already spawn new tasks
 because maybe in the meantime some other tasks have already finished.

I assume you mean yieldForce().

Yes, sorry, got the names mixed up.
 However, if you wait for a condition variable instead (which is fired
 after each finished task) and if you can have multiple computations of
 this kind running in parallel, you can immediately run into the
 situation that the thread pool is crowded with coordinator tasks that
 are all waiting for their condition variables which will never be
 triggered because no worker tasks can be executed.

I assume you're talking about a condition variable other than the one yieldForce() uses. As mentioned previously, in the specific case of yieldForce() this is a solved problem. In the general case I can see the problem.

Yes, just the general problem with other condition variables.
 This is only one example and basically this problem can arise in all
 cases where one task depends on another task by some form of waiting
 that will not execute the dependency like waitForce() does.

Hmm, ok, I definitely understand the problem now.
 But what I wanted to say is, even if it may be difficult to implement
 such thread caching now, putting means to execute a Task in its own
 thread now into the ThreadPool allows for such an optimization later
 (it
 could even exist while still keeping Task.executeInNewThread()).

I can't really comment because I still don't understand this very well.

I hope I could make it a little more clear what I mean. The problem is just that the system I'm talking about is quite complex and it's not easy to find good and simple examples in that system. The problems of course arise only in the most complex pathes of execution.. What I'm not sure about is if executeInNewThread() is supposed to be useful just because it is somtimes nice to have the fine-grained parallelism of the OS scheduler as opposed to task granilarity, or if the advantage is supposed to be efficiency gained because the thread pool is not created. In the latter case the caching of some threads to be reused for a executeInOwnThread()-method should lead to a better performance in almost any case where thread creation overhead is relevant.

Ok, now I'm starting to understand this. Please correct me (once you've gotten a good night's sleep and can think again) wherever this is wrong: 1. As is currently the case, executeInNewThread() is _guaranteed_ to start the task immediately. There is never a queue involved. 2. Unlike the current implementation, executeInNewThread() may use a cached thread. It will **NOT**, however, put the task on a queue or otherwise delay its execution. If no cached thread is available, it will create a new one and possibly destroy it when the task is done.

Exactly.
 Thanks for this suggestion. Now that I (think I) understand it, it makes
 sense in principle. The devil may be in the details, though.

 1. How many threads should be cached? I guess this could just be
 configurable with some reasonable default.

A configurable minimum number of threads sounds reasonable. The default could probably be a fixed small number like 1 or 2.
 2. Should the cache be lazily or eagerly created? I'd assume lazily.

Lazy sounds good.
 3. Where would these threads be stored? I think they probably belong in
 some kind of thread-safe global data structure, **NOT** in a TaskPool
 instance.

Thats a good question.. ThreadPool would be nice because it is the class of which maybe you are already dragging an instance through your code. Global would certainly work.
 4. How would we explain to people what the cache is good for and how to
 use it? The fact that you're proposing it and even you find this
 difficult to convey makes me skeptical that this feature is worth the
 weight it adds to the API. Maybe you'll find it easier once you get some
 sleep. (I understand the problem it solves at an abstract level but have
 never encountered a concrete use case. It also took me a long time to
 understand it.)

I would basically just say its a faster way than to create a new thread each time you start a task. Use it whenever you need to have a task run outside of the thread pool threads - candidates are tasks that wait a lot, either because of IO or because of waiting primitives apart from the ones present in ThreadPool (message queues, condition variables). But please don't make the mistake to dismiss the problem because it is complex. Beeing complex and maybe rare does not mean it cannot be important. Its like a bug that will delete your data but in very rare and complex use cases of the application. You would not want to ignore that just because of those reasons. Also I'm not sure if using the primitives of std.concurrency is allowed within in a Task, maybe not. But if it is, it would be really easy to construct a higher level example without condition variables and stuff like that.
 5. It would break some relaxations of what  safe tasks can do when
 started via executeInNewThread().

You mean because of TLS that is not reinitialized each time? I have to admit that I can't really gauge the impact of this.
 6. This whole proposal might fit better in std.concurrency, by using a
 thread cache for spawn().

But isn't the previous problem (5.) even more relevant in std.concurrency? Putting it near ThreadPool could be a good idea because it still is some sort of thread pool in the abstract sense. It could also be something that std.concurrency uses for its spawn(). Anyway, I would be happy if there would be a place allocated for this wherever this fits. If it is std.concurrency, its fine as long as std.concurrency and std.parallelism play well together. One problem with std.concurrency is that it also does not really work well when you need to wait for other primitives than a message. To have something like WaitForMultipleObjects is critical in many cases, but thats another topic. Having said all this, I just want to make sure that you don't get this wrong. I certainly to not want to push in complex changes for no reason and no complex changes at all for that matter. And in this case I see how it is _functionally_ independent from the ThreadPool itself. My problem here is just that there is a executeInNewThread function that really should not be used for _many_ tasks and maybe in most other cases it would be cleaner to use spawn() - it may still have its place if you count in the safe implications, but I would like to also see a function supporting the same threading guarantees but suitable for many tasks, if only to avoid bad usage patterns of executeInNewThread. However it definitely is in the gray area between std.concurrency and std.parallelism.
Mar 25 2011
parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 25.03.2011 10:33, schrieb Sönke Ludwig:
yadda-yadda

Apart from all this - I just want to make this a known problem, what you (or maybe Andrei for std.concurrency) decide is up to you and I'm fine with any outcome for my personal stuff because I do not have such a complex system apart from work (C++). And I would like to say that I do like this module in general very much . It also fits almost 99% the design I have in my libraries ;)
Mar 25 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/25/2011 5:42 AM, Sönke Ludwig wrote:
 Am 25.03.2011 10:33, schrieb Sönke Ludwig:
 yadda-yadda

Apart from all this - I just want to make this a known problem, what you (or maybe Andrei for std.concurrency) decide is up to you and I'm fine with any outcome for my personal stuff because I do not have such a complex system apart from work (C++).

You've done a good job of explaining a complex problem. I appreciate it. I think we should make this a long-term todo, like Michael Fortin's suggestion that std.concurrency should be able to create tasks or std.parallelism should handle message passing. You should probably file a Bugzilla enhancement request saying Phobos should support thread caching, so that this proposal doesn't get lost. Your proposal is feasible and solves an important problem. On the other hand, the design and implementation details are still vague and would require substantial discussion. The feature is tangential to the purpose of std.parallelism and clearly crosses the line into general case concurrency. Bottom line: This proposal should not hold up the vote and adoption of std.parallelism, but it should not be discarded permanently either.
Mar 25 2011
parent =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Am 25.03.2011 15:40, schrieb dsimcha:
 On 3/25/2011 5:42 AM, Sönke Ludwig wrote:
 Am 25.03.2011 10:33, schrieb Sönke Ludwig:
 yadda-yadda

Apart from all this - I just want to make this a known problem, what you (or maybe Andrei for std.concurrency) decide is up to you and I'm fine with any outcome for my personal stuff because I do not have such a complex system apart from work (C++).

You've done a good job of explaining a complex problem. I appreciate it. I think we should make this a long-term todo, like Michael Fortin's suggestion that std.concurrency should be able to create tasks or std.parallelism should handle message passing. You should probably file a Bugzilla enhancement request saying Phobos should support thread caching, so that this proposal doesn't get lost. Your proposal is feasible and solves an important problem. On the other hand, the design and implementation details are still vague and would require substantial discussion. The feature is tangential to the purpose of std.parallelism and clearly crosses the line into general case concurrency. Bottom line: This proposal should not hold up the vote and adoption of std.parallelism, but it should not be discarded permanently either.

Agreed
Mar 26 2011
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 and apologize for getting defensive at times.

It happens to mammals, don't worry.
 The new docs are at 
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

    real getTerm(int i) {
        immutable x = ( i - 0.5 ) * delta;
        return delta / ( 1.0 + x * x ) ;
    }
    immutable pi = 4.0 * taskPool.reduce!"a + b"(
        std.algorithm.map!getTerm(iota(n))
    );

For the examples I suggest to use q{a + b} instead of "a + b". When D will gain a good implementation of conditional purity, I think taskPool.reduce and taskPool.map may accept only pure functions to map on and pure iterables to work on.
template map(functions...)
    Eager parallel map.

So in Phobos you have std.algorithm.map that's lazy and taskPool.map that's eager. I think it's not so good to have two functions with the same name but subtly different purposes. I have suggested Andrei to add a std.algorithm.amap, that is array(map()): http://d.puremagic.com/issues/show_bug.cgi?id=5756 So I suggest to rename "taskPool.map" as "askPool.amap" and to rename "taskPool.lazyMap" as "taskPool.map".
    auto squareRoots = new float[numbers.length];
    taskPool.map!sqrt(numbers, squareRoots);

Currently to do the same thing with the normal std.algorithm.map you need something like: auto squareRoots = new double[numbers.length]; copy(map!sqrt(numbers), squareRoots);
 A semi-lazy parallel map that can be used for pipelining.

The idea of such vectorized lazyness may give a performance enhancement even for the serial std.algorithm.map. In the module documentation I'd like to see a graph that shows how the parallel map/reduce/foreach scale as the number of cores goes to 1 to 2 to 4 to 8 (or more) :-) Thank you for this very nice Phobos module. Bye, bearophile
Mar 24 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 3:23 AM, bearophile wrote:
 dsimcha:

 and apologize for getting defensive at times.

It happens to mammals, don't worry.
 The new docs are at
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

     real getTerm(int i) {
         immutable x = ( i - 0.5 ) * delta;
         return delta / ( 1.0 + x * x ) ;
     }
     immutable pi = 4.0 * taskPool.reduce!"a + b"(
         std.algorithm.map!getTerm(iota(n))
     );

For the examples I suggest to use q{a + b} instead of "a + b".

I tried to keep it as consistent as possible with std.algorithm.
 When D will gain a good implementation of conditional purity, I think
taskPool.reduce and taskPool.map may accept only pure functions to map on and
pure iterables to work on.

Eventually, maybe. Definitely not now, though, because in practice it would severely affect usability.
 In the module documentation I'd like to see a graph that shows how the
parallel map/reduce/foreach scale as the number of cores goes to 1 to 2 to 4 to
8 (or more) :-)

Unfortunately I don't have access to this kind of hardware except at work.
Mar 24 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 I tried to keep it as consistent as possible with std.algorithm.

OK. Then the question is why std.algorithm uses normal strings instead of q{} ones. And regarding consistency with std.algorithm, a more important factor is that std.algorithm.map is lazy, while you have a eager map, and the lazy version has lazy in the name, so the names are kind of opposite of std.algorithm.
 Unfortunately I don't have access to this kind of hardware except at work.

Maybe some other person in this group has access to a 8 core CPU and is willing to take some numbers, to create few little graphs. Bye, bearophile
Mar 24 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
 I tried to keep it as consistent as possible with std.algorithm.


I personally think "" strings look nicer for simple cases like "a + b". At any rate, this is a bikeshed issue.
 And regarding consistency with std.algorithm, a more important factor is that

lazy in the name, so the names are kind of opposite of std.algorithm. Hmm, you do have a point there. Two reasons: 1. map() was there first and at the time I didn't feel like renaming it. 2. I think map() is much more frequently useful than lazyMap() and name verbosity should be inversely proportional to usage frequency. (Which is why I really need help thinking of a better name than executeInNewThread().) I'm not sure whether I want to change this. Input from the rest of the community would be useful as long as it doesn't end up going off onto some wild tangent and starting Bikeshed War III.
 Unfortunately I don't have access to this kind of hardware except at work.


I can kinda see Andrei's point about adding a few benchmarks just to give a rough idea of the level of fine-grainedness this library is capable of and how to get it, but these requests are exactly what I was afraid of when I put them in. I really don't think rigorous, detailed benchmarks have any place in API documentation. The purpose of API documentation is for people to figure out what the library does and how to use it, not to sell a product or precisely quantify its performance. Next people will be asking for standard deviations, error bars, P-values, confidence intervals, regression models of performance vs. cores and basically things that dstats is good for, or for me to port their favorite non-trivial benchmark from their favorite non-D language.
Mar 24 2011
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 2.  I think map() is much more frequently useful than lazyMap() and name
verbosity
 should be inversely proportional to usage frequency.

I agree, but I have suggested to replace "map" => "amap" and "lazyMap" => "map" (and to add a fully eager amap to std.algorithm). The increase of verbosity is not big.
 (Which is why I really need
 help thinking of a better name than executeInNewThread().)

Andrei is good at this :-)
 I really don't think rigorous, detailed benchmarks have any place in API
 documentation.

OK. I will be happy to see few graphs outside the API doc :-) Bye, bearophile
Mar 24 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from spir (denis.spir gmail.com)'s article
 On 03/24/2011 06:04 PM, dsimcha wrote:
 Hmm, you do have a point there.  Two reasons:

 1.  map() was there first and at the time I didn't feel like renaming it.

 2.  I think map() is much more frequently useful than lazyMap() and name
verbosity
 should be inversely proportional to usage frequency.  (Which is why I really
need
 help thinking of a better name than executeInNewThread().)

std.algo funcs, I guess. Denis

Ok, I'm convinced that the eagerness of map() may throw people off. Unless anyone objects, I'm going with map() for the lazy version and amap() for the eager version. This is both consistent and terse.
Mar 24 2011
prev sibling next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Russel Winder (russel russel.org.uk)'s article
 Is there actually any point in having a lazy parallel map?

It's for pipelining. Please read the API documentation for details about how it works. It's actually only semi-lazy.
 Unfortunately I don't have access to this kind of hardware except at wo=


=20
 Maybe some other person in this group has access to a 8 core CPU and
 is willing to take some numbers, to create few little graphs.

it a few times to get a dataset.

Just try it with the examples in the API docs if you want.
Mar 24 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 1:34 PM, spir wrote:
 On 03/24/2011 05:32 PM, bearophile wrote:
 I tried to keep it as consistent as possible with std.algorithm.

of q{} ones. And regarding consistency with std.algorithm, a more important factor is that std.algorithm.map is lazy, while you have a eager map, and the lazy version has lazy in the name, so the names are kind of opposite of std.algorithm.

I agree there should be a consistent naming scheme here. Aligning with std.algo to consider lazy as standard sounds good. (Better than the opposite anyway, maybe not as good as explicite for both cases --but this is a lost fight ;-) Denis

Thanks for the suggestions, everyone. I've made all the changes suggested today, plus a few others: 1. map -> amap (eager), lazyMap -> map (lazy) 2. isPure is no longer documented, since it was never meant to be part of the public API. 3. I put TaskPool.join() in a version(none) statement and removed all references to it from the documentation. If anyone has a compelling reason why this is useful (preferably with a good, short example), I'll add it back. Otherwise, it's gone, since IMHO it's an evolutionary artifact that makes sense in theory but that I can't think of a use for anymore in practice. 4. After some prodding and thinking I've finally been convinced that task() should just return auto. I thought documenting the return types explicitly would clarify things, but from the questions I've gotten about it, it seems to obfuscate more than it clarifies. 5. This one should have been done a long time ago: amap() now accepts explicit buffers with element types that are implicit conversion targets of the return type of the map function.
Mar 24 2011
prev sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2011-03-24 at 12:32 -0400, bearophile wrote:
 dsimcha:
=20
 I tried to keep it as consistent as possible with std.algorithm.

OK. Then the question is why std.algorithm uses normal strings instead of=

Actually the question why user strings at all, why not have a lambda function syntax that is less verbose?
 And regarding consistency with std.algorithm, a more important factor
 is that std.algorithm.map is lazy, while you have a eager map, and the
 lazy version has lazy in the name, so the names are kind of opposite
 of std.algorithm.

Is there actually any point in having a lazy parallel map?
 Unfortunately I don't have access to this kind of hardware except at wo=


=20
 Maybe some other person in this group has access to a 8 core CPU and
 is willing to take some numbers, to create few little graphs.

I have such hardware, if someone sends me a program to execute I can run it a few times to get a dataset. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 24 2011
prev sibling next sibling parent reply =?ISO-8859-1?Q?S=F6nke_Ludwig?= <ludwig informatik.uni-luebeck.de> writes:
Hm depending on the way the pool is used, it might be a better default 
to have the number of threads equal the number of cpu cores. In my 
experience the control thread is mostly either waiting for tasks or 
processing messages and blocking in between so it rarely uses a full 
core, wasting the available computation time in this case.

However, I'm not really sure if it is like this for the majority of all 
applications or if there are more cases where the control thread will 
continue to do computations in parallel. Maybe we could collect some 
opinions on this?

On another note, I would like to see a rough description on what the 
default workUnitSize is depending on the size of the input. Otherwise it 
feels rather uncomfortable to use this version of parallel().

Another small addition would be to state that the object returned by 
asyncBuf either is an InputRange or which useful methods it might have 
(some kind of progress counter could also be useful here).

Btw., sorry if anything of this has already been discussed. I have 
missed the previous discussion unfortunately.

Sönke
Mar 24 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-24 03:29:52 -0400, Sönke Ludwig 
<ludwig informatik.uni-luebeck.de> said:

 Hm depending on the way the pool is used, it might be a better default 
 to have the number of threads equal the number of cpu cores. In my 
 experience the control thread is mostly either waiting for tasks or 
 processing messages and blocking in between so it rarely uses a full 
 core, wasting the available computation time in this case.
 
 However, I'm not really sure if it is like this for the majority of all 
 applications or if there are more cases where the control thread will 
 continue to do computations in parallel. Maybe we could collect some 
 opinions on this?

The current default is good for command line applications where the main thread generally blocks while you're doing your work. The default you're proposing is good when you're using the task pool to pile up tasks to perform in background, which is generally what you do in an event-driven application. The current default is good to keeps simpler the simpler programs which are more linear in nature. My use case is like yours: a event-driven main thread which starts tasks to be performed in the background. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 24 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 8:03 AM, Michel Fortin wrote:
 On 2011-03-24 03:29:52 -0400, Sönke Ludwig
 <ludwig informatik.uni-luebeck.de> said:

 Hm depending on the way the pool is used, it might be a better default
 to have the number of threads equal the number of cpu cores. In my
 experience the control thread is mostly either waiting for tasks or
 processing messages and blocking in between so it rarely uses a full
 core, wasting the available computation time in this case.

 However, I'm not really sure if it is like this for the majority of
 all applications or if there are more cases where the control thread
 will continue to do computations in parallel. Maybe we could collect
 some opinions on this?

The current default is good for command line applications where the main thread generally blocks while you're doing your work. The default you're proposing is good when you're using the task pool to pile up tasks to perform in background, which is generally what you do in an event-driven application. The current default is good to keeps simpler the simpler programs which are more linear in nature. My use case is like yours: a event-driven main thread which starts tasks to be performed in the background.

Please review the changes carefully, then, because this is a use case I know next to nothing about and didn't design for. I added a few (previously discussed) things to help this use case, at your request. BTW, one hint that I'd like to mention (not sure how to work it into the docs) is that, if you **don't** want to execute a task in the current thread if it's not started, but want its return value if it's done (as may be the case when you need threading for responsiveness, not throughput), you can call Task.done to check if it's done, and then Task.spinForce() to get the return value after you know it's already done.
Mar 24 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-24 09:46:01 -0400, dsimcha <dsimcha yahoo.com> said:

 Please review the changes carefully, then, because this is a use case I 
 know next to nothing about and didn't design for.

Well, it's practically the same thing except you never want to execute a task in the main thread, because the main thread acts more like a coordinator for various things and the coordinator must stay responsive. And since your main thread might be creating various kind of tasks you need a way to priorize some tasks over others. I think creating a few task pools with various priority and relying on OS thread scheduling would be adequate in most cases, but I haven't tried. One thing I'd want to be sure however is that you can use a parallel foreach from within a task. So if you have one or two tasks that could benefit from data parallelism it won't bring the whole system down. From the API I don't think it'll be a problem.
 I added a few (previously discussed) things to help this use case, at 
 your request. BTW, one hint that I'd like to mention (not sure how to 
 work it into the docs) is that, if you **don't** want to execute a task 
 in the current thread if it's not started, but want its return value if 
 it's done (as may be the case when you need threading for 
 responsiveness, not throughput), you can call Task.done to check if 
 it's done, and then Task.spinForce() to get the return value after you 
 know it's already done.

The problem with using Task.done that way is that it requires polling. It might be appropriate in some cases, but in general you want to receive a message telling you when the task is done. That's not really complicated however: all the task has to do is send back its result through std.concurrency's "send" or through some other event dispatch mechanism instead of through a return statement. So I hope your tasks can accept "void" as a return type. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 24 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 10:34 AM, Michel Fortin wrote:
 One thing I'd want to be sure however is that you can use a parallel
 foreach from within a task. So if you have one or two tasks that could
 benefit from data parallelism it won't bring the whole system down. From
 the API I don't think it'll be a problem.

Right. This is completely do-able and I've done it before in practice.
 I added a few (previously discussed) things to help this use case, at
 your request. BTW, one hint that I'd like to mention (not sure how to
 work it into the docs) is that, if you **don't** want to execute a
 task in the current thread if it's not started, but want its return
 value if it's done (as may be the case when you need threading for
 responsiveness, not throughput), you can call Task.done to check if
 it's done, and then Task.spinForce() to get the return value after you
 know it's already done.

The problem with using Task.done that way is that it requires polling. It might be appropriate in some cases, but in general you want to receive a message telling you when the task is done. That's not really complicated however: all the task has to do is send back its result through std.concurrency's "send" or through some other event dispatch mechanism instead of through a return statement.

Sounds like a good plan. In general, I've tried to keep the design of std.parallelism simple but composable. I have no intention of re-implementing any kind of message system when std.concurrency already does this well. If this is what you want to do, though, maybe you should just use std.concurrency. I'm not sure what std.parallelism would add.
 So I hope your tasks can accept "void" as a return type.

Yes, this works.
Mar 24 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-24 10:43:08 -0400, dsimcha <dsimcha yahoo.com> said:

 Sounds like a good plan.  In general, I've tried to keep the design of 
 std.parallelism simple but composable.  I have no intention of 
 re-implementing any kind of message system when std.concurrency already 
 does this well.  If this is what you want to do, though, maybe you 
 should just use std.concurrency.  I'm not sure what std.parallelism 
 would add.

What it adds is a task pool, where you have a fixed number of threads for an unlimited number of tasks. Spawning 10,000 threads because you have 10,000 parallelizable tasks generally isn't a good idea. That said, perhaps std.concurrency's "spawn" should have the ability to create tasks instead of always creating new threads... -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 24 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 11:00 AM, Michel Fortin wrote:

What it adds is a task pool, where you have a fixed number of threads for an unlimited number of tasks. Spawning 10,000 threads because you have 10,000 parallelizable tasks generally isn't a good idea. That said, perhaps std.concurrency's "spawn" should have the ability to create tasks instead of always creating new threads...

This is a great **long-term** todo. Please file an enhancement request in Bugzilla w.r.t. std.concurrency pooling threads so it doesn't get lost. TaskPool would probably be a good back end for this, but IMHO any use of TaskPool by std.concurrency should be regarded as an implementation detail. On the other hand, this kind of inter-module cooperation requires lots of discussion about how it should be designed. It is also well beyond the scope of what std.parallelism was designed to do. This will take a long time to design and implement, have ripple effects into std.concurrency, etc. I don't think it needs to be implemented **now** or should hold up the vote and inclusion of std.parallelism in Phobos.
Mar 24 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 3:29 AM, Sönke Ludwig wrote:
 Hm depending on the way the pool is used, it might be a better default
 to have the number of threads equal the number of cpu cores. In my
 experience the control thread is mostly either waiting for tasks or
 processing messages and blocking in between so it rarely uses a full
 core, wasting the available computation time in this case.

It's funny, it seems like the task parallelism stuff is getting much more attention from the community than the data parallelism stuff. I hardly ever use the task parallelism and use mostly data parallelism. I'm inclined to leave this as-is because: 1. It's definitely the right answer for data parallelism and the task parallelism case is much less obvious. 2. The main thread is utilized in the situation you describe. As I mentioned in a previous post, when a task that has not been started by a worker thread yet is forced, it is executed immediately in the thread that tried to force it, regardless of its position in the queue. There are two reasons for this: a. It guarantees that there won't be any deadlocks where a task waits for another task that's behind it in the queue. b. If you're trying to force a task, then you obviously need the results ASAP, so it's an ad-hoc form of prioritization.
 However, I'm not really sure if it is like this for the majority of all
 applications or if there are more cases where the control thread will
 continue to do computations in parallel. Maybe we could collect some
 opinions on this?

 On another note, I would like to see a rough description on what the
 default workUnitSize is depending on the size of the input. Otherwise it
 feels rather uncomfortable to use this version of parallel().

Hmm, this was there in the old documentation. Andrei recommended against documenting it for one of the cases because it might change. I can tell you that, right now, it's: 1. Whatever workUnitSize would create TaskPool.size * 4 work units, if the range has a length. 2. 512 if the range doesn't have a length.
 Another small addition would be to state that the object returned by
 asyncBuf either is an InputRange or which useful methods it might have
 (some kind of progress counter could also be useful here).

I guess this could be a little clearer, but it's really just a plain vanilla input range that has a length iff range has a length. There are no other public methods.
Mar 24 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/24/2011 05:32 AM, dsimcha wrote:
 One thing Andrei mentioned that I'm really not sure about is what to do with
 TaskPool.join().  My example for it is still terrible, because I think it's an
 evolutionary artifact.  It was useful in earlier designs that were never
 released and didn't have high-level data parallelism primitives.  I never use
 it, don't have any good use cases for it and would be inclined to remove it
 entirely.  Andrei seems to have some good use cases in mind, but he has not
 detailed any that I believe are reasonably implementable and I'm not sure
 whether they could be solved better using the higher-level data parallelism
 primitives.

If I may have a suggestion: just let it aside. Instead of piling up features, just have the core available, and let actual needs show up in real life. An exception may be when a given potential feature would require major re-design. Then, better to anticipate with a design that could accomodate it. Else, better to apply the famous phrase that design is finished when nothing is left to remove. Very true (even more in PL design where it's about impossible to remove a feature once released, as an after-thought). Denis -- _________________ vita es estrany spir.wikidot.com
Mar 24 2011
prev sibling next sibling parent reply spir <denis.spir gmail.com> writes:
On 03/24/2011 05:32 AM, dsimcha wrote:
 [...]

 The new docs are at http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

About the doc: very good. I could understand most of it, while knowing nearly nothing about parallelism prior to reading. 2 details: * highlight key words only on first occurrence (bold online) * wrong doc for Task.isPure (gets a copy of Tast.args' doc) Denis -- _________________ vita es estrany spir.wikidot.com
Mar 24 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 3/24/2011 8:35 AM, spir wrote:
 On 03/24/2011 05:32 AM, dsimcha wrote:
 [...]

 The new docs are at
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

About the doc: very good. I could understand most of it, while knowing nearly nothing about parallelism prior to reading. 2 details: * highlight key words only on first occurrence (bold online) * wrong doc for Task.isPure (gets a copy of Tast.args' doc) Denis

Task.isPure isn't even supposed to be public. I didn't notice that it had slipped into the docs. WTF, DDoc?
Mar 24 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/24/2011 05:32 PM, bearophile wrote:
 I tried to keep it as consistent as possible with std.algorithm.

And regarding consistency with std.algorithm, a more important factor is that std.algorithm.map is lazy, while you have a eager map, and the lazy version has lazy in the name, so the names are kind of opposite of std.algorithm.

I agree there should be a consistent naming scheme here. Aligning with std.algo to consider lazy as standard sounds good. (Better than the opposite anyway, maybe not as good as explicite for both cases --but this is a lost fight ;-) Denis -- _________________ vita es estrany spir.wikidot.com
Mar 24 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/24/2011 06:04 PM, dsimcha wrote:
 Hmm, you do have a point there.  Two reasons:

 1.  map() was there first and at the time I didn't feel like renaming it.

 2.  I think map() is much more frequently useful than lazyMap() and name
verbosity
 should be inversely proportional to usage frequency.  (Which is why I really
need
 help thinking of a better name than executeInNewThread().)

If we decide eager is (an implicite) standard, then we must rename (several?) std.algo funcs, I guess. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 24 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
 On 3/24/2011 8:35 AM, spir wrote:
 On 03/24/2011 05:32 AM, dsimcha wrote:
 [...]
 
 The new docs are at
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .

About the doc: very good. I could understand most of it, while knowing nearly nothing about parallelism prior to reading. 2 details: * highlight key words only on first occurrence (bold online) * wrong doc for Task.isPure (gets a copy of Tast.args' doc) Denis

Task.isPure isn't even supposed to be public. I didn't notice that it had slipped into the docs. WTF, DDoc?

It's not really DDoc's fault. It's dmd's fault. Currently, _all_ templated _anything_ is public. _Period_. It doesn't matter what access level you actually give it. Task is a template. isPure is a member of Task, and you put ddoc comments on it. dmd wrongly thinks that it's public, so ddoc sees it as public and puts it in the generated docs. You need to use /+ instead of /++ or /* instead of /* so that the ddoc comment is not an actual ddoc comment. I think that there are actually multiple bug reports on the issue, but here's one: http://d.puremagic.com/issues/show_bug.cgi?id=2775 - Jonathan M Davis
Mar 24 2011