www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Parallelizing code -- design problem in networkish code

reply Charles Hixson <charleshixsn earthlink.net> writes:
I'm trying to parallelize some code which is essentially a network of 
cells with an index.  I can make the cells immutable, if I route all 
access to them via the index, AND if I can update the index.  The index 
would not need to have items deleted, but updates would cause it to 
point to different addresses as new versions of the cells replaced old 
versions.  Also the index is LARGE.  I can't size it exactly, but it 
would have over 100,000 entries.  To be safe I've said it's indexed by a 
ulong, though I don't have enough RAM to allow that to be reasonable.

So now I'm trying to parallelize it.  I don't want a per thread copy for 
multiple reasons (it's large, it gets updates, etc.).  But if I read 
TDPL correctly, shared won't allow me to update it.  If I can't update 
it, I need to restart the program to resize it AND the cells can't be 
immutable.

It would appear that making the cells immutable is silly.  All I need is 
to protect a few small places where I adjust various weights.  But TDPL 
says that if I adopt that approach, I also need write locks around every 
read.  So that looks infeasible.

I need the index mainly to sequentially step through the cells, so I 
could implement an equivalent functionality by having prior and next 
pointers in each cell...but if I use pointers then making one cell 
immutable makes the whole network immutable, and that's no good at all.

Is there a decent answer in D?  This is basically one class replicated a 
huge number of times.  Ideally each instance would be in a separate 
thread, but that would overwhelm the OS, and I don't have more than 
around 6 cores on my current system.  But if each instance were in a 
separate thread, then I could do this with message passing....except for 
stepping through all the cells sequentially. (Well, I could do that by 
implementing prior thread and next thread variables, I guess, but that 
seems contrary to the purpose of threads.  Still, that's a mandatory 
part of the problem.)
Dec 13 2012
next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Parallelism, concurrency, and multi-threading in general are fascinating 
topics. I can't claim that I have extensive experience on these topics 
but I have written the following two chapters after studying the 
std.parallelism and std.concurrency modules:

   http://ddili.org/ders/d.en/parallelism.html

   http://ddili.org/ders/d.en/concurrency.html

Still, when I tried some of those ideas, I did not always see the 
performance gains that I had expected, at least with a particular test 
program. After reading articles like the following, now I understand how 
perhaps counter-intuitively, single-threading can be way faster than 
multi-threading, including the CAS-style lock-free multi-threading. The 
following has been posted to D forums recently:

   http://martinfowler.com/articles/lmax.html

Sorry to ramble on about this topic. :) I am thinking out loud that next 
time I will try to make the actual processing of data single-threaded as 
I have learned from the Disruptor architecture and see whether that 
makes it faster.

On 12/13/2012 01:32 PM, Charles Hixson wrote:
 I'm trying to parallelize some code which is essentially a network of
 cells with an index. I can make the cells immutable,
Because you say "parallelize", I don't think you need to make your data immutable at all, because parallelism requires that the data is not shared. Even if you need to share the data, if you are doing message passing concurrency, then you can safely cast your immutable data to mutable as long as you know that there is only one thread that accesses it at a given time. Message-passing makes it easy to reason about. But I think 'shared' is likely better than 'immutable' in your case. Not knowing your exact situation, I would recommend the simplest option first, which is std.parallelism: import std.stdio; import std.parallelism; void main() { auto array = [ 1, 2 ]; foreach (element; parallel(array)) { // This block is executed on the elements in parallel writeln(element); } } std.parallelism works on ranges, which makes it possible to parallelize operations lazily, which enables making the objects as needed. Here is a thousand ints that are processed in parallel, which never need to live in an actual array: import std.stdio; import std.parallelism; import std.range; void main() { auto numbers = iota(1000); foreach (element; parallel(numbers)) { writeln(element); } } In case you are not already familiar, ranges are fundamentally a very simple concept. There is the following chapter on ranges: http://ddili.org/ders/d.en/ranges.html Ali
Dec 13 2012
parent reply Charles Hixson <charleshixsn earthlink.net> writes:
On 12/13/2012 07:30 PM, Ali Çehreli wrote:
 Parallelism, concurrency, and multi-threading in general are fascinating
 topics. I can't claim that I have extensive experience on these topics
 but I have written the following two chapters after studying the
 std.parallelism and std.concurrency modules:

 http://ddili.org/ders/d.en/parallelism.html

 http://ddili.org/ders/d.en/concurrency.html

 Still, when I tried some of those ideas, I did not always see the
 performance gains that I had expected, at least with a particular test
 program. After reading articles like the following, now I understand how
 perhaps counter-intuitively, single-threading can be way faster than
 multi-threading, including the CAS-style lock-free multi-threading. The
 following has been posted to D forums recently:

 http://martinfowler.com/articles/lmax.html

 Sorry to ramble on about this topic. :) I am thinking out loud that next
 time I will try to make the actual processing of data single-threaded as
 I have learned from the Disruptor architecture and see whether that
 makes it faster.

 On 12/13/2012 01:32 PM, Charles Hixson wrote:
  > I'm trying to parallelize some code which is essentially a network of
  > cells with an index. I can make the cells immutable,

 Because you say "parallelize", I don't think you need to make your data
 immutable at all, because parallelism requires that the data is not shared.

 Even if you need to share the data, if you are doing message passing
 concurrency, then you can safely cast your immutable data to mutable as
 long as you know that there is only one thread that accesses it at a
 given time. Message-passing makes it easy to reason about.

 But I think 'shared' is likely better than 'immutable' in your case.

 Not knowing your exact situation, I would recommend the simplest option
 first, which is std.parallelism:

 import std.stdio;
 import std.parallelism;

 void main()
 {
 auto array = [ 1, 2 ];

 foreach (element; parallel(array)) {
 // This block is executed on the elements in parallel
 writeln(element);
 }
 }

 std.parallelism works on ranges, which makes it possible to parallelize
 operations lazily, which enables making the objects as needed. Here is a
 thousand ints that are processed in parallel, which never need to live
 in an actual array:

 import std.stdio;
 import std.parallelism;
 import std.range;

 void main()
 {
 auto numbers = iota(1000);

 foreach (element; parallel(numbers)) {
 writeln(element);
 }
 }

 In case you are not already familiar, ranges are fundamentally a very
 simple concept. There is the following chapter on ranges:

 http://ddili.org/ders/d.en/ranges.html

 Ali
std.parallelism is indeed what I was originally thinking of using. But it explicitly states that it doesn't support thread safety except on routines that are marked safe or trusted, which ParallelForEach isn't. It's true that the TaskPool.this methods are marked trusted, but it's not clear to me that this implies that the indirect use in: foreach(i, ref elem; taskPool.parallel(logs, 100)) is trusted. If it is, then that's a good answer. Now what I was thinking of involved an array in another class (i.e., not the Cell class) defined: Cell[] cells; and the Cell class, which includes: public class Cell { ... Idno[] ups; ... } where ups is an array of Id#s which are indexes into cells. While processing a cell, it's likely that ups will be used to index into cells to adjust the weight of the similar reference down to this cell. (And yes, there's an array of downs also, which also does an indirection through cells to adjust weights of cells below it that refer to the current cell. This process offers several possibilities of collision when done in parallel, which was why I though that the alterations should probably be done via creating a new cell and substituting it for the original one. If this isn't necessary, it would be a real benefit. Additionally, it's quite possible that a cell would be added to the cells array during activation, but your point about this being handled via ranges is a good answer to that worry. Perhaps a part of the reason I'm not understanding is that all the examples involve cases where there isn't this kind of mutual interaction. But to me it looks as if my choices are making the cells immutable (doable, if unpleasant) or placing write synchronization locks around every place where I alter a weight. Only TDPL indicates that if I do that, I also need to place the same locks around every place where a value might be read. Which is much worse than just making the cells immutable. (Making the cells immutable is bad enough, as there are lots more pieces than I've shown, and if they are to be immutable, the constructor has to set ALL of them in their final form.) P.S.: I haven't checked, but does std.parallism.ParallelForeach!(R)parallel(R)(R range) copy the specified range to the RAM space of the other thread? If so, I may not be able to use it due to excessive RAM consumption. If it just passes the range, and gets the items from the originating thread as needed, this would be no problem. Actually, reading the chapter you wrote is appears that parallel() does all it's work in the main thread, and only passes the items to be processed to tasks as free threads in the threadpool become available. Which is what I would want. I wish that I were a bit more certain. But at this point the main question is "Is there a decent way to avoid the requirement of making cells immutable?" Perhaps it would be something like: void modifyWeight (shared Cell thisCell, shared Cell upCell, float weight) { synchronized (thisCell, upCell) { .... } } Only shared variables isn't at all what I want...or I don't think it is. Certainly the cells array can't be shared, as that would prevent me from adding new cells to it. But on the other hand, I sure don't want the cells array copied into every new thread. That would grind the system to a halt in no time. Am I just confusing myself by reading about different approaches to parallelism, or is this just something that can't really be handled. When a thread gets started, just what gets copied to it? If the array is in a class instance, does only the reference to the class get copied, or does the entire class instance get copied? If the whole thing gets copied, then this is clearly not going to work, but I really can't be sure. All of the examples that I find concern themselves with trivial amounts of data, so that it doesn't really matter if it gets copied around, but that's not the case here. Here is the data is going to get copied, then the entire threading approach needs to be scrapped. Although.... Perhaps another approach is to start a thread, and store the data in that thread. Then to find the data, you need to query that thread, to save the data, you save it to that thread. All data file I/O would take place within that thread. This would require a lot of planning that I haven't yet done, but it doesn't seem totally implausible. I'd probably need to do manual management of sending things to the threadpool rather than using the parallel method. But I still need to ensure that there aren't write collisions, so that still looks as if I need to make the cells immutable. And lots more data will be flowing around, but not as much at any one time, so the top RAM requirement should be less. (My rough guess is that twice as much data would be flowing, but the top RAM requirements might be 0.01 of an approach that involved copying the entire array once. I can't even take once per thread seriously as a possibility.) OTOH...it does look as if on my current system (6 cores) the serial form would be faster. But I don't expect to stay on my current system forever. Intel has announced that tablets in a few years will have over 40 cores. So the design should be for running on lots of cores, even if that's not what I have right now.
Dec 13 2012
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 12/13/2012 08:38 PM, Charles Hixson wrote:

 Now what I was thinking of involved an array in another class (i.e., not
 the Cell class) defined:
 Cell[] cells;
 and the Cell class, which includes:
 public class Cell
 { ...
 Idno[] ups;
 ...
 }
 where ups is an array of Id#s which are indexes into cells. While
 processing a cell, it's likely that ups will be used to index into cells
 to adjust the weight of the similar reference down to this cell. (And
 yes, there's an array of downs also, which also does an indirection
 through cells to adjust weights of cells below it that refer to the
 current cell.
That description makes me think that there will be data races on accesses to weights: Some threads will be reading while others will be writing. If I understood correctly, that would disqualify std.parallelism.
 This process offers several possibilities of collision when done in
 parallel, which was why I though that the alterations should probably be
 done via creating a new cell and substituting it for the original one.
 If this isn't necessary, it would be a real benefit.
I hope others have answers to your questions. Sorry... Ali
Dec 14 2012
parent Charles Hixson <charleshixsn earthlink.net> writes:
On 12/14/2012 01:50 PM, Ali Çehreli wrote:
 On 12/13/2012 08:38 PM, Charles Hixson wrote:

  > Now what I was thinking of involved an array in another class (i.e., not
  > the Cell class) defined:
  > Cell[] cells;
  > and the Cell class, which includes:
  > public class Cell
  > { ...
  > Idno[] ups;
  > ...
  > }
  > where ups is an array of Id#s which are indexes into cells. While
  > processing a cell, it's likely that ups will be used to index into cells
  > to adjust the weight of the similar reference down to this cell. (And
  > yes, there's an array of downs also, which also does an indirection
  > through cells to adjust weights of cells below it that refer to the
  > current cell.

 That description makes me think that there will be data races on
 accesses to weights: Some threads will be reading while others will be
 writing. If I understood correctly, that would disqualify std.parallelism.

  > This process offers several possibilities of collision when done in
  > parallel, which was why I though that the alterations should probably be
  > done via creating a new cell and substituting it for the original one.
  > If this isn't necessary, it would be a real benefit.

 I hope others have answers to your questions. Sorry...

 Ali
Actually, data races will probably need to be accepted. At least in the sense of some (external) reads getting stale data. That's no worse than using immutable data, and creating a new version with every change. (At that point, any cell holding a pointer to the old version would be working with stale data.) Any approach involving locking reads of data will quickly devolve into a HUGE number of pairs of node in deadly embrace. The thing is, stale data will generally only have a minor effect. And I think that an optimal solution is probably impossible. What I'm looking for is a "sufficiently good" solution. Think more of "eventual consistency" than of consistency. (Note: This reply is over 10 days after the original post, and is only for the benefit of later readers.)
Dec 27 2012
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
Do updates have to happen concurrently with read operations?  Could you =
maybe queue updates and batch them at specific times instead?  That =
would save you from having to protect every access.  Or maybe different =
portions of the network could be owned by different threads?  That makes =
it essentially a map/reduce problem.  A ReadWriteMutex might be another =
option to consider too.  Finally, maybe you could multiplex all of this =
across a thread pool--have each thread iterate across a list of =
delegates, each of which is tied to a single node.  That's pretty close =
to the std.parallelism solution.=
Dec 14 2012