digitalmars.D.learn - Idiomatic way to share mutable data?

Charles McAnany (39/39) Dec 22 2013 Friends,

Frustrated (11/51) Dec 22 2013 Partition the atoms up into n sets, have a thread per set. A
Joseph Rushton Wakeling (15/19) Dec 22 2013 Does your simulation rely at all on pseudo-random number generation _ins...
Joseph Rushton Wakeling (14/19) Dec 22 2013 I should clarify what I mean there. By splitting up x and vx into two s...
Chris Cain (55/56) Dec 22 2013 From what you describe, std.parallelism would probably be
Dan Killebrew (9/49) Dec 23 2013 If you're doing a range limited interaction, partition the atoms

"Charles McAnany" <dlang charlesmcanany.com> writes:

Friends,
I'm writing a little molecular simulator. Without boring you with 
the details, here's the gist of it:

struct Atom{
     double x, vx;
     double interaction(Atom a2){
         return (a2.x-this.x)^^2; //more complicated in reality
     }
}

main(){
     Atom[] atoms = (a bunch of atoms in random positions);
     foreach(timestep; 1..1000){ //L0
         foreach(atom; atoms){ //L1
             foreach(partner; atoms){ //L2
                 atom.vx += atom.interaction(partner)/mass; //F=ma
             }
         }
         foreach(atom; atoms){ //L3
             atom.x += atom.vx * deltaT;
         }
     }
}

So here's the conundrum: How do I parallelize this efficiently? 
The first loop, L0, is not parallelizable at all, and I think the 
best speedup will be in parallelizing L1. But I immediately run 
into trouble: all the threads need access to all of atoms, and 
every atom's position is changed on every pass through L0. So to 
do this purely with message passing would involve copying the 
entirety of atoms to every thread every L0 pass. Clearly, shared 
state is desirable.

But I don't need to be careful about the shared state at all; L1 
only reads Atom.x, and only writes Atom.vx. L3 only reads Atom.vx 
and only writes Atom.x There's no data dependency at all inside 
L1 and L3.

Is there a way to inform the compiler of this without just 
aggressively casting things to shared and immutable?

On that note, how do you pass a reference to a thread (via send) 
without the compiler yelling at you? Do you cast(immutable 
Atom[]) on send and cast(Atom[]) on receive?

Dec 22 2013

"Frustrated" <c1514843 drdrb.com> writes:

On Sunday, 22 December 2013 at 21:07:11 UTC, Charles McAnany 
wrote:
 Friends,
 I'm writing a little molecular simulator. Without boring you 
 with the details, here's the gist of it:

 struct Atom{
     double x, vx;
     double interaction(Atom a2){
         return (a2.x-this.x)^^2; //more complicated in reality
     }
 }

 main(){
     Atom[] atoms = (a bunch of atoms in random positions);
     foreach(timestep; 1..1000){ //L0
         foreach(atom; atoms){ //L1
             foreach(partner; atoms){ //L2
                 atom.vx += atom.interaction(partner)/mass; 
 //F=ma
             }
         }
         foreach(atom; atoms){ //L3
             atom.x += atom.vx * deltaT;
         }
     }
 }

 So here's the conundrum: How do I parallelize this efficiently? 
 The first loop, L0, is not parallelizable at all, and I think 
 the best speedup will be in parallelizing L1. But I immediately 
 run into trouble: all the threads need access to all of atoms, 
 and every atom's position is changed on every pass through L0. 
 So to do this purely with message passing would involve copying 
 the entirety of atoms to every thread every L0 pass. Clearly, 
 shared state is desirable.

 But I don't need to be careful about the shared state at all; 
 L1 only reads Atom.x, and only writes Atom.vx. L3 only reads 
 Atom.vx and only writes Atom.x There's no data dependency at 
 all inside L1 and L3.

 Is there a way to inform the compiler of this without just 
 aggressively casting things to shared and immutable?

 On that note, how do you pass a reference to a thread (via 
 send) without the compiler yelling at you? Do you 
 cast(immutable Atom[]) on send and cast(Atom[]) on receive?

Partition the atoms up into n sets, have a thread per set. A 
thread only writes to it's set. It can read from other sets. No 
locks needed and no need to partition the time.

If there are many atoms on a large scale, you can try to group 
them into sets based on distance. Then you can greatly optimize 
the calculations by ignoring sets that are far away and 
contribute little to the force(or use an point mass as an 
approximation. This could reduce the calculations from n^2 to 
something like n or nlogn.

Dec 22 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 22/12/13 22:07, Charles McAnany wrote:
 So here's the conundrum: How do I parallelize this efficiently?

Does your simulation rely at all on pseudo-random number generation _inside_
the 
loop?  That is, apart from for generation of the initial array of atoms?

If it's purely deterministic, it ought to suffice to use std.parallelism.  If 
not, then things can be more tricky.

OTOH you might find it best to declare the array of atoms as shared.  I don't 
have good personal experience here so don't know how to advise you.

You might also find it beneficial -- since in each of the inner loops you're 
reading from one set of values and writing to another -- to split up your array 
of atoms into two arrays: double[] x and double[] vx -- and to find another way 
of doing the interaction calculation (e.g. do it as interaction(size_t i,
size_t 
j) where i, j are array indexes).

 On that note, how do you pass a reference to a thread (via send) without the
 compiler yelling at you? Do you cast(immutable Atom[]) on send and cast(Atom[])
 on receive?

This can be one way of handling things -- of course, you have to be careful to 
treat the data with respect, e.g. to not modify it even though you have cast it 
away from immutable.

Dec 22 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 22/12/13 22:42, Joseph Rushton Wakeling wrote:
 You might also find it beneficial -- since in each of the inner loops you're
 reading from one set of values and writing to another -- to split up your array
 of atoms into two arrays: double[] x and double[] vx -- and to find another way
 of doing the interaction calculation (e.g. do it as interaction(size_t i,
size_t
 j) where i, j are array indexes).

I should clarify what I mean there.  By splitting up x and vx into two separate 
arrays, you get the benefit that in the inner loops, you can have a situation 
where you have one purely read-only array, and one which is write-only.

Then, as Frustrated says, you can split up the write-only array into manageable 
chunks which can be dealt with by separate threads (using separate RNGs if 
that's necessary: note that the default random generator rndGen is
thread-local, 
so calls to it in different threads should generate sufficiently independent 
pseudo-random values).

But depending on what your total system size is, you might find that your CPU's 
ability to vectorize loop operations is actually more efficient than any split 
into different threads (this has certainly happened to me).

But if you don't have any random number generation inside the inner loops, 
again, you'll probably find it better just to use std.parallelism.

Dec 22 2013

"Chris Cain" <clcain uncg.edu> writes:

On Sunday, 22 December 2013 at 21:07:11 UTC, Charles McAnany 
wrote:
 How do I parallelize this efficiently?

 From what you describe, std.parallelism would probably be 
appropriate for this. However, from your problem description, 
you're going to have a _lot_ of trouble with false sharing. I 
suggest a trick like this:

The atoms list probably needs to be pretty big. I'd assume this 
is the case since you're modeling atoms and you're concerned 
about using parallelism to speed it up. Split the atoms list into 
several lists of approximately equal size. I'd suggest splitting 
into about 8 * totalCPUs lists to start with and go from there. 
That should be very efficient with slices.

So, something like this (pseudocode):
```
atom[] allAtoms = ...;
atom[][] atomsList;
assert(allAtoms.length > (64 * totalCPUs),
     "Probably not worth parallelizing...");
atomsList.length = 8*totalCPUs;
size_t partitionSize = allAtoms / atomsList.length;
foreach(i, ref list; atomsList) {
     list = allAtoms[i*partitionSize .. (i+1)*partitionSize];
}

long leftOvers = allAtoms % atomsList.length;
if(leftOvers) {
     atomsList.length++;
     atomsList[$-1] =
         allAtoms[(atomsList.length - 2)*partitionSize .. $];
}

// Since the leftovers are the only partition that has an odd
// size, it might be appropriate to just save it for last after
// you parallelize everything else.

foreach(a, b; parallelPickPairs(atomsList)) {
     foreach(ref atom; a) {
         foreach(partner; b) {
             atom.vx += atom.interaction(partner)/mass;
         }
     }
}
```

Obviously the bulk of the work left is implementing 
"parallelPickPairs" which is just pseudocode for "use 
std.parallelism while picking all possible pairs of items in some 
list". In this case, it's just picking pairs of atom arrays in 
atomsList. The pair picking process should queue the work up in 
such a way to prevent tasks from accessing the same list at the 
same time (with 8*totalCPUs lists, there are tons of ways to do 
that).

There's some room for improvement, but this kind of idea should 
get you started down the right path, I think.

One kind of landmine with foreach loops, though, is forgetting to 
use "ref" and trying to modify the thing. For instance, in your 
OP, you're modifying a _copy_ of atom, so it wouldn't work 
(although, in your original code that might not be how you did 
it). Just a heads up for you.

Dec 22 2013

"Dan Killebrew" <nospam gmail.com> writes:

On Sunday, 22 December 2013 at 21:07:11 UTC, Charles McAnany 
wrote:
 Friends,
 I'm writing a little molecular simulator. Without boring you 
 with the details, here's the gist of it:

 struct Atom{
     double x, vx;
     double interaction(Atom a2){
         return (a2.x-this.x)^^2; //more complicated in reality
     }
 }

 main(){
     Atom[] atoms = (a bunch of atoms in random positions);
     foreach(timestep; 1..1000){ //L0
         foreach(atom; atoms){ //L1
             foreach(partner; atoms){ //L2
                 atom.vx += atom.interaction(partner)/mass; 
 //F=ma
             }
         }
         foreach(atom; atoms){ //L3
             atom.x += atom.vx * deltaT;
         }
     }
 }

 So here's the conundrum: How do I parallelize this efficiently? 
 The first loop, L0, is not parallelizable at all, and I think 
 the best speedup will be in parallelizing L1. But I immediately 
 run into trouble: all the threads need access to all of atoms, 
 and every atom's position is changed on every pass through L0. 
 So to do this purely with message passing would involve copying 
 the entirety of atoms to every thread every L0 pass. Clearly, 
 shared state is desirable.

 But I don't need to be careful about the shared state at all; 
 L1 only reads Atom.x, and only writes Atom.vx. L3 only reads 
 Atom.vx and only writes Atom.x There's no data dependency at 
 all inside L1 and L3.

 Is there a way to inform the compiler of this without just 
 aggressively casting things to shared and immutable?

 On that note, how do you pass a reference to a thread (via 
 send) without the compiler yelling at you? Do you 
 cast(immutable Atom[]) on send and cast(Atom[]) on receive?


If you're doing a range limited interaction, partition the atoms 
spatially and have each core handle a fixed 3D volume. Check out 
the NT method 
http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/readings
shaw05_ntmethod.pdf 
   When the core that owns an atom detects that it may be in 
interaction range for atom(s) owned by another core, send updates 
to that other core.

Dec 23 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Idiomatic way to share mutable data?