www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How works internally ParallelForEach

reply "Zardoz" <luis.panadero gmail.com> writes:
How works internally ParallelFor ?
I read that lauchn multiple tasks. Each task procces a chunk of 
the range, but each task it's syncronized , ahve some kind of 
comunication between or are using shared memory or what ??

In this example code :
import std.stdio;
import std.parallelism;
import std.math;

void main() {
   auto logs = new double[10_000_000];
   double total = 0;
   foreach(i, ref elem; taskPool.parallel(logs, 100)) {
     elem = log(i + 1.0);
     total += elem;
   }

   writeln(total);
}

I understand that are launched N task, doing a chunk of 100 
elements from logs array. But what happen with "total". There is 
only a "total" and D is using memory barriers / atomic operations 
to write in it ? Or each Task have his own "total" and later 
joint each private "total" in the outside "total".
Dec 01 2012
parent reply "thedeemon" <dlang thedeemon.com> writes:
On Saturday, 1 December 2012 at 10:35:38 UTC, Zardoz wrote:
   auto logs = new double[10_000_000];
   double total = 0;
   foreach(i, ref elem; taskPool.parallel(logs, 100)) {
     elem = log(i + 1.0);
     total += elem;
   }

   writeln(total);
 }

 I understand that are launched N task, doing a chunk of 100 
 elements from logs array. But what happen with "total". There 
 is only a "total" and D is using memory barriers / atomic 
 operations to write in it ? Or each Task have his own "total" 
 and later joint each private "total" in the outside "total".
taskPool.parallel is a library function, it doesn't make compiler smarter and doesn't get much help from the compiler. It means your "total" variable will not get any special treatment, it's still a local variable referenced from the loop body which is turned into a function by foreach. This function is run by .parallel in several threads, so you'll get a race condition and most probably an incorrect total value. You should avoid changing the same memory in paralel foreach. Processing different elements of one array (even local) is ok. Writing to one variable not ok.
Dec 01 2012
parent reply "Zardoz" <luis.panadero gmail.com> writes:
On Saturday, 1 December 2012 at 10:58:55 UTC, thedeemon wrote:
 taskPool.parallel is a library function, it doesn't make 
 compiler smarter and doesn't get much help from the compiler. 
 It means your "total" variable will not get any special 
 treatment, it's still a local variable referenced from the loop 
 body which is turned into a function by foreach. This function 
 is run by .parallel in several threads, so you'll get a race 
 condition and most probably an incorrect total value. You 
 should avoid changing the same memory in paralel foreach. 
 Processing different elements of one array (even local) is ok. 
 Writing to one variable not ok.
Humm... So ParallelForeach only launch N tasks doing a work over a slice from the range and nothing more. The prevois code should work better if i set "total" to be sahred and hope that D shared vars have nnow the internal barries working ,or I need to manually use semaphores ? import std.stdio; import std.parallelism; import std.math; void main() { auto logs = new double[10_000_000]; shared double total = 0; foreach(i, ref elem; taskPool.parallel(logs, 100)) { elem = log(i + 1.0); total += elem; } writeln(total); } PD: I know that I can use reduction to do the same thing much better...
Dec 01 2012
parent reply "thedeemon" <dlang thedeemon.com> writes:
On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote:

 The prevois code should work better if i set "total" to be 
 sahred and hope that D shared vars have nnow the internal 
 barries working ,or I need to manually use semaphores ?
Probably core.atomic is the way to go. Semaphore is an overkill.
Dec 01 2012
parent "jerro" <a a.com> writes:
On Saturday, 1 December 2012 at 12:51:27 UTC, thedeemon wrote:
 On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote:

 The prevois code should work better if i set "total" to be 
 sahred and hope that D shared vars have nnow the internal 
 barries working ,or I need to manually use semaphores ?
Probably core.atomic is the way to go. Semaphore is an overkill.
The easiest and fastest way is probably using taskPool.reduce, like this: auto total = taskPool.reduce!"a+b"( iota(10_000_000).map!(a => log(a + 1.0))); writeln(total); Functions in core.atomic use instructions with lock prefix and according to http://www.agner.org/optimize/instruction_tables.pdf that "typically costs more than a hundred clock cycles,", so calling them for every element will probably slow things down significantly. It's best to just avoid accessing same memory from multiple threads wherever possible.
Dec 01 2012