digitalmars.D.learn - How works internally ParallelForEach

Zardoz (22/22) Dec 01 2012 How works internally ParallelFor ?

thedeemon (10/23) Dec 01 2012 taskPool.parallel is a library function, it doesn't make compiler

Zardoz (20/30) Dec 01 2012 Humm... So ParallelForeach only launch N tasks doing a work over

thedeemon (2/5) Dec 01 2012 Probably core.atomic is the way to go. Semaphore is an overkill.

jerro (12/17) Dec 01 2012 The easiest and fastest way is probably using taskPool.reduce,

"Zardoz" <luis.panadero gmail.com> writes:

How works internally ParallelFor ?
I read that lauchn multiple tasks. Each task procces a chunk of 
the range, but each task it's syncronized , ahve some kind of 
comunication between or are using shared memory or what ??

In this example code :
import std.stdio;
import std.parallelism;
import std.math;

void main() {
   auto logs = new double[10_000_000];
   double total = 0;
   foreach(i, ref elem; taskPool.parallel(logs, 100)) {
     elem = log(i + 1.0);
     total += elem;
   }

   writeln(total);
}

I understand that are launched N task, doing a chunk of 100 
elements from logs array. But what happen with "total". There is 
only a "total" and D is using memory barriers / atomic operations 
to write in it ? Or each Task have his own "total" and later 
joint each private "total" in the outside "total".

Dec 01 2012

"thedeemon" <dlang thedeemon.com> writes:

On Saturday, 1 December 2012 at 10:35:38 UTC, Zardoz wrote:
   auto logs = new double[10_000_000];
   double total = 0;
   foreach(i, ref elem; taskPool.parallel(logs, 100)) {
     elem = log(i + 1.0);
     total += elem;
   }

   writeln(total);
 }

 I understand that are launched N task, doing a chunk of 100 
 elements from logs array. But what happen with "total". There 
 is only a "total" and D is using memory barriers / atomic 
 operations to write in it ? Or each Task have his own "total" 
 and later joint each private "total" in the outside "total".

taskPool.parallel is a library function, it doesn't make compiler 
smarter and doesn't get much help from the compiler. It means 
your "total" variable will not get any special treatment, it's 
still a local variable referenced from the loop body which is 
turned into a function by foreach. This function is run by 
.parallel in several threads, so you'll get a race condition and 
most probably an incorrect total value. You should avoid changing 
the same memory in paralel foreach. Processing different elements 
of one array (even local) is ok. Writing to one variable not ok.

Dec 01 2012

"Zardoz" <luis.panadero gmail.com> writes:

On Saturday, 1 December 2012 at 10:58:55 UTC, thedeemon wrote:
 taskPool.parallel is a library function, it doesn't make 
 compiler smarter and doesn't get much help from the compiler. 
 It means your "total" variable will not get any special 
 treatment, it's still a local variable referenced from the loop 
 body which is turned into a function by foreach. This function 
 is run by .parallel in several threads, so you'll get a race 
 condition and most probably an incorrect total value. You 
 should avoid changing the same memory in paralel foreach. 
 Processing different elements of one array (even local) is ok. 
 Writing to one variable not ok.

Humm... So ParallelForeach only launch N tasks doing a work over 
a slice from the range and nothing more.

The prevois code should work better if i set "total" to be sahred 
and hope that D shared vars have nnow the internal barries 
working ,or I need to manually use semaphores ?

import std.stdio;
import std.parallelism;
import std.math;

void main() {
   auto logs = new double[10_000_000];
   shared double total = 0;
   foreach(i, ref elem; taskPool.parallel(logs, 100)) {
     elem = log(i + 1.0);
     total += elem;
   }

   writeln(total);
}

PD: I know that I can use reduction to do the same thing much 
better...

Dec 01 2012

"thedeemon" <dlang thedeemon.com> writes:

On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote:

 The prevois code should work better if i set "total" to be 
 sahred and hope that D shared vars have nnow the internal 
 barries working ,or I need to manually use semaphores ?

Probably core.atomic is the way to go. Semaphore is an overkill.

Dec 01 2012

"jerro" <a a.com> writes:

On Saturday, 1 December 2012 at 12:51:27 UTC, thedeemon wrote:
 On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote:

 The prevois code should work better if i set "total" to be 
 sahred and hope that D shared vars have nnow the internal 
 barries working ,or I need to manually use semaphores ?

 Probably core.atomic is the way to go. Semaphore is an overkill.

The easiest and fastest way is probably using taskPool.reduce, 
like this:

auto total = taskPool.reduce!"a+b"(
     iota(10_000_000).map!(a => log(a + 1.0)));

writeln(total);

Functions in core.atomic use instructions with lock prefix and 
according to http://www.agner.org/optimize/instruction_tables.pdf 
that "typically costs more than a hundred clock cycles,", so 
calling them for every element will probably slow things down 
significantly. It's best to just avoid accessing same memory from 
multiple threads wherever possible.

Dec 01 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How works internally ParallelForEach