digitalmars.D.learn - What's the "right" way to do openmp-style parallelism?

Charles (19/19) Sep 06 2015 Friends,

Meta (11/31) Sep 06 2015 I believe this is what you want:
Russel Winder via Digitalmars-d-learn (100/114) Sep 07 2015 So as to run things through a compiler, I expanded the code fragment

Dominikus Dittes Scherkl (2/14) Sep 08 2015 Hmm. Shouldn't you instead parallel the outer loop?

Russel Winder via Digitalmars-d-learn (16/31) Sep 09 2015 Can't do that because it is a pipeline: the current computation is

"Charles" <dlang charlesmcanany.com> writes:

Friends,

I have a program that would be pretty easy to parallelize with an 
openmp pragra in C. I'd like to avoid the performance cost of 
using message passing, and the shared qualifier seems like it's 
enforcing guarantees I don't need. Essentially, I have

x = float[imax][jmax]; //x is about 8 GB of floats
for(j = 0; j < jmax; j++){
//create some local variables.
     for(i = 0; i < imax; i++){
         x[j][i] = complicatedFunction(i, x[j-1], other, local, 
variables);
     }
}

In C, I'd just stick a #pragma omp parallel for around the inner 
loop (since the outer loop obviously can't be parallelized).

How should I go about this in D? I want to avoid copying data 
around if it's possible since these arrays are huge.

Cheers,
Charles.

Sep 06 2015

"Meta" <jared771 gmail.com> writes:

On Monday, 7 September 2015 at 02:56:04 UTC, Charles wrote:
 Friends,

 I have a program that would be pretty easy to parallelize with 
 an openmp pragra in C. I'd like to avoid the performance cost 
 of using message passing, and the shared qualifier seems like 
 it's enforcing guarantees I don't need. Essentially, I have

 x = float[imax][jmax]; //x is about 8 GB of floats
 for(j = 0; j < jmax; j++){
 //create some local variables.
     for(i = 0; i < imax; i++){
         x[j][i] = complicatedFunction(i, x[j-1], other, local, 
 variables);
     }
 }

 In C, I'd just stick a #pragma omp parallel for around the 
 inner loop (since the outer loop obviously can't be 
 parallelized).

 How should I go about this in D? I want to avoid copying data 
 around if it's possible since these arrays are huge.

 Cheers,
 Charles.

I believe this is what you want: 


I believe that all you would need to change is to have your inner 
loop become:

foreach(i, ref f; x[j].parallel)
{
     f = complicatedFUnction(i, x[j-1], etc...);
}

Don't quote me on that, though, as I'm not very experienced with 
std.parallelism.

Sep 06 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 2015-09-07 at 02:56 +0000, Charles via Digitalmars-d-learn
wrote:
 [=E2=80=A6]
=20
 x =3D float[imax][jmax]; //x is about 8 GB of floats
 for(j =3D 0; j < jmax; j++){
 //create some local variables.
      for(i =3D 0; i < imax; i++){
          x[j][i] =3D complicatedFunction(i, x[j-1], other, local,=20
 variables);
      }
 }

So as to run things through a compiler, I expanded the code fragment
to:

    float complicatedFunction(int i, float[] x) pure {
      return 0.0;
    }

    void main() {
      immutable imax =3D 10;
      immutable jmax =3D 10;
      float[imax][jmax] x;
      for(int j =3D 1; j < jmax; j++){
        for(int i =3D 0; i < imax; i++){
          x[j][i] =3D complicatedFunction(i, x[j-1]);
        }
      }
    }

Hopefully this is an accurate representation of the original code. Note
the change in the j iteration since j-1 is being used as an index. Of
course I immediately wanted to change this to:

    float complicatedFunction(int i, float[] x) pure {
      return 0.0;
    }

    void main() {
      immutable imax =3D 10;
      immutable jmax =3D 10;
      float[imax][jmax] x;
      foreach(int j; 1..jmax){
        foreach(int i; 0..imax){
          x[j][i] =3D complicatedFunction(i, x[j-1]);
        }
      }
    }

Hopefully this is not now wrong as a representation of the original
problem.

 In C, I'd just stick a #pragma omp parallel for around the inner=20
 loop (since the outer loop obviously can't be parallelized).

I would hope you meant C++ , not C, there. ;-)

I am not sure OpenMP would work to parallelize your C++ (or C) code.

Given that complicatedFunction has access to the whole of x[j-1] there
could be coupling between x[j-1][m] and x[j-1][n] in the function which
would lead to potentially different results being computed in the
sequential and parallel cases. This is not a C/C++/D thing this is a
data coupling thing.

So although Meta suggested a parallel foreach (the D equivalent of
OpenMP parallel for pragma), something along the lines:

    import std.parallelism: parallel;

    float complicatedFunction(int i, float[] x) pure {
      return 0.0;
    }

    void main() {
      immutable imax =3D 10;
      immutable jmax =3D 10;
      float[imax][jmax] x;
      foreach(int j; 1..jmax){
        foreach(int i, ref item; parallel(x[j-1])){
          x[j][i] =3D complicatedFunction(i, item);
        }
      }
    }

(though sadly, this doesn't compile for a reason I can't fathom
instantly) this brings into stark relieve the fact that there is a
potential coupling between x[j-1][m] and x[j-1][n] which means
enforcing parallelism here will almost certainly result in the wrong
values being calculated.

This is a standard pipeline describable with a map, something along the
lines of:

    import std.algorithm: map;

    float complicatedFunction(int i, float[] x) pure {
      return 0.0;
    }

    void main() {
      immutable imax =3D 10;
      immutable jmax =3D 10;
      float[imax][jmax] x;
      foreach(int j; 1..jmax){
        x[j] =3D map!(a =3D> complicatedFunction(a, x[j-1]))(x[j-1]);
      }
    }

(but this also has a compilation error, which I hope someone can fix=E2=80=
=A6)

This is the step prior to using parallel map, but cast in this way
highlights that in order to then be parallelized at all in any way,
complicatedFunction must have no couplings between x[j-1][m] and x[j
-1][n].

(I am guessing this is some form of cellular automaton or some Markov
process problem?)

 How should I go about this in D? I want to avoid copying data=20
 around if it's possible since these arrays are huge.

Indeed. With C, C++, D, (Go, Rust,=E2=80=A6) you have to use references (ak=
a
pointers) and hope you do not get any ownership problems. It might be
interesting to see whether a language such as Haskell, which has copy
semantics but optimizes as much as it can away, would fare with this.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 07 2015

"Dominikus Dittes Scherkl" writes:

On Tuesday, 8 September 2015 at 05:50:30 UTC, Russel Winder wrote:
     void main() {
       immutable imax = 10;
       immutable jmax = 10;
       float[imax][jmax] x;
       foreach(int j; 1..jmax){
         foreach(int i, ref item; parallel(x[j-1])){
           x[j][i] = complicatedFunction(i, item);
         }
       }
     }

 (though sadly, this doesn't compile for a reason I can't fathom 
 instantly)

Hmm. Shouldn't you instead parallel the outer loop?

Sep 08 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Tue, 2015-09-08 at 07:33 +0000, Dominikus Dittes Scherkl via
Digitalmars-d-learn wrote:
 On Tuesday, 8 September 2015 at 05:50:30 UTC, Russel Winder wrote:
     void main() {
       immutable imax =3D 10;
       immutable jmax =3D 10;
       float[imax][jmax] x;
       foreach(int j; 1..jmax){
         foreach(int i, ref item; parallel(x[j-1])){
           x[j][i] =3D complicatedFunction(i, item);
         }
       }
     }
=20
 (though sadly, this doesn't compile for a reason I can't fathom=20
 instantly)

 Hmm. Shouldn't you instead parallel the outer loop?

Can't do that because it is a pipeline: the current computation is
input to the next one. As far as I can tell there is no way the code as
presented can be parallelized in the general case. If there were some
guarantees on complicatedFunction, then it is a different game.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 09 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - What's the "right" way to do openmp-style parallelism?