digitalmars.D.learn - Not able to get scaled performance on increasing number of threads

Sparsh Mittal (9/9) Feb 01 2013 I am parallelizing a program which follows this structure:

Sparsh Mittal (23/23) Feb 01 2013 It got posted before I completed it! Sorry.

Dmitry Olshansky (7/30) Feb 01 2013 Can't tell much without the whole source or at least compilable

Sparsh Mittal (1/3) Feb 01 2013 Give me a moment. I will post.

Sparsh Mittal (107/107) Feb 01 2013 Here is the code:

Dmitry Olshansky (114/115) Feb 01 2013 Mine reiteration on it, with a bit of help from std.parallelism.

FG (4/7) Feb 01 2013 Interestingly, threads+barrier here wasn't much slower than tasks:

Dmitry Olshansky (9/17) Feb 02 2013 I'm think that taskPool should provide better operation for 'wait till

Sparsh Mittal (2/2) Feb 01 2013 Excellent. Thank you so much for your suggestion and code. It now

FG (7/12) Feb 01 2013 Probably because the SolverSlave doesn't have enough work to do to call ...

Sparsh Mittal (1/1) Feb 01 2013 Thanks. Yes, you are right. I have increased the dimension.

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

I am parallelizing a program which follows this structure:

for iter = 1 to MAX_ITERATION
{
      myLocalBarrier = new Barrier(numberOfThreads+1);
      for i= 1 to numberOfThreads
       {
spawn(&myFunc, args)
       }

}

Feb 01 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

It got posted before I completed it! Sorry.


I am parallelizing a program which follows this structure:

immutable int numberOfThreads= 2

for iter = 1 to MAX_ITERATION
{
      myLocalBarrier = new Barrier(numberOfThreads+1);
      for i= 1 to numberOfThreads
       {
         spawn(&myFunc, args)
       }
       myLocalBarrier.wait();

}

void myFunc(args)
{
      //do the task

        myLocalBarrier.wait()
}

When I run it, and compare this parallel version with its serial 
version, I only get speedup of nearly <1.3 for 2 threads. When I 
write same program in Go, scaling is nearly 2.

Also, in D, on doing "top", I see the usage as only 130% CPU and 
not nearly 200% or 180%. So I was wondering, if I am doing it 
properly. Please help me.

Feb 01 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

01-Feb-2013 19:42, Sparsh Mittal пишет:
 It got posted before I completed it! Sorry.


 I am parallelizing a program which follows this structure:

 immutable int numberOfThreads= 2

 for iter = 1 to MAX_ITERATION
 {
       myLocalBarrier = new Barrier(numberOfThreads+1);
       for i= 1 to numberOfThreads
        {
          spawn(&myFunc, args)
        }
        myLocalBarrier.wait();

 }

 void myFunc(args)
 {
       //do the task

         myLocalBarrier.wait()
 }

 When I run it, and compare this parallel version with its serial
 version, I only get speedup of nearly <1.3 for 2 threads. When I write
 same program in Go, scaling is nearly 2.

 Also, in D, on doing "top", I see the usage as only 130% CPU and not
 nearly 200% or 180%. So I was wondering, if I am doing it properly.
 Please help me.


Can't tell much without the whole source or at least compilable 
standalone piece. The '//do task part' is critical to understanding as 
well as declaration of myLocalBarrier.

Also why not use std.parallelism?
-- 
Dmitry Olshansky

Feb 01 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

 Can't tell much without the whole source or at least compilable 
 standalone piece.

Give me a moment. I will post.

Feb 01 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

Here is the code:



import std.stdio;
import std.concurrency;
import core.thread;
import std.datetime;
import std.conv;
import core.sync.barrier;



immutable int gridSize = 256;
immutable int MAXSTEPS = 50000;       /* Maximum number of 
iterations               */
immutable double TOL_VAL =0.00001;         /* Numerical Tolerance 
*/
immutable double omega =  0.376;
immutable double one_minus_omega = 1.0 - 0.376;


immutable int numberOfThreads = 2;


double MAX_FUNC(double a, double b)
{
   return a> b? a: b;
}

double ABS_VAL(double a)
{
   return a> 0? a: -a;
}

__gshared Barrier myLocalBarrier = null;
shared double[gridSize+2][gridSize+2] gridInfo;
shared double maxError = 0.0;

void main(string args[])
{

   for(int i=0; i<gridSize+2; i++)
   {
     for(int j=0; j<gridSize+2; j++)
     {
       if(i==0)
         gridInfo[i][j] = 1.0;
       else
         gridInfo[i][j] = 0.0;
     }
   }

   bool shouldCheck = false;
   bool isConverged = false;
   for(int iter = 1; iter <= MAXSTEPS; iter++)
   {
     shouldCheck = false;
     if(iter % 400 ==0)
     {
       shouldCheck = true;
       maxError = 0.0;
     }

       //This is Phase 1
     {
       myLocalBarrier = new Barrier(numberOfThreads+1);
       for (int cc=0; cc<numberOfThreads; cc++)
       {
         spawn(&SolverSlave, thisTid,cc, 0 ,shouldCheck);
       }


       myLocalBarrier.wait();
     }

      //This is Phase 2
     {
       myLocalBarrier = new Barrier(numberOfThreads+1);
       for (int cc=0; cc<numberOfThreads; cc++)
       {
        spawn(&SolverSlave, thisTid,cc, 1 ,shouldCheck );
       }

       myLocalBarrier.wait();
     }

     if( maxError <  TOL_VAL)
       {
         isConverged = true;
         break;
       }

   }
   if(isConverged)
     writeln("It converged");
   else
     writeln("It did not converge");
}



void SolverSlave(Tid owner, int myNumber, int remainder, bool 
shouldCheckHere)
{

   double sum =0;

   //Divide task among threads
   int iStart = ((myNumber*gridSize)/numberOfThreads) + 1;
   int iEnd =  (((myNumber+1)*gridSize)/numberOfThreads) ;



   for(int i=iStart; i<= iEnd; i++)
   {
     for(int j=1; j< gridSize+1; j++)
     {
       if( ((i+j)%2 ==remainder)) //Phase 1 or 2
       {
         sum = ( gridInfo[i  ][j+1] + gridInfo[i+1][j  ] +
             gridInfo[i-1][j  ] + gridInfo[i  ][j-1] )*0.25;

         //Should not check everytime to reduce synchronization 
overhead
         if(shouldCheckHere)
         {
           maxError = MAX_FUNC(ABS_VAL(omega 
*(sum-gridInfo[i][j])), maxError);
         }
         gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + 
omega*sum;
       }

     }
   }

   myLocalBarrier.wait();
}

Feb 01 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

01-Feb-2013 20:08, Sparsh Mittal пишет:
 Here is the code:

Mine reiteration on it, with a bit of help from std.parallelism.
std.parallelism uses thread pool thus it's somewhat faster then creating 
threads anew.
Still it's instantaneous for me in a range of 30-40ms even with grid 
size of 1024 and 5M of iterations.

Have you enabled all of the optimizations? Correct switches are:

dmd -inline -O -release optimize_me.d

or

rdmd -inline -O -release optimize_me.d

to run after compile


import std.stdio;
import std.parallelism;
import std.datetime;
import std.conv;


immutable int gridSize = 1024;
immutable int MAXSTEPS = 5000_000;       /* Maximum number of iterations  */
immutable double TOL_VAL =0.00001;         /* Numerical Tolerance */
immutable double omega =  0.376;
immutable double one_minus_omega = 1.0 - 0.376;


immutable int numberOfThreads = 2;


double MAX_FUNC(double a, double b)
{
   return a> b? a: b;
}

double ABS_VAL(double a)
{
   return a> 0? a: -a;
}

shared double[gridSize+2][gridSize+2] gridInfo;
shared double maxError = 0.0;

void main(string args[])
{

   for(int i=0; i<gridSize+2; i++)
   {
     for(int j=0; j<gridSize+2; j++)
     {
       if(i==0)
         gridInfo[i][j] = 1.0;
       else
         gridInfo[i][j] = 0.0;
     }
   }

   bool shouldCheck = false;
   bool isConverged = false;
   for(int iter = 1; iter <= MAXSTEPS; iter++)
   {
     shouldCheck = false;
     if(iter % 400 ==0)
     {
       shouldCheck = true;
       maxError = 0.0;
     }


     alias MyTask = typeof(task!(SolverSlave)(0, 0, false));
       //This is Phase 1
     {
       MyTask[numberOfThreads] tasks;
       foreach(cc; 0..numberOfThreads)
       {
         tasks[cc] = task!(SolverSlave)(cc, 0, shouldCheck);
         taskPool.put(tasks[cc]);
       }
       foreach(cc; 0..numberOfThreads)
         tasks[cc].yieldForce();
     }

      //This is Phase 2
     {
       MyTask[numberOfThreads] tasks;
       foreach(cc; 0..numberOfThreads)
       {
         tasks[cc] = task!(SolverSlave)(cc, 1, shouldCheck);
         taskPool.put(tasks[cc]);
       }
       foreach(cc; 0..numberOfThreads)
         tasks[cc].yieldForce();

     }

     if( maxError <  TOL_VAL)
       {
         isConverged = true;
         break;
       }

   }
   /*if(isConverged)
     writeln("It converged");
   else
     writeln("It did not converge");*/
}



void SolverSlave(int myNumber, int remainder, bool shouldCheckHere)
{

   double sum =0;

   //Divide task among threads
   int iStart = ((myNumber*gridSize)/numberOfThreads) + 1;
   int iEnd =  (((myNumber+1)*gridSize)/numberOfThreads) ;


   for(int i=iStart; i<= iEnd; i++)
   {
     for(int j=1; j< gridSize+1; j++)
     {
       if( ((i+j)%2 ==remainder)) //Phase 1 or 2
       {
         sum = ( gridInfo[i  ][j+1] + gridInfo[i+1][j  ] +
             gridInfo[i-1][j  ] + gridInfo[i  ][j-1] )*0.25;

         //Should not check everytime to reduce synchronization overhead
         if(shouldCheckHere)
         {
           maxError = MAX_FUNC(ABS_VAL(omega *(sum-gridInfo[i][j])), 
maxError);
         }
         gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + omega*sum;
       }

     }
   }

}









-- 
Dmitry Olshansky

Feb 01 2013

FG <home fgda.pl> writes:

On 2013-02-01 20:33, Dmitry Olshansky wrote:
 Mine reiteration on it, with a bit of help from std.parallelism.
 std.parallelism uses thread pool thus it's somewhat faster then creating
threads
 anew.

Interestingly, threads+barrier here wasn't much slower than tasks:
14% slower for dmd32, only 5% for gdc64
(and taskpool in dmd 13% slower than in gdc).

Feb 01 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

02-Feb-2013 00:39, FG пишет:
 On 2013-02-01 20:33, Dmitry Olshansky wrote:
 Mine reiteration on it, with a bit of help from std.parallelism.
 std.parallelism uses thread pool thus it's somewhat faster then
 creating threads
 anew.

 Interestingly, threads+barrier here wasn't much slower than tasks:
 14% slower for dmd32, only 5% for gdc64
 (and taskpool in dmd 13% slower than in gdc).

I'm think that taskPool should provide better operation for 'wait till 
all tasks are finished' as yielding on each task (i.e. future) looks 
less efficient.

That being said I might be missing something in std.parallelism API, I 
haven't used it all that much before.

But I love foreach(v; parallel(range) stuff :)

-- 
Dmitry Olshansky

Feb 02 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

Excellent. Thank you so much for your suggestion and code. It now 
produces near linear speedup.

Feb 01 2013

FG <home fgda.pl> writes:

On 2013-02-01 16:42, Sparsh Mittal wrote:
 When I run it, and compare this parallel version with its serial version, I
only
 get speedup of nearly <1.3 for 2 threads. When I write same program in Go,
 scaling is nearly 2.

 Also, in D, on doing "top", I see the usage as only 130% CPU and not nearly
200%
 or 180%. So I was wondering, if I am doing it properly. Please help me.

Probably because the SolverSlave doesn't have enough work to do to call for 
dividing it into threads and barriers with their overhead. Like Dmitry wrote, 
std.parallelism may be a better tool for the job.

I've tested your code on 2 cores (but have put whole main() in a loop).
It's taking about 82% of both cores.
After increasing gridSize to 1024 it was using 88% of the CPU.

Feb 01 2013

"Sparsh Mittal" <sparsh0mittal gmail.com> writes:

Thanks. Yes, you are right. I have increased the dimension.

Feb 01 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Not able to get scaled performance on increasing number of threads