www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Not able to get scaled performance on increasing number of threads

reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
I am parallelizing a program which follows this structure:

for iter = 1 to MAX_ITERATION
{
      myLocalBarrier = new Barrier(numberOfThreads+1);
      for i= 1 to numberOfThreads
       {
spawn(&myFunc, args)
       }

}
Feb 01 2013
parent reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
It got posted before I completed it! Sorry.


I am parallelizing a program which follows this structure:

immutable int numberOfThreads= 2

for iter = 1 to MAX_ITERATION
{
      myLocalBarrier = new Barrier(numberOfThreads+1);
      for i= 1 to numberOfThreads
       {
         spawn(&myFunc, args)
       }
       myLocalBarrier.wait();

}

void myFunc(args)
{
      //do the task

        myLocalBarrier.wait()
}

When I run it, and compare this parallel version with its serial 
version, I only get speedup of nearly <1.3 for 2 threads. When I 
write same program in Go, scaling is nearly 2.

Also, in D, on doing "top", I see the usage as only 130% CPU and 
not nearly 200% or 180%. So I was wondering, if I am doing it 
properly. Please help me.
Feb 01 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
01-Feb-2013 19:42, Sparsh Mittal пишет:
 It got posted before I completed it! Sorry.


 I am parallelizing a program which follows this structure:

 immutable int numberOfThreads= 2

 for iter = 1 to MAX_ITERATION
 {
       myLocalBarrier = new Barrier(numberOfThreads+1);
       for i= 1 to numberOfThreads
        {
          spawn(&myFunc, args)
        }
        myLocalBarrier.wait();

 }

 void myFunc(args)
 {
       //do the task

         myLocalBarrier.wait()
 }

 When I run it, and compare this parallel version with its serial
 version, I only get speedup of nearly <1.3 for 2 threads. When I write
 same program in Go, scaling is nearly 2.

 Also, in D, on doing "top", I see the usage as only 130% CPU and not
 nearly 200% or 180%. So I was wondering, if I am doing it properly.
 Please help me.
Can't tell much without the whole source or at least compilable standalone piece. The '//do task part' is critical to understanding as well as declaration of myLocalBarrier. Also why not use std.parallelism? -- Dmitry Olshansky
Feb 01 2013
parent reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
 Can't tell much without the whole source or at least compilable 
 standalone piece.
Give me a moment. I will post.
Feb 01 2013
parent reply "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
Here is the code:


#!/usr/bin/env rdmd
import std.stdio;
import std.concurrency;
import core.thread;
import std.datetime;
import std.conv;
import core.sync.barrier;



immutable int gridSize = 256;
immutable int MAXSTEPS = 50000;       /* Maximum number of 
iterations               */
immutable double TOL_VAL =0.00001;         /* Numerical Tolerance 
*/
immutable double omega =  0.376;
immutable double one_minus_omega = 1.0 - 0.376;


immutable int numberOfThreads = 2;


double MAX_FUNC(double a, double b)
{
   return a> b? a: b;
}

double ABS_VAL(double a)
{
   return a> 0? a: -a;
}

__gshared Barrier myLocalBarrier = null;
shared double[gridSize+2][gridSize+2] gridInfo;
shared double maxError = 0.0;

void main(string args[])
{

   for(int i=0; i<gridSize+2; i++)
   {
     for(int j=0; j<gridSize+2; j++)
     {
       if(i==0)
         gridInfo[i][j] = 1.0;
       else
         gridInfo[i][j] = 0.0;
     }
   }

   bool shouldCheck = false;
   bool isConverged = false;
   for(int iter = 1; iter <= MAXSTEPS; iter++)
   {
     shouldCheck = false;
     if(iter % 400 ==0)
     {
       shouldCheck = true;
       maxError = 0.0;
     }

       //This is Phase 1
     {
       myLocalBarrier = new Barrier(numberOfThreads+1);
       for (int cc=0; cc<numberOfThreads; cc++)
       {
         spawn(&SolverSlave, thisTid,cc, 0 ,shouldCheck);
       }


       myLocalBarrier.wait();
     }

      //This is Phase 2
     {
       myLocalBarrier = new Barrier(numberOfThreads+1);
       for (int cc=0; cc<numberOfThreads; cc++)
       {
        spawn(&SolverSlave, thisTid,cc, 1 ,shouldCheck );
       }

       myLocalBarrier.wait();
     }

     if( maxError <  TOL_VAL)
       {
         isConverged = true;
         break;
       }

   }
   if(isConverged)
     writeln("It converged");
   else
     writeln("It did not converge");
}



void SolverSlave(Tid owner, int myNumber, int remainder, bool 
shouldCheckHere)
{

   double sum =0;

   //Divide task among threads
   int iStart = ((myNumber*gridSize)/numberOfThreads) + 1;
   int iEnd =  (((myNumber+1)*gridSize)/numberOfThreads) ;



   for(int i=iStart; i<= iEnd; i++)
   {
     for(int j=1; j< gridSize+1; j++)
     {
       if( ((i+j)%2 ==remainder)) //Phase 1 or 2
       {
         sum = ( gridInfo[i  ][j+1] + gridInfo[i+1][j  ] +
             gridInfo[i-1][j  ] + gridInfo[i  ][j-1] )*0.25;

         //Should not check everytime to reduce synchronization 
overhead
         if(shouldCheckHere)
         {
           maxError = MAX_FUNC(ABS_VAL(omega 
*(sum-gridInfo[i][j])), maxError);
         }
         gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + 
omega*sum;
       }

     }
   }

   myLocalBarrier.wait();
}
Feb 01 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
01-Feb-2013 20:08, Sparsh Mittal пишет:
 Here is the code:
Mine reiteration on it, with a bit of help from std.parallelism. std.parallelism uses thread pool thus it's somewhat faster then creating threads anew. Still it's instantaneous for me in a range of 30-40ms even with grid size of 1024 and 5M of iterations. Have you enabled all of the optimizations? Correct switches are: dmd -inline -O -release optimize_me.d or rdmd -inline -O -release optimize_me.d to run after compile import std.stdio; import std.parallelism; import std.datetime; import std.conv; immutable int gridSize = 1024; immutable int MAXSTEPS = 5000_000; /* Maximum number of iterations */ immutable double TOL_VAL =0.00001; /* Numerical Tolerance */ immutable double omega = 0.376; immutable double one_minus_omega = 1.0 - 0.376; immutable int numberOfThreads = 2; double MAX_FUNC(double a, double b) { return a> b? a: b; } double ABS_VAL(double a) { return a> 0? a: -a; } shared double[gridSize+2][gridSize+2] gridInfo; shared double maxError = 0.0; void main(string args[]) { for(int i=0; i<gridSize+2; i++) { for(int j=0; j<gridSize+2; j++) { if(i==0) gridInfo[i][j] = 1.0; else gridInfo[i][j] = 0.0; } } bool shouldCheck = false; bool isConverged = false; for(int iter = 1; iter <= MAXSTEPS; iter++) { shouldCheck = false; if(iter % 400 ==0) { shouldCheck = true; maxError = 0.0; } alias MyTask = typeof(task!(SolverSlave)(0, 0, false)); //This is Phase 1 { MyTask[numberOfThreads] tasks; foreach(cc; 0..numberOfThreads) { tasks[cc] = task!(SolverSlave)(cc, 0, shouldCheck); taskPool.put(tasks[cc]); } foreach(cc; 0..numberOfThreads) tasks[cc].yieldForce(); } //This is Phase 2 { MyTask[numberOfThreads] tasks; foreach(cc; 0..numberOfThreads) { tasks[cc] = task!(SolverSlave)(cc, 1, shouldCheck); taskPool.put(tasks[cc]); } foreach(cc; 0..numberOfThreads) tasks[cc].yieldForce(); } if( maxError < TOL_VAL) { isConverged = true; break; } } /*if(isConverged) writeln("It converged"); else writeln("It did not converge");*/ } void SolverSlave(int myNumber, int remainder, bool shouldCheckHere) { double sum =0; //Divide task among threads int iStart = ((myNumber*gridSize)/numberOfThreads) + 1; int iEnd = (((myNumber+1)*gridSize)/numberOfThreads) ; for(int i=iStart; i<= iEnd; i++) { for(int j=1; j< gridSize+1; j++) { if( ((i+j)%2 ==remainder)) //Phase 1 or 2 { sum = ( gridInfo[i ][j+1] + gridInfo[i+1][j ] + gridInfo[i-1][j ] + gridInfo[i ][j-1] )*0.25; //Should not check everytime to reduce synchronization overhead if(shouldCheckHere) { maxError = MAX_FUNC(ABS_VAL(omega *(sum-gridInfo[i][j])), maxError); } gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + omega*sum; } } } } -- Dmitry Olshansky
Feb 01 2013
next sibling parent reply FG <home fgda.pl> writes:
On 2013-02-01 20:33, Dmitry Olshansky wrote:
 Mine reiteration on it, with a bit of help from std.parallelism.
 std.parallelism uses thread pool thus it's somewhat faster then creating
threads
 anew.
Interestingly, threads+barrier here wasn't much slower than tasks: 14% slower for dmd32, only 5% for gdc64 (and taskpool in dmd 13% slower than in gdc).
Feb 01 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Feb-2013 00:39, FG пишет:
 On 2013-02-01 20:33, Dmitry Olshansky wrote:
 Mine reiteration on it, with a bit of help from std.parallelism.
 std.parallelism uses thread pool thus it's somewhat faster then
 creating threads
 anew.
Interestingly, threads+barrier here wasn't much slower than tasks: 14% slower for dmd32, only 5% for gdc64 (and taskpool in dmd 13% slower than in gdc).
I'm think that taskPool should provide better operation for 'wait till all tasks are finished' as yielding on each task (i.e. future) looks less efficient. That being said I might be missing something in std.parallelism API, I haven't used it all that much before. But I love foreach(v; parallel(range) stuff :) -- Dmitry Olshansky
Feb 02 2013
prev sibling parent "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
Excellent. Thank you so much for your suggestion and code. It now 
produces near linear speedup.
Feb 01 2013
prev sibling parent reply FG <home fgda.pl> writes:
On 2013-02-01 16:42, Sparsh Mittal wrote:
 When I run it, and compare this parallel version with its serial version, I
only
 get speedup of nearly <1.3 for 2 threads. When I write same program in Go,
 scaling is nearly 2.

 Also, in D, on doing "top", I see the usage as only 130% CPU and not nearly
200%
 or 180%. So I was wondering, if I am doing it properly. Please help me.
Probably because the SolverSlave doesn't have enough work to do to call for dividing it into threads and barriers with their overhead. Like Dmitry wrote, std.parallelism may be a better tool for the job. I've tested your code on 2 cores (but have put whole main() in a loop). It's taking about 82% of both cores. After increasing gridSize to 1024 it was using 88% of the CPU.
Feb 01 2013
parent "Sparsh Mittal" <sparsh0mittal gmail.com> writes:
Thanks. Yes, you are right. I have increased the dimension.
Feb 01 2013