digitalmars.D.learn - Not able to get scaled performance on increasing number of threads
- Sparsh Mittal (9/9) Feb 01 2013 I am parallelizing a program which follows this structure:
- Sparsh Mittal (23/23) Feb 01 2013 It got posted before I completed it! Sorry.
- Dmitry Olshansky (7/30) Feb 01 2013 Can't tell much without the whole source or at least compilable
- Sparsh Mittal (1/3) Feb 01 2013 Give me a moment. I will post.
- Sparsh Mittal (107/107) Feb 01 2013 Here is the code:
- Dmitry Olshansky (114/115) Feb 01 2013 Mine reiteration on it, with a bit of help from std.parallelism.
- FG (4/7) Feb 01 2013 Interestingly, threads+barrier here wasn't much slower than tasks:
- Dmitry Olshansky (9/17) Feb 02 2013 I'm think that taskPool should provide better operation for 'wait till
- Sparsh Mittal (2/2) Feb 01 2013 Excellent. Thank you so much for your suggestion and code. It now
- FG (7/12) Feb 01 2013 Probably because the SolverSlave doesn't have enough work to do to call ...
- Sparsh Mittal (1/1) Feb 01 2013 Thanks. Yes, you are right. I have increased the dimension.
I am parallelizing a program which follows this structure: for iter = 1 to MAX_ITERATION { myLocalBarrier = new Barrier(numberOfThreads+1); for i= 1 to numberOfThreads { spawn(&myFunc, args) } }
Feb 01 2013
It got posted before I completed it! Sorry. I am parallelizing a program which follows this structure: immutable int numberOfThreads= 2 for iter = 1 to MAX_ITERATION { myLocalBarrier = new Barrier(numberOfThreads+1); for i= 1 to numberOfThreads { spawn(&myFunc, args) } myLocalBarrier.wait(); } void myFunc(args) { //do the task myLocalBarrier.wait() } When I run it, and compare this parallel version with its serial version, I only get speedup of nearly <1.3 for 2 threads. When I write same program in Go, scaling is nearly 2. Also, in D, on doing "top", I see the usage as only 130% CPU and not nearly 200% or 180%. So I was wondering, if I am doing it properly. Please help me.
Feb 01 2013
01-Feb-2013 19:42, Sparsh Mittal пишет:It got posted before I completed it! Sorry. I am parallelizing a program which follows this structure: immutable int numberOfThreads= 2 for iter = 1 to MAX_ITERATION { myLocalBarrier = new Barrier(numberOfThreads+1); for i= 1 to numberOfThreads { spawn(&myFunc, args) } myLocalBarrier.wait(); } void myFunc(args) { //do the task myLocalBarrier.wait() } When I run it, and compare this parallel version with its serial version, I only get speedup of nearly <1.3 for 2 threads. When I write same program in Go, scaling is nearly 2. Also, in D, on doing "top", I see the usage as only 130% CPU and not nearly 200% or 180%. So I was wondering, if I am doing it properly. Please help me.Can't tell much without the whole source or at least compilable standalone piece. The '//do task part' is critical to understanding as well as declaration of myLocalBarrier. Also why not use std.parallelism? -- Dmitry Olshansky
Feb 01 2013
Can't tell much without the whole source or at least compilable standalone piece.Give me a moment. I will post.
Feb 01 2013
Here is the code: import std.stdio; import std.concurrency; import core.thread; import std.datetime; import std.conv; import core.sync.barrier; immutable int gridSize = 256; immutable int MAXSTEPS = 50000; /* Maximum number of iterations */ immutable double TOL_VAL =0.00001; /* Numerical Tolerance */ immutable double omega = 0.376; immutable double one_minus_omega = 1.0 - 0.376; immutable int numberOfThreads = 2; double MAX_FUNC(double a, double b) { return a> b? a: b; } double ABS_VAL(double a) { return a> 0? a: -a; } __gshared Barrier myLocalBarrier = null; shared double[gridSize+2][gridSize+2] gridInfo; shared double maxError = 0.0; void main(string args[]) { for(int i=0; i<gridSize+2; i++) { for(int j=0; j<gridSize+2; j++) { if(i==0) gridInfo[i][j] = 1.0; else gridInfo[i][j] = 0.0; } } bool shouldCheck = false; bool isConverged = false; for(int iter = 1; iter <= MAXSTEPS; iter++) { shouldCheck = false; if(iter % 400 ==0) { shouldCheck = true; maxError = 0.0; } //This is Phase 1 { myLocalBarrier = new Barrier(numberOfThreads+1); for (int cc=0; cc<numberOfThreads; cc++) { spawn(&SolverSlave, thisTid,cc, 0 ,shouldCheck); } myLocalBarrier.wait(); } //This is Phase 2 { myLocalBarrier = new Barrier(numberOfThreads+1); for (int cc=0; cc<numberOfThreads; cc++) { spawn(&SolverSlave, thisTid,cc, 1 ,shouldCheck ); } myLocalBarrier.wait(); } if( maxError < TOL_VAL) { isConverged = true; break; } } if(isConverged) writeln("It converged"); else writeln("It did not converge"); } void SolverSlave(Tid owner, int myNumber, int remainder, bool shouldCheckHere) { double sum =0; //Divide task among threads int iStart = ((myNumber*gridSize)/numberOfThreads) + 1; int iEnd = (((myNumber+1)*gridSize)/numberOfThreads) ; for(int i=iStart; i<= iEnd; i++) { for(int j=1; j< gridSize+1; j++) { if( ((i+j)%2 ==remainder)) //Phase 1 or 2 { sum = ( gridInfo[i ][j+1] + gridInfo[i+1][j ] + gridInfo[i-1][j ] + gridInfo[i ][j-1] )*0.25; //Should not check everytime to reduce synchronization overhead if(shouldCheckHere) { maxError = MAX_FUNC(ABS_VAL(omega *(sum-gridInfo[i][j])), maxError); } gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + omega*sum; } } } myLocalBarrier.wait(); }
Feb 01 2013
01-Feb-2013 20:08, Sparsh Mittal пишет:Here is the code:Mine reiteration on it, with a bit of help from std.parallelism. std.parallelism uses thread pool thus it's somewhat faster then creating threads anew. Still it's instantaneous for me in a range of 30-40ms even with grid size of 1024 and 5M of iterations. Have you enabled all of the optimizations? Correct switches are: dmd -inline -O -release optimize_me.d or rdmd -inline -O -release optimize_me.d to run after compile import std.stdio; import std.parallelism; import std.datetime; import std.conv; immutable int gridSize = 1024; immutable int MAXSTEPS = 5000_000; /* Maximum number of iterations */ immutable double TOL_VAL =0.00001; /* Numerical Tolerance */ immutable double omega = 0.376; immutable double one_minus_omega = 1.0 - 0.376; immutable int numberOfThreads = 2; double MAX_FUNC(double a, double b) { return a> b? a: b; } double ABS_VAL(double a) { return a> 0? a: -a; } shared double[gridSize+2][gridSize+2] gridInfo; shared double maxError = 0.0; void main(string args[]) { for(int i=0; i<gridSize+2; i++) { for(int j=0; j<gridSize+2; j++) { if(i==0) gridInfo[i][j] = 1.0; else gridInfo[i][j] = 0.0; } } bool shouldCheck = false; bool isConverged = false; for(int iter = 1; iter <= MAXSTEPS; iter++) { shouldCheck = false; if(iter % 400 ==0) { shouldCheck = true; maxError = 0.0; } alias MyTask = typeof(task!(SolverSlave)(0, 0, false)); //This is Phase 1 { MyTask[numberOfThreads] tasks; foreach(cc; 0..numberOfThreads) { tasks[cc] = task!(SolverSlave)(cc, 0, shouldCheck); taskPool.put(tasks[cc]); } foreach(cc; 0..numberOfThreads) tasks[cc].yieldForce(); } //This is Phase 2 { MyTask[numberOfThreads] tasks; foreach(cc; 0..numberOfThreads) { tasks[cc] = task!(SolverSlave)(cc, 1, shouldCheck); taskPool.put(tasks[cc]); } foreach(cc; 0..numberOfThreads) tasks[cc].yieldForce(); } if( maxError < TOL_VAL) { isConverged = true; break; } } /*if(isConverged) writeln("It converged"); else writeln("It did not converge");*/ } void SolverSlave(int myNumber, int remainder, bool shouldCheckHere) { double sum =0; //Divide task among threads int iStart = ((myNumber*gridSize)/numberOfThreads) + 1; int iEnd = (((myNumber+1)*gridSize)/numberOfThreads) ; for(int i=iStart; i<= iEnd; i++) { for(int j=1; j< gridSize+1; j++) { if( ((i+j)%2 ==remainder)) //Phase 1 or 2 { sum = ( gridInfo[i ][j+1] + gridInfo[i+1][j ] + gridInfo[i-1][j ] + gridInfo[i ][j-1] )*0.25; //Should not check everytime to reduce synchronization overhead if(shouldCheckHere) { maxError = MAX_FUNC(ABS_VAL(omega *(sum-gridInfo[i][j])), maxError); } gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + omega*sum; } } } } -- Dmitry Olshansky
Feb 01 2013
On 2013-02-01 20:33, Dmitry Olshansky wrote:Mine reiteration on it, with a bit of help from std.parallelism. std.parallelism uses thread pool thus it's somewhat faster then creating threads anew.Interestingly, threads+barrier here wasn't much slower than tasks: 14% slower for dmd32, only 5% for gdc64 (and taskpool in dmd 13% slower than in gdc).
Feb 01 2013
02-Feb-2013 00:39, FG пишет:On 2013-02-01 20:33, Dmitry Olshansky wrote:I'm think that taskPool should provide better operation for 'wait till all tasks are finished' as yielding on each task (i.e. future) looks less efficient. That being said I might be missing something in std.parallelism API, I haven't used it all that much before. But I love foreach(v; parallel(range) stuff :) -- Dmitry OlshanskyMine reiteration on it, with a bit of help from std.parallelism. std.parallelism uses thread pool thus it's somewhat faster then creating threads anew.Interestingly, threads+barrier here wasn't much slower than tasks: 14% slower for dmd32, only 5% for gdc64 (and taskpool in dmd 13% slower than in gdc).
Feb 02 2013
Excellent. Thank you so much for your suggestion and code. It now produces near linear speedup.
Feb 01 2013
On 2013-02-01 16:42, Sparsh Mittal wrote:When I run it, and compare this parallel version with its serial version, I only get speedup of nearly <1.3 for 2 threads. When I write same program in Go, scaling is nearly 2. Also, in D, on doing "top", I see the usage as only 130% CPU and not nearly 200% or 180%. So I was wondering, if I am doing it properly. Please help me.Probably because the SolverSlave doesn't have enough work to do to call for dividing it into threads and barriers with their overhead. Like Dmitry wrote, std.parallelism may be a better tool for the job. I've tested your code on 2 cores (but have put whole main() in a loop). It's taking about 82% of both cores. After increasing gridSize to 1024 it was using 88% of the CPU.
Feb 01 2013
Thanks. Yes, you are right. I have increased the dimension.
Feb 01 2013