www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - multithread/concurrency/parallel methods and performance

reply SrMordred <patric.dexheimer gmail.com> writes:
I´m experimenting with threads and related recently.
(i´m just started so may be some terrrible mistakes here)

With this base work:

foreach(i ; 0 .. SIZE)
{
     results[i] = values1[i] * values2[i];
}

and then with this 3 others methods: parallel, spawn and Threads.

this was my results:

_base : 456 ms and 479 us
_parallel : 331 ms, 324 us, and 4 hnsecs
_concurrency : 367 ms, 348 us, and 2 hnsecs
_thread : 369 ms, 565 us, and 3 hnsecs

(code here : https://run.dlang.io/is/2pdmmk )

All methods have minor speedup gains.  I was expecting a lot more.
Since I have 7 cores I expected like below 100ms.

I´m not seeing false sharing in this case. or i'm wrong?

If someone can expand on this, i'll be grateful.

Thanks!
Feb 18 2018
next sibling parent Jordan Wilson <wilsonjord gmail.com> writes:
On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:
 I´m experimenting with threads and related recently.
 (i´m just started so may be some terrrible mistakes here)

 With this base work:

 foreach(i ; 0 .. SIZE)
 {
     results[i] = values1[i] * values2[i];
 }

 and then with this 3 others methods: parallel, spawn and 
 Threads.

 this was my results:

 _base : 456 ms and 479 us
 _parallel : 331 ms, 324 us, and 4 hnsecs
 _concurrency : 367 ms, 348 us, and 2 hnsecs
 _thread : 369 ms, 565 us, and 3 hnsecs

 (code here : https://run.dlang.io/is/2pdmmk )

 All methods have minor speedup gains.  I was expecting a lot 
 more.
 Since I have 7 cores I expected like below 100ms.

 I´m not seeing false sharing in this case. or i'm wrong?

 If someone can expand on this, i'll be grateful.

 Thanks!
It may be due to thread local storage: https://tour.dlang.org/tour/en/multithreading/thread-local-storage https://dlang.org/articles/migrate-to-shared.html I'm not sure though, as I don't know how your results array initialised. Jordan
Feb 18 2018
prev sibling next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:
 I´m experimenting with threads and related recently.
 (i´m just started so may be some terrrible mistakes here)

 With this base work:

 foreach(i ; 0 .. SIZE)
 {
     results[i] = values1[i] * values2[i];
 }

 and then with this 3 others methods: parallel, spawn and 
 Threads.

 this was my results:

 _base : 456 ms and 479 us
 _parallel : 331 ms, 324 us, and 4 hnsecs
 _concurrency : 367 ms, 348 us, and 2 hnsecs
 _thread : 369 ms, 565 us, and 3 hnsecs

 (code here : https://run.dlang.io/is/2pdmmk )

 All methods have minor speedup gains.  I was expecting a lot 
 more.
 Since I have 7 cores I expected like below 100ms.

 I´m not seeing false sharing in this case. or i'm wrong?

 If someone can expand on this, i'll be grateful.

 Thanks!
As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache for 32bit) it may be that dealing with the concurrency overhead adds a significant amount of overhead. Also the run.dlang.io link has no -O flag and thus no optimisations without -O i get _base : 323 ms, 92 μs, and 6 hnsecs _parallel : 276 ms, 649 μs, and 3 hnsecs _concurrency : 221 ms, 931 μs, and 7 hnsecs _thread : 212 ms, 277 μs, and 3 hnsecs with it I get _base : 150 ms, 728 μs, and 5 hnsecs _parallel : 120 ms, 78 μs, and 5 hnsecs _concurrency : 134 ms, 787 μs, and 4 hnsecs _thread : 129 ms, 476 μs, and 2 hnsecs with SIZE= 16*1024*1024 without -O i get _base : 5 secs, 835 ms, 240 μs, and 9 hnsecs _parallel : 4 secs, 802 ms, 279 μs, and 8 hnsecs _concurrency : 2 secs, 133 ms, 685 μs, and 3 hnsecs _thread : 2 secs, 108 ms, 860 μs, and 9 hnsecs with SIZE= 16*1024*1024 with -O i get _base : 2 secs, 502 ms, 523 μs, and 4 hnsecs _parallel : 1 sec, 769 ms, 945 μs, and 3 hnsecs _concurrency : 1 sec, 362 ms, 747 μs, and 1 hnsec _thread : 1 sec, 335 ms, 720 μs, and 1 hn
Feb 18 2018
parent SrMordred <patric.dexheimer gmail.com> writes:
On Monday, 19 February 2018 at 05:49:54 UTC, Nicholas Wilson 
wrote:
 As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache 
 for 32bit) it may be that dealing with the concurrency overhead 
 adds a significant amount of overhead.
That 'concurrency overhead' is what i´m not getting. Since the array is big, dividing it into 6, 7 cores will not trash L1 since they are very far from each other, right? Or L2 cache trashing is also a problem in this case?
 _base : 150 ms, 728 μs, and 5 hnsecs
 _parallel : 120 ms, 78 μs, and 5 hnsecs
 _concurrency : 134 ms, 787 μs, and 4 hnsecs
 _thread : 129 ms, 476 μs, and 2 hnsecs
Yes, on my PC I was using -release. Yet, 150ms for 1 core. 120-134ms of X cores. Shouldn´t be way faster? I´m trying to understand where the overhead is, and if is possible to get rid of it (perfect thread scaling).
Feb 19 2018
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:
 I´m experimenting with threads and related recently.
 (i´m just started so may be some terrrible mistakes here)

 With this base work:

 foreach(i ; 0 .. SIZE)
 {
     results[i] = values1[i] * values2[i];
 }

 and then with this 3 others methods: parallel, spawn and 
 Threads.

 this was my results:

 _base : 456 ms and 479 us
 _parallel : 331 ms, 324 us, and 4 hnsecs
 _concurrency : 367 ms, 348 us, and 2 hnsecs
 _thread : 369 ms, 565 us, and 3 hnsecs

 (code here : https://run.dlang.io/is/2pdmmk )

 All methods have minor speedup gains.  I was expecting a lot 
 more.
 Since I have 7 cores I expected like below 100ms.
The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go: result[] = values[] * values2[]; Parallelism is gets more interesting with more expensive operations. You may also try bigger sizes or both.
 I´m not seeing false sharing in this case. or i'm wrong?

 If someone can expand on this, i'll be grateful.

 Thanks!
Feb 18 2018
parent reply SrMordred <patric.dexheimer gmail.com> writes:
On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky 
wrote:
 The operation is trivial and dataset is rather small. In such 
 cases SIMD with eg array ops is the way to go:
 result[] = values[] * values2[];
Yes, absolutely right :) I make a simple example to understand why the threads are not scaling in the way i thought they would. I imagine that, if one core work is done in 200ms a 4 core work will be done in 50ms, plus some overhead, since they are working on separate block of memory, without need of sync, and without false sharing, etc (at least I think i don´t have this problem here).
Feb 19 2018
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:
 On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky 
 wrote:
 The operation is trivial and dataset is rather small. In such 
 cases SIMD with eg array ops is the way to go:
 result[] = values[] * values2[];
Yes, absolutely right :) I make a simple example to understand why the threads are not scaling in the way i thought they would.
Yeah, the world is ugly place where trivial math sometimes doesn’t work. I suggest to: - run with different number of threads from 1 to n - vary sizes from 100k to 10m it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice. In essence a single core runs fast b/c it doesn’t wait for all others to complete via join easily burning its quota in one go. In MT I bet some of overhead comes from not all threads finishing (and starting) at once, so the join block in the kernel. You could run your MT code with strace to see if it hits the futex call or some such, if it does that’s where you are getting delays. (that’s assuming you are on Linux) std.parallel version is a bit faster b/c I think it caches created threadpool so you don’t start threads anew on each run.
 I imagine that, if one core work is done in 200ms a 4 core work 
 will be done in 50ms, plus some overhead, since they are 
 working on separate block of memory, without need of sync, and 
 without false sharing, etc (at least I think i don´t have this 
 problem here).
If you had a long queue of small tasks like that and you don’t wait to join all threads untill absolutely required you get near perfect scalability. (Unless hitting other bottlenecks like RAM).
Feb 19 2018