digitalmars.D.learn - multithread/concurrency/parallel methods and performance

SrMordred (19/19) Feb 18 2018 I´m experimenting with threads and related recently.

Jordan Wilson (7/28) Feb 18 2018 It may be due to thread local storage:
Nicholas Wilson (25/46) Feb 18 2018 As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache

SrMordred (11/18) Feb 19 2018 That 'concurrency overhead' is what i´m not getting.

Dmitry Olshansky (6/27) Feb 18 2018 The operation is trivial and dataset is rather small. In such

SrMordred (10/13) Feb 19 2018 Yes, absolutely right :)

Dmitry Olshansky (20/33) Feb 19 2018 Yeah, the world is ugly place where trivial math sometimes

SrMordred <patric.dexheimer gmail.com> writes:

I´m experimenting with threads and related recently.
(i´m just started so may be some terrrible mistakes here)

With this base work:

foreach(i ; 0 .. SIZE)
{
     results[i] = values1[i] * values2[i];
}

and then with this 3 others methods: parallel, spawn and Threads.

this was my results:

_base : 456 ms and 479 us
_parallel : 331 ms, 324 us, and 4 hnsecs
_concurrency : 367 ms, 348 us, and 2 hnsecs
_thread : 369 ms, 565 us, and 3 hnsecs

(code here : https://run.dlang.io/is/2pdmmk )

All methods have minor speedup gains.  I was expecting a lot more.
Since I have 7 cores I expected like below 100ms.

I´m not seeing false sharing in this case. or i'm wrong?

If someone can expand on this, i'll be grateful.

Thanks!

Feb 18 2018

Jordan Wilson <wilsonjord gmail.com> writes:

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:
 I´m experimenting with threads and related recently.
 (i´m just started so may be some terrrible mistakes here)

 With this base work:

 foreach(i ; 0 .. SIZE)
 {
     results[i] = values1[i] * values2[i];
 }

 and then with this 3 others methods: parallel, spawn and 
 Threads.

 this was my results:

 _base : 456 ms and 479 us
 _parallel : 331 ms, 324 us, and 4 hnsecs
 _concurrency : 367 ms, 348 us, and 2 hnsecs
 _thread : 369 ms, 565 us, and 3 hnsecs

 (code here : https://run.dlang.io/is/2pdmmk )

 All methods have minor speedup gains.  I was expecting a lot 
 more.
 Since I have 7 cores I expected like below 100ms.

 I´m not seeing false sharing in this case. or i'm wrong?

 If someone can expand on this, i'll be grateful.

 Thanks!

It may be due to thread local storage:
https://tour.dlang.org/tour/en/multithreading/thread-local-storage
https://dlang.org/articles/migrate-to-shared.html

I'm not sure though, as I don't know how your results array 
initialised.

Jordan

Feb 18 2018

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:
 I´m experimenting with threads and related recently.
 (i´m just started so may be some terrrible mistakes here)

 With this base work:

 foreach(i ; 0 .. SIZE)
 {
     results[i] = values1[i] * values2[i];
 }

 and then with this 3 others methods: parallel, spawn and 
 Threads.

 this was my results:

 _base : 456 ms and 479 us
 _parallel : 331 ms, 324 us, and 4 hnsecs
 _concurrency : 367 ms, 348 us, and 2 hnsecs
 _thread : 369 ms, 565 us, and 3 hnsecs

 (code here : https://run.dlang.io/is/2pdmmk )

 All methods have minor speedup gains.  I was expecting a lot 
 more.
 Since I have 7 cores I expected like below 100ms.

 I´m not seeing false sharing in this case. or i'm wrong?

 If someone can expand on this, i'll be grateful.

 Thanks!

As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache 
for 32bit) it may be that dealing with the concurrency overhead 
adds a significant amount of overhead. Also the run.dlang.io link 
has no -O flag and thus no optimisations

without -O i get

_base : 323 ms, 92 μs, and 6 hnsecs
_parallel : 276 ms, 649 μs, and 3 hnsecs
_concurrency : 221 ms, 931 μs, and 7 hnsecs
_thread : 212 ms, 277 μs, and 3 hnsecs

with it I get
_base : 150 ms, 728 μs, and 5 hnsecs
_parallel : 120 ms, 78 μs, and 5 hnsecs
_concurrency : 134 ms, 787 μs, and 4 hnsecs
_thread : 129 ms, 476 μs, and 2 hnsecs

with SIZE= 16*1024*1024 without -O i get
_base : 5 secs, 835 ms, 240 μs, and 9 hnsecs
_parallel : 4 secs, 802 ms, 279 μs, and 8 hnsecs
_concurrency : 2 secs, 133 ms, 685 μs, and 3 hnsecs
_thread : 2 secs, 108 ms, 860 μs, and 9 hnsecs

with SIZE= 16*1024*1024 with -O i get
_base : 2 secs, 502 ms, 523 μs, and 4 hnsecs
_parallel : 1 sec, 769 ms, 945 μs, and 3 hnsecs
_concurrency : 1 sec, 362 ms, 747 μs, and 1 hnsec
_thread : 1 sec, 335 ms, 720 μs, and 1 hn

Feb 18 2018

SrMordred <patric.dexheimer gmail.com> writes:

On Monday, 19 February 2018 at 05:49:54 UTC, Nicholas Wilson 
wrote:
 As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache 
 for 32bit) it may be that dealing with the concurrency overhead 
 adds a significant amount of overhead.

That 'concurrency overhead' is what i´m not getting.
Since the array is big, dividing it into 6, 7 cores will not 
trash L1 since they are very far from each other, right? Or L2 
cache trashing is also a problem in this case?

 _base : 150 ms, 728 μs, and 5 hnsecs
 _parallel : 120 ms, 78 μs, and 5 hnsecs
 _concurrency : 134 ms, 787 μs, and 4 hnsecs
 _thread : 129 ms, 476 μs, and 2 hnsecs

Yes, on my PC I was using -release.

Yet, 150ms for 1 core. 120-134ms of X cores.
Shouldn´t be way faster? I´m trying to understand where the 
overhead is, and if is possible to get rid of it (perfect thread 
scaling).

Feb 19 2018

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:
 I´m experimenting with threads and related recently.
 (i´m just started so may be some terrrible mistakes here)

 With this base work:

 foreach(i ; 0 .. SIZE)
 {
     results[i] = values1[i] * values2[i];
 }

 and then with this 3 others methods: parallel, spawn and 
 Threads.

 this was my results:

 _base : 456 ms and 479 us
 _parallel : 331 ms, 324 us, and 4 hnsecs
 _concurrency : 367 ms, 348 us, and 2 hnsecs
 _thread : 369 ms, 565 us, and 3 hnsecs

 (code here : https://run.dlang.io/is/2pdmmk )

 All methods have minor speedup gains.  I was expecting a lot 
 more.
 Since I have 7 cores I expected like below 100ms.

The operation is trivial and dataset is rather small. In such 
cases SIMD with eg array ops is the way to go:
result[] = values[] * values2[];


Parallelism is gets more interesting with more expensive 
operations.  You may also try bigger sizes or both.
 I´m not seeing false sharing in this case. or i'm wrong?

 If someone can expand on this, i'll be grateful.

 Thanks!

Feb 18 2018

SrMordred <patric.dexheimer gmail.com> writes:

On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky 
wrote:
 The operation is trivial and dataset is rather small. In such 
 cases SIMD with eg array ops is the way to go:
 result[] = values[] * values2[];

Yes, absolutely right :)

I make a simple example to understand why the threads are not 
scaling in the way i thought they would.
I imagine that, if one core work is done in 200ms a 4 core work 
will be done in 50ms, plus some overhead, since they are working 
on separate block of memory, without need of sync, and without 
false sharing, etc (at least I think i don´t have this problem 
here).

Feb 19 2018

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:
 On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky 
 wrote:
 The operation is trivial and dataset is rather small. In such 
 cases SIMD with eg array ops is the way to go:
 result[] = values[] * values2[];

 Yes, absolutely right :)

 I make a simple example to understand why the threads are not 
 scaling in the way i thought they would.

Yeah, the world is ugly place where trivial math sometimes 
doesn’t work.

I suggest to:
- run with different number of threads from 1 to n
- vary sizes from 100k to 10m


it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice.

In essence a single core runs fast b/c it doesn’t wait for all 
others to complete via join easily burning its quota in one go. 
In MT I bet some of overhead comes from not all threads finishing 
(and starting) at once, so the join block in the kernel.

You could run your MT code with strace to see if it hits the 
futex call or some such, if it does that’s where you are getting 
delays. (that’s assuming you are on Linux)

std.parallel version is a bit faster b/c I think it caches 
created threadpool so you don’t start threads anew on each run.

 I imagine that, if one core work is done in 200ms a 4 core work 
 will be done in 50ms, plus some overhead, since they are 
 working on separate block of memory, without need of sync, and 
 without false sharing, etc (at least I think i don´t have this 
 problem here).

If you had a long queue of small tasks like that and you don’t 
wait to join all threads untill absolutely required you get near 
perfect scalability. (Unless hitting other bottlenecks like RAM).

Feb 19 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - multithread/concurrency/parallel methods and performance