digitalmars.D.learn - std.parallelism curious results

flamencofantasy (54/54) Oct 05 2014 Hello,

Artem Tarasov (3/3) Oct 05 2014 Welcome to the world of multithreading.
Russel Winder via Digitalmars-d-learn (30/95) Oct 05 2014 -----BEGIN PGP SIGNED MESSAGE-----
Sativa (44/44) Oct 05 2014 Two problems, one, you should create your threads outside the
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (26/31) Oct 05 2014 Reducing the number of threads is key. However, unlike what others said,...

Sativa (39/39) Oct 05 2014 On Sunday, 5 October 2014 at 21:25:39 UTC, Ali Çehreli wrote:

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (10/12) Oct 05 2014 The for loop condition is executed at every iteration and division is an...

Sativa (13/26) Oct 05 2014 Yes, it is a common problem when doing a computation in a for

flamencofantasy (5/5) Oct 05 2014 Thanks everyone for the replies.

"flamencofantasy" <flamencofantasy gmail.com> writes:

Hello,

I am summing up the first 1 billion integers in parallel and in a 
single thread and I'm observing some curious results;

parallel sum : 499999999500000000, elapsed 102833 ms
single thread sum : 499999999500000000, elapsed 1667 ms

The parallel version is 60+ times slower on my i7-3770K CPU. I 
think that maybe due to the CPU constantly flushing and reloading 
the caches in the parallel version but I don't know for sure.

Here is the D code;

	shared ulong sum = 0;
	ulong iter = 1_000_000_000UL;

	StopWatch sw;

	sw.start();

	foreach(i; parallel(iota(0, iter)))
	{
		atomicOp!"+="(sum, i);
	}

	sw.stop();

	writefln("parallel sum : %s, elapsed %s ms", sum, 
sw.peek().msecs);

	sum = 0;

	sw.reset();

	sw.start();

	for (ulong i = 0; i < iter; ++i)
	{
		sum += i;
	}

	sw.stop();

	writefln("single thread sum : %s, elapsed %s ms", sum, 
sw.peek().msecs);



parallel sum : 499999999500000000, elapsed 20320 ms
single thread sum : 499999999500000000, elapsed 1901 ms


is strange on the exact same CPU.



long sum = 0;
long iter = 1000000000L;

var sw = Stopwatch.StartNew();

Parallel.For(0, iter, i =>
{
	Interlocked.Add(ref sum, i);
});

Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, 
sw.ElapsedMilliseconds);

sum = 0;

sw = Stopwatch.StartNew();

for (long i = 0; i < iter; ++i)
{
	sum += i;
}

Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, 
sw.ElapsedMilliseconds);

Thoughts?

Oct 05 2014

"Artem Tarasov" <lomereiter gmail.com> writes:

Welcome to the world of multithreading.
You have just discovered that atomic operations are performance 
killers, congratulations on this.

Oct 05 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/10/14 15:27, flamencofantasy via Digitalmars-d-learn wrote:
 Hello,
 
 I am summing up the first 1 billion integers in parallel and in a
 single thread and I'm observing some curious results;

I am fairly certain that your use of "parallel for" introduces quite a
lot of threads other than you "master" one.

 parallel sum : 499999999500000000, elapsed 102833 ms single thread
 sum : 499999999500000000, elapsed 1667 ms
 
 The parallel version is 60+ times slower on my i7-3770K CPU. I
 think that maybe due to the CPU constantly flushing and reloading
 the caches in the parallel version but I don't know for sure.

I would bet there are cache problems, but far more likely that the
core problem is all the thread activity and in particular all the
synchronization.

 Here is the D code;
 
 shared ulong sum = 0; ulong iter = 1_000_000_000UL;
 
 StopWatch sw;
 
 sw.start();
 
 foreach(i; parallel(iota(0, iter))) { atomicOp!"+="(sum, i); }

Well that will be the problem then, lots and lots of synchronization
with the billion tasks you have set up. I am highly surprised this is
only 60 times slower than sequential!

 sw.stop();
 
 writefln("parallel sum : %s, elapsed %s ms", sum,
 sw.peek().msecs);
 
 sum = 0;
 
 sw.reset();
 
 sw.start();
 
 for (ulong i = 0; i < iter; ++i) { sum += i; }
 
 sw.stop();
 
 writefln("single thread sum : %s, elapsed %s ms", sum, 
 sw.peek().msecs);
 

 
 parallel sum : 499999999500000000, elapsed 20320 ms single thread
 sum : 499999999500000000, elapsed 1901 ms
 

 is strange on the exact same CPU.
 

 
 long sum = 0; long iter = 1000000000L;
 
 var sw = Stopwatch.StartNew();
 
 Parallel.For(0, iter, i => { Interlocked.Add(ref sum, i); });


(somewhat perverse) context is relatively much more efficient than
that of D.

There is almost certainly a useful benchmark test that can come of
this for the std.parallelism implementation (if only I had a few
cycles to get really stuck in to a review and analysis of the module :-(

 Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, 
 sw.ElapsedMilliseconds);
 
 sum = 0;
 
 sw = Stopwatch.StartNew();
 
 for (long i = 0; i < iter; ++i) { sum += i; }
 
 Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, 
 sw.ElapsedMilliseconds);
 
 Thoughts?


- -- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip:
sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlQxeZ0ACgkQ+ooS3F10Be+DKQCgu2Ro+2bVmEua3oPHZ6kAqUVv
cg8AoLpN3BRvLBQLT8qDaiP0wVMS5dQZ
=w4Gx
-----END PGP SIGNATURE-----

Oct 05 2014

"Sativa" <Sativa Indica.org> writes:

Two problems, one, you should create your threads outside the 
stop watch, it is not generally a fair comparison in the real 
world. It throws of the results for short tasks.

Second, you are creating one thread per integer, this is bad. Do 
you really want to create 1B threads when you only have probably 
4 cores?

Below there are 4 threads used. Each thread adds up 1/4 of the 
integers. So it is like 4 threads, each adding up 250M integers. 
The speed, compared to a single thread adding up 250M integers, 
shows how much the parallelism costs per thread.

import std.stdio, std.parallelism, std.datetime, std.range, 
core.atomic;

void main()
{	
	StopWatch sw;
	shared ulong sum1 = 0, sum2 = 0, sum3 = 0, time1, time2, time3;
	
	auto numThreads = 4;
	ulong iter = numThreads*100000UL;
	

	auto thds = parallel(iota(0, iter, iter/numThreads));
	
	sw.start();
	foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < 
iter/numThreads; k++) { s += k; } s += i*iter/numThreads; 
atomicOp!"+="(sum1, s); }
	sw.stop(); time1 = sw.peek().usecs;
	
	
	
	sw.reset();	sw.start();	for (ulong i = 0; i < iter; ++i) { sum2 
+= i; } sw.stop(); time2 = sw.peek().usecs;
	
	writefln("parallel sum : %s, elapsed %s us", sum1, time1);
	writefln("single thread sum : %s, elapsed %s us", sum2, time2);
	writefln("Efficiency : %s%%", 100*time2/time1);
}

http://dpaste.dzfl.pl/bfda7bb2e2b7

Some results:

parallel sum : 79999800000, elapsed 3356 us
single thread sum : 79999800000, elapsed 1984 us Efficiency : 59%


(Not sure all the code is correct, the point is you were creating 
1B threads with 1B atomic operations. The worse possible 
comparison one can do between single and multi-threaded tests.

Oct 05 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 10/05/2014 07:27 AM, flamencofantasy wrote:

 I am summing up the first 1 billion integers in parallel and in a single
 thread and I'm observing some curious results;

 parallel sum : 499999999500000000, elapsed 102833 ms
 single thread sum : 499999999500000000, elapsed 1667 ms

 The parallel version is 60+ times slower

Reducing the number of threads is key. However, unlike what others said, 
parallel() does not use that many threads. By default, TaskPool objects 
are constructed by 'totalCPUs - 1' worker threads. All of parallel()'s 
iteration are executed on that few threads.

The main problem here is the use of atomicOp, which necessarily 
synchronizes the whole process.

Something like the following takes advantage of parallelism and reduces 
the execution time by half on my machine (4 cores (hyperthreaded 2 actul 
ones)).

     ulong adder(ulong beg, ulong end)
     {
         ulong localSum = 0;

         foreach (i; beg .. end) {
             localSum += i;
         }

         return localSum;
     }

     enum totalTasks = 10;

     foreach(i; parallel(iota(0, totalTasks)))
     {
         ulong beg = i * iter / totalTasks;
         ulong end = beg + iter / totalTasks;

         atomicOp!"+="(sum, adder(beg, end));
     }

Ali

Oct 05 2014

"Sativa" <Sativa mindphuck.com> writes:

On Sunday, 5 October 2014 at 21:25:39 UTC, Ali Çehreli wrote:
import std.stdio, std.cstream, std.parallelism, std.datetime, 
std.range, core.atomic;

void main()
{	
	StopWatch sw;
	shared ulong sum1 = 0; ulong sum2 = 0, sum3 = 0, time1, time2, 
time3;

	enum numThreads = 4; // If numThreads is a variable then it 
significantly slows down the process
	ulong iter = 1000000L;
	iter = numThreads*cast(ulong)(iter/numThreads); // Force iter to 
be a multiple of the number of threads so we can partition 
uniformly

	auto thds = parallel(iota(0, cast(uint)iter, 
cast(uint)(iter/numThreads)));

	sw.reset(); sw.start();
	foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < 
iter/numThreads; k++) { s += k; } s += i*iter/numThreads; 
atomicOp!"+="(sum1, s); }
	sw.stop(); time1 = sw.peek().usecs;



	sw.reset();	sw.start();	for (ulong i = 0; i < iter; ++i) { sum2 
+= i; } sw.stop(); time2 = sw.peek().usecs;

	writefln("parallel sum : %s, elapsed %s us", sum1, time1);
	writefln("single thread sum : %s, elapsed %s us", sum2, time2);
	if (time1 > 0) writefln("Efficiency : %s%%", 100*time2/time1);
	din.getc();
}

Playing around with the code above, it seems when numThreads is 
an enum, the execution time is significantly effected(that from 
being < 100% to being >100% efficiency).

results on a 4 core laptop with release builds:

parallel sum : 499999500000, elapsed 2469 us
single thread sum : 499999500000, elapsed 8054 us
Efficiency : 326%


when numThreads is an int:

parallel sum : 499999500000, elapsed 21762 us
single thread sum : 499999500000, elapsed 8033 us
Efficiency : 36%

Oct 05 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 10/05/2014 02:40 PM, Sativa wrote:

      foreach(i; thds) { ulong s = 0; for(ulong k = 0; k <
 iter/numThreads; k++)

The for loop condition is executed at every iteration and division is an 
expensive operation. Apparently, the compiled does some optimization 
when the divisor is known at compile time.

Being 4, it is just a shift of 2 bits. Try something like 5, it is slow 
even for enum.

This solves the problem:

         const end = iter/numThreads;

         for(ulong k = 0; k < end; k++) {

Ali

Oct 05 2014

"Sativa" <Sativa mindphuck.com> writes:

On Sunday, 5 October 2014 at 21:53:23 UTC, Ali Çehreli wrote:
 On 10/05/2014 02:40 PM, Sativa wrote:

      foreach(i; thds) { ulong s = 0; for(ulong k = 0; k <
 iter/numThreads; k++)

 The for loop condition is executed at every iteration and 
 division is an expensive operation. Apparently, the compiled 
 does some optimization when the divisor is known at compile 
 time.

 Being 4, it is just a shift of 2 bits. Try something like 5, it 
 is slow even for enum.

 This solves the problem:

         const end = iter/numThreads;

         for(ulong k = 0; k < end; k++) {

 Ali

Yes, it is a common problem when doing a computation in a for 
loop on the bounds. Most of the time they are constant for the 
loop but the compiler computes it every iteration. When doing a 
simple sum(when the loop does not do much), it becomes expensive 
since it is comparable to what is happening inside the loop.

It's surprising just how slow it makes it though. One can't 
really make numThreads const in the real world though as it 
wouldn't optimal(unless one had a version for each number of 
possible threads).

Obviously one can just move the computation outside the loop. I 
would expect better results if the loops actually did some real 
work.

Oct 05 2014

"flamencofantasy" <flamencofantasy gmail.com> writes:

Thanks everyone for the replies.

I wasn't sure how std.parallel operated but I thought it would 
launch at most a number of threads equal to the number of cores 
on the machine, just as Ali confirmed and similar to what 
Windows' thread pool does.

Oct 05 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - std.parallelism curious results