www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - std.parallelism curious results

reply "flamencofantasy" <flamencofantasy gmail.com> writes:
Hello,

I am summing up the first 1 billion integers in parallel and in a 
single thread and I'm observing some curious results;

parallel sum : 499999999500000000, elapsed 102833 ms
single thread sum : 499999999500000000, elapsed 1667 ms

The parallel version is 60+ times slower on my i7-3770K CPU. I 
think that maybe due to the CPU constantly flushing and reloading 
the caches in the parallel version but I don't know for sure.

Here is the D code;

	shared ulong sum = 0;
	ulong iter = 1_000_000_000UL;

	StopWatch sw;

	sw.start();

	foreach(i; parallel(iota(0, iter)))
	{
		atomicOp!"+="(sum, i);
	}

	sw.stop();

	writefln("parallel sum : %s, elapsed %s ms", sum, 
sw.peek().msecs);

	sum = 0;

	sw.reset();

	sw.start();

	for (ulong i = 0; i < iter; ++i)
	{
		sum += i;
	}

	sw.stop();

	writefln("single thread sum : %s, elapsed %s ms", sum, 
sw.peek().msecs);

Out of curiosity I tried the equivalent code in C# and I got this;

parallel sum : 499999999500000000, elapsed 20320 ms
single thread sum : 499999999500000000, elapsed 1901 ms

The C# parallel is about 3 times faster than the D parallel which 
is strange on the exact same CPU.

And here is the C# code;

long sum = 0;
long iter = 1000000000L;

var sw = Stopwatch.StartNew();

Parallel.For(0, iter, i =>
{
	Interlocked.Add(ref sum, i);
});

Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, 
sw.ElapsedMilliseconds);

sum = 0;

sw = Stopwatch.StartNew();

for (long i = 0; i < iter; ++i)
{
	sum += i;
}

Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, 
sw.ElapsedMilliseconds);

Thoughts?
Oct 05 2014
next sibling parent "Artem Tarasov" <lomereiter gmail.com> writes:
Welcome to the world of multithreading.
You have just discovered that atomic operations are performance 
killers, congratulations on this.
Oct 05 2014
prev sibling next sibling parent Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/10/14 15:27, flamencofantasy via Digitalmars-d-learn wrote:
 Hello,
 
 I am summing up the first 1 billion integers in parallel and in a
 single thread and I'm observing some curious results;
I am fairly certain that your use of "parallel for" introduces quite a lot of threads other than you "master" one.
 parallel sum : 499999999500000000, elapsed 102833 ms single thread
 sum : 499999999500000000, elapsed 1667 ms
 
 The parallel version is 60+ times slower on my i7-3770K CPU. I
 think that maybe due to the CPU constantly flushing and reloading
 the caches in the parallel version but I don't know for sure.
I would bet there are cache problems, but far more likely that the core problem is all the thread activity and in particular all the synchronization.
 Here is the D code;
 
 shared ulong sum = 0; ulong iter = 1_000_000_000UL;
 
 StopWatch sw;
 
 sw.start();
 
 foreach(i; parallel(iota(0, iter))) { atomicOp!"+="(sum, i); }
Well that will be the problem then, lots and lots of synchronization with the billion tasks you have set up. I am highly surprised this is only 60 times slower than sequential!
 sw.stop();
 
 writefln("parallel sum : %s, elapsed %s ms", sum,
 sw.peek().msecs);
 
 sum = 0;
 
 sw.reset();
 
 sw.start();
 
 for (ulong i = 0; i < iter; ++i) { sum += i; }
 
 sw.stop();
 
 writefln("single thread sum : %s, elapsed %s ms", sum, 
 sw.peek().msecs);
 
 Out of curiosity I tried the equivalent code in C# and I got this;
 
 parallel sum : 499999999500000000, elapsed 20320 ms single thread
 sum : 499999999500000000, elapsed 1901 ms
 
 The C# parallel is about 3 times faster than the D parallel which
 is strange on the exact same CPU.
 
 And here is the C# code;
 
 long sum = 0; long iter = 1000000000L;
 
 var sw = Stopwatch.StartNew();
 
 Parallel.For(0, iter, i => { Interlocked.Add(ref sum, i); });
Useful moral of this story is that C# synchronization in this (somewhat perverse) context is relatively much more efficient than that of D. There is almost certainly a useful benchmark test that can come of this for the std.parallelism implementation (if only I had a few cycles to get really stuck in to a review and analysis of the module :-(
 Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, 
 sw.ElapsedMilliseconds);
 
 sum = 0;
 
 sw = Stopwatch.StartNew();
 
 for (long i = 0; i < iter; ++i) { sum += i; }
 
 Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, 
 sw.ElapsedMilliseconds);
 
 Thoughts?
- -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlQxeZ0ACgkQ+ooS3F10Be+DKQCgu2Ro+2bVmEua3oPHZ6kAqUVv cg8AoLpN3BRvLBQLT8qDaiP0wVMS5dQZ =w4Gx -----END PGP SIGNATURE-----
Oct 05 2014
prev sibling next sibling parent "Sativa" <Sativa Indica.org> writes:
Two problems, one, you should create your threads outside the 
stop watch, it is not generally a fair comparison in the real 
world. It throws of the results for short tasks.

Second, you are creating one thread per integer, this is bad. Do 
you really want to create 1B threads when you only have probably 
4 cores?

Below there are 4 threads used. Each thread adds up 1/4 of the 
integers. So it is like 4 threads, each adding up 250M integers. 
The speed, compared to a single thread adding up 250M integers, 
shows how much the parallelism costs per thread.

import std.stdio, std.parallelism, std.datetime, std.range, 
core.atomic;

void main()
{	
	StopWatch sw;
	shared ulong sum1 = 0, sum2 = 0, sum3 = 0, time1, time2, time3;
	
	auto numThreads = 4;
	ulong iter = numThreads*100000UL;
	

	auto thds = parallel(iota(0, iter, iter/numThreads));
	
	sw.start();
	foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < 
iter/numThreads; k++) { s += k; } s += i*iter/numThreads; 
atomicOp!"+="(sum1, s); }
	sw.stop(); time1 = sw.peek().usecs;
	
	
	
	sw.reset();	sw.start();	for (ulong i = 0; i < iter; ++i) { sum2 
+= i; } sw.stop(); time2 = sw.peek().usecs;
	
	writefln("parallel sum : %s, elapsed %s us", sum1, time1);
	writefln("single thread sum : %s, elapsed %s us", sum2, time2);
	writefln("Efficiency : %s%%", 100*time2/time1);
}

http://dpaste.dzfl.pl/bfda7bb2e2b7

Some results:

parallel sum : 79999800000, elapsed 3356 us
single thread sum : 79999800000, elapsed 1984 us Efficiency : 59%


(Not sure all the code is correct, the point is you were creating 
1B threads with 1B atomic operations. The worse possible 
comparison one can do between single and multi-threaded tests.
Oct 05 2014
prev sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 10/05/2014 07:27 AM, flamencofantasy wrote:

 I am summing up the first 1 billion integers in parallel and in a single
 thread and I'm observing some curious results;

 parallel sum : 499999999500000000, elapsed 102833 ms
 single thread sum : 499999999500000000, elapsed 1667 ms

 The parallel version is 60+ times slower
Reducing the number of threads is key. However, unlike what others said, parallel() does not use that many threads. By default, TaskPool objects are constructed by 'totalCPUs - 1' worker threads. All of parallel()'s iteration are executed on that few threads. The main problem here is the use of atomicOp, which necessarily synchronizes the whole process. Something like the following takes advantage of parallelism and reduces the execution time by half on my machine (4 cores (hyperthreaded 2 actul ones)). ulong adder(ulong beg, ulong end) { ulong localSum = 0; foreach (i; beg .. end) { localSum += i; } return localSum; } enum totalTasks = 10; foreach(i; parallel(iota(0, totalTasks))) { ulong beg = i * iter / totalTasks; ulong end = beg + iter / totalTasks; atomicOp!"+="(sum, adder(beg, end)); } Ali
Oct 05 2014
parent reply "Sativa" <Sativa mindphuck.com> writes:
On Sunday, 5 October 2014 at 21:25:39 UTC, Ali Çehreli wrote:
import std.stdio, std.cstream, std.parallelism, std.datetime, 
std.range, core.atomic;

void main()
{	
	StopWatch sw;
	shared ulong sum1 = 0; ulong sum2 = 0, sum3 = 0, time1, time2, 
time3;

	enum numThreads = 4; // If numThreads is a variable then it 
significantly slows down the process
	ulong iter = 1000000L;
	iter = numThreads*cast(ulong)(iter/numThreads); // Force iter to 
be a multiple of the number of threads so we can partition 
uniformly

	auto thds = parallel(iota(0, cast(uint)iter, 
cast(uint)(iter/numThreads)));

	sw.reset(); sw.start();
	foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < 
iter/numThreads; k++) { s += k; } s += i*iter/numThreads; 
atomicOp!"+="(sum1, s); }
	sw.stop(); time1 = sw.peek().usecs;



	sw.reset();	sw.start();	for (ulong i = 0; i < iter; ++i) { sum2 
+= i; } sw.stop(); time2 = sw.peek().usecs;

	writefln("parallel sum : %s, elapsed %s us", sum1, time1);
	writefln("single thread sum : %s, elapsed %s us", sum2, time2);
	if (time1 > 0) writefln("Efficiency : %s%%", 100*time2/time1);
	din.getc();
}

Playing around with the code above, it seems when numThreads is 
an enum, the execution time is significantly effected(that from 
being < 100% to being >100% efficiency).

results on a 4 core laptop with release builds:

parallel sum : 499999500000, elapsed 2469 us
single thread sum : 499999500000, elapsed 8054 us
Efficiency : 326%


when numThreads is an int:

parallel sum : 499999500000, elapsed 21762 us
single thread sum : 499999500000, elapsed 8033 us
Efficiency : 36%
Oct 05 2014
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 10/05/2014 02:40 PM, Sativa wrote:

      foreach(i; thds) { ulong s = 0; for(ulong k = 0; k <
 iter/numThreads; k++)
The for loop condition is executed at every iteration and division is an expensive operation. Apparently, the compiled does some optimization when the divisor is known at compile time. Being 4, it is just a shift of 2 bits. Try something like 5, it is slow even for enum. This solves the problem: const end = iter/numThreads; for(ulong k = 0; k < end; k++) { Ali
Oct 05 2014
parent reply "Sativa" <Sativa mindphuck.com> writes:
On Sunday, 5 October 2014 at 21:53:23 UTC, Ali Çehreli wrote:
 On 10/05/2014 02:40 PM, Sativa wrote:

      foreach(i; thds) { ulong s = 0; for(ulong k = 0; k <
 iter/numThreads; k++)
The for loop condition is executed at every iteration and division is an expensive operation. Apparently, the compiled does some optimization when the divisor is known at compile time. Being 4, it is just a shift of 2 bits. Try something like 5, it is slow even for enum. This solves the problem: const end = iter/numThreads; for(ulong k = 0; k < end; k++) { Ali
Yes, it is a common problem when doing a computation in a for loop on the bounds. Most of the time they are constant for the loop but the compiler computes it every iteration. When doing a simple sum(when the loop does not do much), it becomes expensive since it is comparable to what is happening inside the loop. It's surprising just how slow it makes it though. One can't really make numThreads const in the real world though as it wouldn't optimal(unless one had a version for each number of possible threads). Obviously one can just move the computation outside the loop. I would expect better results if the loops actually did some real work.
Oct 05 2014
parent "flamencofantasy" <flamencofantasy gmail.com> writes:
Thanks everyone for the replies.

I wasn't sure how std.parallel operated but I thought it would 
launch at most a number of threads equal to the number of cores 
on the machine, just as Ali confirmed and similar to what 
Windows' thread pool does.
Oct 05 2014