www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Scalability in std.parallelism

reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
In the following test code given below of std.parallelism I get 
some interesting results:

when compiled as

dmd -release -noboundscheck -O -inline -w -wi -wi  
~/Work/justd/t_parallelism.d -oft_parallelism

My scalability measures says the following

3.14159 took 221[ms]
3.14159 took 727[ms]
Speedup 3.28959
-5.80829e+09 took 33[ms]
-5.80829e+09 took 201[ms]
Speedup 6.09091

Why do I get a larger speed for a simpler map function?
Shouldn't it be the opposite?
I've always read that the more calculations I perform on each 
memory access the better the speedup...

Anyhow the speedups are great!

I'm sitting on a Intel Quad core with 8 hyperthreads.


Sample code follows:



import std.algorithm, std.parallelism, std.range, std.datetime, 
std.stdio;

void test1 () {
     immutable n = 100_000_000;
     immutable delta = 1.0 / n;

     auto piTerm(int i) {
         immutable x = (i - 0.5) * delta;
         return delta / (1.0 + x*x);
     }

     auto nums = n.iota.map!piTerm; // numbers
     StopWatch sw;

     sw.reset();
     sw.start();
     immutable pi = 4.0*taskPool.reduce!"a+b"(nums);
     sw.stop();
     immutable ms = sw.peek().msecs;
     writeln(pi, " took ", ms, "[ms]");

     sw.reset();
     sw.start();
     immutable pi_ = 4.0*std.algorithm.reduce!"a+b"(nums);
     sw.stop();
     immutable ms_ = sw.peek().msecs;
     writeln(pi_, " took ", ms_, "[ms]");

     writeln("Speedup ", cast(real)ms_ / ms);
}

auto square(T)(T i)  safe pure nothrow { return i*i; }

void test2 () {
     immutable n = 100_000_000;
     immutable delta = 1.0 / n;

     auto nums = n.iota.map!square; // numbers
     StopWatch sw;

     sw.reset();
     sw.start();
     immutable pi = 4.0*taskPool.reduce!"a+b"(nums);
     sw.stop();
     immutable ms = sw.peek().msecs;
     writeln(pi, " took ", ms, "[ms]");

     sw.reset();
     sw.start();
     immutable pi_ = 4.0*std.algorithm.reduce!"a+b"(nums);
     sw.stop();
     immutable ms_ = sw.peek().msecs;
     writeln(pi_, " took ", ms_, "[ms]");

     writeln("Speedup ", cast(real)ms_ / ms);
}

void main () {
     test1();
     test2();
}
Feb 22 2014
next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 02/22/2014 08:21 AM, "Nordlöw" wrote:

 In the following test code given below of std.parallelism I get some
 interesting results:

 when compiled as

 dmd -release -noboundscheck -O -inline -w -wi -wi
 ~/Work/justd/t_parallelism.d -oft_parallelism

 My scalability measures says the following

 3.14159 took 221[ms]
 3.14159 took 727[ms]
 Speedup 3.28959
 -5.80829e+09 took 33[ms]
 -5.80829e+09 took 201[ms]
 Speedup 6.09091
On my quad-core Intel I get the following (I have two actual cores, four hyperthreads): 3.14159 took 441[ms] 3.14159 took 878[ms] Speedup 1.99093 -5.80829e+09 took 98[ms] -5.80829e+09 took 328[ms] Speedup 3.34694 I am not an expect at all but it looks like the first test cannot take advantage of hyperthreading but the second one can to some degree.
      auto piTerm(int i) {
          immutable x = (i - 0.5) * delta;
          return delta / (1.0 + x*x);
      }
 auto square(T)(T i)  safe pure nothrow { return i*i; }
Ali
Feb 23 2014
next sibling parent "Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> writes:
 I am not an expect at all but it looks like the first test 
 cannot take advantage of hyperthreading but the second one can 
 to some degree.
Yes, that is my hypothesis aswell :) Thx
Feb 24 2014
prev sibling parent reply Russel Winder <russel winder.org.uk> writes:
On Sun, 2014-02-23 at 22:55 -0800, Ali Çehreli wrote:
[…]
 On my quad-core Intel I get the following (I have two actual cores, four 
 hyperthreads):
 
 3.14159 took 441[ms]
 3.14159 took 878[ms]
 Speedup 1.99093
 -5.80829e+09 took 98[ms]
 -5.80829e+09 took 328[ms]
 Speedup 3.34694
 
 I am not an expect at all but it looks like the first test cannot take 
 advantage of hyperthreading but the second one can to some degree.
I'm fairly sure I don't agree with this hypothesis. Two cores with hyperthreads generally means a maximum speed up of 2 with optimized native code. So for me the first one looks fine whereas the second one seems to indicate a problem in std.parallelism. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 24 2014
parent reply "thedeemon" <dlang thedeemon.com> writes:
On Monday, 24 February 2014 at 14:34:14 UTC, Russel Winder wrote:

 Two cores with hyperthreads generally means a maximum speed up 
 of 2 with optimized native code.
Not true. If the code is not trivial and the threads are not doing exactly same instructions (i.e. they can do some search where number of operations depends on data) then 2 cores x 2 hyperthreads can easily provide more than 2x speed up (but far from 4x of course). I see it very often in my video processing code.
Feb 25 2014
parent Russel Winder <russel winder.org.uk> writes:
On Tue, 2014-02-25 at 12:33 +0000, thedeemon wrote:
 On Monday, 24 February 2014 at 14:34:14 UTC, Russel Winder wrote:
 
 Two cores with hyperthreads generally means a maximum speed up 
 of 2 with optimized native code.
Not true. If the code is not trivial and the threads are not doing exactly same instructions (i.e. they can do some search where number of operations depends on data) then 2 cores x 2 hyperthreads can easily provide more than 2x speed up (but far from 4x of course). I see it very often in my video processing code.
I suspect the issue here is how compute intensive the code is, are there cache line misses, are there requests out to memory, etc. i.e. non-trivial. My observation and gross over-simplification stems from CPU bound jobs with very localized data, no need for memory writes, and definitely no I/O. This leads to no opportunity for the hyperthreads to contribute anything. I would guess that your video processing uses a cache-friendly (streaming?) algorithm so that the hyperthreads can operate with data already in cache whilst the other gets more data into the cache. This could easily get a >2x speed up on 2 core, 2 hyperthreads machine if the data chunks are of a suitable size and the memory reads and writes are in good rhythm with the calculation done on the data. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 25 2014
prev sibling next sibling parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
Don't rely on dmd when making raw performance tests.

On my machine (i3-2100, two cores):

dmd2 -O -release -inline

3.14159 took 368[ms]
3.14159 took 713[ms]
Speedup 1.9375
-5.80829e+09 took 61[ms]
-5.80829e+09 took 201[ms]
Speedup 3.29508

ldc2 -O3 -release

3.14159 took 360[ms]
3.14159 took 718[ms]
Speedup 1.99444
-5.80829e+09 took 0[ms]
-5.80829e+09 took 0[ms]
Speedup -nan

ldc2 -O3 -release -vectorize -vectorize-loops

3.14159 took 193[ms]
3.14159 took 721[ms]
Speedup 3.73575
-5.80829e+09 took 0[ms]
-5.80829e+09 took 0[ms]
Speedup -nan
Feb 24 2014
prev sibling next sibling parent "safety0ff" <safety0ff.dev gmail.com> writes:
On Saturday, 22 February 2014 at 16:21:21 UTC, Nordlöw wrote:
 In the following test code given below of std.parallelism I get 
 some interesting results:
Don't forget that "n.iota.map" is returning a lazily evaluated range. Std.parallelism might have to convert the lazy range to a random access range (i.e. an array,) before it can schedule the work. If I add ".array" after the map call (e.g. auto nums = n.iota.map!piTerm.array;) I get numbers closer to the ideal for test2. Now we compare the differences between test1 and test2: test1 is reducing doubles and test2 is reducing ints. I believe that the reason for the difference in speed up is because you have hyper threads and not true independent threads. Hyper threads can contend for shared resources in the CPU (e.g. cache and FPU.) On my computer, forcing the nums to be a range of doubles in test2 causes the speed up to drop to approximately the same as test1. Regards.
Feb 24 2014
prev sibling parent Russel Winder <russel winder.org.uk> writes:
On Sat, 2014-02-22 at 16:21 +0000, "Nordlöw" wrote:
 In the following test code given below of std.parallelism I get 
 some interesting results:
 
 when compiled as
 
 dmd -release -noboundscheck -O -inline -w -wi -wi  
 ~/Work/justd/t_parallelism.d -oft_parallelism
 
 My scalability measures says the following
 
 3.14159 took 221[ms]
 3.14159 took 727[ms]
 Speedup 3.28959
 -5.80829e+09 took 33[ms]
 -5.80829e+09 took 201[ms]
 Speedup 6.09091
 
 Why do I get a larger speed for a simpler map function?
 Shouldn't it be the opposite?
I'm not sure just now, but it is an interesting issue. On my 7 year old twin 2.33GHz Xeon, I get: 3.14159 took 128[ms] 3.14159 took 860[ms] Speedup 6.71875 -5.80829e+09 took 28[ms] -5.80829e+09 took 302[ms] Speedup 10.7857
 I've always read that the more calculations I perform on each 
 memory access the better the speedup...
Not necessarily, it depends on the caches and lots of other fun things.
 Anyhow the speedups are great!
 
 I'm sitting on a Intel Quad core with 8 hyperthreads.
As anyone involved in native code CPU-bound benchmarking will tell you, hyperthreads are generally a waste of time. So on your machine I'd expect a "flat out" speed up of 4 on your machine, 8 on mine. Virtual machines, e.g. JVM and PVM, can sometimes do interesting things that make hyperthreads useful.
 Sample code follows:
Hey, I recognize the core of this code, it's my π by quadrature example. Sadly π is not well approximately by -5.80829e+09 :-) I shal tinker with this a bit as I want to understand why the speed up is greater than the number of cores. It implies overhead in one of the algorithms. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 24 2014