digitalmars.D.learn - Scalability in std.parallelism

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (68/68) Feb 22 2014 In the following test code given below of std.parallelism I get

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (12/29) Feb 23 2014 On my quad-core Intel I get the following (I have two actual cores, four...

"Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> (2/5) Feb 24 2014 Yes, that is my hypothesis aswell :)
Russel Winder (12/24) Feb 24 2014 I'm fairly sure I don't agree with this hypothesis. Two cores with

thedeemon (7/9) Feb 25 2014 Not true. If the code is not trivial and the threads are not

Russel Winder (19/30) Feb 25 2014 I suspect the issue here is how compute intensive the code is, are there

Stanislav Blinov (23/23) Feb 24 2014 Don't rely on dmd when making raw performance tests.
safety0ff (18/20) Feb 24 2014 Don't forget that "n.iota.map" is returning a lazily evaluated
Russel Winder (26/51) Feb 24 2014 I'm not sure just now, but it is an interesting issue. On my 7 year old

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

In the following test code given below of std.parallelism I get 
some interesting results:

when compiled as

dmd -release -noboundscheck -O -inline -w -wi -wi  
~/Work/justd/t_parallelism.d -oft_parallelism

My scalability measures says the following

3.14159 took 221[ms]
3.14159 took 727[ms]
Speedup 3.28959
-5.80829e+09 took 33[ms]
-5.80829e+09 took 201[ms]
Speedup 6.09091

Why do I get a larger speed for a simpler map function?
Shouldn't it be the opposite?
I've always read that the more calculations I perform on each 
memory access the better the speedup...

Anyhow the speedups are great!

I'm sitting on a Intel Quad core with 8 hyperthreads.


Sample code follows:



import std.algorithm, std.parallelism, std.range, std.datetime, 
std.stdio;

void test1 () {
     immutable n = 100_000_000;
     immutable delta = 1.0 / n;

     auto piTerm(int i) {
         immutable x = (i - 0.5) * delta;
         return delta / (1.0 + x*x);
     }

     auto nums = n.iota.map!piTerm; // numbers
     StopWatch sw;

     sw.reset();
     sw.start();
     immutable pi = 4.0*taskPool.reduce!"a+b"(nums);
     sw.stop();
     immutable ms = sw.peek().msecs;
     writeln(pi, " took ", ms, "[ms]");

     sw.reset();
     sw.start();
     immutable pi_ = 4.0*std.algorithm.reduce!"a+b"(nums);
     sw.stop();
     immutable ms_ = sw.peek().msecs;
     writeln(pi_, " took ", ms_, "[ms]");

     writeln("Speedup ", cast(real)ms_ / ms);
}

auto square(T)(T i)  safe pure nothrow { return i*i; }

void test2 () {
     immutable n = 100_000_000;
     immutable delta = 1.0 / n;

     auto nums = n.iota.map!square; // numbers
     StopWatch sw;

     sw.reset();
     sw.start();
     immutable pi = 4.0*taskPool.reduce!"a+b"(nums);
     sw.stop();
     immutable ms = sw.peek().msecs;
     writeln(pi, " took ", ms, "[ms]");

     sw.reset();
     sw.start();
     immutable pi_ = 4.0*std.algorithm.reduce!"a+b"(nums);
     sw.stop();
     immutable ms_ = sw.peek().msecs;
     writeln(pi_, " took ", ms_, "[ms]");

     writeln("Speedup ", cast(real)ms_ / ms);
}

void main () {
     test1();
     test2();
}

Feb 22 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 02/22/2014 08:21 AM, "Nordlöw" wrote:

 In the following test code given below of std.parallelism I get some
 interesting results:

 when compiled as

 dmd -release -noboundscheck -O -inline -w -wi -wi
 ~/Work/justd/t_parallelism.d -oft_parallelism

 My scalability measures says the following

 3.14159 took 221[ms]
 3.14159 took 727[ms]
 Speedup 3.28959
 -5.80829e+09 took 33[ms]
 -5.80829e+09 took 201[ms]
 Speedup 6.09091

On my quad-core Intel I get the following (I have two actual cores, four 
hyperthreads):

3.14159 took 441[ms]
3.14159 took 878[ms]
Speedup 1.99093
-5.80829e+09 took 98[ms]
-5.80829e+09 took 328[ms]
Speedup 3.34694

I am not an expect at all but it looks like the first test cannot take 
advantage of hyperthreading but the second one can to some degree.

      auto piTerm(int i) {
          immutable x = (i - 0.5) * delta;
          return delta / (1.0 + x*x);
      }

 auto square(T)(T i)  safe pure nothrow { return i*i; }

Ali

Feb 23 2014

"Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> writes:

 I am not an expect at all but it looks like the first test 
 cannot take advantage of hyperthreading but the second one can 
 to some degree.

Yes, that is my hypothesis aswell :)

Thx

Feb 24 2014

Russel Winder <russel winder.org.uk> writes:

On Sun, 2014-02-23 at 22:55 -0800, Ali Çehreli wrote:
[…]
 On my quad-core Intel I get the following (I have two actual cores, four 
 hyperthreads):
 
 3.14159 took 441[ms]
 3.14159 took 878[ms]
 Speedup 1.99093
 -5.80829e+09 took 98[ms]
 -5.80829e+09 took 328[ms]
 Speedup 3.34694
 
 I am not an expect at all but it looks like the first test cannot take 
 advantage of hyperthreading but the second one can to some degree.

I'm fairly sure I don't agree with this hypothesis. Two cores with
hyperthreads generally means a maximum speed up of 2 with optimized
native code. So for me the first one looks fine whereas the second one
seems to indicate a problem in std.parallelism.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 24 2014

"thedeemon" <dlang thedeemon.com> writes:

On Monday, 24 February 2014 at 14:34:14 UTC, Russel Winder wrote:

 Two cores with hyperthreads generally means a maximum speed up 
 of 2 with optimized native code.

Not true. If the code is not trivial and the threads are not 
doing exactly same instructions (i.e. they can do some search 
where number of operations depends on data) then 2 cores x 2 
hyperthreads can easily provide more than 2x speed up (but far 
from 4x of course). I see it very often in my video processing 
code.

Feb 25 2014

Russel Winder <russel winder.org.uk> writes:

On Tue, 2014-02-25 at 12:33 +0000, thedeemon wrote:
 On Monday, 24 February 2014 at 14:34:14 UTC, Russel Winder wrote:
 
 Two cores with hyperthreads generally means a maximum speed up 
 of 2 with optimized native code.

 
 Not true. If the code is not trivial and the threads are not 
 doing exactly same instructions (i.e. they can do some search 
 where number of operations depends on data) then 2 cores x 2 
 hyperthreads can easily provide more than 2x speed up (but far 
 from 4x of course). I see it very often in my video processing 
 code.

I suspect the issue here is how compute intensive the code is, are there
cache line misses, are there requests out to memory, etc. i.e.
non-trivial. My observation and gross over-simplification stems from CPU
bound jobs with very localized data, no need for memory writes, and
definitely no I/O. This leads to no opportunity for the hyperthreads to
contribute anything.

I would guess that your video processing uses a cache-friendly
(streaming?) algorithm so that the hyperthreads can operate with data
already in cache whilst the other gets more data into the cache. This
could easily get a >2x speed up on 2 core, 2 hyperthreads machine if the
data chunks are of a suitable size and the memory reads and writes are
in good rhythm with the calculation done on the data.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 25 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

Don't rely on dmd when making raw performance tests.

On my machine (i3-2100, two cores):

dmd2 -O -release -inline

3.14159 took 368[ms]
3.14159 took 713[ms]
Speedup 1.9375
-5.80829e+09 took 61[ms]
-5.80829e+09 took 201[ms]
Speedup 3.29508

ldc2 -O3 -release

3.14159 took 360[ms]
3.14159 took 718[ms]
Speedup 1.99444
-5.80829e+09 took 0[ms]
-5.80829e+09 took 0[ms]
Speedup -nan

ldc2 -O3 -release -vectorize -vectorize-loops

3.14159 took 193[ms]
3.14159 took 721[ms]
Speedup 3.73575
-5.80829e+09 took 0[ms]
-5.80829e+09 took 0[ms]
Speedup -nan

Feb 24 2014

"safety0ff" <safety0ff.dev gmail.com> writes:

On Saturday, 22 February 2014 at 16:21:21 UTC, Nordlöw wrote:
 In the following test code given below of std.parallelism I get 
 some interesting results:

Don't forget that "n.iota.map" is returning a lazily evaluated 
range.
Std.parallelism might have to convert the lazy range to a random 
access range (i.e. an array,) before it can schedule the work.

If I add ".array" after the map call (e.g. auto nums = 
n.iota.map!piTerm.array;)
I get numbers closer to the ideal for test2.

Now we compare the differences between test1 and test2:
test1 is reducing doubles and test2 is reducing ints.

I believe that the reason for the difference in speed up is 
because you have hyper threads and not true independent threads. 
Hyper threads can contend for shared resources in the CPU (e.g. 
cache and FPU.)

On my computer, forcing the nums to be a range of doubles in 
test2 causes the speed up to drop to approximately the same as 
test1.

Regards.

Feb 24 2014

Russel Winder <russel winder.org.uk> writes:

On Sat, 2014-02-22 at 16:21 +0000, "Nordlöw" wrote:
 In the following test code given below of std.parallelism I get 
 some interesting results:
 
 when compiled as
 
 dmd -release -noboundscheck -O -inline -w -wi -wi  
 ~/Work/justd/t_parallelism.d -oft_parallelism
 
 My scalability measures says the following
 
 3.14159 took 221[ms]
 3.14159 took 727[ms]
 Speedup 3.28959
 -5.80829e+09 took 33[ms]
 -5.80829e+09 took 201[ms]
 Speedup 6.09091
 
 Why do I get a larger speed for a simpler map function?
 Shouldn't it be the opposite?

I'm not sure just now, but it is an interesting issue.  On my 7 year old
twin 2.33GHz Xeon, I get:

3.14159 took 128[ms]
3.14159 took 860[ms]
Speedup 6.71875
-5.80829e+09 took 28[ms]
-5.80829e+09 took 302[ms]
Speedup 10.7857

 I've always read that the more calculations I perform on each 
 memory access the better the speedup...

Not necessarily, it depends on the caches and lots of other fun things.

 Anyhow the speedups are great!
 
 I'm sitting on a Intel Quad core with 8 hyperthreads.

As anyone involved in native code CPU-bound benchmarking will tell you,
hyperthreads are generally a waste of time. So on your machine I'd
expect a "flat out" speed up of 4 on your machine, 8 on mine. Virtual
machines, e.g. JVM and PVM, can sometimes do interesting things that
make hyperthreads useful.

 Sample code follows:

Hey, I recognize the core of this code, it's my π by quadrature example.
Sadly π is not well approximately by -5.80829e+09 :-)

I shal tinker with this a bit as I want to understand why the speed up
is greater than the number of cores. It implies overhead in one of the
algorithms.


-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 24 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Scalability in std.parallelism