www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.parallelism: Request for Review

reply dsimcha <dsimcha yahoo.com> writes:
I've taken care of all of the issues Andrei mentioned a while back with 
regard to std.parallelism.  I've moved the repository to Github 
(https://github.com/dsimcha/std.parallelism/wiki), updated/improved the 
documentation 
(http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html), cleaned up 
a couple miscellaneous minor issues, added a LazyMap object for 
pipelining, and added a benchmarks folder (in the Github source tree) 
with at least one benchmark for each of the parallelism types (parallel 
foreach, map, reduce, task-based, pipelining).  This is an official 
request for review of this module for inclusion in Phobos.

I didn't include any benchmark results on the wiki yet, because so far 
I've only had a chance to run them on a dual core I have at home. 
They're also admittedly slightly tuned to my hardware, so the results 
would be biased, but all seem to provide at least close to linear 
speedups.  Others may feel free to post their benchmark results to the wiki.

One last note:  Due to Bug 5612 
(http://d.puremagic.com/issues/show_bug.cgi?id=5612), the benchmarks 
don't work on 64-bit because core.cpuid won't realize that your CPU is 
multicore.  There are two ways around this.  One is to use 32-bit mode. 
  The other is to change the benchmark files to manually set the number 
of cores by setting the defaultPoolThreads property.
Feb 26 2011
next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Some results on an Athlon II X4, 2.8Ghz (quad-core):

https://gist.github.com/845676
Feb 26 2011
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Without release, only the euclidean benchmark shows a more dramatic
speed difference:
Serial reduce:  6298 milliseconds.
Parallel reduce with 4 cores:  567 milliseconds.

I forgot to mention I'm on XP32. I could test these on a virtualized
Linux, if that's worth testing.
Feb 26 2011
parent dsimcha <dsimcha yahoo.com> writes:
I have no idea why the euclidean benchmark shows a superlinear speedup 
without -release, though I'm able to reproduce this on my box.  Must 
have something to do with std.algorithm's use of asserts or something.

As far as operating systems, I'm glad you tested on XP32.  One thing 
that can make a **huge** difference is that, on XP, synchronized blocks 
immediately hit kernel calls and context switches unless you use the 
Windows API directly to explicitly override this behavior.  On Vista and 
7, the default behavior (which D uses) is to spin for a short period of 
time before context switching when waiting on a lock.  This is usually 
vastly more efficient in the case of heavily contested, fine grained 
locking.  I tested on Windows 7 and I'm very happy that none of the 
numbers completely blew up on XP because of this issue.

On 2/26/2011 5:30 PM, Andrej Mitrovic wrote:
 Without release, only the euclidean benchmark shows a more dramatic
 speed difference:
 Serial reduce:  6298 milliseconds.
 Parallel reduce with 4 cores:  567 milliseconds.

 I forgot to mention I'm on XP32. I could test these on a virtualized
 Linux, if that's worth testing.

Feb 26 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/26/11 3:13 PM, dsimcha wrote:
 I've taken care of all of the issues Andrei mentioned a while back with
 regard to std.parallelism. I've moved the repository to Github
 (https://github.com/dsimcha/std.parallelism/wiki), updated/improved the
 documentation
 (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html), cleaned up
 a couple miscellaneous minor issues, added a LazyMap object for
 pipelining, and added a benchmarks folder (in the Github source tree)
 with at least one benchmark for each of the parallelism types (parallel
 foreach, map, reduce, task-based, pipelining). This is an official
 request for review of this module for inclusion in Phobos.

 I didn't include any benchmark results on the wiki yet, because so far
 I've only had a chance to run them on a dual core I have at home.
 They're also admittedly slightly tuned to my hardware, so the results
 would be biased, but all seem to provide at least close to linear
 speedups. Others may feel free to post their benchmark results to the wiki.

 One last note: Due to Bug 5612
 (http://d.puremagic.com/issues/show_bug.cgi?id=5612), the benchmarks
 don't work on 64-bit because core.cpuid won't realize that your CPU is
 multicore. There are two ways around this. One is to use 32-bit mode.
 The other is to change the benchmark files to manually set the number of
 cores by setting the defaultPoolThreads property.

Who'd like to be review manager? That entails putting together a schedule, tally votes, summarize discussions, and such. Andrei
Feb 26 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 2/26/2011 6:08 PM, Andrej Mitrovic wrote:
 The example code is quite simple to digest. The makeAngel name is funny. :p

 I wonder how this compares to other languages.

 Should the return values "Task!(run,TypeTuple!(F,Args))" and
 "Task!(run,TypeTuple!(F,Args))*" be exposed like that? I'd maybe vote
 for auto on this one, if possible. Although auto does hide what it
 returns..

Yeah, this was a somewhat difficult choice. In the end I decided that this detail is fairly unlikely to change and the clarity of making what's really going on transparent outweighs the fairly mild exposure of implementation details that are unlikely to change. The only detail being exposed is that I use an adapter function to make callable objects work with the same infrastructure as aliases.
Feb 26 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
The example code is quite simple to digest. The makeAngel name is funny. :p

I wonder how this compares to other languages.

Should the return values "Task!(run,TypeTuple!(F,Args))" and
"Task!(run,TypeTuple!(F,Args))*" be exposed like that? I'd maybe vote
for auto on this one, if possible. Although auto does hide what it
returns..
Feb 26 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, 2011-02-26 at 16:13 -0500, dsimcha wrote:
[ . . . ]
 One last note:  Due to Bug 5612=20
 (http://d.puremagic.com/issues/show_bug.cgi?id=3D5612), the benchmarks=

 don't work on 64-bit because core.cpuid won't realize that your CPU is=

 multicore.  There are two ways around this.  One is to use 32-bit mode.=

   The other is to change the benchmark files to manually set the number=

 of cores by setting the defaultPoolThreads property.

Actually this bug is worse that at first appears: 32-bit is badly broken in that the core count of one processors is reported, it ignores having multiple processors. So on my twin-Xeon in 64-bit core.cpuid report 1 and in 32-bit mode it reports 4 instead of 8. Aaarrrgggghhhhhh.... NB The source code link on http://digitalmars.com/d/2.0/phobos/core_cpuid.html points to https://github.com/D-Programming-Language/phobos/blob/master/core/cpuid.d w= hich is a 404. I guess I'll have to clone Phobos in its entirety in order to take a look at this. Damn I only want to look at the one file. In 64-bit mode: |> scons runall /usr/bin/python /home/Checkouts/Mercurial/SCons/bootstrap/src/scrip= t/scons.py runall scons: Reading SConscript files ... scons: done reading SConscript files. scons: Building targets ... dmd -I. -I/home/users/russel/lib/D -m64 -release -O -inline -c -ofe= uclidean.o euclidean.d gcc -o euclidean -m64 euclidean.o -L/home/users/russel/lib.Linux.x8= 6_64 -L/home/users/russel/lib.Linux.x86_64/DMD2/lib64 -lparallelism -lphobo= s2 -lpthread -lm -lrt dmd -I. -I/home/users/russel/lib/D -m64 -release -O -inline -c -ofm= atrixInversion.o matrixInversion.d gcc -o matrixInversion -m64 matrixInversion.o -L/home/users/russel/= lib.Linux.x86_64 -L/home/users/russel/lib.Linux.x86_64/DMD2/lib64 -lparalle= lism -lphobos2 -lpthread -lm -lrt dmd -I. -I/home/users/russel/lib/D -m64 -release -O -inline -c -ofm= illionSqrt.o millionSqrt.d gcc -o millionSqrt -m64 millionSqrt.o -L/home/users/russel/lib.Linu= x.x86_64 -L/home/users/russel/lib.Linux.x86_64/DMD2/lib64 -lparallelism -lp= hobos2 -lpthread -lm -lrt dmd -I. -I/home/users/russel/lib/D -m64 -release -O -inline -c -ofp= arallelSort.o parallelSort.d gcc -o parallelSort -m64 parallelSort.o -L/home/users/russel/lib.Li= nux.x86_64 -L/home/users/russel/lib.Linux.x86_64/DMD2/lib64 -lparallelism -= lphobos2 -lpthread -lm -lrt dmd -I. -I/home/users/russel/lib/D -m64 -release -O -inline -c -ofp= ipelining.o pipelining.d gcc -o pipelining -m64 pipelining.o -L/home/users/russel/lib.Linux.= x86_64 -L/home/users/russel/lib.Linux.x86_64/DMD2/lib64 -lparallelism -lpho= bos2 -lpthread -lm -lrt runEverything(["runall"], ["euclidean", "matrixInversion", "million= Sqrt", "parallelSort", "pipelining"]) =20 =3D=3D=3D=3D=3D=3D=3D=3D euclidean =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Serial reduce: 1158 milliseconds. Parallel reduce with 1 cores: 1159 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D matrixInversion =3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D Inverted a 256 x 256 matrix serially in 62 milliseconds. Inverted a 256 x 256 matrix using 1 cores in 60 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D millionSqrt =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Parallel benchmarks being done with 1 cores. Did serial millionSqrt in 956 milliseconds. Did parallel foreach millionSqrt in 990 milliseconds. Did parallel map millionSqrt in 985 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D parallelSort =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Serial quick sort: 4542 milliseconds. Parallel quick sort: 4616 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D pipelining =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Did serial string -> float, euclid in 2097 milliseconds. Did parallel string -> float, euclid with 1 cores in 1939 milliseco= nds. =20 scons: done building targets. In 32-bit mode: <pending, I need to fix the SCons D mode to get this to work :-(> --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 27 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On an ancient 32-bit dual core Mac Mini:

|> scons runall
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho=
n.app/Contents/MacOS/Python /home/Checkouts/Mercurial/SCons/bootstrap/src/s=
cript/scons.py runall
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
dmd -I. -I/Users/russel/lib/D -m32 -release -O -inline -c -ofeuclidean.o eu=
clidean.d
gcc -o euclidean -m32 euclidean.o -L/Users/russel/lib.Darwin.ix86 -L/Users/=
russel/lib.Darwin.ix86/DMD2/lib32 -lparallelism -lphobos2 -lpthread -lm
dmd -I. -I/Users/russel/lib/D -m32 -release -O -inline -c -ofmatrixInversio=
n.o matrixInversion.d
gcc -o matrixInversion -m32 matrixInversion.o -L/Users/russel/lib.Darwin.ix=
86 -L/Users/russel/lib.Darwin.ix86/DMD2/lib32 -lparallelism -lphobos2 -lpth=
read -lm
dmd -I. -I/Users/russel/lib/D -m32 -release -O -inline -c -ofmillionSqrt.o =
millionSqrt.d
gcc -o millionSqrt -m32 millionSqrt.o -L/Users/russel/lib.Darwin.ix86 -L/Us=
ers/russel/lib.Darwin.ix86/DMD2/lib32 -lparallelism -lphobos2 -lpthread -lm
dmd -I. -I/Users/russel/lib/D -m32 -release -O -inline -c -ofparallelSort.o=
 parallelSort.d
gcc -o parallelSort -m32 parallelSort.o -L/Users/russel/lib.Darwin.ix86 -L/=
Users/russel/lib.Darwin.ix86/DMD2/lib32 -lparallelism -lphobos2 -lpthread -=
lm
dmd -I. -I/Users/russel/lib/D -m32 -release -O -inline -c -ofpipelining.o p=
ipelining.d
gcc -o pipelining -m32 pipelining.o -L/Users/russel/lib.Darwin.ix86 -L/User=
s/russel/lib.Darwin.ix86/DMD2/lib32 -lparallelism -lphobos2 -lpthread -lm
runEverything(["runall"], ["euclidean", "matrixInversion", "millionSqrt", "=
parallelSort", "pipelining"])

=3D=3D=3D=3D=3D=3D=3D=3D  euclidean  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Serial reduce:  1426 milliseconds.
Parallel reduce with 2 cores:  644 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  matrixInversion  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Inverted a 256 x 256 matrix serially in 109 milliseconds.
Inverted a 256 x 256 matrix using 2 cores in 65 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  millionSqrt  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Parallel benchmarks being done with 2 cores.
Did serial millionSqrt in 4122 milliseconds.
Did parallel foreach millionSqrt in 2094 milliseconds.
Did parallel map millionSqrt in 2110 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  parallelSort  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
Serial quick sort:  8547 milliseconds.
Parallel quick sort:  4654 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  pipelining  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Did serial string -> float, euclid in 3021 milliseconds.
Did parallel string -> float, euclid with 2 cores in 1712 milliseconds.

scons: done building targets.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
Feb 27 2011
prev sibling next sibling parent reply Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

32-bit mode on a 8-core (twin Xeon) Linux box.  That core.cpuid bug
really, really sucks.

I see matrix inversion takes longer with 4 cores than with 1!

|> scons runall
/usr/bin/python /home/Checkouts/Mercurial/SCons/bootstrap/src/script/scons.=
py runall
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
dmd -I. -I/home/users/russel/lib/D -m32 -release -O -inline -c -ofeuclidean=
.o euclidean.d
gcc -o euclidean -m32 euclidean.o -L/home/users/russel/lib.Linux.ix86 -L/ho=
me/users/russel/lib.Linux.x86_64/DMD2/lib32 -lparallelism -lphobos2 -lpthre=
ad -lm -lrt
dmd -I. -I/home/users/russel/lib/D -m32 -release -O -inline -c -ofmatrixInv=
ersion.o matrixInversion.d
gcc -o matrixInversion -m32 matrixInversion.o -L/home/users/russel/lib.Linu=
x.ix86 -L/home/users/russel/lib.Linux.x86_64/DMD2/lib32 -lparallelism -lpho=
bos2 -lpthread -lm -lrt
dmd -I. -I/home/users/russel/lib/D -m32 -release -O -inline -c -ofmillionSq=
rt.o millionSqrt.d
gcc -o millionSqrt -m32 millionSqrt.o -L/home/users/russel/lib.Linux.ix86 -=
L/home/users/russel/lib.Linux.x86_64/DMD2/lib32 -lparallelism -lphobos2 -lp=
thread -lm -lrt
dmd -I. -I/home/users/russel/lib/D -m32 -release -O -inline -c -ofparallelS=
ort.o parallelSort.d
gcc -o parallelSort -m32 parallelSort.o -L/home/users/russel/lib.Linux.ix86=
 -L/home/users/russel/lib.Linux.x86_64/DMD2/lib32 -lparallelism -lphobos2 -=
lpthread -lm -lrt
dmd -I. -I/home/users/russel/lib/D -m32 -release -O -inline -c -ofpipelinin=
g.o pipelining.d
gcc -o pipelining -m32 pipelining.o -L/home/users/russel/lib.Linux.ix86 -L/=
home/users/russel/lib.Linux.x86_64/DMD2/lib32 -lparallelism -lphobos2 -lpth=
read -lm -lrt
runEverything(["runall"], ["euclidean", "matrixInversion", "millionSqrt", "=
parallelSort", "pipelining"])

=3D=3D=3D=3D=3D=3D=3D=3D  euclidean  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Serial reduce:  1104 milliseconds.
Parallel reduce with 4 cores:  324 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  matrixInversion  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Inverted a 256 x 256 matrix serially in 60 milliseconds.
Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  millionSqrt  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Parallel benchmarks being done with 4 cores.
Did serial millionSqrt in 980 milliseconds.
Did parallel foreach millionSqrt in 355 milliseconds.
Did parallel map millionSqrt in 249 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  parallelSort  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
Serial quick sort:  5191 milliseconds.
Parallel quick sort:  1913 milliseconds.

=3D=3D=3D=3D=3D=3D=3D=3D  pipelining  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Did serial string -> float, euclid in 2236 milliseconds.
Did parallel string -> float, euclid with 4 cores in 1528 milliseconds.

scons: done building targets.


--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
Feb 27 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 2/27/2011 8:03 AM, Russel Winder wrote:
 32-bit mode on a 8-core (twin Xeon) Linux box.  That core.cpuid bug
 really, really sucks.

 I see matrix inversion takes longer with 4 cores than with 1!

Can you please re-run the benchmark to make sure that this isn't just a one-time anomaly? I can't seem to make the parallel matrix inversion run slower than serial on my hardware, even with ridiculous tuning parameters that I was almost sure would bottleneck the thing on the task queue. Also, all the other benchmarks actually look pretty good. It's possible that machines with multiple physical CPUs are much more likely to bottleneck on the task queue because synchronized blocks cost a few more clock cycles. It's also possible that stack alignment issues are creeping in somewhere I hadn't anticipated, or that using 4 cores instead of two on a fairly fine-grained benchmark is enough to bottleneck on the queue (though I doubt this because this benchmark worked well for others with quad cores).
Feb 27 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 2/27/2011 9:48 AM, dsimcha wrote:
 On 2/27/2011 8:03 AM, Russel Winder wrote:
 32-bit mode on a 8-core (twin Xeon) Linux box. That core.cpuid bug
 really, really sucks.

 I see matrix inversion takes longer with 4 cores than with 1!


Actually, I am able to reproduce this, but only on Linux, and I think I figured out why. I think it's related to my Posix workaround for Bug 3753 (http://d.puremagic.com/issues/show_bug.cgi?id=3753). This workaround causes GC heap allocations to occur in a loop inside the matrix inversion routine (one for each call to parallel(), so 256 over the course of the benchmark). This was intended to be a very quick and dirty workaround for a DMD bug that I thought would get fixed a long time ago. It also seemed good enough at the time because I was using this lib for very coarse grained parallelism, where the effect is negligible. Originally, I was using alloca() all over the place to efficiently deal with memory management. However, under Posix, I ran into Bug 3753 a long time ago and put in the following workaround, which simply forwards alloca() calls to the GC. From near the top of parallelism.d: // Workaround for bug 3753. version(Posix) { // Can't use alloca() because it can't be used with exception // handling. // Use the GC instead even though it's slightly less efficient. void* alloca(size_t nBytes) { return GC.malloc(nBytes); } } else { // Can really use alloca(). import core.stdc.stdlib : alloca; } In this particular use case the performance hit is probably substantial. There are ways to mitigate it (maybe having TaskPool maintain a free list, etc.), but I can't bring myself to put a lot of effort into optimizing a workaround for a compiler bug.
Feb 27 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
I've looked into this more.  I realized that I'm only able to reproduce 
it when running Linux in a VM on top of Windows.  When I reboot and run 
my Linux distro in bare metal instead, I get decent (but not linear) 
speedups on the matrix benchmark.  I'm guessing this is due to things 
like locking and context switches being less efficient/more expensive in 
a VM than on bare metal.  In your case, having two physical CPUs in 
separate sockets probably makes the atomic ops required for locking, 
context switches, etc. more expensive.  From fiddling around, the GC 
thing actually appears to be a non-issue.

Since only the inner loop, not the outer loop, is easily parallelizable, 
I think a 256x256 matrix is really at the very edge of what's feasible 
in terms of fine-grainedness.  Each iteration of the outer loop only 
takes on the order of half a millisecond, in serial.  This means we're 
trying to parallelize an inner loop that only takes on the order of half 
a CPU-millisecond to run.  (This is the cost of the whole loop, start to 
finish, not the cost of one iteration.)  Slight changes in the costs of 
various primitives (or having more cores to contest locks, invoke 
context switches, etc.) can have a huge effect.  I've changed to using a 
1024x1024 matrix instead, although this seems to be somewhat memory 
bandwidth-bound.

As a general statement, these benchmarks are much more fine-grained than 
what I use std.parallelism for in the real world, both because 
fine-grained examples were the only simple, non-domain-specific, 
dependency-free ones I could think of and to show that std.parallelism 
works reasonably well (though certainly not perfectly) even with fairly 
fine-grained parallelism.  The unfortunate reality, though, is that this 
kind of micro-parallelism is hard to implement efficiently and will 
probably always (on every lib, not just mine) have performance 
characteristics that are highly dependent on hardware, OS primitives, 
etc. and require some tuning.  This isn't to say that std.parallelism is 
the best micro-parallelism lib out there, just that I highly doubt that 
efficient general-case micro-parallelism is a totally solved problem, or 
is even practically solvable, and that these benchmarks illustrate a 
far-from-ideal case.

On 2/27/2011 1:44 PM, Russel Winder wrote:
 David,

 On Sun, 2011-02-27 at 09:48 -0500, dsimcha wrote:
 [ . . . ]
 Can you please re-run the benchmark to make sure that this isn't just a
 one-time anomaly?  I can't seem to make the parallel matrix inversion
 run slower than serial on my hardware, even with ridiculous tuning
 parameters that I was almost sure would bottleneck the thing on the task
 queue.  Also, all the other benchmarks actually look pretty good.

Sadly the result is consistent :-( |> matrixInversion Inverted a 256 x 256 matrix serially in 60 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 61 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 65 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 59 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |>

Feb 27 2011
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 2/26/2011 4:13 PM, dsimcha wrote:
 One last note: Due to Bug 5612
 (http://d.puremagic.com/issues/show_bug.cgi?id=5612), the benchmarks
 don't work on 64-bit because core.cpuid won't realize that your CPU is
 multicore. There are two ways around this. One is to use 32-bit mode.
 The other is to change the benchmark files to manually set the number of
 cores by setting the defaultPoolThreads property.

I realized the obvious kludge and "fixed" this. Now, all benchmarks take a --nCpu command line argument that allows you to set the number of cores manually. This is an absolute must if running on 64. If you don't set this, the default is whatever core.cpuid.coresPerCPU returns, and on 64 bit this will always be 1.
Feb 27 2011
next sibling parent reply Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

David,

On Sun, 2011-02-27 at 10:40 -0500, dsimcha wrote:

 I realized the obvious kludge and "fixed" this.  Now, all benchmarks=20
 take a --nCpu command line argument that allows you to set the number of=

 cores manually.  This is an absolute must if running on 64.  If you=20
 don't set this, the default is whatever core.cpuid.coresPerCPU returns,=

 and on 64 bit this will always be 1.

I agree it is sad to have had to do this, core.cpuid.coresPerCPU should be fixed and fixed asap. In the interim using "--nCpu 8" with the 64-bit build on Ubuntu Maverick, I get: =3D=3D=3D=3D=3D=3D=3D=3D euclidean =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Serial reduce: 1154 milliseconds. Parallel reduce with 8 cores: 266 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D matrixInversion =3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D Inverted a 256 x 256 matrix serially in 62 milliseconds. Inverted a 256 x 256 matrix using 8 cores in 88 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D millionSqrt =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Parallel benchmarks being done with 8 cores. Did serial millionSqrt in 957 milliseconds. Did parallel foreach millionSqrt in 212 milliseconds. Did parallel map millionSqrt in 164 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D parallelSort =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Serial quick sort: 4548 milliseconds. Parallel quick sort: 1502 milliseconds. =20 =3D=3D=3D=3D=3D=3D=3D=3D pi =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Calculated pi =3D 3.141596556 in 172 milliseconds serially. Calculated pi =3D 3.141596794 in 41 milliseconds using 8 cores. =20 =3D=3D=3D=3D=3D=3D=3D=3D pipelining =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Did serial string -> float, euclid in 2047 milliseconds. Did parallel string -> float, euclid with 8 cores in 1756 milliseco= nds. I get the feeling that the pi calculation is showing that the perhaps the computations are too short to show proper scaling? --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 27 2011
parent reply Don <nospam nospam.com> writes:
Russel Winder wrote:
 David,
 
 On Sun, 2011-02-27 at 10:40 -0500, dsimcha wrote:
 
 I realized the obvious kludge and "fixed" this.  Now, all benchmarks 
 take a --nCpu command line argument that allows you to set the number of 
 cores manually.  This is an absolute must if running on 64.  If you 
 don't set this, the default is whatever core.cpuid.coresPerCPU returns, 
 and on 64 bit this will always be 1.

I agree it is sad to have had to do this, core.cpuid.coresPerCPU should be fixed and fixed asap.

As the name says, it is * cores per CPU *. That is _not_ the same as the total number of cores in the machine. The documentation probably doesn't state that strongly enough, but currently there is no known bug in the implementation of core.cpuid.coresPerCPU. It's an OS issue, anyway: just because a machine has 64 cores, is no guarantee that it will give you access to all of them. std.parallelism shouldn't be using core.cpuid for that purpose.
Feb 28 2011
next sibling parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 28.02.2011 10:41, schrieb Don:
 Russel Winder wrote:
 David,

 On Sun, 2011-02-27 at 10:40 -0500, dsimcha wrote:

 I realized the obvious kludge and "fixed" this. Now, all benchmarks
 take a --nCpu command line argument that allows you to set the number
 of cores manually. This is an absolute must if running on 64. If you
 don't set this, the default is whatever core.cpuid.coresPerCPU
 returns, and on 64 bit this will always be 1.

I agree it is sad to have had to do this, core.cpuid.coresPerCPU should be fixed and fixed asap.

As the name says, it is * cores per CPU *. That is _not_ the same as the total number of cores in the machine. The documentation probably doesn't state that strongly enough, but currently there is no known bug in the implementation of core.cpuid.coresPerCPU. It's an OS issue, anyway: just because a machine has 64 cores, is no guarantee that it will give you access to all of them. std.parallelism shouldn't be using core.cpuid for that purpose.

Having a platform-independent way (=> core.cpuid.availableCores wrapping whatever is necessary on specific platforms) to get the number of available cores would probably a useful addition. Cheers, - Daniel
Feb 28 2011
prev sibling parent reply Don <nospam nospam.com> writes:
Russel Winder wrote:
 On Mon, 2011-02-28 at 10:41 +0100, Don wrote:
 [ . . . ]
 As the name says, it is * cores per CPU *. That is _not_ the same as the 
 total number of cores in the machine.

I guess then the missing extension is to have a function that returns an array opf processor references so that the core count can be measured?

Yes.
 
 The documentation probably doesn't state that strongly enough, but 
 currently there is no known bug in the implementation of 
 core.cpuid.coresPerCPU.

OK so my complaints were somewhat misdirected, apologies. The fact remains though that it seems there is no way in D or Phobos to pick up the number of "processors" available to a Linux or Mac OS X machine -- I guess the same applies to Windows but I personally don't use that OS so don't really care about that one ;-)
 It's an OS issue, anyway: just because a machine has 64 cores, is no 
 guarantee that it will give you access to all of them. std.parallelism 
 shouldn't be using core.cpuid for that purpose.

I had thought you were arguing that D/Phobos couldn't rely on waiting for OS functionality, that it had to use direct access to the processor techniques. Maybe I misunderstood what you were saying. If so, my apologies.

The primary purpose of core.cpuid is to determine which instruction set is supported. Obviously this needs to happen at a very early stage in the runtime initialization. It turns out that on x86 (and Itanium), it is possible to determine the cache sizes as well, and this can be used to optimize array operations. It would very nice if you could get the number of processors at the same time, but it's just not possible.
 Linux claims to have 8 processors on my twin-Xeon box.  std.parallelism
 therefore needs some form of support from D and/or Phobos to say "how
 many parallel threads can this platform sustain".  All other parallelism
 frameworks I use handle this no problem.
 
 I accept your argument about core.cpuid and so will not investigate it
 further, other than to say it needs to work on 64-bit processors as well
 as 32-bit ones.  The campaign now must be to have an OS query capability
 to find the number of processors.

Yes, it definitely needs to be a priority. It's a fundamental function for a modern system programming language. I'm not sure where this stuff should go, probably not into core.cpuid. I could imagine something with a name like std.sysinfo, which dealt with system configuration, and also provided OS-independent abstractions for things like setting process affinity. I think it should go into std.parallelism for now.
Feb 28 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 2/28/2011 7:22 AM, Don wrote:
 Russel Winder wrote:
 I accept your argument about core.cpuid and so will not investigate it
 further, other than to say it needs to work on 64-bit processors as well
 as 32-bit ones. The campaign now must be to have an OS query capability
 to find the number of processors.

Yes, it definitely needs to be a priority. It's a fundamental function for a modern system programming language. I'm not sure where this stuff should go, probably not into core.cpuid. I could imagine something with a name like std.sysinfo, which dealt with system configuration, and also provided OS-independent abstractions for things like setting process affinity. I think it should go into std.parallelism for now.

Done. This was actually much easier than I thought. I didn't document/expose it, though, because I didn't put any thought into creating an API for it. I just implemented the bare minimum to make std.parallelism work properly. https://github.com/dsimcha/std.parallelism/commit/db7751ba436af3f7ffcaa1b65070f3981a75f98a This code is tested (at least on my hardware) on Windows 7 and Ubuntu 10.10 in both 32 and 64 mode. I did not test on Mac OS because I don't own any such hardware, though it **should** work because Mac OS is also POSIX. Someone please confirm. BTW, I don't imagine we care about supporting ancient (pre-Windows 2000) Windows. The Windows code will only work for Win2k and up.
Feb 28 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Russel Winder (russel russel.org.uk)'s article
 std.parallelism.d fails to compile on Mac OS X 32-bit:

Crap. Didn't notice that half of unistd.d is in version(Linux) blocks. Will fix later, but don't have time now and am waiting on a stackoverflow answer to figure out where the proper API is.
Feb 28 2011
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 2/28/2011 10:14 AM, Russel Winder wrote:
 This code is tested (at least on my hardware) on Windows 7 and Ubuntu
 10.10 in both 32 and 64 mode.  I did not test on Mac OS because I don't
 own any such hardware, though it **should** work because Mac OS is also
 POSIX.  Someone please confirm.

std.parallelism.d fails to compile on Mac OS X 32-bit:

https://github.com/dsimcha/std.parallelism/commit/dc1aae7560d46126c7db2ed6dbb56e8309c7f9e9 Please let me know if it works. Again, it is completely untested because I don't have a Mac box to test it on. I did the best I could by reading various documentation.
Feb 28 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
Ok, so that's one issue to cross off the list.  To summarize the discussion so
far, most of it's revolved around the issue of automatically determining how
many
CPUs are available and therefore how many threads the default pool should have.
Previously, std.parallelism had been using core.cpuid for this task.  This
module
doesn't work yet on 64 bits and doesn't and isn't supposed to determine how many
sockets/physical CPUs are available.  This was a point of miscommunication.

std.parallelism now uses OS-specific APIs to determine the total number of cores
available across all physical CPUs.  This appears to Just Work (TM) on 32-bit
Windows, 32- and 64-bit Linux, and 32-bit Mac OS.

We still need a volunteer to manage the review process.  As a reminder, for
those
of you who have been meaning to have a look but haven't, the Git repository is
at:

https://github.com/dsimcha/std.parallelism

The pre-compiled documentation is at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
Mar 01 2011
next sibling parent reply jasonw <user webmails.org> writes:
dsimcha Wrote:

 Ok, so that's one issue to cross off the list.  To summarize the discussion so
 far, most of it's revolved around the issue of automatically determining how
many
 CPUs are available and therefore how many threads the default pool should have.
 Previously, std.parallelism had been using core.cpuid for this task.  This
module
 doesn't work yet on 64 bits and doesn't and isn't supposed to determine how
many
 sockets/physical CPUs are available.  This was a point of miscommunication.
 
 std.parallelism now uses OS-specific APIs to determine the total number of
cores
 available across all physical CPUs.  This appears to Just Work (TM) on 32-bit
 Windows, 32- and 64-bit Linux, and 32-bit Mac OS.

Does a Hyperthread machine have 2x as much cores & worker threads ? In Pentium 4 HT might reduce throughput, in Core i7 increase it.
Mar 01 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from jasonw (user webmails.org)'s article
 dsimcha Wrote:
 Ok, so that's one issue to cross off the list.  To summarize the discussion so
 far, most of it's revolved around the issue of automatically determining how
many
 CPUs are available and therefore how many threads the default pool should have.
 Previously, std.parallelism had been using core.cpuid for this task.  This
module
 doesn't work yet on 64 bits and doesn't and isn't supposed to determine how
many
 sockets/physical CPUs are available.  This was a point of miscommunication.

 std.parallelism now uses OS-specific APIs to determine the total number of
cores
 available across all physical CPUs.  This appears to Just Work (TM) on 32-bit
 Windows, 32- and 64-bit Linux, and 32-bit Mac OS.


Someone please check on this for me. I'd assume that these OS functions return the number of logical CPUs, but they don't really seem to document and I don't have the relevant hardware.
Mar 01 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 01.03.2011 20:19, schrieb dsimcha:
 == Quote from jasonw (user webmails.org)'s article
 dsimcha Wrote:
 Ok, so that's one issue to cross off the list.  To summarize the discussion so
 far, most of it's revolved around the issue of automatically determining how
many
 CPUs are available and therefore how many threads the default pool should have.
 Previously, std.parallelism had been using core.cpuid for this task.  This
module
 doesn't work yet on 64 bits and doesn't and isn't supposed to determine how
many
 sockets/physical CPUs are available.  This was a point of miscommunication.

 std.parallelism now uses OS-specific APIs to determine the total number of
cores
 available across all physical CPUs.  This appears to Just Work (TM) on 32-bit
 Windows, 32- and 64-bit Linux, and 32-bit Mac OS.


Someone please check on this for me. I'd assume that these OS functions return the number of logical CPUs, but they don't really seem to document and I don't have the relevant hardware.

Who cares, the Pentium4 sucks anyway :P Intel decided to implement Hyperthreading and to report 2 core when there really is only one, so it should be treated like 2 cores. If this makes things slower.. bad luck (Why did Intel introduce Hyperthreading if it makes things slower, anyway?). Pentium4 (and probably also Pentium D, it's a dual-core Pentium4) users still can set the number of worker threads manually. Furthermore people (hopefully!) don't use Pentium4 for serious heavy calculations. Cheers, - Daniel
Mar 01 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/4/11 5:32 AM, Lars T. Kyllingstad wrote:
 On Tue, 01 Mar 2011 16:23:43 +0000, dsimcha wrote:

 Ok, so that's one issue to cross off the list.  To summarize the
 discussion so far, most of it's revolved around the issue of
 automatically determining how many CPUs are available and therefore how
 many threads the default pool should have. Previously, std.parallelism
 had been using core.cpuid for this task.  This module doesn't work yet
 on 64 bits and doesn't and isn't supposed to determine how many
 sockets/physical CPUs are available.  This was a point of
 miscommunication.

 std.parallelism now uses OS-specific APIs to determine the total number
 of cores available across all physical CPUs.  This appears to Just Work
 (TM) on 32-bit Windows, 32- and 64-bit Linux, and 32-bit Mac OS.

 We still need a volunteer to manage the review process.  As a reminder,
 for those of you who have been meaning to have a look but haven't, the
 Git repository is at:

 https://github.com/dsimcha/std.parallelism

 The pre-compiled documentation is at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

I'll volunteer as the review manager. Since the module has been through a few reviews already, both in this group and on the Phobos mailing list, I don't think we need a lot more time for that. I suggest the following: - We give it one more week for the final review, starting today, 4 March. - If this review does not lead to major API changes, we start the vote next Friday, 11 March. Vote closes after one week, 18 March. How does this sound? -Lars

I suggest let's make the review three weeks and the vote one week. Andrei
Mar 04 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/4/11 11:52 AM, Russel Winder wrote:
 On Fri, 2011-03-04 at 09:27 -0600, Andrei Alexandrescu wrote:
 [ . . . ]
 - We give it one more week for the final review, starting today, 4 March.
 - If this review does not lead to major API changes, we start the vote
 next Friday, 11 March.  Vote closes after one week, 18 March.

 How does this sound?

 -Lars

I suggest let's make the review three weeks and the vote one week.

I am guessing there is a constituency of eligible voters, i.e. there is not universal suffrage.

Like in Boost, anyone on this list can participate to discussions and vote. In theory this system can be abused (i.e. by multiple accounts, sock puppets etc.) but in practice it works fairly well. Andrei
Mar 04 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, March 04, 2011 09:52:17 Russel Winder wrote:
 On Fri, 2011-03-04 at 09:27 -0600, Andrei Alexandrescu wrote:
 [ . . . ]
 
 - We give it one more week for the final review, starting today, 4
 March. - If this review does not lead to major API changes, we start
 the vote next Friday, 11 March.  Vote closes after one week, 18 March.
 
 How does this sound?
 
 -Lars

I suggest let's make the review three weeks and the vote one week.

I am guessing there is a constituency of eligible voters, i.e. there is not universal suffrage.

We've never really discussed that. Thus far, anyone who posted on the newsgroup could vote. Now, if there were a bunch of votes from unknown folks and that definitely shifted the vote, then I would fully expect those votes to be thrown out or the vote redone or whatnot (if nothing else, they could be sock puppets). But it's not like we've selected a list of people and said that they were the ones allowed to vote. The few times that we've voted on including something in Phobos thus far, it hasn't been an issue. - Jonathan M Davis
Mar 04 2011
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/4/11 5:32 AM, Lars T. Kyllingstad wrote:
 On Tue, 01 Mar 2011 16:23:43 +0000, dsimcha wrote:

 Ok, so that's one issue to cross off the list.  To summarize the
 discussion so far, most of it's revolved around the issue of
 automatically determining how many CPUs are available and therefore how
 many threads the default pool should have. Previously, std.parallelism
 had been using core.cpuid for this task.  This module doesn't work yet
 on 64 bits and doesn't and isn't supposed to determine how many
 sockets/physical CPUs are available.  This was a point of
 miscommunication.

 std.parallelism now uses OS-specific APIs to determine the total number
 of cores available across all physical CPUs.  This appears to Just Work
 (TM) on 32-bit Windows, 32- and 64-bit Linux, and 32-bit Mac OS.

 We still need a volunteer to manage the review process.  As a reminder,
 for those of you who have been meaning to have a look but haven't, the
 Git repository is at:

 https://github.com/dsimcha/std.parallelism

 The pre-compiled documentation is at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

I'll volunteer as the review manager. Since the module has been through a few reviews already, both in this group and on the Phobos mailing list, I don't think we need a lot more time for that. I suggest the following: - We give it one more week for the final review, starting today, 4 March. - If this review does not lead to major API changes, we start the vote next Friday, 11 March. Vote closes after one week, 18 March. How does this sound? -Lars

Andrei

This sounds reasonable. Should I be doing anything besides following the thread and reacting accordingly?
Mar 04 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/4/11 12:34 PM, dsimcha wrote:
 This sounds reasonable.  Should I be doing anything besides following the
thread
 and reacting accordingly?

Basically yes. Here's a good set of notes: http://www.boost.org/community/reviews.html#Review_Manager Don't forget that the ultimate accept/reject decision is taken by you, of course ideally best reflecting the discussion and votes. Andrei
Mar 04 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Lars T. Kyllingstad (public kyllingen.NOSPAMnet)'s article
 On Fri, 04 Mar 2011 18:34:39 +0000, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 On 3/4/11 5:32 AM, Lars T. Kyllingstad wrote:
 On Tue, 01 Mar 2011 16:23:43 +0000, dsimcha wrote:

 Ok, so that's one issue to cross off the list.  To summarize the
 discussion so far, most of it's revolved around the issue of
 automatically determining how many CPUs are available and therefore
 how many threads the default pool should have. Previously,
 std.parallelism had been using core.cpuid for this task.  This
 module doesn't work yet on 64 bits and doesn't and isn't supposed to
 determine how many sockets/physical CPUs are available.  This was a
 point of miscommunication.

 std.parallelism now uses OS-specific APIs to determine the total
 number of cores available across all physical CPUs.  This appears to
 Just Work (TM) on 32-bit Windows, 32- and 64-bit Linux, and 32-bit
 Mac OS.

 We still need a volunteer to manage the review process.  As a
 reminder, for those of you who have been meaning to have a look but
 haven't, the Git repository is at:

 https://github.com/dsimcha/std.parallelism

 The pre-compiled documentation is at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

I'll volunteer as the review manager. Since the module has been through a few reviews already, both in this group and on the Phobos mailing list, I don't think we need a lot more time for that. I suggest the following: - We give it one more week for the final review, starting today, 4 March. - If this review does not lead to major API changes, we start the vote next Friday, 11 March. Vote closes after one week, 18 March. How does this sound? -Lars

Andrei

This sounds reasonable.

-Lars

But then official "judgement day" will be April Fool's Day. I don't want anyone thinking std.parallelism is an April Fool's joke.
Mar 04 2011
prev sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, 2011-03-04 at 10:10 -0800, Jonathan M Davis wrote:
[ . . . ]
 We've never really discussed that. Thus far, anyone who posted on the new=

 could vote. Now, if there were a bunch of votes from unknown folks and th=

 definitely shifted the vote, then I would fully expect those votes to be =

 out or the vote redone or whatnot (if nothing else, they could be sock pu=

 But it's not like we've selected a list of people and said that they were=

 ones allowed to vote. The few times that we've voted on including somethi=

 Phobos thus far, it hasn't been an issue.

Works for me. It's much nicer to have an informal system that works -- as long as it is possible to tell when the system is being subverted. Presumably this is a four-state vote: +1 approve 0 cannot decide -1 disapprove -- no opinion Anyone not emailing is deemed to have cast a -- vote all of which are automatically discarded. Votes such as +100 will presumably be renormalized to +1. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 04 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday 28 February 2011 06:39:02 dsimcha wrote:
 On 2/28/2011 7:22 AM, Don wrote:
 Russel Winder wrote:
 I accept your argument about core.cpuid and so will not investigate it
 further, other than to say it needs to work on 64-bit processors as well
 as 32-bit ones. The campaign now must be to have an OS query capability
 to find the number of processors.

Yes, it definitely needs to be a priority. It's a fundamental function for a modern system programming language. I'm not sure where this stuff should go, probably not into core.cpuid. I could imagine something with a name like std.sysinfo, which dealt with system configuration, and also provided OS-independent abstractions for things like setting process affinity. I think it should go into std.parallelism for now.

Done. This was actually much easier than I thought. I didn't document/expose it, though, because I didn't put any thought into creating an API for it. I just implemented the bare minimum to make std.parallelism work properly. https://github.com/dsimcha/std.parallelism/commit/db7751ba436af3f7ffcaa1b65 070f3981a75f98a This code is tested (at least on my hardware) on Windows 7 and Ubuntu 10.10 in both 32 and 64 mode. I did not test on Mac OS because I don't own any such hardware, though it **should** work because Mac OS is also POSIX. Someone please confirm. BTW, I don't imagine we care about supporting ancient (pre-Windows 2000) Windows. The Windows code will only work for Win2k and up.

I think that we care enough that if it can easily be done work on pre-Win2k, then we might as well do so, but if we can't we can't. I believe that there's at least on example of something druntime or Phobos where something was done the way it was because it worked on pre-Win2k that way (though I can't recall where that code is at the moment). However, some of std.datetime wouldn't work pre- Win2k, because the system calls that it makes didn't exist pre-Win2k, and there isn't a way around that. On the other hand, std.datetime.WindowsTimeZone could have been improved with regards to DST if it were to assume Vista or later, but we obviously can't do that. At this point, I think that we need to assume that Phobos needs to work with XP and later for sure - and maybe Win2k (though I don't get the impression that Microsoft added much in the way of system calls with XP anyway) - but versions prior to that are supported only if we can reasonably do so. Some things just plain require Win2k or newer, and I think that for the most part it's reasonable to require Win2k or newer at this point. - Jonathan M Davis
Feb 28 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, 2011-02-28 at 10:41 +0100, Don wrote:
[ . . . ]
 As the name says, it is * cores per CPU *. That is _not_ the same as the=

 total number of cores in the machine.

I guess then the missing extension is to have a function that returns an array opf processor references so that the core count can be measured?
 The documentation probably doesn't state that strongly enough, but=20
 currently there is no known bug in the implementation of=20
 core.cpuid.coresPerCPU.

OK so my complaints were somewhat misdirected, apologies. The fact remains though that it seems there is no way in D or Phobos to pick up the number of "processors" available to a Linux or Mac OS X machine -- I guess the same applies to Windows but I personally don't use that OS so don't really care about that one ;-)
 It's an OS issue, anyway: just because a machine has 64 cores, is no=20
 guarantee that it will give you access to all of them. std.parallelism=

 shouldn't be using core.cpuid for that purpose.

I had thought you were arguing that D/Phobos couldn't rely on waiting for OS functionality, that it had to use direct access to the processor techniques. Maybe I misunderstood what you were saying. If so, my apologies. Linux claims to have 8 processors on my twin-Xeon box. std.parallelism therefore needs some form of support from D and/or Phobos to say "how many parallel threads can this platform sustain". All other parallelism frameworks I use handle this no problem. I accept your argument about core.cpuid and so will not investigate it further, other than to say it needs to work on 64-bit processors as well as 32-bit ones. The campaign now must be to have an OS query capability to find the number of processors. Thanks. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 28 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, 2011-02-28 at 09:39 -0500, dsimcha wrote:
[ . . . ]
 Done.  This was actually much easier than I thought.  I didn't=20
 document/expose it, though, because I didn't put any thought into=20
 creating an API for it.  I just implemented the bare minimum to make=20
 std.parallelism work properly.
=20
 https://github.com/dsimcha/std.parallelism/commit/db7751ba436af3f7ffcaa1b=

Cool. Updated, and tested on Ubuntu 10.10 64-bit and it seems to work fine.
 This code is tested (at least on my hardware) on Windows 7 and Ubuntu=20
 10.10 in both 32 and 64 mode.  I did not test on Mac OS because I don't=

 own any such hardware, though it **should** work because Mac OS is also=

 POSIX.  Someone please confirm.

std.parallelism.d fails to compile on Mac OS X 32-bit: |> sh -x ./all.sh + root=3Dparallelism ++ uname -s + platform=3DDarwin + dmd -m32 -lib std/parallelism.d std/parallelism.d(83): Error: undefined identifier _SC_NPROCESSORS_ONLN std/parallelism.d(100): Error: template core.atomic.cas(T,V1,V2) if (is(Nak= edType!(V1) =3D=3D NakedType!(T)) && is(NakedType!(V2) =3D=3D NakedType!(T)= )) does not match any function template declaration std/parallelism.d(100): Error: template core.atomic.cas(T,V1,V2) if (is(Nak= edType!(V1) =3D=3D NakedType!(T)) && is(NakedType!(V2) =3D=3D NakedType!(T)= )) cannot deduce template function from argument types !()(shared(ubyte*),u= byte,ubyte) std/parallelism.d(129): Error: template core.atomic.cas(T,V1,V2) if (is(Nak= edType!(V1) =3D=3D NakedType!(T)) && is(NakedType!(V2) =3D=3D NakedType!(T)= )) does not match any function template declaration std/parallelism.d(129): Error: template core.atomic.cas(T,V1,V2) if (is(Nak= edType!(V1) =3D=3D NakedType!(T)) && is(NakedType!(V2) =3D=3D NakedType!(T)= )) cannot deduce template function from argument types !()(shared(ubyte*),u= byte,ubyte) std/parallelism.d(133): Error: template core.atomic.atomicOp(string op,T,V1= ) if (is(NakedType!(V1) =3D=3D NakedType!(T))) does not match any function = template declaration std/parallelism.d(133): Error: template core.atomic.atomicOp(string op,T,V1= ) if (is(NakedType!(V1) =3D=3D NakedType!(T))) cannot deduce template funct= ion from argument types !("+=3D")(uint,uint) std/parallelism.d(133): Error: template instance errors instantiating templ= ate
 BTW, I don't imagine we care about supporting ancient (pre-Windows 2000)=

 Windows.  The Windows code will only work for Win2k and up.

I don't care about Windows of any sort ;-) More widely though I would guess saying "XP and above only" probably covers the field with the best cost/benefit. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 28 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, 2011-02-28 at 18:54 -0500, dsimcha wrote:
 On 2/28/2011 10:14 AM, Russel Winder wrote:
 This code is tested (at least on my hardware) on Windows 7 and Ubuntu
 10.10 in both 32 and 64 mode.  I did not test on Mac OS because I don'=



 own any such hardware, though it **should** work because Mac OS is als=



 POSIX.  Someone please confirm.

std.parallelism.d fails to compile on Mac OS X 32-bit:

https://github.com/dsimcha/std.parallelism/commit/dc1aae7560d46126c7db2ed=

=20
 Please let me know if it works.  Again, it is completely untested=20
 because I don't have a Mac box to test it on.  I did the best I could by=

 reading various documentation.

Pulled, compiled and tried with the benchmarks, it all seems to work fine. Thanks for attacking this one quickly. Now all that is needed is a 64-bit DMD for Max OS X, but that is SEP ;-) --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 28 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, 2011-03-01 at 13:06 -0500, jasonw wrote:
 dsimcha Wrote:
=20
 Ok, so that's one issue to cross off the list.  To summarize the discus=


 far, most of it's revolved around the issue of automatically determinin=


 CPUs are available and therefore how many threads the default pool shou=


 Previously, std.parallelism had been using core.cpuid for this task.  T=


 doesn't work yet on 64 bits and doesn't and isn't supposed to determine=


 sockets/physical CPUs are available.  This was a point of miscommunicat=


=20
 std.parallelism now uses OS-specific APIs to determine the total number=


 available across all physical CPUs.  This appears to Just Work (TM) on =


 Windows, 32- and 64-bit Linux, and 32-bit Mac OS.

Does a Hyperthread machine have 2x as much cores & worker threads ? In Pe=

There appear to be Core i7s and other Core i7s. Some of them may actually be quad core, but some of them are dual core with 2 hyperthreads per core. To Java these dual-core-twin-hyperthread processors actually behave like 4 processors, to C, C++, etc. they look just like any other dual core processor. Sadly they announce themselves to Linux as 4 processors, which is clearly a lie. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 01 2011
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Tue, 01 Mar 2011 16:23:43 +0000, dsimcha wrote:

 Ok, so that's one issue to cross off the list.  To summarize the
 discussion so far, most of it's revolved around the issue of
 automatically determining how many CPUs are available and therefore how
 many threads the default pool should have. Previously, std.parallelism
 had been using core.cpuid for this task.  This module doesn't work yet
 on 64 bits and doesn't and isn't supposed to determine how many
 sockets/physical CPUs are available.  This was a point of
 miscommunication.
 
 std.parallelism now uses OS-specific APIs to determine the total number
 of cores available across all physical CPUs.  This appears to Just Work
 (TM) on 32-bit Windows, 32- and 64-bit Linux, and 32-bit Mac OS.
 
 We still need a volunteer to manage the review process.  As a reminder,
 for those of you who have been meaning to have a look but haven't, the
 Git repository is at:
 
 https://github.com/dsimcha/std.parallelism
 
 The pre-compiled documentation is at:
 
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

I'll volunteer as the review manager. Since the module has been through a few reviews already, both in this group and on the Phobos mailing list, I don't think we need a lot more time for that. I suggest the following: - We give it one more week for the final review, starting today, 4 March. - If this review does not lead to major API changes, we start the vote next Friday, 11 March. Vote closes after one week, 18 March. How does this sound? -Lars
Mar 04 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, 2011-03-04 at 09:27 -0600, Andrei Alexandrescu wrote:
[ . . . ]
 - We give it one more week for the final review, starting today, 4 Marc=


 - If this review does not lead to major API changes, we start the vote
 next Friday, 11 March.  Vote closes after one week, 18 March.

 How does this sound?

 -Lars

I suggest let's make the review three weeks and the vote one week.

I am guessing there is a constituency of eligible voters, i.e. there is not universal suffrage. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 04 2011
prev sibling next sibling parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Fri, 04 Mar 2011 18:34:39 +0000, dsimcha wrote:

 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 On 3/4/11 5:32 AM, Lars T. Kyllingstad wrote:
 On Tue, 01 Mar 2011 16:23:43 +0000, dsimcha wrote:

 Ok, so that's one issue to cross off the list.  To summarize the
 discussion so far, most of it's revolved around the issue of
 automatically determining how many CPUs are available and therefore
 how many threads the default pool should have. Previously,
 std.parallelism had been using core.cpuid for this task.  This
 module doesn't work yet on 64 bits and doesn't and isn't supposed to
 determine how many sockets/physical CPUs are available.  This was a
 point of miscommunication.

 std.parallelism now uses OS-specific APIs to determine the total
 number of cores available across all physical CPUs.  This appears to
 Just Work (TM) on 32-bit Windows, 32- and 64-bit Linux, and 32-bit
 Mac OS.

 We still need a volunteer to manage the review process.  As a
 reminder, for those of you who have been meaning to have a look but
 haven't, the Git repository is at:

 https://github.com/dsimcha/std.parallelism

 The pre-compiled documentation is at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

I'll volunteer as the review manager. Since the module has been through a few reviews already, both in this group and on the Phobos mailing list, I don't think we need a lot more time for that. I suggest the following: - We give it one more week for the final review, starting today, 4 March. - If this review does not lead to major API changes, we start the vote next Friday, 11 March. Vote closes after one week, 18 March. How does this sound? -Lars

Andrei

This sounds reasonable.

3+1 weeks it is, then. I'll announce it in a separate thread. -Lars
Mar 04 2011
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, March 04, 2011 12:53:56 dsimcha wrote:
 == Quote from Lars T. Kyllingstad (public kyllingen.NOSPAMnet)'s article
 
 On Fri, 04 Mar 2011 18:34:39 +0000, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 
 On 3/4/11 5:32 AM, Lars T. Kyllingstad wrote:
 On Tue, 01 Mar 2011 16:23:43 +0000, dsimcha wrote:
 Ok, so that's one issue to cross off the list.  To summarize the
 discussion so far, most of it's revolved around the issue of
 automatically determining how many CPUs are available and therefore
 how many threads the default pool should have. Previously,
 std.parallelism had been using core.cpuid for this task.  This
 module doesn't work yet on 64 bits and doesn't and isn't supposed
 to determine how many sockets/physical CPUs are available.  This
 was a point of miscommunication.
 
 std.parallelism now uses OS-specific APIs to determine the total
 number of cores available across all physical CPUs.  This appears
 to Just Work (TM) on 32-bit Windows, 32- and 64-bit Linux, and
 32-bit Mac OS.
 
 We still need a volunteer to manage the review process.  As a
 reminder, for those of you who have been meaning to have a look but
 haven't, the Git repository is at:
 
 https://github.com/dsimcha/std.parallelism
 
 The pre-compiled documentation is at:
 
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

I'll volunteer as the review manager. Since the module has been through a few reviews already, both in this group and on the Phobos mailing list, I don't think we need a lot more time for that. I suggest the following: - We give it one more week for the final review, starting today, 4 March. - If this review does not lead to major API changes, we start the vote next Friday, 11 March. Vote closes after one week, 18 March. How does this sound? -Lars

I suggest let's make the review three weeks and the vote one week. Andrei

This sounds reasonable.

3+1 weeks it is, then. I'll announce it in a separate thread. -Lars

But then official "judgement day" will be April Fool's Day. I don't want anyone thinking std.parallelism is an April Fool's joke.

LOL. That was my though exactly, though I doubt that anyone will really take it that way. - Jonathan M Davis
Mar 04 2011
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Fri, 04 Mar 2011 20:53:56 +0000, dsimcha wrote:

 == Quote from Lars T. Kyllingstad (public kyllingen.NOSPAMnet)'s article
 On Fri, 04 Mar 2011 18:34:39 +0000, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 On 3/4/11 5:32 AM, Lars T. Kyllingstad wrote:
 On Tue, 01 Mar 2011 16:23:43 +0000, dsimcha wrote:

 Ok, so that's one issue to cross off the list.  To summarize the
 discussion so far, most of it's revolved around the issue of
 automatically determining how many CPUs are available and
 therefore how many threads the default pool should have.
 Previously, std.parallelism had been using core.cpuid for this
 task.  This module doesn't work yet on 64 bits and doesn't and
 isn't supposed to determine how many sockets/physical CPUs are
 available.  This was a point of miscommunication.

 std.parallelism now uses OS-specific APIs to determine the total
 number of cores available across all physical CPUs.  This appears
 to Just Work (TM) on 32-bit Windows, 32- and 64-bit Linux, and
 32-bit Mac OS.

 We still need a volunteer to manage the review process.  As a
 reminder, for those of you who have been meaning to have a look
 but haven't, the Git repository is at:

 https://github.com/dsimcha/std.parallelism

 The pre-compiled documentation is at:

 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

I'll volunteer as the review manager. Since the module has been through a few reviews already, both in this group and on the Phobos mailing list, I don't think we need a lot more time for that. I suggest the following: - We give it one more week for the final review, starting today, 4 March. - If this review does not lead to major API changes, we start the vote next Friday, 11 March. Vote closes after one week, 18 March. How does this sound? -Lars

Andrei

This sounds reasonable.


But then official "judgement day" will be April Fool's Day. I don't want anyone thinking std.parallelism is an April Fool's joke.

Too late. :) -Lars
Mar 04 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 04 Mar 2011 15:53:56 -0500, dsimcha <dsimcha yahoo.com> wrote:


 But then official "judgement day" will be April Fool's Day. I don't want  
 anyone
 thinking std.parallelism is an April Fool's joke.

IIRC, I believe that day is reserved for the big release of preprocessor macros for D. We'll have to find another day. -Steve
Mar 04 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

David,

On Sun, 2011-02-27 at 09:48 -0500, dsimcha wrote:
[ . . . ]
 Can you please re-run the benchmark to make sure that this isn't just a=

 one-time anomaly?  I can't seem to make the parallel matrix inversion=20
 run slower than serial on my hardware, even with ridiculous tuning=20
 parameters that I was almost sure would bottleneck the thing on the task=

 queue.  Also, all the other benchmarks actually look pretty good.

Sadly the result is consistent :-( |> matrixInversion Inverted a 256 x 256 matrix serially in 60 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 61 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 65 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 59 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |>=20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 27 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sun, 2011-02-27 at 11:36 -0500, dsimcha wrote:
[ . . . ]
 figured out why.  I think it's related to my Posix workaround for Bug=20
 3753 (http://d.puremagic.com/issues/show_bug.cgi?id=3D3753).  This=20
 workaround causes GC heap allocations to occur in a loop inside the=20

 list, etc.), but I can't bring myself to put a lot of effort into=20
 optimizing a workaround for a compiler bug.

Is the compiler bug fixable or is it one of those that will last for a 1000 years? --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 27 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

David,

On Sun, 2011-02-27 at 09:48 -0500, dsimcha wrote:
[ . . . ]
 Can you please re-run the benchmark to make sure that this isn't just a=

 one-time anomaly?  I can't seem to make the parallel matrix inversion=20
 run slower than serial on my hardware, even with ridiculous tuning=20
 parameters that I was almost sure would bottleneck the thing on the task=

 queue.  Also, all the other benchmarks actually look pretty good.

Sadly the result is consistent :-( |> matrixInversion Inverted a 256 x 256 matrix serially in 60 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 61 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 65 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 59 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |> matrixInversion Inverted a 256 x 256 matrix serially in 58 milliseconds. Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds. 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks |>=20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 27 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, March 04, 2011 11:12:00 Russel Winder wrote:
 On Fri, 2011-03-04 at 10:10 -0800, Jonathan M Davis wrote:
 [ . . . ]
 
 We've never really discussed that. Thus far, anyone who posted on the
 newsgroup could vote. Now, if there were a bunch of votes from unknown
 folks and that definitely shifted the vote, then I would fully expect
 those votes to be thrown out or the vote redone or whatnot (if nothing
 else, they could be sock puppets). But it's not like we've selected a
 list of people and said that they were the ones allowed to vote. The few
 times that we've voted on including something in Phobos thus far, it
 hasn't been an issue.

Works for me. It's much nicer to have an informal system that works -- as long as it is possible to tell when the system is being subverted. Presumably this is a four-state vote: +1 approve 0 cannot decide -1 disapprove -- no opinion Anyone not emailing is deemed to have cast a -- vote all of which are automatically discarded. Votes such as +100 will presumably be renormalized to +1.

All you really do is vote whether you want it in Phobos or not. The total number of votes/voters is then taken, and the majority decides. If the majority vote for inclusion, then it's included. If the majority vote for it not to be included, then it's not included. There's no weighting of votes or point system or whatnot. - Jonathan M Davis
Mar 04 2011
prev sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, 2011-03-04 at 11:27 -0800, Jonathan M Davis wrote:
[ . . . ]
 Presumably this is a four-state vote:
=20
 	+1 approve
 	0 cannot decide
 	-1 disapprove
 	-- no opinion
=20
 Anyone not emailing is deemed to have cast a -- vote all of which are
 automatically discarded.  Votes such as +100 will presumably be
 renormalized to +1.

All you really do is vote whether you want it in Phobos or not. The total=

 of votes/voters is then taken, and the majority decides. If the majority =

 for inclusion, then it's included. If the majority vote for it not to be=

 included, then it's not included. There's no weighting of votes or point =

 or whatnot.

I think we are saying fundamentally the same thing, but you are keeping it informal and I have started my Friday night whisky (*) drinking. I think I have already decided to say +1, but I better not say that in case it is deemed as trying to influence the electorate ;-) Have a good weekend! (*) Definitely not whiskey. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 04 2011