digitalmars.D - Thoughts on parallel programming?

jfd (4/4) Nov 10 2010 Any thoughts on parallel programming. I was looking at something about ...

bearophile (5/9) Nov 10 2010 In past I have shown here two large posts about Chapel, that's a languag...
dsimcha (11/15) Nov 10 2010 Well, there's my std.parallelism library, which is in review for inclusi...
Russel Winder (53/57) Nov 11 2010 that
Fawzi Mohamed (105/155) Nov 11 2010 on this I am not so sure, heterogeneous clusters are more difficult to
Fawzi Mohamed (14/23) Nov 11 2010 I just finished reading "Parallel Programmability and the Chapel
Fawzi Mohamed (18/38) Nov 11 2010 charset=US-ASCII;
Tobias Pfaff (5/9) Nov 11 2010 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...

Trass3r (2/5) Nov 11 2010 That would require compiler support for it.

Tobias Pfaff (3/8) Nov 11 2010 Ok, that's what I suspected.

Russel Winder (24/27) Nov 11 2010 I'd hardly call OpenMP lightweight. I agree that as a meta-notation for

Trass3r (1/2) Nov 11 2010 http://bitbucket.org/trass3r/cl4d/wiki/Home
Tobias Pfaff (14/31) Nov 11 2010 Well, I am looking for an easy & efficient way to perform parallel

dsimcha (13/49) Nov 11 2010 I think you'll be very pleased with std.parallelism when/if it gets into...

Tobias Pfaff (4/53) Nov 12 2010 I did a quick test of the module, looks really good so far, thanks for

Fawzi Mohamed (3/16) Nov 12 2010 If you use D1 blip.parallel.smp offers that, and it does scale well to

Sean Kelly (2/11) Nov 11 2010 I've considered backing spawn() calls by fibers multiplexed by a thread ...

sybrandy (9/20) Nov 11 2010 I actually did something similar for a very simple web server I was

Fawzi Mohamed (14/30) Nov 12 2010 I agree I think that TBB offers primitives for many parallelization

Russel Winder (91/156) Nov 11 2010 =20

retard (19/167) Nov 11 2010 FWIW, I'm not a parallel computing expert and have almost no experience

retard (8/12) Nov 11 2010 At least it seems so to me. My last 1 and 2 core systems had a TDP of 65...
Gary Whatmore (5/176) Nov 11 2010 You're unfortunately completely wrong. The industry is moving away from ...

Walter Bright (7/13) Nov 11 2010 Yup. I am bemused by the efforts put into analyzing loops so that they c...

bearophile (4/11) Nov 11 2010 I agree a lot. The language has to offer means to express all the semant...

retard (4/24) Nov 11 2010 How does the Chapel work when I need to sort data (just basic quicksort

Walter Bright (3/8) Nov 11 2010 I think it's more than being trained to think sequentially. I think it i...

Sean Kelly (2/11) Nov 11 2010 Distributed programming is essentially a bunch of little sequential prog...

%u (2/15) Nov 11 2010 Intel promised this AVX instruction set next year. Does it also work lik...

Gary Whatmore (2/19) Nov 11 2010 AVX isn't parallel programming, it's vector processing. A dying breed of...

%u (7/28) Nov 11 2010 Currently the amount of information available is scarce. I have no idea ...

Don (16/28) Nov 12 2010 The Erlang people seem to say that a lot. The thing they omit to say,

sybrandy (20/30) Nov 13 2010 That's only part of the reasoning behind all of the little programs in

Fawzi Mohamed (29/195) Nov 12 2010 Vector co processors, yes I see that, and short term the effect of
Sean Kelly (5/35) Nov 13 2010 True enough. But it's certainly more natural to think about than mutex-...

sybrandy (2/4) Nov 13 2010 I like that description!

jfd <jfd nospam.com> writes:

Any thoughts on parallel programming.  I was looking at something about Chapel
and X10 languages etc. for parallelism, and it looks interesting.  I know that
it is still an area of active research, and it is not yet (far from?) done,
but anyone have thoughts on this as future direction?  Thank you.

Nov 10 2010

bearophile <bearophileHUGS lycos.com> writes:

jfd:

 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.

In past I have shown here two large posts about Chapel, that's a language
contains several good ideas worth stealing, but my posts were mostly ignored.

Chapel is designed for heavy numerical computing on multi-cores or multi-CPUs,
it has good ideas of CPU-localization of the work, while D isn't much serious
about that kind of parallelism (yet). So far D has instead embraced
message-passing, that's fit for other purposes.

Bye,
bearophile

Nov 10 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from jfd (jfd nospam.com)'s article
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.

Well, there's my std.parallelism library, which is in review for inclusion in
Phobos.  (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html,
http://www.dsource.org/projects/scrapple/browser/trunk/parallelFuture/std_parallelism.d)


One unfortunate thing about it is that it doesn't use (and actually bypasses)
D's
thread isolation system and allows unchecked sharing.  I couldn't think of any
way
to create a pedal-to-metal parallelism library that was simultaneously useful,
safe and worked with the language as-is, and I wanted something that worked
**now**, not next year or in D3 or whatever, so I decided to omit safety.

Given that the library is in review, now would be the perfect time to offer any
suggestions on how it can be improved.

Nov 10 2010

Russel Winder <russel russel.org.uk> writes:

On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about C=

hapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know=

 that
 it is still an area of active research, and it is not yet (far from?) don=

e,
 but anyone have thoughts on this as future direction?  Thank you.

Any programming language that cannot be used to program applications
running on a heterogeneous collection of processors, including CPUs and
GPUs as computational devices, on a single chip, with there being many
such chips on a board, possibly clustered, doesn't have much of a
future.  Timescale 5--10 years.

Intel's 80-core, 48-core and 50-core devices show the way server,
workstation and laptop architectures are going.  There may be a large
central memory unit as now, but it will be secondary storage not primary
storage.  All the chip architectures are shifting to distributed memory
-- basically cache coherence is too hard a problem to solve, so instead
of solving it, they are getting rid of it.  Also the memory bus stops
being the bottleneck for computations, which is actually the biggest
problem with current architectures.

Windows, Linux and Mac OS X have a serious problem and will either die
or be revolutionized.  Apple at least recognize the issue, hence they
pushed OpenCL.

Actor model, CSP, dataflow, and similar distributed memory/process-based
architectures will become increasingly important for software.  There
will be an increasing move to declarative expression, but I doubt
functional languages will ever make the main stream.  The issue here is
that parallelism generally requires programmers not to try and tell the
computer every detail how to do something, but instead specify the start
and end conditions and allow the runtime system to handle the
realization of the transformation.  Hence the move in Fortran from lots
of "do" loops to "whole array" operations.

MPI and all the SPMD approaches have a severely limited future, but I
bet the HPC codes are still using Fortran and MPI in 50 years time.

You mentioned Chapel and X10, but don't forget the other one of the
original three HPCS projects, Fortress.  Whilst all three are PGAS
(partitioned global address space) languages, Fortress takes a very
different viewpoint compared to Chapel and X10.

The summary of the summary is:  programmers will either be developing
parallelism systems or they will be unemployed.

<shameless-plug>
To hear more, I am doing a session on all this stuff for ACCU London
2010-11-18 18:30+00:00
http://skillsmatter.com/event/java-jee/java-python-ruby-linux-windows-are-a=
ll-doomed
</shameless-plug>

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Nov 11 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

On 11-nov-10, at 09:58, Russel Winder wrote:

 On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something  
 about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.   
 I know that
 it is still an area of active research, and it is not yet (far  
 from?) done,
 but anyone have thoughts on this as future direction?  Thank you.

 Any programming language that cannot be used to program applications
 running on a heterogeneous collection of processors, including CPUs  
 and
 GPUs as computational devices, on a single chip, with there being many
 such chips on a board, possibly clustered, doesn't have much of a
 future.  Timescale 5--10 years.

on this I am not so sure, heterogeneous clusters are more difficult to  
program, and GPU & co are slowly becoming more and more general purpose.
Being able to take advantage of those is useful, but I am not  
convinced they are necessarily the future.

 Intel's 80-core, 48-core and 50-core devices show the way server,
 workstation and laptop architectures are going.  There may be a large
 central memory unit as now, but it will be secondary storage not  
 primary
 storage.  All the chip architectures are shifting to distributed  
 memory
 -- basically cache coherence is too hard a problem to solve, so  
 instead
 of solving it, they are getting rid of it.  Also the memory bus stops
 being the bottleneck for computations, which is actually the biggest
 problem with current architectures.

yes many core is the future I agree on this, and also that distributed  
approach is the only way to scale to a really large number of  
processors.
Bud distributed systems *are* more complex, so I think that for the  
foreseeable future one will have a hybrid approach.

 Windows, Linux and Mac OS X have a serious problem and will either die
 or be revolutionized.  Apple at least recognize the issue, hence they
 pushed OpenCL.

again not sure the situation is as dire as you paint it, Linux does  
quite well in the HPC field... but I agree that to be the ideal OS for  
these architectures it will need more changes.

 Actor model, CSP, dataflow, and similar distributed memory/process- 
 based
 architectures will become increasingly important for software.  There
 will be an increasing move to declarative expression, but I doubt
 functional languages will ever make the main stream.  The issue here  
 is
 that parallelism generally requires programmers not to try and tell  
 the
 computer every detail how to do something, but instead specify the  
 start
 and end conditions and allow the runtime system to handle the
 realization of the transformation.  Hence the move in Fortran from  
 lots
 of "do" loops to "whole array" operations.

Whole array operation are useful, and when possible one gains much  
using them, unfortunately not all problems can be reduced to few large  
array operations, data parallel languages are not the main type of  
language for these reasons.

 MPI and all the SPMD approaches have a severely limited future, but I
 bet the HPC codes are still using Fortran and MPI in 50 years time.

well whole array operations are a generalization of the SPMD approach,  
so I this sense you said that that kind of approach will have a future  
(but with a more difficult optimization as the hardware is more complex.

About MPI I think that many don't see what MPI really does, mpi offers  
a simplified parallel model.
The main weakness of this model is that it assumes some kind of  
reliability, but then it offers
a clear computational model with processors ordered in a linear of  
higher dimensional structure and efficient collective communication  
primitives.
Yes MPI is not the right choice for all problems, but when usable it  
is very powerful, often superior to the alternatives, and programming  
with it is *simpler* than thinking about a generic distributed system.
So I think that for problems that are not trivially parallel, or  
easily parallelizable MPI will remain as the best choice.

 You mentioned Chapel and X10, but don't forget the other one of the
 original three HPCS projects, Fortress.  Whilst all three are PGAS
 (partitioned global address space) languages, Fortress takes a very
 different viewpoint compared to Chapel and X10.

It might be a personal thing, but I am kind of "suspicious" toward  
PGAS, I find a generalized MPI model better than PGAS when you want to  
have separated address spaces.
Using MPI one can define a PGAS like object wrapping local storage  
with an object that sends remote requests to access remote memory  
pieces.
This means having a local server where this wrapped objects can be  
"published" and that can respond in any moment to external requests. I  
call this rpc (remote procedure call) and it can be realized easily on  
the top of MPI.
As not all objects are distributed and in a complex program it does  
not always makes sense to distribute these objects on all processors  
or none, I find that the robust partitioning and collective  
communication primitives of MPI superior to PGAS.
With enough effort you probably can get everything also from PGAS, but  
then you loose all its simplicity.

 The summary of the summary is:  programmers will either be developing
 parallelism systems or they will be unemployed.

The situation is not so dire, some problems are trivially parallel, or  
can be solved with simple parallel patterns, others don't need to be  
solved in parallel, as the sequential solution if fast enough, but I  
do agree that being able to develop parallel systems is increasingly  
important.
In fact it is something that I like to do, and I thought about a lot.
I did program parallel systems, and out of my experience I tried to  
build something to do parallel programs "the way it  should be", or at  
least the way I would like it to be ;)

The result is what I did with blip, http://dsource.org/projects/blip .
I don't think that (excluding some simple examples) fully automatic  
(trasparent) parallelization is really feasible.
At some point being parallel is more complex, and it puts an extra  
burden on the programmer.
Still it is possible to have several levels of parallelization, and if  
you program a fully parallel program it should still be possible to  
use it relatively efficiently locally, but a local program will not  
automatically become fully parallel.

What I did is a basic smp parallelization for programs with shared  
memory.
This level tries to schedule efficiently independent recursive tasks  
using all processors as efficiently as possible (using the topology  
detected by libhwloc.
It leverages an event based framework (libev) to avoid blocking  
waiting for external tasks.
The ability to describe complex asynchronous processes can be very  
useful also to work with GPUs.

mpi parallelization is part of the hierarchy of parallelization, for  
the reasons I described before, it is wrapped so that on a single  
processor one can use a "pseudo" mpi.

rpc (remote procedure call) might be better described as distributed  
objects, offers a server that can responds to external requests at any  
moment and the possibility to publish objects that will be then  
identified by urls.
There urls can be used to create local proxies that call the remote  
object and get results from it.
This can be done using mpi, or directly sockets.
If one uses sockets he has the whole flexibility (but also the whole  
complexity) of a fully distributed system.
The basic building blocks of this can be used also in a distributed  
protocol like distributed hashtables.

blip is available now, and works with osx and linux. It should be  
possible to port it to windows, (both libhwloc and libev work on  
windows), but I didn't do it.
It needs D1 and tango, tango trunk can be compiled using the scripts  
in blip/buildTango, and then programs using blip can be compiled more  
easily with the dbuild script (that uses xfbuild behind the scenes).

I planned to make an official release this w.e., but you can look  
already now, the code is all there...

Fawzi

-----------------------------------------------------
Dr. Fawzi Mohamed,                      Office: 3'322
Humboldt-Universitaet zu Berlin, Institut fuer Chemie
Post:               Unter den Linden 6, 10099, Berlin
Besucher/Pakete:    Brook-Taylor-Str. 2, 12489 Berlin
Tel: +49 30 20 93 7140          Fax: +49 30 2093 7136
-----------------------------------------------------

Nov 11 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

On 11-nov-10, at 15:16, Fawzi Mohamed wrote:

 On 11-nov-10, at 09:58, Russel Winder wrote:

 On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something  
 about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.   
 I know that
 it is still an area of active research, and it is not yet (far  
 from?) done,
 but anyone have thoughts on this as future direction?  Thank you.



I just finished reading "Parallel Programmability and the Chapel  
Language" by Chamberlain, Callahan and Zima.
A very nice read, and overview of several languages and approaches.
Still I stand by my earlier view, an MPI like approach is more  
flexible, but indeed having a nice parallel implementation of  
distributed arrays (which on MPI one can have using Global Arrays for  
example), can be very useful.
I think that a language like D can hide these behind wrapper objects,  
and reach for these objects (that are not the only ones present in a  
complex parallel program) an expressivity similar to chapel using the  
approach I have in blip.
A direct implementation might be more efficient on shared memory  
machines though.

Nov 11 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

	charset=US-ASCII;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit


On 11-nov-10, at 15:16, Fawzi Mohamed wrote:

 On 11-nov-10, at 09:58, Russel Winder wrote:

 MPI and all the SPMD approaches have a severely limited future, but I
 bet the HPC codes are still using Fortran and MPI in 50 years time.

 well whole array operations are a generalization of the SPMD  
 approach, so I this sense you said that that kind of approach will  
 have a future (but with a more difficult optimization as the  
 hardware is more complex.

sorry I translated that as SIMD, not SPMD, but the answer below still  
holds in my opinion, if one has a complex parallel problem mpi is a  
worthy contender, the thing is that in many occasions one doesn't need  
all its power.
If a client server, a distributed or a map/reduce approach work, then  
simpler and more flexible solutions are superior.
That (and its reliability problem, that PGAS also shares) is, in my  
opinion, the reason MPI is not very used outside the computational  
community.
Being able to tackle also MPMD in a good way can be useful, and that  
is what the rpc level does between computers, and the event based  
scheduling within a single computer (ensuring that one processor can  
do meaningful work while the other waits.

 About MPI I think that many don't see what MPI really does, mpi  
 offers a simplified parallel model.
 The main weakness of this model is that it assumes some kind of  
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of  
 higher dimensional structure and efficient collective communication  
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it  
 is very powerful, often superior to the alternatives, and  
 programming with it is *simpler* than thinking about a generic  
 distributed system.
 So I think that for problems that are not trivially parallel, or  
 easily parallelizable MPI will remain as the best choice.

Nov 11 2010

Tobias Pfaff <nospam spam.no> writes:

On 11/11/2010 03:24 AM, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.

Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
Speaking of which: Are there any attempts to support lightweight 
multithreading in D, that is, something like OpenMP ?

Thanks!

Nov 11 2010

Trass3r <un known.com> writes:

 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight  
 multithreading in D, that is, something like OpenMP ?

That would require compiler support for it.
Other than that there only seems to be dsimcha's std.parallelism

Nov 11 2010

Tobias Pfaff <nospam spam.no> writes:

On 11/11/2010 07:01 PM, Trass3r wrote:
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?

 That would require compiler support for it.
 Other than that there only seems to be dsimcha's std.parallelism

Ok, that's what I suspected.
std.parallelism doesn't look to bad though, I'll try around with that...

Nov 11 2010

Russel Winder <russel russel.org.uk> writes:

On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
[ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight=20
 multithreading in D, that is, something like OpenMP ?

I'd hardly call OpenMP lightweight.  I agree that as a meta-notation for
directing the compiler how to insert appropriate code to force
multithreading of certain classes of code, using OpenMP generally beats
manual coding of the threads.  But OpenMP is very Fortran oriented even
though it can be useful for C, and indeed C++ as well.

However, given things like Threading Building Blocks (TBB) and the
functional programming inspired techniques used by Chapel, OpenMP
increasingly looks like a "hack" rather than a solution.

Using parallel versions of for, map, filter, reduce in the language is
probably a better way forward.

Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going
to be a good thing.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Nov 11 2010

Trass3r <un known.com> writes:

 Having a D binding to OpenCL is probably going to be a good thing.

http://bitbucket.org/trass3r/cl4d/wiki/Home

Nov 11 2010

Tobias Pfaff <nospam spam.no> writes:

On 11/11/2010 08:10 PM, Russel Winder wrote:
 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?

 I'd hardly call OpenMP lightweight.  I agree that as a meta-notation for
 directing the compiler how to insert appropriate code to force
 multithreading of certain classes of code, using OpenMP generally beats
 manual coding of the threads.  But OpenMP is very Fortran oriented even
 though it can be useful for C, and indeed C++ as well.

 However, given things like Threading Building Blocks (TBB) and the
 functional programming inspired techniques used by Chapel, OpenMP
 increasingly looks like a "hack" rather than a solution.

 Using parallel versions of for, map, filter, reduce in the language is
 probably a better way forward.

 Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going
 to be a good thing.

Well, I am looking for an easy & efficient way to perform parallel 
numerical calculations on our 4-8 core machines. With C++, that's OpenMP 
(or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the 
wrong word, what I meant is that OpenMP is easy to use, and efficient 
for the problems we are solving. There actually might be better tools 
for that, honestly we didn't look into that much options -- we are no 
HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy 
that we even started parallelizing our code at all :)

Anyways, I was thinking about the logical thing to use in D for this 
scenario. It's nothing super-fancy, in cases just a parallel_for we 
will, and sometimes a map/reduce operation...

Cheers,
Tobias

Nov 11 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Tobias Pfaff (nospam spam.no)'s article
 On 11/11/2010 08:10 PM, Russel Winder wrote:
 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?

 I'd hardly call OpenMP lightweight.  I agree that as a meta-notation for
 directing the compiler how to insert appropriate code to force
 multithreading of certain classes of code, using OpenMP generally beats
 manual coding of the threads.  But OpenMP is very Fortran oriented even
 though it can be useful for C, and indeed C++ as well.

 However, given things like Threading Building Blocks (TBB) and the
 functional programming inspired techniques used by Chapel, OpenMP
 increasingly looks like a "hack" rather than a solution.

 Using parallel versions of for, map, filter, reduce in the language is
 probably a better way forward.

 Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going
 to be a good thing.

 Well, I am looking for an easy & efficient way to perform parallel
 numerical calculations on our 4-8 core machines. With C++, that's OpenMP
 (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the
 wrong word, what I meant is that OpenMP is easy to use, and efficient
 for the problems we are solving. There actually might be better tools
 for that, honestly we didn't look into that much options -- we are no
 HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy
 that we even started parallelizing our code at all :)
 Anyways, I was thinking about the logical thing to use in D for this
 scenario. It's nothing super-fancy, in cases just a parallel_for we
 will, and sometimes a map/reduce operation...
 Cheers,
 Tobias

I think you'll be very pleased with std.parallelism when/if it gets into Phobos.
The design philosophy is exactly what you're looking for:  Simple shared memory
parallelism on multicore computers, assuming no fancy/unusual OS-, compiler- or
hardware-level infrastructure.  Basically, it's got parallel foreach, parallel
map, parallel reduce and parallel tasks.  All you need to fully utilize it is
DMD
and a multicore PC.

As a reminder, the docs are at
http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html and the code is at
http://dsource.org/projects/scrapple/browser/trunk/parallelFutu
e/std_parallelism.d .
 If this doesn't meet your needs in its current form, I'd like as much
constructive criticism as possible, as long as it's within the scope of simple,
everyday parallelism without fancy infrastructure.

Nov 11 2010

Tobias Pfaff <nospam spam.no> writes:

On 11/12/2010 12:44 AM, dsimcha wrote:
 == Quote from Tobias Pfaff (nospam spam.no)'s article
 On 11/11/2010 08:10 PM, Russel Winder wrote:
 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?

 I'd hardly call OpenMP lightweight.  I agree that as a meta-notation for
 directing the compiler how to insert appropriate code to force
 multithreading of certain classes of code, using OpenMP generally beats
 manual coding of the threads.  But OpenMP is very Fortran oriented even
 though it can be useful for C, and indeed C++ as well.

 However, given things like Threading Building Blocks (TBB) and the
 functional programming inspired techniques used by Chapel, OpenMP
 increasingly looks like a "hack" rather than a solution.

 Using parallel versions of for, map, filter, reduce in the language is
 probably a better way forward.

 Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going
 to be a good thing.

 Well, I am looking for an easy&  efficient way to perform parallel
 numerical calculations on our 4-8 core machines. With C++, that's OpenMP
 (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the
 wrong word, what I meant is that OpenMP is easy to use, and efficient
 for the problems we are solving. There actually might be better tools
 for that, honestly we didn't look into that much options -- we are no
 HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy
 that we even started parallelizing our code at all :)
 Anyways, I was thinking about the logical thing to use in D for this
 scenario. It's nothing super-fancy, in cases just a parallel_for we
 will, and sometimes a map/reduce operation...
 Cheers,
 Tobias

 I think you'll be very pleased with std.parallelism when/if it gets into
Phobos.
 The design philosophy is exactly what you're looking for:  Simple shared memory
 parallelism on multicore computers, assuming no fancy/unusual OS-, compiler- or
 hardware-level infrastructure.  Basically, it's got parallel foreach, parallel
 map, parallel reduce and parallel tasks.  All you need to fully utilize it is
DMD
 and a multicore PC.

 As a reminder, the docs are at
 http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html and the code is at
 http://dsource.org/projects/scrapple/browser/trunk/parallelFutu
e/std_parallelism.d .
   If this doesn't meet your needs in its current form, I'd like as much
 constructive criticism as possible, as long as it's within the scope of simple,
 everyday parallelism without fancy infrastructure.

I did a quick test of the module, looks really good so far, thanks for 
providing this ! (Is this module scheduled for inclusion in phobos2 ?)
If I find issues with it I'll let you know.

Nov 12 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

On 12-nov-10, at 00:29, Tobias Pfaff wrote:

 [...]
 Well, I am looking for an easy & efficient way to perform parallel  
 numerical calculations on our 4-8 core machines. With C++, that's  
 OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe  
 lightweight was the wrong word, what I meant is that OpenMP is easy  
 to use, and efficient for the problems we are solving. There  
 actually might be better tools for that, honestly we didn't look  
 into that much options -- we are no HPC guys, 1000-cpu clusters are  
 not a relevant scenario and we are happy that we even started  
 parallelizing our code at all :)

 Anyways, I was thinking about the logical thing to use in D for this  
 scenario. It's nothing super-fancy, in cases just a parallel_for we  
 will, and sometimes a map/reduce operation...

If you use D1 blip.parallel.smp offers that, and it does scale well to  
4-8 cores.

Nov 12 2010

Sean Kelly <sean invisibleduck.org> writes:

Tobias Pfaff Wrote:

 On 11/11/2010 03:24 AM, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.

 
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight 
 multithreading in D, that is, something like OpenMP ?

I've considered backing spawn() calls by fibers multiplexed by a thread pool
(receive() calls would cause the fiber to yield) instead of having each call
generate a new kernel thread.  The only issue is that TLS (ie. non-shared
static storage) is thread-local, not fiber-local.  One idea, however, is to do
OSX-style manual TLS inside Fiber, so each fiber would have its own automatic
local storage.  Perhaps as an experiment I'll create a new derivative of Fiber
that does this and see how it works.

Nov 11 2010

sybrandy <sybrandy gmail.com> writes:

On 11/11/2010 02:41 PM, Sean Kelly wrote:
 Tobias Pfaff Wrote:

 On 11/11/2010 03:24 AM, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.

 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?

 I've considered backing spawn() calls by fibers multiplexed by a thread pool
(receive() calls would cause the fiber to yield) instead of having each call
generate a new kernel thread.  The only issue is that TLS (ie. non-shared
static storage) is thread-local, not fiber-local.  One idea, however, is to do
OSX-style manual TLS inside Fiber, so each fiber would have its own automatic
local storage.  Perhaps as an experiment I'll create a new derivative of Fiber
that does this and see how it works.

I actually did something similar for a very simple web server I was 
experimenting with.  It is similar to how Erlang works in that the 
Erlang processes are, at least to me, similar to fibers and they are run 
in one of several threads in the interpreter.

The only problem I had was ensuring that my logging was thread-safe.  If 
you could implement a TLS-like system for Fibers, I think that would 
help prevent that issue.

Casey

Nov 11 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

On 11-nov-10, at 20:10, Russel Winder wrote:

 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?

 I'd hardly call OpenMP lightweight.  I agree that as a meta-notation  
 for
 directing the compiler how to insert appropriate code to force
 multithreading of certain classes of code, using OpenMP generally  
 beats
 manual coding of the threads.  But OpenMP is very Fortran oriented  
 even
 though it can be useful for C, and indeed C++ as well.

 However, given things like Threading Building Blocks (TBB) and the
 functional programming inspired techniques used by Chapel, OpenMP
 increasingly looks like a "hack" rather than a solution.

I agree I think that TBB offers primitives for many parallelization  
kinds, and is more clean and flexible than OpenMP, but in my opinion  
it has a big weakness: it cannot cope well with independent tasks.  
Coping well wit both nested parallelism and independent tasks is a  
crucial thing to have a generic solution that can be applied to  
several problems.
This is missing as far as I know also from Chapel.
I think that having a solution that copes well with both nested  
parallelism and independent tasks is an excellent starting on which to  
build almost all other higher level parallelization schemes.
It is important to handle this centrally, because the number of  
threads that one should spawn should ideally stay limited to the  
number of  execution units.

Nov 12 2010

Russel Winder <russel russel.org.uk> writes:

On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote:
[ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult to =

=20
 program, and GPU & co are slowly becoming more and more general purpose.
 Being able to take advantage of those is useful, but I am not =20
 convinced they are necessarily the future.

The Intel roadmap is for processor chips that have a number of cores
with different architectures.  Heterogeneity is not going going to be a
choice, it is going to be an imposition.  And this is at bus level, not
at cluster level.=20

[ . . . ]
 yes many core is the future I agree on this, and also that distributed =

=20
 approach is the only way to scale to a really large number of =20
 processors.
 Bud distributed systems *are* more complex, so I think that for the =20
 foreseeable future one will have a hybrid approach.

Hybrid is what I am saying is the future whether we like it or not.  SMP
as the whole system is the past.

I disagree that distributed systems are more complex per se.  I suspect
comments are getting so general here that anything anyone writes can be
seen as both true and false simultaneously.  My perception is that
shared memory multithreading is less and less a tool that applications
programmers should be thinking in terms of.  Multiple processes with an
hierarchy of communications costs is the overarching architecture with
each process potentially being SMP or CSP or . . . =20

 again not sure the situation is as dire as you paint it, Linux does =20
 quite well in the HPC field... but I agree that to be the ideal OS for =

=20
 these architectures it will need more changes.

The Linux driver architecture is already creaking at the seams, it
implies a central monolithic approach to operating system.  This falls
down in a multiprocessor shared memory context.  The fact that the Top
500 generally use Linux is because it is the least worst option.  M$
despite throwing large amounts of money at the problem, and indeed
bought some very high profile names to try and do something about the
lack of traction, have failed to make any headway in the HPC operating
system stakes.  Do you want to have to run a virus checker on your HPC
system?

My gut reaction is that we are going to see a rise of hypervisors as per
Tilera chips, at least in the short to medium term, simply as a bridge
from the now OSes to the future.  My guess is that L4 microkernels
and/or nanokernels, exokernels, etc. will find a central place in future
systems.  The problem to be solved is ensuring that the appropriate ABI
is available on the appropriate core at the appropriate time.  Mobility
of ABI is the critical factor here. =20

[ . . . ]
 Whole array operation are useful, and when possible one gains much =20
 using them, unfortunately not all problems can be reduced to few large =

=20
 array operations, data parallel languages are not the main type of =20
 language for these reasons.

Agreed.  My point was that in 1960s code people explicitly handled array
operations using do loops because they had to.  Nowadays such code is
anathema to efficient execution.  My complaint here is that people have
put effort into compiler technology instead of rewriting the codes in a
better language and/or idiom.  Clearly whole array operations only apply
to algorithms that involve arrays!

[ . . . ]
 well whole array operations are a generalization of the SPMD approach, =

=20
 so I this sense you said that that kind of approach will have a future =

=20
 (but with a more difficult optimization as the hardware is more complex.

I guess this is where the PGAS people are challenging things.
Applications can be couched in terms of array algorithms which can be
scattered across distributed memory systems.  Inappropriate operations
lead to huge inefficiencies, but handles correctly, code runs very
fast.=20

 About MPI I think that many don't see what MPI really does, mpi offers =

=20
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of =20
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of =20
 higher dimensional structure and efficient collective communication =20
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it =20
 is very powerful, often superior to the alternatives, and programming =

=20
 with it is *simpler* than thinking about a generic distributed system.
 So I think that for problems that are not trivially parallel, or =20
 easily parallelizable MPI will remain as the best choice.

I guess my main irritant with MPI is that I have to run the same
executable on every node and, perhaps more importantly, the message
passing structure is founded on Fortran primitive data types.  OK so you
can hack up some element of abstraction so as to send complex messages,
but it would be far better if the MPI standard provided better
abstractions.=20

[ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward =20
 PGAS, I find a generalized MPI model better than PGAS when you want to =

=20
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage =20
 with an object that sends remote requests to access remote memory =20
 pieces.
 This means having a local server where this wrapped objects can be =20
 "published" and that can respond in any moment to external requests. I =

=20
 call this rpc (remote procedure call) and it can be realized easily on =

=20
 the top of MPI.
 As not all objects are distributed and in a complex program it does =20
 not always makes sense to distribute these objects on all processors =20
 or none, I find that the robust partitioning and collective =20
 communication primitives of MPI superior to PGAS.
 With enough effort you probably can get everything also from PGAS, but =

=20
 then you loose all its simplicity.

I think we are going to have to take this one off the list.  My summary
is that MPI and PGAS solve different problems differently.  There are
some problems that one can code up neatly in MPI and that are ugly in
PGAS, but the converse is also true.

[ . . . ]
 The situation is not so dire, some problems are trivially parallel, or =

=20
 can be solved with simple parallel patterns, others don't need to be =20
 solved in parallel, as the sequential solution if fast enough, but I =20
 do agree that being able to develop parallel systems is increasingly =20
 important.
 In fact it is something that I like to do, and I thought about a lot.
 I did program parallel systems, and out of my experience I tried to =20
 build something to do parallel programs "the way it  should be", or at =

=20
 least the way I would like it to be ;)

The real question is whether future computers will run Word,
OpenOffice.org, Excel, Powerpoint fast enough so that people don't
complain.  Everything else is an HPC ghetto :-)

 The result is what I did with blip, http://dsource.org/projects/blip .
 I don't think that (excluding some simple examples) fully automatic =20
 (trasparent) parallelization is really feasible.
 At some point being parallel is more complex, and it puts an extra =20
 burden on the programmer.
 Still it is possible to have several levels of parallelization, and if =

=20
 you program a fully parallel program it should still be possible to =20
 use it relatively efficiently locally, but a local program will not =20
 automatically become fully parallel.

At the heart of all this is that programmers are taught that algorithm
is a sequence of actions to achieve a goal.  Programmers are trained to
think sequentially and this affects their coding.  This means that
parallelism has to be expressed at a sufficiently high level that
programmers can still reason about algorithms as sequential things.=20

[ . . . ]

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Nov 11 2010

retard <re tard.com.invalid> writes:

Thu, 11 Nov 2010 19:41:56 +0000, Russel Winder wrote:

 On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult to
 program, and GPU & co are slowly becoming more and more general
 purpose. Being able to take advantage of those is useful, but I am not
 convinced they are necessarily the future.

 
 The Intel roadmap is for processor chips that have a number of cores
 with different architectures.  Heterogeneity is not going going to be a
 choice, it is going to be an imposition.  And this is at bus level, not
 at cluster level.
 
 [ . . . ]
 yes many core is the future I agree on this, and also that distributed
 approach is the only way to scale to a really large number of
 processors.
 Bud distributed systems *are* more complex, so I think that for the
 foreseeable future one will have a hybrid approach.

 
 Hybrid is what I am saying is the future whether we like it or not.  SMP
 as the whole system is the past.
 
 I disagree that distributed systems are more complex per se.  I suspect
 comments are getting so general here that anything anyone writes can be
 seen as both true and false simultaneously.  My perception is that
 shared memory multithreading is less and less a tool that applications
 programmers should be thinking in terms of.  Multiple processes with an
 hierarchy of communications costs is the overarching architecture with
 each process potentially being SMP or CSP or . . .
 
 again not sure the situation is as dire as you paint it, Linux does
 quite well in the HPC field... but I agree that to be the ideal OS for
 these architectures it will need more changes.

 
 The Linux driver architecture is already creaking at the seams, it
 implies a central monolithic approach to operating system.  This falls
 down in a multiprocessor shared memory context.  The fact that the Top
 500 generally use Linux is because it is the least worst option.  M$
 despite throwing large amounts of money at the problem, and indeed
 bought some very high profile names to try and do something about the
 lack of traction, have failed to make any headway in the HPC operating
 system stakes.  Do you want to have to run a virus checker on your HPC
 system?
 
 My gut reaction is that we are going to see a rise of hypervisors as per
 Tilera chips, at least in the short to medium term, simply as a bridge
 from the now OSes to the future.  My guess is that L4 microkernels
 and/or nanokernels, exokernels, etc. will find a central place in future
 systems.  The problem to be solved is ensuring that the appropriate ABI
 is available on the appropriate core at the appropriate time.  Mobility
 of ABI is the critical factor here.
 
 [ . . . ]
 Whole array operation are useful, and when possible one gains much
 using them, unfortunately not all problems can be reduced to few large
 array operations, data parallel languages are not the main type of
 language for these reasons.

 
 Agreed.  My point was that in 1960s code people explicitly handled array
 operations using do loops because they had to.  Nowadays such code is
 anathema to efficient execution.  My complaint here is that people have
 put effort into compiler technology instead of rewriting the codes in a
 better language and/or idiom.  Clearly whole array operations only apply
 to algorithms that involve arrays!
 
 [ . . . ]
 well whole array operations are a generalization of the SPMD approach,
 so I this sense you said that that kind of approach will have a future
 (but with a more difficult optimization as the hardware is more
 complex.

 
 I guess this is where the PGAS people are challenging things.
 Applications can be couched in terms of array algorithms which can be
 scattered across distributed memory systems.  Inappropriate operations
 lead to huge inefficiencies, but handles correctly, code runs very fast.
 
 About MPI I think that many don't see what MPI really does, mpi offers
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of
 higher dimensional structure and efficient collective communication
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it is
 very powerful, often superior to the alternatives, and programming with
 it is *simpler* than thinking about a generic distributed system. So I
 think that for problems that are not trivially parallel, or easily
 parallelizable MPI will remain as the best choice.

 
 I guess my main irritant with MPI is that I have to run the same
 executable on every node and, perhaps more importantly, the message
 passing structure is founded on Fortran primitive data types.  OK so you
 can hack up some element of abstraction so as to send complex messages,
 but it would be far better if the MPI standard provided better
 abstractions.
 
 [ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward
 PGAS, I find a generalized MPI model better than PGAS when you want to
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage with
 an object that sends remote requests to access remote memory pieces.
 This means having a local server where this wrapped objects can be
 "published" and that can respond in any moment to external requests. I
 call this rpc (remote procedure call) and it can be realized easily on
 the top of MPI.
 As not all objects are distributed and in a complex program it does not
 always makes sense to distribute these objects on all processors or
 none, I find that the robust partitioning and collective communication
 primitives of MPI superior to PGAS. With enough effort you probably can
 get everything also from PGAS, but then you loose all its simplicity.

 
 I think we are going to have to take this one off the list.  My summary
 is that MPI and PGAS solve different problems differently.  There are
 some problems that one can code up neatly in MPI and that are ugly in
 PGAS, but the converse is also true.
 
 [ . . . ]
 The situation is not so dire, some problems are trivially parallel, or
 can be solved with simple parallel patterns, others don't need to be
 solved in parallel, as the sequential solution if fast enough, but I do
 agree that being able to develop parallel systems is increasingly
 important.
 In fact it is something that I like to do, and I thought about a lot. I
 did program parallel systems, and out of my experience I tried to build
 something to do parallel programs "the way it  should be", or at least
 the way I would like it to be ;)

 
 The real question is whether future computers will run Word,
 OpenOffice.org, Excel, Powerpoint fast enough so that people don't
 complain.  Everything else is an HPC ghetto :-)
 
 The result is what I did with blip, http://dsource.org/projects/blip .
 I don't think that (excluding some simple examples) fully automatic
 (trasparent) parallelization is really feasible. At some point being
 parallel is more complex, and it puts an extra burden on the
 programmer.
 Still it is possible to have several levels of parallelization, and if
 you program a fully parallel program it should still be possible to use
 it relatively efficiently locally, but a local program will not
 automatically become fully parallel.

 
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things.
 
 [ . . . ]

FWIW, I'm not a parallel computing expert and have almost no experience 
of it outside basic parallel programming courses, but it seems to me that 
the HPC clusters are a completely separate domain.

It used to be the case that *all* other systems are single core and only 
the HPC consists of (hybrid multicore setups of) several nodes. Now what 
is happening is both embedded and mainstream PCs are getting multiple 
cores in both CPU and GPU chips. Multi-socket setups are still rare. The 
growth rate maybe follows the Moore's law at least in GPUs, in CPUs the 
problems with programmability are slowing things down and many laptops 
are still dual-core despite multiple cores are more energy efficient than 
higher GHz and my home PC has 8 virtual cores in a single CPU.

The HPC systems with hundreds of processors are definitely still 
important, but I see that 99.9(99)% of the market is in desktop and 
embedded systems. We need efficient ways to program multicore mobile 
phones, multicore laptops, multicore tablet devices and so on. These are 
all shared memory systems. I don't think MPI works well in shared memory 
systems. It's good to have MPI like system for D, but it cannot solve 
these problems.

Nov 11 2010

retard <re tard.com.invalid> writes:

Thu, 11 Nov 2010 20:01:09 +0000, retard wrote:

 in CPUs the
 problems with programmability are slowing things down and many laptops
 are still dual-core despite multiple cores are more energy efficient
 than higher GHz and my home PC has 8 virtual cores in a single CPU.

At least it seems so to me. My last 1 and 2 core systems had a TDP of 65 
and 105W. Now it's 130W, the next gen have 12 cores and 130W TDP.

So I currently have 8 CPU cores and 480 GPU cores. Unfortunately many 
open source applications don't use the GPU (maybe OpenGL 1.0 but usually 
software rendering. The gpu accelerated desktops are still buggy and 
crash prone) and are single threaded. Even some heavier tasks like video 
encoding uses cores very inefficiently. Would MPI help?

Nov 11 2010

Gary Whatmore <no spam.sp> writes:

retard Wrote:

 Thu, 11 Nov 2010 19:41:56 +0000, Russel Winder wrote:
 
 On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult to
 program, and GPU & co are slowly becoming more and more general
 purpose. Being able to take advantage of those is useful, but I am not
 convinced they are necessarily the future.

 
 The Intel roadmap is for processor chips that have a number of cores
 with different architectures.  Heterogeneity is not going going to be a
 choice, it is going to be an imposition.  And this is at bus level, not
 at cluster level.
 
 [ . . . ]
 yes many core is the future I agree on this, and also that distributed
 approach is the only way to scale to a really large number of
 processors.
 Bud distributed systems *are* more complex, so I think that for the
 foreseeable future one will have a hybrid approach.

 
 Hybrid is what I am saying is the future whether we like it or not.  SMP
 as the whole system is the past.
 
 I disagree that distributed systems are more complex per se.  I suspect
 comments are getting so general here that anything anyone writes can be
 seen as both true and false simultaneously.  My perception is that
 shared memory multithreading is less and less a tool that applications
 programmers should be thinking in terms of.  Multiple processes with an
 hierarchy of communications costs is the overarching architecture with
 each process potentially being SMP or CSP or . . .
 
 again not sure the situation is as dire as you paint it, Linux does
 quite well in the HPC field... but I agree that to be the ideal OS for
 these architectures it will need more changes.

 
 The Linux driver architecture is already creaking at the seams, it
 implies a central monolithic approach to operating system.  This falls
 down in a multiprocessor shared memory context.  The fact that the Top
 500 generally use Linux is because it is the least worst option.  M$
 despite throwing large amounts of money at the problem, and indeed
 bought some very high profile names to try and do something about the
 lack of traction, have failed to make any headway in the HPC operating
 system stakes.  Do you want to have to run a virus checker on your HPC
 system?
 
 My gut reaction is that we are going to see a rise of hypervisors as per
 Tilera chips, at least in the short to medium term, simply as a bridge
 from the now OSes to the future.  My guess is that L4 microkernels
 and/or nanokernels, exokernels, etc. will find a central place in future
 systems.  The problem to be solved is ensuring that the appropriate ABI
 is available on the appropriate core at the appropriate time.  Mobility
 of ABI is the critical factor here.
 
 [ . . . ]
 Whole array operation are useful, and when possible one gains much
 using them, unfortunately not all problems can be reduced to few large
 array operations, data parallel languages are not the main type of
 language for these reasons.

 
 Agreed.  My point was that in 1960s code people explicitly handled array
 operations using do loops because they had to.  Nowadays such code is
 anathema to efficient execution.  My complaint here is that people have
 put effort into compiler technology instead of rewriting the codes in a
 better language and/or idiom.  Clearly whole array operations only apply
 to algorithms that involve arrays!
 
 [ . . . ]
 well whole array operations are a generalization of the SPMD approach,
 so I this sense you said that that kind of approach will have a future
 (but with a more difficult optimization as the hardware is more
 complex.

 
 I guess this is where the PGAS people are challenging things.
 Applications can be couched in terms of array algorithms which can be
 scattered across distributed memory systems.  Inappropriate operations
 lead to huge inefficiencies, but handles correctly, code runs very fast.
 
 About MPI I think that many don't see what MPI really does, mpi offers
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of
 higher dimensional structure and efficient collective communication
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it is
 very powerful, often superior to the alternatives, and programming with
 it is *simpler* than thinking about a generic distributed system. So I
 think that for problems that are not trivially parallel, or easily
 parallelizable MPI will remain as the best choice.

 
 I guess my main irritant with MPI is that I have to run the same
 executable on every node and, perhaps more importantly, the message
 passing structure is founded on Fortran primitive data types.  OK so you
 can hack up some element of abstraction so as to send complex messages,
 but it would be far better if the MPI standard provided better
 abstractions.
 
 [ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward
 PGAS, I find a generalized MPI model better than PGAS when you want to
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage with
 an object that sends remote requests to access remote memory pieces.
 This means having a local server where this wrapped objects can be
 "published" and that can respond in any moment to external requests. I
 call this rpc (remote procedure call) and it can be realized easily on
 the top of MPI.
 As not all objects are distributed and in a complex program it does not
 always makes sense to distribute these objects on all processors or
 none, I find that the robust partitioning and collective communication
 primitives of MPI superior to PGAS. With enough effort you probably can
 get everything also from PGAS, but then you loose all its simplicity.

 
 I think we are going to have to take this one off the list.  My summary
 is that MPI and PGAS solve different problems differently.  There are
 some problems that one can code up neatly in MPI and that are ugly in
 PGAS, but the converse is also true.
 
 [ . . . ]
 The situation is not so dire, some problems are trivially parallel, or
 can be solved with simple parallel patterns, others don't need to be
 solved in parallel, as the sequential solution if fast enough, but I do
 agree that being able to develop parallel systems is increasingly
 important.
 In fact it is something that I like to do, and I thought about a lot. I
 did program parallel systems, and out of my experience I tried to build
 something to do parallel programs "the way it  should be", or at least
 the way I would like it to be ;)

 
 The real question is whether future computers will run Word,
 OpenOffice.org, Excel, Powerpoint fast enough so that people don't
 complain.  Everything else is an HPC ghetto :-)
 
 The result is what I did with blip, http://dsource.org/projects/blip .
 I don't think that (excluding some simple examples) fully automatic
 (trasparent) parallelization is really feasible. At some point being
 parallel is more complex, and it puts an extra burden on the
 programmer.
 Still it is possible to have several levels of parallelization, and if
 you program a fully parallel program it should still be possible to use
 it relatively efficiently locally, but a local program will not
 automatically become fully parallel.

 
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things.
 
 [ . . . ]

 
 FWIW, I'm not a parallel computing expert and have almost no experience 
 of it outside basic parallel programming courses, but it seems to me that 
 the HPC clusters are a completely separate domain.
 
 It used to be the case that *all* other systems are single core and only 
 the HPC consists of (hybrid multicore setups of) several nodes. Now what 
 is happening is both embedded and mainstream PCs are getting multiple 
 cores in both CPU and GPU chips. Multi-socket setups are still rare. The 
 growth rate maybe follows the Moore's law at least in GPUs, in CPUs the 
 problems with programmability are slowing things down and many laptops 
 are still dual-core despite multiple cores are more energy efficient than 
 higher GHz and my home PC has 8 virtual cores in a single CPU.
 
 The HPC systems with hundreds of processors are definitely still 
 important, but I see that 99.9(99)% of the market is in desktop and 
 embedded systems. We need efficient ways to program multicore mobile 
 phones, multicore laptops, multicore tablet devices and so on. These are 
 all shared memory systems. I don't think MPI works well in shared memory 
 systems. It's good to have MPI like system for D, but it cannot solve 
 these problems.

You're unfortunately completely wrong. The industry is moving away from desktop
applications. The reason is simple, software as service brings more profit and
provides a handy vendor lock-in the customers don't even realize now.
Advertisers pay the services now, the customers directly in the future once
local desktop application market has been crushed.

Another reason is the amount of open source software out there. You can't
compete with free (as in beer) and it's considered good enough by typical
users. Desktop applications suffer from segfaults and bugs. You can hide those
with server technology. Just blame the infrastructure. Conceptually internet is
so complex that people accept broken behavior more often.  "yep, it wasn't
facebook's fault - some pipe exploded in africa and your net is down". High
performance web servers need MPI and similar technologies to scale on huge
clusters.

The streaming services and browser plugins guarantee that you don't need to do
video encoding at home anymore. Low upload bandwidth guarantees that you won't
share your personal content (e.g. images or videos taken with a camera) even
when ipv6 comes. All games will be only available on game consoles. The
multicore PC will just die away.

In the future the client systems are even dumber than they're now. Maybe even X
like thin clients on top of http/html5/ajax. These systems run just fine on
single tasking single core.

Nov 11 2010

Walter Bright <newshound2 digitalmars.com> writes:

Russel Winder wrote:
 Agreed.  My point was that in 1960s code people explicitly handled array
 operations using do loops because they had to.  Nowadays such code is
 anathema to efficient execution.  My complaint here is that people have
 put effort into compiler technology instead of rewriting the codes in a
 better language and/or idiom.  Clearly whole array operations only apply
 to algorithms that involve arrays!

Yup. I am bemused by the efforts put into analyzing loops so that they can (by 
the compiler) be re-written into a higher level construct, and then the higher 
level construct is compiled.

It just is backwards what the compiler should be doing. The high level
construct 
is what the programmer should be writing. It shouldn't be something the
compiler 
reconstructs from low level source code.

Nov 11 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 Yup. I am bemused by the efforts put into analyzing loops so that they can (by 
 the compiler) be re-written into a higher level construct, and then the higher 
 level construct is compiled.
 
 It just is backwards what the compiler should be doing. The high level
construct 
 is what the programmer should be writing. It shouldn't be something the
compiler 
 reconstructs from low level source code.

I agree a lot. The language has to offer means to express all the semantics and
constraints, that the arrays are disjointed, that the operations done on them
are pure or not pure, that the operations are not pure but determined only by a
small window in the arrays, and so on and on. And then the compiler has to
optimize the code according to the presence of SIMD registers, multi-cores,
etc. This maybe is not enough for max performance applications, but in most
situations it's plenty enough. (Incidentally, this is a lot what the Chapel
language does (and D doesn't), and what I have explained in two past posts
about Chapel, that were mostly ignored.)

Bye,
bearophile

Nov 11 2010

retard <re tard.com.invalid> writes:

Thu, 11 Nov 2010 16:32:03 -0500, bearophile wrote:

 Walter:
 
 Yup. I am bemused by the efforts put into analyzing loops so that they
 can (by the compiler) be re-written into a higher level construct, and
 then the higher level construct is compiled.
 
 It just is backwards what the compiler should be doing. The high level
 construct is what the programmer should be writing. It shouldn't be
 something the compiler reconstructs from low level source code.

 
 I agree a lot. The language has to offer means to express all the
 semantics and constraints, that the arrays are disjointed, that the
 operations done on them are pure or not pure, that the operations are
 not pure but determined only by a small window in the arrays, and so on
 and on. And then the compiler has to optimize the code according to the
 presence of SIMD registers, multi-cores, etc. This maybe is not enough
 for max performance applications, but in most situations it's plenty
 enough. (Incidentally, this is a lot what the Chapel language does (and
 D doesn't), and what I have explained in two past posts about Chapel,
 that were mostly ignored.)

How does the Chapel work when I need to sort data (just basic quicksort 
on 12 cores, for instance) or e.g. compile many files in parallel or 
encode xvid? What is the content of the array with xvid files?

Nov 11 2010

Walter Bright <newshound2 digitalmars.com> writes:

Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 

I think it's more than being trained to think sequentially. I think it is in
the 
inherent nature of how we think.

Nov 11 2010

Sean Kelly <sean invisibleduck.org> writes:

Walter Bright Wrote:

 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 

 
 I think it's more than being trained to think sequentially. I think it is in
the 
 inherent nature of how we think.

Distributed programming is essentially a bunch of little sequential program
that interact, which is basically how people cooperate in the real world.  I
think that is by far the most intuitive of any concurrent programming model,
though it's still a significant conceptual shift from the traditional
monolithic imperative program.

Nov 11 2010

%u <user web.news> writes:

Sean Kelly Wrote:

 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 

 
 I think it's more than being trained to think sequentially. I think it is in
the 
 inherent nature of how we think.

 
 Distributed programming is essentially a bunch of little sequential program
that interact, which is basically how people cooperate in the real world.  I
think that is by far the most intuitive of any concurrent programming model,
though it's still a significant conceptual shift from the traditional
monolithic imperative program.

Intel promised this AVX instruction set next year. Does it also work like
distributed processes? I hear it doubles your FLOPS. These are exciting times
parallel computing. Lots of new medias for distributed message passing
programming. Lots of little fibers filling the multimedia pipelines with
parallel data. Might even beat GPU soon if Larrabee comes.

Nov 11 2010

Gary Whatmore <no spam.sp> writes:

%u Wrote:

 Sean Kelly Wrote:
 
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 

 
 I think it's more than being trained to think sequentially. I think it is in
the 
 inherent nature of how we think.

 
 Distributed programming is essentially a bunch of little sequential program
that interact, which is basically how people cooperate in the real world.  I
think that is by far the most intuitive of any concurrent programming model,
though it's still a significant conceptual shift from the traditional
monolithic imperative program.

 
 Intel promised this AVX instruction set next year. Does it also work like
distributed processes? I hear it doubles your FLOPS. These are exciting times
parallel computing. Lots of new medias for distributed message passing
programming. Lots of little fibers filling the multimedia pipelines with
parallel data. Might even beat GPU soon if Larrabee comes.

AVX isn't parallel programming, it's vector processing. A dying breed of
paradigms. Parallel programming deals with concurrency. OpenMP and MPI. Chapel
(don't know it, but heard it here). Fortran. These are all good examples. AVX
is just a cpu intrinsics stuff in std.intrinsics

Nov 11 2010

%u <user web.news> writes:

Gary Whatmore Wrote:

 %u Wrote:
 
 Sean Kelly Wrote:
 
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 

 
 I think it's more than being trained to think sequentially. I think it is in
the 
 inherent nature of how we think.

 
 Distributed programming is essentially a bunch of little sequential program
that interact, which is basically how people cooperate in the real world.  I
think that is by far the most intuitive of any concurrent programming model,
though it's still a significant conceptual shift from the traditional
monolithic imperative program.

 
 Intel promised this AVX instruction set next year. Does it also work like
distributed processes? I hear it doubles your FLOPS. These are exciting times
parallel computing. Lots of new medias for distributed message passing
programming. Lots of little fibers filling the multimedia pipelines with
parallel data. Might even beat GPU soon if Larrabee comes.

 
 AVX isn't parallel programming, it's vector processing. A dying breed of
paradigms. Parallel programming deals with concurrency. OpenMP and MPI. Chapel
(don't know it, but heard it here). Fortran. These are all good examples. AVX
is just a cpu intrinsics stuff in std.intrinsics

Currently the amount of information available is scarce. I have no idea how I
use AVX or SSE in D. Auto-vectorization? Does it cover all use cases?

So..

SSE & autovectorization & intrinsics => loops, hand written inline assembly
parts, very small scale
local worker threads / fibers => dsimcha's lib, medium scale
local area network => the great flagship distributed message passing system,
huge clusters with 1000+ computers?

Why is message passing system so important? Assume I have dual-core laptop with
AVX instructions next year. Use of 2 threads doubles my processor power. Use of
AVX gives 8 times more power in good loops. I have no cluster so the flagship
system provides zero benefit.

Nov 11 2010

Don <nospam nospam.com> writes:

Sean Kelly wrote:
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 

 I think it's more than being trained to think sequentially. I think it is in
the 
 inherent nature of how we think.

 
 Distributed programming is essentially a bunch of little sequential program
that interact, which is basically how people cooperate in the real world.  I
think that is by far the most intuitive of any concurrent programming model,
though it's still a significant conceptual shift from the traditional
monolithic imperative program.

The Erlang people seem to say that a lot. The thing they omit to say, 
though, is that it is very, very difficult in the real world!
Consider managing a team of ten people. Getting them to be ten times as 
productive as a single person is extremely difficult -- virtually 
impossible, in fact.

I agree with Walter -- I don't think it's got much to do with programmer 
training. It's a problem that hasn't been solved in the real world in 
the general case.

The analogy with the real world suggests to me that there are three 
cases that work well:
* massively parallel;
* _completely_ independent tasks; and
* very small teams.

Large teams are a management nightmare, and I see no reason to believe 
that wouldn't hold true for a large number of cores as well.

Nov 12 2010

sybrandy <sybrandy gmail.com> writes:

 Distributed programming is essentially a bunch of little sequential
 program that interact, which is basically how people cooperate in the
 real world. I think that is by far the most intuitive of any
 concurrent programming model, though it's still a significant
 conceptual shift from the traditional monolithic imperative program.

 The Erlang people seem to say that a lot. The thing they omit to say,
 though, is that it is very, very difficult in the real world!
 Consider managing a team of ten people. Getting them to be ten times as
 productive as a single person is extremely difficult -- virtually
 impossible, in fact.

That's only part of the reasoning behind all of the little programs in 
Erlang.  The one of the more important aspect is the concept of 
supervisor trees where you have processes that monitor* other processes. 
  In the event that a child process fails, the parent process will try 
to perform a simpler version of what needs to occur until it is successful.

The other aspect is the concept of failing fast.  It is assumed that a 
process that fails does not know how to resolve the issue, therefore it 
should just stop running and allow the parent process to do the right thing.

If you build your software the Erlang way, then you implicitly build 
software that is multi-core friendly.  How well it uses multiple cores 
depends on the software that is written, however I believe that Erlang 
is supposed to be better than most other languages at obtaining 
something close to linear scaling across cores.  Not 100% sure, though.

Does this mean that I believe distributed programming is easy in Erlang? 
  Well, that depends on what you're doing, but I will say that being 
able to spawn functions on different machines is dirt simple.  Doing it 
efficiently...well, that's where I think the programmer needs to know 
what they're doing.

Casey

* The monitoring is something implicit to the language.

Nov 13 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

On 11-nov-10, at 20:41, Russel Winder wrote:

 On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote:
 [ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult  
 to
 program, and GPU & co are slowly becoming more and more general  
 purpose.
 Being able to take advantage of those is useful, but I am not
 convinced they are necessarily the future.

 The Intel roadmap is for processor chips that have a number of cores
 with different architectures.  Heterogeneity is not going going to  
 be a
 choice, it is going to be an imposition.  And this is at bus level,  
 not
 at cluster level.

Vector co processors, yes I see that, and short term the effect of  
things like AMD fusion (CPU/GPU merging).
Is this necessarily the future? I don't know, neither does intel I  
think, as they are still evaluating larabee.
But CPU/GPU will stay around fro some time more for sure.

 [ . . . ]
 yes many core is the future I agree on this, and also that  
 distributed
 approach is the only way to scale to a really large number of
 processors.
 Bud distributed systems *are* more complex, so I think that for the
 foreseeable future one will have a hybrid approach.

 Hybrid is what I am saying is the future whether we like it or not.   
 SMP
 as the whole system is the past.

 I disagree that distributed systems are more complex per se.  I  
 suspect
 comments are getting so general here that anything anyone writes can  
 be
 seen as both true and false simultaneously.  My perception is that
 shared memory multithreading is less and less a tool that applications
 programmers should be thinking in terms of.  Multiple processes with  
 an
 hierarchy of communications costs is the overarching architecture with
 each process potentially being SMP or CSP or . . .

I agree that on not too large shared memory machines a hierarchy of  
tasks is the correct approach.
This is what I did in blip.parallel.smp. Using that one can have  
fairly efficient automatic scheduling, and so forget most of the  
complexities, and actual hardware configuration.

 again not sure the situation is as dire as you paint it, Linux does
 quite well in the HPC field... but I agree that to be the ideal OS  
 for
 these architectures it will need more changes.

 The Linux driver architecture is already creaking at the seams, it
 implies a central monolithic approach to operating system.  This falls
 down in a multiprocessor shared memory context.  The fact that the Top
 500 generally use Linux is because it is the least worst option.  M$
 despite throwing large amounts of money at the problem, and indeed
 bought some very high profile names to try and do something about the
 lack of traction, have failed to make any headway in the HPC operating
 system stakes.  Do you want to have to run a virus checker on your HPC
 system?

 My gut reaction is that we are going to see a rise of hypervisors as  
 per
 Tilera chips, at least in the short to medium term, simply as a bridge
 from the now OSes to the future.  My guess is that L4 microkernels
 and/or nanokernels, exokernels, etc. will find a central place in  
 future
 systems.  The problem to be solved is ensuring that the appropriate  
 ABI
 is available on the appropriate core at the appropriate time.   
 Mobility
 of ABI is the critical factor here.

yes microkernels& co will be more and more important (but I wonder how  
much this will be the case for the desktop).
ABI mobility?not so sure, for hpc I can imagine having to compile to  
different ABIs (but maybe that is what you mean with ABI mobility)

 [ . . . ]
 Whole array operation are useful, and when possible one gains much
 using them, unfortunately not all problems can be reduced to few  
 large
 array operations, data parallel languages are not the main type of
 language for these reasons.

 Agreed.  My point was that in 1960s code people explicitly handled  
 array
 operations using do loops because they had to.  Nowadays such code is
 anathema to efficient execution.  My complaint here is that people  
 have
 put effort into compiler technology instead of rewriting the codes  
 in a
 better language and/or idiom.  Clearly whole array operations only  
 apply
 to algorithms that involve arrays!

 [ . . . ]
 well whole array operations are a generalization of the SPMD  
 approach,
 so I this sense you said that that kind of approach will have a  
 future
 (but with a more difficult optimization as the hardware is more  
 complex.

 I guess this is where the PGAS people are challenging things.
 Applications can be couched in terms of array algorithms which can be
 scattered across distributed memory systems.  Inappropriate operations
 lead to huge inefficiencies, but handles correctly, code runs very
 fast.

 About MPI I think that many don't see what MPI really does, mpi  
 offers
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of
 higher dimensional structure and efficient collective communication
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it
 is very powerful, often superior to the alternatives, and programming
 with it is *simpler* than thinking about a generic distributed  
 system.
 So I think that for problems that are not trivially parallel, or
 easily parallelizable MPI will remain as the best choice.

 I guess my main irritant with MPI is that I have to run the same
 executable on every node and, perhaps more importantly, the message
 passing structure is founded on Fortran primitive data types.  OK so  
 you
 can hack up some element of abstraction so as to send complex  
 messages,
 but it would be far better if the MPI standard provided better
 abstractions.

PGAS and MPI both have the same executable everywhere, but MPI is more  
flexible, with respect of making different part execute different  
things, and MPI does provide more generic packing/unpacking, but I  
guess I see you problems with it.
Having the same executable is a big constraint, but is also a  
simplification.

 [ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward
 PGAS, I find a generalized MPI model better than PGAS when you want  
 to
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage
 with an object that sends remote requests to access remote memory
 pieces.
 This means having a local server where this wrapped objects can be
 "published" and that can respond in any moment to external  
 requests. I
 call this rpc (remote procedure call) and it can be realized easily  
 on
 the top of MPI.
 As not all objects are distributed and in a complex program it does
 not always makes sense to distribute these objects on all processors
 or none, I find that the robust partitioning and collective
 communication primitives of MPI superior to PGAS.
 With enough effort you probably can get everything also from PGAS,  
 but
 then you loose all its simplicity.

 I think we are going to have to take this one off the list.  My  
 summary
 is that MPI and PGAS solve different problems differently.  There are
 some problems that one can code up neatly in MPI and that are ugly in
 PGAS, but the converse is also true.

Yes I guess that is true

 [ . . . ]
 The situation is not so dire, some problems are trivially parallel,  
 or
 can be solved with simple parallel patterns, others don't need to be
 solved in parallel, as the sequential solution if fast enough, but I
 do agree that being able to develop parallel systems is increasingly
 important.
 In fact it is something that I like to do, and I thought about a lot.
 I did program parallel systems, and out of my experience I tried to
 build something to do parallel programs "the way it  should be", or  
 at
 least the way I would like it to be ;)

 The real question is whether future computers will run Word,
 OpenOffice.org, Excel, Powerpoint fast enough so that people don't
 complain.  Everything else is an HPC ghetto :-)

 The result is what I did with blip, http://dsource.org/projects/ 
 blip .
 I don't think that (excluding some simple examples) fully automatic
 (trasparent) parallelization is really feasible.
 At some point being parallel is more complex, and it puts an extra
 burden on the programmer.
 Still it is possible to have several levels of parallelization, and  
 if
 you program a fully parallel program it should still be possible to
 use it relatively efficiently locally, but a local program will not
 automatically become fully parallel.

 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained  
 to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things.


when you have a network of things communicating (I think that once you  
have a distributed system you come at that level) then i is not   
sufficient anymore to think about each piece in isolation, you have to  
think about the interactions too.
There are some patterns that might help reduce the complexity: client/ 
server, map/reduce,.... but in general it is more complex.

Nov 12 2010

Sean Kelly <sean invisibleduck.org> writes:

Don Wrote:

 Sean Kelly wrote:
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 

 I think it's more than being trained to think sequentially. I think it is in
the 
 inherent nature of how we think.

 
 Distributed programming is essentially a bunch of little sequential program
that interact, which is basically how people cooperate in the real world.  I
think that is by far the most intuitive of any concurrent programming model,
though it's still a significant conceptual shift from the traditional
monolithic imperative program.

 
 The Erlang people seem to say that a lot. The thing they omit to say, 
 though, is that it is very, very difficult in the real world!
 Consider managing a team of ten people. Getting them to be ten times as 
 productive as a single person is extremely difficult -- virtually 
 impossible, in fact.

True enough.  But it's certainly more natural to think about than mutex-based
concurrency, automatic parallelization, etc.  In the long term there may turn
out to be better models, but I don't know of one today.

Also, there are other goals for such a design than increasing computation
speed: decreased maintenance cost, system reliability, etc.  Erlang processes
are equivalent to objects in C++ or Java with the added benefit of asynchronous
execution in instances where an immediate response (ie. RPC) is not required. 
Performance gain is a direct function of how often this is true.  But even
where it's not, the other benefits exist.

 I agree with Walter -- I don't think it's got much to do with programmer 
 training. It's a problem that hasn't been solved in the real world in 
 the general case.

I agree.  But we still need something better than the traditional approach now
:-)

 The analogy with the real world suggests to me that there are three 
 cases that work well:
 * massively parallel;
 * _completely_ independent tasks; and
 * very small teams.
 
 Large teams are a management nightmare, and I see no reason to believe 
 that wouldn't hold true for a large number of cores as well.

Back when the Java OS was announced I envisioned a modular system backed by a
database of objects serving different functions.  Kind of like the old OpenDoc
model, but at an OS level.  It clearly didn't work out this way, but I'd be
interested to see something along these lines.  I honestly couldn't say whether
apps would turn out to be easier or more difficult to create in such an
environment though.

Nov 13 2010

sybrandy <sybrandy gmail.com> writes:

 True enough.  But it's certainly more natural to think about than mutex-based
concurrency, automatic parallelization, etc.  In the long term there may turn
out to be better models, but I don't know of one today.

 Also, there are other goals for such a design than increasing computation
speed: decreased maintenance cost, system reliability, etc.  Erlang processes
are equivalent to objects in C++ or Java with the added benefit of asynchronous
execution in instances where an immediate response (ie. RPC) is not required. 
Performance gain is a direct function of how often this is true.  But even
where it's not, the other benefits exist.

I like that description!

Casey

Nov 13 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Thoughts on parallel programming?