www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Thoughts on parallel programming?

reply jfd <jfd nospam.com> writes:
Any thoughts on parallel programming.  I was looking at something about Chapel
and X10 languages etc. for parallelism, and it looks interesting.  I know that
it is still an area of active research, and it is not yet (far from?) done,
but anyone have thoughts on this as future direction?  Thank you.
Nov 10 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
jfd:

 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.
In past I have shown here two large posts about Chapel, that's a language contains several good ideas worth stealing, but my posts were mostly ignored. Chapel is designed for heavy numerical computing on multi-cores or multi-CPUs, it has good ideas of CPU-localization of the work, while D isn't much serious about that kind of parallelism (yet). So far D has instead embraced message-passing, that's fit for other purposes. Bye, bearophile
Nov 10 2010
prev sibling next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from jfd (jfd nospam.com)'s article
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.
Well, there's my std.parallelism library, which is in review for inclusion in Phobos. (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html, http://www.dsource.org/projects/scrapple/browser/trunk/parallelFuture/std_parallelism.d) One unfortunate thing about it is that it doesn't use (and actually bypasses) D's thread isolation system and allows unchecked sharing. I couldn't think of any way to create a pedal-to-metal parallelism library that was simultaneously useful, safe and worked with the language as-is, and I wanted something that worked **now**, not next year or in D3 or whatever, so I decided to omit safety. Given that the library is in review, now would be the perfect time to offer any suggestions on how it can be improved.
Nov 10 2010
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about C=
hapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know=
that
 it is still an area of active research, and it is not yet (far from?) don=
e,
 but anyone have thoughts on this as future direction?  Thank you.
Any programming language that cannot be used to program applications running on a heterogeneous collection of processors, including CPUs and GPUs as computational devices, on a single chip, with there being many such chips on a board, possibly clustered, doesn't have much of a future. Timescale 5--10 years. Intel's 80-core, 48-core and 50-core devices show the way server, workstation and laptop architectures are going. There may be a large central memory unit as now, but it will be secondary storage not primary storage. All the chip architectures are shifting to distributed memory -- basically cache coherence is too hard a problem to solve, so instead of solving it, they are getting rid of it. Also the memory bus stops being the bottleneck for computations, which is actually the biggest problem with current architectures. Windows, Linux and Mac OS X have a serious problem and will either die or be revolutionized. Apple at least recognize the issue, hence they pushed OpenCL. Actor model, CSP, dataflow, and similar distributed memory/process-based architectures will become increasingly important for software. There will be an increasing move to declarative expression, but I doubt functional languages will ever make the main stream. The issue here is that parallelism generally requires programmers not to try and tell the computer every detail how to do something, but instead specify the start and end conditions and allow the runtime system to handle the realization of the transformation. Hence the move in Fortran from lots of "do" loops to "whole array" operations. MPI and all the SPMD approaches have a severely limited future, but I bet the HPC codes are still using Fortran and MPI in 50 years time. You mentioned Chapel and X10, but don't forget the other one of the original three HPCS projects, Fortress. Whilst all three are PGAS (partitioned global address space) languages, Fortress takes a very different viewpoint compared to Chapel and X10. The summary of the summary is: programmers will either be developing parallelism systems or they will be unemployed. <shameless-plug> To hear more, I am doing a session on all this stuff for ACCU London 2010-11-18 18:30+00:00 http://skillsmatter.com/event/java-jee/java-python-ruby-linux-windows-are-a= ll-doomed </shameless-plug> --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Nov 11 2010
prev sibling next sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
On 11-nov-10, at 09:58, Russel Winder wrote:

 On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something  
 about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.   
 I know that
 it is still an area of active research, and it is not yet (far  
 from?) done,
 but anyone have thoughts on this as future direction?  Thank you.
Any programming language that cannot be used to program applications running on a heterogeneous collection of processors, including CPUs and GPUs as computational devices, on a single chip, with there being many such chips on a board, possibly clustered, doesn't have much of a future. Timescale 5--10 years.
on this I am not so sure, heterogeneous clusters are more difficult to program, and GPU & co are slowly becoming more and more general purpose. Being able to take advantage of those is useful, but I am not convinced they are necessarily the future.
 Intel's 80-core, 48-core and 50-core devices show the way server,
 workstation and laptop architectures are going.  There may be a large
 central memory unit as now, but it will be secondary storage not  
 primary
 storage.  All the chip architectures are shifting to distributed  
 memory
 -- basically cache coherence is too hard a problem to solve, so  
 instead
 of solving it, they are getting rid of it.  Also the memory bus stops
 being the bottleneck for computations, which is actually the biggest
 problem with current architectures.
yes many core is the future I agree on this, and also that distributed approach is the only way to scale to a really large number of processors. Bud distributed systems *are* more complex, so I think that for the foreseeable future one will have a hybrid approach.
 Windows, Linux and Mac OS X have a serious problem and will either die
 or be revolutionized.  Apple at least recognize the issue, hence they
 pushed OpenCL.
again not sure the situation is as dire as you paint it, Linux does quite well in the HPC field... but I agree that to be the ideal OS for these architectures it will need more changes.
 Actor model, CSP, dataflow, and similar distributed memory/process- 
 based
 architectures will become increasingly important for software.  There
 will be an increasing move to declarative expression, but I doubt
 functional languages will ever make the main stream.  The issue here  
 is
 that parallelism generally requires programmers not to try and tell  
 the
 computer every detail how to do something, but instead specify the  
 start
 and end conditions and allow the runtime system to handle the
 realization of the transformation.  Hence the move in Fortran from  
 lots
 of "do" loops to "whole array" operations.
Whole array operation are useful, and when possible one gains much using them, unfortunately not all problems can be reduced to few large array operations, data parallel languages are not the main type of language for these reasons.
 MPI and all the SPMD approaches have a severely limited future, but I
 bet the HPC codes are still using Fortran and MPI in 50 years time.
well whole array operations are a generalization of the SPMD approach, so I this sense you said that that kind of approach will have a future (but with a more difficult optimization as the hardware is more complex. About MPI I think that many don't see what MPI really does, mpi offers a simplified parallel model. The main weakness of this model is that it assumes some kind of reliability, but then it offers a clear computational model with processors ordered in a linear of higher dimensional structure and efficient collective communication primitives. Yes MPI is not the right choice for all problems, but when usable it is very powerful, often superior to the alternatives, and programming with it is *simpler* than thinking about a generic distributed system. So I think that for problems that are not trivially parallel, or easily parallelizable MPI will remain as the best choice.
 You mentioned Chapel and X10, but don't forget the other one of the
 original three HPCS projects, Fortress.  Whilst all three are PGAS
 (partitioned global address space) languages, Fortress takes a very
 different viewpoint compared to Chapel and X10.
It might be a personal thing, but I am kind of "suspicious" toward PGAS, I find a generalized MPI model better than PGAS when you want to have separated address spaces. Using MPI one can define a PGAS like object wrapping local storage with an object that sends remote requests to access remote memory pieces. This means having a local server where this wrapped objects can be "published" and that can respond in any moment to external requests. I call this rpc (remote procedure call) and it can be realized easily on the top of MPI. As not all objects are distributed and in a complex program it does not always makes sense to distribute these objects on all processors or none, I find that the robust partitioning and collective communication primitives of MPI superior to PGAS. With enough effort you probably can get everything also from PGAS, but then you loose all its simplicity.
 The summary of the summary is:  programmers will either be developing
 parallelism systems or they will be unemployed.
The situation is not so dire, some problems are trivially parallel, or can be solved with simple parallel patterns, others don't need to be solved in parallel, as the sequential solution if fast enough, but I do agree that being able to develop parallel systems is increasingly important. In fact it is something that I like to do, and I thought about a lot. I did program parallel systems, and out of my experience I tried to build something to do parallel programs "the way it should be", or at least the way I would like it to be ;) The result is what I did with blip, http://dsource.org/projects/blip . I don't think that (excluding some simple examples) fully automatic (trasparent) parallelization is really feasible. At some point being parallel is more complex, and it puts an extra burden on the programmer. Still it is possible to have several levels of parallelization, and if you program a fully parallel program it should still be possible to use it relatively efficiently locally, but a local program will not automatically become fully parallel. What I did is a basic smp parallelization for programs with shared memory. This level tries to schedule efficiently independent recursive tasks using all processors as efficiently as possible (using the topology detected by libhwloc. It leverages an event based framework (libev) to avoid blocking waiting for external tasks. The ability to describe complex asynchronous processes can be very useful also to work with GPUs. mpi parallelization is part of the hierarchy of parallelization, for the reasons I described before, it is wrapped so that on a single processor one can use a "pseudo" mpi. rpc (remote procedure call) might be better described as distributed objects, offers a server that can responds to external requests at any moment and the possibility to publish objects that will be then identified by urls. There urls can be used to create local proxies that call the remote object and get results from it. This can be done using mpi, or directly sockets. If one uses sockets he has the whole flexibility (but also the whole complexity) of a fully distributed system. The basic building blocks of this can be used also in a distributed protocol like distributed hashtables. blip is available now, and works with osx and linux. It should be possible to port it to windows, (both libhwloc and libev work on windows), but I didn't do it. It needs D1 and tango, tango trunk can be compiled using the scripts in blip/buildTango, and then programs using blip can be compiled more easily with the dbuild script (that uses xfbuild behind the scenes). I planned to make an official release this w.e., but you can look already now, the code is all there... Fawzi ----------------------------------------------------- Dr. Fawzi Mohamed, Office: 3'322 Humboldt-Universitaet zu Berlin, Institut fuer Chemie Post: Unter den Linden 6, 10099, Berlin Besucher/Pakete: Brook-Taylor-Str. 2, 12489 Berlin Tel: +49 30 20 93 7140 Fax: +49 30 2093 7136 -----------------------------------------------------
Nov 11 2010
prev sibling next sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
On 11-nov-10, at 15:16, Fawzi Mohamed wrote:

 On 11-nov-10, at 09:58, Russel Winder wrote:

 On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something  
 about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.   
 I know that
 it is still an area of active research, and it is not yet (far  
 from?) done,
 but anyone have thoughts on this as future direction?  Thank you.
I just finished reading "Parallel Programmability and the Chapel Language" by Chamberlain, Callahan and Zima. A very nice read, and overview of several languages and approaches. Still I stand by my earlier view, an MPI like approach is more flexible, but indeed having a nice parallel implementation of distributed arrays (which on MPI one can have using Global Arrays for example), can be very useful. I think that a language like D can hide these behind wrapper objects, and reach for these objects (that are not the only ones present in a complex parallel program) an expressivity similar to chapel using the approach I have in blip. A direct implementation might be more efficient on shared memory machines though.
Nov 11 2010
prev sibling next sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
	charset=US-ASCII;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit


On 11-nov-10, at 15:16, Fawzi Mohamed wrote:

 On 11-nov-10, at 09:58, Russel Winder wrote:

 MPI and all the SPMD approaches have a severely limited future, but I
 bet the HPC codes are still using Fortran and MPI in 50 years time.
well whole array operations are a generalization of the SPMD approach, so I this sense you said that that kind of approach will have a future (but with a more difficult optimization as the hardware is more complex.
sorry I translated that as SIMD, not SPMD, but the answer below still holds in my opinion, if one has a complex parallel problem mpi is a worthy contender, the thing is that in many occasions one doesn't need all its power. If a client server, a distributed or a map/reduce approach work, then simpler and more flexible solutions are superior. That (and its reliability problem, that PGAS also shares) is, in my opinion, the reason MPI is not very used outside the computational community. Being able to tackle also MPMD in a good way can be useful, and that is what the rpc level does between computers, and the event based scheduling within a single computer (ensuring that one processor can do meaningful work while the other waits.
 About MPI I think that many don't see what MPI really does, mpi  
 offers a simplified parallel model.
 The main weakness of this model is that it assumes some kind of  
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of  
 higher dimensional structure and efficient collective communication  
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it  
 is very powerful, often superior to the alternatives, and  
 programming with it is *simpler* than thinking about a generic  
 distributed system.
 So I think that for problems that are not trivially parallel, or  
 easily parallelizable MPI will remain as the best choice.
Nov 11 2010
prev sibling next sibling parent reply Tobias Pfaff <nospam spam.no> writes:
On 11/11/2010 03:24 AM, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.
Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ? Thanks!
Nov 11 2010
next sibling parent reply Trass3r <un known.com> writes:
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight  
 multithreading in D, that is, something like OpenMP ?
That would require compiler support for it. Other than that there only seems to be dsimcha's std.parallelism
Nov 11 2010
parent Tobias Pfaff <nospam spam.no> writes:
On 11/11/2010 07:01 PM, Trass3r wrote:
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?
That would require compiler support for it. Other than that there only seems to be dsimcha's std.parallelism
Ok, that's what I suspected. std.parallelism doesn't look to bad though, I'll try around with that...
Nov 11 2010
prev sibling next sibling parent reply Russel Winder <russel russel.org.uk> writes:
On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
[ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight=20
 multithreading in D, that is, something like OpenMP ?
I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Nov 11 2010
next sibling parent Trass3r <un known.com> writes:
 Having a D binding to OpenCL is probably going to be a good thing.
http://bitbucket.org/trass3r/cl4d/wiki/Home
Nov 11 2010
prev sibling parent reply Tobias Pfaff <nospam spam.no> writes:
On 11/11/2010 08:10 PM, Russel Winder wrote:
 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?
I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing.
Well, I am looking for an easy & efficient way to perform parallel numerical calculations on our 4-8 core machines. With C++, that's OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the wrong word, what I meant is that OpenMP is easy to use, and efficient for the problems we are solving. There actually might be better tools for that, honestly we didn't look into that much options -- we are no HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy that we even started parallelizing our code at all :) Anyways, I was thinking about the logical thing to use in D for this scenario. It's nothing super-fancy, in cases just a parallel_for we will, and sometimes a map/reduce operation... Cheers, Tobias
Nov 11 2010
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Tobias Pfaff (nospam spam.no)'s article
 On 11/11/2010 08:10 PM, Russel Winder wrote:
 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?
I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing.
Well, I am looking for an easy & efficient way to perform parallel numerical calculations on our 4-8 core machines. With C++, that's OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the wrong word, what I meant is that OpenMP is easy to use, and efficient for the problems we are solving. There actually might be better tools for that, honestly we didn't look into that much options -- we are no HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy that we even started parallelizing our code at all :) Anyways, I was thinking about the logical thing to use in D for this scenario. It's nothing super-fancy, in cases just a parallel_for we will, and sometimes a map/reduce operation... Cheers, Tobias
I think you'll be very pleased with std.parallelism when/if it gets into Phobos. The design philosophy is exactly what you're looking for: Simple shared memory parallelism on multicore computers, assuming no fancy/unusual OS-, compiler- or hardware-level infrastructure. Basically, it's got parallel foreach, parallel map, parallel reduce and parallel tasks. All you need to fully utilize it is DMD and a multicore PC. As a reminder, the docs are at http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html and the code is at http://dsource.org/projects/scrapple/browser/trunk/parallelFutu e/std_parallelism.d . If this doesn't meet your needs in its current form, I'd like as much constructive criticism as possible, as long as it's within the scope of simple, everyday parallelism without fancy infrastructure.
Nov 11 2010
parent Tobias Pfaff <nospam spam.no> writes:
On 11/12/2010 12:44 AM, dsimcha wrote:
 == Quote from Tobias Pfaff (nospam spam.no)'s article
 On 11/11/2010 08:10 PM, Russel Winder wrote:
 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?
I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing.
Well, I am looking for an easy& efficient way to perform parallel numerical calculations on our 4-8 core machines. With C++, that's OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the wrong word, what I meant is that OpenMP is easy to use, and efficient for the problems we are solving. There actually might be better tools for that, honestly we didn't look into that much options -- we are no HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy that we even started parallelizing our code at all :) Anyways, I was thinking about the logical thing to use in D for this scenario. It's nothing super-fancy, in cases just a parallel_for we will, and sometimes a map/reduce operation... Cheers, Tobias
I think you'll be very pleased with std.parallelism when/if it gets into Phobos. The design philosophy is exactly what you're looking for: Simple shared memory parallelism on multicore computers, assuming no fancy/unusual OS-, compiler- or hardware-level infrastructure. Basically, it's got parallel foreach, parallel map, parallel reduce and parallel tasks. All you need to fully utilize it is DMD and a multicore PC. As a reminder, the docs are at http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html and the code is at http://dsource.org/projects/scrapple/browser/trunk/parallelFutu e/std_parallelism.d . If this doesn't meet your needs in its current form, I'd like as much constructive criticism as possible, as long as it's within the scope of simple, everyday parallelism without fancy infrastructure.
I did a quick test of the module, looks really good so far, thanks for providing this ! (Is this module scheduled for inclusion in phobos2 ?) If I find issues with it I'll let you know.
Nov 12 2010
prev sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
On 12-nov-10, at 00:29, Tobias Pfaff wrote:

 [...]
 Well, I am looking for an easy & efficient way to perform parallel  
 numerical calculations on our 4-8 core machines. With C++, that's  
 OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe  
 lightweight was the wrong word, what I meant is that OpenMP is easy  
 to use, and efficient for the problems we are solving. There  
 actually might be better tools for that, honestly we didn't look  
 into that much options -- we are no HPC guys, 1000-cpu clusters are  
 not a relevant scenario and we are happy that we even started  
 parallelizing our code at all :)

 Anyways, I was thinking about the logical thing to use in D for this  
 scenario. It's nothing super-fancy, in cases just a parallel_for we  
 will, and sometimes a map/reduce operation...
If you use D1 blip.parallel.smp offers that, and it does scale well to 4-8 cores.
Nov 12 2010
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Tobias Pfaff Wrote:

 On 11/11/2010 03:24 AM, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.
Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?
I've considered backing spawn() calls by fibers multiplexed by a thread pool (receive() calls would cause the fiber to yield) instead of having each call generate a new kernel thread. The only issue is that TLS (ie. non-shared static storage) is thread-local, not fiber-local. One idea, however, is to do OSX-style manual TLS inside Fiber, so each fiber would have its own automatic local storage. Perhaps as an experiment I'll create a new derivative of Fiber that does this and see how it works.
Nov 11 2010
parent sybrandy <sybrandy gmail.com> writes:
On 11/11/2010 02:41 PM, Sean Kelly wrote:
 Tobias Pfaff Wrote:

 On 11/11/2010 03:24 AM, jfd wrote:
 Any thoughts on parallel programming.  I was looking at something about Chapel
 and X10 languages etc. for parallelism, and it looks interesting.  I know that
 it is still an area of active research, and it is not yet (far from?) done,
 but anyone have thoughts on this as future direction?  Thank you.
Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?
I've considered backing spawn() calls by fibers multiplexed by a thread pool (receive() calls would cause the fiber to yield) instead of having each call generate a new kernel thread. The only issue is that TLS (ie. non-shared static storage) is thread-local, not fiber-local. One idea, however, is to do OSX-style manual TLS inside Fiber, so each fiber would have its own automatic local storage. Perhaps as an experiment I'll create a new derivative of Fiber that does this and see how it works.
I actually did something similar for a very simple web server I was experimenting with. It is similar to how Erlang works in that the Erlang processes are, at least to me, similar to fibers and they are run in one of several threads in the interpreter. The only problem I had was ensuring that my logging was thread-safe. If you could implement a TLS-like system for Fibers, I think that would help prevent that issue. Casey
Nov 11 2010
prev sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
On 11-nov-10, at 20:10, Russel Winder wrote:

 On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote:
 [ . . . ]
 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
 Speaking of which: Are there any attempts to support lightweight
 multithreading in D, that is, something like OpenMP ?
I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution.
I agree I think that TBB offers primitives for many parallelization kinds, and is more clean and flexible than OpenMP, but in my opinion it has a big weakness: it cannot cope well with independent tasks. Coping well wit both nested parallelism and independent tasks is a crucial thing to have a generic solution that can be applied to several problems. This is missing as far as I know also from Chapel. I think that having a solution that copes well with both nested parallelism and independent tasks is an excellent starting on which to build almost all other higher level parallelization schemes. It is important to handle this centrally, because the number of threads that one should spawn should ideally stay limited to the number of execution units.
Nov 12 2010
prev sibling next sibling parent reply Russel Winder <russel russel.org.uk> writes:
On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote:
[ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult to =
=20
 program, and GPU & co are slowly becoming more and more general purpose.
 Being able to take advantage of those is useful, but I am not =20
 convinced they are necessarily the future.
The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level.=20 [ . . . ]
 yes many core is the future I agree on this, and also that distributed =
=20
 approach is the only way to scale to a really large number of =20
 processors.
 Bud distributed systems *are* more complex, so I think that for the =20
 foreseeable future one will have a hybrid approach.
Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past. I disagree that distributed systems are more complex per se. I suspect comments are getting so general here that anything anyone writes can be seen as both true and false simultaneously. My perception is that shared memory multithreading is less and less a tool that applications programmers should be thinking in terms of. Multiple processes with an hierarchy of communications costs is the overarching architecture with each process potentially being SMP or CSP or . . . =20
 again not sure the situation is as dire as you paint it, Linux does =20
 quite well in the HPC field... but I agree that to be the ideal OS for =
=20
 these architectures it will need more changes.
The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here. =20 [ . . . ]
 Whole array operation are useful, and when possible one gains much =20
 using them, unfortunately not all problems can be reduced to few large =
=20
 array operations, data parallel languages are not the main type of =20
 language for these reasons.
Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]
 well whole array operations are a generalization of the SPMD approach, =
=20
 so I this sense you said that that kind of approach will have a future =
=20
 (but with a more difficult optimization as the hardware is more complex.
I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.=20
 About MPI I think that many don't see what MPI really does, mpi offers =
=20
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of =20
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of =20
 higher dimensional structure and efficient collective communication =20
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it =20
 is very powerful, often superior to the alternatives, and programming =
=20
 with it is *simpler* than thinking about a generic distributed system.
 So I think that for problems that are not trivially parallel, or =20
 easily parallelizable MPI will remain as the best choice.
I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions.=20 [ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward =20
 PGAS, I find a generalized MPI model better than PGAS when you want to =
=20
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage =20
 with an object that sends remote requests to access remote memory =20
 pieces.
 This means having a local server where this wrapped objects can be =20
 "published" and that can respond in any moment to external requests. I =
=20
 call this rpc (remote procedure call) and it can be realized easily on =
=20
 the top of MPI.
 As not all objects are distributed and in a complex program it does =20
 not always makes sense to distribute these objects on all processors =20
 or none, I find that the robust partitioning and collective =20
 communication primitives of MPI superior to PGAS.
 With enough effort you probably can get everything also from PGAS, but =
=20
 then you loose all its simplicity.
I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true. [ . . . ]
 The situation is not so dire, some problems are trivially parallel, or =
=20
 can be solved with simple parallel patterns, others don't need to be =20
 solved in parallel, as the sequential solution if fast enough, but I =20
 do agree that being able to develop parallel systems is increasingly =20
 important.
 In fact it is something that I like to do, and I thought about a lot.
 I did program parallel systems, and out of my experience I tried to =20
 build something to do parallel programs "the way it  should be", or at =
=20
 least the way I would like it to be ;)
The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)
 The result is what I did with blip, http://dsource.org/projects/blip .
 I don't think that (excluding some simple examples) fully automatic =20
 (trasparent) parallelization is really feasible.
 At some point being parallel is more complex, and it puts an extra =20
 burden on the programmer.
 Still it is possible to have several levels of parallelization, and if =
=20
 you program a fully parallel program it should still be possible to =20
 use it relatively efficiently locally, but a local program will not =20
 automatically become fully parallel.
At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.=20 [ . . . ] --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Nov 11 2010
next sibling parent reply retard <re tard.com.invalid> writes:
Thu, 11 Nov 2010 19:41:56 +0000, Russel Winder wrote:

 On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult to
 program, and GPU & co are slowly becoming more and more general
 purpose. Being able to take advantage of those is useful, but I am not
 convinced they are necessarily the future.
The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level. [ . . . ]
 yes many core is the future I agree on this, and also that distributed
 approach is the only way to scale to a really large number of
 processors.
 Bud distributed systems *are* more complex, so I think that for the
 foreseeable future one will have a hybrid approach.
Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past. I disagree that distributed systems are more complex per se. I suspect comments are getting so general here that anything anyone writes can be seen as both true and false simultaneously. My perception is that shared memory multithreading is less and less a tool that applications programmers should be thinking in terms of. Multiple processes with an hierarchy of communications costs is the overarching architecture with each process potentially being SMP or CSP or . . .
 again not sure the situation is as dire as you paint it, Linux does
 quite well in the HPC field... but I agree that to be the ideal OS for
 these architectures it will need more changes.
The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here. [ . . . ]
 Whole array operation are useful, and when possible one gains much
 using them, unfortunately not all problems can be reduced to few large
 array operations, data parallel languages are not the main type of
 language for these reasons.
Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]
 well whole array operations are a generalization of the SPMD approach,
 so I this sense you said that that kind of approach will have a future
 (but with a more difficult optimization as the hardware is more
 complex.
I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.
 About MPI I think that many don't see what MPI really does, mpi offers
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of
 higher dimensional structure and efficient collective communication
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it is
 very powerful, often superior to the alternatives, and programming with
 it is *simpler* than thinking about a generic distributed system. So I
 think that for problems that are not trivially parallel, or easily
 parallelizable MPI will remain as the best choice.
I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions. [ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward
 PGAS, I find a generalized MPI model better than PGAS when you want to
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage with
 an object that sends remote requests to access remote memory pieces.
 This means having a local server where this wrapped objects can be
 "published" and that can respond in any moment to external requests. I
 call this rpc (remote procedure call) and it can be realized easily on
 the top of MPI.
 As not all objects are distributed and in a complex program it does not
 always makes sense to distribute these objects on all processors or
 none, I find that the robust partitioning and collective communication
 primitives of MPI superior to PGAS. With enough effort you probably can
 get everything also from PGAS, but then you loose all its simplicity.
I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true. [ . . . ]
 The situation is not so dire, some problems are trivially parallel, or
 can be solved with simple parallel patterns, others don't need to be
 solved in parallel, as the sequential solution if fast enough, but I do
 agree that being able to develop parallel systems is increasingly
 important.
 In fact it is something that I like to do, and I thought about a lot. I
 did program parallel systems, and out of my experience I tried to build
 something to do parallel programs "the way it  should be", or at least
 the way I would like it to be ;)
The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)
 The result is what I did with blip, http://dsource.org/projects/blip .
 I don't think that (excluding some simple examples) fully automatic
 (trasparent) parallelization is really feasible. At some point being
 parallel is more complex, and it puts an extra burden on the
 programmer.
 Still it is possible to have several levels of parallelization, and if
 you program a fully parallel program it should still be possible to use
 it relatively efficiently locally, but a local program will not
 automatically become fully parallel.
At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things. [ . . . ]
FWIW, I'm not a parallel computing expert and have almost no experience of it outside basic parallel programming courses, but it seems to me that the HPC clusters are a completely separate domain. It used to be the case that *all* other systems are single core and only the HPC consists of (hybrid multicore setups of) several nodes. Now what is happening is both embedded and mainstream PCs are getting multiple cores in both CPU and GPU chips. Multi-socket setups are still rare. The growth rate maybe follows the Moore's law at least in GPUs, in CPUs the problems with programmability are slowing things down and many laptops are still dual-core despite multiple cores are more energy efficient than higher GHz and my home PC has 8 virtual cores in a single CPU. The HPC systems with hundreds of processors are definitely still important, but I see that 99.9(99)% of the market is in desktop and embedded systems. We need efficient ways to program multicore mobile phones, multicore laptops, multicore tablet devices and so on. These are all shared memory systems. I don't think MPI works well in shared memory systems. It's good to have MPI like system for D, but it cannot solve these problems.
Nov 11 2010
next sibling parent retard <re tard.com.invalid> writes:
Thu, 11 Nov 2010 20:01:09 +0000, retard wrote:

 in CPUs the
 problems with programmability are slowing things down and many laptops
 are still dual-core despite multiple cores are more energy efficient
 than higher GHz and my home PC has 8 virtual cores in a single CPU.
At least it seems so to me. My last 1 and 2 core systems had a TDP of 65 and 105W. Now it's 130W, the next gen have 12 cores and 130W TDP. So I currently have 8 CPU cores and 480 GPU cores. Unfortunately many open source applications don't use the GPU (maybe OpenGL 1.0 but usually software rendering. The gpu accelerated desktops are still buggy and crash prone) and are single threaded. Even some heavier tasks like video encoding uses cores very inefficiently. Would MPI help?
Nov 11 2010
prev sibling parent Gary Whatmore <no spam.sp> writes:
retard Wrote:

 Thu, 11 Nov 2010 19:41:56 +0000, Russel Winder wrote:
 
 On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult to
 program, and GPU & co are slowly becoming more and more general
 purpose. Being able to take advantage of those is useful, but I am not
 convinced they are necessarily the future.
The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level. [ . . . ]
 yes many core is the future I agree on this, and also that distributed
 approach is the only way to scale to a really large number of
 processors.
 Bud distributed systems *are* more complex, so I think that for the
 foreseeable future one will have a hybrid approach.
Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past. I disagree that distributed systems are more complex per se. I suspect comments are getting so general here that anything anyone writes can be seen as both true and false simultaneously. My perception is that shared memory multithreading is less and less a tool that applications programmers should be thinking in terms of. Multiple processes with an hierarchy of communications costs is the overarching architecture with each process potentially being SMP or CSP or . . .
 again not sure the situation is as dire as you paint it, Linux does
 quite well in the HPC field... but I agree that to be the ideal OS for
 these architectures it will need more changes.
The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here. [ . . . ]
 Whole array operation are useful, and when possible one gains much
 using them, unfortunately not all problems can be reduced to few large
 array operations, data parallel languages are not the main type of
 language for these reasons.
Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]
 well whole array operations are a generalization of the SPMD approach,
 so I this sense you said that that kind of approach will have a future
 (but with a more difficult optimization as the hardware is more
 complex.
I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.
 About MPI I think that many don't see what MPI really does, mpi offers
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of
 higher dimensional structure and efficient collective communication
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it is
 very powerful, often superior to the alternatives, and programming with
 it is *simpler* than thinking about a generic distributed system. So I
 think that for problems that are not trivially parallel, or easily
 parallelizable MPI will remain as the best choice.
I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions. [ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward
 PGAS, I find a generalized MPI model better than PGAS when you want to
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage with
 an object that sends remote requests to access remote memory pieces.
 This means having a local server where this wrapped objects can be
 "published" and that can respond in any moment to external requests. I
 call this rpc (remote procedure call) and it can be realized easily on
 the top of MPI.
 As not all objects are distributed and in a complex program it does not
 always makes sense to distribute these objects on all processors or
 none, I find that the robust partitioning and collective communication
 primitives of MPI superior to PGAS. With enough effort you probably can
 get everything also from PGAS, but then you loose all its simplicity.
I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true. [ . . . ]
 The situation is not so dire, some problems are trivially parallel, or
 can be solved with simple parallel patterns, others don't need to be
 solved in parallel, as the sequential solution if fast enough, but I do
 agree that being able to develop parallel systems is increasingly
 important.
 In fact it is something that I like to do, and I thought about a lot. I
 did program parallel systems, and out of my experience I tried to build
 something to do parallel programs "the way it  should be", or at least
 the way I would like it to be ;)
The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)
 The result is what I did with blip, http://dsource.org/projects/blip .
 I don't think that (excluding some simple examples) fully automatic
 (trasparent) parallelization is really feasible. At some point being
 parallel is more complex, and it puts an extra burden on the
 programmer.
 Still it is possible to have several levels of parallelization, and if
 you program a fully parallel program it should still be possible to use
 it relatively efficiently locally, but a local program will not
 automatically become fully parallel.
At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things. [ . . . ]
FWIW, I'm not a parallel computing expert and have almost no experience of it outside basic parallel programming courses, but it seems to me that the HPC clusters are a completely separate domain. It used to be the case that *all* other systems are single core and only the HPC consists of (hybrid multicore setups of) several nodes. Now what is happening is both embedded and mainstream PCs are getting multiple cores in both CPU and GPU chips. Multi-socket setups are still rare. The growth rate maybe follows the Moore's law at least in GPUs, in CPUs the problems with programmability are slowing things down and many laptops are still dual-core despite multiple cores are more energy efficient than higher GHz and my home PC has 8 virtual cores in a single CPU. The HPC systems with hundreds of processors are definitely still important, but I see that 99.9(99)% of the market is in desktop and embedded systems. We need efficient ways to program multicore mobile phones, multicore laptops, multicore tablet devices and so on. These are all shared memory systems. I don't think MPI works well in shared memory systems. It's good to have MPI like system for D, but it cannot solve these problems.
You're unfortunately completely wrong. The industry is moving away from desktop applications. The reason is simple, software as service brings more profit and provides a handy vendor lock-in the customers don't even realize now. Advertisers pay the services now, the customers directly in the future once local desktop application market has been crushed. Another reason is the amount of open source software out there. You can't compete with free (as in beer) and it's considered good enough by typical users. Desktop applications suffer from segfaults and bugs. You can hide those with server technology. Just blame the infrastructure. Conceptually internet is so complex that people accept broken behavior more often. "yep, it wasn't facebook's fault - some pipe exploded in africa and your net is down". High performance web servers need MPI and similar technologies to scale on huge clusters. The streaming services and browser plugins guarantee that you don't need to do video encoding at home anymore. Low upload bandwidth guarantees that you won't share your personal content (e.g. images or videos taken with a camera) even when ipv6 comes. All games will be only available on game consoles. The multicore PC will just die away. In the future the client systems are even dumber than they're now. Maybe even X like thin clients on top of http/html5/ajax. These systems run just fine on single tasking single core.
Nov 11 2010
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Russel Winder wrote:
 Agreed.  My point was that in 1960s code people explicitly handled array
 operations using do loops because they had to.  Nowadays such code is
 anathema to efficient execution.  My complaint here is that people have
 put effort into compiler technology instead of rewriting the codes in a
 better language and/or idiom.  Clearly whole array operations only apply
 to algorithms that involve arrays!
Yup. I am bemused by the efforts put into analyzing loops so that they can (by the compiler) be re-written into a higher level construct, and then the higher level construct is compiled. It just is backwards what the compiler should be doing. The high level construct is what the programmer should be writing. It shouldn't be something the compiler reconstructs from low level source code.
Nov 11 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 Yup. I am bemused by the efforts put into analyzing loops so that they can (by 
 the compiler) be re-written into a higher level construct, and then the higher 
 level construct is compiled.
 
 It just is backwards what the compiler should be doing. The high level
construct 
 is what the programmer should be writing. It shouldn't be something the
compiler 
 reconstructs from low level source code.
I agree a lot. The language has to offer means to express all the semantics and constraints, that the arrays are disjointed, that the operations done on them are pure or not pure, that the operations are not pure but determined only by a small window in the arrays, and so on and on. And then the compiler has to optimize the code according to the presence of SIMD registers, multi-cores, etc. This maybe is not enough for max performance applications, but in most situations it's plenty enough. (Incidentally, this is a lot what the Chapel language does (and D doesn't), and what I have explained in two past posts about Chapel, that were mostly ignored.) Bye, bearophile
Nov 11 2010
parent retard <re tard.com.invalid> writes:
Thu, 11 Nov 2010 16:32:03 -0500, bearophile wrote:

 Walter:
 
 Yup. I am bemused by the efforts put into analyzing loops so that they
 can (by the compiler) be re-written into a higher level construct, and
 then the higher level construct is compiled.
 
 It just is backwards what the compiler should be doing. The high level
 construct is what the programmer should be writing. It shouldn't be
 something the compiler reconstructs from low level source code.
I agree a lot. The language has to offer means to express all the semantics and constraints, that the arrays are disjointed, that the operations done on them are pure or not pure, that the operations are not pure but determined only by a small window in the arrays, and so on and on. And then the compiler has to optimize the code according to the presence of SIMD registers, multi-cores, etc. This maybe is not enough for max performance applications, but in most situations it's plenty enough. (Incidentally, this is a lot what the Chapel language does (and D doesn't), and what I have explained in two past posts about Chapel, that were mostly ignored.)
How does the Chapel work when I need to sort data (just basic quicksort on 12 cores, for instance) or e.g. compile many files in parallel or encode xvid? What is the content of the array with xvid files?
Nov 11 2010
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 
I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Nov 11 2010
parent reply Sean Kelly <sean invisibleduck.org> writes:
Walter Bright Wrote:

 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 
I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.
Nov 11 2010
next sibling parent reply %u <user web.news> writes:
Sean Kelly Wrote:

 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 
I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.
Intel promised this AVX instruction set next year. Does it also work like distributed processes? I hear it doubles your FLOPS. These are exciting times parallel computing. Lots of new medias for distributed message passing programming. Lots of little fibers filling the multimedia pipelines with parallel data. Might even beat GPU soon if Larrabee comes.
Nov 11 2010
parent reply Gary Whatmore <no spam.sp> writes:
%u Wrote:

 Sean Kelly Wrote:
 
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 
I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.
Intel promised this AVX instruction set next year. Does it also work like distributed processes? I hear it doubles your FLOPS. These are exciting times parallel computing. Lots of new medias for distributed message passing programming. Lots of little fibers filling the multimedia pipelines with parallel data. Might even beat GPU soon if Larrabee comes.
AVX isn't parallel programming, it's vector processing. A dying breed of paradigms. Parallel programming deals with concurrency. OpenMP and MPI. Chapel (don't know it, but heard it here). Fortran. These are all good examples. AVX is just a cpu intrinsics stuff in std.intrinsics
Nov 11 2010
parent %u <user web.news> writes:
Gary Whatmore Wrote:

 %u Wrote:
 
 Sean Kelly Wrote:
 
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 
I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.
Intel promised this AVX instruction set next year. Does it also work like distributed processes? I hear it doubles your FLOPS. These are exciting times parallel computing. Lots of new medias for distributed message passing programming. Lots of little fibers filling the multimedia pipelines with parallel data. Might even beat GPU soon if Larrabee comes.
AVX isn't parallel programming, it's vector processing. A dying breed of paradigms. Parallel programming deals with concurrency. OpenMP and MPI. Chapel (don't know it, but heard it here). Fortran. These are all good examples. AVX is just a cpu intrinsics stuff in std.intrinsics
Currently the amount of information available is scarce. I have no idea how I use AVX or SSE in D. Auto-vectorization? Does it cover all use cases? So.. SSE & autovectorization & intrinsics => loops, hand written inline assembly parts, very small scale local worker threads / fibers => dsimcha's lib, medium scale local area network => the great flagship distributed message passing system, huge clusters with 1000+ computers? Why is message passing system so important? Assume I have dual-core laptop with AVX instructions next year. Use of 2 threads doubles my processor power. Use of AVX gives 8 times more power in good loops. I have no cluster so the flagship system provides zero benefit.
Nov 11 2010
prev sibling parent reply Don <nospam nospam.com> writes:
Sean Kelly wrote:
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 
I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.
The Erlang people seem to say that a lot. The thing they omit to say, though, is that it is very, very difficult in the real world! Consider managing a team of ten people. Getting them to be ten times as productive as a single person is extremely difficult -- virtually impossible, in fact. I agree with Walter -- I don't think it's got much to do with programmer training. It's a problem that hasn't been solved in the real world in the general case. The analogy with the real world suggests to me that there are three cases that work well: * massively parallel; * _completely_ independent tasks; and * very small teams. Large teams are a management nightmare, and I see no reason to believe that wouldn't hold true for a large number of cores as well.
Nov 12 2010
parent sybrandy <sybrandy gmail.com> writes:
 Distributed programming is essentially a bunch of little sequential
 program that interact, which is basically how people cooperate in the
 real world. I think that is by far the most intuitive of any
 concurrent programming model, though it's still a significant
 conceptual shift from the traditional monolithic imperative program.
The Erlang people seem to say that a lot. The thing they omit to say, though, is that it is very, very difficult in the real world! Consider managing a team of ten people. Getting them to be ten times as productive as a single person is extremely difficult -- virtually impossible, in fact.
That's only part of the reasoning behind all of the little programs in Erlang. The one of the more important aspect is the concept of supervisor trees where you have processes that monitor* other processes. In the event that a child process fails, the parent process will try to perform a simpler version of what needs to occur until it is successful. The other aspect is the concept of failing fast. It is assumed that a process that fails does not know how to resolve the issue, therefore it should just stop running and allow the parent process to do the right thing. If you build your software the Erlang way, then you implicitly build software that is multi-core friendly. How well it uses multiple cores depends on the software that is written, however I believe that Erlang is supposed to be better than most other languages at obtaining something close to linear scaling across cores. Not 100% sure, though. Does this mean that I believe distributed programming is easy in Erlang? Well, that depends on what you're doing, but I will say that being able to spawn functions on different machines is dirt simple. Doing it efficiently...well, that's where I think the programmer needs to know what they're doing. Casey * The monitoring is something implicit to the language.
Nov 13 2010
prev sibling next sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
On 11-nov-10, at 20:41, Russel Winder wrote:

 On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote:
 [ . . . ]
 on this I am not so sure, heterogeneous clusters are more difficult  
 to
 program, and GPU & co are slowly becoming more and more general  
 purpose.
 Being able to take advantage of those is useful, but I am not
 convinced they are necessarily the future.
The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level.
Vector co processors, yes I see that, and short term the effect of things like AMD fusion (CPU/GPU merging). Is this necessarily the future? I don't know, neither does intel I think, as they are still evaluating larabee. But CPU/GPU will stay around fro some time more for sure.
 [ . . . ]
 yes many core is the future I agree on this, and also that  
 distributed
 approach is the only way to scale to a really large number of
 processors.
 Bud distributed systems *are* more complex, so I think that for the
 foreseeable future one will have a hybrid approach.
Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past.
 I disagree that distributed systems are more complex per se.  I  
 suspect
 comments are getting so general here that anything anyone writes can  
 be
 seen as both true and false simultaneously.  My perception is that
 shared memory multithreading is less and less a tool that applications
 programmers should be thinking in terms of.  Multiple processes with  
 an
 hierarchy of communications costs is the overarching architecture with
 each process potentially being SMP or CSP or . . .
I agree that on not too large shared memory machines a hierarchy of tasks is the correct approach. This is what I did in blip.parallel.smp. Using that one can have fairly efficient automatic scheduling, and so forget most of the complexities, and actual hardware configuration.
 again not sure the situation is as dire as you paint it, Linux does
 quite well in the HPC field... but I agree that to be the ideal OS  
 for
 these architectures it will need more changes.
The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here.
yes microkernels& co will be more and more important (but I wonder how much this will be the case for the desktop). ABI mobility?not so sure, for hpc I can imagine having to compile to different ABIs (but maybe that is what you mean with ABI mobility)
 [ . . . ]
 Whole array operation are useful, and when possible one gains much
 using them, unfortunately not all problems can be reduced to few  
 large
 array operations, data parallel languages are not the main type of
 language for these reasons.
Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]
 well whole array operations are a generalization of the SPMD  
 approach,
 so I this sense you said that that kind of approach will have a  
 future
 (but with a more difficult optimization as the hardware is more  
 complex.
I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.
 About MPI I think that many don't see what MPI really does, mpi  
 offers
 a simplified parallel model.
 The main weakness of this model is that it assumes some kind of
 reliability, but then it offers
 a clear computational model with processors ordered in a linear of
 higher dimensional structure and efficient collective communication
 primitives.
 Yes MPI is not the right choice for all problems, but when usable it
 is very powerful, often superior to the alternatives, and programming
 with it is *simpler* than thinking about a generic distributed  
 system.
 So I think that for problems that are not trivially parallel, or
 easily parallelizable MPI will remain as the best choice.
I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions.
PGAS and MPI both have the same executable everywhere, but MPI is more flexible, with respect of making different part execute different things, and MPI does provide more generic packing/unpacking, but I guess I see you problems with it. Having the same executable is a big constraint, but is also a simplification.
 [ . . . ]
 It might be a personal thing, but I am kind of "suspicious" toward
 PGAS, I find a generalized MPI model better than PGAS when you want  
 to
 have separated address spaces.
 Using MPI one can define a PGAS like object wrapping local storage
 with an object that sends remote requests to access remote memory
 pieces.
 This means having a local server where this wrapped objects can be
 "published" and that can respond in any moment to external  
 requests. I
 call this rpc (remote procedure call) and it can be realized easily  
 on
 the top of MPI.
 As not all objects are distributed and in a complex program it does
 not always makes sense to distribute these objects on all processors
 or none, I find that the robust partitioning and collective
 communication primitives of MPI superior to PGAS.
 With enough effort you probably can get everything also from PGAS,  
 but
 then you loose all its simplicity.
I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true.
Yes I guess that is true
 [ . . . ]
 The situation is not so dire, some problems are trivially parallel,  
 or
 can be solved with simple parallel patterns, others don't need to be
 solved in parallel, as the sequential solution if fast enough, but I
 do agree that being able to develop parallel systems is increasingly
 important.
 In fact it is something that I like to do, and I thought about a lot.
 I did program parallel systems, and out of my experience I tried to
 build something to do parallel programs "the way it  should be", or  
 at
 least the way I would like it to be ;)
The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)
 The result is what I did with blip, http://dsource.org/projects/ 
 blip .
 I don't think that (excluding some simple examples) fully automatic
 (trasparent) parallelization is really feasible.
 At some point being parallel is more complex, and it puts an extra
 burden on the programmer.
 Still it is possible to have several levels of parallelization, and  
 if
 you program a fully parallel program it should still be possible to
 use it relatively efficiently locally, but a local program will not
 automatically become fully parallel.
At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.
when you have a network of things communicating (I think that once you have a distributed system you come at that level) then i is not sufficient anymore to think about each piece in isolation, you have to think about the interactions too. There are some patterns that might help reduce the complexity: client/ server, map/reduce,.... but in general it is more complex.
Nov 12 2010
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Don Wrote:

 Sean Kelly wrote:
 Walter Bright Wrote:
 
 Russel Winder wrote:
 At the heart of all this is that programmers are taught that algorithm
 is a sequence of actions to achieve a goal.  Programmers are trained to
 think sequentially and this affects their coding.  This means that
 parallelism has to be expressed at a sufficiently high level that
 programmers can still reason about algorithms as sequential things. 
I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.
The Erlang people seem to say that a lot. The thing they omit to say, though, is that it is very, very difficult in the real world! Consider managing a team of ten people. Getting them to be ten times as productive as a single person is extremely difficult -- virtually impossible, in fact.
True enough. But it's certainly more natural to think about than mutex-based concurrency, automatic parallelization, etc. In the long term there may turn out to be better models, but I don't know of one today. Also, there are other goals for such a design than increasing computation speed: decreased maintenance cost, system reliability, etc. Erlang processes are equivalent to objects in C++ or Java with the added benefit of asynchronous execution in instances where an immediate response (ie. RPC) is not required. Performance gain is a direct function of how often this is true. But even where it's not, the other benefits exist.
 I agree with Walter -- I don't think it's got much to do with programmer 
 training. It's a problem that hasn't been solved in the real world in 
 the general case.
I agree. But we still need something better than the traditional approach now :-)
 The analogy with the real world suggests to me that there are three 
 cases that work well:
 * massively parallel;
 * _completely_ independent tasks; and
 * very small teams.
 
 Large teams are a management nightmare, and I see no reason to believe 
 that wouldn't hold true for a large number of cores as well.
Back when the Java OS was announced I envisioned a modular system backed by a database of objects serving different functions. Kind of like the old OpenDoc model, but at an OS level. It clearly didn't work out this way, but I'd be interested to see something along these lines. I honestly couldn't say whether apps would turn out to be easier or more difficult to create in such an environment though.
Nov 13 2010
parent sybrandy <sybrandy gmail.com> writes:
 True enough.  But it's certainly more natural to think about than mutex-based
concurrency, automatic parallelization, etc.  In the long term there may turn
out to be better models, but I don't know of one today.

 Also, there are other goals for such a design than increasing computation
speed: decreased maintenance cost, system reliability, etc.  Erlang processes
are equivalent to objects in C++ or Java with the added benefit of asynchronous
execution in instances where an immediate response (ie. RPC) is not required. 
Performance gain is a direct function of how often this is true.  But even
where it's not, the other benefits exist.
I like that description! Casey
Nov 13 2010