digitalmars.D - D and multicore

parallel noob (16/16) Nov 13 2010 Hello

Gary Whatmore (3/30) Nov 13 2010 This question would have been more appropriate in D.learn. I started my ...

Fawzi Mohamed (105/164) Nov 13 2010 there are different kinds of parallel problems, some are trivially, or
Fawzi Mohamed (10/34) Nov 14 2010 I forgot to say that if that part of your problem fits a simple

Peter Alexander (14/15) Nov 13 2010 This is all to do with "what" you are actually parallelizing. The task

Gary Whatmore (2/21) Nov 13 2010 He/she also doesn't really get the idea of GPU. Writing programs for the...

sybrandy (38/42) Nov 13 2010 Ideas: Yes.
retard (4/4) Nov 13 2010 Not really an answer to your questions, but according to my google reade...
Sean Kelly (4/14) Nov 13 2010 Software takes a long time to change. And the apps you picked aren't ea...

parallel noob <silly question.is> writes:

Hello

Intro: people with pseudonyms are often considered trolls here, but this is a
really honest question by a sw engineer now writing mostly sequential web
applications.  (I write "parallel" web apps, but the logic goes that you write
sequential applications for each http query, the frontend distributes queries
among backend http processes, and the database "magically" ensures proper
locking. There's hardly ever any locks in my code.)

D is touted as the next gen of multicore languages. I pondered between D and
D.learn, where to ask this. It just strikes me odd that there isn't any kind of
documentation explaining how I should write (parallel) code for multicore in D.
If D is much different, the general guidelines for PHP web applications, Java,
or Erlang might don't work. From what I've gathered from these discussions,
there are:

 - array operations and auto-parallelization of loops
 - mmx/sse intrinsics via library
 - transactional memory (requires hardware support? doesn't work?)
 - "erlang style" concurrency? == process functions in Phobos 2?
 - threads, locks, and synchronization primitives

Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel, trass3r,
dennis, and simen clearly have ideas how to work with parallel problems. 

A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing
and http://en.wikipedia.org/wiki/Parallel_programming_model

I fail to map these concepts discussed here with the things listed on those
pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.

Sean mentioned:

"In the long term there may turn out to be better models, but I don't know of
one today."

So he's basically saying that those others listed in the wikipedia pages are
totally unsuitable for real world tasks? Only Erlang style message passing
works?

The next machine I buy comes with 12 or 16 cores or even more -- this one has 4
cores. The typical applications I use take advantage of 1-2 threads. For
example a cd ripper starts a new process for each mp3 encoder. The program runs
at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more
applications run in the browser. The browser actively uses one thread + one
thread per applet. I can't even utilize more than 50% of the power of the
current gen!

The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the
core count doubles, the FPS rating in games almost doubles. They definitely are
not running Erlang style processes (one for GUI, one for sounds, one for
physics, one for network). That would leave 3150 cores unused.

Nov 13 2010

Gary Whatmore <no spam.sp> writes:

parallel noob Wrote:

 Hello
 
 Intro: people with pseudonyms are often considered trolls here, but this is a
really honest question by a sw engineer now writing mostly sequential web
applications.  (I write "parallel" web apps, but the logic goes that you write
sequential applications for each http query, the frontend distributes queries
among backend http processes, and the database "magically" ensures proper
locking. There's hardly ever any locks in my code.)
 
 D is touted as the next gen of multicore languages. I pondered between D and
D.learn, where to ask this. It just strikes me odd that there isn't any kind of
documentation explaining how I should write (parallel) code for multicore in D.
If D is much different, the general guidelines for PHP web applications, Java,
or Erlang might don't work. From what I've gathered from these discussions,
there are:
 
  - array operations and auto-parallelization of loops
  - mmx/sse intrinsics via library
  - transactional memory (requires hardware support? doesn't work?)
  - "erlang style" concurrency? == process functions in Phobos 2?
  - threads, locks, and synchronization primitives
 
 Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel,
trass3r, dennis, and simen clearly have ideas how to work with parallel
problems. 
 
 A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing
and http://en.wikipedia.org/wiki/Parallel_programming_model
 
 I fail to map these concepts discussed here with the things listed on those
pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.
 
 Sean mentioned:
 
 "In the long term there may turn out to be better models, but I don't know of
one today."
 
 So he's basically saying that those others listed in the wikipedia pages are
totally unsuitable for real world tasks? Only Erlang style message passing
works?
 
 The next machine I buy comes with 12 or 16 cores or even more -- this one has
4 cores. The typical applications I use take advantage of 1-2 threads. For
example a cd ripper starts a new process for each mp3 encoder. The program runs
at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more
applications run in the browser. The browser actively uses one thread + one
thread per applet. I can't even utilize more than 50% of the power of the
current gen!
 
 The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the
core count doubles, the FPS rating in games almost doubles. They definitely are
not running Erlang style processes (one for GUI, one for sounds, one for
physics, one for network). That would leave 3150 cores unused.

This question would have been more appropriate in D.learn. I started my journey
with http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601

The GPU is a different kind of beast. GPU uses CUDA/OpenCL.

Nov 13 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

On 13-nov-10, at 22:23, Gary Whatmore wrote:

 parallel noob Wrote:

 Hello

 Intro: people with pseudonyms are often considered trolls here, but  
 this is a really honest question by a sw engineer now writing  
 mostly sequential web applications.  (I write "parallel" web apps,  
 but the logic goes that you write sequential applications for each  
 http query, the frontend distributes queries among backend http  
 processes, and the database "magically" ensures proper locking.  
 There's hardly ever any locks in my code.)

 D is touted as the next gen of multicore languages. I pondered  
 between D and D.learn, where to ask this. It just strikes me odd  
 that there isn't any kind of documentation explaining how I should  
 write (parallel) code for multicore in D. If D is much different,  
 the general guidelines for PHP web applications, Java, or Erlang  
 might don't work. From what I've gathered from these discussions,  
 there are:

 - array operations and auto-parallelization of loops
 - mmx/sse intrinsics via library
 - transactional memory (requires hardware support? doesn't work?)
 - "erlang style" concurrency? == process functions in Phobos 2?
 - threads, locks, and synchronization primitives

 Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile,  
 russel, trass3r, dennis, and simen clearly have ideas how to work  
 with parallel problems.

 A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing 
  and http://en.wikipedia.org/wiki/Parallel_programming_model

 I fail to map these concepts discussed here with the things listed  
 on those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP,  
 and OpenCL there.

 Sean mentioned:

 "In the long term there may turn out to be better models, but I  
 don't know of one today."

 So he's basically saying that those others listed in the wikipedia  
 pages are totally unsuitable for real world tasks? Only Erlang  
 style message passing works?

 The next machine I buy comes with 12 or 16 cores or even more --  
 this one has 4 cores. The typical applications I use take advantage  
 of 1-2 threads. For example a cd ripper starts a new process for  
 each mp3 encoder. The program runs at most 3 threads (the gui, the  
 mp3 encoder, the cd ripper). More and more applications run in the  
 browser. The browser actively uses one thread + one thread per  
 applet. I can't even utilize more than 50% of the power of the  
 current gen!

 The situation is different with GPUs. My Radeon 5970 has 3200  
 cores. When the core count doubles, the FPS rating in games almost  
 doubles. They definitely are not running Erlang style processes  
 (one for GUI, one for sounds, one for physics, one for network).  
 That would leave 3150 cores unused.


there are different kinds of parallel problems, some are trivially, or  
almost trivially parallel, other are less parallel.
Some tasks are very quick (one talks of micro parallelism), other are  
much more coarse.

typical code has a limited parallelization potential, out of order  
execution of modern processors tries to take advantage of this, but  
normally having a lot of execution hardware is not useful because  
there is a limited amount of instruction level parallelism (ILP) is  
limited.
There is an important exception: vector operations. So processor often  
have vector hardware to do them efficiently. Compiler to take  
advantage of them vectorize loops.
Array operations are a class of operations (that include vector  
operations) that are often very parallel.
If one for example wants to apply a pure operations on an array this  
is trivially parallel.
Data parallel languages are especially good to express this kind of  
parallelism.

GPU are optimized graphical operations which are mainly kind of vector  
and array operations and thus have a large amount of this kind of  
parallelization. This is also present in some scientific programs, and  
indeed GPU (with CUDA or openCL) are increasingly used for that.

The more coarse levels of parallelization use other means.
In my opinion shared memory parallelization can be done efficiently if  
on is able to treat independent recursive tasks.
Recursive task (that come form example from divide & conquer  
approaches, and for example can be used to perform array operations)  
can be evaluated efficiently evaluating subtasks first, and stealing  
supertasks keeping into account the locality of processors (cilk has a  
such an approach).

Independent tasks can be represented by threads, should be executed  
fairly, and work well to represent a single interacting object or  
different requests for a web server.
All OS give support for this, as threads (unlike processes) share  
memory one has to take care that changes from one thread and another  
are done in a meaningful way. To achieve this there are locks, atomic  
ops,...
Transactional memory works for changes that should be done atomically  
(the big problem is that if something fails one has to undo everything  
and retry again, something that becomes more and more likely the  
longer the transaction).

What I tried to achieve with blip.parallel.smp is to accommodate  
reasonably well both kinds of parallelism.
Several higher level abstractions can then be built using this  
framework.
I feel that it is important to handle this centrally because a  
processor is used optimally when each core has a thread.
Actually to hide the latency introduced by some operations that would  
make a processor waste cycles, some processors keep two active threads  
and switch quickly from one to the other when one stalls. This is  
called hyperthreading, and while it doesn't improve the performance of  
a single thread (actually it is worse) it optimizes the throughput and  
and usage of the execution units of a single core.

I think that almost all shared memory parallelization approaches can  
be implemented reasonably efficiently on the top of tasks, possibly  
using some locality hints for the initial distribution.

PGAS (Partitioned global address space) tries to allow one to have a  
global view, while still having local storages, each process can  
access both his local and remote just indexing. the conceptual  
advantage of this is that one can easily migrate to it starting from a  
local view. The hope is that one can optimize the layout and patterns  
used to access the memory later and at least partially independently  
from the algorithm used. At least in some cases it is possible.

In distributed memory approaches each process has its own memory  
space, and cannot access directly the memory of other processes.
This is potentially more complex than PGAS/shared memory approaches,  
because one has to explicitly transfer data between different  
processes to communicate. Normally one uses *messages* to do it.
The big advantage of doing this is that the programmer has to think  
more explicitly about the potential costly communication operations  
(if one has a latency of few ms this is still 10^6 cycles of a typical  
processor), and possibly optimize it better.

The distributed memory space scales very well, as one can create  
easily large clusters of computers.
MPI (message passing interface) uses this approach.
But actually mpi is more about having collective communication  
patterns efficiently implemented and being able to create optimal  
subsets to do subwork. MPI is the correct choice if you have a complex  
problem (also for the parallelization) and you want to efficiently use  
*all* resources available.

Sometime the problem you have is not so costly that you need to commit  
all resources to it, you just want to solve it efficiently and if  
possible taking advantage of the parallelization.
In this case a good model is the actor model where objects communicate  
with messages to each other.
One can have thread objects with mailboxes and pattern matching to  
select the message, or objects with an interface and a remote  
procedure call to invoke them.
You can organize the network of messages in several ways, you can have  
a central server, and clients that connect, you can have a central  
database to communicate, you can have a peer to peer structure, you  
can have producer/consumer relationships.
Normally given a problem one can see how to partition it optimally.

I use blip.parallel.rpc to give that kind of messaging between objects.
Note that one has to think about failure of one part in  this model,  
not necessarily a failure of one process should stop all processes (in  
some cases it might even be undetected).
At this level one could theoretically migrate processes/objects  
automatically, but given that the latency increase can be very large  
(~10^6) this automatic distribution is doable only for tasks that were  
considered by the programmer, a fully automatic redistribution of any  
object is not realistic.

I hope this overview helps a bit
Fawzi

Nov 13 2010

Fawzi Mohamed <fawzi gmx.ch> writes:

On 14-nov-10, at 02:44, Fawzi Mohamed wrote:

 [...]
 Sometime the problem you have is not so costly that you need to  
 commit all resources to it, you just want to solve it efficiently  
 and if possible taking advantage of the parallelization.

 In this case a good model is the actor model where objects  
 communicate with messages to each other.
 One can have thread objects with mailboxes and pattern matching to  
 select the message, or objects with an interface and a remote  
 procedure call to invoke them.
 You can organize the network of messages in several ways, you can  
 have a central server, and clients that connect, you can have a  
 central database to communicate, you can have a peer to peer  
 structure, you can have producer/consumer relationships.
 Normally given a problem one can see how to partition it optimally.

 I use blip.parallel.rpc to give that kind of messaging between  
 objects.
 Note that one has to think about failure of one part in  this model,  
 not necessarily a failure of one process should stop all processes  
 (in some cases it might even be undetected).
 At this level one could theoretically migrate processes/objects  
 automatically, but given that the latency increase can be very large  
 (~10^6) this automatic distribution is doable only for tasks that  
 were considered by the programmer, a fully automatic redistribution  
 of any object is not realistic.

I forgot to say that if that part of your problem fits a simple  
parallelization pattern, for example costly basically independent  
tasks, for which you might (for example) use a client/server approach,  
or a data parallel approach with huge amount of distributed data on  
which you want to apply something like map/reduce.
In these examples the actor model might work well allowing one to be  
more dynamic about the work distribution than MPI (that has no nice  
partial failure), and still use all system resources (in fact I use it  
exactly for those reasons).

Nov 14 2010

Peter Alexander <peter.alexander.au gmail.com> writes:

On 13/11/10 7:04 PM, parallel noob wrote:
 The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the
core count doubles, the FPS rating in games almost doubles. They definitely are
not running Erlang style processes (one for GUI, one for sounds, one for
physics, one for network). That would leave 3150 cores unused.

This is all to do with "what" you are actually parallelizing. The task 
of the GPU (typically transforming vertices and doing lighting 
calculations per pixel) is trivially parallelized.

If you did that task on CPU it would still be trivially parallelized 
(just spawn as many threads as you have cores and split up the screen's 
pixels between the threads).

Besides being optimized for this particular task, GPU's have no silver 
bullet for handling concurrency. Trying to get a GPU to parallelize a 
task that is not easily parallelizable (e.g. depth-first search on a 
graph) is just as hard as getting the CPU to do it. In this case, 
doubling the cores will not help.

GPU batching of tasks concurrency handles rendering well, but Erlang 
style message passing is a better model in the general case.

Nov 13 2010

Gary Whatmore <no spam.sp> writes:

Peter Alexander Wrote:

 On 13/11/10 7:04 PM, parallel noob wrote:
 The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the
core count doubles, the FPS rating in games almost doubles. They definitely are
not running Erlang style processes (one for GUI, one for sounds, one for
physics, one for network). That would leave 3150 cores unused.

 
 This is all to do with "what" you are actually parallelizing. The task 
 of the GPU (typically transforming vertices and doing lighting 
 calculations per pixel) is trivially parallelized.
 
 If you did that task on CPU it would still be trivially parallelized 
 (just spawn as many threads as you have cores and split up the screen's 
 pixels between the threads).
 
 Besides being optimized for this particular task, GPU's have no silver 
 bullet for handling concurrency. Trying to get a GPU to parallelize a 
 task that is not easily parallelizable (e.g. depth-first search on a 
 graph) is just as hard as getting the CPU to do it. In this case, 
 doubling the cores will not help.
 
 GPU batching of tasks concurrency handles rendering well, but Erlang 
 style message passing is a better model in the general case.

He/she also doesn't really get the idea of GPU. Writing programs for the GPU is
really simple. Just use a C enriched with parallel arrays. Few changes here and
there. The old C program now runs 200 times as fast if it's graphics processing
(z-buffer filtering, blur, brightness correction). When you need something more
complex, the control becomes hard and locks don't always work. You're left with
academic non-concepts and Erlang style message passing.

Nov 13 2010

sybrandy <sybrandy gmail.com> writes:

 Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel,
trass3r, dennis, and simen clearly have ideas how to work with parallel
problems.

Ideas: Yes.
Expert: No.  I haven't exercised it much as I haven't been able to do it 
at work.  Others here have shown that they have more knowledge than I.

 Sean mentioned:

 "In the long term there may turn out to be better models, but I don't know of
one today."

 So he's basically saying that those others listed in the wikipedia pages are
totally unsuitable for real world tasks? Only Erlang style message passing
works?

It's the easiest to understand for the most part as it's very similar to 
how the real world works.  Think about a coffee shop: two 
processes/people, let's call them Alice (customer) and Bob (barrista), 
are in the coffee shop.  Alice is ordering coffee and Bob is making 
coffee.  Alice tells Bob that she wants a 
super-duper-extra-foamy-mocha-hazel-choco-latte.  E.g. she sends a 
message to Bob.  Bob then does his magic and sends a message to Alice by 
handing her overpriced coffee.

So, from an understandability standpoint, it's very simple to 
understand.  There are also other benefits that are best expressed in 
Joe Armstrong's thesis, which you can easily find online.  He is one of 
the creators of Erlang and his thesis describes many of the benefits of 
the architecture.

Is it the best for all problems?  No.  For example, there are better 
tools for parallelizing loops.  The Actor Model, which is what Erlang 
uses, doesn't work very well for this.  There are other methods as well, 
but outside of auto parellelization and the actor model, they are harder 
to implement correctly, IIRC.  There may be a couple I don't really know 
about.

Now, will any of these provide you with instant benefits in your apps? 
Probably not.  Most software today was designed to run on one core and 
therefore doesn't use multiple cores very well.  Firefox (3.6 and 
later), Chrome, IE 9, and I think the latest version of Opera all use 
multiple threads now mainly to keep certain pages/plugins from killing 
the entire browser, though you may notice some speedups just from that.

However, this does not mean that multiple cores are not benefiting you 
much.  The OS is (should be) smart enough to distribute multiple apps 
across the cores, so you can have your browser and your word processor 
working hard and not have either one prevent the other one from getting 
any CPU time.  The problem is that most of us don't have more than 2-4 
apps working hard at any time, so past 4 cores you probably won't be 
getting much benefit in this respect.  As threading becomes more popular 
and threaded apps are better designed, we should see a trend of better 
CPU usage on the desktop.

Casey

Nov 13 2010

retard <re tard.com.invalid> writes:

Not really an answer to your questions, but according to my google reader 
this appeared very few minutes ago

http://lambda-the-ultimate.org/node/4134

I don't think we've heard the last word on this domain of programming.

Nov 13 2010

Sean Kelly <sean invisibleduck.org> writes:

parallel noob Wrote:
 
 I fail to map these concepts discussed here with the things listed on those
pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.
 
 Sean mentioned:
 
 "In the long term there may turn out to be better models, but I don't know of
one today."
 
 So he's basically saying that those others listed in the wikipedia pages are
totally unsuitable for real world tasks? Only Erlang style message passing
works?

I was saying that message-passing is the most easily understandable concurrency
model.  But answering the rest of your question is somewhat tricky.  Data
parallelism (ie. OpenMP) is the optimal approach for processing large datasets,
but data parallelism can be achieved (at an API level) in a variety of ways,
from POSIX Threads to message passing.  A specialized API is best though, and
in fact there's a data parallelism package under consideration for Phobos right
now.  I don't know if that clarifies or confuses things however.

 The next machine I buy comes with 12 or 16 cores or even more -- this one has
4 cores. The typical applications I use take advantage of 1-2 threads. For
example a cd ripper starts a new process for each mp3 encoder. The program runs
at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more
applications run in the browser. The browser actively uses one thread + one
thread per applet. I can't even utilize more than 50% of the power of the
current gen!

Software takes a long time to change.  And the apps you picked aren't easily
parallelizable anyway.  I think Russel is exactly right when he says that
future systems will have a collection of heterogeneous cores... a few
traditional ones for apps like the above (say 16) and clusters of others for
different purposes.  Seems like the trick will be dispatching work to the
appropriate one.

 The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the
core count doubles, the FPS rating in games almost doubles. They definitely are
not running Erlang style processes (one for GUI, one for sounds, one for
physics, one for network). That would leave 3150 cores unused.

Graphics are an ideal application for data parallelism.  But those cores are
all organizing their operations via messages passed on the system bus :-)  I
know, that's kind of a lame out, even if it's true.

Nov 13 2010

D Programming

C/C++ Programming

Other

digitalmars.D - D and multicore