digitalmars.D - Good demo for showing benefits of parallelism

Bill Baxter (15/15) Jan 25 2007 I don't remember where it was, but somebody in some thread said

David B. Held (11/16) Jan 26 2007 Yeah, it's funny that people are so impressed with 2, 4, even 32 cores
renoX (5/24) Jan 26 2007 What I find funny, me, is that you're talking about GPUs and at the same...

Mikola Lysenko (24/24) Jan 27 2007 There seems to be a great deal of confusion between concurrency and

Kevin Bealer (48/75) Jan 28 2007 I like your description -- I've probably been switching these terms

Sean Kelly (30/49) Jan 28 2007 I don't think anyone is disagreeing with you here. CSP is built around

Joel C. Salomon (9/18) Jan 28 2007 The Plan 9 threads library

kris (4/29) Jan 28 2007 Amen.

Jascha Wetzel (23/42) Jan 26 2007 i had a simple, unoptimized raytracer from an old CG assignment lying

Kevin Bealer (27/73) Jan 26 2007 Do you know if there was much lock contention? If there is a lot, then

Jascha Wetzel (6/7) Jan 26 2007 there is no explicit synchronization at all. the nice thing about

Kevin Bealer (35/43) Jan 27 2007 Hmmm... if the chunks you use are too small or too large, it can be an

Jascha Wetzel (16/21) Jan 27 2007 that's all very reasonable, thx.

=?ISO-8859-1?Q?Philipp_Sp=f6th?= (4/30) Jan 27 2007 Kevin is right with his optimization ideas of course. But there is somet...

Kevin Bealer (12/51) Jan 27 2007 That's really wierd. Unless you are actually straining your L1 or L2

=?ISO-8859-1?Q?Philipp_Sp=f6th?= (34/90) Jan 27 2007 I did the good, old, elaborate "remove/insert everything until you

Frits van Bommel (5/27) Jan 27 2007 I don't think it is, at least not directly. It's probably related to the...

Jascha Wetzel (13/17) Jan 29 2007 yep, that's exactly what happens.

Kevin Bealer (3/22) Jan 26 2007 I think it was me -- thanks for the pointer, I may have a look at this.

Bill Baxter <dnewsgroup billbaxter.com> writes:

I don't remember where it was, but somebody in some thread said 
something like gee it would be great if we had a demo that showed the 
benefits of having parallel threads.  (It was probably in either the 
Futures lib thread or in the discussion about varargs_reduce).

Anyway, a raytracer is a perfect example of "embarrasingly 
parallelizeable" code.  In the simplest case you can state it simply as 
"for each pixel do trace_ray(ray_dir)".

Bradley Smith posted code for a little raytracer over on D.learn under 
the heading "Why is this code slower than C++".  If anyone is interested 
in turning this into a parallel raytracer that would be a nice little demo.

Along the same lines, almost any image processing algorithm is in the 
same category where the top loop looks like "foreach pixel do ___". 
This is a big part of why GPUs just keep getting faster and faster, 
because the basic problem is just so inherently parallelizable.

--bb

Jan 25 2007

"David B. Held" <dheld codelogicconsulting.com> writes:

Bill Baxter wrote:
 [...]
 Along the same lines, almost any image processing algorithm is in the 
 same category where the top loop looks like "foreach pixel do ___". This 
 is a big part of why GPUs just keep getting faster and faster, because 
 the basic problem is just so inherently parallelizable.

Yeah, it's funny that people are so impressed with 2, 4, even 32 cores 
when the nVidia 8800 has 48 pixel pipelines and 128 shaders.  That's 
also why some people use GPUs as math coprocessors (and why some video 
cards require a dedicated power supply cord!).

Speaking of threading and parallelization, when is D going to get 
Transactional Memory?  Could it steal it from here: 
http://en.wikipedia.org/wiki/Transactional_memory#Implementations? 
There's http://libcmt.sourceforge.net/ and 
https://sourceforge.net/projects/libltx.  Surely something can be kiped?

Dave

Jan 26 2007

renoX <renosky free.fr> writes:

What I find funny, me, is that you're talking about GPUs and at the same time
about STM, AFAIK noone has suggested yet to use STM for programs running on
GPUs.

That said STM, futurism what about 'normal' data-parallel (array) operations?
IMHO, this is the parallelism which is the easiest for programmers to use, even
if its scope is limitated.

renoX

David B. Held Wrote:
 Bill Baxter wrote:
 [...]
 Along the same lines, almost any image processing algorithm is in the 
 same category where the top loop looks like "foreach pixel do ___". This 
 is a big part of why GPUs just keep getting faster and faster, because 
 the basic problem is just so inherently parallelizable.

 
 Yeah, it's funny that people are so impressed with 2, 4, even 32 cores 
 when the nVidia 8800 has 48 pixel pipelines and 128 shaders.  That's 
 also why some people use GPUs as math coprocessors (and why some video 
 cards require a dedicated power supply cord!).
 
 Speaking of threading and parallelization, when is D going to get 
 Transactional Memory?  Could it steal it from here: 
 http://en.wikipedia.org/wiki/Transactional_memory#Implementations? 
 There's http://libcmt.sourceforge.net/ and 
 https://sourceforge.net/projects/libltx.  Surely something can be kiped?
 
 Dave

Jan 26 2007

Mikola Lysenko <mclysenk mtu.edu> writes:

There seems to be a great deal of confusion between concurrency and 
parallelism.  Parallelism is a natural part of many problems, and it is 
relatively easy to exploit in order to enhance performance.  Parallel 
algorithms naturally scale to arbitrary numbers of processors, and are 
not particularly difficult to develop.

Concurrency on the other hand is very difficult.  When multiple 
processes must communicate, the programming complexity quickly spirals 
out of control resulting in unmanageable chaotic programs.  Locks, 
channels and STM are all useful concurrency primitives, but no single 
one can be considered a complete solution.  The difficulty of concurrent 
programming was recognized early on by programmers like Dijkstra who 
worked on the first operating systems.  To this day, it is still an 
unsolved problem and must be approached very carefully on a per-case basis.

In this light, GPU programming should not be considered concurrent 
programming, since it is impossible for threads on the GPU to 
communicate since all shader memory is read-only.  GPU programs are 
parallel however, and they are typically not very difficult to write 
(beyond some annoyances in the API/shader language).  Similarly futures 
do not help with concurrent programs, since they only improve the 
parallelism inherent within a program.

Shaders, futures and array operations are all helpful, since they 
provide convenient mechanisms for utilizing parallelism.  However they 
utterly fail to address the most difficult aspects of multi-threading, 
which means they are not a complete solution.

Jan 27 2007

Kevin Bealer <kevinbealer gmail.com> writes:

Mikola Lysenko wrote:
 There seems to be a great deal of confusion between concurrency and 
 parallelism.  Parallelism is a natural part of many problems, and it is 
 relatively easy to exploit in order to enhance performance.  Parallel 
 algorithms naturally scale to arbitrary numbers of processors, and are 
 not particularly difficult to develop.
 
 Concurrency on the other hand is very difficult.  When multiple 
 processes must communicate, the programming complexity quickly spirals 
 out of control resulting in unmanageable chaotic programs.  Locks, 
 channels and STM are all useful concurrency primitives, but no single 
 one can be considered a complete solution.  The difficulty of concurrent 
 programming was recognized early on by programmers like Dijkstra who 
 worked on the first operating systems.  To this day, it is still an 
 unsolved problem and must be approached very carefully on a per-case basis.
 
 In this light, GPU programming should not be considered concurrent 
 programming, since it is impossible for threads on the GPU to 
 communicate since all shader memory is read-only.  GPU programs are 
 parallel however, and they are typically not very difficult to write 
 (beyond some annoyances in the API/shader language).  Similarly futures 
 do not help with concurrent programs, since they only improve the 
 parallelism inherent within a program.
 
 Shaders, futures and array operations are all helpful, since they 
 provide convenient mechanisms for utilizing parallelism.  However they 
 utterly fail to address the most difficult aspects of multi-threading, 
 which means they are not a complete solution.

I like your description -- I've probably been switching these terms 
myself.  I also agree that futures are not a complete solution, however, 
I do think they are a useful tool.  I think it depends on the reason for 
the concurrency in the first place.

Why not write an algorithm sequentially instead of using multiple threads?

1. There is a lot of work to do and multiple CPUs to do it.

2. The process alternates between periods of heavy CPU / low IO usage, 
and low CPU / heavy I/O usage.  Often one thread can use the CPU while 
the other waits for I/O.

3. Each thread represents interaction with another 'active entity', such 
as a socket connection to another computer, a GUI connection to an X 
server, an input device or piece of hardware.

4. Other cases such as (if I understand this correctly) O/S threads 
where the O/S thread corresponds to a user thread, and runs when the 
user thread is in a syscall.

Items 1 and 2 can be done easily via futures -- they are strictly 
concerned with parallelism.

Item 4 probably needs a thread for architectural reasons (except in a 
message passing O/S design?  I'm not sure about this one...)

Item 3:  I think there is more flexibility here than is normally 
exploited;  A design which currently uses the "one-thread-per-entity" 
philosophy can often be redesigned as a message passing algorithm or 
something like it.  The select call helps for file-like datasets.

Then the question comes: why (and if) message passing / futures are 
better than Thread and Mutex.  Herb Sutter argues that it is hard to 
design correct code using locks and primitives like sleep/pause/mutex, 
and that it gets a lot harder with larger systems.

(As I understand it...) Herb's argument is that if I have several 
modules that use Locks correctly, they often won't when combined. 
Deadlock (and livelock) avoidance require knowledge of the locking rules 
for the entire system.  Without such knowledge, it is difficult to do 
things like lock ordering that prevent deadlocks.  Other techniques are 
available (deadlock detection and rollback), but these can have their 
own thorny failure states.

In a design based on futures, I can reason about correctness much more 
easily because the design can be made sequential trivially -- just don't 
compute the result until the point where the value is accessed.

If the sequential program is correct, the concurrent version is probably 
correct.  There are fewer opportunities for breaking it, because the 
code does synchronization in a few places instead of everywhere.  (Of 
course there is synchronization and concurrency management in the Future 
implementation, but its pretty well constrained.)

I agree completely with your premise that concurrency is fundamentally 
hard.  So the goal (as I see it today) is to take as much of the 
concurrency as possible *out* of the algorithm, and still leverage 


Kevin

Jan 28 2007

Sean Kelly <sean f4.ca> writes:

Kevin Bealer wrote:
 
 Then the question comes: why (and if) message passing / futures are 
 better than Thread and Mutex.  Herb Sutter argues that it is hard to 
 design correct code using locks and primitives like sleep/pause/mutex, 
 and that it gets a lot harder with larger systems.

I don't think anyone is disagreeing with you here.  CSP is built around 
message passing and was invented in the late 70s.  And IIRC the agent 
model was designed in the early 60s.

 (As I understand it...) Herb's argument is that if I have several 
 modules that use Locks correctly, they often won't when combined. 
 Deadlock (and livelock) avoidance require knowledge of the locking rules 
 for the entire system.  Without such knowledge, it is difficult to do 
 things like lock ordering that prevent deadlocks.  Other techniques are 
 available (deadlock detection and rollback), but these can have their 
 own thorny failure states.

Yup.  His basic argument is that object-oriented programming is 
incompatible with lock-based programming because object composition can 
result in unpredictable lock interaction.  In essence, if you call into 
unknown when a lock is held then there is no way to prove your code will 
not deadlock.

 In a design based on futures, I can reason about correctness much more 
 easily because the design can be made sequential trivially -- just don't 
 compute the result until the point where the value is accessed.

I like futures, but they are structured in such a way that they still 
lend themselves to data sharing.  They're definitely better than 
traditional lock-based programming and they're a good, efficient middle 
ground for parallel/concurrent programming, but there's something to be 
said for more structured models like CSP as well.

 I agree completely with your premise that concurrency is fundamentally 
 hard.  So the goal (as I see it today) is to take as much of the 
 concurrency as possible *out* of the algorithm, and still leverage 


One thing I like about Concur is that forces the user to think in terms 
of which tasks may be run in parallel without much affecting the 
structure of the application--it's a good introduction to parallel 
programming and it can be implemented fairly cleanly entirely in library 
code.  But I think it will be a slow transition, because it's not a 
natural way for people to think about things.  A while back I read that 
most congestion on the highways exists because people all accelerate and 
decelerate at different rates, so accretion points naturally form just 
from this interaction of 'particles'.  But when people encounter a 
traffic slowdown their first thought is that there is a specific, 
localized cause: an accident occurred, etc.  In essence, people tend to 
be reasonably good at delegation, but they are worse at understanding 
the interaction between atomic tasks.  Eventually, both will be 
important, and the more machines can figure out the details for us the 
better off we'll be :-)


Sean

Jan 28 2007

"Joel C. Salomon" <JoelCSalomon Gmail.com> writes:

Sean Kelly wrote:
 Kevin Bealer wrote:
 Then the question comes: why (and if) message passing / futures are 
 better than Thread and Mutex.  Herb Sutter argues that it is hard to 
 design correct code using locks and primitives like sleep/pause/mutex, 
 and that it gets a lot harder with larger systems.

 
 I don't think anyone is disagreeing with you here.  CSP is built around 
 message passing and was invented in the late 70s.  And IIRC the agent 
 model was designed in the early 60s.

The Plan 9 threads library 
(<http://plan9.bell-labs.com/magic/man2html/2/thread>, ported to UNIX 
and available from <http://swtch.com/plan9port/>), the (defunct) 
language Alef, and the Limbo language on the Inferno system are all 
based on the CSP model.  Beginning C programmers can write deadlock-free 
programs that usefully exploit concurrency.

For God’s sake don’t copy the Java model…

--Joel

Jan 28 2007

kris <foo bar.com> writes:

Joel C. Salomon wrote:
 Sean Kelly wrote:
 
 Kevin Bealer wrote:

 Then the question comes: why (and if) message passing / futures are 
 better than Thread and Mutex.  Herb Sutter argues that it is hard to 
 design correct code using locks and primitives like 
 sleep/pause/mutex, and that it gets a lot harder with larger systems.


 I don't think anyone is disagreeing with you here.  CSP is built 
 around message passing and was invented in the late 70s.  And IIRC the 
 agent model was designed in the early 60s.

 
 
 The Plan 9 threads library 
 (<http://plan9.bell-labs.com/magic/man2html/2/thread>, ported to UNIX 
 and available from <http://swtch.com/plan9port/>), the (defunct) 
 language Alef, and the Limbo language on the Inferno system are all 
 based on the CSP model.  Beginning C programmers can write deadlock-free 
 programs that usefully exploit concurrency.
 
 For God’s sake don’t copy the Java model…
 
 --Joel

Amen.

And now that Mik has released his concise and syntactically slick CSP 
for D, well, perhaps even Tony Hoare might crack a smile :)

Jan 28 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

i had a simple, unoptimized raytracer from an old CG assignment lying
around that i ported to D and modified for parallel computation.
download here: http://mainia.de/prt.zip

on a single core/cpu machine, the time stays about the same until 4
threads, then the overhead kicks in.
here are some values from an opteron dual core system running linux
kernel 2.6 for different thread counts:
thrds seconds
1     32.123
2     32.182
3     29.329
4     28.556
8     21.661
16    20.186
24    20.423
32    21.410

these aren't quite what i expected. CPU usage shows that both cores get
about 55-80% load with 2 threads, stablilizing at 65-75% with 16
threads. with a single thread it's clearly 100% on one core.

am i missing something about the pthread lib or memory usage/sync that
prevents reasonable speedup? 160% is nice, but why does it take 16
threads to get there? and where exactly do the remaining 40% go?

Bill Baxter wrote:
 I don't remember where it was, but somebody in some thread said
 something like gee it would be great if we had a demo that showed the
 benefits of having parallel threads.  (It was probably in either the
 Futures lib thread or in the discussion about varargs_reduce).
 
 Anyway, a raytracer is a perfect example of "embarrasingly
 parallelizeable" code.  In the simplest case you can state it simply as
 "for each pixel do trace_ray(ray_dir)".
 
 Bradley Smith posted code for a little raytracer over on D.learn under
 the heading "Why is this code slower than C++".  If anyone is interested
 in turning this into a parallel raytracer that would be a nice little demo.
 
 Along the same lines, almost any image processing algorithm is in the
 same category where the top loop looks like "foreach pixel do ___". This
 is a big part of why GPUs just keep getting faster and faster, because
 the basic problem is just so inherently parallelizable.
 
 --bb

Jan 26 2007

Kevin Bealer <kevinbealer gmail.com> writes:

Jascha Wetzel wrote:
 i had a simple, unoptimized raytracer from an old CG assignment lying
 around that i ported to D and modified for parallel computation.
 download here: http://mainia.de/prt.zip
 
 on a single core/cpu machine, the time stays about the same until 4
 threads, then the overhead kicks in.
 here are some values from an opteron dual core system running linux
 kernel 2.6 for different thread counts:
 thrds seconds
 1     32.123
 2     32.182
 3     29.329
 4     28.556
 8     21.661
 16    20.186
 24    20.423
 32    21.410
 
 these aren't quite what i expected. CPU usage shows that both cores get
 about 55-80% load with 2 threads, stablilizing at 65-75% with 16
 threads. with a single thread it's clearly 100% on one core.
 
 am i missing something about the pthread lib or memory usage/sync that
 prevents reasonable speedup? 160% is nice, but why does it take 16
 threads to get there? and where exactly do the remaining 40% go?
 
 Bill Baxter wrote:
 I don't remember where it was, but somebody in some thread said
 something like gee it would be great if we had a demo that showed the
 benefits of having parallel threads.  (It was probably in either the
 Futures lib thread or in the discussion about varargs_reduce).

 Anyway, a raytracer is a perfect example of "embarrasingly
 parallelizeable" code.  In the simplest case you can state it simply as
 "for each pixel do trace_ray(ray_dir)".

 Bradley Smith posted code for a little raytracer over on D.learn under
 the heading "Why is this code slower than C++".  If anyone is interested
 in turning this into a parallel raytracer that would be a nice little demo.

 Along the same lines, almost any image processing algorithm is in the
 same category where the top loop looks like "foreach pixel do ___". This
 is a big part of why GPUs just keep getting faster and faster, because
 the basic problem is just so inherently parallelizable.

 --bb


Do you know if there was much lock contention?  If there is a lot, then 
you can lose efficiency because the threads bump into each other like a 
traffic jam.  Failing to get a lock (pausing and waking up again when it 
frees up) is a lot slower than getting it successfully on the first try.

There was some code I had at work where I do something like this:

Mutex.lock();

     int length = (end-begin)*4;

     char * p = (finds a byte location in an mmapped file);

     int remainder = *p;  // A
     length += remainder; // A

Mutex.unlock();

I was able to improve performance measurably by moving the unlock() to a 
point before the lines marked "A".

It turns out the *p was (often) causing a soft page fault, which caused 
the thread to stall, which caused other threads to pile up on the mutex.

If you can minimize your critical sections, you can reduce contention. 
The ideal thing is to use a critical section to swap pointers or 
something else that can't take much time.  If a memory read can trigger 
page fault, you can sometimes do it *before* the critical section, then 
get the lock and get the 'correct' value.  This way the memory page is 
"known to be warm" by the time you are in the lock, even if the actual 
value may change.

[ But this kind of fiddling around is pretty low level, so it won't have 
any effect most of the time -- make sure to have the right problem 
before fixing it. ]

Kevin

Jan 26 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

Kevin Bealer wrote:
 Do you know if there was much lock contention?

there is no explicit synchronization at all. the nice thing about
raytracing and similar algorithms is, that you can use disjoint memory
and let the threads work isolated from each other. except for the main
thread that waits for the workers to terminate, there are no muteces
involved.

Jan 26 2007

Kevin Bealer <kevinbealer gmail.com> writes:

Jascha Wetzel wrote:
 Kevin Bealer wrote:
 Do you know if there was much lock contention?

 
 there is no explicit synchronization at all. the nice thing about
 raytracing and similar algorithms is, that you can use disjoint memory
 and let the threads work isolated from each other. except for the main
 thread that waits for the workers to terminate, there are no muteces
 involved.

Hmmm... if the chunks you use are too small or too large, it can be an 
issue too.  If you divide the workload into small chunks like individual 
pixels, it costs overhead due to switching and loss of locality of 
reference.

On the other hand, if you have ten threads, I think it is a mistake to 
divide the pixel load into exactly ten parts -- some pixels are heavier 
than others.  The pixels that raytrace complex refractions will take 
longer and so some threads finish first.

What I would tend to do is divide the pixel count into blocks, then let 
each thread pull a block of (consecutive or adjacent if possible) pixels 
to work on.  Then some threads can do hard blocks and take a while at 
it, and others can do simple blocks and process more of them.

The goal is for all the threads to finish at about the same time.  If 
they don't you end up with some threads waiting idly at the end.

This would require reintroducing a little synchronization.  I might 
divide an imagine into blocks that were individually each about 2% of 
the total image.  Let's say you have 4 hardware threads and each block 
is 2% of the work load.  Then the average inefficiency is maybe 
something like 1/2 * .02 * 4 = 4 %.  This is 1/2 of the threads being 
idle as they finish at randomly uneven rates, times 2 % of the total 
work, times 4 hardware threads (because it doesn't hurt much to have 
idle *software* threads, only idle hardware ones.)

Other micro-considerations:

1. Handing out 'harder' sections first is a good idea if you have this 
info, because these will "stick out" more if they are the only one 
running at the end (they represent a larger percentage than their block 
size.)

2. You can start by handing out large chunks and then switch to smaller 
chunks when the work is mostly done.  For example, for 4 hardware 
threads, you could always hand out 1/16th of the remaining work until 
you get down to handing out 100 pixels at a time.

(I don't know how useful this is for raytracing specifically, but some 
of these issues come up where I work.)

Kevin

Jan 27 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

Kevin Bealer wrote:
 [...]
 The goal is for all the threads to finish at about the same time.  If
 they don't you end up with some threads waiting idly at the end.

 This would require reintroducing a little synchronization.
 [...]

that's all very reasonable, thx.
i changed the job distribution by dividing the image into a grid,
assigning (equally sized) cells to threads in an extending spiral
starting in the middle of the image. this is due to the assumption that
the hardest cells are around the middle (which is the case for the test
scene).
the distribution of those cells is of course synchronized.
new sources: http://mainia.de/prt_grid.zip

now everything is faster, even with one thread. but the speedup with
more threads decreased (time stays about the same for 1-16 threads).
still, i haven't quite "felt" the multicore performance.

meanwhile, a friend tells me, that his speedups with image processing
algorithms and naive block assignment are close to 200%...

atm i suspect there's a more technical reason to that, something with
the threading lib or the like...

Jan 27 2007

=?ISO-8859-1?Q?Philipp_Sp=f6th?= <philipp frischluft.com> writes:

Kevin is right with his optimization ideas of course. But there is something
else going on there. The native approach already should gain way more speed
than Jascha describes. And even worse on my Core Duo on windows the tracer
drastically looses time with more threads. With one thread on a 640x480 image
it takes 11 seconds. With two threads it takes 86 seconds!

When looking at the process manager with 2 threads it shows that cpu usage is
constantly only somehwere between 15% and 35%. Increasing priority of the
threads in D doen't seem to do anything. Raising Priority in the process
manager lifts cpu usage to near 100%. However it doesn't make the progam finish
any faster.

There must be some implizit synchronising involved though I have no idea where
and why.

 Jascha Wetzel <[firstname] mainia.de> Wrote:


 Kevin Bealer wrote:
 [...]
 The goal is for all the threads to finish at about the same time.  If
 they don't you end up with some threads waiting idly at the end.

 This would require reintroducing a little synchronization.
 [...]

 
 that's all very reasonable, thx.
 i changed the job distribution by dividing the image into a grid,
 assigning (equally sized) cells to threads in an extending spiral
 starting in the middle of the image. this is due to the assumption that
 the hardest cells are around the middle (which is the case for the test
 scene).
 the distribution of those cells is of course synchronized.
 new sources: http://mainia.de/prt_grid.zip
 
 now everything is faster, even with one thread. but the speedup with
 more threads decreased (time stays about the same for 1-16 threads).
 still, i haven't quite "felt" the multicore performance.
 
 meanwhile, a friend tells me, that his speedups with image processing
 algorithms and naive block assignment are close to 200%...
 
 atm i suspect there's a more technical reason to that, something with
 the threading lib or the like...

Jan 27 2007

Kevin Bealer <kevinbealer gmail.com> writes:

That's really wierd.  Unless you are actually straining your L1 or L2 
cache (or there is as you say, really bad implicit synchonization), it's 
hard to see what could do this.

I think 'valgrind' has some skins that deal with this sort of analysis,
specifically, cachegrind, callgrind and helgrind, but helgrind requires 
version 2.2 (or earlier) and it looks like valgrind is crashing for me 
with dmd-produced binaries -- maybe I need to set up gdc and try it that 
way.

Kevin

Philipp Sp�th wrote:
 Kevin is right with his optimization ideas of course. But there is something
else going on there. The native approach already
 should  gain way more speed than Jascha describes. And even worse on my Core 

Duo on windows the tracer drastically looses time
 with more threads. With one thread on a 640x480 image it takes 11 

seconds. With two threads it takes 86 seconds!
 
 When looking at the process manager with 2 threads it shows that cpu usage is
constantly only somehwere between 15% and 35%.
 Increasing priority of the threads in D doen't seem to do anything. Raising
Priority in the process manager lifts cpu usage
 to near 100%. However it doesn't make the progam finish any faster.
 
 There must be some implizit synchronising involved though I have no idea where
and why.
 
  Jascha Wetzel <[firstname] mainia.de> Wrote:
 
 
 Kevin Bealer wrote:
 [...]
 The goal is for all the threads to finish at about the same time.  If
 they don't you end up with some threads waiting idly at the end.

 This would require reintroducing a little synchronization.
 [...]

 that's all very reasonable, thx.
 i changed the job distribution by dividing the image into a grid,
 assigning (equally sized) cells to threads in an extending spiral
 starting in the middle of the image. this is due to the assumption that
 the hardest cells are around the middle (which is the case for the test
 scene).
 the distribution of those cells is of course synchronized.
 new sources: http://mainia.de/prt_grid.zip

 now everything is faster, even with one thread. but the speedup with
 more threads decreased (time stays about the same for 1-16 threads).
 still, i haven't quite "felt" the multicore performance.

 meanwhile, a friend tells me, that his speedups with image processing
 algorithms and naive block assignment are close to 200%...

 atm i suspect there's a more technical reason to that, something with
 the threading lib or the like...

Jan 27 2007

=?ISO-8859-1?Q?Philipp_Sp=f6th?= <no no.no> writes:

I did the good, old, elaborate "remove/insert everything until you
locate the problem" technique. And finaly speed improoved to about 9,6
seconds for the testscene with one thread (down from 39 seconds).
Also threading behaviour is reasonable now. Two threads do the job
in 5,4 seconds.

I had to change a couple of things. To many to list here, but when
the following lines are in - the strange threading behaviour is in as well:

Vec4d	Qo;
[snip]
Vec3d Qo3;
Qo3 = [Qo[0],Qo[1],Qo[2]];

When the last line is replaced with these - everything is fine:
Qo3[0] = Qo[0];
Qo3[1] = Qo[1];
Qo3[2] = Qo[2];

So the problem seems to be related to the vector assign operator which
is defined like this:

alias Vector!(double,3u) Vec3d;
alias Vector!(double,4u) Vec4d;
struct Vector(T, uint dim)
{
	alias Vector!(T,dim) MyType;

	MyType opAssign(T[dim] d)
	{
		data[] = d;
		return *this;
	}
}

I tried some variations but didn't find any further hints to what causes
the problem. Little curious side note - if the operator is declared
directly instead of using 'MyType' the threading behaviour is still wrong
but the program runs much faster.

This is on windows with dmd v1.004 Any ideas?

Kevin Bealer Wrote:

 That's really wierd.  Unless you are actually straining your L1 or L2 
 cache (or there is as you say, really bad implicit synchonization), it's 
 hard to see what could do this.

 I think 'valgrind' has some skins that deal with this sort of analysis,
 specifically, cachegrind, callgrind and helgrind, but helgrind requires 
 version 2.2 (or earlier) and it looks like valgrind is crashing for me 
 with dmd-produced binaries -- maybe I need to set up gdc and try it that 
 way.

 Kevin

 Philipp Sp�th wrote:
 Kevin is right with his optimization ideas of course. But there is something
else going on there. The native approach already
 should  gain way more speed than Jascha describes. And even worse on my Core 

 Duo on windows the tracer drastically looses time
  > with more threads. With one thread on a 640x480 image it takes 11 
 seconds. With two threads it takes 86 seconds!

 When looking at the process manager with 2 threads it shows that cpu usage is
constantly only somehwere between 15% and 35%.
 Increasing priority of the threads in D doen't seem to do anything. Raising
Priority in the process manager lifts cpu usage

  > to near 100%. However it doesn't make the progam finish any faster.

 There must be some implizit synchronising involved though I have no idea where
and why.

  Jascha Wetzel <[firstname] mainia.de> Wrote:

 Kevin Bealer wrote:
 [...]
 The goal is for all the threads to finish at about the same time.  If
 they don't you end up with some threads waiting idly at the end.

 This would require reintroducing a little synchronization.
 [...]

 that's all very reasonable, thx.
 i changed the job distribution by dividing the image into a grid,
 assigning (equally sized) cells to threads in an extending spiral
 starting in the middle of the image. this is due to the assumption that
 the hardest cells are around the middle (which is the case for the test
 scene).
 the distribution of those cells is of course synchronized.
 new sources: http://mainia.de/prt_grid.zip

 now everything is faster, even with one thread. but the speedup with
 more threads decreased (time stays about the same for 1-16 threads).
 still, i haven't quite "felt" the multicore performance.

 meanwhile, a friend tells me, that his speedups with image processing
 algorithms and naive block assignment are close to 200%...

 atm i suspect there's a more technical reason to that, something with
 the threading lib or the like...

Jan 27 2007

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Philipp Sp�th wrote:
 I did the good, old, elaborate "remove/insert everything until you
 locate the problem" technique. And finaly speed improoved to about 9,6
 seconds for the testscene with one thread (down from 39 seconds).
 Also threading behaviour is reasonable now. Two threads do the job
 in 5,4 seconds.
 
 I had to change a couple of things. To many to list here, but when
 the following lines are in - the strange threading behaviour is in as well:
 
 Vec4d	Qo;
 [snip]
 Vec3d Qo3;
 Qo3 = [Qo[0],Qo[1],Qo[2]];
 
 
 When the last line is replaced with these - everything is fine:
 Qo3[0] = Qo[0];
 Qo3[1] = Qo[1];
 Qo3[2] = Qo[2];
 
 
 So the problem seems to be related to the vector assign operator

I don't think it is, at least not directly. It's probably related to the 
[Qo[0],Qo[1],Qo[2]] expression, which allocates a 3-element array _on 
the heap_. AFAIK to allocate on the heap a mutex needs to be locked, 
which might explain bad threading behavior.

Jan 27 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

Frits van Bommel wrote:
 It's probably related to the [Qo[0],Qo[1],Qo[2]] expression, which
 allocates a 3-element array _on the heap_. AFAIK to allocate on the
 heap a mutex needs to be locked, which might explain bad threading
 behavior.

yep, that's exactly what happens.
An array literal alwyas gets allocated on the heap by _d_arrayliteral,
which uses new, which in turn uses GC.malloc, which locks a mutex.
It shouldn't render a problem if there are some allocs, but those were
definitely too many...

I removed the unnessecary use of array literals, and now everything runs
the way it should.
Now i get 9.940sec with one thread and 5.011sec with two (opteron dual
core 2.6ghz), which stays about the same for higher thread counts. Nice! :)

So here is, for your parallel raytracing pleasure, the little demo as
intended in the first place:
http://mainia.de/prt.zip

Jan 29 2007

Kevin Bealer <kevinbealer gmail.com> writes:

Bill Baxter wrote:
 I don't remember where it was, but somebody in some thread said 
 something like gee it would be great if we had a demo that showed the 
 benefits of having parallel threads.  (It was probably in either the 
 Futures lib thread or in the discussion about varargs_reduce).
 
 Anyway, a raytracer is a perfect example of "embarrasingly 
 parallelizeable" code.  In the simplest case you can state it simply as 
 "for each pixel do trace_ray(ray_dir)".
 
 Bradley Smith posted code for a little raytracer over on D.learn under 
 the heading "Why is this code slower than C++".  If anyone is interested 
 in turning this into a parallel raytracer that would be a nice little demo.
 
 Along the same lines, almost any image processing algorithm is in the 
 same category where the top loop looks like "foreach pixel do ___". This 
 is a big part of why GPUs just keep getting faster and faster, because 
 the basic problem is just so inherently parallelizable.
 
 --bb

I think it was me -- thanks for the pointer, I may have a look at this.

Kevin

Jan 26 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Good demo for showing benefits of parallelism