digitalmars.D - Optimizing a raytracer

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= (23/23) Oct 16 2013 Hello!

Jacob Carlborg (6/24) Oct 16 2013 I would say use structs. For compiler I would go with LDC or GDC. Both
finalpatch (4/8) Oct 16 2013 I find it critical to ensure all loops are unrolled in basic
ponce (6/12) Oct 16 2013 Yes, by all means use struct.
=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= (22/30) Oct 17 2013 Thank you for the advice!

bearophile (7/10) Oct 17 2013 Using a double4 could improve the performance of your code, but

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= (7/13) Mar 26 2014 I sadly could not get it to work properly, but the performance
=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= (2/2) Mar 26 2014 Oh, thanks for all of your help. Nice

Bienlein (3/3) Mar 26 2014 You can also achieve significant speed-ups by doing things in

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= (5/5) Mar 26 2014 Thanks! I already do tracing the samples parallel.

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:

Hello!

I am writing an unbiased raytrace renderer in D. I have good 
progress, but I want to make it as fast as possible where I can 
do it without compromises.

I use a struct with three doubles for vector and color 
calculations and I have operator overloading for them. Many 
vectors and colors are created during the tracing calculations.

I thought, using classes may require too much memory, because 
they are not destructed on scope end, and maybe speed reduction 
when GC kicks in.

Is my assumptions that in this case struct are more wise?

To avoid the constructing many vectors and colors, I thought to 
use ref arguments, but I also heard that ref functions are not 
inlined. What would generate the fastest code for a cross-product 
for example?

What compiler and compilations flags should I use to generate the 
fastest code? My main target is sixty-four bit machines, 
cross-platform. What optimizations can I assume for various 
compilers? Are only once used local variables inlined? So it 
secure to extract local variables only to make the code more easy 
to understand?

Thanks is Advance!
Róbert László Páli

Oct 16 2013

Jacob Carlborg <doob me.com> writes:

On 2013-10-16 14:02, "Róbert László Páli" wrote:
 Hello!

 I am writing an unbiased raytrace renderer in D. I have good progress,
 but I want to make it as fast as possible where I can do it without
 compromises.

 I use a struct with three doubles for vector and color calculations and
 I have operator overloading for them. Many vectors and colors are
 created during the tracing calculations.

 I thought, using classes may require too much memory, because they are
 not destructed on scope end, and maybe speed reduction when GC kicks in.

 Is my assumptions that in this case struct are more wise?

 To avoid the constructing many vectors and colors, I thought to use ref
 arguments, but I also heard that ref functions are not inlined. What
 would generate the fastest code for a cross-product for example?

 What compiler and compilations flags should I use to generate the
 fastest code? My main target is sixty-four bit machines, cross-platform.
 What optimizations can I assume for various compilers? Are only once
 used local variables inlined? So it secure to extract local variables
 only to make the code more easy to understand?

I would say use structs. For compiler I would go with LDC or GDC. Both 
of these are faster for floating point calculations than DMD. You can 
always benchmark.

-- 
/Jacob Carlborg

Oct 16 2013

"finalpatch" <fengli gmail.com> writes:

I find it critical to ensure all loops are unrolled in basic 
vector ops (copy/arithmathc/dot etc.)

On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli 
wrote:
 Hello!

 I am writing an unbiased raytrace renderer in D. I have good 
 progress, but I want to make it as fast as possible where I can 
 do it without compromises.

Oct 16 2013

"ponce" <contact gam3sfrommars.fr> writes:

On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli 
wrote:
 I thought, using classes may require too much memory, because 
 they are not destructed on scope end, and maybe speed reduction 
 when GC kicks in.

 Is my assumptions that in this case struct are more wise?

Yes, by all means use struct.


 What would generate the fastest code for a cross-product for 
 example?

If you are on x86, SSE 4.1 introduced an instruction called DPPS 
which performs a dot product. Maybe you can force it into doing a 
cross-product with clever swizzles and masks.

Oct 16 2013

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:

 Jacob Carlborg
 I would say use structs. For compiler I would go with LDC or 
 GDC. Both of these are faster for floating point calculations 
 than DMD. You can always benchmark.

Thank you for the advice!
I installed ldc and used ldmd2.
Te benchmarks are amazing! :O

DMD > compile = 2503 > run = 26210
LDMD > compile = 3953 > run = 8935

These are in milliseconds,
benchmarked with time command.
Both were compiled with smae Flags:
-O -inline -release -noboundscheck

 finalpatch
 I find it critical to ensure all loops are unrolled in basic 
 vector ops (copy/arithmathc/dot etc.)

In these crucial parts I don't use loops,
made these operations by hand. There
are simple 3 named doubles.
But thanks for the advice.

 ponce
 If you are on x86, SSE 4.1 introduced an instruction called 
 DPPS which performs a dot product. Maybe you can force it into 
 doing a cross-product with clever swizzles and masks.

Could you give me a hint, how it could
be implemented in D to use that dot product?
I am not expirienced with such low-level programming.

And would you suggest to try to use
SIMD double4 for 3D vectors? It would
take some time to change code.

Oct 17 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Róbert László Páli:

 And would you suggest to try to use
 SIMD double4 for 3D vectors? It would
 take some time to change code.

Using a double4 could improve the performance of your code, but 
it must be used wisely. (One general tip is to avoid mixing SIMD 
and serial code. if you want to use SIMD code, then it's often 
better to keep using SIMD registers even if you have one value).

Bye,
bearophile

Oct 17 2013

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:

 Using a double4 could improve the performance of your code, but 
 it must be used wisely. (One general tip is to avoid mixing  
 SIMD
 and serial code. if you want to use SIMD code, then it's  often
 better to keep using SIMD registers even if you have one  
 value).

I sadly could not get it to work properly, but the performance
seems good so far. Teoretichally I only would need to adjust the
Vector struct and operations (a small layer of the code, the rest
uses only the Vector type and the operations, not the inside of 
it).

In case you are interested:
http://palaes.rudanium.org/SubSpace/render.php

Mar 26 2014

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:

Oh, thanks for all of your help. Nice
to see, that D guys do really help. :)

Mar 26 2014

"Bienlein" <jeti789 web.de> writes:

You can also achieve significant speed-ups by doing things in 
parallel, f.ex. see 
https://groups.google.com/forum/?hl=de#!searchin/golang-nuts/ray$20tracer/golang-nuts/mxYzHQSV3rw/dOA78aeVLgEJ

Mar 26 2014

=?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:

Thanks! I already do tracing the samples parallel.
Strangly I have a core 2 duo and it seems that using
3 threads is the best (slightly better than 2). Aldough
this might be accidetal. Maybe the more-complex
samples are more equally in separate threds.

Mar 26 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Optimizing a raytracer