www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D slower than C++ by a factor of _two_ for simple raytracer (gdc)

reply downs <default_357-line yahoo.de> writes:
My platform is GDC 4.1.2 vs G++ 4.1.1.

I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).

During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!

"Glee" I thought!

Then I applied the same optimization to the C++ source and it dropped to 8s.

I haven't been able to get the D code even close to this new speed level.

The outputs of both programs are identical save for off-by-one differences.

The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv

D version is http://paste.dprogramming.com/dpzal0jd

Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.

Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
Both compilers were built with roughly similar configure flags. The GDC used is
the latest available in SVN, and based on DMD 1.022.

Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?

Ideas appreciated,

 --downs
Feb 14 2008
next sibling parent Daniel Lewis <murpsoft hotmail.com> writes:
downs Wrote:

 My platform is GDC 4.1.2 vs G++ 4.1.1.
 
 I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).
 
 During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!
 
 "Glee" I thought!
 
 Then I applied the same optimization to the C++ source and it dropped to 8s.
 
 I haven't been able to get the D code even close to this new speed level.
 
 The outputs of both programs are identical save for off-by-one differences.
 
 The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv
 
 D version is http://paste.dprogramming.com/dpzal0jd
 
 Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.
 
 Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
 Both compilers were built with roughly similar configure flags. The GDC used
is the latest available in SVN, and based on DMD 1.022.
 
 Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?
 
 Ideas appreciated,
 
  --downs

Don't have a GC, or statically load all of Phobos just to do simple raytracing?
Feb 14 2008
prev sibling next sibling parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Well, I'm on Windows, but comparing DMC and DMD on that code, DMD is 
slightly faster.  I know that gdc isn't really optimizing everything yet....

That said, cl (v15) beats dmc and dmd at like 60% the time, but this has 
less to do with the language itself.

I wonder how gcc and dmd compare here...

-[Unknown]


downs wrote:
 My platform is GDC 4.1.2 vs G++ 4.1.1.
 
 I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).
 
 During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!
 
 "Glee" I thought!
 
 Then I applied the same optimization to the C++ source and it dropped to 8s.
 
 I haven't been able to get the D code even close to this new speed level.
 
 The outputs of both programs are identical save for off-by-one differences.
 
 The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv
 
 D version is http://paste.dprogramming.com/dpzal0jd
 
 Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.
 
 Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
 Both compilers were built with roughly similar configure flags. The GDC used
is the latest available in SVN, and based on DMD 1.022.
 
 Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?
 
 Ideas appreciated,
 
  --downs

Feb 14 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.

DMD doesn't optimize much for speed, and programs compiled with GDC aren't that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later to be compiled by GCC) so I can see the spots where it emits slow-looking C++ code. DMD isn't much good at inlining, etc, so probably your methods are all function calls, struct methods too. If you translate your D raytracer to Java6 with HotSpot you will probably find that your D code is probably 20-50% slower than the Java one, despite the Java one being a bit higher level :-) (Thanks to HotSpot and the GC). If you can stand the ugliness, you can probably reduce your running time by 10-15% using my TinyVector structs instead of your Vec struct, you can find them in my d libs: (V.2.70 at the moment, their development is going well, http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes from extensive testing of mine. You probably may require 10-20 minutes of time to adapt your raytracer to using TinyVector, but it's not too much difficult. The result will be ugly... Bye, bearophile
Feb 15 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile>If you can stand the ugliness, you can probably reduce your running
time by 10-15% using my TinyVector structs instead of your Vec struct,<

Note that I expect such speedup on DMD, where I have developed them. I don't
know what's the outcome on GDC (that you are using).

Bye,
bearophile
Feb 15 2008
prev sibling next sibling parent reply Marius Muja <mariusm cs.ubc.ca> writes:
bearophile wrote:
 downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.

DMD doesn't optimize much for speed, and programs compiled with GDC aren't that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later to be compiled by GCC) so I can see the spots where it emits slow-looking C++ code.

In my experience GDC code is faster than DMD code (in some cases significantly faster).
 DMD isn't much good at inlining, etc, so probably your methods are all
function calls, struct methods too.
 
 If you translate your D raytracer to Java6 with HotSpot you will probably find
that your D code is probably 20-50% slower than the Java one, despite the Java
one being a bit higher level :-) (Thanks to HotSpot and the GC).
 
 If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...
 
 Bye,
 bearophile

Feb 15 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Marius Muja Wrote:
 In my experience GDC code is faster than DMD code (in some cases significantly
faster).

My experience is similar to the results you can see here, that is about the same on average, better for some things, worse for other ones: http://shootout.alioth.debian.org/sandbox/benchmark.php?test=all&lang=gdc (I was using GDC based on MinGW based on GCC 3.2. You can find a good newer MinGW here: http://nuwen.net/mingw.html but I don't know if it works with GDC). Note for downs: have you tried -fprofile-generate/-fprofile-use flags for the C++ code? They improve the C++ raytracer speed some. Bye, bearophile
Feb 15 2008
parent downs <default_357-line yahoo.de> writes:
bearophile wrote:
 Note for downs: have you tried -fprofile-generate/-fprofile-use flags for the
C++ code? They improve the C++ raytracer speed some.
 
 Bye,
 bearophile

My point is not in making GDC's crushing defeat even crushinger :) But thanks for the advice, anyway. --downs
Feb 15 2008
prev sibling parent reply downs <default_357-line yahoo.de> writes:
bearophile wrote:
 downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.

DMD doesn't optimize much for speed, and programs compiled with GDC aren't that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later to be compiled by GCC) so I can see the spots where it emits slow-looking C++ code. DMD isn't much good at inlining, etc, so probably your methods are all function calls, struct methods too. If you translate your D raytracer to Java6 with HotSpot you will probably find that your D code is probably 20-50% slower than the Java one, despite the Java one being a bit higher level :-) (Thanks to HotSpot and the GC). If you can stand the ugliness, you can probably reduce your running time by 10-15% using my TinyVector structs instead of your Vec struct, you can find them in my d libs: (V.2.70 at the moment, their development is going well, http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes from extensive testing of mine. You probably may require 10-20 minutes of time to adapt your raytracer to using TinyVector, but it's not too much difficult. The result will be ugly... Bye, bearophile

The weird thing is: even if I inline the one spot where gdc ignores its opportunity to inline a function, so that I have the _same_ call-counts as G++ (as measured with -g -pg), even then, the D code is slower. So it doesn't depend on missing inlining opportunities. Or am I missing something? --downs PS: for reference, the missing bit is GDC not always inlining Sphere::ray_sphere. If you look, it's only ever called for cases where the final type is obvious.
Feb 15 2008
next sibling parent downs <default_357-line yahoo.de> writes:
downs wrote:
 bearophile wrote:
 If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...

 Bye,
 bearophile


To clarify: I know I can get the D code to be as fast as the C++ code if I optimize it more, or use custom structs, etc. That's not the point. The point is getting a comparison of C++ and D using equivalent code. But, again, thanks for the advice.
Feb 15 2008
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
downs wrote:
 The weird thing is: even if I inline the one spot where gdc ignores
 its opportunity to inline a function, so that I have the _same_
 call-counts as G++ (as measured with -g -pg), even then, the D code
 is slower. So it doesn't depend on missing inlining opportunities. Or
 am I missing something?

It's often worthwhile to run obj2asm on the output of each, and compare.
Feb 15 2008
prev sibling next sibling parent reply downs <default_357-line yahoo.de> writes:
Another interesting observation.

If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.

 --downs
Feb 15 2008
next sibling parent downs <default_357-line yahoo.de> writes:
downs wrote:
 Another interesting observation.
 
 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.
 
  --downs

Feb 15 2008
prev sibling parent reply Tim Burrell <tim timburrell.net> writes:
downs wrote:
 Another interesting observation.
 
 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.
 
  --downs

Yeah, I was about to say the same. See here: http://paste.dprogramming.com/dpolmzhw It's ugly, but no struct returning. On my machine it's about a second slower than g++ (8.9s vs. 7.8s) compiled via: gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb -frelease -finline-functions and g++ -O3 -fomit-frame-pointer -fweb -finline-functions There's probably some other optimizations that could be made. But really I think this comes down to the compiler not being as mature. The stuff that I did should all be done by an optimizing compiler. You're basically tricking the compiler into moving less bits around. Tim.
Feb 15 2008
next sibling parent reply downs <default_357-line yahoo.de> writes:
Tim Burrell wrote:
 downs wrote:
 Another interesting observation.

 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.

  --downs

Yeah, I was about to say the same. See here: http://paste.dprogramming.com/dpolmzhw It's ugly, but no struct returning. On my machine it's about a second slower than g++ (8.9s vs. 7.8s) compiled via: gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb -frelease -finline-functions and g++ -O3 -fomit-frame-pointer -fweb -finline-functions There's probably some other optimizations that could be made. But really I think this comes down to the compiler not being as mature. The stuff that I did should all be done by an optimizing compiler. You're basically tricking the compiler into moving less bits around. Tim.

But even using your compiler flags, I'm still looking at 12.8s (D) vs 8.1s (C++) .. 11.4 (D) vs 7.8 (C++) using -march=nocona. :ten minutes later: ... Okay, now I'm confused. Your program is three seconds faster than my op*Assign version. Is there a generic problem with operator overloading? I rewrote my version for freestanding functions .. 9.5s :confused: Why do struct members (which are inlined, I checked) take such a speed hit? Ah well. Let's hope LLVMDC does a better job .. someday. --downs
Feb 15 2008
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"downs" <default_357-line yahoo.de> wrote in message 
news:fp4593$1kko$1 digitalmars.com...

 I rewrote my version for freestanding functions .. 9.5s :confused: Why do 
 struct members (which are inlined, I checked) take such a speed hit?

I think other people have come to this bizarre realization as well. It really doesn't make any sense. Have you compared the assembly of calling a struct member function and calling a free function?
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
I ran a comparison of struct vector methods vs freestanding, and the GDC
generated assembler code is precisely identical.

Here's my test source

struct foo {
  double x, y, z;
  void opAddAssign(ref foo bar) {
    x += bar.x; y += bar.y; z += bar.z;
  }
}

void foo_add(ref foo bar, ref foo baz) {
  baz.x += bar.x; baz.y += bar.y; baz.z += bar.z;
}

// prevents overzealous optimization
// really just returns 0, 0, 0
extern(C) foo complex_external_function();

import std.stdio;
void main() {
  foo a = complex_external_function(), b = complex_external_function();
  asm { int 3; }
  a += b;
  asm { int 3; }
  foo c = complex_external_function(), d = complex_external_function();
  asm { int 3; }
  foo_add(d, c);
  asm { int 3; }
  writefln(a, b, c, d);
}

And here are the relevant two bits of assembler.


#APP
	int	$3
#NO_APP
	fldl	-120(%ebp)
	faddl	-96(%ebp)
	fstpl	-120(%ebp)
	fldl	-112(%ebp)
	faddl	-88(%ebp)
	fstpl	-112(%ebp)
	fldl	-104(%ebp)
	faddl	-80(%ebp)
	fstpl	-104(%ebp)



#APP
	int	$3
#NO_APP
	fldl	-72(%ebp)
	faddl	-48(%ebp)
	fstpl	-72(%ebp)
	fldl	-64(%ebp)
	faddl	-40(%ebp)
	fstpl	-64(%ebp)
	fldl	-56(%ebp)
	faddl	-32(%ebp)
	fstpl	-56(%ebp)

No difference. But then why the obvious speed difference? Color me confused ._.

 --downs
Feb 15 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
downs wrote:
 No difference. But then why the obvious speed difference? Color me confused ._.

Test to see if the stack is aligned, i.e. if the doubles start on 16 byte address boundaries.
Feb 15 2008
prev sibling parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 I rewrote my version for freestanding functions .. 9.5s :confused: Why do
struct members (which are inlined, I checked) take such a speed hit?
 

My version had a bug. x__X The correct version takes 11.2s again. --downs
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 downs wrote:
 I rewrote my version for freestanding functions .. 9.5s :confused: Why do
struct members (which are inlined, I checked) take such a speed hit?

My version had a bug. x__X The correct version takes 11.2s again. --downs

If I fix the bug, the 'external function' version is exactly as fast as the opFoo version. Sorry. I think the 8s version posted earlier has a similar bug. Look at the output. :) -- downs
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
I've been playing around with the 8-9s version posted earlier.

The problem seems to lie in ray_sphere.

Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
but Vec v = center - ray.orig; runs in 11.1s.

Still investigating why this happens.

 --downs
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 I've been playing around with the 8-9s version posted earlier.
 
 The problem seems to lie in ray_sphere.
 
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.
 
 Still investigating why this happens.
 
  --downs

Okay, found the cause, if not the reason, by looking at the assembler output. For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them. Here's the disassembly for ray_sphere for both cases: slow (opSub) http://paste.dprogramming.com/dpcds3p3 fast http://paste.dprogramming.com/dpd6pi8n So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch? --downs
Feb 15 2008
next sibling parent reply Tim Burrell <tim timburrell.net> writes:
downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.

For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them. So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?

Hey good deal on figuring this out! It's good to know, especially for those of us using D for real-time simulation type stuff. Is there really a GDC that compiles against gcc >= 4.2?!
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
Tim Burrell wrote:
 downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.

So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?

Hey good deal on figuring this out! It's good to know, especially for those of us using D for real-time simulation type stuff. Is there really a GDC that compiles against gcc >= 4.2?!

I'm not sure; I remember somebody saying he'd managed to build it. And there's a post on d.gnu from somebody saying he'd gotten it to work, although he couldn't build phobos. Since GDC seems to be .. inert at the moment, it'd probably up to some volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed. Myself of course is mostly clueless about both compilers. :/ --downs
Feb 15 2008
parent Tim Burrell <tim timburrell.net> writes:
downs wrote:
 Tim Burrell wrote:
 downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.

So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?

Hey good deal on figuring this out! It's good to know, especially for those of us using D for real-time simulation type stuff. Is there really a GDC that compiles against gcc >= 4.2?!

I'm not sure; I remember somebody saying he'd managed to build it. And there's a post on d.gnu from somebody saying he'd gotten it to work, although he couldn't build phobos. Since GDC seems to be .. inert at the moment, it'd probably up to some volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed. Myself of course is mostly clueless about both compilers. :/

I notice that the Ubuntu team appears to have a working 4.2 based gdc that the changelog also says works with 4.3: http://packages.ubuntu.com/hardy/devel/gdc-4.2 Changelog is here: http://changelogs.ubuntu.com/changelogs/pool/universe/g/gdc-4.2/gdc-4.2_0.25-4.2.3-0ubuntu1/changelog It'd be really nice to see a new gdc release! I wonder if David even knows about these patches!?
Feb 15 2008
prev sibling next sibling parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 Here's the disassembly for ray_sphere for both cases:
 
 slow (opSub)
 
 http://paste.dprogramming.com/dpcds3p3
 
 fast
 
 http://paste.dprogramming.com/dpd6pi8n
 
 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?
 
  --downs

Especially interesting to note (slow case): fstpl -24(%ebp) [...] movl -24(%ebp), %eax movl %eax, -48(%ebp) movl -20(%ebp), %eax movl %eax, -44(%ebp) Translation: Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48]. This indicates a pretty serious problem with optimization, since the whole thing is basically redundant. The "fast" version doesn't have any memory writes at all during the computation. --downs
Feb 15 2008
parent downs <default_357-line yahoo.de> writes:
downs wrote:
 Especially interesting to note (slow case):
 
     fstpl    -24(%ebp)
 [...]
     movl    -24(%ebp), %eax
     movl    %eax, -48(%ebp)
     movl    -20(%ebp), %eax
     movl    %eax, -44(%ebp)
 
 Translation:
 	Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48].

I left something out. fstpl -24(%ebp) [...] movl -24(%ebp), %eax movl %eax, -48(%ebp) movl -20(%ebp), %eax movl %eax, -44(%ebp) [...] fldl -48(%ebp) So, the whole thing comes down to "Store FP number to memory. No wait, move it somewhere else! No wait, read it back!" No wonder it's slow.
Feb 15 2008
prev sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
downs <default_357-line yahoo.de> wrote:
 For some reason, the bad case, although inlined, stores its values back into
memory. The fast case keeps working with them.
 
 Here's the disassembly for ray_sphere for both cases:
 
 slow (opSub)
 
 http://paste.dprogramming.com/dpcds3p3
 
 fast
 
 http://paste.dprogramming.com/dpd6pi8n
 
 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?

I'm trying to investigate this issue, too. I'm comparing the C++ code generated by Visual C Express 2005, and GDC 0.24 based on GCC 3.4.5 and DMD 1.020. Here's the commented out comparison of unitise() function: http://paste.dprogramming.com/dpl9p4pt As you can see, the code is very close. But the static opCall() which initializes the by-value return struct is not inlined, and therefore not optimized out. So there is an additional call and extra copying of already calculated values. If not that, the code would be nearly identical. -- SnakE
Feb 16 2008
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sergey Gromov <snake.scaly gmail.com> wrote:
 I'm trying to investigate this issue, too.  I'm comparing the C++ code 
 generated by Visual C Express 2005, and GDC 0.24 based on GCC 3.4.5 and 
 DMD 1.020.  Here's the commented out comparison of unitise() function:

Continuing investigation. Here are raw results:
make-cpp-gcc.cmd

gcc ray-cpp.o -o ray-cpp.exe -lstdc++
test-cpp.cmd

10968
make-d.cmd

d.d gdc ray-d.o -o ray-d.exe
test-d.cmd

10828 The numbers printed by tests are milliseconds. As you can see, the D version is slightly faster. The outputs are identical. C++ and D program is here, respectively: http://paste.dprogramming.com/dpaftqa2 http://paste.dprogramming.com/dptiniar The only change in C++ is the time output at the end of the main(). D program is refactored so that all struct manipulations happen in-place, without passing and returning by value. GDC has troubles inlining static opCalls for some reason. Microsoft's compiler produces FP/math code about 25% shorter than GCC/GDC in average, hence the results:
make-cpp.cmd

test-cpp.cmd

7656 -- SnakE
Feb 16 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.

Yep, you seem to have re-invented a fixed-size version of my TinyVector (I have added static opCalls yesterday, but I may have to remove them again).
 Microsoft's compiler produces FP/math code about 25% shorter than 
 GCC/GDC in average

Nice. Thank you for your experiments. Timings of your code (that has a bug, see downs for a fixed version) on Win, Pentium3, best of 3 runs, image 256x256: D DMD v.1.025: bud -clean -O -release -inline rayD.d 15.8 seconds (memory deallocation too) C++ MinGW based on GCC 4.2.1: g++ -O3 -s rayCpp.cpp -o rayCpp0 9.42 s (memory deallocation too) C++ MinGW (the same): g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer rayCpp.cpp -o rayCpp1 8.89 s (memory deallocation too) C++ MinGW (the same): g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-generate rayCpp.cpp -o rayCpp2 g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-use rayCpp.cpp -o rayCpp2 8.72 s (memory deallocation too) I haven't tried GDC yet. Bye, bearophile
Feb 16 2008
parent Sergey Gromov <snake.scaly gmail.com> writes:
bearophile <bearophileHUGS lycos.com> wrote:
 Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.

Yep, you seem to have re-invented a fixed-size version of my TinyVector (I have added static opCalls yesterday, but I may have to remove them again).

One of programmer's joys is to invent a wheel and pretend it's better than the others. ;)
 Timings of your code (that has a bug, see downs for a fixed version) on 

The only bug I can see is printing out characters through text-mode Windows stdout which expands every 0xA into "\r\n". This doesn't have any impact on the benchmark. -- SnakE
Feb 16 2008
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
downs:
f I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit.<

Tim Burrell:
 Yeah, I was about to say the same.  See here:

Yep, see my TinyVector ;-) Bye, bearophile
Feb 15 2008
prev sibling parent downs <default_357-line yahoo.de> writes:
Another other observation: GDC's std.math functions still aren't being inlined
properly, forcing me to use the intrinsics manually.

That didn't cause the speed difference though.

Still, it would be nice to see it fixed some time soon, seeing as I filed the
bug in November :)

 --downs
Feb 15 2008
prev sibling parent "Saaa" <empty needmail.com> writes:
With a little bit of commenting, this could be an excellent tutorial. 
Feb 15 2008