digitalmars.D - D slower than C++ by a factor of _two

digitalmars.D - D slower than C++ by a factor of _two_ for simple raytracer (gdc)

downs (15/15) Feb 14 2008 My platform is GDC 4.1.2 vs G++ 4.1.1.

Daniel Lewis (2/30) Feb 14 2008 Don't have a GC, or statically load all of Phobos just to do simple rayt...
Unknown W. Brackets (7/35) Feb 14 2008 Well, I'm on Windows, but comparing DMC and DMD on that code, DMD is
bearophile (7/8) Feb 15 2008 DMD doesn't optimize much for speed, and programs compiled with GDC aren...

bearophile (4/4) Feb 15 2008 bearophile>If you can stand the ugliness, you can probably reduce your r...
Marius Muja (3/15) Feb 15 2008 In my experience GDC code is faster than DMD code (in some cases

bearophile (7/8) Feb 15 2008 My experience is similar to the results you can see here, that is about ...

downs (4/8) Feb 15 2008 My point is not in making GDC's crushing defeat even crushinger :)

downs (6/18) Feb 15 2008 The weird thing is:

downs (4/10) Feb 15 2008 To clarify: I know I can get the D code to be as fast as the C++ code if...
Walter Bright (2/7) Feb 15 2008 It's often worthwhile to run obj2asm on the output of each, and compare.
downs (3/3) Feb 15 2008 Another interesting observation.

downs (2/7) Feb 15 2008 Excuse me. 24 bytes.
Tim Burrell (15/20) Feb 15 2008 Yeah, I was about to say the same. See here:

downs (9/38) Feb 15 2008 But even using your compiler flags, I'm still looking at 12.8s (D) vs 8....

Jarrett Billingsley (6/8) Feb 15 2008 I think other people have come to this bizarre realization as well. It

downs (53/53) Feb 15 2008 I ran a comparison of struct vector methods vs freestanding, and the GDC...

Walter Bright (3/4) Feb 15 2008 Test to see if the stack is aligned, i.e. if the doubles start on 16

downs (4/6) Feb 15 2008 My version had a bug. x__X

downs (6/15) Feb 15 2008 If I fix the bug, the 'external function' version is exactly as fast as ...

downs (6/6) Feb 15 2008 I've been playing around with the 8-9s version posted earlier.

downs (10/20) Feb 15 2008 Okay, found the cause, if not the reason, by looking at the assembler ou...

Tim Burrell (4/10) Feb 15 2008 Hey good deal on figuring this out! It's good to know, especially for

downs (5/16) Feb 15 2008 I'm not sure; I remember somebody saying he'd managed to build it. And t...

Tim Burrell (8/26) Feb 15 2008 I notice that the Ubuntu team appears to have a working 4.2 based gdc

downs (13/26) Feb 15 2008 Especially interesting to note (slow case):

downs (12/23) Feb 15 2008 I left something out.

Sergey Gromov (12/25) Feb 16 2008 I'm trying to investigate this issue, too. I'm comparing the C++ code

Sergey Gromov (27/36) Feb 16 2008 gcc -c -O3 -fomit-frame-pointer -fweb -finline-functions ray-cpp.cpp

bearophile (21/26) Feb 16 2008 Nice.

Sergey Gromov (8/17) Feb 16 2008 One of programmer's joys is to invent a wheel and pretend it's better

K.Wilson (12/49) Oct 27 2008 I just finished initial support for x86-64 output with the ldc compiler ...

Bill Baxter (4/13) Oct 27 2008 Faaaaan Tastic!

K.Wilson (6/25) Oct 29 2008 Hey Bill,

Christian Kamm (8/18) Oct 30 2008 Exactly. Elrood has managed to compile a working LDC on x86-32 Windows u...
Robert Fraser (8/37) Oct 30 2008 I was trying to get LDC going on Windows x64, but it's been slow going

bearophile (8/12) Oct 28 2008 I have read it compiles all Tango tests, this is a lot, because Tango is...

Don (40/56) Oct 28 2008 I can answer this.

Tomas Lindquist Olsen (22/86) Oct 28 2008 :P

Don (36/149) Oct 29 2008 Yes, it's your responsibility to make sure that ESP points to first

Walter Bright (2/8) Oct 28 2008 (3) you want to write the entire function in assembler.

Don (4/13) Oct 29 2008 I think it probably should be illegal to include any non-asm code in a

Tomas Lindquist Olsen (2/16) Oct 29 2008 Seems sensible to me. Walter, why is normal code even allowed ?

Walter Bright (2/11) Oct 30 2008 You might want to build your own adjuster thunks.
Walter Bright (4/5) Oct 30 2008 I should add that 'naked' is for people who don't mind getting intimate

Christian Kamm (7/11) Oct 30 2008 Here portable means 'portable between different compilers for the same

Walter Bright (5/18) Oct 30 2008 Since the stack prolog/epilog is not defined in the D ABI, a compiler is...

Christian Kamm (3/24) Oct 31 2008 Thanks for the clarification. We'll not worry about it too much then -
Don (14/35) Nov 03 2008 That's great, and should be in the documentation.

bearophile (5/7) Feb 15 2008 Yep, see my TinyVector ;-)

downs (4/4) Feb 15 2008 Another other observation: GDC's std.math functions still aren't being i...

Saaa (1/1) Feb 15 2008 With a little bit of commenting, this could be an excellent tutorial.

downs <default_357-line yahoo.de> writes:

My platform is GDC 4.1.2 vs G++ 4.1.1.

I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).

During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!

"Glee" I thought!

Then I applied the same optimization to the C++ source and it dropped to 8s.

I haven't been able to get the D code even close to this new speed level.

The outputs of both programs are identical save for off-by-one differences.

The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv

D version is http://paste.dprogramming.com/dpzal0jd

Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.

Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
Both compilers were built with roughly similar configure flags. The GDC used is
the latest available in SVN, and based on DMD 1.022.

Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?

Ideas appreciated,

 --downs

Feb 14 2008

Daniel Lewis <murpsoft hotmail.com> writes:

downs Wrote:

 My platform is GDC 4.1.2 vs G++ 4.1.1.
 
 I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).
 
 During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!
 
 "Glee" I thought!
 
 Then I applied the same optimization to the C++ source and it dropped to 8s.
 
 I haven't been able to get the D code even close to this new speed level.
 
 The outputs of both programs are identical save for off-by-one differences.
 
 The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv
 
 D version is http://paste.dprogramming.com/dpzal0jd
 
 Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.
 
 Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
 Both compilers were built with roughly similar configure flags. The GDC used
is the latest available in SVN, and based on DMD 1.022.
 
 Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?
 
 Ideas appreciated,
 
  --downs

Don't have a GC, or statically load all of Phobos just to do simple raytracing?

Feb 14 2008

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Well, I'm on Windows, but comparing DMC and DMD on that code, DMD is 
slightly faster.  I know that gdc isn't really optimizing everything yet....

That said, cl (v15) beats dmc and dmd at like 60% the time, but this has 
less to do with the language itself.

I wonder how gcc and dmd compare here...

-[Unknown]


downs wrote:
 My platform is GDC 4.1.2 vs G++ 4.1.1.
 
 I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).
 
 During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!
 
 "Glee" I thought!
 
 Then I applied the same optimization to the C++ source and it dropped to 8s.
 
 I haven't been able to get the D code even close to this new speed level.
 
 The outputs of both programs are identical save for off-by-one differences.
 
 The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv
 
 D version is http://paste.dprogramming.com/dpzal0jd
 
 Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.
 
 Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
 Both compilers were built with roughly similar configure flags. The GDC used
is the latest available in SVN, and based on DMD 1.022.
 
 Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?
 
 Ideas appreciated,
 
  --downs

Feb 14 2008

bearophile <bearophileHUGS lycos.com> writes:

downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.

DMD doesn't optimize much for speed, and programs compiled with GDC aren't that
far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later to be
compiled by GCC) so I can see the spots where it emits slow-looking C++ code.
DMD isn't much good at inlining, etc, so probably your methods are all function
calls, struct methods too.

If you translate your D raytracer to Java6 with HotSpot you will probably find
that your D code is probably 20-50% slower than the Java one, despite the Java
one being a bit higher level :-) (Thanks to HotSpot and the GC).

If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...

Bye,
bearophile

Feb 15 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile>If you can stand the ugliness, you can probably reduce your running
time by 10-15% using my TinyVector structs instead of your Vec struct,<

Note that I expect such speedup on DMD, where I have developed them. I don't
know what's the outcome on GDC (that you are using).

Bye,
bearophile

Feb 15 2008

Marius Muja <mariusm cs.ubc.ca> writes:

bearophile wrote:
 downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.

 
 DMD doesn't optimize much for speed, and programs compiled with GDC aren't
that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later
to be compiled by GCC) so I can see the spots where it emits slow-looking C++
code.

In my experience GDC code is faster than DMD code (in some cases 
significantly faster).

 DMD isn't much good at inlining, etc, so probably your methods are all
function calls, struct methods too.
 
 If you translate your D raytracer to Java6 with HotSpot you will probably find
that your D code is probably 20-50% slower than the Java one, despite the Java
one being a bit higher level :-) (Thanks to HotSpot and the GC).
 
 If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...
 
 Bye,
 bearophile

Feb 15 2008

bearophile <bearophileHUGS lycos.com> writes:

Marius Muja Wrote:
 In my experience GDC code is faster than DMD code (in some cases significantly
faster).

My experience is similar to the results you can see here, that is about the
same on average, better for some things, worse for other ones:
http://shootout.alioth.debian.org/sandbox/benchmark.php?test=all&lang=gdc
(I was using GDC based on MinGW based on GCC 3.2. You can find a good newer
MinGW here: http://nuwen.net/mingw.html but I don't know if it works with GDC).

Note for downs: have you tried -fprofile-generate/-fprofile-use flags for the
C++ code? They improve the C++ raytracer speed some.

Bye,
bearophile

Feb 15 2008

downs <default_357-line yahoo.de> writes:

bearophile wrote:
 Note for downs: have you tried -fprofile-generate/-fprofile-use flags for the
C++ code? They improve the C++ raytracer speed some.
 
 Bye,
 bearophile

My point is not in making GDC's crushing defeat even crushinger :)

But thanks for the advice, anyway.

 --downs

Feb 15 2008

downs <default_357-line yahoo.de> writes:

bearophile wrote:
 downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.

 
 DMD doesn't optimize much for speed, and programs compiled with GDC aren't
that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later
to be compiled by GCC) so I can see the spots where it emits slow-looking C++
code.
 DMD isn't much good at inlining, etc, so probably your methods are all
function calls, struct methods too.
 
 If you translate your D raytracer to Java6 with HotSpot you will probably find
that your D code is probably 20-50% slower than the Java one, despite the Java
one being a bit higher level :-) (Thanks to HotSpot and the GC).
 
 If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...
 
 Bye,
 bearophile

The weird thing is:
even if I inline the one spot where gdc ignores its opportunity to inline a
function, so that I have the _same_ call-counts as G++ (as measured with -g
-pg), even then, the D code is slower.
So it doesn't depend on missing inlining opportunities. Or am I missing
something?

 --downs

PS: for reference, the missing bit is GDC not always inlining
Sphere::ray_sphere. If you look, it's only ever called for cases where the
final type is obvious.

Feb 15 2008

downs <default_357-line yahoo.de> writes:

downs wrote:
 bearophile wrote:
 If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...

 Bye,
 bearophile


To clarify: I know I can get the D code to be as fast as the C++ code if I
optimize it more, or use custom structs, etc.

That's not the point. The point is getting a comparison of C++ and D using
equivalent code.

But, again, thanks for the advice.

Feb 15 2008

Walter Bright <newshound1 digitalmars.com> writes:

downs wrote:
 The weird thing is: even if I inline the one spot where gdc ignores
 its opportunity to inline a function, so that I have the _same_
 call-counts as G++ (as measured with -g -pg), even then, the D code
 is slower. So it doesn't depend on missing inlining opportunities. Or
 am I missing something?

It's often worthwhile to run obj2asm on the output of each, and compare.

Feb 15 2008

downs <default_357-line yahoo.de> writes:

Another interesting observation.

If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.

 --downs

Feb 15 2008

downs <default_357-line yahoo.de> writes:

downs wrote:
 Another interesting observation.
 
 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.
 
  --downs

Excuse me. 24 bytes.

Feb 15 2008

Tim Burrell <tim timburrell.net> writes:

downs wrote:
 Another interesting observation.
 
 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.
 
  --downs

Yeah, I was about to say the same.  See here:

http://paste.dprogramming.com/dpolmzhw

It's ugly, but no struct returning.

On my machine it's about a second slower than g++ (8.9s vs. 7.8s)
compiled via:

gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb
-frelease -finline-functions

and

g++ -O3 -fomit-frame-pointer -fweb -finline-functions

There's probably some other optimizations that could be made.  But
really I think this comes down to the compiler not being as mature.  The
stuff that I did should all be done by an optimizing compiler.  You're
basically tricking the compiler into moving less bits around.

Tim.

Feb 15 2008

downs <default_357-line yahoo.de> writes:

Tim Burrell wrote:
 downs wrote:
 Another interesting observation.

 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.

  --downs

 
 Yeah, I was about to say the same.  See here:
 
 http://paste.dprogramming.com/dpolmzhw
 
 It's ugly, but no struct returning.
 
 On my machine it's about a second slower than g++ (8.9s vs. 7.8s)
 compiled via:
 
 gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb
 -frelease -finline-functions
 
 and
 
 g++ -O3 -fomit-frame-pointer -fweb -finline-functions
 
 There's probably some other optimizations that could be made.  But
 really I think this comes down to the compiler not being as mature.  The
 stuff that I did should all be done by an optimizing compiler.  You're
 basically tricking the compiler into moving less bits around.
 
 Tim.

But even using your compiler flags, I'm still looking at 12.8s (D) vs 8.1s
(C++) .. 11.4 (D) vs 7.8 (C++) using -march=nocona.

:ten minutes later:

... Okay, now I'm confused.
Your program is three seconds faster than my op*Assign version.
Is there a generic problem with operator overloading?

I rewrote my version for freestanding functions .. 9.5s :confused: Why do
struct members (which are inlined, I checked) take such a speed hit?

Ah well. Let's hope LLVMDC does a better job .. someday.

 --downs

Feb 15 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"downs" <default_357-line yahoo.de> wrote in message 
news:fp4593$1kko$1 digitalmars.com...

 I rewrote my version for freestanding functions .. 9.5s :confused: Why do 
 struct members (which are inlined, I checked) take such a speed hit?

I think other people have come to this bizarre realization as well.  It 
really doesn't make any sense.

Have you compared the assembly of calling a struct member function and 
calling a free function?

Feb 15 2008

downs <default_357-line yahoo.de> writes:

I ran a comparison of struct vector methods vs freestanding, and the GDC
generated assembler code is precisely identical.

Here's my test source

struct foo {
  double x, y, z;
  void opAddAssign(ref foo bar) {
    x += bar.x; y += bar.y; z += bar.z;
  }
}

void foo_add(ref foo bar, ref foo baz) {
  baz.x += bar.x; baz.y += bar.y; baz.z += bar.z;
}

// prevents overzealous optimization
// really just returns 0, 0, 0
extern(C) foo complex_external_function();

import std.stdio;
void main() {
  foo a = complex_external_function(), b = complex_external_function();
  asm { int 3; }
  a += b;
  asm { int 3; }
  foo c = complex_external_function(), d = complex_external_function();
  asm { int 3; }
  foo_add(d, c);
  asm { int 3; }
  writefln(a, b, c, d);
}

And here are the relevant two bits of assembler.


#APP
	int	$3
#NO_APP
	fldl	-120(%ebp)
	faddl	-96(%ebp)
	fstpl	-120(%ebp)
	fldl	-112(%ebp)
	faddl	-88(%ebp)
	fstpl	-112(%ebp)
	fldl	-104(%ebp)
	faddl	-80(%ebp)
	fstpl	-104(%ebp)



#APP
	int	$3
#NO_APP
	fldl	-72(%ebp)
	faddl	-48(%ebp)
	fstpl	-72(%ebp)
	fldl	-64(%ebp)
	faddl	-40(%ebp)
	fstpl	-64(%ebp)
	fldl	-56(%ebp)
	faddl	-32(%ebp)
	fstpl	-56(%ebp)

No difference. But then why the obvious speed difference? Color me confused ._.

 --downs

Feb 15 2008

Walter Bright <newshound1 digitalmars.com> writes:

downs wrote:
 No difference. But then why the obvious speed difference? Color me confused ._.

Test to see if the stack is aligned, i.e. if the doubles start on 16 
byte address boundaries.

Feb 15 2008

downs <default_357-line yahoo.de> writes:

downs wrote:
 I rewrote my version for freestanding functions .. 9.5s :confused: Why do
struct members (which are inlined, I checked) take such a speed hit?
 

My version had a bug. x__X

The correct version takes 11.2s again.

 --downs

Feb 15 2008

downs <default_357-line yahoo.de> writes:

downs wrote:
 downs wrote:
 I rewrote my version for freestanding functions .. 9.5s :confused: Why do
struct members (which are inlined, I checked) take such a speed hit?

 
 My version had a bug. x__X
 
 The correct version takes 11.2s again.
 
  --downs

If I fix the bug, the 'external function' version is exactly as fast as the
opFoo version.

Sorry.

I think the 8s version posted earlier has a similar bug.

Look at the output. :)

 -- downs

Feb 15 2008

downs <default_357-line yahoo.de> writes:

I've been playing around with the 8-9s version posted earlier.

The problem seems to lie in ray_sphere.

Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
but Vec v = center - ray.orig; runs in 11.1s.

Still investigating why this happens.

 --downs

Feb 15 2008

downs <default_357-line yahoo.de> writes:

downs wrote:
 I've been playing around with the 8-9s version posted earlier.
 
 The problem seems to lie in ray_sphere.
 
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.
 
 Still investigating why this happens.
 
  --downs

Okay, found the cause, if not the reason, by looking at the assembler output.

For some reason, the bad case, although inlined, stores its values back into
memory. The fast case keeps working with them.

Here's the disassembly for ray_sphere for both cases:

slow (opSub)

http://paste.dprogramming.com/dpcds3p3

fast

http://paste.dprogramming.com/dpd6pi8n

So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?

 --downs

Feb 15 2008

Tim Burrell <tim timburrell.net> writes:

downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.

 
 For some reason, the bad case, although inlined, stores its values back into
memory. The fast case keeps working with them.
 
 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?

Hey good deal on figuring this out!  It's good to know, especially for
those of us using D for real-time simulation type stuff.

Is there really a GDC that compiles against gcc >= 4.2?!

Feb 15 2008

downs <default_357-line yahoo.de> writes:

Tim Burrell wrote:
 downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.

 For some reason, the bad case, although inlined, stores its values back into
memory. The fast case keeps working with them.

 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?

 
 Hey good deal on figuring this out!  It's good to know, especially for
 those of us using D for real-time simulation type stuff.
 
 Is there really a GDC that compiles against gcc >= 4.2?!

I'm not sure; I remember somebody saying he'd managed to build it. And there's
a post on d.gnu from somebody saying he'd gotten it to work, although he
couldn't build phobos.

Since GDC seems to be .. inert at the moment, it'd probably up to some
volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed.

Myself of course is mostly clueless about both compilers. :/

 --downs

Feb 15 2008

Tim Burrell <tim timburrell.net> writes:

downs wrote:
 Tim Burrell wrote:
 downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.

 For some reason, the bad case, although inlined, stores its values back into
memory. The fast case keeps working with them.

 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?

 
 Hey good deal on figuring this out!  It's good to know, especially for
 those of us using D for real-time simulation type stuff.
 
 Is there really a GDC that compiles against gcc >= 4.2?!

 
 I'm not sure; I remember somebody saying he'd managed to build it. And there's
a post on d.gnu from somebody saying he'd gotten it to work, although he
couldn't build phobos.
 
 Since GDC seems to be .. inert at the moment, it'd probably up to some
volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed.
 
 Myself of course is mostly clueless about both compilers. :/

I notice that the Ubuntu team appears to have a working 4.2 based gdc
that the changelog also says works with 4.3:

http://packages.ubuntu.com/hardy/devel/gdc-4.2

Changelog is here:
http://changelogs.ubuntu.com/changelogs/pool/universe/g/gdc-4.2/gdc-4.2_0.25-4.2.3-0ubuntu1/changelog

It'd be really nice to see a new gdc release!

I wonder if David even knows about these patches!?

Feb 15 2008

downs <default_357-line yahoo.de> writes:

downs wrote:
 Here's the disassembly for ray_sphere for both cases:
 
 slow (opSub)
 
 http://paste.dprogramming.com/dpcds3p3
 
 fast
 
 http://paste.dprogramming.com/dpd6pi8n
 
 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?
 
  --downs

Especially interesting to note (slow case):

    fstpl    -24(%ebp)
[...]
    movl    -24(%ebp), %eax
    movl    %eax, -48(%ebp)
    movl    -20(%ebp), %eax
    movl    %eax, -44(%ebp)

Translation:
	Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48].

This indicates a pretty serious problem with optimization, since the whole
thing is basically redundant.

The "fast" version doesn't have any memory writes at all during the computation.

 --downs

Feb 15 2008

downs <default_357-line yahoo.de> writes:

downs wrote:
 Especially interesting to note (slow case):
 
     fstpl    -24(%ebp)
 [...]
     movl    -24(%ebp), %eax
     movl    %eax, -48(%ebp)
     movl    -20(%ebp), %eax
     movl    %eax, -44(%ebp)
 
 Translation:
 	Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48].

I left something out.

    fstpl    -24(%ebp)
[...]
    movl    -24(%ebp), %eax
    movl    %eax, -48(%ebp)
    movl    -20(%ebp), %eax
    movl    %eax, -44(%ebp)
[...]
    fldl    -48(%ebp)


So, the whole thing comes down to "Store FP number to memory. No wait, move it
somewhere else! No wait, read it back!"

No wonder it's slow.

Feb 15 2008

Sergey Gromov <snake.scaly gmail.com> writes:

downs <default_357-line yahoo.de> wrote:
 For some reason, the bad case, although inlined, stores its values back into
memory. The fast case keeps working with them.
 
 Here's the disassembly for ray_sphere for both cases:
 
 slow (opSub)
 
 http://paste.dprogramming.com/dpcds3p3
 
 fast
 
 http://paste.dprogramming.com/dpd6pi8n
 
 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?

I'm trying to investigate this issue, too.  I'm comparing the C++ code 
generated by Visual C Express 2005, and GDC 0.24 based on GCC 3.4.5 and 
DMD 1.020.  Here's the commented out comparison of unitise() function:

http://paste.dprogramming.com/dpl9p4pt

As you can see, the code is very close.  But the static opCall() which 
initializes the by-value return struct is not inlined, and therefore not 
optimized out.  So there is an additional call and extra copying of 
already calculated values.  If not that, the code would be nearly 
identical.

-- 
SnakE

Feb 16 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Sergey Gromov <snake.scaly gmail.com> wrote:
 I'm trying to investigate this issue, too.  I'm comparing the C++ code 
 generated by Visual C Express 2005, and GDC 0.24 based on GCC 3.4.5 and 
 DMD 1.020.  Here's the commented out comparison of unitise() function:

Continuing investigation.  Here are raw results:

make-cpp-gcc.cmd

gcc -c -O3 -fomit-frame-pointer -fweb -finline-functions ray-cpp.cpp
gcc ray-cpp.o -o ray-cpp.exe -lstdc++

test-cpp.cmd

ray-cpp  1>ray-cpp.pbm
10968

make-d.cmd

gdc -c -O3 -fomit-frame-pointer -fweb -frelease -finline-functions ray-
d.d
gdc ray-d.o -o ray-d.exe

test-d.cmd

ray-d  1>ray-d.pbm
10828

The numbers printed by tests are milliseconds.  As you can see, the D 
version is slightly faster.  The outputs are identical.

C++ and D program is here, respectively:
http://paste.dprogramming.com/dpaftqa2
http://paste.dprogramming.com/dptiniar

The only change in C++ is the time output at the end of the main().  D 
program is refactored so that all struct manipulations happen in-place, 
without passing and returning by value.  GDC has troubles inlining 
static opCalls for some reason.

Microsoft's compiler produces FP/math code about 25% shorter than 
GCC/GDC in average, hence the results:

make-cpp.cmd

cl -nologo -EHsc -Ox ray-cpp.cpp

test-cpp.cmd

ray-cpp  1>ray-cpp.pbm
7656

-- 
SnakE

Feb 16 2008

bearophile <bearophileHUGS lycos.com> writes:

Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.

Yep, you seem to have re-invented a fixed-size version of my TinyVector (I have
added static opCalls yesterday, but I may have to remove them again).


 Microsoft's compiler produces FP/math code about 25% shorter than 
 GCC/GDC in average

Nice.
Thank you for your experiments.

Timings of your code (that has a bug, see downs for a fixed version) on Win,
Pentium3, best of 3 runs, image 256x256:

D DMD v.1.025:
bud -clean -O -release -inline rayD.d
15.8 seconds (memory deallocation too)

C++ MinGW based on GCC 4.2.1:
g++ -O3 -s rayCpp.cpp -o rayCpp0
9.42 s (memory deallocation too)

C++ MinGW (the same):
g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer rayCpp.cpp -o rayCpp1
8.89 s (memory deallocation too)

C++ MinGW (the same):
g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-generate rayCpp.cpp
-o rayCpp2
g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-use rayCpp.cpp -o
rayCpp2
8.72 s (memory deallocation too)

I haven't tried GDC yet.

Bye,
bearophile

Feb 16 2008

Sergey Gromov <snake.scaly gmail.com> writes:

bearophile <bearophileHUGS lycos.com> wrote:
 Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.

 
 Yep, you seem to have re-invented a fixed-size version of my TinyVector 
 (I have added static opCalls yesterday, but I may have to remove them 
 again).

One of programmer's joys is to invent a wheel and pretend it's better 
than the others. ;)

 Timings of your code (that has a bug, see downs for a fixed version) on 

The only bug I can see is printing out characters through text-mode 
Windows stdout which expands every 0xA into "\r\n".  This doesn't have 
any impact on the benchmark.

-- 
SnakE

Feb 16 2008

K.Wilson <phizzzt yahoo.com> writes:

I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

llvm-g++4.0.1   5.76
ldc-rev736      6.68
g++4.1.2        6.72
gdc0.24         7.45
g++4.3.1        7.66
dmd1.030        14.52

Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)

Thanks,
K.Wilson


bearophile Wrote:

 Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.

 
 Yep, you seem to have re-invented a fixed-size version of my TinyVector (I
have added static opCalls yesterday, but I may have to remove them again).
 
 
 Microsoft's compiler produces FP/math code about 25% shorter than 
 GCC/GDC in average

 
 Nice.
 Thank you for your experiments.
 
 Timings of your code (that has a bug, see downs for a fixed version) on Win,
Pentium3, best of 3 runs, image 256x256:
 
 D DMD v.1.025:
 bud -clean -O -release -inline rayD.d
 15.8 seconds (memory deallocation too)
 
 C++ MinGW based on GCC 4.2.1:
 g++ -O3 -s rayCpp.cpp -o rayCpp0
 9.42 s (memory deallocation too)
 
 C++ MinGW (the same):
 g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer rayCpp.cpp -o rayCpp1
 8.89 s (memory deallocation too)
 
 C++ MinGW (the same):
 g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-generate
rayCpp.cpp -o rayCpp2
 g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-use rayCpp.cpp -o
rayCpp2
 8.72 s (memory deallocation too)
 
 I haven't tried GDC yet.
 
 Bye,
 bearophile

Oct 27 2008

"Bill Baxter" <wbaxter gmail.com> writes:

On Tue, Oct 28, 2008 at 2:26 PM, K.Wilson <phizzzt yahoo.com> wrote:
 I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

 llvm-g++4.0.1   5.76
 ldc-rev736      6.68
 g++4.1.2        6.72
 gdc0.24         7.45
 g++4.3.1        7.66
 dmd1.030        14.52

 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

 I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)

Faaaaan Tastic!

So how is Windows support coming along?  Does it build smoothly with MinGW now?

--bb

Oct 27 2008

K.Wilson <phizzzt yahoo.com> writes:

Bill Baxter Wrote:

 On Tue, Oct 28, 2008 at 2:26 PM, K.Wilson <phizzzt yahoo.com> wrote:
 I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

 llvm-g++4.0.1   5.76
 ldc-rev736      6.68
 g++4.1.2        6.72
 gdc0.24         7.45
 g++4.3.1        7.66
 dmd1.030        14.52

 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

 I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)

 
 Faaaaan Tastic!
 
 So how is Windows support coming along?  Does it build smoothly with MinGW now?
 
 --bb

Hey Bill,

I didn't see your question the other day...the windows support is still not up
to par for x86-64 (whether trying to build in 32 or 64 bit mode). I couldn't
get things built on my machine for Windows (but I don't have VStudio 2008,
which is suggested because MinGW64 is not production quallity yet).

I think building for MinGW on x86 is somewhat supported and
running...exceptions and inline asm still have some issues...hopefully these
issues will be resolved quickly.

Thanks,
K.Wilson

Oct 29 2008

Christian Kamm <kamm-incasoftware removethis.de> writes:

 So how is Windows support coming along?  Does it build smoothly with
 MinGW now?
 
 --bb

 
 I think building for MinGW on x86-32 is somewhat supported and
 running...exceptions and inline asm still have some issues...hopefully
 these issues will be resolved quickly.

 Thanks,
 K.Wilson

Exactly. Elrood has managed to compile a working LDC on x86-32 Windows using
MinGW (he even provided binaries at one point) and since he hasn't reported
otherwise, I'd expect that its still working. LLVM-Windows in particular
was broken in LLVM trunk sometimes, but now that we're using the
2.4-release branch, it should be more stable. 

However, in order for Windows LDC to make quick progress, we'd need more
Windows people who are willing to test, debug and fix things to try the
compiler.

Oct 30 2008

Robert Fraser <fraserofthenight gmail.com> writes:

K.Wilson wrote:
 Bill Baxter Wrote:
 
 On Tue, Oct 28, 2008 at 2:26 PM, K.Wilson <phizzzt yahoo.com> wrote:
 I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

 llvm-g++4.0.1   5.76
 ldc-rev736      6.68
 g++4.1.2        6.72
 gdc0.24         7.45
 g++4.3.1        7.66
 dmd1.030        14.52

 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

 I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)

 Faaaaan Tastic!

 So how is Windows support coming along?  Does it build smoothly with MinGW now?

 --bb

 
 Hey Bill,
 
 I didn't see your question the other day...the windows support is still not up
to par for x86-64 (whether trying to build in 32 or 64 bit mode). I couldn't
get things built on my machine for Windows (but I don't have VStudio 2008,
which is suggested because MinGW64 is not production quallity yet).
 
 I think building for MinGW on x86 is somewhat supported and
running...exceptions and inline asm still have some issues...hopefully these
issues will be resolved quickly.
 
 Thanks,
 K.Wilson

I was trying to get LDC going on Windows x64, but it's been slow going 
with school & everything else. I submitted a path that gets it 
_compiling_ on VS (the patch is out-of-date, but it shouldn't be too 
hard to bring it up to date), but last time I checked (a couple weeks 
ago...) it would fail an assertion. It will take a lot of knowledge 
about LLVM internals to get LDC usable on this architecture+OS. Then 
there's Windows IA64...

Oct 30 2008

bearophile <bearophileHUGS lycos.com> writes:

K. Wilson:
Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).<

Very nice. On Win in 100% of my programs and benchmarks llvm-gcc 2.3 turns out
slower or quite slower than GCC 4.3.1, but the ratio is never bigger than about
2 times slower. So it's curious to see an example of the opposite.


I just thought I would let people know that ldc is coming along and performs
quite well, at this point.<

I have read it compiles all Tango tests, this is a lot, because Tango is large
and complex.

--------

The LDC docs say:

One thing the D spec isn't clear about at all is how asm blocks mixed with
normal D code (for example code between two asm blocks) interacts.<
Currently 'naked' in D is treated as a compile time error in LDC. Reason for
this is that LLVM does not support directly controlling prologue/epilogue
generation. Also the documentation from the D spec on this topic is extremely
limited and doesn't mention anything about how normal D code in a naked
function works. In particular local (stack) variables are unclear, also
accessing named parameters etc.<

I think Walter has more or less said he's interested in seeing LDC grow, so I
presume such things can be asked to him, and he can give some answers that can
help LDC a lot.

Bye,
bearophile

Oct 28 2008

Don <nospam nospam.com.au> writes:

bearophile wrote:
 K. Wilson:
 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).<

 
 Very nice. On Win in 100% of my programs and benchmarks llvm-gcc 2.3 turns out
slower or quite slower than GCC 4.3.1, but the ratio is never bigger than about
2 times slower. So it's curious to see an example of the opposite.
 
 
 I just thought I would let people know that ldc is coming along and performs
quite well, at this point.<

 
 I have read it compiles all Tango tests, this is a lot, because Tango is large
and complex.
 
 --------
 
 The LDC docs say:
 
 One thing the D spec isn't clear about at all is how asm blocks mixed with
normal D code (for example code between two asm blocks) interacts.<
 Currently 'naked' in D is treated as a compile time error in LDC. Reason for
this is that LLVM does not support directly controlling prologue/epilogue
generation. Also the documentation from the D spec on this topic is extremely
limited and doesn't mention anything about how normal D code in a naked
function works. In particular local (stack) variables are unclear, also
accessing named parameters etc.<


I can answer this.

'naked' in DMD is quite simple: almost nothing works.

Stack variables with 'naked' don't work. Parameters don't work, either. 
Nor do contracts. Here's why:

Regardless of whether 'naked' is specified or not, the compiler creates 
code exactly as if the function began with 'push EBP; mov EBP, ESP; ' 
and ended with 'pop EBP; '
If 'naked' is specified, it just doesn't put that prologue and epilogue in.
So the only way to use 'naked' is to manually keep track of where 
everything is on the stack, and index it off the stack pointer.

Why use 'naked' at all, then?
(1) so that you can use the EBP register;
(2) because non-naked asm doesn't work properly, either. If you pass an 
array into an asm function, you can't get the ".ptr" part of it, because 
the "ptr" conflicts with the asm "ptr" keyword. This is a big problem, 
since almost all asm functions that I write work on arrays.

I don't think that this is how naked asm _should_ work, though. It 
should allow you to reference parameters without adjusting the offsets 
assuming you're using EBP. So the following should work:

void nakedfunc(uint [] dest, uint [] src)
{
   asm {
     naked;
     mov ECX, dest.ptr[ESP];
   }
}

Since the spec says:
"If the [EBP] is omitted, it is assumed for local variables. If naked is 
used, this no longer holds."
True, it doesn't put in an [EBP], but it adjusts the offset assuming 
that a stack frame has been set up. And ".ptr" doesn't work.


Curious fact:
Whenever I'm naked, my body disappears!

void fkk()
{
    asm { naked; }
}

void main() { fkk(); }

--> generates a linker error. Fair enough, really. But kind of interesting.

Oct 28 2008

Tomas Lindquist Olsen <tomas famolsen.dk> writes:

Don wrote:
 bearophile wrote:
 The LDC docs say:

 One thing the D spec isn't clear about at all is how asm blocks mixed 
 with normal D code (for example code between two asm blocks) interacts.<
 Currently 'naked' in D is treated as a compile time error in LDC. 
 Reason for this is that LLVM does not support directly controlling 
 prologue/epilogue generation. Also the documentation from the D spec 
 on this topic is extremely limited and doesn't mention anything about 
 how normal D code in a naked function works. In particular local 
 (stack) variables are unclear, also accessing named parameters etc.<


 
 I can answer this.

Thank you very much. This will help me a lot implementing 'naked' in LDC :)

 
 'naked' in DMD is quite simple: almost nothing works.

:P

 
 Stack variables with 'naked' don't work. Parameters don't work, either. 
 Nor do contracts. Here's why:
 
 Regardless of whether 'naked' is specified or not, the compiler creates 
 code exactly as if the function began with 'push EBP; mov EBP, ESP; ' 
 and ended with 'pop EBP; '
 If 'naked' is specified, it just doesn't put that prologue and epilogue in.
 So the only way to use 'naked' is to manually keep track of where 
 everything is on the stack, and index it off the stack pointer.
 

This actually makes things quite simple, LLVM does frame pointer elimination by
default (we 
force it to emit one when normal inline asm is used though), so this condition
should just be 
changed to 'if (asmIsUsed && !isNaked) doEBP();'
There's a few other technical issues, mostly related to LLVM using an SSA form
and function 
arguments being l-values in D, but that's not really important here...

 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass an 
 array into an asm function, you can't get the ".ptr" part of it, because 
 the "ptr" conflicts with the asm "ptr" keyword. This is a big problem, 
 since almost all asm functions that I write work on arrays.
 

Could we work around this somehow ? Actually I'm not sure we even handle this
kind of thing 
properly yet (accessing aggregate fields in asm). I should test that :)

 I don't think that this is how naked asm _should_ work, though. It 
 should allow you to reference parameters without adjusting the offsets 
 assuming you're using EBP. So the following should work:
 
 void nakedfunc(uint [] dest, uint [] src)
 {
   asm {
     naked;
     mov ECX, dest.ptr[ESP];
   }
 }

So this should work since you've not modified ESP right? I'm not sure what
facilities LLVM 
currently has to get stack offsets of parameters before the actual native
codegen, probably 
none.. So this might be a bit difficult to implement in LDC, but again, I'm not
really aware of 
all that LLVM's inline asm can really do, since there is basically no
documentation, and we've 
only read so much of their source code :P

 
 Since the spec says:
 "If the [EBP] is omitted, it is assumed for local variables. If naked is 
 used, this no longer holds."
 True, it doesn't put in an [EBP], but it adjusts the offset assuming 
 that a stack frame has been set up. And ".ptr" doesn't work.
 

This .ptr issue certainly seems like something that should be fixed somehow :)

 
 Curious fact:
 Whenever I'm naked, my body disappears!
 
 void fkk()
 {
    asm { naked; }
 }
 
 void main() { fkk(); }
 
 --> generates a linker error. Fair enough, really. But kind of interesting.

This will not happen in LDC, it will just produce what it does without the
naked:

_D3bar3fooFZv:
                 ret

Again thanx for these explanations.
Maybe we'll have 'naked' soon in LDC after all :)

-Tomas

Oct 28 2008

Don <nospam nospam.com.au> writes:

Tomas Lindquist Olsen wrote:
 Don wrote:
 bearophile wrote:
 The LDC docs say:

 One thing the D spec isn't clear about at all is how asm blocks 
 mixed with normal D code (for example code between two asm blocks) 
 interacts.<
 Currently 'naked' in D is treated as a compile time error in LDC. 
 Reason for this is that LLVM does not support directly controlling 
 prologue/epilogue generation. Also the documentation from the D spec 
 on this topic is extremely limited and doesn't mention anything 
 about how normal D code in a naked function works. In particular 
 local (stack) variables are unclear, also accessing named parameters 
 etc.<


 I can answer this.

 
 Thank you very much. This will help me a lot implementing 'naked' in LDC :)
 
 'naked' in DMD is quite simple: almost nothing works.

 
 :P
 
 Stack variables with 'naked' don't work. Parameters don't work, 
 either. Nor do contracts. Here's why:

 Regardless of whether 'naked' is specified or not, the compiler 
 creates code exactly as if the function began with 'push EBP; mov EBP, 
 ESP; ' and ended with 'pop EBP; '
 If 'naked' is specified, it just doesn't put that prologue and 
 epilogue in.
 So the only way to use 'naked' is to manually keep track of where 
 everything is on the stack, and index it off the stack pointer.

 
 This actually makes things quite simple, LLVM does frame pointer 
 elimination by default (we force it to emit one when normal inline asm 
 is used though), so this condition should just be changed to 'if 
 (asmIsUsed && !isNaked) doEBP();'
 There's a few other technical issues, mostly related to LLVM using an 
 SSA form and function arguments being l-values in D, but that's not 
 really important here...
 
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass 
 an array into an asm function, you can't get the ".ptr" part of it, 
 because the "ptr" conflicts with the asm "ptr" keyword. This is a big 
 problem, since almost all asm functions that I write work on arrays.

 
 Could we work around this somehow ? Actually I'm not sure we even handle 
 this kind of thing properly yet (accessing aggregate fields in asm). I 
 should test that :)
 
 I don't think that this is how naked asm _should_ work, though. It 
 should allow you to reference parameters without adjusting the offsets 
 assuming you're using EBP. So the following should work:

 void nakedfunc(uint [] dest, uint [] src)
 {
   asm {
     naked;
     mov ECX, dest.ptr[ESP];
     ret 4*4;
   }
 }

 
 So this should work since you've not modified ESP right? 

Yes, it's your responsibility to make sure that ESP points to first 
parameter on the stack.

I'm not sure
 what facilities LLVM currently has to get stack offsets of parameters 
 before the actual native codegen, probably none.. So this might be a bit 
 difficult to implement in LDC, but again, I'm not really aware of all 
 that LLVM's inline asm can really do, since there is basically no 
 documentation, and we've only read so much of their source code :P

If it is only mov ECX, [ESP]param, then you can translate it to mov ECX, 
param.
In fact, it would be even better if you could write mov ECX, param.
If DMD would keep track of the number of pushes and pops that occured, 
it could do this too.

Here's the kind of nonsense I'm doing at the moment. I create a constant 
'LASTPARAM' which is the offset to the last parameter.
The compiler should really be helping with this. But at least it works...

void foo(uint [] dest, uint[] left, uint [] right)
{
     enum { LASTPARAM = 6*4 } // 4* pushes + local + return address.
     asm {
         naked;

         push ESI;
         push EDI;
         push EBX;
         push EBP;
         push EAX;    // local variable M
         mov EDI, [ESP + LASTPARAM + 4*5]; // dest.ptr
         mov EBX, [ESP + LASTPARAM + 4*2]; // left.length
         mov ESI, [ESP + LASTPARAM + 4*3];  // left.ptr
...
         mul int ptr [ESP]; // M
...
         pop EAX; // get rid of M
         pop EBP;
         pop EBX;
         pop EDI;
         pop ESI;
         ret 6*4;
    }
}




 
 Since the spec says:
 "If the [EBP] is omitted, it is assumed for local variables. If naked 
 is used, this no longer holds."
 True, it doesn't put in an [EBP], but it adjusts the offset assuming 
 that a stack frame has been set up. And ".ptr" doesn't work.

 
 This .ptr issue certainly seems like something that should be fixed 
 somehow :)
 
 Curious fact:
 Whenever I'm naked, my body disappears!

 void fkk()
 {
    asm { naked; }
 }

 void main() { fkk(); }

 --> generates a linker error. Fair enough, really. But kind of 
 interesting.

 
 This will not happen in LDC, it will just produce what it does without 
 the naked:
 
 _D3bar3fooFZv:
                 ret
 
 Again thanx for these explanations.
 Maybe we'll have 'naked' soon in LDC after all :)
 
 -Tomas

Oct 29 2008

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass an 
 array into an asm function, you can't get the ".ptr" part of it, because 
 the "ptr" conflicts with the asm "ptr" keyword. This is a big problem, 
 since almost all asm functions that I write work on arrays.

(3) you want to write the entire function in assembler.

Oct 28 2008

Don <nospam nospam.com.au> writes:

Walter Bright wrote:
 Don wrote:
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass 
 an array into an asm function, you can't get the ".ptr" part of it, 
 because the "ptr" conflicts with the asm "ptr" keyword. This is a big 
 problem, since almost all asm functions that I write work on arrays.

 
 (3) you want to write the entire function in assembler.

I think it probably should be illegal to include any non-asm code in a 
function containing naked asm. It would certainly make the spec simpler!
In fact, I don't see how any other approach is really possible.

Oct 29 2008

Tomas Lindquist Olsen <tomas famolsen.dk> writes:

Don wrote:
 Walter Bright wrote:
 Don wrote:
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass 
 an array into an asm function, you can't get the ".ptr" part of it, 
 because the "ptr" conflicts with the asm "ptr" keyword. This is a big 
 problem, since almost all asm functions that I write work on arrays.

 (3) you want to write the entire function in assembler.

 
 I think it probably should be illegal to include any non-asm code in a 
 function containing naked asm. It would certainly make the spec simpler!
 In fact, I don't see how any other approach is really possible.

Seems sensible to me. Walter, why is normal code even allowed ?

Oct 29 2008

Walter Bright <newshound1 digitalmars.com> writes:

Tomas Lindquist Olsen wrote:
 Don wrote:
 Walter Bright wrote:
 (3) you want to write the entire function in assembler.

 I think it probably should be illegal to include any non-asm code in a 
 function containing naked asm. It would certainly make the spec simpler!
 In fact, I don't see how any other approach is really possible.

 
 Seems sensible to me. Walter, why is normal code even allowed ?

You might want to build your own adjuster thunks.

Oct 30 2008

Walter Bright <newshound1 digitalmars.com> writes:

Tomas Lindquist Olsen wrote:
 Seems sensible to me. Walter, why is normal code even allowed ?

I should add that 'naked' is for people who don't mind getting intimate 
with how the compiler arranges things. Any code using it should expect 
it to not be portable, and be past the compiler wagging its finger at them.

Oct 30 2008

Christian Kamm <kamm-incasoftware removethis.de> writes:

Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.

Here portable means 'portable between different compilers for the same
architecture'? 

I'm wondering because I thought the point of specifying inline assembler was
exactly to guarantee that kind of portability. So you would not consider a
compiler that does not implement naked, or handles it differently from DMD
to be breaking the D specification?

Oct 30 2008

Walter Bright <newshound1 digitalmars.com> writes:

Christian Kamm wrote:
 Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.

 
 Here portable means 'portable between different compilers for the same
 architecture'? 
 
 I'm wondering because I thought the point of specifying inline assembler was
 exactly to guarantee that kind of portability. So you would not consider a
 compiler that does not implement naked, or handles it differently from DMD
 to be breaking the D specification?

Since the stack prolog/epilog is not defined in the D ABI, a compiler is 
free to innovate in this area.

The reason to specify the inline assembler is to have a common ground on 
the assembler syntax.

Oct 30 2008

Christian Kamm <kamm-incasoftware removethis.de> writes:

 Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.

 
 Christian Kamm wrote:
 Here portable means 'portable between different compilers for the same
 architecture'?
 
 I'm wondering because I thought the point of specifying inline assembler
 was exactly to guarantee that kind of portability. So you would not
 consider a compiler that does not implement naked, or handles it
 differently from DMD to be breaking the D specification?

 
 Walter Bright wrote:
 Since the stack prolog/epilog is not defined in the D ABI, a compiler is
 free to innovate in this area.
 
 The reason to specify the inline assembler is to have a common ground on
 the assembler syntax.

Thanks for the clarification. We'll not worry about it too much then -
though I think that from the conversation with Don, Tomas now has a good
idea of what we'd need to do for them to behave similarly.

Oct 31 2008

Don <nospam nospam.com> writes:

Walter Bright wrote:
 Christian Kamm wrote:
 Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.

 Here portable means 'portable between different compilers for the same
 architecture'?
 I'm wondering because I thought the point of specifying inline 
 assembler was
 exactly to guarantee that kind of portability. So you would not 
 consider a
 compiler that does not implement naked, or handles it differently from 
 DMD
 to be breaking the D specification?

 
 Since the stack prolog/epilog is not defined in the D ABI, a compiler is 
 free to innovate in this area.

That's great, and should be in the documentation.
However, I don't think there's any reason for the prolog to interfere 
with an all-assembler naked function.
Accessing a stack parameter by name from inside a naked asm function 
should either be:
* defined as implementation-dependent; or
* assume nothing about the way the variable is referenced, and should be 
independent of the stack prolog. (This could also be achieved with 
another 'magic' constant, similar to LOCALSIZE, giving the size of the 
stack frame for that function; so that var+FRAMESIZE is constant, 
regardless of the vendor).
For now, you could add a line to the docs stating that "accessing stack 
params and variables from naked asm functions is not yet standardized".


 The reason to specify the inline assembler is to have a common ground on 
 the assembler syntax.

Nov 03 2008

bearophile <bearophileHUGS lycos.com> writes:

downs:
f I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit.<

Tim Burrell:
 Yeah, I was about to say the same.  See here:

Yep, see my TinyVector ;-)

Bye,
bearophile

Feb 15 2008

downs <default_357-line yahoo.de> writes:

Another other observation: GDC's std.math functions still aren't being inlined
properly, forcing me to use the intrinsics manually.

That didn't cause the speed difference though.

Still, it would be nice to see it fixed some time soon, seeing as I filed the
bug in November :)

 --downs

Feb 15 2008

"Saaa" <empty needmail.com> writes:

With a little bit of commenting, this could be an excellent tutorial.

Feb 15 2008

D Programming

C/C++ Programming

Other

digitalmars.D - D slower than C++ by a factor of _two_ for simple raytracer (gdc)