D - D calling conventions

Mike Wynn (28/28) Sep 09 2003 Walter,

Walter (6/34) Sep 09 2003 No, as the last parameter is passed in a register.

Ilya Minkov (3/5) Sep 09 2003 Is there any out there which does?

Mike Wynn (7/17) Sep 09 2003 I though gcc 3.2.x did ...

Mike Wynn (34/84) Sep 09 2003 I assume you mean last to be pushed i.e. first

Walter (5/8) Sep 09 2003 Yup, times thousands of function calls .

Mike Wynn (38/56) Sep 11 2003 from some basic tests I've been doing it appear that

Walter (4/22) Sep 11 2003 I wish I could spend more time on the cg and implement some of these gre...
Sean L. Palmer (10/40) Sep 12 2003 You better be very careful with not protecting your stack frame by adjus...

Mike Wynn <mike l8night.co.uk> writes:

Walter,

it appears that the D calling convention (even on linux) is stdcall
(last param pushed first, callee cleans up)
is there a reason for this, personally I think stdcall/pascal are a 
silly way to pass params, the caller should clean up the stack they 
allocate, makes code more robust
I've heard ppl say that stdcall is more efficient on x86, but don't see 
it myself, you can not optimise calls within loops.
e.g
for( int i = 0; i < someval; i++ ) {
	int b = 9*i;
	func( b, i, other, 50 );
}

can become
i :=0 ;
fr = create frame for func;
fr[2] = other;
fr[3] = 50;
jump check;
loop:
  fr[0] = 9*i;
  fr[1] = i;
  call func;
  i :=i+1;
check:
  if i < someval jump loop;
remove frame fr;

infact in this case fr[1] can be 'i'

Sep 09 2003

"Walter" <walter digitalmars.com> writes:

"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjl8rq$2ccr$1 digitaldaemon.com...
 Walter,

 it appears that the D calling convention (even on linux) is stdcall
 (last param pushed first, callee cleans up)

No, as the last parameter is passed in a register.

 is there a reason for this, personally I think stdcall/pascal are a
 silly way to pass params, the caller should clean up the stack they
 allocate, makes code more robust

It's smaller code.

 I've heard ppl say that stdcall is more efficient on x86, but don't see
 it myself, you can not optimise calls within loops.
 e.g
 for( int i = 0; i < someval; i++ ) {
 int b = 9*i;
 func( b, i, other, 50 );
 }

 can become
 i :=0 ;
 fr = create frame for func;
 fr[2] = other;
 fr[3] = 50;
 jump check;
 loop:
   fr[0] = 9*i;
   fr[1] = i;
   call func;
   i :=i+1;
 check:
   if i < someval jump loop;
 remove frame fr;

 infact in this case fr[1] can be 'i'

Those kinds of optimizations are possible, and if done, would make the
caller cleanup superior. But my code generator doesn't do them :-(

Sep 09 2003

Ilya Minkov <webmaster midiclub.de.vu> writes:

Walter wrote:

 Those kinds of optimizations are possible, and if done, would make the
 caller cleanup superior. But my code generator doesn't do them :-(

Is there any out there which does?

-eye

Sep 09 2003

Mike Wynn <mike l8night.co.uk> writes:

Ilya Minkov wrote:
 Walter wrote:
 
 Those kinds of optimizations are possible, and if done, would make the
 caller cleanup superior. But my code generator doesn't do them :-(

 
 
 Is there any out there which does?
 
 -eye
 

I though gcc 3.2.x did ...
obviously not is compiles the loop into
push *4
call
reset esp
jump round loop again.

Sep 09 2003

Mike Wynn <mike l8night.co.uk> writes:

Walter wrote:
 "Mike Wynn" <mike l8night.co.uk> wrote in message
 news:bjl8rq$2ccr$1 digitaldaemon.com...
 
Walter,

it appears that the D calling convention (even on linux) is stdcall
(last param pushed first, callee cleans up)

 
 
 No, as the last parameter is passed in a register.

I assume you mean last to be pushed i.e. first
as in func (int a, int b ) a in reg, b on stack.
(so for member functions "this" is in a register).

 
 
is there a reason for this, personally I think stdcall/pascal are a
silly way to pass params, the caller should clean up the stack they
allocate, makes code more robust

 
 
 It's smaller code.

so you save a few sub esp's

with caller cleanup, you know how many locals and max param space in the 
  function will require, so only need to allocate once.
push ebp;
mov ebp, esp;


prams at [esp + (4*param number)]
locals at[esp + (4*(max param number+local num))] // locals 0..m
or [ebp - 4*local] //locals numbered 1..n (m=n-1)

mov esp, ebp; pop ebp; ret;

or to save ever having to push/pop; (I believe this is then pairable)
mov [esp-4], ebp;
mov ebp, esp;

...
prams at [esp + (4*param number)]
locals at[esp + (4*(max param number+local num))] // locals 0..m
or [ebp - 4*local] //locals numbered 2..n+1 (m=n-1)
....
mov esp, ebp; mov ebp, [ebp-4]; ret;

is it not quicker to do


mov [esp+8], eax;
mov [esp+4], ebx;
mov [esp], ecx;

than
push eax;
push ebx;
push ecx;

or can Pentium pair pushes ??


 
 
I've heard ppl say that stdcall is more efficient on x86, but don't see
it myself, you can not optimise calls within loops.
e.g
for( int i = 0; i < someval; i++ ) {
int b = 9*i;
func( b, i, other, 50 );
}

can become
i :=0 ;
fr = create frame for func;
fr[2] = other;
fr[3] = 50;
jump check;
loop:
  fr[0] = 9*i;
  fr[1] = i;
  call func;
  i :=i+1;
check:
  if i < someval jump loop;
remove frame fr;

infact in this case fr[1] can be 'i'

 
 
 Those kinds of optimizations are possible, and if done, would make the
 caller cleanup superior. But my code generator doesn't do them :-(

Sep 09 2003

"Walter" <walter digitalmars.com> writes:

"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjlq0c$2ts$1 digitaldaemon.com...
 It's smaller code.

 so you save a few sub esp's

Yup, times thousands of function calls <g>.

 or can Pentium pair pushes ??

Which works out faster flip-flops back and forth on successive Intel chip
architectures :-(

Sep 09 2003

Mike Wynn <mike l8night.co.uk> writes:

Walter wrote:
 "Mike Wynn" <mike l8night.co.uk> wrote in message
 news:bjlq0c$2ts$1 digitaldaemon.com...
 
It's smaller code.

so you save a few sub esp's

 
 
 Yup, times thousands of function calls <g>.

from some basic tests I've been doing it appear that
esp:=esp-N;esp[0] := a;esp[1] := b;esp[2] := c
call X
esp:=esp+N (can be delayed i.e. lazy frame removal)
is slightly faster for C calls
but push/pop faster for D calls

interestingly D with C calls is faster than gcc 3.2.2 :)
and there is little difference D or C except in a few odd cases (not 
tried method calls as I can't do C param with dmd)

interestingly
int sum( int a, int b, int c ) { return a+b+c; }
is much slower than
int sum( int a, int b, int c ) { return c+b+a; }
the compiler uses the fact c is in eax and although it creates a frame 
it does not have to store eax only to pull it back.

one seriour speed up would be the removal of leaf function frames
in the same time it takes to do
push ebp;

you can do
mov ebx, [esp-4]
mov esi, [esp-8]

as its a leaf function [esp-N] can be used for locals and saved reg's 
with out moving esp and there is no need to change ebp
also as GC is pausing its not a problem having objects beyond esp, first 
it's a leaf func so can't call new, and if new was inlined making the 
function a leaf or it manipulates objects on the heap the gc wil not be 
called until after the return. most concurrent collectors have to wait 
to "catch" the thread as they return, or on backwards branch. in the 
former no problem, in the latter code would be put in on the backwards 
branch, this could do the movement of esp etc.

I believe this would spped up all those small member functionsby a huge 
amount, (as ebx,esi,edi can all be stored very cheaply) chances are you 
don't even need extra locals.

as an aside I know eax is "this" but would it not make more sense to use
a saved reg instead that way non leaf member functions do not have to 
save "this" to call their own methods that have return values i.e. this 
in ebx or edi

 
 
or can Pentium pair pushes ??

 
 
 Which works out faster flip-flops back and forth on successive Intel chip
 architectures :-(

Sep 11 2003

"Walter" <walter digitalmars.com> writes:

"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjr3e7$1c9u$1 digitaldaemon.com...
 one seriour speed up would be the removal of leaf function frames
 in the same time it takes to do
 push ebp;

 you can do
 mov ebx, [esp-4]
 mov esi, [esp-8]

 as its a leaf function [esp-N] can be used for locals and saved reg's
 with out moving esp and there is no need to change ebp
 also as GC is pausing its not a problem having objects beyond esp, first
 it's a leaf func so can't call new, and if new was inlined making the
 function a leaf or it manipulates objects on the heap the gc wil not be
 called until after the return. most concurrent collectors have to wait
 to "catch" the thread as they return, or on backwards branch. in the
 former no problem, in the latter code would be put in on the backwards
 branch, this could do the movement of esp etc.

 I believe this would spped up all those small member functionsby a huge
 amount, (as ebx,esi,edi can all be stored very cheaply) chances are you
 don't even need extra locals.

I wish I could spend more time on the cg and implement some of these great
ideas. Unfortunately, for now all I can do is just fix bugs in it.

Sep 11 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

You better be very careful with not protecting your stack frame by adjusting
esp, in an environment where interrupts can happen that use the same stack
(i.e. DOS, or Win32 ring 0, say, driver or kernel level).

An interrupt can come along, start using the stack right below esp, and if
your proggy stored some stuff there it will be trashed.  These kinds of bugs
are really hard to track down.  This bit me on the Xbox when using an
intel-supplied _ftol replacement.  ;)

Sean

"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjr3e7$1c9u$1 digitaldaemon.com...
 from some basic tests I've been doing it appear that
 esp:=esp-N;esp[0] := a;esp[1] := b;esp[2] := c
 call X
 esp:=esp+N (can be delayed i.e. lazy frame removal)
 is slightly faster for C calls
 but push/pop faster for D calls

 interestingly D with C calls is faster than gcc 3.2.2 :)
 and there is little difference D or C except in a few odd cases (not
 tried method calls as I can't do C param with dmd)

 interestingly
 int sum( int a, int b, int c ) { return a+b+c; }
 is much slower than
 int sum( int a, int b, int c ) { return c+b+a; }
 the compiler uses the fact c is in eax and although it creates a frame
 it does not have to store eax only to pull it back.

 one seriour speed up would be the removal of leaf function frames
 in the same time it takes to do
 push ebp;

 you can do
 mov ebx, [esp-4]
 mov esi, [esp-8]

 as its a leaf function [esp-N] can be used for locals and saved reg's
 with out moving esp and there is no need to change ebp
 also as GC is pausing its not a problem having objects beyond esp, first
 it's a leaf func so can't call new, and if new was inlined making the
 function a leaf or it manipulates objects on the heap the gc wil not be
 called until after the return. most concurrent collectors have to wait
 to "catch" the thread as they return, or on backwards branch. in the
 former no problem, in the latter code would be put in on the backwards
 branch, this could do the movement of esp etc.

Sep 12 2003

D Programming

C/C++ Programming

Other

D - D calling conventions