www.digitalmars.com         C & C++   DMDScript  

D - D calling conventions

reply Mike Wynn <mike l8night.co.uk> writes:
Walter,

it appears that the D calling convention (even on linux) is stdcall
(last param pushed first, callee cleans up)
is there a reason for this, personally I think stdcall/pascal are a 
silly way to pass params, the caller should clean up the stack they 
allocate, makes code more robust
I've heard ppl say that stdcall is more efficient on x86, but don't see 
it myself, you can not optimise calls within loops.
e.g
for( int i = 0; i < someval; i++ ) {
	int b = 9*i;
	func( b, i, other, 50 );
}

can become
i :=0 ;
fr = create frame for func;
fr[2] = other;
fr[3] = 50;
jump check;
loop:
  fr[0] = 9*i;
  fr[1] = i;
  call func;
  i :=i+1;
check:
  if i < someval jump loop;
remove frame fr;

infact in this case fr[1] can be 'i'
Sep 09 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjl8rq$2ccr$1 digitaldaemon.com...
 Walter,

 it appears that the D calling convention (even on linux) is stdcall
 (last param pushed first, callee cleans up)
No, as the last parameter is passed in a register.
 is there a reason for this, personally I think stdcall/pascal are a
 silly way to pass params, the caller should clean up the stack they
 allocate, makes code more robust
It's smaller code.
 I've heard ppl say that stdcall is more efficient on x86, but don't see
 it myself, you can not optimise calls within loops.
 e.g
 for( int i = 0; i < someval; i++ ) {
 int b = 9*i;
 func( b, i, other, 50 );
 }

 can become
 i :=0 ;
 fr = create frame for func;
 fr[2] = other;
 fr[3] = 50;
 jump check;
 loop:
   fr[0] = 9*i;
   fr[1] = i;
   call func;
   i :=i+1;
 check:
   if i < someval jump loop;
 remove frame fr;

 infact in this case fr[1] can be 'i'
Those kinds of optimizations are possible, and if done, would make the caller cleanup superior. But my code generator doesn't do them :-(
Sep 09 2003
next sibling parent reply Ilya Minkov <webmaster midiclub.de.vu> writes:
Walter wrote:

 Those kinds of optimizations are possible, and if done, would make the
 caller cleanup superior. But my code generator doesn't do them :-(
Is there any out there which does? -eye
Sep 09 2003
parent Mike Wynn <mike l8night.co.uk> writes:
Ilya Minkov wrote:
 Walter wrote:
 
 Those kinds of optimizations are possible, and if done, would make the
 caller cleanup superior. But my code generator doesn't do them :-(
Is there any out there which does? -eye
I though gcc 3.2.x did ... obviously not is compiles the loop into push *4 call reset esp jump round loop again.
Sep 09 2003
prev sibling parent reply Mike Wynn <mike l8night.co.uk> writes:
Walter wrote:
 "Mike Wynn" <mike l8night.co.uk> wrote in message
 news:bjl8rq$2ccr$1 digitaldaemon.com...
 
Walter,

it appears that the D calling convention (even on linux) is stdcall
(last param pushed first, callee cleans up)
No, as the last parameter is passed in a register.
I assume you mean last to be pushed i.e. first as in func (int a, int b ) a in reg, b on stack. (so for member functions "this" is in a register).
 
 
is there a reason for this, personally I think stdcall/pascal are a
silly way to pass params, the caller should clean up the stack they
allocate, makes code more robust
It's smaller code.
so you save a few sub esp's with caller cleanup, you know how many locals and max param space in the function will require, so only need to allocate once. push ebp; mov ebp, esp; prams at [esp + (4*param number)] locals at[esp + (4*(max param number+local num))] // locals 0..m or [ebp - 4*local] //locals numbered 1..n (m=n-1) mov esp, ebp; pop ebp; ret; or to save ever having to push/pop; (I believe this is then pairable) mov [esp-4], ebp; mov ebp, esp; ... prams at [esp + (4*param number)] locals at[esp + (4*(max param number+local num))] // locals 0..m or [ebp - 4*local] //locals numbered 2..n+1 (m=n-1) .... mov esp, ebp; mov ebp, [ebp-4]; ret; is it not quicker to do mov [esp+8], eax; mov [esp+4], ebx; mov [esp], ecx; than push eax; push ebx; push ecx; or can Pentium pair pushes ??
 
 
I've heard ppl say that stdcall is more efficient on x86, but don't see
it myself, you can not optimise calls within loops.
e.g
for( int i = 0; i < someval; i++ ) {
int b = 9*i;
func( b, i, other, 50 );
}

can become
i :=0 ;
fr = create frame for func;
fr[2] = other;
fr[3] = 50;
jump check;
loop:
  fr[0] = 9*i;
  fr[1] = i;
  call func;
  i :=i+1;
check:
  if i < someval jump loop;
remove frame fr;

infact in this case fr[1] can be 'i'
Those kinds of optimizations are possible, and if done, would make the caller cleanup superior. But my code generator doesn't do them :-(
Sep 09 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjlq0c$2ts$1 digitaldaemon.com...
 It's smaller code.
so you save a few sub esp's
Yup, times thousands of function calls <g>.
 or can Pentium pair pushes ??
Which works out faster flip-flops back and forth on successive Intel chip architectures :-(
Sep 09 2003
parent reply Mike Wynn <mike l8night.co.uk> writes:
Walter wrote:
 "Mike Wynn" <mike l8night.co.uk> wrote in message
 news:bjlq0c$2ts$1 digitaldaemon.com...
 
It's smaller code.
so you save a few sub esp's
Yup, times thousands of function calls <g>.
from some basic tests I've been doing it appear that esp:=esp-N;esp[0] := a;esp[1] := b;esp[2] := c call X esp:=esp+N (can be delayed i.e. lazy frame removal) is slightly faster for C calls but push/pop faster for D calls interestingly D with C calls is faster than gcc 3.2.2 :) and there is little difference D or C except in a few odd cases (not tried method calls as I can't do C param with dmd) interestingly int sum( int a, int b, int c ) { return a+b+c; } is much slower than int sum( int a, int b, int c ) { return c+b+a; } the compiler uses the fact c is in eax and although it creates a frame it does not have to store eax only to pull it back. one seriour speed up would be the removal of leaf function frames in the same time it takes to do push ebp; you can do mov ebx, [esp-4] mov esi, [esp-8] as its a leaf function [esp-N] can be used for locals and saved reg's with out moving esp and there is no need to change ebp also as GC is pausing its not a problem having objects beyond esp, first it's a leaf func so can't call new, and if new was inlined making the function a leaf or it manipulates objects on the heap the gc wil not be called until after the return. most concurrent collectors have to wait to "catch" the thread as they return, or on backwards branch. in the former no problem, in the latter code would be put in on the backwards branch, this could do the movement of esp etc. I believe this would spped up all those small member functionsby a huge amount, (as ebx,esi,edi can all be stored very cheaply) chances are you don't even need extra locals. as an aside I know eax is "this" but would it not make more sense to use a saved reg instead that way non leaf member functions do not have to save "this" to call their own methods that have return values i.e. this in ebx or edi
 
 
or can Pentium pair pushes ??
Which works out faster flip-flops back and forth on successive Intel chip architectures :-(
Sep 11 2003
next sibling parent "Walter" <walter digitalmars.com> writes:
"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjr3e7$1c9u$1 digitaldaemon.com...
 one seriour speed up would be the removal of leaf function frames
 in the same time it takes to do
 push ebp;

 you can do
 mov ebx, [esp-4]
 mov esi, [esp-8]

 as its a leaf function [esp-N] can be used for locals and saved reg's
 with out moving esp and there is no need to change ebp
 also as GC is pausing its not a problem having objects beyond esp, first
 it's a leaf func so can't call new, and if new was inlined making the
 function a leaf or it manipulates objects on the heap the gc wil not be
 called until after the return. most concurrent collectors have to wait
 to "catch" the thread as they return, or on backwards branch. in the
 former no problem, in the latter code would be put in on the backwards
 branch, this could do the movement of esp etc.

 I believe this would spped up all those small member functionsby a huge
 amount, (as ebx,esi,edi can all be stored very cheaply) chances are you
 don't even need extra locals.
I wish I could spend more time on the cg and implement some of these great ideas. Unfortunately, for now all I can do is just fix bugs in it.
Sep 11 2003
prev sibling parent "Sean L. Palmer" <palmer.sean verizon.net> writes:
You better be very careful with not protecting your stack frame by adjusting
esp, in an environment where interrupts can happen that use the same stack
(i.e. DOS, or Win32 ring 0, say, driver or kernel level).

An interrupt can come along, start using the stack right below esp, and if
your proggy stored some stuff there it will be trashed.  These kinds of bugs
are really hard to track down.  This bit me on the Xbox when using an
intel-supplied _ftol replacement.  ;)

Sean

"Mike Wynn" <mike l8night.co.uk> wrote in message
news:bjr3e7$1c9u$1 digitaldaemon.com...
 from some basic tests I've been doing it appear that
 esp:=esp-N;esp[0] := a;esp[1] := b;esp[2] := c
 call X
 esp:=esp+N (can be delayed i.e. lazy frame removal)
 is slightly faster for C calls
 but push/pop faster for D calls

 interestingly D with C calls is faster than gcc 3.2.2 :)
 and there is little difference D or C except in a few odd cases (not
 tried method calls as I can't do C param with dmd)

 interestingly
 int sum( int a, int b, int c ) { return a+b+c; }
 is much slower than
 int sum( int a, int b, int c ) { return c+b+a; }
 the compiler uses the fact c is in eax and although it creates a frame
 it does not have to store eax only to pull it back.

 one seriour speed up would be the removal of leaf function frames
 in the same time it takes to do
 push ebp;

 you can do
 mov ebx, [esp-4]
 mov esi, [esp-8]

 as its a leaf function [esp-N] can be used for locals and saved reg's
 with out moving esp and there is no need to change ebp
 also as GC is pausing its not a problem having objects beyond esp, first
 it's a leaf func so can't call new, and if new was inlined making the
 function a leaf or it manipulates objects on the heap the gc wil not be
 called until after the return. most concurrent collectors have to wait
 to "catch" the thread as they return, or on backwards branch. in the
 former no problem, in the latter code would be put in on the backwards
 branch, this could do the movement of esp etc.
Sep 12 2003