www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Strange instruction sequence with DMD while calling functions with

reply PatateVerte <patate.verte yahoo.com> writes:
Hello
I noticed a strange behaviour of the DMD compiler when it has to 
call a function with float arguments.

I build with the flags "-mcpu=avx2 -O  -m64" under windows 64 
bits using "DMD32 D Compiler v2.090.1-dirty"

I have the following function :
    float mul_add(float a, float b, float c); //Return a * b + c

When I try to call it :
    float f = d_mul_add(1.0, 2.0, 3.0);

I tested with other functions with float parameters, and there is 
the same problem.

Then the following instructions are generated :
         //Loads the values, as it can be expected
    	vmovss xmm2,dword [rel 0x64830]
	vmovss xmm1,dword [rel 0x64834]
	vmovss xmm0,dword [rel 0x64838]
         //Why ?
	movq r8,xmm2
	movq rdx,xmm1
	movq rcx,xmm0
         //
	call 0x400   //0x400 is where the mul_add function is located

My questions are :
  - Is there a reason why the registers xmm0/1/2 are saved in 
rcx/rdx/r8 before calling ? The calling convention specifies that 
the floating point parameters have to be put in xmm registers, 
and not GPR, unless you are using your own calling convention.
  - Why is it done using non-avx instructions ? Mixing AVX and 
non-AVX instructions may impact the speed greatly.

Any idea ? Thank you in advance.
Feb 14 2020
parent Basile B. <b2.temp gmx.com> writes:
On Friday, 14 February 2020 at 22:36:20 UTC, PatateVerte wrote:
 Hello
 I noticed a strange behaviour of the DMD compiler when it has 
 to call a function with float arguments.

 I build with the flags "-mcpu=avx2 -O  -m64" under windows 64 
 bits using "DMD32 D Compiler v2.090.1-dirty"

 I have the following function :
    float mul_add(float a, float b, float c); //Return a * b + c

 When I try to call it :
    float f = d_mul_add(1.0, 2.0, 3.0);

 I tested with other functions with float parameters, and there 
 is the same problem.

 Then the following instructions are generated :
         //Loads the values, as it can be expected
    	vmovss xmm2,dword [rel 0x64830]
 	vmovss xmm1,dword [rel 0x64834]
 	vmovss xmm0,dword [rel 0x64838]
         //Why ?
 	movq r8,xmm2
 	movq rdx,xmm1
 	movq rcx,xmm0
         //
 	call 0x400   //0x400 is where the mul_add function is located

 My questions are :
  - Is there a reason why the registers xmm0/1/2 are saved in 
 rcx/rdx/r8 before calling ? The calling convention specifies 
 that the floating point parameters have to be put in xmm 
 registers, and not GPR, unless you are using your own calling 
 convention.
  - Why is it done using non-avx instructions ? Mixing AVX and 
 non-AVX instructions may impact the speed greatly.

 Any idea ? Thank you in advance.
It's simply the bad codegen (or rather a missed opportunity to optimize) from DMD, its backend doesn't see that the parameters are already in the right order and in the right registers so it copy them and put them in the regs for the inner func call. I had observed this in the past too, i.e unexplained round tripping from GP to SSE regs. For good FP codegen use LDC2 or GDC or write iasm (but loose inlining). For other people who'd like to observe the problem: https://godbolt.org/z/gvqEqz. By the way I had to deactivate AVX2 targeting because otherwise the result is even more weird (https://godbolt.org/z/T9NwMc)
Feb 14 2020