www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 17965] New: Unexplained usage of the FPU while function

https://issues.dlang.org/show_bug.cgi?id=17965

          Issue ID: 17965
           Summary: Unexplained usage of the FPU while function result
                    already in right XMM registers
           Product: D
           Version: D2
          Hardware: x86_64
                OS: All
            Status: NEW
          Keywords: performance
          Severity: normal
          Priority: P1
         Component: dmd
          Assignee: nobody puremagic.com
          Reporter: b2.temp gmx.com

For the following trivial function:

---
struct Point{double x,y;}

Point foo()
{
    Point result;
    return result;
}
---

dmd64 with -O generates:

;------- SUB 000000000044E1C0h -------
000000000044E1C0h  push rbp
000000000044E1C1h  mov rbp, rsp
000000000044E1C4h  sub rsp, 20h
000000000044E1C8h  lea rax, qword ptr [00000000004C92F0h]
000000000044E1CFh  movsd xmm0, qword ptr [rax]     // result.x = 0; // default
init OK
000000000044E1D3h  movsd qword ptr [rbp-10h], xmm0 // load result.x in a temp
because ?
000000000044E1D9h  movsd xmm1, qword ptr [rax+08h] // result.y = 0; // default
init OK
000000000044E1DEh  movsd qword ptr [rbp-08h], xmm1 // load result.y in a temp
because ?
000000000044E1E4h  fld qword ptr [rbp-10h]         // pass the whole result to
the FPU because ?
000000000044E1E7h  fld qword ptr [rbp-08h]         // ...
000000000044E1EAh  fstp qword ptr [rbp-20h]        // ...
000000000044E1EDh  movsd xmm1, qword ptr [rbp-20h] // reload back result to
XMM0 and 1 because?
000000000044E1F2h  fstp qword ptr [rbp-20h]        //
000000000044E1F5h  movsd xmm0, qword ptr [rbp-20h] // .
000000000044E1FAh  mov rsp, rbp
000000000044E1FDh  pop rbp
000000000044E1FEh  ret 
;-------------------------------------

Point.x is returned in low XMM0 half and Point.y in low XMM1 half.
from 000000000044E1E4h to 000000000044E1F5h, the result is loaded in the FPU
and then loaded back in XMM0 and XMM1 for no reasons. In addition, 32 bytes are
allocated for this useless transfert, leading to the prelude and prologue to be
emitted.


Expected backend production is something like

---
lea rax, qword ptr [<address of init>]
movsd xmm0, qword ptr [rax]
movsd xmm1, qword ptr [rax+08h]
ret
---

--
Nov 03