digitalmars.D.bugs - [Issue 16605] New: core.simd generates slow/irrelevant code
- via Digitalmars-d-bugs (100/100) Oct 08 2016 https://issues.dlang.org/show_bug.cgi?id=16605
https://issues.dlang.org/show_bug.cgi?id=16605 Issue ID: 16605 Summary: core.simd generates slow/irrelevant code Product: D Version: D2 Hardware: x86_64 OS: Linux Status: NEW Severity: enhancement Priority: P1 Component: dmd Assignee: nobody puremagic.com Reporter: malte.kiessling mkalte.me I tried working with core.simd. I noticed that (at least for trivial operations like +=, *= etc) the generated code is kinda slow (slower than wihout SSE instructions!). I used asm.dlang.org to get these results (using the newest dmd) below. This code: **** import core.simd; void doStuff() { float4 x = [1.0,0.4,1234.0,124.0]; float4 y = [1.0,0.4,1234.0,124.0]; float4 z = [1.0,0.4,1234.0,123.0]; for(long i = 0; i<1_000_000; i++) { x += y; x += z; z += x; } } **** Results in the following Assembly (i only pasted the function) **** void example.doStuff(): push rbp mov rbp,rsp sub rsp,0x40 movaps xmm0,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf> movaps XMMWORD PTR [rbp-0x40],xmm0 movaps xmm1,XMMWORD PTR [rip+0x0] # 1a <void example.doStuff()+0x1a> movaps XMMWORD PTR [rbp-0x30],xmm1 movaps xmm2,XMMWORD PTR [rip+0x0] # 25 <void example.doStuff()+0x25> movaps XMMWORD PTR [rbp-0x20],xmm2 mov QWORD PTR [rbp-0x10],0x0 cmp QWORD PTR [rbp-0x10],0xf4240 jge 6e <void example.doStuff()+0x6e> movaps xmm3,XMMWORD PTR [rbp-0x30] movaps xmm4,XMMWORD PTR [rbp-0x40] addps xmm4,xmm3 movaps XMMWORD PTR [rbp-0x40],xmm4 movaps xmm0,XMMWORD PTR [rbp-0x20] movaps xmm1,XMMWORD PTR [rbp-0x40] addps xmm1,xmm0 movaps XMMWORD PTR [rbp-0x40],xmm1 movaps xmm2,XMMWORD PTR [rbp-0x40] movaps xmm3,XMMWORD PTR [rbp-0x20] addps xmm3,xmm2 movaps XMMWORD PTR [rbp-0x20],xmm3 inc QWORD PTR [rbp-0x10] jmp 31 <void example.doStuff()+0x31> leave ret **** The most importand thing here is in the body of the for-loop: **** x += y; x += z; z += x; **** Becomes **** movaps xmm3,XMMWORD PTR [rbp-0x30] movaps xmm4,XMMWORD PTR [rbp-0x40] addps xmm4,xmm3 movaps XMMWORD PTR [rbp-0x40],xmm4 movaps xmm0,XMMWORD PTR [rbp-0x20] movaps xmm1,XMMWORD PTR [rbp-0x40] addps xmm1,xmm0 movaps XMMWORD PTR [rbp-0x40],xmm1 movaps xmm2,XMMWORD PTR [rbp-0x40] movaps xmm3,XMMWORD PTR [rbp-0x20] addps xmm3,xmm2 movaps XMMWORD PTR [rbp-0x20],xmm3 **** Insted of **** addps xmm0,xmm1 addps xmm0,xmm2 addps xmm2,xmm0 **** So the results of the calculation are put back into memory at each loop iteration insted of moving them into the xmm registers beforehand and storing them back afterwards. Also, in the beginning the value of the float4 is stored into xmm0-2. Insted of being used inside the loop, this assignment is ignored inside of the loop and only used for the copy into the array. The result of this is that the generated code runs slower than the manual operation on an array instead of being a significant speedup. --
Oct 08 2016