digitalmars.D.bugs - [Issue 10636] New: Vector calling convention for D?
- d-bugmail puremagic.com (124/130) Jul 13 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10636
http://d.puremagic.com/issues/show_bug.cgi?id=10636 Summary: Vector calling convention for D? Product: D Version: D2 Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: DMD AssignedTo: nobody puremagic.com ReportedBy: bearophile_hugs eml.cc VS2013 designers have added a new calling convention, that allows to pass SIMD registers to functions avoiding the stack in most cases: http://blogs.msdn.com/b/vcblog/archive/2013/07/12/introducing-vector-calling-convention.aspx An example D program: import core.stdc.stdio, core.simd; struct Particle { float4 x, y; } Particle addParticles(in Particle p1, in Particle p2) pure nothrow { return Particle(p1.x + p2.x, p1.y + p2.y); } void main() { auto p1 = Particle([1, 2, 3, 4], [10, 20, 30, 40]); printf("%f %f %f %f %f %f %f %f\n", p1.x.array[0], p1.x.array[1], p1.x.array[2], p1.x.array[3], p1.y.array[0], p1.y.array[1], p1.y.array[2], p1.y.array[3]); auto p2 = Particle([100, 200, 300, 400], [1000, 2000, 3000, 4000]); printf("%f %f %f %f %f %f %f %f\n", p2.x.array[0], p2.x.array[1], p2.x.array[2], p2.x.array[3], p2.y.array[0], p2.y.array[1], p2.y.array[2], p2.y.array[3]); auto p3 = addParticles(p1, p2); printf("%f %f %f %f %f %f %f %f\n", p3.x.array[0], p3.x.array[1], p3.x.array[2], p3.x.array[3], p3.y.array[0], p3.y.array[1], p3.y.array[2], p3.y.array[3]); } Comping that code with the ldc2 v.0.11.0 on Windows 32bit with: ldc2 -O5 -disable-inlining -release -vectorize-slp -vectorize-slp-aggressive -output-s test.d It outputs the X86 asm: __D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle: pushl %ebp movl %esp, %ebp andl $-16, %esp subl $16, %esp movaps 40(%ebp), %xmm0 movaps 56(%ebp), %xmm1 addps 8(%ebp), %xmm0 addps 24(%ebp), %xmm1 movups %xmm1, 16(%eax) movups %xmm0, (%eax) movl %ebp, %esp popl %ebp ret $64 __Dmain: ... movaps 160(%esp), %xmm0 movaps 176(%esp), %xmm1 movaps %xmm1, 48(%esp) movaps %xmm0, 32(%esp) movaps 128(%esp), %xmm0 movaps 144(%esp), %xmm1 movaps %xmm1, 16(%esp) movaps %xmm0, (%esp) leal 96(%esp), %eax calll __D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle subl $64, %esp movss 96(%esp), %xmm0 movss 100(%esp), %xmm1 movss 104(%esp), %xmm2 movss 108(%esp), %xmm3 movss 112(%esp), %xmm4 movss 116(%esp), %xmm5 movss 120(%esp), %xmm6 movss 124(%esp), %xmm7 cvtss2sd %xmm7, %xmm7 movsd %xmm7, 60(%esp) cvtss2sd %xmm6, %xmm6 movsd %xmm6, 52(%esp) cvtss2sd %xmm5, %xmm5 movsd %xmm5, 44(%esp) cvtss2sd %xmm4, %xmm4 movsd %xmm4, 36(%esp) cvtss2sd %xmm3, %xmm3 movsd %xmm3, 28(%esp) cvtss2sd %xmm2, %xmm2 movsd %xmm2, 20(%esp) cvtss2sd %xmm1, %xmm1 movsd %xmm1, 12(%esp) cvtss2sd %xmm0, %xmm0 movsd %xmm0, 4(%esp) movl $_.str3, (%esp) calll ___mingw_printf xorl %eax, %eax movl %ebp, %esp popl %ebp ret As shown in that article, with a vector calling convention to set the arguments of addParticles it needs only four movaps (instead of eigth and the leal). With the vectr calling convention the body of addParticles gets short, because all the needed operands are already in xmm registers. Probably the code of addParticles becomes only two addps, a ret and maybe two movaps to put the result in the right output registers. D is meant to be useful for people that write fast video games, or other numerical code, and both use plenty of SIMD code. So I think adding such optimization can be useful. But I can't estimate how much advantage it's going to give, benchmarks are needed. They write:Today on AMD64 target, passed by value vector arguments (such as __m128/__m256/) must be turned into a passed by address of a temporary buffer (i.e. $T1, $T2, $T3 in the figure above) allocated in caller's local stack as shown in the figure above. We have been receiving increasing concerns about this inefficiency in past years, especially from game, graphic, video/audio, and codec domains. A concrete example is MS XNA library in which passing vector arguments is a common pattern in many APIs of XNAMath library. The inefficiency will be intensified on upcoming AVX2/AVX3 and future processors with wider vector registers.<On the other hand small functions get inlined, and introducing a new calling convention has a disadvantage, as comment by Iain Buclaw:I'd vote for not adding more fluff which makes ABI differences between compilers greater. But it certainly looks like if would be useful if you wish to save the time taken to copy the vector from XMM registers onto the stack and back again when passing values around.Maybe such vector calling convention will become more standard in future, as it seems an useful improvement. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 13 2013