digitalmars.D - Compilation of a numerical kernel
- bearophile (73/73) Jun 27 2010 Recently I have seen some work from Don about floating point optimizatio...
Recently I have seen some work from Don about floating point optimization in
DMD:
http://d.puremagic.com/issues/show_bug.cgi?id=4380
http://d.puremagic.com/issues/show_bug.cgi?id=4383
so maybe he is interested in this too. This test program is the nested loop of
a program, and it's one of the hottest spots, it determines the performance of
the whole small program, so even if it's just three lines of code it needs to
be optimized well by the compiler (the D code can be modified to unroll the
loop few times).
// D code
double foo(double[] arr1, double[] arr2) {
double diff = 0.0;
for (int i; i < arr1.length; i++) {
double aux = arr1[i] - arr2[i];
diff += aux * aux;
}
return diff;
}
void main() {}
D code compiled by DMD, optimized build:
L38: fld qword ptr [EDX*8][ECX]
fsub qword ptr [EDX*8][EBX]
inc EDX
cmp EDX,058h[ESP]
fstp qword ptr 014h[ESP]
fld qword ptr 014h[ESP]
fmul ST,ST(0)
fadd qword ptr 4[ESP]
fstp qword ptr 4[ESP]
jb L38
D code compiled by LDC, optimized build:
.LBB13_5:
movsd (%edi,%ecx,8), %xmm1
subsd (%eax,%ecx,8), %xmm1
incl %ecx
cmpl %esi, %ecx
mulsd %xmm1, %xmm1
addsd %xmm1, %xmm0
jne .LBB13_5
The asm produced by dmd is not efficient, it's not a matter of SSE register
usage.
I have translated it to C to see how GCC compiles it, to see how compile it
with no SSE:
// C code
double foo(double* arr1, double* arr2, int len) {
double diff = 0.0;
int i;
for (i = 0; i < len; i++) {
double aux = arr1[i] - arr2[i];
diff += aux * aux;
}
return diff;
}
C code compiled with gcc 4.5 (32 bit):
L3:
fldl (%ecx,%eax,8)
fsubl (%ebx,%eax,8)
incl %eax
fmul %st(0), %st
cmpl %edx, %eax
faddp %st, %st(1)
jne L3
This is an example of how a compiler can compile it, unrolled once and working
on two doubles in each SSE instruction (this is on 64 bit too), so this equals
to a 4X unroll:
Modified C code compiled with GCC (64 bit):
L3:
movapd (%rcx,%rax), %xmm1
subpd (%rdx,%rax), %xmm1
movapd %xmm1, %xmm0
mulpd %xmm1, %xmm0
addpd %xmm0, %xmm2
movapd 16(%rcx,%rax), %xmm0
subpd 16(%rdx,%rax), %xmm0
addq $32, %rax
mulpd %xmm1, %xmm0
cmpq %r8, %rax
addpd %xmm0, %xmm3
jne L3
Cache prefetch instructions can't help a lot here, because the access pattern
to the memory is very plain.
Bye,
bearophile
Jun 27 2010








bearophile <bearophileHUGS lycos.com>