www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Compilation of a numerical kernel

Recently I have seen some work from Don about floating point optimization in
DMD:
http://d.puremagic.com/issues/show_bug.cgi?id=4380
http://d.puremagic.com/issues/show_bug.cgi?id=4383

so maybe he is interested in this too. This test program is the nested loop of
a program, and it's one of the hottest spots, it determines the performance of
the whole small program, so even if it's just three lines of code it needs to
be optimized well by the compiler (the D code can be modified to unroll the
loop few times).



// D code
double foo(double[] arr1, double[] arr2) {
    double diff = 0.0;
    for (int i; i < arr1.length; i++) {
        double aux = arr1[i] - arr2[i];
        diff += aux * aux;
    }
    return diff;
}
void main() {}


D code compiled by DMD, optimized build:
L38:    fld qword ptr [EDX*8][ECX]
        fsub    qword ptr [EDX*8][EBX]
        inc EDX
        cmp EDX,058h[ESP]
        fstp    qword ptr 014h[ESP]
        fld qword ptr 014h[ESP]
        fmul    ST,ST(0)
        fadd    qword ptr 4[ESP]
        fstp    qword ptr 4[ESP]
        jb  L38


D code compiled by LDC, optimized build:
.LBB13_5:
    movsd   (%edi,%ecx,8), %xmm1
    subsd   (%eax,%ecx,8), %xmm1
    incl    %ecx
    cmpl    %esi, %ecx
    mulsd   %xmm1, %xmm1
    addsd   %xmm1, %xmm0
    jne .LBB13_5


The asm produced by dmd is not efficient, it's not a matter of SSE register
usage.
I have translated it to C to see how GCC compiles it, to see how compile it
with no SSE:

// C code
double foo(double* arr1, double* arr2, int len) {
    double diff = 0.0;
    int i;
    for (i = 0; i < len; i++) {
        double aux = arr1[i] - arr2[i];
        diff += aux * aux;
    }

    return diff;
}


C code compiled with gcc 4.5 (32 bit):
L3:
    fldl    (%ecx,%eax,8)
    fsubl   (%ebx,%eax,8)
    incl    %eax
    fmul    %st(0), %st
    cmpl    %edx, %eax
    faddp   %st, %st(1)
    jne L3


This is an example of how a compiler can compile it, unrolled once and working
on two doubles in each SSE instruction (this is on 64 bit too), so this equals
to a 4X unroll:

Modified C code compiled with GCC (64 bit):
L3:
    movapd  (%rcx,%rax), %xmm1
    subpd   (%rdx,%rax), %xmm1
    movapd  %xmm1, %xmm0
    mulpd   %xmm1, %xmm0
    addpd   %xmm0, %xmm2
    movapd  16(%rcx,%rax), %xmm0
    subpd   16(%rdx,%rax), %xmm0
    addq    $32, %rax
    mulpd   %xmm1, %xmm0
    cmpq    %r8, %rax
    addpd   %xmm0, %xmm3
    jne L3

Cache prefetch instructions can't help a lot here, because the access pattern
to the memory is very plain.

Bye,
bearophile
Jun 27 2010