digitalmars.D - Low hanging fruit for optimizing loops

Juan Manuel Cabo (118/118) Jun 07 2013 Given the recent surge in interest for performance, I dusted

Walter Bright (4/7) Jun 07 2013 It's great that you're doing this. You can track it down further by usin...

dennis luehring (4/11) Jun 07 2013 or even nicer :) :)
Juan Manuel Cabo (56/64) Jun 07 2013 Thanks!!

Walter Bright (2/4) Jun 08 2013 Thanks, this is great information.

dennis luehring (2/6) Jun 08 2013 sadly not for x64 :(

Walter Bright (2/10) Jun 08 2013 I don't understand your comment - the code example was for 64 bit code.

dennis luehring (3/14) Jun 08 2013 ida freeware is x86 only - so my "even nicer disassembly" just don't

Adam D. Ruppe (4/5) Jun 08 2013 You know what would rok? If it could output code that you could

Andrei Alexandrescu (3/8) Jun 08 2013 Wow!

Marco Leise (5/5) Jun 07 2013 GCC even cosiders cache line length and CPU cache size.
Walter Bright (2/5) Jun 08 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10301

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:

Given the recent surge in interest for performance, I dusted
off a small test that I made long ago and determined myself
to find the cause of the performance difference.

I tested the linear version of fibonacci both in DMD and
in C with GCC. Without optimization switches, I'm happy
to see that the D version is faster. But when using the
switches, the C version takes 30% less time.

I'm including the dissasembly here. The asm functions
are very very close to each other. Both loops are only
6 instructions long.

So I think it is a low hanging fruit because IMO the
speed difference is either because the GCC loop jump is
64bit aligned, or because of the order of the instructions
(there is an extra instruction between the loop increment
and the loop comparison, giving an opportunity for
parallelization).

OS:
    Kubuntu Linux 12.04  64bit

CPU:
    2.1GHz - AMD Athlon(tm) 64 X2 Dual Core Processor 4000+

Switches:
    dmd -O -inline -noboundscheck -release dtest.d
    gcc -O3 -o ctest ctest.c

Times:
    D:   1 sec, 430 ms, 207 μs, and 4 hnsecs
    C:   940 ms

D version:
---------

     import std.stdio;
     import std.datetime;

     int fibw(int n) { //Linear Fibonacci
         int a = 1;
         int b = 1;
         for (int i=2; i <= n; ++i) {
             int sum = a + b;
             a = b;
             b = sum;
         }
         return b;
     }

     void main() {
         auto start = Clock.currTime();
         int r = fibw(1000_000_000);
         auto elapsed = Clock.currTime() - start;
         writeln(r);
         writeln(elapsed);
     }


C Version:
---------

#include <stdio.h>
#include <time.h>

int fibw(int n) { //Linear Fibonacci
     int a = 1;
     int b = 1;
     int i;
     for (i=2; i <= n; ++i) {
         int sum = a + b;
         a = b;
         b = sum;
     }
     return b;
}

int main() {
     clock_t start = clock();
     int r = fibw(1000*1000*1000);
     clock_t elapsed = clock() - start;
     printf("%d\n", r);
     printf("%d ms\n", (int)(elapsed * 1000 / CLOCKS_PER_SEC));
     return 0;
}



D Version DISASM:
----------------
00000000004681d0 <_D5dtest4fibwFiZi>:
   4681d0:   55                 push   rbp
   4681d1:   48 8b ec           mov    rbp,rsp
   4681d4:   48 89 f9           mov    rcx,rdi
   4681d7:   be 01 00 00 00     mov    esi,0x1
   4681dc:   b8 01 00 00 00     mov    eax,0x1
   4681e1:   ba 02 00 00 00     mov    edx,0x2
   4681e6:   83 f9 02           cmp    ecx,0x2
   4681e9:   7c 0f              jl     4681fa 
<_D5dtest4fibwFiZi+0x2a>
                                ; LOOP JUMP --->
   4681eb:   8d 3c 06           lea    edi,[rsi+rax*1]
   4681ee:   48 89 c6           mov    rsi,rax
   4681f1:   48 89 f8           mov    rax,rdi
   4681f4:   ff c2              inc    edx
   4681f6:   39 ca              cmp    edx,ecx
   4681f8:   7e f1              jle    4681eb 
<_D5dtest4fibwFiZi+0x1b>
                                ; LOOP END
   4681fa:   5d                 pop    rbp
   4681fb:   c3                 ret


C Version DISASM:
----------------
0000000000400860 <fibw>:
   400860:   83 ff 01          cmp    edi,0x1
   400863:   b8 01 00 00 00    mov    eax,0x1
   400868:   7e 1c             jle    400886 <fibw+0x26>
   40086a:   ba 02 00 00 00    mov    edx,0x2
   40086f:   b9 01 00 00 00    mov    ecx,0x1
                               ; NOTICE THE nop (64bit 
alignment??):
   400874:   0f 1f 40 00       nop    DWORD PTR [rax+0x0]
                               ; LOOP JUMP -->
   400878:   8d 34 01          lea    esi,[rcx+rax*1]
   40087b:   83 c2 01          add    edx,0x1
   40087e:   89 c1             mov    ecx,eax    ; REORDERED cmp
   400880:   39 d7             cmp    edi,edx
   400882:   89 f0             mov    eax,esi
   400884:   7d f2             jge    400878 <fibw+0x18>
                               ; LOOP END
   400886:   f3 c3             repz ret
   400888:   90                nop
   400889:   90                nop



both files were dissasembled with:

        objdump -M intel -d

--jm

Jun 07 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:
 Given the recent surge in interest for performance, I dusted
 off a small test that I made long ago and determined myself
 to find the cause of the performance difference.

It's great that you're doing this. You can track it down further by using
inline 
assembler and trying different instruction sequences.

Also, obj2asm gives nicer disassembly :-)

Jun 07 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 08.06.2013 07:11, schrieb Walter Bright:
 On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:
 Given the recent surge in interest for performance, I dusted
 off a small test that I made long ago and determined myself
 to find the cause of the performance difference.

 It's great that you're doing this. You can track it down further by using
inline
 assembler and trying different instruction sequences.

 Also, obj2asm gives nicer disassembly :-)

or even nicer :) :)

ida pro freeware

http://out7.hex-rays.com/files/idafree50.exe

Jun 07 2013

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:

On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:
 On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:
 Given the recent surge in interest for performance, I dusted
 off a small test that I made long ago and determined myself
 to find the cause of the performance difference.

 It's great that you're doing this. You can track it down 
 further by using inline assembler and trying different 
 instruction sequences.

 Also, obj2asm gives nicer disassembly :-)


Thanks!!

I now used inline assembler, and can confidently say
that the difference is because of the alignment.
    Changing the order of the cmp relative to the
increment didn't do anything.

Adding the right amount of 'nop' makes it run in

       957 ms, 921 μs, and 4 hnsecs

But if I overshoot it, or miss one, it goes back to

       1 sec, 438 ms, and 544 μs

Also, I couldn't use this instruction in D's asm{}

         0f 1f 40 00       nop    DWORD PTR [rax+0x0]

and obj2asm doesn't dissasemble it (it just puts "0f1f"
and gives incorrent asm for the next few instructions).

I'm now not entirely sure that aligning loop jumps would be
worthwhile though. They would have to be "leaf" loops
because any call made inside the loop would overshadow
the benefits (I was looping millons of times in my test).

Anyway, here is the new source:

     import std.stdio;
     import std.datetime;

     int fiba(int n) {
         asm {
             naked;
             push   RBP;
             mov    RBP,RSP;
             mov    RCX,RDI;
             mov    ESI,0x1;
             mov    EAX,0x1;
             mov    EDX,0x2;
             cmp    ECX,0x2;
             jl     EXIT_LOOP;
             nop;
             nop; nop; nop; nop;
             nop; nop; nop; nop;
             nop; nop; nop; nop;
         LOOP_START:
             lea    EDI,[RSI+RAX*1];
             mov    RSI,RAX;
             mov    RAX,RDI;
             inc    EDX;
             cmp    EDX,ECX;
             jle    LOOP_START;
         EXIT_LOOP:
             pop    RBP;
             ret;
         }
     }

     void main() {
         auto start = Clock.currTime();
         int r = fiba(1000_000_000);
         auto elapsed = Clock.currTime() - start;
         writeln(r);
         writeln(elapsed);
     }


--jm

Jun 07 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:
 I now used inline assembler, and can confidently say
 that the difference is because of the alignment.

Thanks, this is great information.

Jun 08 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 08.06.2013 09:28, schrieb Walter Bright:
 On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:
 I now used inline assembler, and can confidently say
 that the difference is because of the alignment.

 Thanks, this is great information.

sadly not for x64 :(

Jun 08 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/8/2013 12:37 AM, dennis luehring wrote:
 Am 08.06.2013 09:28, schrieb Walter Bright:
 On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:
 I now used inline assembler, and can confidently say
 that the difference is because of the alignment.

 Thanks, this is great information.

 sadly not for x64 :(

I don't understand your comment - the code example was for 64 bit code.

Jun 08 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 08.06.2013 19:26, schrieb Walter Bright:
 On 6/8/2013 12:37 AM, dennis luehring wrote:
 Am 08.06.2013 09:28, schrieb Walter Bright:
 On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:
 I now used inline assembler, and can confidently say
 that the difference is because of the alignment.

 Thanks, this is great information.

 sadly not for x64 :(

 I don't understand your comment - the code example was for 64 bit code.

ida freeware is x86 only - so my "even nicer disassembly" just don't 
work in this case

Jun 08 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:
 Also, obj2asm gives nicer disassembly :-)

You know what would rok? If it could output code that you could 
copy/paste directly into D's iasm without having to go back and 
add semicolons, change 0eh to 0x0e, etc.

Jun 08 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/8/13 1:53 PM, Adam D. Ruppe wrote:
 On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:
 Also, obj2asm gives nicer disassembly :-)

 You know what would rok? If it could output code that you could
 copy/paste directly into D's iasm without having to go back and add
 semicolons, change 0eh to 0x0e, etc.

Wow!

Andrei

Jun 08 2013

Marco Leise <Marco.Leise gmx.de> writes:

GCC even cosiders cache line length and CPU cache size.
This alignment improvement should consistently work on
x86/amd64.

-- 
Marco

Jun 07 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:
 So I think it is a low hanging fruit because IMO the
 speed difference is either because the GCC loop jump is
 64bit aligned,

http://d.puremagic.com/issues/show_bug.cgi?id=10301

Jun 08 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Low hanging fruit for optimizing loops