www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Performance updates

reply bearophile <bearophileHUGS lycos.com> writes:
In this post I show few things I have found/collected in the last weeks related
to the performance of the code compiled with DMD.

I have added two of them to the list of slow things (as tiny benchmarks) I keep:
http://www.fantascienza.net/leonardo/js/slow_d.zip

------------------------

This is D code that just performs many integer divisions (by 7, known at
compile time):

int divideBySeven(int x) {
    return x / 7;
}

void main() {
    int i = int.max;
    int r;
    while (i--)
        r = divideBySeven(i);
    printf("%d\n", r);
}


The same code in C:

#include "limits.h"
#include "stdio.h"

int divideBySeven(int x) {
    return x / 7;
}

int main() {
    int i = INT_MAX;
    int r;
    while (i--)
        r = divideBySeven(i);

    printf("%d\n", r);
    return 0;
}


I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the D
code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz.

Timing results:
intdiv:
  C:  5.04 s
  D: 39.32 s


The ASM generated by DMD for the function divideBySeven():

mov	ECX,7
cdq
idiv	ECX


The ASM generated by GCC for the function divideBySeven():

pushl	%ebp
movl	%esp, %ebp
movl	8(%ebp), %ecx
pushl	%ebx
movl	$-1840700269, %ebx
movl	%ebx, %eax
popl	%ebx
imull	%ecx
popl	%ebp
leal	(%edx,%ecx), %eax
sarl	$2, %eax
sarl	$31, %ecx
subl	%ecx, %eax

--------------------------

This is a scanning, C code:

#include "stdio.h"

int main() {
    int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
    int i = 300000000;
    while (i--) {
        // 0.63 s
        if (i % 4 == 0) {
            counter0++;
        } else if (i % 4 == 1) {
            counter1++;
        } else if (i % 4 == 2) {
            counter2++;
        } else {
            counter3++;
        }
    }

    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
    return 0;
}


Equal D code:

import std.stdio: printf;

int main() {
    int counter0, counter1, counter2, counter3;

    int i = 300_000_000;
    while (i--) { // 1.23 s
        if (i % 4 == 0) {
            counter0++;
        } else if (i % 4 == 1) {
            counter1++;
        } else if (i % 4 == 2) {
            counter2++;
        } else {
            counter3++;
        }
    }

    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
    return 0;
}


Timings:
Scan (i = 300_000_000):
  C: 0.63 s
  D: 1.23 s

I can offer Asm code on request.

------------------------

The D1 docs strongly suggest to use foreach every time it's possible, avoiding
to use the less handy for(), so for almost a year I have assumed foreach is as
fast as the for() but this two versions shows differences:

20.66 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1

My classless version, 25.91 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2

That second version uses the following code in the most critical spot of the
program:

void advance(TyPlanets bodies, double dt) {
    foreach(idx, ref b; bodies) {
        double bm = b.mass;
        foreach(ref b2; bodies[idx + 1 .. length]) {

Removing the foreach, replacing them with for loops like the following, gives a
significant performance increase, the code runs in about 19.5 second (estimated
timing of the Shootout computer, I can give you exact timings on my PC if you
want):

static void advance(TyPlanets bodies, double dt) {
    for (int idx; idx < bodies.length; idx++) {
        auto b = &(bodies[idx]);
        double bm = b.mass;
        for (int j = idx + 1; j < bodies.length; j++) {
            auto b2 = &(bodies[j]);
            
(Note that I haven't just replaced foreach with a for, I have also removed the
slicing bodies[idx+1..$], that also slows down the code a little, but I have
seen that even removing just the slice the code is slower anyway).

------------------------

Looking at assembly produced by the Intel compiler from C code that contains
many floating point operations I can see how short and slick it is. I have also
read few articles that show various ways to manage the floating point stack as
efficiently as possible. Some of such methods are slow or quite slow but
produce more efficient code.
Using those refined methods for all the FP operations of a large program may
lead to too much long compilation times. So profile-guided optimization may be
used to find what are the hot parts of the code, to optimize the FP operations
in them expecially well, and compile all the other FP parts with a faster
compilation method.

GCC has also the "hot" function attribute to let programmers manually define
what functions are more critical:
http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html

In D in theory it can be used something like this:
optimize { ... }
(As static if { ... } it doesn't create a new scope).

It tells the compiler that a part of code that uses lot of FP operations is
speed critical, so the compiler can compile it with a quite slower floating
point stack allocation algorithm, like ones I have read about.

Bye,
bearophile
Nov 13 2008
next sibling parent "Bill Baxter" <wbaxter gmail.com> writes:
On Thu, Nov 13, 2008 at 8:19 PM, bearophile <bearophileHUGS lycos.com> wrote:
 In this post I show few things I have found/collected in the last weeks
related to the performance of the code compiled with DMD.

 I have added two of them to the list of slow things (as tiny benchmarks) I
keep:
 http://www.fantascienza.net/leonardo/js/slow_d.zip

 ------------------------

 This is D code that just performs many integer divisions (by 7, known at
compile time):

 int divideBySeven(int x) {
    return x / 7;
 }

 void main() {
    int i = int.max;
    int r;
    while (i--)
        r = divideBySeven(i);
    printf("%d\n", r);
 }


 The same code in C:

 #include "limits.h"
 #include "stdio.h"

 int divideBySeven(int x) {
    return x / 7;
 }

 int main() {
    int i = INT_MAX;
    int r;
    while (i--)
        r = divideBySeven(i);

    printf("%d\n", r);
    return 0;
 }


 I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the
D code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz.

 Timing results:
 intdiv:
  C:  5.04 s
  D: 39.32 s


 The ASM generated by DMD for the function divideBySeven():

 mov     ECX,7
 cdq
 idiv    ECX


 The ASM generated by GCC for the function divideBySeven():

 pushl   %ebp
 movl    %esp, %ebp
 movl    8(%ebp), %ecx
 pushl   %ebx
 movl    $-1840700269, %ebx
 movl    %ebx, %eax
 popl    %ebx
 imull   %ecx
 popl    %ebp
 leal    (%edx,%ecx), %eax
 sarl    $2, %eax
 sarl    $31, %ecx
 subl    %ecx, %eax

 --------------------------

 This is a scanning, C code:

 #include "stdio.h"

 int main() {
    int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
    int i = 300000000;
    while (i--) {
        // 0.63 s
        if (i % 4 == 0) {
            counter0++;
        } else if (i % 4 == 1) {
            counter1++;
        } else if (i % 4 == 2) {
            counter2++;
        } else {
            counter3++;
        }
    }

    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
    return 0;
 }


 Equal D code:

 import std.stdio: printf;

 int main() {
    int counter0, counter1, counter2, counter3;

    int i = 300_000_000;
    while (i--) { // 1.23 s
        if (i % 4 == 0) {
            counter0++;
        } else if (i % 4 == 1) {
            counter1++;
        } else if (i % 4 == 2) {
            counter2++;
        } else {
            counter3++;
        }
    }

    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
    return 0;
 }


 Timings:
 Scan (i = 300_000_000):
  C: 0.63 s
  D: 1.23 s

 I can offer Asm code on request.

 ------------------------

 The D1 docs strongly suggest to use foreach every time it's possible, avoiding
to use the less handy for(), so for almost a year I have assumed foreach is as
fast as the for() but this two versions shows differences:

 20.66 seconds:
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1

 My classless version, 25.91 seconds:
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2

 That second version uses the following code in the most critical spot of the
program:

 void advance(TyPlanets bodies, double dt) {
    foreach(idx, ref b; bodies) {
        double bm = b.mass;
        foreach(ref b2; bodies[idx + 1 .. length]) {

 Removing the foreach, replacing them with for loops like the following, gives
a significant performance increase, the code runs in about 19.5 second
(estimated timing of the Shootout computer, I can give you exact timings on my
PC if you want):

 static void advance(TyPlanets bodies, double dt) {
    for (int idx; idx < bodies.length; idx++) {
        auto b = &(bodies[idx]);
        double bm = b.mass;
        for (int j = idx + 1; j < bodies.length; j++) {
            auto b2 = &(bodies[j]);

 (Note that I haven't just replaced foreach with a for, I have also removed the
slicing bodies[idx+1..$], that also slows down the code a little, but I have
seen that even removing just the slice the code is slower anyway).

 ------------------------

 Looking at assembly produced by the Intel compiler from C code that contains
many floating point operations I can see how short and slick it is. I have also
read few articles that show various ways to manage the floating point stack as
efficiently as possible. Some of such methods are slow or quite slow but
produce more efficient code.
 Using those refined methods for all the FP operations of a large program may
lead to too much long compilation times. So profile-guided optimization may be
used to find what are the hot parts of the code, to optimize the FP operations
in them expecially well, and compile all the other FP parts with a faster
compilation method.

 GCC has also the "hot" function attribute to let programmers manually define
what functions are more critical:
 http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html

 In D in theory it can be used something like this:
 optimize { ... }
 (As static if { ... } it doesn't create a new scope).

 It tells the compiler that a part of code that uses lot of FP operations is
speed critical, so the compiler can compile it with a quite slower floating
point stack allocation algorithm, like ones I have read about.

 Bye,
 bearophile
*subliminal chant: L D C L D C L D C .....* --bb
Nov 13 2008
prev sibling parent reply Moritz Warning <moritzwarning web.de> writes:
On Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:

 In this post I show few things I have found/collected in the last weeks
 related to the performance of the code compiled with DMD.
 
[..]
 
 The D1 docs strongly suggest to use foreach every time it's possible,
 avoiding to use the less handy for(), so for almost a year I have
 assumed foreach is as fast as the for() but this two versions shows
 differences:
 
I have noticed the same problem with foreach. A for loop is often faster. :/ I haven't tried ldc yet..
Nov 13 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Moritz Warning:
 I have noticed the same problem with foreach.
 A for loop is often faster. :/
You generally notice a difference only in the most inner and critical loops, so I think it's okay to use foreach() in 80-90% of the code of a program.
 I haven't tried ldc yet..
I haven't succeed in compiling code with ldc yet. But I have tried the last LLVM with the C code of the nbody benchmark of the Shootout, and GCC (MinGW) 4.2.1 produces code about twice faster on my Core2. That's even quite worse than the D code compiled with DMD. Bye, bearophile
Nov 13 2008
parent "Bill Baxter" <wbaxter gmail.com> writes:
On Fri, Nov 14, 2008 at 7:59 AM, bearophile <bearophileHUGS lycos.com> wrote:
 Moritz Warning:
 I have noticed the same problem with foreach.
 A for loop is often faster. :/
You generally notice a difference only in the most inner and critical loops, so I think it's okay to use foreach() in 80-90% of the code of a program.
 I haven't tried ldc yet..
I haven't succeed in compiling code with ldc yet. But I have tried the last LLVM with the C code of the nbody benchmark of the Shootout, and GCC (MinGW) 4.2.1 produces code about twice faster on my Core2. That's even quite worse than the D code compiled with DMD.
That's a bummer to hear that. But the good point of LLVM is that it's on an upward trend. With DMD only Walter has the ability to work on backend codegen improvements and we all know how much time he has available for such things. Eventually I believe LLVM will leave DMD in the dust and probably GCC too. --bb
Nov 13 2008
prev sibling parent reply "Bill Baxter" <wbaxter gmail.com> writes:
On Fri, Nov 14, 2008 at 7:49 AM, Moritz Warning <moritzwarning web.de> wrote:
 On Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:

 In this post I show few things I have found/collected in the last weeks
 related to the performance of the code compiled with DMD.
[..]
 The D1 docs strongly suggest to use foreach every time it's possible,
 avoiding to use the less handy for(), so for almost a year I have
 assumed foreach is as fast as the for() but this two versions shows
 differences:
I have noticed the same problem with foreach. A for loop is often faster. :/
One thing to be clear about when talking about the speed of foreach is whether you're using it on a built-in array or on a user type with an opApply. The former is expected to be as fast as for(), the latter is definitely not because it has to call a delegate each time around the loop. --bb
Nov 13 2008
parent reply Moritz Warning <moritzwarning web.de> writes:
On Fri, 14 Nov 2008 08:48:53 +0900, Bill Baxter wrote:

 On Fri, Nov 14, 2008 at 7:49 AM, Moritz Warning <moritzwarning web.de>
 wrote:
 On Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:

 In this post I show few things I have found/collected in the last
 weeks related to the performance of the code compiled with DMD.
[..]
 The D1 docs strongly suggest to use foreach every time it's possible,
 avoiding to use the less handy for(), so for almost a year I have
 assumed foreach is as fast as the for() but this two versions shows
 differences:
I have noticed the same problem with foreach. A for loop is often faster. :/
One thing to be clear about when talking about the speed of foreach is whether you're using it on a built-in array or on a user type with an opApply. The former is expected to be as fast as for(), the latter is definitely not because it has to call a delegate each time around the loop.
Nobody would use "for"-loops with opApply in real world code. :P
Nov 13 2008
parent "Bill Baxter" <wbaxter gmail.com> writes:
On Fri, Nov 14, 2008 at 9:06 AM, Moritz Warning <moritzwarning web.de> wrote:
 On Fri, 14 Nov 2008 08:48:53 +0900, Bill Baxter wrote:

 On Fri, Nov 14, 2008 at 7:49 AM, Moritz Warning <moritzwarning web.de>
 wrote:
 On Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:

 In this post I show few things I have found/collected in the last
 weeks related to the performance of the code compiled with DMD.
[..]
 The D1 docs strongly suggest to use foreach every time it's possible,
 avoiding to use the less handy for(), so for almost a year I have
 assumed foreach is as fast as the for() but this two versions shows
 differences:
I have noticed the same problem with foreach. A for loop is often faster. :/
One thing to be clear about when talking about the speed of foreach is whether you're using it on a built-in array or on a user type with an opApply. The former is expected to be as fast as for(), the latter is definitely not because it has to call a delegate each time around the loop.
Nobody would use "for"-loops with opApply in real world code. :P
Er, right, which is why using a for() loop is faster that foreach() in that case. Sorry if I wasn't clear about that. --bb
Nov 13 2008