digitalmars.D.learn - Loop optimization

kai (18/18) May 13 2010 Hello,

Lars T. Kyllingstad (31/51) May 13 2010 Two suggestions:
Lars T. Kyllingstad (31/51) May 13 2010 Two suggestions:

Lars T. Kyllingstad (3/16) May 13 2010 Hmm.. something very strange is going on with the line breaking here.
Steven Schveighoffer (5/11) May 14 2010 -release implies -noboundscheck (in fact, I did not know there was a

Lars T. Kyllingstad (7/24) May 14 2010 You are right, just checked it now. But it's strange, I thought the

bearophile (117/130) May 14 2010 Using floating point for indexes and lengths is not a good practice. In ...

strtr (5/6) May 14 2010 NaNs (that's the default initalization of FP values in D), and operation...

Don (7/15) May 15 2010 Yes, nan and inf are usually the same speed. However, it's very CPU

strtr (3/18) May 15 2010 Thanks!

Don (9/26) May 15 2010 Yes. What's happened is that none of the popular programming languages

Don (6/10) May 15 2010 More precisely:

Walter Bright (6/18) May 16 2010 Have to be careful when talking about floating point optimizations. For ...

bearophile (42/45) May 17 2010 I have done a little experiment, compiling this D1 code with LDC:

Walter Bright (16/19) May 17 2010 In my view, such switches are bad news, because:

bearophile (7/9) May 17 2010 The Intel compiler, Microsoft compiler, GCC and LLVM have a similar swit...

Walter Bright (3/10) May 18 2010 If I agreed with everything other vendors did with their compilers, I wo...

Don (8/32) May 17 2010 The most glaring limitation of the FP optimiser is that it seems to

BCS (5/12) May 17 2010 Does DMD have the ground work for doing FP keyhole optimizations? That s...

Walter Bright (7/8) May 16 2010 This is simply false. DMD does an excellent job with integer and pointer...

bearophile (5/9) May 16 2010 You are of course right, I understand your feelings, I am a stupid -.-
Brad Roberts (11/23) May 16 2010 While it's false that DMD doesn't do many optimizations. It's true that...
Joseph Wakeling (7/12) May 19 2010 Interesting to note, relative to my earlier experience with D vs. C++ sp...

Steven Schveighoffer (24/45) May 14 2010 I figured it out.

kai (11/26) May 14 2010 Unfortunately, I don't think I will be able to. The actual code is

bearophile (11/14) May 14 2010 LDC is D1 still, mostly :-(

=?windows-1252?Q?=22J=E9r=F4me_M=2E_Berger=22?= (21/43) May 14 2010 ot of work to do still. And some parts of D design will need to be impro...

div0 (18/21) May 15 2010 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= (16/32) May 15 2010 According to the C89 standard and onwards it *must* be initialized

div0 (15/26) May 16 2010 -----BEGIN PGP SIGNED MESSAGE-----

Jouko Koski (10/15) May 16 2010 No, in C++ all *global or static* variables are zero-initialized. By
=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= (10/26) May 16 2010 The specs haven't diverged and C++ has mostly the same behaviour as

Steven Schveighoffer (6/10) May 17 2010 You are probably right. All I did to figure this out is print out the

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (5/7) May 15 2010 I've discovered that this is the equivalent of the last line above:

Simen kjaeraas (7/13) May 15 2010 Looks unintended to me. In fact (though that might be the

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (5/20) May 15 2010 I have to make a correction: It works with fixed-sized arrays. It does

bearophile (5/6) May 15 2010 It's a compiler bug, don't use that bracket less syntax in your programs...

Walter Bright (4/19) May 21 2010 for (int j=0;j<1e6-1;j++)

bearophile (6/10) May 22 2010 The syntax "1e6" can represent an integer value of one million as perfec...

kai <kai nospam.zzz> writes:

Hello,

I was evaluating using D for some numerical stuff. However I was surprised to
find that looping & array indexing was not very speedy compared to
alternatives (gcc et al). I was using the DMD2 compiler on mac and windows,
with -O -release. Here is a boiled down test case:

	void main (string[] args)
	{
		double [] foo = new double [cast(int)1e6];
		for (int i=0;i<1e3;i++)
		{
			for (int j=0;j<1e6-1;j++)
			{
				foo[j]=foo[j]+foo[j+1];
			}
		}
	}

Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for
your help.

May 13 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Fri, 14 May 2010 02:38:40 +0000, kai wrote:

 Hello,
 
 I was evaluating using D for some numerical stuff. However I was
 surprised to find that looping & array indexing was not very speedy
 compared to alternatives (gcc et al). I was using the DMD2 compiler on
 mac and windows, with -O -release. Here is a boiled down test case:
 
 	void main (string[] args)
 	{
 		double [] foo = new double [cast(int)1e6]; for (int 

i=0;i<1e3;i++)
 		{
 			for (int j=0;j<1e6-1;j++)
 			{
 				foo[j]=foo[j]+foo[j+1];
 			}
 		}
 	}
 
 Any ideas? Am I somehow not hitting a vital compiler optimization?
 Thanks for your help.


Two suggestions:


1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks 
that you do not try to read/write beyond the end of an array, but you can 
turn those checks off with said switch.


2. Can you use vector operations?  If the example you gave is 
representative of your specific problem, then you can't because you are 
adding overlapping parts of the array.  But if you are doing operations 
on separate arrays, then array operations will be *much* faster.

    http://www.digitalmars.com/d/2.0/arrays.html#array-operations

As an example, compare the run time of the following code with the 
example you gave:

    void main ()
    {
        double[] foo = new double [cast(int)1e6];
        double[] slice1 = foo[0 .. 999_998];
        double[] slice2 = foo[1 .. 999_999];

        for (int i=0;i<1e3;i++)
        {
            // BAD, BAD, BAD.  DON'T DO THIS even though
            // it's pretty awesome:
            slice1[] += slice2[];
        }
    }

Note that this is very bad code, since slice1 and slice2 are overlapping 
arrays, and there is no guarantee as to which order the array elements 
are computed -- it may even occur in parallel.  It was just an example of 
the speed gains you may expect from designing your code with array 
operations in mind.

-Lars

May 13 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Fri, 14 May 2010 02:38:40 +0000, kai wrote:

 Hello,
 
 I was evaluating using D for some numerical stuff. However I was
 surprised to find that looping & array indexing was not very speedy
 compared to alternatives (gcc et al). I was using the DMD2 compiler on
 mac and windows, with -O -release. Here is a boiled down test case:
 
 	void main (string[] args)
 	{
 		double [] foo = new double [cast(int)1e6]; for (int 

i=0;i<1e3;i++)
 		{
 			for (int j=0;j<1e6-1;j++)
 			{
 				foo[j]=foo[j]+foo[j+1];
 			}
 		}
 	}
 
 Any ideas? Am I somehow not hitting a vital compiler optimization?
 Thanks for your help.

Two suggestions:


1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks
that you do not try to read/write beyond the end of an array, but you can
turn those checks off with said switch.


2. Can you use vector operations?  If the example you gave is
representative of your specific problem, then you can't because you are
adding overlapping parts of the array.  But if you are doing operations on
separate arrays, then array operations will be *much* faster.

    http://www.digitalmars.com/d/2.0/arrays.html#array-operations

As an example, compare the run time of the following code with the example
you gave:

    void main ()
    {
        double[] foo = new double [cast(int)1e6];
        double[] slice1 = foo[0 .. 999_998];
        double[] slice2 = foo[1 .. 999_999];

        for (int i=0;i<1e3;i++)
        {
            // BAD, BAD, BAD.  DON'T DO THIS even though
            // it's pretty awesome:
            slice1[] += slice2[];
        }
    }

Note that this is very bad code, since slice1 and slice2 are overlapping
arrays, and there is no guarantee as to which order the array elements are
computed -- it may even occur in parallel.  It was just an example of the
speed gains you may expect from designing your code with array operations
in mind.

-Lars

May 13 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Fri, 14 May 2010 06:31:29 +0000, Lars T. Kyllingstad wrote:
     void main ()
     {
         double[] foo = new double [cast(int)1e6]; double[] slice1 =
         foo[0 .. 999_998];
         double[] slice2 = foo[1 .. 999_999];
 
         for (int i=0;i<1e3;i++)
         {
             // BAD, BAD, BAD.  DON'T DO THIS even though // it's pretty
             awesome:
             slice1[] += slice2[];
         }
     }

Hmm.. something very strange is going on with the line breaking here.

-Lars

May 13 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 May 2010 02:31:29 -0400, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Fri, 14 May 2010 02:38:40 +0000, kai wrote:


 I was using the DMD2 compiler on
 mac and windows, with -O -release.

 1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks
 that you do not try to read/write beyond the end of an array, but you can
 turn those checks off with said switch.

-release implies -noboundscheck (in fact, I did not know there was a  
noboundscheck flag, I thought you had to use -release).

-Steve

May 14 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Fri, 14 May 2010 07:32:54 -0400, Steven Schveighoffer wrote:

 On Fri, 14 May 2010 02:31:29 -0400, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 
 On Fri, 14 May 2010 02:38:40 +0000, kai wrote:

 
 
 I was using the DMD2 compiler on
 mac and windows, with -O -release.

 1. Have you tried the -noboundscheck compiler switch?  Unlike C, D
 checks that you do not try to read/write beyond the end of an array,
 but you can turn those checks off with said switch.

 
 -release implies -noboundscheck (in fact, I did not know there was a
 noboundscheck flag, I thought you had to use -release).
 
 -Steve


You are right, just checked it now.  But it's strange, I thought the 
whole point of the -noboundscheck switch was that it would be independent 
of -release.  But perhaps I remember wrongly (or perhaps Walter just 
hasn't gotten around to it yet).

Anyway, sorry for the misinformation.

-Lars

May 14 2010

bearophile <bearophileHUGS lycos.com> writes:

kai:

 I was evaluating using D for some numerical stuff.

For that evaluation you probably have to use the LDC compiler, that is able to
optimize better.


 	void main (string[] args)
 	{
 		double [] foo = new double [cast(int)1e6];
 		for (int i=0;i<1e3;i++)
 		{
 			for (int j=0;j<1e6-1;j++)
 			{
 				foo[j]=foo[j]+foo[j+1];
 			}
 		}
 	}

Using floating point for indexes and lengths is not a good practice. In D large
numbers are written like 1_000_000.
Use -release too.

 
 Any ideas? Am I somehow not hitting a vital compiler optimization?

DMD compiler doesn't perform many optimizations, especially on floating point
computations.
But the bigger problem in your code is that you are performing operations on
NaNs (that's the default initalization of FP values in D), and operations on
NaNs are usually quite slower.


Your code in C:

#include "stdio.h"
#include "stdlib.h"
#define N 1000000

int main() {
    double *foo = calloc(N, sizeof(double)); // malloc suffices here
    int i, j;
    for (j = 0; j < N; j++)
        foo[j] = 1.0;

    for (i = 0; i < 1000; i++)
        for (j = 0; j < N-1; j++)
            foo[j] = foo[j] + foo[j + 1];

    printf("%f", foo[N-1]);
    return 0;
}

/*
gcc -O3 -s -Wall test.c -o test
Timings, outer loop=1_000 times: 7.72

------------------

gcc -Wall -O3 -fomit-frame-pointer -msse3 -march=native test.c -o test
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.69 s
Just the inner loop:
.L7:
    fldl    8(%edx)
    fadd    %st, %st(1)
    fxch    %st(1)
    fstpl   (%edx)
    addl    $8, %edx
    cmpl    %ecx, %edx
    jne .L7
*/

--------------------

Your code in D1:

version (Tango)
    import tango.stdc.stdio: printf;
else
    import std.c.stdio: printf;

void main() {
    const int N = 1_000_000;
    double[] foo = new double[N];
    foo[] = 1.0;

    for (int i = 0; i < 1_000; i++)
        for (int j = 0; j < N-1; j++)
            foo[j] = foo[j] + foo[j + 1];

    printf("%f", foo[N-1]);
}


/*
dmd -O -release -inline test.d
(Not running on a VirtualBox)
Timings, outer loop=1_000 times: 9.35 s
Just the inner loop:
L34:    fld qword ptr 8[EDX*8][ECX]
        fadd    qword ptr [EDX*8][ECX]
        fstp    qword ptr [EDX*8][ECX]
        inc EDX
        cmp EDX,0F423Fh
        jb  L34

-----------------------

ldc -O3 -release -inline test.d
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.87 s
Just the inner loop:
.LBB1_2:
    movsd   (%eax,%ecx,8), %xmm0
    addsd   8(%eax,%ecx,8), %xmm0
    movsd   %xmm0, (%eax,%ecx,8)
    incl    %ecx
    cmpl    $999999, %ecx
    jne .LBB1_2

-----------------------

ldc -unroll-allow-partial -O3 -release -inline test.d
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.75 s
Just the inner loop:
.LBB1_2:
    movsd   (%eax,%ecx,8), %xmm0
    addsd   8(%eax,%ecx,8), %xmm0
    movsd   %xmm0, (%eax,%ecx,8)
    movsd   8(%eax,%ecx,8), %xmm0
    addsd   16(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 8(%eax,%ecx,8)
    movsd   16(%eax,%ecx,8), %xmm0
    addsd   24(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 16(%eax,%ecx,8)
    movsd   24(%eax,%ecx,8), %xmm0
    addsd   32(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 24(%eax,%ecx,8)
    movsd   32(%eax,%ecx,8), %xmm0
    addsd   40(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 32(%eax,%ecx,8)
    movsd   40(%eax,%ecx,8), %xmm0
    addsd   48(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 40(%eax,%ecx,8)
    movsd   48(%eax,%ecx,8), %xmm0
    addsd   56(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 48(%eax,%ecx,8)
    movsd   56(%eax,%ecx,8), %xmm0
    addsd   64(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 56(%eax,%ecx,8)
    movsd   64(%eax,%ecx,8), %xmm0
    addsd   72(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 64(%eax,%ecx,8)
    addl    $9, %ecx
    cmpl    $999999, %ecx
    jne .LBB1_2
*/

As you see the code generated by ldc is about as good as the one generated by
gcc. There are of course other ways to optimize this code...

Bye,
bearophile

May 14 2010

strtr <strtr spam.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 But the bigger problem in your code is that you are performing operations on

NaNs (that's the default initalization of FP values in D), and operations on
NaNs
are usually quite slower.

I didn't know that. Is it the same for inf?
I used it as a null for structs.

May 14 2010

Don <nospam nospam.com> writes:

strtr wrote:
 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 But the bigger problem in your code is that you are performing operations on

 NaNs (that's the default initalization of FP values in D), and operations on
NaNs
 are usually quite slower.
 
 I didn't know that. Is it the same for inf?

Yes, nan and inf are usually the same speed. However, it's very CPU 
dependent, and even *within* a CPU! On Pentium 4, for example, for x87, 
nan is 200 times slower than a normal value (!), but on Pentium 4 SSE 
there's no speed difference at all between nan and normal. I think 
there's no speed difference on AMD, but I'm not sure.
There's almost no documentation on it at all.


 I used it as a null for structs.

May 15 2010

strtr <strtr spam.com> writes:

== Quote from Don (nospam nospam.com)'s article
 strtr wrote:
 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 But the bigger problem in your code is that you are performing operations on

 NaNs (that's the default initalization of FP values in D), and operations on
NaNs
 are usually quite slower.

 I didn't know that. Is it the same for inf?

 Yes, nan and inf are usually the same speed. However, it's very CPU
 dependent, and even *within* a CPU! On Pentium 4, for example, for x87,
 nan is 200 times slower than a normal value (!), but on Pentium 4 SSE
 there's no speed difference at all between nan and normal. I think
 there's no speed difference on AMD, but I'm not sure.
 There's almost no documentation on it at all.

Thanks!
NaNs being slower I can understand but inf might well be a value you want to
use.

 I used it as a null for structs.

May 15 2010

Don <nospam nospam.com> writes:

strtr wrote:
 == Quote from Don (nospam nospam.com)'s article
 strtr wrote:
 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 But the bigger problem in your code is that you are performing operations on

 NaNs (that's the default initalization of FP values in D), and operations on
NaNs
 are usually quite slower.

 I didn't know that. Is it the same for inf?

 Yes, nan and inf are usually the same speed. However, it's very CPU
 dependent, and even *within* a CPU! On Pentium 4, for example, for x87,
 nan is 200 times slower than a normal value (!), but on Pentium 4 SSE
 there's no speed difference at all between nan and normal. I think
 there's no speed difference on AMD, but I'm not sure.
 There's almost no documentation on it at all.

 
 Thanks!
 NaNs being slower I can understand but inf might well be a value you want to
use.

Yes. What's happened is that none of the popular programming languages 
support special IEEE values, so they're given very low priority by chip 
designers. In the Pentium 4 case, they're implemented entirely in 
microcode. A 200X slowdown is really significant.

However, the bit pattern for NaN is 0xFFFF..., which is the same as a 
negative integer, so an uninitialized floating-point variable has a 
quite high probability of being a NaN. I'm certain there's a lot of C 
programs out there which are inadvertantly using NaNs.

May 15 2010

Don <nospam nospam.com> writes:

bearophile wrote:
 kai:
 Any ideas? Am I somehow not hitting a vital compiler optimization?

 
 DMD compiler doesn't perform many optimizations, especially on floating point
computations.

More precisely:
In terms of optimizations performed, DMD isn't too far behind gcc. But 
it performs almost no optimization on floating point. Also, the inliner 
doesn't yet support the newer D features (this won't be hard to fix) and 
the scheduler is based on Pentium1.

May 15 2010

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 bearophile wrote:
 kai:
 Any ideas? Am I somehow not hitting a vital compiler optimization?

 DMD compiler doesn't perform many optimizations, especially on 
 floating point computations.

 
 More precisely:
 In terms of optimizations performed, DMD isn't too far behind gcc. But 
 it performs almost no optimization on floating point. Also, the inliner 
 doesn't yet support the newer D features (this won't be hard to fix) and 
 the scheduler is based on Pentium1.

Have to be careful when talking about floating point optimizations. For example,

    x/c => x * 1/c

is not done because of roundoff error. Also,

    0 * x => 0

is also not done because it is not a correct replacement if x is a NaN.

May 16 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

 is not done because of roundoff error. Also,
     0 * x => 0
 is also not done because it is not a correct replacement if x is a NaN.

I have done a little experiment, compiling this D1 code with LDC:


import tango.stdc.stdio: printf;
void main(char[][] args) {
    double x = cast(double)args.length;
    double y = 0 * x;
    printf("%f\n", y);
}


I think the asm generated by ldc shows what you say:


ldc -O3 -release -inline -output-s test
_Dmain:
	pushl	%ebp
	movl	%esp, %ebp
	andl	$-16, %esp
	subl	$32, %esp
	movsd	.LCPI1_0, %xmm0
	movd	8(%ebp), %xmm1
	orps	%xmm0, %xmm1
	subsd	%xmm0, %xmm1
	pxor	%xmm0, %xmm0
	mulsd	%xmm1, %xmm0
	movsd	%xmm0, 4(%esp)
	movl	$.str, (%esp)
	call	printf
	xorl	%eax, %eax
	movl	%ebp, %esp
	popl	%ebp
	ret	$8



So I have added an extra "unsafe floating point" optimization:

ldc -O3 -release -inline -enable-unsafe-fp-math -output-s test
_Dmain:
	subl	$12, %esp
	movl	$0, 8(%esp)
	movl	$0, 4(%esp)
	movl	$.str, (%esp)
	call	printf
	xorl	%eax, %eax
	addl	$12, %esp
	ret	$8


GCC has similar switches.

Bye,
bearophile

May 17 2010

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 So I have added an extra "unsafe floating point" optimization:
 
 ldc -O3 -release -inline -enable-unsafe-fp-math -output-s test

In my view, such switches are bad news, because:

1. very few people understand the issues regarding wrong floating point 
optimizations

2. even those that do, are faced with a switch that doesn't really define what 
unsafe fp optimizations it is doing, so there's no way to tell how it affects 
their code

3. the behavior of such a switch may change over time, breaking one's carefully 
written code

4. most of those optimizations can be done by hand if you want to, meaning that 
then their behavior will be reliable, portable and correct for your application

5. in my experience with such switches, almost nobody uses them, and the few 
that do use them wrongly

6. they add clutter, complexity, confusion and errors to the documentation

7. they use it, their code doesn't work correctly, they blame the 
compiler/language and waste the time of the tech support people

May 17 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

In my view, such switches are bad news, because:<

The Intel compiler, Microsoft compiler, GCC and LLVM have a similar switch
(fp:fast in the Microsoft compiler, -ffast-math on GCC, etc). So you might send
your list of comments to the devs of each of those four compilers.

I have used the "unsafe fp" switch in LDC to run faster my small raytracers,
with good results. So I use it now and then where max precision is not
important and small errors are not going to ruin the output.

I have asked the LLVM head developer to improve this optimization on LLVM,
because in my opinion it's not aggressive enough, to put LLVM on par with GCC.
So LDC too will probably get better on this, in future. This unsafe
optimization is off on default, so if you don't like it you can avoid it. Its
presence in LDC has caused zero problems to me so far in LDC (because when I
need safer/more precise results I don't use it).


4. most of those optimizations can be done by hand if you want to, meaning that
then their behavior will be reliable, portable and correct for your application<

This is true for any optimization.

Bye,
bearophile

May 17 2010

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 Walter Bright:
 
 In my view, such switches are bad news, because:<

 
 The Intel compiler, Microsoft compiler, GCC and LLVM have a similar switch
 (fp:fast in the Microsoft compiler, -ffast-math on GCC, etc). So you might
 send your list of comments to the devs of each of those four compilers.

If I agreed with everything other vendors did with their compilers, I wouldn't 
have built my own <g>.

May 18 2010

Don <nospam nospam.com> writes:

Walter Bright wrote:
 Don wrote:
 bearophile wrote:
 kai:
 Any ideas? Am I somehow not hitting a vital compiler optimization?

 DMD compiler doesn't perform many optimizations, especially on 
 floating point computations.

 More precisely:
 In terms of optimizations performed, DMD isn't too far behind gcc. But 
 it performs almost no optimization on floating point. Also, the 
 inliner doesn't yet support the newer D features (this won't be hard 
 to fix) and the scheduler is based on Pentium1.

 
 Have to be careful when talking about floating point optimizations. For 
 example,
 
    x/c => x * 1/c
 
 is not done because of roundoff error. Also,
 
    0 * x => 0
 
 is also not done because it is not a correct replacement if x is a NaN.

The most glaring limitation of the FP optimiser is that it seems to 
never keep values in the FP stack. So that it will often do:
FSTP x
FLD x
instead of FST x
Fixing this would probably give a speedup of ~20% on almost all FP code, 
and would unlock the path to further optimisation.

May 17 2010

BCS <none anon.com> writes:

Hello Don,

 The most glaring limitation of the FP optimiser is that it seems to
 never keep values in the FP stack. So that it will often do:
 FSTP x
 FLD x
 instead of FST x
 Fixing this would probably give a speedup of ~20% on almost all FP
 code, and would unlock the path to further optimisation.

Does DMD have the ground work for doing FP keyhole optimizations? That sound 
like an easy one.

-- 
... <IXOYE><

May 17 2010

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 DMD compiler doesn't perform many optimizations,

This is simply false. DMD does an excellent job with integer and pointer 
operations. It does a so-so job with floating point.

There are probably over a thousand optimizations at all levels that dmd does 
with integer and pointer code.

Compare the generated code with and without -O. Even without -O, dmd does a
long 
list of optimizations (such as common subexpression elimination).

May 16 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 This is simply false. DMD does an excellent job with integer and pointer 
 operations. It does a so-so job with floating point.
 There are probably over a thousand optimizations at all levels that dmd does 
 with integer and pointer code.

You are of course right, I understand your feelings, I am a stupid -.-
I must be more precise in my posts. You are right that surely dmd performs
numerous optimizations. What I meant to say was a comparison with other
compilers, particularly ldc. And even then generic words about a generic
comparison aren't useful. So I am sorry.

Bye,
bearophile

May 16 2010

Brad Roberts <braddr puremagic.com> writes:

On 5/16/2010 4:15 PM, Walter Bright wrote:
 bearophile wrote:
 DMD compiler doesn't perform many optimizations,

 
 This is simply false. DMD does an excellent job with integer and pointer
 operations. It does a so-so job with floating point.
 
 There are probably over a thousand optimizations at all levels that dmd
 does with integer and pointer code.
 
 Compare the generated code with and without -O. Even without -O, dmd
 does a long list of optimizations (such as common subexpression
 elimination).

While it's false that DMD doesn't do many optimizations.  It's true that it's
behind more modern compiler optimizers.

I've been working to fix some of the grossly bad holes in dmd's inliner which is
one are that's just obviously lacking (see bug 2008).  But gcc and ldc (and
likely msvc though I lack any direct knowledge) are simply a decade or so ahead.
 It's not a criticism of dmd or a suggestion that the priorities are in the
wrong place, just a point of fact.  They've got larger teams of people and are
spending significant time on just improving and adding optimizations.

Later,
Brad

May 16 2010

Joseph Wakeling <joseph.wakeling webdrake.net> writes:

On 05/17/2010 01:15 AM, Walter Bright wrote:
 bearophile wrote:
 DMD compiler doesn't perform many optimizations,

 
 This is simply false. DMD does an excellent job with integer and pointer
 operations. It does a so-so job with floating point.

Interesting to note, relative to my earlier experience with D vs. C++ speed:
http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=digitalmars.D.learn&artnum=19567

I'll have to try and put together a no-floating-point bit of code to
make a comparison.

Best wishes,

    -- Joe

May 19 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 13 May 2010 22:38:40 -0400, kai <kai nospam.zzz> wrote:

 Hello,

 I was evaluating using D for some numerical stuff. However I was  
 surprised to
 find that looping & array indexing was not very speedy compared to
 alternatives (gcc et al). I was using the DMD2 compiler on mac and  
 windows,
 with -O -release. Here is a boiled down test case:

 	void main (string[] args)
 	{
 		double [] foo = new double [cast(int)1e6];
 		for (int i=0;i<1e3;i++)
 		{
 			for (int j=0;j<1e6-1;j++)
 			{
 				foo[j]=foo[j]+foo[j+1];
 			}
 		}
 	}

 Any ideas? Am I somehow not hitting a vital compiler optimization?  
 Thanks for
 your help.

I figured it out.

in D, the default value for doubles is nan, so you are adding countless  
scores of nan's which is costly for some reason (not a big floating point  
guy, so I'm not sure about this).

In C/C++, the default value for doubles is 0.

BTW, without any initialization of the array, what are you expecting the  
code to do?  In the C++ version, I suspect you are simply adding a bunch  
of 0s together.

Equivalent D code which first initializes the array to 0s:

void main (string[] args)
{
     double [] foo = new double [cast(int)1e6];
     foo[] = 0; // probably want to change this to something more meaningful
     for (int i=0;i<cast(int)1e3;i++)
     {
         for (int j=0;j<cast(int)1e6-1;j++)
         {
             foo[j]+=foo[j+1];
         }
     }
}

On my PC, it runs almost exactly at the same speed as the C++ version.

-Steve

May 14 2010

kai <kai nospam.zzz> writes:

Thanks for the help all!

 2. Can you use vector operations?  If the example you gave is
 representative of your specific problem, then you can't because you are
 adding overlapping parts of the array.  But if you are doing operations
 on separate arrays, then array operations will be *much* faster.

Unfortunately, I don't think I will be able to. The actual code is
computing norms of a sequence of points and then updating their values as
needed (MLE smoothing/prediction).

 For that evaluation you probably have to use the LDC compiler, that is
 able to optimize better.

I was scared off by the warning that D 2.0 support is experimental. I
realize D 2 itself is still non-production, but for academic interests
industrial-strength isnt all that important if it usually works :).

 Using floating point for indexes and lengths is not a good practice.
 In D large numbers are written like 1_000_000. Use -release too.

Good to know, thanks (thats actually a great feature for scientists!).

 DMD compiler doesn't perform many optimizations, especially on floating
 point computations. But the bigger problem in your code is that you are
 performing operations on NaNs (that's the default initalization of FP
 values in D), and operations on NaNs are usually quite slower.

 in D, the default value for doubles is nan, so you are adding countless
 scores of nan's which is costly for some reason (not a big floating point
 guy, so I'm not sure about this).

Ah ha, that was it-- serves me right for trying to boil down a test case and
failing miserably. I'll head back to my code now and try to find the real
problem :-) At some point I removed the initialization data obviously.

May 14 2010

bearophile <bearophileHUGS lycos.com> writes:

kai:

 I was scared off by the warning that D 2.0 support is experimental.

LDC is D1 still, mostly :-(
And at the moment it uses LLVM 2.6.
LLVM 2.7 contains a new optimization that can improve that code some more.


 Good to know, thanks (thats actually a great feature for scientists!).

In theory D is a bit fit for numerical computations too, but there is lot of
work to do still. And some parts of D design will need to be improved to help
numerical code performance.

From my extensive tests, if you use it correctly, D1 code compiled with LDC can
be about as efficient as C code compiled with GCC or sometimes a little more
efficient.

-------------

Steven Schveighoffer:
 In C/C++, the default value for doubles is 0.

I think in C and C++ the default value for doubles is "uninitialized" (that is
anything).

Bye,
bearophile

May 14 2010

=?windows-1252?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:

bearophile wrote:
 kai:
=20
 I was scared off by the warning that D 2.0 support is experimental.

=20
 LDC is D1 still, mostly :-(
 And at the moment it uses LLVM 2.6.
 LLVM 2.7 contains a new optimization that can improve that code some mo=

re.
=20
=20
 Good to know, thanks (thats actually a great feature for scientists!).=


=20
 In theory D is a bit fit for numerical computations too, but there is l=

ot of work to do still. And some parts of D design will need to be improv=
ed to help numerical code performance.
=20
 From my extensive tests, if you use it correctly, D1 code compiled with=

 LDC can be about as efficient as C code compiled with GCC or sometimes a=
 little more efficient.
=20
 -------------
=20
 Steven Schveighoffer:
 In C/C++, the default value for doubles is 0.

=20
 I think in C and C++ the default value for doubles is "uninitialized" (=

that is anything).
=20

	That depends. In C/C++, the default value for any global variable
is to have all bits set to 0 whatever that means for the actual data
type. The default value for local variables and malloc/new memory is
"whatever was in this place in memory before" which can be anything.
The default value for calloc is to have all bits to 0 as for global
variables.

	In the OP code, the malloc will probably return memory that has
never been used before, therefore probably initialized to 0 too (OS
dependent).

		Jerome
--=20
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr

May 14 2010

div0 <div0 users.sourceforge.net> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

J�r�me M. Berger wrote:
 	That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type. 

No it's not, it's always uninitialized.

Visual studio will initialise memory & a functions stack segment with
0xcd, but only in debug builds. In release mode you get what was already
there. That used to be the case with gcc (which used 0xdeadbeef) as well
unless they've changed it.

- --
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFL7qNxT9LetA9XoXwRAnApAJ9rSzMN9dy1mxMFdBzASaESlkpvCQCfTRWO
GlaukVSRKe3prjs/jXe73CU=
=tgCi
-----END PGP SIGNATURE-----

May 15 2010

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:

div0 wrote:
 J=E9r=F4me M. Berger wrote:
 	That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type.=20

=20
 No it's not, it's always uninitialized.
=20

	According to the C89 standard and onwards it *must* be initialized
to 0. If it isn't then your implementation isn't standard compliant
(needless to say, gcc, Visual, llvm, icc and dmc are all standard
compliant, so you won't have any difficulty checking).

 Visual studio will initialise memory & a functions stack segment with
 0xcd, but only in debug builds. In release mode you get what was alread=

y
 there. That used to be the case with gcc (which used 0xdeadbeef) as wel=

l
 unless they've changed it.
=20

	This does not concern global variables. Therefore the second part
of my message applies, the part you didn't quote:
 The default value for local variables and malloc/new memory is
 "whatever was in this place in memory before" which can be anything.
 The default value for calloc is to have all bits to 0 as for global
 variables.

	I should have added that some compiler / standard libraries allow
you to have a default initialization value for debugging purpose.

		Jerome
--=20
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr

May 15 2010

div0 <div0 users.sourceforge.net> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

J�r�me M. Berger wrote:
 div0 wrote:
 J�r�me M. Berger wrote:
 	That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type. 

 No it's not, it's always uninitialized.

 	According to the C89 standard and onwards it *must* be initialized
 to 0. If it isn't then your implementation isn't standard compliant
 (needless to say, gcc, Visual, llvm, icc and dmc are all standard
 compliant, so you won't have any difficulty checking).

Ah, I only do C++, where the standard is to not initialise.
I didn't know the two specs had diverged like that.

- --
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFL7/wlT9LetA9XoXwRAtiuAKCsbvt0KXymdZV4SBNG2lMRB9MM6QCgo9pm
qGbY++2jGP9W/lELsnq47Zs=
=8KpC
-----END PGP SIGNATURE-----

May 16 2010

"Jouko Koski" <joukokoskispam101 netti.fi> writes:

"div0" <div0 users.sourceforge.net> wrote:
 J�r�me M. Berger wrote:
 That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type.



 Ah, I only do C++, where the standard is to not initialise.

No, in C++ all *global or static* variables are zero-initialized. By 
default, stack variables are default-initialized, which means that doubles 
in stack can have any value (they are uninitialized).

The C-function calloc is required to fill the newly allocated memory with 
zero bit pattern; malloc is not required to initialize anything. Fresh heap 
areas given by malloc may have zero bit pattern, but one should really make 
no assumptions on this.

-- 
Jouko

May 16 2010

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:

div0 wrote:
 J=E9r=F4me M. Berger wrote:
 div0 wrote:
 J=E9r=F4me M. Berger wrote:
 	That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data=




 type.=20

 No it's not, it's always uninitialized.

 	According to the C89 standard and onwards it *must* be initialized
 to 0. If it isn't then your implementation isn't standard compliant
 (needless to say, gcc, Visual, llvm, icc and dmc are all standard
 compliant, so you won't have any difficulty checking).

=20
 Ah, I only do C++, where the standard is to not initialise.
 I didn't know the two specs had diverged like that.
=20

	The specs haven't diverged and C++ has mostly the same behaviour as
C where global variables are concerned. The only difference is that
if the global variable is a class with a constructor, then that
constructor gets called after the memory is zeroed out.

		Jerome
--=20
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr

May 16 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 May 2010 12:40:52 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Steven Schveighoffer:
 In C/C++, the default value for doubles is 0.

 I think in C and C++ the default value for doubles is "uninitialized"  
 (that is anything).

You are probably right.  All I did to figure this out is print out the  
first element of the array in my C++ version of kai's code.  So it may be  
arbitrarily set to 0.

-Steve

May 17 2010

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

Steven Schveighoffer wrote:

     double [] foo = new double [cast(int)1e6];
     foo[] = 0;

I've discovered that this is the equivalent of the last line above:

   foo = 0;

I don't see it in the spec. Is that an old or an unintended feature?

Ali

May 15 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Ali =C3=87ehreli <acehreli yahoo.com> wrote:

 Steven Schveighoffer wrote:

  >     double [] foo =3D new double [cast(int)1e6];
  >     foo[] =3D 0;

 I've discovered that this is the equivalent of the last line above:

    foo =3D 0;

 I don't see it in the spec. Is that an old or an unintended feature?

Looks unintended to me.  In fact (though that might be the
C programmer in me doing the thinking), it looks to me like
foo =3D null;. It might be related to the discussion in
digitalmars.D "Is [] mandatory for array operations?".

-- =

Simen

May 15 2010

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

Simen kjaeraas wrote:
 Ali Çehreli <acehreli yahoo.com> wrote:

 Steven Schveighoffer wrote:

  >     double [] foo = new double [cast(int)1e6];
  >     foo[] = 0;

 I've discovered that this is the equivalent of the last line above:

    foo = 0;

 I don't see it in the spec. Is that an old or an unintended feature?

I have to make a correction: It works with fixed-sized arrays. It does 
not work with the dynamic array initialization above.

 Looks unintended to me.  In fact (though that might be the
 C programmer in me doing the thinking), it looks to me like
 foo = null;. It might be related to the discussion in
 digitalmars.D "Is [] mandatory for array operations?".

Thanks,
Ali

May 15 2010

bearophile <bearophileHUGS lycos.com> writes:

Ali Çehreli:
 I don't see it in the spec. Is that an old or an unintended feature?

It's a compiler bug, don't use that bracket less syntax in your programs.
Don is fighting to fix such problems (and I have written several posts and bug
reports on that stuff).

Bye,
bearophile

May 15 2010

Walter Bright <newshound1 digitalmars.com> writes:

kai wrote:

 Here is a boiled down test case:
 
 	void main (string[] args)
 	{
 		double [] foo = new double [cast(int)1e6];
 		for (int i=0;i<1e3;i++)
 		{
 			for (int j=0;j<1e6-1;j++)
 			{
 				foo[j]=foo[j]+foo[j+1];
 			}
 		}
 	}
 
 Any ideas?

for (int j=0;j<1e6-1;j++)

The j<1e6-1 is a floating point operation. It should be redone as an int one:
      j<1_000_000-1

May 21 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 for (int j=0;j<1e6-1;j++)
 
 The j<1e6-1 is a floating point operation. It should be redone as an int one:
       j<1_000_000-1

The syntax "1e6" can represent an integer value of one million as perfectly and
as precisely as "1_000_000", but traditionally in many languages the
exponential syntax is used to represent floating point values only, I don't
know why.
If the OP wants a short syntax to represent one million, this syntax can be
used in D2:
foreach (j; 0 .. 10^^6)

Bye,
bearophile

May 22 2010

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Loop optimization