digitalmars.D - DMD floating point performance.

Dave (31/31) Nov 11 2006 Ok this is trivial, but it probably fairly represents what many use floa...

Lionello Lunesu (3/3) Nov 12 2006 I'm not sure if it would change anything, but for a full release build,

Dave (2/8) Nov 12 2006 It wouldn't make a diff. for these tests, so I avoided the clutter.

Walter Bright (9/12) Nov 12 2006 The issue isn't with D, it's with the back end code generator. You'll

Dave (87/101) Nov 12 2006 I know this is simplistic, but could something like fxch be used to 'mim...

Walter Bright (2/5) Nov 12 2006 Yes, I just haven't done the work to figure it out.

Don Clugston (5/11) Nov 12 2006 The definitive article on this is Agner Fog's optimisation manual at

Walter Bright (3/7) Nov 13 2006 Thanks, it's a great reference I didn't know existed. But I couldn't

Dave (2/10) Nov 15 2006 Ran across this: http://www.smlnj.org/compiler-notes/x86-fp.ps

=?iso-8859-1?q?Knud_S=F8rensen?= (22/22) Nov 13 2006 I think it is possible to systematic search

=?iso-8859-1?q?Knud_S=F8rensen?= (9/37) Nov 15 2006 Does anyone know of tools which might be use to hack up a deoptimizer li...

Dave <Dave_member pathlink.com> writes:

Ok this is trivial, but it probably fairly represents what many use floating
point for day-to-day 
(simple math on simple data):

import std.stdio, std.date;

void main()
{
     d_time s = getUTCtime;
     double sum = 1.0, val = 1.000001;
     for(size_t i = 0; i < 10_000_000; i++)
     {
         sum += val;
         sum -= val;
         sum *= val;
         sum /= val;
     }
     d_time e = getUTCtime;
     writefln(sum,": ",(e-s)/cast(real)TicksPerSecond," secs");
}


1: 0.282 secs


1: 0.683 secs

A difference of 2.5x?!

If you look at the DMD asm, the problem is that each operation is wrapped by a
load/store.. Why 
wouldn't val and sum be kept in fp registers inside the loop?

Another couple of examples:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=all
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=all

I mean I'd like to standardize on DMD; a) it doesn't come with any baggage on
either Windows or 
Linux and b) it is more consistently maintained as of late, but things like
this make the decision a 
lot tougher.

Thanks,

- Dave

Nov 11 2006

"Lionello Lunesu" <lionello lunesu.remove.com> writes:

I'm not sure if it would change anything, but for a full release build, 
you'd also want -release and -inline.

L.

Nov 12 2006

Dave <Dave_member pathlink.com> writes:

Lionello Lunesu wrote:
 I'm not sure if it would change anything, but for a full release build, 
 you'd also want -release and -inline.
 
 L. 
 
 

It wouldn't make a diff. for these tests, so I avoided the clutter.

Nov 12 2006

Walter Bright <newshound digitalmars.com> writes:

Dave wrote:
 If you look at the DMD asm, the problem is that each operation is 
 wrapped by a load/store.. Why wouldn't val and sum be kept in fp 
 registers inside the loop?

The issue isn't with D, it's with the back end code generator. You'll 
see the same thing with the C and C++ compiler.

Because the FPU is a 'stack' machine rather than a register machine, 
it's hard to write a code generator that enregisters floating point 
variables. It's also problematical because every function call must 
empty the FPU stack anyway, and so a lot of register spill logic must be 
in place for it to be very useful.

Not impossible, just a lot of tricky work. How does GDC do with this?

Nov 12 2006

Dave <Dave_member pathlink.com> writes:

Walter Bright wrote:
 Dave wrote:
 If you look at the DMD asm, the problem is that each operation is 
 wrapped by a load/store.. Why wouldn't val and sum be kept in fp 
 registers inside the loop?

 
 The issue isn't with D, it's with the back end code generator. You'll 
 see the same thing with the C and C++ compiler.
 
 Because the FPU is a 'stack' machine rather than a register machine, 
 it's hard to write a code generator that enregisters floating point 

I know this is simplistic, but could something like fxch be used to 'mimick' a
register machine as 
vars are stored into registers? fxch is supposed to be very efficient.

 variables. It's also problematical because every function call must 
 empty the FPU stack anyway, and so a lot of register spill logic must be 
 in place for it to be very useful.

Could ffree greatly simply things here?

 Not impossible, just a lot of tricky work. How does GDC do with this?

The original post was comparing DMD vs. GDC.

Here's some code showing relative asm. output:

dmd = 1: 0.673 secs
dmd asm = 1: 0.682 secs
opt asm = 1: 0.272 secs

;---

import std.stdio, std.date;

void main()
{
     d_time s = getUTCtime;
     double sum = fp;
     d_time e = getUTCtime;
     writefln("dmd = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs");

     s = getUTCtime;
     sum = fp_dmd_asm;
     e = getUTCtime;
     writefln("dmd asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs");

     s = getUTCtime;
     sum = fp_dmd_opt;
     e = getUTCtime;
     writefln("opt asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs");
}

double fp()
{
     double sum = 1.0, val = 0.000001;
     for(size_t i = 0; i < 10_000_000; i++)
     {
         sum += val;
         sum -= val;
         sum *= val;
         sum /= val;
     }
     return sum;
}

double _sum = 1.0, _val = 0.000001;
double fp_dmd_asm() // more or less
{
     asm
     {
                 fld     qword ptr _sum[0];
                 fstp    qword ptr -8[EBP];
                 xor     EAX,EAX;
L11:            fld     qword ptr _val[0];
                 fadd    qword ptr -8[EBP];
                 fstp    qword ptr -8[EBP];
                 fld     qword ptr _val[0];
                 fsubr   qword ptr -8[EBP];
                 fstp    qword ptr -8[EBP];
                 fld     qword ptr _val[0];
                 fmul    qword ptr -8[EBP];
                 fstp    qword ptr -8[EBP];
                 fld     qword ptr _val[0];
                 fdivr   qword ptr -8[EBP];
                 fstp    qword ptr -8[EBP];
                 inc     EAX;
                 cmp     EAX,10_000_000;
                 jb      L11;
                 fld     qword ptr -8[EBP];
                 fstp    _sum[0];
     }
     return _sum;
}

double fp_dmd_opt()
{
     double sum = 1.0, val = 0.000001;
     asm
     {
                 fld     qword ptr sum[0];
                 fld     qword ptr val[0];
                 fxch    ST(1);
                 xor     EAX,EAX;
L1:             fadd    ST, ST(1);
                 fsubr   ST, ST(1);
                 fmul    ST, ST(1);
                 fdivr   ST, ST(1);
                 inc     EAX;
                 cmp     EAX,10_000_000;
                 jb      L1;
                 fstp    sum[0];
                 ffree   ST(1);
     }
     return sum;
}

Nov 12 2006

Walter Bright <newshound digitalmars.com> writes:

Dave wrote:
 I know this is simplistic, but could something like fxch be used to 
 'mimick' a register machine as vars are stored into registers? fxch is 
 supposed to be very efficient.

Yes, I just haven't done the work to figure it out.

Nov 12 2006

Don Clugston <dac nospam.com.au> writes:

Walter Bright wrote:
 Dave wrote:
 I know this is simplistic, but could something like fxch be used to 
 'mimick' a register machine as vars are stored into registers? fxch is 
 supposed to be very efficient.

 
 Yes, I just haven't done the work to figure it out.

The definitive article on this is Agner Fog's optimisation manual at 
www.agner.org. It's fantastically difficult to do x87 perfectly (and 
this is probably the main appeal of SSE), but I'm sure there are massive 
gains to be had relatively cheaply.

Nov 12 2006

Walter Bright <newshound digitalmars.com> writes:

Don Clugston wrote:
 The definitive article on this is Agner Fog's optimisation manual at 
 www.agner.org. It's fantastically difficult to do x87 perfectly (and 
 this is probably the main appeal of SSE), but I'm sure there are massive 
 gains to be had relatively cheaply.

Thanks, it's a great reference I didn't know existed. But I couldn't 
find anything specific about enregistering ST register algorithms.

Nov 13 2006

Dave <Dave_member pathlink.com> writes:

Walter Bright wrote:
 Don Clugston wrote:
 The definitive article on this is Agner Fog's optimisation manual at 
 www.agner.org. It's fantastically difficult to do x87 perfectly (and 
 this is probably the main appeal of SSE), but I'm sure there are 
 massive gains to be had relatively cheaply.

 
 Thanks, it's a great reference I didn't know existed. But I couldn't 
 find anything specific about enregistering ST register algorithms.

Ran across this: http://www.smlnj.org/compiler-notes/x86-fp.ps

Nov 15 2006

=?iso-8859-1?q?Knud_S=F8rensen?= <12tkvvb02 sneakemail.com> writes:

I think it is  possible to systematic search 
for cases where dmd don't make optimal code.

How this could be done is described at
http://www.oonumerics.org/lunar/

The idea is that one start with a simples program
main()
{
 return(0);
}

Then one deoptimize it through one or more deoptimizing steps, example:

main()
{
 int i=0;
 if (false)
  {
    int i=10;
  }
 return(i)
}

At last one run the deoptimized program through the compiler and 
compere the result to the result of the start program.

Now we just need someone to write a deoptimizer for D.

Nov 13 2006

=?iso-8859-1?q?Knud_S=F8rensen?= <12tkvvb02 sneakemail.com> writes:

Does anyone know of tools which might be use to hack up a deoptimizer like
that ???

It does not have to be a tool written in D, but is should be possible
to generate D code. Maybe a good genetic programing tool can be used, I
don't know.
 
Any suggestions ??

 
On Mon, 13 Nov 2006 10:04:00 +0100, Knud S�rensen wrote:

 I think it is  possible to systematic search 
 for cases where dmd don't make optimal code.
 
 How this could be done is described at
 http://www.oonumerics.org/lunar/
 
 The idea is that one start with a simples program
 main()
 {
  return(0);
 }
 
 Then one deoptimize it through one or more deoptimizing steps, example:
 
 main()
 {
  int i=0;
  if (false)
   {
     int i=10;
   }
  return(i)
 }
 
 At last one run the deoptimized program through the compiler and 
 compere the result to the result of the start program.
 
 Now we just need someone to write a deoptimizer for D.

Nov 15 2006

D Programming

C/C++ Programming

Other

digitalmars.D - DMD floating point performance.