www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - DMD floating point performance.

reply Dave <Dave_member pathlink.com> writes:
Ok this is trivial, but it probably fairly represents what many use floating
point for day-to-day 
(simple math on simple data):

import std.stdio, std.date;

void main()
{
     d_time s = getUTCtime;
     double sum = 1.0, val = 1.000001;
     for(size_t i = 0; i < 10_000_000; i++)
     {
         sum += val;
         sum -= val;
         sum *= val;
         sum /= val;
     }
     d_time e = getUTCtime;
     writefln(sum,": ",(e-s)/cast(real)TicksPerSecond," secs");
}

# gdc -O2 fp.d -o fp ; fp
1: 0.282 secs

# dmd -O fp.d ; fp
1: 0.683 secs

A difference of 2.5x?!

If you look at the DMD asm, the problem is that each operation is wrapped by a
load/store.. Why 
wouldn't val and sum be kept in fp registers inside the loop?

Another couple of examples:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=all
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=all

I mean I'd like to standardize on DMD; a) it doesn't come with any baggage on
either Windows or 
Linux and b) it is more consistently maintained as of late, but things like
this make the decision a 
lot tougher.

Thanks,

- Dave
Nov 11 2006
next sibling parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
I'm not sure if it would change anything, but for a full release build, 
you'd also want -release and -inline.

L. 
Nov 12 2006
parent Dave <Dave_member pathlink.com> writes:
Lionello Lunesu wrote:
 I'm not sure if it would change anything, but for a full release build, 
 you'd also want -release and -inline.
 
 L. 
 
 

It wouldn't make a diff. for these tests, so I avoided the clutter.
Nov 12 2006
prev sibling next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Dave wrote:
 If you look at the DMD asm, the problem is that each operation is 
 wrapped by a load/store.. Why wouldn't val and sum be kept in fp 
 registers inside the loop?

The issue isn't with D, it's with the back end code generator. You'll see the same thing with the C and C++ compiler. Because the FPU is a 'stack' machine rather than a register machine, it's hard to write a code generator that enregisters floating point variables. It's also problematical because every function call must empty the FPU stack anyway, and so a lot of register spill logic must be in place for it to be very useful. Not impossible, just a lot of tricky work. How does GDC do with this?
Nov 12 2006
parent reply Dave <Dave_member pathlink.com> writes:
Walter Bright wrote:
 Dave wrote:
 If you look at the DMD asm, the problem is that each operation is 
 wrapped by a load/store.. Why wouldn't val and sum be kept in fp 
 registers inside the loop?

The issue isn't with D, it's with the back end code generator. You'll see the same thing with the C and C++ compiler. Because the FPU is a 'stack' machine rather than a register machine, it's hard to write a code generator that enregisters floating point

I know this is simplistic, but could something like fxch be used to 'mimick' a register machine as vars are stored into registers? fxch is supposed to be very efficient.
 variables. It's also problematical because every function call must 
 empty the FPU stack anyway, and so a lot of register spill logic must be 
 in place for it to be very useful.

Could ffree greatly simply things here?
 Not impossible, just a lot of tricky work. How does GDC do with this?

The original post was comparing DMD vs. GDC. Here's some code showing relative asm. output: dmd = 1: 0.673 secs dmd asm = 1: 0.682 secs opt asm = 1: 0.272 secs ;--- import std.stdio, std.date; void main() { d_time s = getUTCtime; double sum = fp; d_time e = getUTCtime; writefln("dmd = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); s = getUTCtime; sum = fp_dmd_asm; e = getUTCtime; writefln("dmd asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); s = getUTCtime; sum = fp_dmd_opt; e = getUTCtime; writefln("opt asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); } double fp() { double sum = 1.0, val = 0.000001; for(size_t i = 0; i < 10_000_000; i++) { sum += val; sum -= val; sum *= val; sum /= val; } return sum; } double _sum = 1.0, _val = 0.000001; double fp_dmd_asm() // more or less { asm { fld qword ptr _sum[0]; fstp qword ptr -8[EBP]; xor EAX,EAX; L11: fld qword ptr _val[0]; fadd qword ptr -8[EBP]; fstp qword ptr -8[EBP]; fld qword ptr _val[0]; fsubr qword ptr -8[EBP]; fstp qword ptr -8[EBP]; fld qword ptr _val[0]; fmul qword ptr -8[EBP]; fstp qword ptr -8[EBP]; fld qword ptr _val[0]; fdivr qword ptr -8[EBP]; fstp qword ptr -8[EBP]; inc EAX; cmp EAX,10_000_000; jb L11; fld qword ptr -8[EBP]; fstp _sum[0]; } return _sum; } double fp_dmd_opt() { double sum = 1.0, val = 0.000001; asm { fld qword ptr sum[0]; fld qword ptr val[0]; fxch ST(1); xor EAX,EAX; L1: fadd ST, ST(1); fsubr ST, ST(1); fmul ST, ST(1); fdivr ST, ST(1); inc EAX; cmp EAX,10_000_000; jb L1; fstp sum[0]; ffree ST(1); } return sum; }
Nov 12 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Dave wrote:
 I know this is simplistic, but could something like fxch be used to 
 'mimick' a register machine as vars are stored into registers? fxch is 
 supposed to be very efficient.

Yes, I just haven't done the work to figure it out.
Nov 12 2006
parent reply Don Clugston <dac nospam.com.au> writes:
Walter Bright wrote:
 Dave wrote:
 I know this is simplistic, but could something like fxch be used to 
 'mimick' a register machine as vars are stored into registers? fxch is 
 supposed to be very efficient.

Yes, I just haven't done the work to figure it out.

The definitive article on this is Agner Fog's optimisation manual at www.agner.org. It's fantastically difficult to do x87 perfectly (and this is probably the main appeal of SSE), but I'm sure there are massive gains to be had relatively cheaply.
Nov 12 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Don Clugston wrote:
 The definitive article on this is Agner Fog's optimisation manual at 
 www.agner.org. It's fantastically difficult to do x87 perfectly (and 
 this is probably the main appeal of SSE), but I'm sure there are massive 
 gains to be had relatively cheaply.

Thanks, it's a great reference I didn't know existed. But I couldn't find anything specific about enregistering ST register algorithms.
Nov 13 2006
parent Dave <Dave_member pathlink.com> writes:
Walter Bright wrote:
 Don Clugston wrote:
 The definitive article on this is Agner Fog's optimisation manual at 
 www.agner.org. It's fantastically difficult to do x87 perfectly (and 
 this is probably the main appeal of SSE), but I'm sure there are 
 massive gains to be had relatively cheaply.

Thanks, it's a great reference I didn't know existed. But I couldn't find anything specific about enregistering ST register algorithms.

Ran across this: http://www.smlnj.org/compiler-notes/x86-fp.ps
Nov 15 2006
prev sibling parent reply =?iso-8859-1?q?Knud_S=F8rensen?= <12tkvvb02 sneakemail.com> writes:
I think it is  possible to systematic search 
for cases where dmd don't make optimal code.

How this could be done is described at
http://www.oonumerics.org/lunar/

The idea is that one start with a simples program
main()
{
 return(0);
}

Then one deoptimize it through one or more deoptimizing steps, example:

main()
{
 int i=0;
 if (false)
  {
    int i=10;
  }
 return(i)
}

At last one run the deoptimized program through the compiler and 
compere the result to the result of the start program.

Now we just need someone to write a deoptimizer for D.
Nov 13 2006
parent =?iso-8859-1?q?Knud_S=F8rensen?= <12tkvvb02 sneakemail.com> writes:
Does anyone know of tools which might be use to hack up a deoptimizer like
that ???

It does not have to be a tool written in D, but is should be possible
to generate D code. Maybe a good genetic programing tool can be used, I
don't know.
 
Any suggestions ??

 
On Mon, 13 Nov 2006 10:04:00 +0100, Knud Sørensen wrote:

 I think it is  possible to systematic search 
 for cases where dmd don't make optimal code.
 
 How this could be done is described at
 http://www.oonumerics.org/lunar/
 
 The idea is that one start with a simples program
 main()
 {
  return(0);
 }
 
 Then one deoptimize it through one or more deoptimizing steps, example:
 
 main()
 {
  int i=0;
  if (false)
   {
     int i=10;
   }
  return(i)
 }
 
 At last one run the deoptimized program through the compiler and 
 compere the result to the result of the start program.
 
 Now we just need someone to write a deoptimizer for D.

Nov 15 2006