www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Raytracing speed again, this time with gdc.

reply downs <default_357-line yahoo.de> writes:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

I've tried to translate the attached C++ program to D (D program also
attached).

Even after explicitly using the __builtin_sqrt (which correctly
generates a fsqrt instruction (shame on you, non-inlined std.math)), the
D code is significantly slower (12.9s for D vs 9s for C++).

Does anybody know why this is so?
 --downs, confused and saddened

PS: benchmark

gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
-ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm

real    0m13.448s
user    0m12.730s
sys     0m0.090s

real    0m10.128s
user    0m9.810s
sys     0m0.020s
Nov 07 2007
next sibling parent reply "dominik" <aha aha.com> writes:
"downs" <default_357-line yahoo.de> wrote in message 
news:fgskq5$1p5n$1 digitalmars.com...
  writefln("P5\n", n, " ", n, "\n255");
  for (int y=n-1; y>=0; --y)
    for (int x=0; x<n; ++x) {
      double g=0;
      for (int d=0; d<ss*ss; ++d) {
        auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, 
 n).unitise();
 g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s);
      }
      printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss)));
    }
 }

what kind of output is this? how do you view result?
Nov 07 2007
parent downs <default_357-line yahoo.de> writes:
dominik wrote:
 "downs" <default_357-line yahoo.de> wrote in message 
 news:fgskq5$1p5n$1 digitalmars.com...
  writefln("P5\n", n, " ", n, "\n255");
  for (int y=n-1; y>=0; --y)
    for (int x=0; x<n; ++x) {
      double g=0;
      for (int d=0; d<ss*ss; ++d) {
        auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, 
 n).unitise();
 g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s);
      }
      printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss)));
    }
 }

what kind of output is this? how do you view result?

viewer. http://en.wikipedia.org/wiki/Portable_pixmap --downs
Nov 07 2007
prev sibling next sibling parent reply Jascha Wetzel <firstname mainia.de> writes:
downs wrote:
 I've tried to translate the attached C++ program to D (D program also
 attached).
 
 Even after explicitly using the __builtin_sqrt (which correctly
 generates a fsqrt instruction (shame on you, non-inlined std.math)), the
 D code is significantly slower (12.9s for D vs 9s for C++).
 
 Does anybody know why this is so?
  --downs, confused and saddened
 
 PS: benchmark
 
 gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
 tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm
 
 real    0m13.448s
 user    0m12.730s
 sys     0m0.090s
 
 real    0m10.128s
 user    0m9.810s
 sys     0m0.020s
 

did you try to profile both versions? also try benchmarking without the IO.
Nov 08 2007
parent reply downs <default_357-line yahoo.de> writes:
Jascha Wetzel wrote:
 gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
 tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm

 real    0m13.448s
 user    0m12.730s
 sys     0m0.090s

 real    0m10.128s
 user    0m9.810s
 sys     0m0.020s

did you try to profile both versions? also try benchmarking without the IO.

Thanks for the input, but that's not it. The IO of 262K cannot possibly account for three seconds of difference. Just for the books, same result without the printf call. Here's the first three lines of each's profile run. D version: time seconds seconds calls s/call s/call name 77.27 5.10 5.10 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 15.15 6.10 1.00 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 4.55 6.40 0.30 1 0.30 6.56 _Dmain All: 5.25s C++ version: time seconds seconds calls ms/call ms/call name 75.18 4.12 4.12 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 19.34 5.18 1.06 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 4.74 5.44 0.26 main All: 6.57s --downs
Nov 08 2007
parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 D version:
  time   seconds   seconds    calls   s/call   s/call  name
  77.27      5.10     5.10  6164146     0.00     0.00  void
ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
  15.15      6.10     1.00 38551270     0.00     0.00  void
ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
   4.55      6.40     0.30        1     0.30     6.56  _Dmain
 All: 5.25s

Okay, now that I look at it again, some of those values smack me as distinctly .. impossible. I'll redo the profiling. --downs
Nov 08 2007
parent reply downs <default_357-line yahoo.de> writes:
Here we go again.

D version.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 73.09      5.08     5.08  6164146     0.00     0.00  void
ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
 19.57      6.44     1.36 38551270     0.00     0.00  void
ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
  3.74      6.70     0.26        1     0.26     6.88  _Dmain
  2.59      6.88     0.18  4194304     0.00     0.00  double
ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene)
  0.43      6.91     0.03                             struct gcx.Pool*
gcx.Gcx.findPool(void*, void*)
  0.14      6.92     0.01                             void*
gcx.GC.mallocNoSync(uint, void*)
  0.14      6.93     0.01                             uint
gcbits.GCBits.test(uint, void*)
  0.14      6.94     0.01                             _d_callfinalizer
  0.14      6.95     0.01                             _d_newclass
  0.00      6.95     0.00        1     0.00     0.00  void ray1._staticCtor1()
  0.00      6.95     0.00        1     0.00     0.00  class ray1.Scene
ray1.create(int, inout struct ray1.Vec, double)
  0.00      6.95     0.00        1     0.00     0.00  void ray1.__modinit()

C++ version.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 76.88      4.19     4.19  6164146     0.00     0.00 
Group::intersect(std::pair<double, Vec>&, Ray const&) const
 18.35      5.19     1.00 38551282     0.00     0.00 
Sphere::intersect(std::pair<double, Vec>&, Ray const&) const
  3.30      5.37     0.18                             main
  1.28      5.44     0.07  4194304     0.00     0.00  ray_trace(Vec const&, Ray
const&, Scene const&)
  0.18      5.45     0.01        1    10.00    10.00  create(int, Vec const&,
double)
  0.00      5.45     0.00    87381     0.00     0.00  Sphere::~Sphere()
  0.00      5.45     0.00        1     0.00     0.00  global constructors keyed
to real
  0.00      5.45     0.00        1     0.00     0.00 
__static_initialization_and_destruction_0(int, int)
  0.00      5.45     0.00        1     0.00     0.00  Group::~Group()
Nov 08 2007
parent reply Jascha Wetzel <firstname mainia.de> writes:
downs wrote:
 Here we go again.
 
 D version.
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls   s/call   s/call  name
  73.09      5.08     5.08  6164146     0.00     0.00  void
ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
  19.57      6.44     1.36 38551270     0.00     0.00  void
ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
   3.74      6.70     0.26        1     0.26     6.88  _Dmain
   2.59      6.88     0.18  4194304     0.00     0.00  double
ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene)
   0.43      6.91     0.03                             struct gcx.Pool*
gcx.Gcx.findPool(void*, void*)
   0.14      6.92     0.01                             void*
gcx.GC.mallocNoSync(uint, void*)
   0.14      6.93     0.01                             uint
gcbits.GCBits.test(uint, void*)
   0.14      6.94     0.01                             _d_callfinalizer
   0.14      6.95     0.01                             _d_newclass
   0.00      6.95     0.00        1     0.00     0.00  void ray1._staticCtor1()
   0.00      6.95     0.00        1     0.00     0.00  class ray1.Scene
ray1.create(int, inout struct ray1.Vec, double)
   0.00      6.95     0.00        1     0.00     0.00  void ray1.__modinit()
 
 C++ version.
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  76.88      4.19     4.19  6164146     0.00     0.00 
Group::intersect(std::pair<double, Vec>&, Ray const&) const
  18.35      5.19     1.00 38551282     0.00     0.00 
Sphere::intersect(std::pair<double, Vec>&, Ray const&) const
   3.30      5.37     0.18                             main
   1.28      5.44     0.07  4194304     0.00     0.00  ray_trace(Vec const&,
Ray const&, Scene const&)
   0.18      5.45     0.01        1    10.00    10.00  create(int, Vec const&,
double)
   0.00      5.45     0.00    87381     0.00     0.00  Sphere::~Sphere()
   0.00      5.45     0.00        1     0.00     0.00  global constructors
keyed to real
   0.00      5.45     0.00        1     0.00     0.00 
__static_initialization_and_destruction_0(int, int)
   0.00      5.45     0.00        1     0.00     0.00  Group::~Group()

in Sphere.intersect, instead of hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise); try hit.first = lamdba; hit.second = ... it might not optimize away the creation on stack + copy. now i'd look at the asm code for ray_sphere in both versions, since that seems to where the time is burnt. also curious: why does the C++ version call Sphere::intersect 12 times more often than the D version?
Nov 08 2007
parent reply downs <default_357-line yahoo.de> writes:
Jascha Wetzel wrote:
 downs wrote:
 Here we go again.

 D version.

 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls   s/call   s/call  name
  73.09      5.08     5.08  6164146     0.00     0.00  void
 ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray,
 void*)
  19.57      6.44     1.36 38551270     0.00     0.00  void
 ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray,
 void*)
   3.74      6.70     0.26        1     0.26     6.88  _Dmain
   2.59      6.88     0.18  4194304     0.00     0.00  double
 ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class
 ray1.Scene)
   0.43      6.91     0.03                             struct gcx.Pool*
 gcx.Gcx.findPool(void*, void*)
   0.14      6.92     0.01                             void*
 gcx.GC.mallocNoSync(uint, void*)
   0.14      6.93     0.01                             uint
 gcbits.GCBits.test(uint, void*)
   0.14      6.94     0.01                             _d_callfinalizer
   0.14      6.95     0.01                             _d_newclass
   0.00      6.95     0.00        1     0.00     0.00  void
 ray1._staticCtor1()
   0.00      6.95     0.00        1     0.00     0.00  class ray1.Scene
 ray1.create(int, inout struct ray1.Vec, double)
   0.00      6.95     0.00        1     0.00     0.00  void
 ray1.__modinit()

 C++ version.

 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  76.88      4.19     4.19  6164146     0.00     0.00 
 Group::intersect(std::pair<double, Vec>&, Ray const&) const
  18.35      5.19     1.00 38551282     0.00     0.00 
 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const
   3.30      5.37     0.18                             main
   1.28      5.44     0.07  4194304     0.00     0.00  ray_trace(Vec
 const&, Ray const&, Scene const&)
   0.18      5.45     0.01        1    10.00    10.00  create(int, Vec
 const&, double)
   0.00      5.45     0.00    87381     0.00     0.00  Sphere::~Sphere()
   0.00      5.45     0.00        1     0.00     0.00  global
 constructors keyed to real
   0.00      5.45     0.00        1     0.00     0.00 
 __static_initialization_and_destruction_0(int, int)
   0.00      5.45     0.00        1     0.00     0.00  Group::~Group()

in Sphere.intersect, instead of hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise); try hit.first = lamdba; hit.second = ... it might not optimize away the creation on stack + copy.

 now i'd look at the asm code for ray_sphere in both versions, since that
 seems to where the time is burnt.
 
 also curious: why does the C++ version call Sphere::intersect 12 times
 more often than the D version?

I'd like to know that as well. Far as I know, the algorithm's the same. I'll add debugging code to find out. --downs
Nov 08 2007
parent downs <default_357-line yahoo.de> writes:
downs wrote:
 Jascha Wetzel wrote:
 also curious: why does the C++ version call Sphere::intersect 12 times
 more often than the D version?

I'd like to know that as well. Far as I know, the algorithm's the same. I'll add debugging code to find out. --downs

Okay, it's some weird floating point effect around this line: "disc=b*b - v.dot(v) + radius*radius;". It only happens in very specific conditions; x=430, y=379, d=11. So whatever it is, responsible for three seconds it is not. --downs
Nov 08 2007
prev sibling next sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
downs wrote:
 I've tried to translate the attached C++ program to D (D program also
 attached).
 
 Even after explicitly using the __builtin_sqrt (which correctly
 generates a fsqrt instruction (shame on you, non-inlined std.math)), the
 D code is significantly slower (12.9s for D vs 9s for C++).
 
 Does anybody know why this is so?
  --downs, confused and saddened
 
 PS: benchmark
 
 gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
 tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm
 
 real    0m13.448s
 user    0m12.730s
 sys     0m0.090s
 
 real    0m10.128s
 user    0m9.810s
 sys     0m0.020s
 

Could the difference be in part due to default initialization in D? Maybe all your rays and vecs are getting initialized first to NaN and then overwritten with the value you want, and that is slowing it down. You could try sticking some =void's in your structs, like so: struct Vec { double x=void, y=void, z=void; Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); } Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); } Vec opMul(double a) { return Vec(x*a, y*a, z*a); } double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; } Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); } } struct Pair(T, U) { T first=void; U second=void; } typedef Pair!(double, Vec) Hit; struct Ray { Vec orig=void, dir=void; } --bb
Nov 08 2007
parent reply downs <default_357-line yahoo.de> writes:
Bill Baxter wrote:
 
 Could the difference be in part due to default initialization in D?
 Maybe all your rays and vecs are getting initialized first to NaN and
 then overwritten with the value you want, and that is slowing it down.
 
 You could try sticking some =void's in your structs, like so:
 
 struct Vec {
   double x=void, y=void, z=void;
   Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); }
   Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); }
   Vec opMul(double a) { return Vec(x*a, y*a, z*a); }
   double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; }
   Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); }
 }
 
 struct Pair(T, U) { T first=void; U second=void; }
 typedef Pair!(double, Vec) Hit;
 
 struct Ray { Vec orig=void, dir=void; }
 
 
 --bb

If you check the source, you'll see that practically all my structs are either manually initialized with proper values or the result of a calculation. Also, I tried that and it's not it. Sorry. Still, thanks for the idea! --downs
Nov 08 2007
parent "Dave" <Dave_member pathlink.com> writes:
"downs" <default_357-line yahoo.de> wrote in message 
news:fh0iem$2dhc$1 digitalmars.com...
 Bill Baxter wrote:
 Could the difference be in part due to default initialization in D?
 Maybe all your rays and vecs are getting initialized first to NaN and
 then overwritten with the value you want, and that is slowing it down.

 You could try sticking some =void's in your structs, like so:

 struct Vec {
   double x=void, y=void, z=void;
   Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, 
 z+other.z); }
   Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, 
 z-other.z); }
   Vec opMul(double a) { return Vec(x*a, y*a, z*a); }
   double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; }
   Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); }
 }

 struct Pair(T, U) { T first=void; U second=void; }
 typedef Pair!(double, Vec) Hit;

 struct Ray { Vec orig=void, dir=void; }


 --bb

If you check the source, you'll see that practically all my structs are either manually initialized with proper values or the result of a calculation. Also, I tried that and it's not it. Sorry. Still, thanks for the idea! --downs

Sorry if this has already been discussed, but if not: Have you checked to see if function inlining differences are the culprit? The D front-end doesn't inline functions with byref params, and for DMD at least the FE is responsible for all inlining. For GDC the front-end inlining might be turned off anyhow (I can't remember), and maybe GCC does all that in the intermediate or backend stages.. If so it would presumably be pretty close to the same for D and C++ but it might be worth a look anyhow.
Nov 10 2007
prev sibling parent "Saaa" <empty needmail.com> writes:
Any success? If you find out what it was could you post it?

The only things I can think of are a stack heap allocation, variable 
initialisation and better optimization differences...
Maybe making some smaller test cases.
Nov 09 2007