digitalmars.D.learn - Raytracing speed again, this time with gdc.

downs (17/17) Nov 07 2007 I've tried to translate the attached C++ program to D (D program also

dominik (3/15) Nov 07 2007 what kind of output is this? how do you view result?

downs (5/23) Nov 07 2007 PGM. Gwenview and Konqueror can display it, as well as any decent image

Jascha Wetzel (3/27) Nov 08 2007 did you try to profile both versions?

downs (17/31) Nov 08 2007 Thanks for the input, but that's not it.

downs (3/9) Nov 08 2007 Okay, now that I look at it again, some of those values smack me as dist...

downs (30/30) Nov 08 2007 Here we go again.

Jascha Wetzel (11/45) Nov 08 2007 in Sphere.intersect, instead of

downs (5/71) Nov 08 2007 I'd like to know that as well. Far as I know, the algorithm's the same.

downs (5/12) Nov 08 2007 Okay, it's some weird floating point effect around this line: "disc=b*b ...

Bill Baxter (17/41) Nov 08 2007 Could the difference be in part due to default initialization in D?

downs (7/30) Nov 08 2007 If you check the source, you'll see that practically all my structs are ...

Dave (10/44) Nov 10 2007 Sorry if this has already been discussed, but if not: Have you checked t...

Saaa (4/4) Nov 09 2007 Any success? If you find out what it was could you post it?

downs <default_357-line yahoo.de> writes:

I've tried to translate the attached C++ program to D (D program also
attached).

Even after explicitly using the __builtin_sqrt (which correctly
generates a fsqrt instruction (shame on you, non-inlined std.math)), the
D code is significantly slower (12.9s for D vs 9s for C++).

Does anybody know why this is so?
 --downs, confused and saddened

PS: benchmark

gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
-ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm

real    0m13.448s
user    0m12.730s
sys     0m0.090s

real    0m10.128s
user    0m9.810s
sys     0m0.020s

Nov 07 2007

"dominik" <aha aha.com> writes:

"downs" <default_357-line yahoo.de> wrote in message 
news:fgskq5$1p5n$1 digitalmars.com...
  writefln("P5\n", n, " ", n, "\n255");
  for (int y=n-1; y>=0; --y)
    for (int x=0; x<n; ++x) {
      double g=0;
      for (int d=0; d<ss*ss; ++d) {
        auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, 
 n).unitise();
 g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s);
      }
      printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss)));
    }
 }

what kind of output is this? how do you view result?

Nov 07 2007

downs <default_357-line yahoo.de> writes:

dominik wrote:
 "downs" <default_357-line yahoo.de> wrote in message 
 news:fgskq5$1p5n$1 digitalmars.com...
  writefln("P5\n", n, " ", n, "\n255");
  for (int y=n-1; y>=0; --y)
    for (int x=0; x<n; ++x) {
      double g=0;
      for (int d=0; d<ss*ss; ++d) {
        auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, 
 n).unitise();
 g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s);
      }
      printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss)));
    }
 }

 
 what kind of output is this? how do you view result? 
 
 

PGM. Gwenview and Konqueror can display it, as well as any decent image
viewer.

http://en.wikipedia.org/wiki/Portable_pixmap

 --downs

Nov 07 2007

Jascha Wetzel <firstname mainia.de> writes:

downs wrote:
 I've tried to translate the attached C++ program to D (D program also
 attached).
 
 Even after explicitly using the __builtin_sqrt (which correctly
 generates a fsqrt instruction (shame on you, non-inlined std.math)), the
 D code is significantly slower (12.9s for D vs 9s for C++).
 
 Does anybody know why this is so?
  --downs, confused and saddened
 
 PS: benchmark
 
 gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
 tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm
 
 real    0m13.448s
 user    0m12.730s
 sys     0m0.090s
 
 real    0m10.128s
 user    0m9.810s
 sys     0m0.020s
 

did you try to profile both versions?
also try benchmarking without the IO.

Nov 08 2007

downs <default_357-line yahoo.de> writes:

Jascha Wetzel wrote:
 gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
 tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm

 real    0m13.448s
 user    0m12.730s
 sys     0m0.090s

 real    0m10.128s
 user    0m9.810s
 sys     0m0.020s

 did you try to profile both versions?
 also try benchmarking without the IO.

Thanks for the input, but that's not it.

The IO of 262K cannot possibly account for three seconds of difference. Just
for the books, same result without the printf call.

Here's the first three lines of each's profile run.
D version:
 time   seconds   seconds    calls   s/call   s/call  name
 77.27      5.10     5.10  6164146     0.00     0.00  void
ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
 15.15      6.10     1.00 38551270     0.00     0.00  void
ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
  4.55      6.40     0.30        1     0.30     6.56  _Dmain
All: 5.25s

C++ version:
 time   seconds   seconds    calls  ms/call  ms/call  name
 75.18      4.12     4.12  6164146     0.00     0.00 
Group::intersect(std::pair<double, Vec>&, Ray const&) const
 19.34      5.18     1.06 38551282     0.00     0.00 
Sphere::intersect(std::pair<double, Vec>&, Ray const&) const
  4.74      5.44     0.26                             main
All: 6.57s


 --downs

Nov 08 2007

downs <default_357-line yahoo.de> writes:

downs wrote:
 D version:
  time   seconds   seconds    calls   s/call   s/call  name
  77.27      5.10     5.10  6164146     0.00     0.00  void
ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
  15.15      6.10     1.00 38551270     0.00     0.00  void
ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
   4.55      6.40     0.30        1     0.30     6.56  _Dmain
 All: 5.25s

Okay, now that I look at it again, some of those values smack me as distinctly
.. impossible. I'll redo the profiling.
 --downs

Nov 08 2007

downs <default_357-line yahoo.de> writes:

Here we go again.

D version.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 73.09      5.08     5.08  6164146     0.00     0.00  void
ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
 19.57      6.44     1.36 38551270     0.00     0.00  void
ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
  3.74      6.70     0.26        1     0.26     6.88  _Dmain
  2.59      6.88     0.18  4194304     0.00     0.00  double
ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene)
  0.43      6.91     0.03                             struct gcx.Pool*
gcx.Gcx.findPool(void*, void*)
  0.14      6.92     0.01                             void*
gcx.GC.mallocNoSync(uint, void*)
  0.14      6.93     0.01                             uint
gcbits.GCBits.test(uint, void*)
  0.14      6.94     0.01                             _d_callfinalizer
  0.14      6.95     0.01                             _d_newclass
  0.00      6.95     0.00        1     0.00     0.00  void ray1._staticCtor1()
  0.00      6.95     0.00        1     0.00     0.00  class ray1.Scene
ray1.create(int, inout struct ray1.Vec, double)
  0.00      6.95     0.00        1     0.00     0.00  void ray1.__modinit()

C++ version.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 76.88      4.19     4.19  6164146     0.00     0.00 
Group::intersect(std::pair<double, Vec>&, Ray const&) const
 18.35      5.19     1.00 38551282     0.00     0.00 
Sphere::intersect(std::pair<double, Vec>&, Ray const&) const
  3.30      5.37     0.18                             main
  1.28      5.44     0.07  4194304     0.00     0.00  ray_trace(Vec const&, Ray
const&, Scene const&)
  0.18      5.45     0.01        1    10.00    10.00  create(int, Vec const&,
double)
  0.00      5.45     0.00    87381     0.00     0.00  Sphere::~Sphere()
  0.00      5.45     0.00        1     0.00     0.00  global constructors keyed
to real
  0.00      5.45     0.00        1     0.00     0.00 
__static_initialization_and_destruction_0(int, int)
  0.00      5.45     0.00        1     0.00     0.00  Group::~Group()

Nov 08 2007

Jascha Wetzel <firstname mainia.de> writes:

downs wrote:
 Here we go again.
 
 D version.
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls   s/call   s/call  name
  73.09      5.08     5.08  6164146     0.00     0.00  void
ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
  19.57      6.44     1.36 38551270     0.00     0.00  void
ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*)
   3.74      6.70     0.26        1     0.26     6.88  _Dmain
   2.59      6.88     0.18  4194304     0.00     0.00  double
ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene)
   0.43      6.91     0.03                             struct gcx.Pool*
gcx.Gcx.findPool(void*, void*)
   0.14      6.92     0.01                             void*
gcx.GC.mallocNoSync(uint, void*)
   0.14      6.93     0.01                             uint
gcbits.GCBits.test(uint, void*)
   0.14      6.94     0.01                             _d_callfinalizer
   0.14      6.95     0.01                             _d_newclass
   0.00      6.95     0.00        1     0.00     0.00  void ray1._staticCtor1()
   0.00      6.95     0.00        1     0.00     0.00  class ray1.Scene
ray1.create(int, inout struct ray1.Vec, double)
   0.00      6.95     0.00        1     0.00     0.00  void ray1.__modinit()
 
 C++ version.
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  76.88      4.19     4.19  6164146     0.00     0.00 
Group::intersect(std::pair<double, Vec>&, Ray const&) const
  18.35      5.19     1.00 38551282     0.00     0.00 
Sphere::intersect(std::pair<double, Vec>&, Ray const&) const
   3.30      5.37     0.18                             main
   1.28      5.44     0.07  4194304     0.00     0.00  ray_trace(Vec const&,
Ray const&, Scene const&)
   0.18      5.45     0.01        1    10.00    10.00  create(int, Vec const&,
double)
   0.00      5.45     0.00    87381     0.00     0.00  Sphere::~Sphere()
   0.00      5.45     0.00        1     0.00     0.00  global constructors
keyed to real
   0.00      5.45     0.00        1     0.00     0.00 
__static_initialization_and_destruction_0(int, int)
   0.00      5.45     0.00        1     0.00     0.00  Group::~Group()

in Sphere.intersect, instead of
hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise);
try
hit.first = lamdba;
hit.second = ...
it might not optimize away the creation on stack + copy.

now i'd look at the asm code for ray_sphere in both versions, since that 
seems to where the time is burnt.

also curious: why does the C++ version call Sphere::intersect 12 times 
more often than the D version?

Nov 08 2007

downs <default_357-line yahoo.de> writes:

Jascha Wetzel wrote:
 downs wrote:
 Here we go again.

 D version.

 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls   s/call   s/call  name
  73.09      5.08     5.08  6164146     0.00     0.00  void
 ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray,
 void*)
  19.57      6.44     1.36 38551270     0.00     0.00  void
 ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray,
 void*)
   3.74      6.70     0.26        1     0.26     6.88  _Dmain
   2.59      6.88     0.18  4194304     0.00     0.00  double
 ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class
 ray1.Scene)
   0.43      6.91     0.03                             struct gcx.Pool*
 gcx.Gcx.findPool(void*, void*)
   0.14      6.92     0.01                             void*
 gcx.GC.mallocNoSync(uint, void*)
   0.14      6.93     0.01                             uint
 gcbits.GCBits.test(uint, void*)
   0.14      6.94     0.01                             _d_callfinalizer
   0.14      6.95     0.01                             _d_newclass
   0.00      6.95     0.00        1     0.00     0.00  void
 ray1._staticCtor1()
   0.00      6.95     0.00        1     0.00     0.00  class ray1.Scene
 ray1.create(int, inout struct ray1.Vec, double)
   0.00      6.95     0.00        1     0.00     0.00  void
 ray1.__modinit()

 C++ version.

 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  76.88      4.19     4.19  6164146     0.00     0.00 
 Group::intersect(std::pair<double, Vec>&, Ray const&) const
  18.35      5.19     1.00 38551282     0.00     0.00 
 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const
   3.30      5.37     0.18                             main
   1.28      5.44     0.07  4194304     0.00     0.00  ray_trace(Vec
 const&, Ray const&, Scene const&)
   0.18      5.45     0.01        1    10.00    10.00  create(int, Vec
 const&, double)
   0.00      5.45     0.00    87381     0.00     0.00  Sphere::~Sphere()
   0.00      5.45     0.00        1     0.00     0.00  global
 constructors keyed to real
   0.00      5.45     0.00        1     0.00     0.00 
 __static_initialization_and_destruction_0(int, int)
   0.00      5.45     0.00        1     0.00     0.00  Group::~Group()

 
 in Sphere.intersect, instead of
 hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise);
 try
 hit.first = lamdba;
 hit.second = ...
 it might not optimize away the creation on stack + copy.
 

I just tried that. Negligible change.

 now i'd look at the asm code for ray_sphere in both versions, since that
 seems to where the time is burnt.
 
 also curious: why does the C++ version call Sphere::intersect 12 times
 more often than the D version?

I'd like to know that as well. Far as I know, the algorithm's the same.
I'll add debugging code to find out.
 --downs

Nov 08 2007

downs <default_357-line yahoo.de> writes:

downs wrote:
 Jascha Wetzel wrote:
 also curious: why does the C++ version call Sphere::intersect 12 times
 more often than the D version?

 
 I'd like to know that as well. Far as I know, the algorithm's the same.
 I'll add debugging code to find out.
  --downs

Okay, it's some weird floating point effect around this line: "disc=b*b -
v.dot(v) + radius*radius;".
It only happens in very specific conditions; x=430, y=379, d=11.
So whatever it is, responsible for three seconds it is not.
 --downs

Nov 08 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

downs wrote:
 I've tried to translate the attached C++ program to D (D program also
 attached).
 
 Even after explicitly using the __builtin_sqrt (which correctly
 generates a fsqrt instruction (shame on you, non-inlined std.math)), the
 D code is significantly slower (12.9s for D vs 9s for C++).
 
 Does anybody know why this is so?
  --downs, confused and saddened
 
 PS: benchmark
 
 gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d
 tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3
 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm
 
 real    0m13.448s
 user    0m12.730s
 sys     0m0.090s
 
 real    0m10.128s
 user    0m9.810s
 sys     0m0.020s
 

Could the difference be in part due to default initialization in D? 
Maybe all your rays and vecs are getting initialized first to NaN and 
then overwritten with the value you want, and that is slowing it down.

You could try sticking some =void's in your structs, like so:

struct Vec {
   double x=void, y=void, z=void;
   Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); }
   Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); }
   Vec opMul(double a) { return Vec(x*a, y*a, z*a); }
   double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; }
   Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); }
}

struct Pair(T, U) { T first=void; U second=void; }
typedef Pair!(double, Vec) Hit;

struct Ray { Vec orig=void, dir=void; }


--bb

Nov 08 2007

downs <default_357-line yahoo.de> writes:

Bill Baxter wrote:
 
 Could the difference be in part due to default initialization in D?
 Maybe all your rays and vecs are getting initialized first to NaN and
 then overwritten with the value you want, and that is slowing it down.
 
 You could try sticking some =void's in your structs, like so:
 
 struct Vec {
   double x=void, y=void, z=void;
   Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); }
   Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); }
   Vec opMul(double a) { return Vec(x*a, y*a, z*a); }
   double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; }
   Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); }
 }
 
 struct Pair(T, U) { T first=void; U second=void; }
 typedef Pair!(double, Vec) Hit;
 
 struct Ray { Vec orig=void, dir=void; }
 
 
 --bb

If you check the source, you'll see that practically all my structs are either
manually initialized with proper values or the result of a calculation. Also,
I tried that and it's not it.
Sorry.

Still, thanks for the idea!
 --downs

Nov 08 2007

"Dave" <Dave_member pathlink.com> writes:

"downs" <default_357-line yahoo.de> wrote in message 
news:fh0iem$2dhc$1 digitalmars.com...
 Bill Baxter wrote:
 Could the difference be in part due to default initialization in D?
 Maybe all your rays and vecs are getting initialized first to NaN and
 then overwritten with the value you want, and that is slowing it down.

 You could try sticking some =void's in your structs, like so:

 struct Vec {
   double x=void, y=void, z=void;
   Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, 
 z+other.z); }
   Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, 
 z-other.z); }
   Vec opMul(double a) { return Vec(x*a, y*a, z*a); }
   double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; }
   Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); }
 }

 struct Pair(T, U) { T first=void; U second=void; }
 typedef Pair!(double, Vec) Hit;

 struct Ray { Vec orig=void, dir=void; }


 --bb

 If you check the source, you'll see that practically all my structs are 
 either
 manually initialized with proper values or the result of a calculation. 
 Also,
 I tried that and it's not it.
 Sorry.

 Still, thanks for the idea!
 --downs

Sorry if this has already been discussed, but if not: Have you checked to
see if function inlining differences are the culprit? The D front-end
doesn't inline functions with byref params, and for DMD at least the FE is
responsible for all inlining.

For GDC the front-end inlining might be turned off anyhow (I can't
remember), and maybe GCC does all that in the intermediate or backend
stages.. If so it would presumably be pretty close to the same for D and C++
but it might be worth a look anyhow.

Nov 10 2007

"Saaa" <empty needmail.com> writes:

Any success? If you find out what it was could you post it?

The only things I can think of are a stack heap allocation, variable 
initialisation and better optimization differences...
Maybe making some smaller test cases.

Nov 09 2007

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Raytracing speed again, this time with gdc.