digitalmars.D.learn - Raytracing speed again, this time with gdc.
- downs <default_357-line yahoo.de> Nov 07 2007
- "dominik" <aha aha.com> Nov 07 2007
- downs <default_357-line yahoo.de> Nov 07 2007
- Jascha Wetzel <firstname mainia.de> Nov 08 2007
- downs <default_357-line yahoo.de> Nov 08 2007
- downs <default_357-line yahoo.de> Nov 08 2007
- downs <default_357-line yahoo.de> Nov 08 2007
- Jascha Wetzel <firstname mainia.de> Nov 08 2007
- downs <default_357-line yahoo.de> Nov 08 2007
- downs <default_357-line yahoo.de> Nov 08 2007
- Bill Baxter <dnewsgroup billbaxter.com> Nov 08 2007
- downs <default_357-line yahoo.de> Nov 08 2007
- "Dave" <Dave_member pathlink.com> Nov 10 2007
- "Saaa" <empty needmail.com> Nov 09 2007
Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit I've tried to translate the attached C++ program to D (D program also attached). Even after explicitly using the __builtin_sqrt (which correctly generates a fsqrt instruction (shame on you, non-inlined std.math)), the D code is significantly slower (12.9s for D vs 9s for C++). Does anybody know why this is so? --downs, confused and saddened PS: benchmark gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020s
Nov 07 2007
"downs" <default_357-line yahoo.de> wrote in message news:fgskq5$1p5n$1 digitalmars.com...writefln("P5\n", n, " ", n, "\n255"); for (int y=n-1; y>=0; --y) for (int x=0; x<n; ++x) { double g=0; for (int d=0; d<ss*ss; ++d) { auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, n).unitise(); g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s); } printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss))); } }
what kind of output is this? how do you view result?
Nov 07 2007
dominik wrote:"downs" <default_357-line yahoo.de> wrote in message news:fgskq5$1p5n$1 digitalmars.com...writefln("P5\n", n, " ", n, "\n255"); for (int y=n-1; y>=0; --y) for (int x=0; x<n; ++x) { double g=0; for (int d=0; d<ss*ss; ++d) { auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, n).unitise(); g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s); } printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss))); } }
what kind of output is this? how do you view result?
viewer. http://en.wikipedia.org/wiki/Portable_pixmap --downs
Nov 07 2007
downs wrote:I've tried to translate the attached C++ program to D (D program also attached). Even after explicitly using the __builtin_sqrt (which correctly generates a fsqrt instruction (shame on you, non-inlined std.math)), the D code is significantly slower (12.9s for D vs 9s for C++). Does anybody know why this is so? --downs, confused and saddened PS: benchmark gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020s
did you try to profile both versions? also try benchmarking without the IO.
Nov 08 2007
Jascha Wetzel wrote:gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020s
did you try to profile both versions? also try benchmarking without the IO.
Thanks for the input, but that's not it. The IO of 262K cannot possibly account for three seconds of difference. Just for the books, same result without the printf call. Here's the first three lines of each's profile run. D version: time seconds seconds calls s/call s/call name 77.27 5.10 5.10 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 15.15 6.10 1.00 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 4.55 6.40 0.30 1 0.30 6.56 _Dmain All: 5.25s C++ version: time seconds seconds calls ms/call ms/call name 75.18 4.12 4.12 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 19.34 5.18 1.06 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 4.74 5.44 0.26 main All: 6.57s --downs
Nov 08 2007
downs wrote:D version: time seconds seconds calls s/call s/call name 77.27 5.10 5.10 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 15.15 6.10 1.00 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 4.55 6.40 0.30 1 0.30 6.56 _Dmain All: 5.25s
Okay, now that I look at it again, some of those values smack me as distinctly .. impossible. I'll redo the profiling. --downs
Nov 08 2007
Here we go again. D version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 73.09 5.08 5.08 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 19.57 6.44 1.36 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 3.74 6.70 0.26 1 0.26 6.88 _Dmain 2.59 6.88 0.18 4194304 0.00 0.00 double ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene) 0.43 6.91 0.03 struct gcx.Pool* gcx.Gcx.findPool(void*, void*) 0.14 6.92 0.01 void* gcx.GC.mallocNoSync(uint, void*) 0.14 6.93 0.01 uint gcbits.GCBits.test(uint, void*) 0.14 6.94 0.01 _d_callfinalizer 0.14 6.95 0.01 _d_newclass 0.00 6.95 0.00 1 0.00 0.00 void ray1._staticCtor1() 0.00 6.95 0.00 1 0.00 0.00 class ray1.Scene ray1.create(int, inout struct ray1.Vec, double) 0.00 6.95 0.00 1 0.00 0.00 void ray1.__modinit() C++ version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 76.88 4.19 4.19 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 18.35 5.19 1.00 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 3.30 5.37 0.18 main 1.28 5.44 0.07 4194304 0.00 0.00 ray_trace(Vec const&, Ray const&, Scene const&) 0.18 5.45 0.01 1 10.00 10.00 create(int, Vec const&, double) 0.00 5.45 0.00 87381 0.00 0.00 Sphere::~Sphere() 0.00 5.45 0.00 1 0.00 0.00 global constructors keyed to real 0.00 5.45 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.45 0.00 1 0.00 0.00 Group::~Group()
Nov 08 2007
downs wrote:Here we go again. D version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 73.09 5.08 5.08 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 19.57 6.44 1.36 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 3.74 6.70 0.26 1 0.26 6.88 _Dmain 2.59 6.88 0.18 4194304 0.00 0.00 double ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene) 0.43 6.91 0.03 struct gcx.Pool* gcx.Gcx.findPool(void*, void*) 0.14 6.92 0.01 void* gcx.GC.mallocNoSync(uint, void*) 0.14 6.93 0.01 uint gcbits.GCBits.test(uint, void*) 0.14 6.94 0.01 _d_callfinalizer 0.14 6.95 0.01 _d_newclass 0.00 6.95 0.00 1 0.00 0.00 void ray1._staticCtor1() 0.00 6.95 0.00 1 0.00 0.00 class ray1.Scene ray1.create(int, inout struct ray1.Vec, double) 0.00 6.95 0.00 1 0.00 0.00 void ray1.__modinit() C++ version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 76.88 4.19 4.19 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 18.35 5.19 1.00 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 3.30 5.37 0.18 main 1.28 5.44 0.07 4194304 0.00 0.00 ray_trace(Vec const&, Ray const&, Scene const&) 0.18 5.45 0.01 1 10.00 10.00 create(int, Vec const&, double) 0.00 5.45 0.00 87381 0.00 0.00 Sphere::~Sphere() 0.00 5.45 0.00 1 0.00 0.00 global constructors keyed to real 0.00 5.45 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.45 0.00 1 0.00 0.00 Group::~Group()
in Sphere.intersect, instead of hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise); try hit.first = lamdba; hit.second = ... it might not optimize away the creation on stack + copy. now i'd look at the asm code for ray_sphere in both versions, since that seems to where the time is burnt. also curious: why does the C++ version call Sphere::intersect 12 times more often than the D version?
Nov 08 2007
Jascha Wetzel wrote:downs wrote:Here we go again. D version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 73.09 5.08 5.08 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 19.57 6.44 1.36 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 3.74 6.70 0.26 1 0.26 6.88 _Dmain 2.59 6.88 0.18 4194304 0.00 0.00 double ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene) 0.43 6.91 0.03 struct gcx.Pool* gcx.Gcx.findPool(void*, void*) 0.14 6.92 0.01 void* gcx.GC.mallocNoSync(uint, void*) 0.14 6.93 0.01 uint gcbits.GCBits.test(uint, void*) 0.14 6.94 0.01 _d_callfinalizer 0.14 6.95 0.01 _d_newclass 0.00 6.95 0.00 1 0.00 0.00 void ray1._staticCtor1() 0.00 6.95 0.00 1 0.00 0.00 class ray1.Scene ray1.create(int, inout struct ray1.Vec, double) 0.00 6.95 0.00 1 0.00 0.00 void ray1.__modinit() C++ version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 76.88 4.19 4.19 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 18.35 5.19 1.00 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 3.30 5.37 0.18 main 1.28 5.44 0.07 4194304 0.00 0.00 ray_trace(Vec const&, Ray const&, Scene const&) 0.18 5.45 0.01 1 10.00 10.00 create(int, Vec const&, double) 0.00 5.45 0.00 87381 0.00 0.00 Sphere::~Sphere() 0.00 5.45 0.00 1 0.00 0.00 global constructors keyed to real 0.00 5.45 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.45 0.00 1 0.00 0.00 Group::~Group()
in Sphere.intersect, instead of hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise); try hit.first = lamdba; hit.second = ... it might not optimize away the creation on stack + copy.
now i'd look at the asm code for ray_sphere in both versions, since that seems to where the time is burnt. also curious: why does the C++ version call Sphere::intersect 12 times more often than the D version?
I'd like to know that as well. Far as I know, the algorithm's the same. I'll add debugging code to find out. --downs
Nov 08 2007
downs wrote:Jascha Wetzel wrote:also curious: why does the C++ version call Sphere::intersect 12 times more often than the D version?
I'd like to know that as well. Far as I know, the algorithm's the same. I'll add debugging code to find out. --downs
Okay, it's some weird floating point effect around this line: "disc=b*b - v.dot(v) + radius*radius;". It only happens in very specific conditions; x=430, y=379, d=11. So whatever it is, responsible for three seconds it is not. --downs
Nov 08 2007
downs wrote:I've tried to translate the attached C++ program to D (D program also attached). Even after explicitly using the __builtin_sqrt (which correctly generates a fsqrt instruction (shame on you, non-inlined std.math)), the D code is significantly slower (12.9s for D vs 9s for C++). Does anybody know why this is so? --downs, confused and saddened PS: benchmark gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020s
Could the difference be in part due to default initialization in D? Maybe all your rays and vecs are getting initialized first to NaN and then overwritten with the value you want, and that is slowing it down. You could try sticking some =void's in your structs, like so: struct Vec { double x=void, y=void, z=void; Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); } Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); } Vec opMul(double a) { return Vec(x*a, y*a, z*a); } double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; } Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); } } struct Pair(T, U) { T first=void; U second=void; } typedef Pair!(double, Vec) Hit; struct Ray { Vec orig=void, dir=void; } --bb
Nov 08 2007
Bill Baxter wrote:Could the difference be in part due to default initialization in D? Maybe all your rays and vecs are getting initialized first to NaN and then overwritten with the value you want, and that is slowing it down. You could try sticking some =void's in your structs, like so: struct Vec { double x=void, y=void, z=void; Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); } Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); } Vec opMul(double a) { return Vec(x*a, y*a, z*a); } double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; } Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); } } struct Pair(T, U) { T first=void; U second=void; } typedef Pair!(double, Vec) Hit; struct Ray { Vec orig=void, dir=void; } --bb
If you check the source, you'll see that practically all my structs are either manually initialized with proper values or the result of a calculation. Also, I tried that and it's not it. Sorry. Still, thanks for the idea! --downs
Nov 08 2007
"downs" <default_357-line yahoo.de> wrote in message news:fh0iem$2dhc$1 digitalmars.com...Bill Baxter wrote:Could the difference be in part due to default initialization in D? Maybe all your rays and vecs are getting initialized first to NaN and then overwritten with the value you want, and that is slowing it down. You could try sticking some =void's in your structs, like so: struct Vec { double x=void, y=void, z=void; Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); } Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); } Vec opMul(double a) { return Vec(x*a, y*a, z*a); } double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; } Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); } } struct Pair(T, U) { T first=void; U second=void; } typedef Pair!(double, Vec) Hit; struct Ray { Vec orig=void, dir=void; } --bb
If you check the source, you'll see that practically all my structs are either manually initialized with proper values or the result of a calculation. Also, I tried that and it's not it. Sorry. Still, thanks for the idea! --downs
Sorry if this has already been discussed, but if not: Have you checked to see if function inlining differences are the culprit? The D front-end doesn't inline functions with byref params, and for DMD at least the FE is responsible for all inlining. For GDC the front-end inlining might be turned off anyhow (I can't remember), and maybe GCC does all that in the intermediate or backend stages.. If so it would presumably be pretty close to the same for D and C++ but it might be worth a look anyhow.
Nov 10 2007
Any success? If you find out what it was could you post it? The only things I can think of are a stack heap allocation, variable initialisation and better optimization differences... Maybe making some smaller test cases.
Nov 09 2007









downs <default_357-line yahoo.de> 