www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - D 50% slower than C++. What I'm doing wrong?

reply "ReneSac" <reneduani yahoo.com.br> writes:
I have this simple binary arithmetic coder in C++ by Mahoney and 
translated to D by Maffi. I added "notrow", "final" and "pure"  
and "GC.disable" where it was possible, but that didn't made much 
difference. Adding "const" to the Predictor.p() (as in the C++ 
version) gave 3% higher performance. Here the two versions:

http://mattmahoney.net/dc/  <-- original zip

http://pastebin.com/55x9dT9C  <-- Original C++ version.
http://pastebin.com/TYT7XdwX  <-- Modified D translation.

The problem is that the D version is 50% slower:

test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

Lang| Comp  | Binary size | Time (lower is better)
C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s


The only diference I could see between the C++ and D versions is 
that C++ has hints to the compiler about which functions to 
inline, and I could't find anything similar in D. So I manually 
inlined the encode and decode functions:

http://pastebin.com/N4nuyVMh  - Manual inline

D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s

Still, the D version is slower. What makes this speed diference? 
Is there any way to side-step this?

Note that this simple C++ version can be made more than 2 times 
faster with algoritimical and io optimizations, (ab)using 
templates, etc. So I'm not asking for generic speed 
optimizations, but only things that may make the D code "more 
equal" to the C++ code.
Apr 14 2012
next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 14/04/12 21:05, ReneSac wrote:
 Lang| Comp | Binary size | Time (lower is better)
 C++ (g++) - 13kb - 2.42s (100%) -O3 -s
 D (DMD) - 230kb - 4.46s (184%) -O -release -inline
 D (GDC) - 1322kb - 3.69s (152%) -O3 -frelease -s

Try using extra optimizations for GDC. Actually, GDC has a "dmd-like" interface, gdmd, and gdmd -O -release -inline corresponds to gdc -O3 -fweb -frelease -finline-functions ... so there may be some optimizations you were missing. (If you call gdmd with the -vdmd flag, it will tell you exactly what gdc statement it's using.)
 The only diference I could see between the C++ and D versions is that C++ has
 hints to the compiler about which functions to inline, and I could't find
 anything similar in D. So I manually inlined the encode and decode functions:

GDC has all the regular gcc optimization flags available IIRC. The ones on the GDC man page are just the ones specific to GDC.
 Still, the D version is slower. What makes this speed diference? Is there any
 way to side-step this?

In my (brief and limited) experience GDC produced executables tend to have a noticeable but minor gap compared to equivalent g++ compiled C++ code -- nothing on the order of 150%. E.g. I have some simulation code which models a reputation system where users rate objects and are then in turn judged on the consistency of their ratings with the general consensus. A simulation with 1000 users and 1000 objects takes ~22s to run in C++, ~24s in D compiled with gdmd -O -release -inline. Scale that up to 2000 users and 1000 objects and it's 47s (C++) vs 53s (D). 2000 users and 2000 objects gives 1min 49s (C++) and 2min 4s (D). So, it's a gap, but not one to write home about really, especially when you count that D is safer and (I think) easier/faster to program in. It's true that DMD is much slower -- the GCC backend is much better at generating fast code. If I recall right the DMD backend's encoding of floating point operations is considerably less sophisticated.
 Note that this simple C++ version can be made more than 2 times faster with
 algoritimical and io optimizations, (ab)using templates, etc. So I'm not asking
 for generic speed optimizations, but only things that may make the D code "more
 equal" to the C++ code.

I'm sure you can make various improvements to your D code in a similar way, and there are some things that improve in D when written in idiomatic "D style" as opposed to a more C++ish way of doing things (e.g. if you want to copy 1 vector to another, as happens in my code, write x[] = y[] instead of doing any kind of loop). Best wishes, -- Joe
Apr 14 2012
prev sibling next sibling parent reply "q66" <quaker66 gmail.com> writes:
On Saturday, 14 April 2012 at 19:05:40 UTC, ReneSac wrote:
 I have this simple binary arithmetic coder in C++ by Mahoney 
 and translated to D by Maffi. I added "notrow", "final" and 
 "pure"  and "GC.disable" where it was possible, but that didn't 
 made much difference. Adding "const" to the Predictor.p() (as 
 in the C++ version) gave 3% higher performance. Here the two 
 versions:

 http://mattmahoney.net/dc/  <-- original zip

 http://pastebin.com/55x9dT9C  <-- Original C++ version.
 http://pastebin.com/TYT7XdwX  <-- Modified D translation.

 The problem is that the D version is 50% slower:

 test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

 Lang| Comp  | Binary size | Time (lower is better)
 C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
 D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
 D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s


 The only diference I could see between the C++ and D versions 
 is that C++ has hints to the compiler about which functions to 
 inline, and I could't find anything similar in D. So I manually 
 inlined the encode and decode functions:

 http://pastebin.com/N4nuyVMh  - Manual inline

 D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
 D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s

 Still, the D version is slower. What makes this speed 
 diference? Is there any way to side-step this?

 Note that this simple C++ version can be made more than 2 times 
 faster with algoritimical and io optimizations, (ab)using 
 templates, etc. So I'm not asking for generic speed 
 optimizations, but only things that may make the D code "more 
 equal" to the C++ code.

I wrote a version http://codepad.org/phpLP7cx based on the C++ one. My commands used to compile: g++46 -O3 -s fpaq0.cpp -o fpaq0cpp dmd -O -release -inline -noboundscheck fpaq0.d G++ 4.6, dmd 2.059. I did 5 tests for each: test.fpaq0 (34603008 bytes) -> test.bmp (34610367 bytes) The C++ average result was 9.99 seconds (varying from 9.98 to 10.01) The D average result was 12.00 seconds (varying from 11.98 to 12.01) That means there is 16.8 percent difference in performance that would be cleared out by usage of gdc (which I don't have around currently).
Apr 14 2012
parent reply Somedude <lovelydear mailmetrash.com> writes:
Le 14/04/2012 21:53, q66 a écrit :
 On Saturday, 14 April 2012 at 19:05:40 UTC, ReneSac wrote:
 I have this simple binary arithmetic coder in C++ by Mahoney and
 translated to D by Maffi. I added "notrow", "final" and "pure"  and
 "GC.disable" where it was possible, but that didn't made much
 difference. Adding "const" to the Predictor.p() (as in the C++
 version) gave 3% higher performance. Here the two versions:

 http://mattmahoney.net/dc/  <-- original zip

 http://pastebin.com/55x9dT9C  <-- Original C++ version.
 http://pastebin.com/TYT7XdwX  <-- Modified D translation.

 The problem is that the D version is 50% slower:

 test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

 Lang| Comp  | Binary size | Time (lower is better)
 C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
 D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
 D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s


 The only diference I could see between the C++ and D versions is that
 C++ has hints to the compiler about which functions to inline, and I
 could't find anything similar in D. So I manually inlined the encode
 and decode functions:

 http://pastebin.com/N4nuyVMh  - Manual inline

 D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
 D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s

 Still, the D version is slower. What makes this speed diference? Is
 there any way to side-step this?

 Note that this simple C++ version can be made more than 2 times faster
 with algoritimical and io optimizations, (ab)using templates, etc. So
 I'm not asking for generic speed optimizations, but only things that
 may make the D code "more equal" to the C++ code.

I wrote a version http://codepad.org/phpLP7cx based on the C++ one. My commands used to compile: g++46 -O3 -s fpaq0.cpp -o fpaq0cpp dmd -O -release -inline -noboundscheck fpaq0.d G++ 4.6, dmd 2.059. I did 5 tests for each: test.fpaq0 (34603008 bytes) -> test.bmp (34610367 bytes) The C++ average result was 9.99 seconds (varying from 9.98 to 10.01) The D average result was 12.00 seconds (varying from 11.98 to 12.01) That means there is 16.8 percent difference in performance that would be cleared out by usage of gdc (which I don't have around currently).

The code is nearly identical (there is a slight difference in update(), where he accesses the array once more than you), but the main difference I see is the -noboundscheck compilation option on DMD.
Apr 14 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 15.04.2012 12:29, Joseph Rushton Wakeling wrote:
 On 15/04/12 09:23, ReneSac wrote:
 What really amazes me is the difference between g++, DMD and GDC in
 size of
 the executable binary. 100 orders of magnitude!

world". I need to update there, now that I got DMD working. BTW, it is 2 orders of magnitude.

Ack. Yes, 2 orders of magnitude, 100 _times_ larger. That'll teach me to send comments to a mailing list at an hour of the morning so late that it can be called early. ;-)
 Sounds stupid as the C stuff should be fastest, but I've been surprised
 sometimes at how using idiomatic D formulations can improve things.

Well, it may indeed be faster, especially the IO that is dependent on things like buffering and so on. But for this I just wanted something as close as the C++ code as possible.

Fair dos. Seems worth trying the idiomatic alternatives before assuming the speed difference is always going to be so great, though.
 Yeah, I don't know. I just did just throw those qualifiers against the
 compiler,
 and saw what sticks. And I was testing the decode speed specially to
 see easily
 if the output was corrupted. But maybe it haven't corrupted because
 the compiler
 don't optimize based on "pure" yet... there was no speed difference
 too.. so...

I think it's because although there's mutability outside the function, those variables are still internal to the struct. e.g. if you try and compile this code: int k = 23; pure int twotimes(int a) { auto b = 2*a; auto c = 2*k; return b+c; } ... it'll fail, but if you try and compile, struct TwoTimes { int k = 23; pure int twotimes(int a) { auto b = 2*a; auto c = 2*k; return b+c; } }

Simple - it's called weakly pure. (it can modify this for instance) It's allowed so that you can do: pure int f(){ Twotimes t; return t.towtime(1,1); } which is obviously normal/strongly pure. Of course, the compiler knows what kind of purity each function has.
 ... the compiler accepts it. Whether that's because it's acceptably
 pure, or because the compiler just doesn't detect this case of impurity,
 is another matter. The int k is certainly mutable from outside the scope
 of the function, so AFAICS it _should_ be disallowed.

-- Dmitry Olshansky
Apr 15 2012
prev sibling next sibling parent reply Somedude <lovelydear mailmetrash.com> writes:
Le 15/04/2012 09:23, ReneSac a écrit :
 On Sunday, 15 April 2012 at 02:56:21 UTC, Joseph Rushton Wakeling wrote:
 On Saturday, 14 April 2012 at 19:51:21 UTC, Joseph Rushton Wakeling
 wrote:
 GDC has all the regular gcc optimization flags available IIRC. The




I notice the 2D array is declared int ct[512][2]; // C++ int ct[2][512]; // D Any particular reason for this ? May this impede the cache utilization ? As for the executable size, the D compiler links the druntime and libraries statically in the binary while I would believe g++ links them dynamically.
Apr 15 2012
parent reply Somedude <lovelydear mailmetrash.com> writes:
Le 15/04/2012 23:33, Ashish Myles a crit :
 On Sun, Apr 15, 2012 at 5:16 PM, Somedude <lovelydear mailmetrash.com> wrote:
 Le 15/04/2012 09:23, ReneSac a crit :
 On Sunday, 15 April 2012 at 02:56:21 UTC, Joseph Rushton Wakeling wrote:
 On Saturday, 14 April 2012 at 19:51:21 UTC, Joseph Rushton Wakeling
 wrote:
 GDC has all the regular gcc optimization flags available IIRC. The




I notice the 2D array is declared int ct[512][2]; // C++ int ct[2][512]; // D

Not quite. It is declared int[2][512] ct; // not valid in C++ which is the same as Int ct[512][2], // also valid in C++ That's because you look at it as ((int)[2])[512] 512-sized array of 2-sized array of ints.

Oh right, sorry for this. It's a bit confusing.
Apr 15 2012
parent Somedude <lovelydear mailmetrash.com> writes:
Le 15/04/2012 23:41, Somedude a crit :
 Le 15/04/2012 23:33, Ashish Myles a crit :
 On Sun, Apr 15, 2012 at 5:16 PM, Somedude <lovelydear mailmetrash.com> wrote:

Oh right, sorry for this. It's a bit confusing.

Now apart from comparing the generated asm, I don't see.
Apr 15 2012
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 04/15/2012 02:23 PM, Kevin Cox wrote:
 On Apr 15, 2012 4:30 AM, "Joseph Rushton Wakeling"
 <joseph.wakeling webdrake.net <mailto:joseph.wakeling webdrake.net>> wrote:
  > ... the compiler accepts it.  Whether that's because it's acceptably
 pure, or because the compiler just doesn't detect this case of impurity,
 is another matter.  The int k is certainly mutable from outside the
 scope of the function, so AFAICS it _should_ be disallowed.

 As far as I understand pure includes the "hidden this parameter"  so it
 is pure if when you call it on the same structure with the same
 arguments you always get the same results.  Although that seams pretty
 useless in the optimization standpoint

It is useful, because the guarantees it gives still are quite strong. Besides, mutation of the receiver object can be explicitly disabled by marking the method const or immutable; there is no reason why 'pure' should imply 'const'.
 because the function can modify
 it's own object between calls.

S foo(int a)pure{ S x; x.foo(a); x.bar(a); return x; }
Apr 15 2012
prev sibling next sibling parent "q66" <quaker66 gmail.com> writes:
Forgot to mention specs: Dualcore Athlon II X2 240 (2.8GHz), 4GB 
RAM, FreeBSD 9 x64, both compilers are 64bit.
Apr 14 2012
prev sibling next sibling parent "q66" <quaker66 gmail.com> writes:
On Saturday, 14 April 2012 at 20:58:01 UTC, Somedude wrote:
 Le 14/04/2012 21:53, q66 a écrit :
 On Saturday, 14 April 2012 at 19:05:40 UTC, ReneSac wrote:
 I have this simple binary arithmetic coder in C++ by Mahoney 
 and
 translated to D by Maffi. I added "notrow", "final" and 
 "pure"  and
 "GC.disable" where it was possible, but that didn't made much
 difference. Adding "const" to the Predictor.p() (as in the C++
 version) gave 3% higher performance. Here the two versions:

 http://mattmahoney.net/dc/  <-- original zip

 http://pastebin.com/55x9dT9C  <-- Original C++ version.
 http://pastebin.com/TYT7XdwX  <-- Modified D translation.

 The problem is that the D version is 50% slower:

 test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

 Lang| Comp  | Binary size | Time (lower is better)
 C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
 D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release 
 -inline
 D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s


 The only diference I could see between the C++ and D versions 
 is that
 C++ has hints to the compiler about which functions to 
 inline, and I
 could't find anything similar in D. So I manually inlined the 
 encode
 and decode functions:

 http://pastebin.com/N4nuyVMh  - Manual inline

 D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release 
 -inline
 D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s

 Still, the D version is slower. What makes this speed 
 diference? Is
 there any way to side-step this?

 Note that this simple C++ version can be made more than 2 
 times faster
 with algoritimical and io optimizations, (ab)using templates, 
 etc. So
 I'm not asking for generic speed optimizations, but only 
 things that
 may make the D code "more equal" to the C++ code.

I wrote a version http://codepad.org/phpLP7cx based on the C++ one. My commands used to compile: g++46 -O3 -s fpaq0.cpp -o fpaq0cpp dmd -O -release -inline -noboundscheck fpaq0.d G++ 4.6, dmd 2.059. I did 5 tests for each: test.fpaq0 (34603008 bytes) -> test.bmp (34610367 bytes) The C++ average result was 9.99 seconds (varying from 9.98 to 10.01) The D average result was 12.00 seconds (varying from 11.98 to 12.01) That means there is 16.8 percent difference in performance that would be cleared out by usage of gdc (which I don't have around currently).

The code is nearly identical (there is a slight difference in update(), where he accesses the array once more than you), but the main difference I see is the -noboundscheck compilation option on DMD.

He also uses a class. And -noboundscheck should be automatically induced by -release.
Apr 14 2012
prev sibling next sibling parent "ReneSac" <reneduani yahoo.com.br> writes:
I tested the q66 version in my computer (sandy bridge   4.3GHz). 
Repeating the old timmings here, and the new results are marked 
as "D-v2":

test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

Lang| Comp  | Binary size | Time (lower is better)
C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
D-v2 (DMD)  -     206kb   -  4.50s  (186%)   -O -release -inline
D-v2 (GDC)  -     852kb   -  3.65s  (151%)   -O3 -frelease -s

So, basically the same thing... Not using clases seems a little 
slower on DMD, and no difference on GDC. The "if (++ct[cxt][y] > 
65534)" made a very small, but measurable difference (those .04s 
in GDC). The "if ((cxt += cxt + y) >= 512)" only made the code 
more complicated, with no speed benefit.

But the input file is also important. The file you tested seems 
to be an already compressed one, or something not very 
compressible. Here a test with an incompressible file:

pnad9huff.fpaq0 (43443040 bytes) -> test-d.huff (43617049 bytes)

C++  (g++)  -      13kb   -  5.13   (100%)   -O3 -s
D-v2 (DMD)  -     206kb   -  8.03   (156%)   -O -release -inline
D-v2 (GDC)  -     852kb   -  7.09   (138%)   -O3 -frelease -s
D-inl(DMD)  -     228kb   -  6.93   (135%)   -O -release -inline
D-inl(GDC)  -    1318kb   -  6.86   (134%)   -O3 -frelease -s

The C++ advantage becomes smaller in this file. D-inl is my 
manual inline version, with your small optimization on 
"Predictor.Update()".

On Saturday, 14 April 2012 at 19:51:21 UTC, Joseph Rushton 
Wakeling wrote:
 GDC has all the regular gcc optimization flags available IIRC.  
 The ones on the GDC man page are just the ones specific to GDC.

the C++ source code. I saw some discussion about " inline" but it seems not implemented (yet?). Well, that is not a priority for D anyway. About compiler optimizations, -finline-functions and -fweb are part of -O3. I tried to compile with -no-bounds-check, but made no diference for DMD and GDC. It probably is part of -release as q66 said.
Apr 14 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, April 15, 2012 03:51:59 ReneSac wrote:
 About compiler optimizations, -finline-functions and -fweb are
 part of -O3. I tried to compile with -no-bounds-check, but made
 no diference for DMD and GDC. It probably is part of -release as
 q66 said.

Not quite. -noboundscheck turns off _all_ array bounds checking, whereas - release turns it off in system and trusted functions, but not in safe functions. But unless you've marked your code with safe or it uses templated functions which get inferred as safe, all of your functions are going to be system functions anyway, in which case, it makes no difference. - Jonathan M Davis
Apr 14 2012
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 14/04/12 23:03, q66 wrote:
 He also uses a class. And -noboundscheck should be automatically induced by
 -release.

Ahh, THAT probably explains why some of my numerical code is so markedly different in speed when compiled using DMD with or without the -release switch. It's a MAJOR difference -- between code taking say 5min to run, compared to half an hour or more.
Apr 14 2012
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 14/04/12 23:03, q66 wrote:
 He also uses a class. And -noboundscheck should be automatically induced by
 -release.

... but the methods are marked as final -- shouldn't that substantially reduce any speed hit from using class instead of struct?
Apr 14 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, April 15, 2012 04:21:09 Joseph Rushton Wakeling wrote:
 On 14/04/12 23:03, q66 wrote:
 He also uses a class. And -noboundscheck should be automatically induced
 by
 -release.

... but the methods are marked as final -- shouldn't that substantially reduce any speed hit from using class instead of struct?

In theory. If they don't override anything, then that signals to the compiler that they don't need to be virtual, in which case, they _shouldn't_ be virtual, but that's up to the compiler to optimize, and I don't know how good it is about that right now. Certainly, if you had code like class C { final int foo() { return 42;} } and benchmarking showed that it was the same speed as class C { int foo() { return 42;} } when compiled with -O and -inline, then I'd submit a bug report (maybe an enhancement request?) on the compiler failing to make final functions non- virtual. Also, if the function is doing enough work, then whether it's virtual or not really doesn't make any difference, because the function itself costs so much more than the extra cost of the virtual function call. - Jonathan M Davis
Apr 14 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, April 14, 2012 19:31:40 Jonathan M Davis wrote:
 On Sunday, April 15, 2012 04:21:09 Joseph Rushton Wakeling wrote:
 On 14/04/12 23:03, q66 wrote:
 He also uses a class. And -noboundscheck should be automatically induced
 by
 -release.

... but the methods are marked as final -- shouldn't that substantially reduce any speed hit from using class instead of struct?

In theory. If they don't override anything, then that signals to the compiler that they don't need to be virtual, in which case, they _shouldn't_ be virtual, but that's up to the compiler to optimize, and I don't know how good it is about that right now. Certainly, if you had code like class C { final int foo() { return 42;} } and benchmarking showed that it was the same speed as class C { int foo() { return 42;} } when compiled with -O and -inline, then I'd submit a bug report (maybe an enhancement request?) on the compiler failing to make final functions non- virtual.

Actually, if you try and benchmark it, make sure that the code can't know that the reference is exactly a C. In theory, the compiler could be smart enough to know in a case such as auto c = new C; auto a = c.foo(); that c is exactly a C and that therefore, it can just inline the call to foo even if it's virtual. If c is set from another function, it can't do that. e.g. auto c = bar(); auto a = c.foo(); The compiler _probably_ isn't that smart, but it might be, so you'd have to be careful about that. - Jonathan M Davis
Apr 14 2012
prev sibling next sibling parent "jerro" <a a.com> writes:
On Sunday, 15 April 2012 at 02:20:34 UTC, Joseph Rushton Wakeling 
wrote:
 On 14/04/12 23:03, q66 wrote:
 He also uses a class. And -noboundscheck should be 
 automatically induced by
 -release.

Ahh, THAT probably explains why some of my numerical code is so markedly different in speed when compiled using DMD with or without the -release switch. It's a MAJOR difference -- between code taking say 5min to run, compared to half an hour or more.

I know this isn't what your post was about, but you really should compile numerical code with GDC instead of DMD if you care about performance. It generates much faster floating point code.
Apr 14 2012
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 15/04/12 04:37, jerro wrote:
 I know this isn't what your post was about, but you really should compile
 numerical code with GDC instead of DMD if you care about performance. It
 generates much faster floating point code.

It's exactly what I do. :-)
Apr 14 2012
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
 On Saturday, 14 April 2012 at 19:51:21 UTC, Joseph Rushton Wakeling wrote:
 GDC has all the regular gcc optimization flags available IIRC. The ones on the
 GDC man page are just the ones specific to GDC.

code. I saw some discussion about " inline" but it seems not implemented (yet?). Well, that is not a priority for D anyway. About compiler optimizations, -finline-functions and -fweb are part of -O3. I tried to compile with -no-bounds-check, but made no diference for DMD and GDC. It probably is part of -release as q66 said.

Ah yes, you're right. I do wonder if your seeming speed differences are magnified because the whole operation is only 2-4 seconds long: if your algorithm were operating over a longer timeframe I think you'd likely find the relative speed differences decrease. (I have a memory of a program that took ~0.004s with a C/C++ version and 1s with D, and the difference seemed to be just startup time for the D program.) What really amazes me is the difference between g++, DMD and GDC in size of the executable binary. 100 orders of magnitude! 3 remarks about the D code. One is that much of it still seems very "C-ish"; I'd be interested to see how speed and executable size differ if things like the file opening, or the reading of characters, are done with more idiomatic D code. Sounds stupid as the C stuff should be fastest, but I've been surprised sometimes at how using idiomatic D formulations can improve things. Second remark is more of a query -- can Predictor.p() and .update really be marked as pure? Their result for a given input actually varies depending on the current values of cxt and ct, which are modified outside of function scope. Third remark -- again a query -- why the GC.disable ... ?
Apr 14 2012
prev sibling next sibling parent "jerro" <a a.com> writes:
(I have a memory of a program that took
 ~0.004s with a C/C++ version and 1s with D, and the difference 
 seemed to be just startup time for the D program.)

I have never seen anything like that. Usually the minimal time to run a D program is something like: j debian:~$ time ./hello Hello world! real 0m0.001s user 0m0.000s sys 0m0.000s
 What really amazes me is the difference between g++, DMD and 
 GDC in size of the executable binary.  100 orders of magnitude!

With GDC those flags(for gdmd): -fdata-sections -ffunction-sections -L--gc-sections -L-l help a lot if you want to reduce a size of executable. Besides, this overhead is the standard library and runtime and won't be much larger in larger programs.
Apr 14 2012
prev sibling next sibling parent "jerro" <a a.com> writes:
On Sunday, 15 April 2012 at 03:41:55 UTC, jerro wrote:
(I have a memory of a program that took
 ~0.004s with a C/C++ version and 1s with D, and the difference 
 seemed to be just startup time for the D program.)

I have never seen anything like that. Usually the minimal time to run a D program is something like: j debian:~$ time ./hello Hello world! real 0m0.001s user 0m0.000s sys 0m0.000s
 What really amazes me is the difference between g++, DMD and 
 GDC in size of the executable binary.  100 orders of magnitude!

With GDC those flags(for gdmd): -fdata-sections -ffunction-sections -L--gc-sections -L-l help a lot if you want to reduce a size of executable. Besides, this overhead is the standard library and runtime and won't be much larger in larger programs.

The last flag should be -L-s
Apr 14 2012
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 15/04/12 05:41, jerro wrote:
 I have never seen anything like that. Usually the minimal time to run a D
 program is something like:

 j debian:~$ time ./hello
 Hello world!

 real 0m0.001s
 user 0m0.000s
 sys 0m0.000s

Yea, my experience too in general. I can't remember exactly what I was testing but if it's what I think it was (and have just retested:-), the difference may have been less pronounced (maybe 0.080s for D compared to 0.004 for C++) and that would have been due to not enabling optimizations for D. I have another pair of stupid D-vs.-C++ speed-test files where with optimizations engaged, D beats C++: the dominant factor is lots of output to console, so I guess this is D's writeln() beating C++'s cout.
 What really amazes me is the difference between g++, DMD and GDC in size of
 the executable binary. 100 orders of magnitude!

With GDC those flags(for gdmd): -fdata-sections -ffunction-sections -L--gc-sections -L-l help a lot if you want to reduce a size of executable. Besides, this overhead is the standard library and runtime and won't be much larger in larger programs.

Ahh! I hadn't realized that the libphobos package on Ubuntu didn't install a compiled version of the library. (DMD does.)
Apr 14 2012
prev sibling next sibling parent "ReneSac" <reneduani yahoo.com.br> writes:
On Sunday, 15 April 2012 at 02:56:21 UTC, Joseph Rushton Wakeling 
wrote:
 On Saturday, 14 April 2012 at 19:51:21 UTC, Joseph Rushton 
 Wakeling wrote:
 GDC has all the regular gcc optimization flags available 
 IIRC. The ones on the
 GDC man page are just the ones specific to GDC.

in the C++ source code. I saw some discussion about " inline" but it seems not implemented (yet?). Well, that is not a priority for D anyway. About compiler optimizations, -finline-functions and -fweb are part of -O3. I tried to compile with -no-bounds-check, but made no diference for DMD and GDC. It probably is part of -release as q66 said.

Ah yes, you're right. I do wonder if your seeming speed differences are magnified because the whole operation is only 2-4 seconds long: if your algorithm were operating over a longer timeframe I think you'd likely find the relative speed differences decrease. (I have a memory of a program that took ~0.004s with a C/C++ version and 1s with D, and the difference seemed to be just startup time for the D program.)

1.2MB compressible encode: C++: 0.11s (100%) D-inl: 0.14s (127%) decode C++: 0.12s (100%) D-inl: 0.16s (133%) ~200MB compressible encode: C++: 17.2s (100%) D-inl: 21.5s (125%) decode: C++: 16.3s (100%) D-inl: 24,5s (150%) 3,8GB, barelly-compressible encode: C++: 412s (100%) D-inl: 512s (124%)
 What really amazes me is the difference between g++, DMD and 
 GDC in size of the executable binary.  100 orders of magnitude!

world". I need to update there, now that I got DMD working. BTW, it is 2 orders of magnitude.
 3 remarks about the D code.  One is that much of it still seems 
 very "C-ish"; I'd be interested to see how speed and executable 
 size differ if things like the file opening, or the reading of 
 characters, are done with more idiomatic D code.
  Sounds stupid as the C stuff should be fastest, but I've been 
 surprised sometimes at how using idiomatic D formulations can 
 improve things.

Well, it may indeed be faster, especially the IO that is dependent on things like buffering and so on. But for this I just wanted something as close as the C++ code as possible.
 Second remark is more of a query -- can Predictor.p() and 
 .update really be marked as pure?  Their result for a given 
 input actually varies depending on the current values of cxt 
 and ct, which are modified outside of function scope.

against the compiler, and saw what sticks. And I was testing the decode speed specially to see easily if the output was corrupted. But maybe it haven't corrupted because the compiler don't optimize based on "pure" yet... there was no speed difference too.. so...
 Third remark -- again a query -- why the GC.disable ... ?

GC. It didn't make any speed difference also, and it is indeed a bad idea.
Apr 15 2012
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 15/04/12 09:23, ReneSac wrote:
 What really amazes me is the difference between g++, DMD and GDC in size of
 the executable binary. 100 orders of magnitude!

to update there, now that I got DMD working. BTW, it is 2 orders of magnitude.

Ack. Yes, 2 orders of magnitude, 100 _times_ larger. That'll teach me to send comments to a mailing list at an hour of the morning so late that it can be called early. ;-)
 Sounds stupid as the C stuff should be fastest, but I've been surprised
 sometimes at how using idiomatic D formulations can improve things.

Well, it may indeed be faster, especially the IO that is dependent on things like buffering and so on. But for this I just wanted something as close as the C++ code as possible.

Fair dos. Seems worth trying the idiomatic alternatives before assuming the speed difference is always going to be so great, though.
 Yeah, I don't know. I just did just throw those qualifiers against the
compiler,
 and saw what sticks. And I was testing the decode speed specially to see easily
 if the output was corrupted. But maybe it haven't corrupted because the
compiler
 don't optimize based on "pure" yet... there was no speed difference too.. so...

I think it's because although there's mutability outside the function, those variables are still internal to the struct. e.g. if you try and compile this code: int k = 23; pure int twotimes(int a) { auto b = 2*a; auto c = 2*k; return b+c; } ... it'll fail, but if you try and compile, struct TwoTimes { int k = 23; pure int twotimes(int a) { auto b = 2*a; auto c = 2*k; return b+c; } } ... the compiler accepts it. Whether that's because it's acceptably pure, or because the compiler just doesn't detect this case of impurity, is another matter. The int k is certainly mutable from outside the scope of the function, so AFAICS it _should_ be disallowed.
Apr 15 2012
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 15/04/12 10:29, Joseph Rushton Wakeling wrote:
 ... the compiler accepts it. Whether that's because it's acceptably pure, or
 because the compiler just doesn't detect this case of impurity, is another
 matter. The int k is certainly mutable from outside the scope of the function,
 so AFAICS it _should_ be disallowed.

Meant to say: outside the scope of the struct.
Apr 15 2012
prev sibling next sibling parent Kevin Cox <kevincox.ca gmail.com> writes:
--000e0ce03dea7c601304bdb6c4ea
Content-Type: text/plain; charset=UTF-8

On Apr 15, 2012 4:30 AM, "Joseph Rushton Wakeling" <
joseph.wakeling webdrake.net> wrote:
 ... the compiler accepts it.  Whether that's because it's acceptably

another matter. The int k is certainly mutable from outside the scope of the function, so AFAICS it _should_ be disallowed. As far as I understand pure includes the "hidden this parameter" so it is pure if when you call it on the same structure with the same arguments you always get the same results. Although that seams pretty useless in the optimization standpoint because the function can modify it's own object between calls. --000e0ce03dea7c601304bdb6c4ea Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <p><br> On Apr 15, 2012 4:30 AM, &quot;Joseph Rushton Wakeling&quot; &lt;<a href=3D= "mailto:joseph.wakeling webdrake.net">joseph.wakeling webdrake.net</a>&gt; = wrote:<br> &gt; ... the compiler accepts it. =C2=A0Whether that&#39;s because it&#39;s= acceptably pure, or because the compiler just doesn&#39;t detect this case= of impurity, is another matter. =C2=A0The int k is certainly mutable from = outside the scope of the function, so AFAICS it _should_ be disallowed.</p> <p>As far as I understand pure includes the &quot;hidden this parameter&quo= t;=C2=A0 so it is pure if when you call it on the same structure with the s= ame arguments you always get the same results.=C2=A0 Although that seams pr= etty useless in the optimization standpoint because the function can modify= it&#39;s own object between calls.<br> </p> --000e0ce03dea7c601304bdb6c4ea--
Apr 15 2012
prev sibling next sibling parent Ashish Myles <marcianx gmail.com> writes:
On Sun, Apr 15, 2012 at 5:16 PM, Somedude <lovelydear mailmetrash.com> wrot=
e:
 Le 15/04/2012 09:23, ReneSac a =E9crit :
 On Sunday, 15 April 2012 at 02:56:21 UTC, Joseph Rushton Wakeling wrote:
 On Saturday, 14 April 2012 at 19:51:21 UTC, Joseph Rushton Wakeling
 wrote:
 GDC has all the regular gcc optimization flags available IIRC. The




I notice the 2D array is declared int ct[512][2]; // C++ int ct[2][512]; // D

Not quite. It is declared int[2][512] ct; // not valid in C++ which is the same as Int ct[512][2], // also valid in C++ That's because you look at it as ((int)[2])[512] 512-sized array of 2-sized array of ints.
Apr 15 2012
prev sibling next sibling parent reply "Andrea Fontana" <nospam example.com> writes:
Are you on linux/windows/mac?

On Saturday, 14 April 2012 at 19:05:40 UTC, ReneSac wrote:
 I have this simple binary arithmetic coder in C++ by Mahoney 
 and translated to D by Maffi. I added "notrow", "final" and 
 "pure"  and "GC.disable" where it was possible, but that didn't 
 made much difference. Adding "const" to the Predictor.p() (as 
 in the C++ version) gave 3% higher performance. Here the two 
 versions:

 http://mattmahoney.net/dc/  <-- original zip

 http://pastebin.com/55x9dT9C  <-- Original C++ version.
 http://pastebin.com/TYT7XdwX  <-- Modified D translation.

 The problem is that the D version is 50% slower:

 test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

 Lang| Comp  | Binary size | Time (lower is better)
 C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
 D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
 D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s


 The only diference I could see between the C++ and D versions 
 is that C++ has hints to the compiler about which functions to 
 inline, and I could't find anything similar in D. So I manually 
 inlined the encode and decode functions:

 http://pastebin.com/N4nuyVMh  - Manual inline

 D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
 D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s

 Still, the D version is slower. What makes this speed 
 diference? Is there any way to side-step this?

 Note that this simple C++ version can be made more than 2 times 
 faster with algoritimical and io optimizations, (ab)using 
 templates, etc. So I'm not asking for generic speed 
 optimizations, but only things that may make the D code "more 
 equal" to the C++ code.

Apr 16 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 04/17/2012 12:24 AM, ReneSac wrote:
 On Monday, 16 April 2012 at 07:28:25 UTC, Andrea Fontana wrote:
 Are you on linux/windows/mac?

Windows.

DMC runtime !
 My main question is now *WHY* D is slower than C++ in this program? The
 code is identical (even the same C functions)

No. They are not the same. The performance difference is probably explained by the dmc runtime vs. glibc difference, because your biased results are not reproducible on a linux system where glibc is used for both versions.
 in the
 performance-critical parts, I'm using the "same" compiler backend
 (gdc/g++), and D was supposed to a fast compilable language.

/was/is/s
 Yet it is up to 50% slower.

This is a fallacy. Benchmarks can only compare implementations, not languages. Furthermore, it is usually the case that benchmarks that have surprising results don't measure what they intend to measure. Your program is apparently rather I/O bound.
 What is D doing more than C++ in this program, that accounts for the
 lost CPU cycles?
 Or what prevents the D program to be optimized to the
 C++ level? The D front-end?

The difference is likely because of differences in external C libraries.
Apr 16 2012
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 14 Apr 2012 22:31:40 -0400, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 On Sunday, April 15, 2012 04:21:09 Joseph Rushton Wakeling wrote:
 On 14/04/12 23:03, q66 wrote:
 He also uses a class. And -noboundscheck should be automatically  

 by
 -release.

... but the methods are marked as final -- shouldn't that substantially reduce any speed hit from using class instead of struct?

In theory. If they don't override anything, then that signals to the compiler that they don't need to be virtual, in which case, they _shouldn't_ be virtual, but that's up to the compiler to optimize, and I don't know how good it is about that right now.

You are misunderstanding something. Final functions can be in the vtable, and still not be called via the vtable. i.e.: class C { int foo() { return 1; } } class D : C { override final int foo() { return 2; } } void main() { auto d = new D; C c = d; assert(d.foo() == 2); // non-virtual, inline-able call. assert(c.foo() == 2); // virtual call } Disclaimer -- I haven't examined any of the code being discussed or the issues contained in this thread. I just wanted to point out this misunderstanding. -Steve
Apr 16 2012
prev sibling next sibling parent "ReneSac" <reneduani yahoo.com.br> writes:
On Monday, 16 April 2012 at 07:28:25 UTC, Andrea Fontana wrote:
 Are you on linux/windows/mac?

Windows. My main question is now *WHY* D is slower than C++ in this program? The code is identical (even the same C functions) in the performance-critical parts, I'm using the "same" compiler backend (gdc/g++), and D was supposed to a fast compilable language. Yet it is up to 50% slower. What is D doing more than C++ in this program, that accounts for the lost CPU cycles? Or what prevents the D program to be optimized to the C++ level? The D front-end?
Apr 16 2012
prev sibling next sibling parent "ReneSac" <reneduani yahoo.com.br> writes:
On Monday, 16 April 2012 at 22:58:08 UTC, Timon Gehr wrote:
 On 04/17/2012 12:24 AM, ReneSac wrote:
 Windows.

DMC runtime !

that both, g++ and GDC compiled binaries, use the mingw runtime, but I'm not sure also.
 No. They are not the same. The performance difference is 
 probably explained by the dmc runtime vs. glibc difference, 
 because your biased results are not reproducible on a linux 
 system where glibc is used for both versions.

 This is a fallacy. Benchmarks can only compare implementations, 
 not languages. Furthermore, it is usually the case that 
 benchmarks that have surprising results don't measure what they 
 intend to measure. Your program is apparently rather I/O bound.

it may be the GDC front-end that may be the "bottleneck". And I don't think it is I/O bound. It is only around 10MB/s, whereas my HD can do ~100MB/s. Furthermore, on files more compressible, where the speed was higher, the difference between D and C++ was higher too. And if is in fact I/O bound, then D is MORE than 50% slower than C++.
 The difference is likely because of differences in external C 
 libraries.

difference?
Apr 16 2012
prev sibling next sibling parent "jerro" <a a.com> writes:
On Tuesday, 17 April 2012 at 01:30:30 UTC, ReneSac wrote:
 On Monday, 16 April 2012 at 22:58:08 UTC, Timon Gehr wrote:
 On 04/17/2012 12:24 AM, ReneSac wrote:
 Windows.

DMC runtime !

that both, g++ and GDC compiled binaries, use the mingw runtime, but I'm not sure also.
 No. They are not the same. The performance difference is 
 probably explained by the dmc runtime vs. glibc difference, 
 because your biased results are not reproducible on a linux 
 system where glibc is used for both versions.

 This is a fallacy. Benchmarks can only compare 
 implementations, not languages. Furthermore, it is usually the 
 case that benchmarks that have surprising results don't 
 measure what they intend to measure. Your program is 
 apparently rather I/O bound.

that it may be the GDC front-end that may be the "bottleneck". And I don't think it is I/O bound. It is only around 10MB/s, whereas my HD can do ~100MB/s. Furthermore, on files more compressible, where the speed was higher, the difference between D and C++ was higher too. And if is in fact I/O bound, then D is MORE than 50% slower than C++.
 The difference is likely because of differences in external C 
 libraries.

the difference?

Have you tried profiling it? On Windows you can use AMD CodeAnalyst for that, it works pretty well in my experience and it's free of charge.
Apr 16 2012
prev sibling next sibling parent Oleg Kuporosov <oleg.kuporosov gmail.com> writes:
--20cf303a2b73733d6304bdda620d
Content-Type: text/plain; charset=ISO-8859-1

  DMC = Digital Mars Compiler? Does Mingw/GDC uses that? I think that

sure also.

you right, only dmd uses dmc environment, gdc uses mingw's.
 And I don't think it is I/O bound. It is only around 10MB/s, whereas my HD
 can do ~100MB/s. Furthermore, on files more compressible, where the speed
 was higher, the difference between D and C++ was higher too. And if is in
 fact I/O bound, then D is MORE than 50% slower than C++.

to minimize system load and I/O impact, run the same file in the loop, like 5-10 times, it will be located in kernel cache.
  The difference is likely because of differences in external C libraries.

difference? probably because gdc backend make worse job on optimizing D AST vs C++

C:\D>echo off FF=a.doc C++ compress a.doc a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.36 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.36 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.33 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.34 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.34 s. "C++ decompress" a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.50 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.51 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.51 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.50 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.50 s. "D compress" a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.11 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.08 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.08 s. "D decompress" a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.19 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.19 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.22 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.25 s. "Done" So, what's up? I'm used the same backend too, but DMC (-o -6) vs DMD (-release -inline -O -noboundscheck). I don't know DMC optimization flags, so probably results might be better for it. Lets try to compile by MS CL (-Ox) C:\D>echo off FF=a.doc C++ compress a.doc a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.03 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.02 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.04 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.03 s. a.doc (2694428 bytes) -> a.doc.cmp (1459227 bytes) in 1.01 s. "C++ decompress" a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.08 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.06 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.07 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.07 s. a.doc.cmp (1459227 bytes) -> a.doc.cmp.or (2694428 bytes) in 1.07 s. "D compress" a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.08 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s. a.doc (2694428 bytes) -> a.doc.dmp (1459227 bytes) in 1.09 s. "D decompress" a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.15 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.19 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s. a.doc.dmp (1459227 bytes) -> a.doc.dmp.or (2694428 bytes) in 1.17 s. "Done" Much better for C++, but D is not so worse and about 1.1*C++ too. What we see - different compiler, different story. We should not compare languages for performance but compilers! So many differencies in compilers and environment and definetelly C++ is much more mature for performance now but also D has own benefits (faster development/debugging and more reliable code). Thanks, Oleg. PS. BTW, this code still can be optimized quite a lot. --20cf303a2b73733d6304bdda620d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi= n:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class= =3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd= ing-left:1ex"> <br> </blockquote> DMC =3D Digital Mars Compiler? Does Mingw/GDC uses that? I think that both,= g++ and GDC compiled binaries, use the mingw runtime, but I&#39;m not sure= also.</blockquote><div><br></div><div>you right, only dmd uses dmc environ= ment, gdc uses mingw&#39;s.=A0</div> <div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;= border-left:1px #ccc solid;padding-left:1ex"><div class=3D"im"><br></div> And I don&#39;t think it is I/O bound. It is only around 10MB/s, whereas my= HD can do ~100MB/s. Furthermore, on files more compressible, where the spe= ed was higher, the difference between D and C++ was higher too. And if is i= n fact I/O bound, then D is MORE than 50% slower than C++.</blockquote> <div><br></div><div>to minimize system load and I/O impact, run the same fi= le in the loop, like 5-10 times, it will be located in kernel cache.=A0</di= v><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:= 1px #ccc solid;padding-left:1ex"> <div class=3D"im"> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> The difference is likely because of differences in external C libraries.<br=

Both, the D and C++ versions, use C&#39;s stdio library. What is the differ= ence?<br> <br> </blockquote></div>probably because gdc backend make worse job on optimizin= g D AST vs C++ AST. I&#39;ve got following results:<div><div><div><br></div=
<div>C:\D&gt;echo off=A0</div><div>FF=3Da.doc</div><div>C++ compress a.doc=

<div>a.doc (2694428 bytes) -&gt; a.doc.cmp (1459227 bytes) in 1.36 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.cmp (1459227 bytes) in 1.36 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.cmp (1459227 bytes) in 1.33 s.</div> <div>a.doc (2694428 bytes) -&gt; a.doc.cmp (1459227 bytes) in 1.34 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.cmp (1459227 bytes) in 1.34 s.</div>= <div>&quot;C++ decompress&quot;</div><div>a.doc.cmp (1459227 bytes) -&gt; a= .doc.cmp.or (2694428 bytes) in 1.50 s.</div> <div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 bytes) in 1.51 s= .</div><div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 bytes) in= 1.51 s.</div><div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 by= tes) in 1.50 s.</div> <div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 bytes) in 1.50 s= .</div><div>&quot;D compress&quot;</div><div>a.doc (2694428 bytes) -&gt; a.= doc.dmp (1459227 bytes) in 1.11 s.</div><div>a.doc (2694428 bytes) -&gt; a.= doc.dmp (1459227 bytes) in 1.09 s.</div> <div>a.doc (2694428 bytes) -&gt; a.doc.dmp (1459227 bytes) in 1.08 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.dmp (1459227 bytes) in 1.09 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.dmp (1459227 bytes) in 1.08 s.</div> <div>&quot;D decompress&quot;</div><div>a.doc.dmp (1459227 bytes) -&gt; a.d= oc.dmp.or (2694428 bytes) in 1.17 s.</div><div>a.doc.dmp (1459227 bytes) -&= gt; a.doc.dmp.or (2694428 bytes) in 1.19 s.</div><div>a.doc.dmp (1459227 by= tes) -&gt; a.doc.dmp.or (2694428 bytes) in 1.19 s.</div> <div>a.doc.dmp (1459227 bytes) -&gt; a.doc.dmp.or (2694428 bytes) in 1.22 s= .</div><div>a.doc.dmp (1459227 bytes) -&gt; a.doc.dmp.or (2694428 bytes) in= 1.25 s.</div><div>&quot;Done&quot;</div></div><div><br></div><div>So, what= &#39;s up? I&#39;m used the same backend too, but DMC (-o -6) vs DMD (-rele= ase -inline -O -noboundscheck).</div> </div><div>I don&#39;t know DMC optimization flags, so probably results mig= ht be better for it.</div><div><br></div><div>Lets try to compile by MS CL = (-Ox)</div><div><div><br></div><div>C:\D&gt;echo off=A0</div><div>FF=3Da.do= c</div> <div>C++ compress a.doc</div><div>a.doc (2694428 bytes) -&gt; a.doc.cmp (14= 59227 bytes) in 1.03 s.</div><div>a.doc (2694428 bytes) -&gt; a.doc.cmp (14= 59227 bytes) in 1.02 s.</div><div>a.doc (2694428 bytes) -&gt; a.doc.cmp (14= 59227 bytes) in 1.04 s.</div> <div>a.doc (2694428 bytes) -&gt; a.doc.cmp (1459227 bytes) in 1.03 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.cmp (1459227 bytes) in 1.01 s.</div>= <div>&quot;C++ decompress&quot;</div><div>a.doc.cmp (1459227 bytes) -&gt; a= .doc.cmp.or (2694428 bytes) in 1.08 s.</div> <div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 bytes) in 1.06 s= .</div><div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 bytes) in= 1.07 s.</div><div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 by= tes) in 1.07 s.</div> <div>a.doc.cmp (1459227 bytes) -&gt; a.doc.cmp.or (2694428 bytes) in 1.07 s= .</div><div>&quot;D compress&quot;</div><div>a.doc (2694428 bytes) -&gt; a.= doc.dmp (1459227 bytes) in 1.08 s.</div><div>a.doc (2694428 bytes) -&gt; a.= doc.dmp (1459227 bytes) in 1.09 s.</div> <div>a.doc (2694428 bytes) -&gt; a.doc.dmp (1459227 bytes) in 1.09 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.dmp (1459227 bytes) in 1.09 s.</div>= <div>a.doc (2694428 bytes) -&gt; a.doc.dmp (1459227 bytes) in 1.09 s.</div> <div>&quot;D decompress&quot;</div><div>a.doc.dmp (1459227 bytes) -&gt; a.d= oc.dmp.or (2694428 bytes) in 1.15 s.</div><div>a.doc.dmp (1459227 bytes) -&= gt; a.doc.dmp.or (2694428 bytes) in 1.17 s.</div><div>a.doc.dmp (1459227 by= tes) -&gt; a.doc.dmp.or (2694428 bytes) in 1.19 s.</div> <div>a.doc.dmp (1459227 bytes) -&gt; a.doc.dmp.or (2694428 bytes) in 1.17 s= .</div><div>a.doc.dmp (1459227 bytes) -&gt; a.doc.dmp.or (2694428 bytes) in= 1.17 s.</div><div>&quot;Done&quot;</div></div><div><br></div><div>Much bet= ter for C++, but D is not so worse and about 1.1*C++ too.</div> <div><br></div><div>What we see - different compiler, different story. We s= hould not compare languages for performance but compilers!</div><div>So man= y differencies in=A0compilers and environment and definetelly C++ is much m= ore mature for performance now but also D has own=A0benefits (faster develo= pment/debugging and more reliable code).</div> <div><br></div><div>Thanks,</div><div>Oleg.</div><div>PS. BTW, this code st= ill can be optimized quite a lot.</div> --20cf303a2b73733d6304bdda620d--
Apr 16 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 14 Apr 2012 19:31:40 -0700
schrieb Jonathan M Davis <jmdavisProg gmx.com>:

 On Sunday, April 15, 2012 04:21:09 Joseph Rushton Wakeling wrote:
 On 14/04/12 23:03, q66 wrote:
 He also uses a class. And -noboundscheck should be automatically induced
 by
 -release.

... but the methods are marked as final -- shouldn't that substantially reduce any speed hit from using class instead of struct?

In theory. If they don't override anything, then that signals to the compiler that they don't need to be virtual, in which case, they _shouldn't_ be virtual, but that's up to the compiler to optimize, and I don't know how good it is about that right now.

<cynicism> May I point to this: http://d.puremagic.com/issues/show_bug.cgi?id=7865 </cynicism> -- Marco
Apr 24 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 14 Apr 2012 21:05:36 +0200
schrieb "ReneSac" <reneduani yahoo.com.br>:

 I have this simple binary arithmetic coder in C++ by Mahoney and 
 translated to D by Maffi. I added "notrow", "final" and "pure"  
 and "GC.disable" where it was possible, but that didn't made much 
 difference. Adding "const" to the Predictor.p() (as in the C++ 
 version) gave 3% higher performance. Here the two versions:
 
 http://mattmahoney.net/dc/  <-- original zip
 
 http://pastebin.com/55x9dT9C  <-- Original C++ version.
 http://pastebin.com/TYT7XdwX  <-- Modified D translation.
 
 The problem is that the D version is 50% slower:
 
 test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)
 
 Lang| Comp  | Binary size | Time (lower is better)
 C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
 D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
 D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
 
 
 The only diference I could see between the C++ and D versions is 
 that C++ has hints to the compiler about which functions to 
 inline, and I could't find anything similar in D. So I manually 
 inlined the encode and decode functions:
 
 http://pastebin.com/N4nuyVMh  - Manual inline
 
 D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
 D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s
 
 Still, the D version is slower. What makes this speed diference? 
 Is there any way to side-step this?
 
 Note that this simple C++ version can be made more than 2 times 
 faster with algoritimical and io optimizations, (ab)using 
 templates, etc. So I'm not asking for generic speed 
 optimizations, but only things that may make the D code "more 
 equal" to the C++ code.

I noticed the thread just now. I ported fast paq8 (fp8) to D, and with some careful D-ification and optimization it runs a bit faster than the original C program when compiled with the GCC on Linux x86_64, Core 2 Duo. As others said the files are cached in RAM anyway if there is enough available, so you should not be bound by your hard drive speed anyway. I don't know about this version of paq you ported the coder from, but I try to give you some hints on what I did to optimize the code. - time portions of your main() is the time actually spent at start up or in the compression? - use structs, where you don't classes don't make your code cleaner - where ever you have large arrays that you don't need initialized to .init, write: int[<large number>] arr = void; double[<large number>] arr = void; This disables default initialization, which may help you in inner loops. Remember that C++ doesn't default initialize at all, so this is an obvious way to lose performance against that language. Also keep in mind that the .init for floating point types is NaN: struct Foo { double[999999] bar; } Is not a block of binary zeroes and hence cannot be stored in a .bss section in the executable, where it would not take any space at all. struct Foo { double[999999] bar = void; } On the contrary will not bloat your executable by 7,6 MB! Be cautious with: class Foo { double[999999] bar = void; } Classes' .init don't go into .bss either way. Another reason to use a struct where appropriate. (WARNING: Usage of .bss on Linux/MacOS is currently broken in the compiler front-end. You'll only see the effect on Windows) - Mahoney used an Array class in my version of paq, which allocates via calloc. Do this as well. You can't win otherwise. Read up a bit on calloc if you want. It generally 'allocates' a special zeroed out memory page multiple times. No matter how much memory you ask for, it wont really allocate anything until you *write* to it, at which point new memory is allocated for you and the zero-page is copied into it. The D GC on the other hand allocates that memory and writes zeroes to it immediately. The effect is two fold: First, the calloc version will use much less RAM, if the 'allocated' buffers aren't fully used (e.g. you compressed a small file). Second, the D GC version is slowed down by writing zeroes to all that memory. At high compression levels, paq8 uses ~2 GB of memory that is calloc'ed. You should _not_ try to use GC memory for that. - If there are data structures that are designed to fit into a CPU cache-line (I had one of those in paq8), make sure it still has the correct size in your D version. "static assert(Foo.sizeof == 64);" helped me find a bug there that resulted from switching from C bitfields to the D version (which is a library solution in Phobos). I hope that gives you some ideas what to look for. Good luck! -- Marco
Apr 24 2012
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Marco Leise:

 I ported fast paq8 (fp8) to D, and with some careful 
 D-ification and optimization it runs a bit faster than the 
 original C program when compiled with the GCC on Linux x86_64, 
 Core 2 Duo.

I guess you mean GDC. With DMD, even if you are a good D programmer, it's not easy to beat that original C compressor :-) Do you have a link to your D version? Matt Mahoney is probably willing to put a link in his site to your D version.
 I don't know about this version of paq you ported the coder 
 from,

It was a very basic coder.
   The D GC on the other hand allocates that memory and writes 
 zeroes to it immediately.

Is this always done the first time the memory is allocated by the GC?
   The effect is two fold: First, the calloc version will use 
 much less RAM, if
   the 'allocated' buffers aren't fully used (e.g. you 
 compressed a small file).

On the other hand in D you may allocate the memory in a more conscious way.
 "static assert(Foo.sizeof == 64);" helped me find a bug there 
 that
   resulted from switching from C bitfields to the D version 
 (which is a library
   solution in Phobos).

The Phobos D bitfields aren't required to mimic C, but that's an interesting case. Maybe it's a difference interesting to be taken a look at. Do you have the code of the two C-D versions? Bye, bearophile
Apr 24 2012