www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D speed compared to C++

reply Matthew Allen <matt.allen removeme.creativelifestyles.com> writes:
I am looking to use D for programming a high speed vision application which was
previously written in C/C++. I have done some arbitary speed tests and am
finding that C/C++ seems to be faster than D by a magnitude of about 3 times. I
have done some simple loop tests that increment a float value by some number
and also some memory allocation/deallocation loops and C/C++ seems to come out
on top each time. Is D meant to be faster or as fast as C/C++ and if so how can
I optimize the code. I am using -inline, -O, and -release. 

An example of a simple loop test I ran is as follows:

DWORD start = timeGetTime();
	int i,j,k;
	float dx=0;
    for(i=0; i<1000;i++)
        for(j=0; j<1000;j++)
            for(k=0; k<10; k++)
                {
                     dx++;
                }
    DWORD end = timeGetTime();

In C++ int and doubles. The C++ came back with a time of 15ms, and D came back
with 45ms.
Mar 18 2008
next sibling parent reply BCS <BCS pathlink.com> writes:
Matthew Allen wrote:
 
 An example of a simple loop test I ran is as follows:
 
 DWORD start = timeGetTime();
 	int i,j,k;
 	float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();
 
 In C++ int and doubles. The C++ came back with a time of 15ms, and D came back
with 45ms.

first of all what C++ compiler? the best for testing would be DMC as that removes the back end differences. Second how many test runs was that over? third, try it with doubles (64bit reals) in both programs as the different conversions might be making a difference. Another thing that might mask some stuff is start up time. Try running the test loops in another loop and spit out sequential times. I have seen large (2x - 3x) differences in the first run of a test vs. later runs. This would avoid random variables like the test code spanning a page boundary in one case and no in the other. If you have done these things already then I don't known what's happening. /My/ next step would be to start looking at the ASM, but then again I'm known to be crazy.
Mar 18 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from BCS (BCS pathlink.com)'s article
 Matthew Allen wrote:
 An example of a simple loop test I ran is as follows:

 DWORD start = timeGetTime();
 	int i,j,k;
 	float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();

 In C++ int and doubles. The C++ came back with a time of 15ms, and D came back
with 45ms.

that removes the back end differences. Second how many test runs was that over? third, try it with doubles (64bit reals) in both programs as the different conversions might be making a difference. Another thing that might mask some stuff is start up time. Try running the test loops in another loop and spit out sequential times. I have seen large (2x - 3x) differences in the first run of a test vs. later runs.

D apps also have more going on in the application initialization phase than C++ apps. For a real apples-apples comparison, you might want to consider using Tango with the "stub" GC plugged in. That just calls malloc/free and has no initialization cost, at the expense of no actual garbage collection. I'll have to check whether the stub GC compiles with the latest Tango--it's been a while since I used it. Sean
Mar 18 2008
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Sean Kelly wrote:
 == Quote from BCS (BCS pathlink.com)'s article
 DWORD start = timeGetTime();
 	int i,j,k;
 	float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();



 
 D apps also have more going on in the application initialization phase than
C++ apps.  For a real
 apples-apples comparison, you might want to consider using Tango with the
"stub" GC plugged in.
 That just calls malloc/free and has no initialization cost, at the expense of
no actual garbage collection.
 I'll have to check whether the stub GC compiles with the latest Tango--it's
been a while since I used it.

How is the startup time relevant, when he appears to be measuring in-process?
Mar 18 2008
next sibling parent BCS <BCS pathlink.com> writes:
Frits van Bommel wrote:
 DWORD start = timeGetTime();
     int i,j,k;
     float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();



How is the startup time relevant, when he appears to be measuring in-process?

the GC time is not, but cache priming and such can make a difference. I have actually worked on code like the above and seen a consistent and significant drop in the second pass time.
Mar 18 2008
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Frits van Bommel (fvbommel REMwOVExCAPSs.nl)'s article
 Sean Kelly wrote:
 == Quote from BCS (BCS pathlink.com)'s article
 DWORD start = timeGetTime();
 	int i,j,k;
 	float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();



 D apps also have more going on in the application initialization phase than
C++ apps.  For a real
 apples-apples comparison, you might want to consider using Tango with the
"stub" GC plugged in.
 That just calls malloc/free and has no initialization cost, at the expense of
no actual garbage


 I'll have to check whether the stub GC compiles with the latest Tango--it's
been a while since I


 How is the startup time relevant, when he appears to be measuring
 in-process?

It's not. I only mentioned it because BCS mentioned startup time. Sean
Mar 18 2008
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Matthew Allen (matt.allen removeme.creativelifestyles.com)'s
article
 I am looking to use D for programming a high speed vision application which
was previously written

by a magnitude of about 3 times. I have done some simple loop tests that increment a float value by some number and also some memory allocation/deallocation loops and C/C++ seems to come out on top each time. Is D meant to be faster or as fast as C/C++ and if so how can I optimize the code. I am using -inline, -O, and -release.
 An example of a simple loop test I ran is as follows:
 DWORD start = timeGetTime();
 	int i,j,k;
 	float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();
 In C++ int and doubles. The C++ came back with a time of 15ms, and D came back
with 45ms.

Are these tests with DMD vs. DMC, or GDC vs. GCC? If you're using different compilers for the C++ and D tests then you're really testing the code generator and optimizer more than the language. D code generated by DMD, for example, is notoriously slow at floating point operations, while the same code is much faster with GDC. This is an artifact of the Digital Mars back-end rather than the language itself. Sean
Mar 18 2008
prev sibling next sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Matthew Allen wrote:
 I am looking to use D for programming a high speed vision application which
was previously written in C/C++. I have done some arbitary speed tests and am
finding that C/C++ seems to be faster than D by a magnitude of about 3 times. I
have done some simple loop tests that increment a float value by some number
and also some memory allocation/deallocation loops and C/C++ seems to come out
on top each time. Is D meant to be faster or as fast as C/C++ and if so how can
I optimize the code. I am using -inline, -O, and -release. 
 
 An example of a simple loop test I ran is as follows:
 
 DWORD start = timeGetTime();
 	int i,j,k;
 	float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();
 
 In C++ int and doubles. The C++ came back with a time of 15ms, and D came back
with 45ms.

That's not a useful benchmark. G++ completely optimizes away the loop, leaving you timing how fast an empty piece of code runs... However, after adding 'printf("%d", dx)' the generated code for D and C++ is virtually identical, as are the timings. At least on my machine and with my compilers (gdc and g++ on 64-bit Ubuntu). If you're seeing different results it may just be a difference between your C++ and your D compiler; especially if they're not g++ and gdc or dmc and dmd, i.e. if they don't share the same backend.
Mar 18 2008
next sibling parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Frits van Bommel wrote:
 Matthew Allen wrote:
     float dx=0;


 However, after adding 'printf("%d", dx)' the generated code for D and 

Oops, that shouldn't be "%d", should it? Well, it doesn't matter because I just put that in to keep the compiler from completely optimizing out the loop but that does explain why I get such weird output :).
Mar 18 2008
prev sibling parent reply BCS <BCS pathlink.com> writes:
Frits van Bommel wrote:

 However, after adding 'printf("%d", dx)' the generated code for D and 
 C++ is virtually identical, as are the timings.

printf is kinda a heavy weight function. how does it compare with some dummy function?
Mar 18 2008
parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
BCS wrote:
 Frits van Bommel wrote:
 
 However, after adding 'printf("%d", dx)' the generated code for D and 
 C++ is virtually identical, as are the timings.

printf is kinda a heavy weight function. how does it compare with some dummy function?

It doesn't seem to make any difference, unless you count executable size. (I passed 'dx' to a separately-compiled empty C function instead of printf)
Mar 18 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Matthew Allen wrote:
 I am looking to use D for programming a high speed vision application
 which was previously written in C/C++. I have done some arbitary
 speed tests and am finding that C/C++ seems to be faster than D by a
 magnitude of about 3 times. I have done some simple loop tests that
 increment a float value by some number and also some memory
 allocation/deallocation loops and C/C++ seems to come out on top each
 time. Is D meant to be faster or as fast as C/C++ and if so how can I
 optimize the code. I am using -inline, -O, and -release.
 
 An example of a simple loop test I ran is as follows:
 
 DWORD start = timeGetTime(); int i,j,k; float dx=0; for(i=0;
 i<1000;i++) for(j=0; j<1000;j++) for(k=0; k<10; k++) { dx++; } DWORD
 end = timeGetTime();
 
 In C++ int and doubles. The C++ came back with a time of 15ms, and D
 came back with 45ms.

Loop unrolling could be a big issue here. DMD doesn't do loop unrolling, but that is not a language issue at all, it's an optimizer issue. It's easy enough to check - get the assembler output of the loop from your compiler and post it here.
Mar 18 2008
parent reply Matthew Allen <mallen creativelifestyles.removeme.com> writes:
Walter Bright Wrote:

 Matthew Allen wrote:
 I am looking to use D for programming a high speed vision application
 which was previously written in C/C++. I have done some arbitary
 speed tests and am finding that C/C++ seems to be faster than D by a
 magnitude of about 3 times. I have done some simple loop tests that
 increment a float value by some number and also some memory
 allocation/deallocation loops and C/C++ seems to come out on top each
 time. Is D meant to be faster or as fast as C/C++ and if so how can I
 optimize the code. I am using -inline, -O, and -release.
 
 An example of a simple loop test I ran is as follows:
 
 DWORD start = timeGetTime(); int i,j,k; float dx=0; for(i=0;
 i<1000;i++) for(j=0; j<1000;j++) for(k=0; k<10; k++) { dx++; } DWORD
 end = timeGetTime();
 
 In C++ int and doubles. The C++ came back with a time of 15ms, and D
 came back with 45ms.

Loop unrolling could be a big issue here. DMD doesn't do loop unrolling, but that is not a language issue at all, it's an optimizer issue. It's easy enough to check - get the assembler output of the loop from your compiler and post it here.

Walter you are right. That was the issue. I added a function call into the loop and D came out a lot faster than C++. Thanks for a great language!!!
Mar 19 2008
next sibling parent "Koroskin Denis" <2korden+dmd gmail.com> writes:
On Thu, 20 Mar 2008 00:44:02 +0300, Matthew Allen  =

<mallen creativelifestyles.removeme.com> wrote:

 Walter Bright Wrote:

 Matthew Allen wrote:
 I am looking to use D for programming a high speed vision applicati=



 which was previously written in C/C++. I have done some arbitary
 speed tests and am finding that C/C++ seems to be faster than D by =



 magnitude of about 3 times. I have done some simple loop tests that=



 increment a float value by some number and also some memory
 allocation/deallocation loops and C/C++ seems to come out on top ea=



 time. Is D meant to be faster or as fast as C/C++ and if so how can=



 optimize the code. I am using -inline, -O, and -release.

 An example of a simple loop test I ran is as follows:

 DWORD start =3D timeGetTime(); int i,j,k; float dx=3D0; for(i=3D0;
 i<1000;i++) for(j=3D0; j<1000;j++) for(k=3D0; k<10; k++) { dx++; } =



 end =3D timeGetTime();

 In C++ int and doubles. The C++ came back with a time of 15ms, and =



 came back with 45ms.

Loop unrolling could be a big issue here. DMD doesn't do loop unrolli=


 but that is not a language issue at all, it's an optimizer issue. It'=


 easy enough to check - get the assembler output of the loop from your=


 compiler and post it here.

Walter you are right. That was the issue. I added a function call into=

 the loop and D came out a lot faster than C++.

 Thanks for a great language!!!

Obviously, you can not say that D integer increment is faster or slower = = than that of C++, it just doesn't depend on a language design. You could= = compare DMD to GCC and that would make similar sense to comparison of GC= C = and, say, ICC or DMC. The difference is a matter of optimization = techniques used be the language compilers, not by languages. Surely, = higher level language design makes some inpact on general performance, = like GC or constant-time array slicing, but not low level variable = increments/loops etc.. General D performance would only increase as more= = vendors will produce commercial D compilers.
Mar 20 2008
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Matthew Allen wrote:
 Walter you are right. That was the issue. I added a function call
 into the loop and D came out a lot faster than C++.
 
 Thanks for a great language!!!

You're welcome. To compare D vs C++ as languages, rather than the optimizers or code generators, the most straightforward way is to benchmark dmd vs dmc, and gdc vs gcc.
Mar 21 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Matthew Allen, the benchmarks on the Shootout site are flawed, but they are
probably less flawed by your benchmarks, so I suggest you to take a look at
those benchmarks, you can download them and try them on your PC (for example
using an Intel compiler for C++, etc).

Bye,
bearophile
Mar 18 2008
next sibling parent reply Dan <murpsoft hotmail.com> writes:
1) Using a float or double as an incrementor in a tight loop is a bad idea. 
Most compilers optimize it out where possible; and so do Agner Fog and Paul
Hseih.  They know why such is true better than I.

2) Most compilers optimize stuff out if it's not directly affecting output or
external functions or arguments.  This is usually done on a per-function level.
 A better optimizer would do it for the whole program.

3) Startup for D is slower even for hello world because D statically links the
entirety of phobos and the GC even if you don't ever use them.  This equates to
about 80kb of bloat - so it's still dramatically better than Java or C#, but
still not "correct".

4) If the GC does a collection cycle, it'll bump the time complexity.  This
will happen pseudo-randomly.

~~~

If you really want to improve performance on C or C++, do it by profiling your
program, and optimize parts where it matters how fast you go.

- simplify
- remove unnecessary loops
- hoist stuff out of loops as much as possible
- iterate or recurse in ways that ease cache miss penalties
- iterate instead of recurse as much as possible
- reduce if/else if/else || && as much as possible
- multiply by inverse instead of divide where possible
- reduce calls to the OS where it's sensible

If you need to go further, learn assembler.  D's inline one ain't half bad. 
You can do things in assembler that you can't do in HLL's.  Things like ror,
rcl, sete, cmovcc, prefetchntq, clever XMMX usage and such.

/me is looking forward to when XMMX has byte-array functionality.   Could
outperform all x86-32 string stuff by an order of magnitude.
Mar 18 2008
next sibling parent reply "Saaa" <empty needmail.com> writes:
 If you really want to improve performance on C or C++, do it by profiling 
 your program, and optimize parts where it matters how fast you go.

 - simplify
 - remove unnecessary loops
 - hoist stuff out of loops as much as possible
 - iterate or recurse in ways that ease cache miss penalties

 - iterate instead of recurse as much as possible
 - reduce if/else if/else || && as much as possible

else?
 - multiply by inverse instead of divide where possible
 - reduce calls to the OS where it's sensible

- don't allocate and then delete is you need a ~same amount of memory afterwards.
Mar 18 2008
next sibling parent Paul Findlay <r.lph50+d gmail.com> writes:
Saaa wrote:
 - reduce if/else if/else || && as much as possible

else?

consistent, or so it doesn't have to guess.. "The elimination of branching is an important concern with today's deeply pipelined processor architectures" Check out some of the stuff on: http://www.azillionmonkeys.com/qed/optimize.html - Paul
Mar 19 2008
prev sibling parent reply Georg Wrede <georg nospam.org> writes:
Saaa wrote:
 If you really want to improve performance on C or C++, do it by
 profiling your program, and optimize parts where it matters how
 fast you go.


To Saa: it's pretty late here now, but I'll try to address some of this briefly. In general, (I have to start with this. And this post, as every post in a NG, should be for a broader audience than the one nominally replied to.) improving performance only means one thing: have the computer do less to achieve the goal.
 - simplify


Any time you look at your code from two weeks ago, you'll probably find the same thing could be done more easily, with less code lines, with less effort for the computer.
 - remove unnecessary loops


For example, to sum up the integers from 1 to 100 doesn't mean that you'd do the obvious loop. Thinking more about it gives Sum = 1+100 + 2+99 ... and further refining it gives 50 * (1+100) = 5050. (Courtesy of Gauss (http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss, ca 1785, ca 7 years old!)
 - hoist stuff out of loops as much as possible


for(int i=0; i<100; i++) { if (i%3 == 0) sum += i; } converted to for(int i=0; i<100/3; i++) { sum += i; } // ok, stupid example, AND it's late here, AND I haven't compiled it...
 - iterate or recurse in ways that ease cache miss penalties


Modern processors have caches ranging in size from hundreds of kilobytes to some megabytes, usually with a separate cache for instructions, and another for data. Program code, for example, is fetched into the cache from memory around where the current instruction is located. Thus, the instructions following the current one are in the cache which the processor can access much faster than real memory. Now, if you have your code so that, say, a loop you're executing, contains subroutine or function calls that actually reside far apart in memory, then it might be that the processor doesn't understand to keep all of them in the cache. One way to ease this is to see to it that the needed functions reside close to each other in ram. One way of trying to dot this is to have them next to each other in the source code. (Please, again, understand that it's late here, etc..., so I'm not accurate here, more like conveying the general idea.) The same goes for data. Suppose you're Cloe [in "24", the TV series], and Boss gives you 25 minutes to filter out who of our 2 million suspects have made any of 100 million recent cellular calls. You'd have to write a D(!!!!) program to get it done, right? Now, matching 2 million in an array against 100 million in a stream (yes, you'd do it against a stream) would be the first tac to do it. But, 25 minutes wouldn't be enough. So, you write your program so that first, the 2 million suspects are sorted in order of "suspectability", right? The extra time used for this is more than countered for when the actual routine is run because now most of the "suspects" appear regularly in the comparisons, and therefore "stay in the cache". (Again, it's late here, but you get the idea.)
 - iterate instead of recurse as much as possible


Any textbook on programming and recursion shows you an example of doing the Fibonacci numbers. Just write the code as recursive and as iterative (both from the textbook) and time the results. The difference is appalling.
 - reduce if/else if/else || && as much as possible

Is this because of the obvious `less calculations is better`, or something else?

Well, any single calculation that you do _within_ the loop, is calculated as many times as the loop is run. In _most_ cases, what you'd have inside the loop at first thought, can pretty easily be transported outside the loop. (Especially with D, where you can have compile-time things calculated automatically, but also with other languages where quite a lot of the stuff you'd initially have inside the loop, is possible to either calculate once before, or convert the whole thing to calculating other (simpler) things, and then either before the loop or after it, have some function "rectifying" the end result. (See the for-loop example above.)
 - multiply by inverse instead of divide where possible


Division is poison for computers. It's also poison for math coprocessors. (Hey, those of us that are old enough to have had to do division of large floating point numbers with pencil and paper at school, know intuitively that multiplication is a piece of cake, compared. Who here would venture to do 123455.2525 / 4.7110211 on only pencil and paper??) Now, if instead, one could choose to do 123455.2525 * 0.212268206568 (whether with paper-and-pencil, or with the math coprocessor), the result would be obtained _much_ faster. And with a _lot_ less head ache.
 - reduce calls to the OS where it's sensible


(This is ridiculous, but I hope it gets the idea across.) Suppose an idiot Ivan has to write a program that sums up the sizes of all files on a hard disk. He might first create a function that lists all the files in a directory with separate OS queries, then edit it to be recursive, then make a list of all the files found with this method, then one-by-one ask the operating system for the sizes of these files. Some operating systems have a single call that returns a list of the file names and (among other things) the file sizes and types. Wouldn't it be faster to use this, sum up the sizes, and then do the same if any of the entreies turn out to be directories? (Comment about reducing calls to the OS: It's not as clear-cut as the above example might say. Not all of us use Windows, where "_any_" OS call is a "waste of time". So, one should know, or seek wisdom in the OS docs, (or even source code, where available) before making judgements on the efficacy of particular tactics.)
 - don't allocate and then delete is you need a ~same amount of memory
  afterwards.

(Ignoring the typo.) New-ing and delete-ing are expesive operations. (Time-wise.) Now, especially if you know up front that you will need to make a lot of both, it might be smart to first consider (or have your app evaluate, or even guess) how many of these you might need at most at the same time. Then it would be smart to allocate an array containing whatever it is you need to allocate (be it integers, structs or objects, or whatever) at the most. Then you could write a function myNew(object_or_whatever) that, instead of allocating with new, would just look at the array and find the first empty slot in it. Similarly, with delete, you could have a function that frees the particular slot in this array. While this doesn't sound much faster than using new and delete (because I'm too tired now to explain it properly), this is a tried an proven method of making the program run _much_ faster. ---
 If you really want to improve performance on C or C++, do it by
 profiling your program, and optimize parts where it matters how
 fast you go.


(Quoted again, from the top of this post.) I'd rewrite the above quote, to: If you really want to improve performance, do it by profiling. And, improve only where the profiling shows you're slow. In any language.
Mar 25 2008
parent reply lutger <lutger.blijdestijn gmail.com> writes:
Georg Wrede wrote:
...
 Now, if you have your code so that, say, a loop you're executing,
 contains subroutine or function calls that actually reside far apart in
 memory, then it might be that the processor doesn't understand to keep
 all of them in the cache. One way to ease this is to see to it that the
 needed functions reside close to each other in ram. One way of trying to
 dot this is to have them next to each other in the source code. ...

Another way is to use the profiler built into dmd. It can generate a def file with the optimal link order gathered from empirical results: http://www.digitalmars.com/ctg/trace.html It's only for win32 however.
Mar 26 2008
parent bearophile <bearophileHUGS lycos.com> writes:
lutger:
 Another way is to use the profiler built into dmd. It can generate a def
 file with the optimal link order gathered from empirical results:
 http://www.digitalmars.com/ctg/trace.html 
 It's only for win32 however.

I think DMD docs deserve to include such thing too, then. Bye, bearophile
Mar 26 2008
prev sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Dan" <murpsoft hotmail.com> wrote in message 
news:frpnmc$24u0$1 digitalmars.com...

 3) Startup for D is slower even for hello world because D statically links 
 the entirety of phobos and the GC even if you don't ever use them.  This 
 equates to about 80kb of bloat - so it's still dramatically better than 
 Java or C#, but still not "correct".

If it linked in all of phobos, your programs would start at around 1MB. It doesn't link in all of phobos.
Mar 18 2008
prev sibling parent "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Wed, 19 Mar 2008 02:45:00 +0200, Dan <murpsoft hotmail.com> wrote:

 4) If the GC does a collection cycle, it'll bump the time complexity.  This
will happen pseudo-randomly.

Garbage collection can only happen on a memory allocation, turning off the GC will have no effect here (except, perhaps, the initialization time, which isn't what we're measuring here anyway). -- Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 18 2008
prev sibling next sibling parent David Ferenczi <raggae ferenczi.net> writes:
Watch out with the release flag, I expereienced starnge behaviour.

See: http://d.puremagic.com/issues/show_bug.cgi?id=797
Mar 19 2008
prev sibling parent reply Matthew Allen <mallen removeme.creativelifestyles.com> writes:
Matthew Allen Wrote:

 I am looking to use D for programming a high speed vision application which
was previously written in C/C++. I have done some arbitary speed tests and am
finding that C/C++ seems to be faster than D by a magnitude of about 3 times. I
have done some simple loop tests that increment a float value by some number
and also some memory allocation/deallocation loops and C/C++ seems to come out
on top each time. Is D meant to be faster or as fast as C/C++ and if so how can
I optimize the code. I am using -inline, -O, and -release. 
 
 An example of a simple loop test I ran is as follows:
 
 DWORD start = timeGetTime();
 	int i,j,k;
 	float dx=0;
     for(i=0; i<1000;i++)
         for(j=0; j<1000;j++)
             for(k=0; k<10; k++)
                 {
                      dx++;
                 }
     DWORD end = timeGetTime();
 
 In C++ int and doubles. The C++ came back with a time of 15ms, and D came back
with 45ms.

I am testing DMD1.0 against MSVC6 compiler. On DMD I used -O and -inline, On MSVC I used -O2. Taking in the discussion I tried a few more tests and found that D is faster in certain cirumstances so I guess that the speed is down to compiler optimization. Also of note is that these tests were run in gui applications not console applications. Running in simple console applications D came out on top in all tests. Here is summary of what I tried. Each test was run 100 times average given. double Add(double a, double b) {return a+b;} DWORD start = timeGetTime(); int i,j,k; double dx=0; for(i=0; i<1000;i++) for(j=0; j<1000;j++) for(k=0; k<10;k++) { dx++; // test 1 - simple increment on dx dx = Add(i+0.5, j); // test 2 - funcion to change dx dx+=Add(i+0.5,j); // test 3 - function increment on dx } DWORD end = timeGetTime(); For test 1: DMD [42ms] MSVC6 [15ms] For test 2: DMD [9ms] MSVC6 [100ms] For test 3: DMD [42ms] MSVC6 [109ms]
Mar 19 2008
parent lutger <lutger.blijdestijn gmail.com> writes:
For benchmarks that operate on a higher, language level see this comparison
of xml libraries, D comes on top with tango's implementation: 
http://dotnot.org/blog/index.php

Here are the slides of the presentation where the underlying ideas were
discussed:
http://s3.amazonaws.com/dconf2007/Kris_Array_Slicing.pdf
Mar 19 2008