www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - is D so slow?

reply baleog <maccarka yahoo.com> writes:
Hello
I wrote 2 almost identical test programs(matrix multiplication). One on C and
another on D. And D prorgram was 15 times slower! 
Was it my mistake or not?
Thank you

p.s. code:
void test (int n) {  
  float[] xs = new float[n*n];
  float[] ys = new float[n*n];
  for(int i = n-1; i>=0; --i) {
    xs[i] = 1.0;
  }
 for(int i = n-1; i>=0; --i) {
    ys[i] = 2.0;
  }
  float[] zs = new float[n*n];
  for (int i=0; i<n; ++i) {
    for (int j=0; j<n; ++j) {
      float s = 0.0;
      for (int k=0; k<n; ++k) {
        s = s + (xs[k + (i*n)] * ys[j + (k*n)]);
      }
      zs[j+ (i*n)] =  s;
    }
  }
  delete xs;
  delete ys;
  delete zs;
}
Jun 14 2008
next sibling parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
This is a frequently asked question.

What compilers are you comparing?  They have different optimization 
backends.  It could be that the C compiler or compiler flags you are 
using simply perform better than the comparable D compiler and flags.

-[Unknown]


baleog wrote:
 Hello
 I wrote 2 almost identical test programs(matrix multiplication). One on C and
another on D. And D prorgram was 15 times slower! 
 Was it my mistake or not?
 Thank you
 
 p.s. code:
 void test (int n) {  
   float[] xs = new float[n*n];
   float[] ys = new float[n*n];
   for(int i = n-1; i>=0; --i) {
     xs[i] = 1.0;
   }
  for(int i = n-1; i>=0; --i) {
     ys[i] = 2.0;
   }
   float[] zs = new float[n*n];
   for (int i=0; i<n; ++i) {
     for (int j=0; j<n; ++j) {
       float s = 0.0;
       for (int k=0; k<n; ++k) {
         s = s + (xs[k + (i*n)] * ys[j + (k*n)]);
       }
       zs[j+ (i*n)] =  s;
     }
   }
   delete xs;
   delete ys;
   delete zs;
 }
 

Jun 14 2008
parent reply baleog <maccarka yahoo.com> writes:
Unknown W. Brackets Wrote:
 What compilers are you comparing?  

gcc-4.0 with last releases of the gdc and dmd-2
 They have different optimization

but gdc uses gcc backend and the test programs (C and D) differs by the couple of bytes.. i used std.gc.disable() too - nothing changed
 backends.  It could be that the C compiler or compiler flags you are 
 using simply perform better than the comparable D compiler and flags.
 
 -[Unknown]
 
 
 baleog wrote:
 Hello
 I wrote 2 almost identical test programs(matrix multiplication). One on C and
another on D. And D prorgram was 15 times slower! 
 Was it my mistake or not?
 Thank you
 
 p.s. code:
 void test (int n) {  
   float[] xs = new float[n*n];
   float[] ys = new float[n*n];
   for(int i = n-1; i>=0; --i) {
     xs[i] = 1.0;
   }
  for(int i = n-1; i>=0; --i) {
     ys[i] = 2.0;
   }
   float[] zs = new float[n*n];
   for (int i=0; i<n; ++i) {
     for (int j=0; j<n; ++j) {
       float s = 0.0;
       for (int k=0; k<n; ++k) {
         s = s + (xs[k + (i*n)] * ys[j + (k*n)]);
       }
       zs[j+ (i*n)] =  s;
     }
   }
   delete xs;
   delete ys;
   delete zs;
 }
 


Jun 14 2008
parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
What about switches?  Your program uses arrays; if you have array bounds 
checks enabled, that could easily account for the difference.

One way to see is dump the assembly (I think there's a utility called 
dumpobj included with dmd) and compare.  Obviously, it's doing something 
differently - there's nothing instrinsically "slower" about the language 
for sure.

Also - keep in mind that gdc doesn't take advantage of all the 
optimizations that gcc is able to provide, at least at this time.  A 
couple of bytes can go a long long way if not optimized right.

-[Unknown]


baleog wrote:
 Unknown W. Brackets Wrote:
 What compilers are you comparing?  

gcc-4.0 with last releases of the gdc and dmd-2
 They have different optimization

but gdc uses gcc backend and the test programs (C and D) differs by the couple of bytes.. i used std.gc.disable() too - nothing changed
 backends.  It could be that the C compiler or compiler flags you are 
 using simply perform better than the comparable D compiler and flags.

 -[Unknown]


 baleog wrote:
 Hello
 I wrote 2 almost identical test programs(matrix multiplication). One on C and
another on D. And D prorgram was 15 times slower! 
 Was it my mistake or not?
 Thank you

 p.s. code:
 void test (int n) {  
   float[] xs = new float[n*n];
   float[] ys = new float[n*n];
   for(int i = n-1; i>=0; --i) {
     xs[i] = 1.0;
   }
  for(int i = n-1; i>=0; --i) {
     ys[i] = 2.0;
   }
   float[] zs = new float[n*n];
   for (int i=0; i<n; ++i) {
     for (int j=0; j<n; ++j) {
       float s = 0.0;
       for (int k=0; k<n; ++k) {
         s = s + (xs[k + (i*n)] * ys[j + (k*n)]);
       }
       zs[j+ (i*n)] =  s;
     }
   }
   delete xs;
   delete ys;
   delete zs;
 }



Jun 14 2008
parent Jerry Quinn <jlquinn optonline.net> writes:
Unknown W. Brackets Wrote:

 What about switches?  Your program uses arrays; if you have array bounds 
 checks enabled, that could easily account for the difference.
 
 One way to see is dump the assembly (I think there's a utility called 
 dumpobj included with dmd) and compare.  Obviously, it's doing something 
 differently - there's nothing instrinsically "slower" about the language 
 for sure.
 
 Also - keep in mind that gdc doesn't take advantage of all the 
 optimizations that gcc is able to provide, at least at this time.  A 
 couple of bytes can go a long long way if not optimized right.

There's another classic benchmark issue that you could be stumbling over. The sample code you posted throws away the results inside the function. GCC C can detect that the result of the computations are not used, and optimize everything out of existence. That kind of difference could easily explain the speed difference you're seeing. If you're going to do this kind of micro-benchmark, you need to print the result of computation or otherwise convince the compiler you need the result.
Jun 15 2008
prev sibling next sibling parent reply Tomas Lindquist Olsen <tomas famolsen.dk> writes:
baleog wrote:
 Hello
 I wrote 2 almost identical test programs(matrix multiplication). One on C and
another on D. And D prorgram was 15 times slower! 
 Was it my mistake or not?
 Thank you
 
 p.s. code:
 void test (int n) {  
   float[] xs = new float[n*n];
   float[] ys = new float[n*n];
   for(int i = n-1; i>=0; --i) {
     xs[i] = 1.0;
   }
  for(int i = n-1; i>=0; --i) {
     ys[i] = 2.0;
   }
   float[] zs = new float[n*n];
   for (int i=0; i<n; ++i) {
     for (int j=0; j<n; ++j) {
       float s = 0.0;
       for (int k=0; k<n; ++k) {
         s = s + (xs[k + (i*n)] * ys[j + (k*n)]);
       }
       zs[j+ (i*n)] =  s;
     }
   }
   delete xs;
   delete ys;
   delete zs;
 }
 

What switches did you use to compile? Not much info you're giving ... Tomas
Jun 15 2008
next sibling parent reply baleog <maccarka yahoo.com> writes:
Tomas Lindquist Olsen Wrote:
 
 What switches did you use to compile? Not much info you're giving ...

Ubuntu-6.06 dmd-2.0.14 - 40sec witth n=500 dmd -O -release -inline test.d gdc-0.24 - 32sec gdmd -O -release test.d and gcc-4.0.3 - 1.5sec gcc test.c so gcc without optimization runs 20 times faster than gdc but i can't find how to suppress array bound checking
 
 Tomas

Jun 15 2008
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"baleog" <maccarka yahoo.com> wrote in message 
news:g32umu$11kq$1 digitalmars.com...
 Tomas Lindquist Olsen Wrote:

 What switches did you use to compile? Not much info you're giving ...

Ubuntu-6.06 dmd-2.0.14 - 40sec witth n=500 dmd -O -release -inline test.d gdc-0.24 - 32sec gdmd -O -release test.d and gcc-4.0.3 - 1.5sec gcc test.c so gcc without optimization runs 20 times faster than gdc but i can't find how to suppress array bound checking

Array bounds checking is off as long as you specify -release. I don't know if your computer is just really, REALLY slow, but out of curiosity I tried running the D program on my computer. It completes in 1.2 seconds. Also, using malloc/free vs. new/delete shouldn't much matter in this program, because you make all of three allocations, all before any loops. The GC is never going to be called during the program.
Jun 15 2008
parent reply "Dave" <Dave_member pathlink.com> writes:
"Jarrett Billingsley" <kb3ctd2 yahoo.com> wrote in message 
news:g336hl$10c8$1 digitalmars.com...
 "baleog" <maccarka yahoo.com> wrote in message 
 news:g32umu$11kq$1 digitalmars.com...
 Tomas Lindquist Olsen Wrote:

 What switches did you use to compile? Not much info you're giving ...

Ubuntu-6.06 dmd-2.0.14 - 40sec witth n=500 dmd -O -release -inline test.d gdc-0.24 - 32sec gdmd -O -release test.d and gcc-4.0.3 - 1.5sec gcc test.c so gcc without optimization runs 20 times faster than gdc but i can't find how to suppress array bound checking

Array bounds checking is off as long as you specify -release. I don't know if your computer is just really, REALLY slow, but out of curiosity I tried running the D program on my computer. It completes in 1.2 seconds. Also, using malloc/free vs. new/delete shouldn't much matter in this program, because you make all of three allocations, all before any loops. The GC is never going to be called during the program.

I agree, but nonetheless the malloc version runs much faster on my systems (both Linux/Windows, P4 and Core2, all compiled w/ -O -inline -release). The relative performance difference gets larger as n increases: n malloc GC 100 0.094 0.328 200 0.140 1.859 300 0.203 6.094 400 0.312 14.141 500 0.547 27.625 import std.conv; void main(string[] args) { if(args.length > 1) test(toInt(args[1])); else printf("usage: mm nnn\n"); } version(malloc) { import std.c.stdlib; } void test(int n) { version(malloc) { float* xs = cast(float*)malloc(n*n*float.sizeof); float* ys = cast(float*)malloc(n*n*float.sizeof); } else { float[] xs = new float[n*n]; float[] ys = new float[n*n]; } for(int i = n-1; i>=0; --i) { xs[i] = 1.0; } for(int i = n-1; i>=0; --i) { ys[i] = 2.0; } version(malloc) { float* zs = cast(float*)malloc(n*n*float.sizeof); } else { float[] zs = new float[n*n]; } for (int i=0; i<n; ++i) { for (int j=0; j<n; ++j) { float s = 0.0; for (int k=0; k<n; ++k) { s = s + (xs[k + (i*n)] * ys[j + (k*n)]); } zs[j+ (i*n)] = s; } } version(malloc) { free(zs); free(ys); free(xs); } else { delete xs; delete ys; delete zs; } }
Jun 15 2008
parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Dave" <Dave_member pathlink.com> wrote in message 
news:g34sja$2m1a$1 digitalmars.com...

 I agree, but nonetheless the malloc version runs much faster on my systems 
 (both Linux/Windows, P4 and Core2, all compiled w/ -O -inline -release). 
 The relative performance difference gets larger as n increases:

 n            malloc        GC
 100        0.094         0.328
 200        0.140         1.859
 300        0.203         6.094
 400        0.312        14.141
 500        0.547        27.625

I'm sorry, but using your code, I can't reproduce times anywhere near that. I'm on Windows, DMD, Athlon X2 64. Here are my results: Phobos: n malloc GC ------------------------ 100 0.005206 0.005285 200 0.045083 0.045199 300 0.148954 0.148920 400 0.400136 0.404554 500 0.933754 1.076060 Tango: n malloc GC ------------------------ 100 0.005221 0.005298 200 0.045342 0.044910 300 0.150753 0.149157 400 0.402951 0.403343 500 0.946041 1.073466 Tested with both Tango and Phobos to be sure, and the times are not really any different between the two. The malloc and GC times don't really differ until n=500, and even then it's not by much.
Jun 16 2008
parent baleog <maccarka yahoo.com> writes:
Jarrett Billingsley Wrote:

 I'm sorry, but using your code, I can't reproduce times anywhere near that. 

Maybe it depends on hardware? And `new` effictiveness depends on used hardware. my /proc/cpuinfo: intel celeron 1.5GHz flags: fpu, vme, de, tsk, msr, pae, mce, cx8, apic, sep, mtrr, pge, mca, cmov, pat, clflush, dts, acpi, mmx, fxsr, sse, sse2, ss, tm, pbe, nx
Jun 16 2008
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
baleog:
 i can't find how to suppress array bound checking

With DMD you can disable array bound checking using the -release compilation option. Bye, bearophile
Jun 15 2008
prev sibling parent reply baleog <maccarka yahoo.com> writes:
Thank you for your replies! I used malloc instead of new and run time was about
1sec

p.s. i'm sorry for my terrible english

Tomas Lindquist Olsen Wrote:

 baleog wrote:
 Hello
 I wrote 2 almost identical test programs(matrix multiplication). One on C and
another on D. And D prorgram was 15 times slower! 
 Was it my mistake or not?
 Thank you
 
 p.s. code:
 void test (int n) {  
   float[] xs = new float[n*n];
   float[] ys = new float[n*n];
   for(int i = n-1; i>=0; --i) {
     xs[i] = 1.0;
   }
  for(int i = n-1; i>=0; --i) {
     ys[i] = 2.0;
   }
   float[] zs = new float[n*n];
   for (int i=0; i<n; ++i) {
     for (int j=0; j<n; ++j) {
       float s = 0.0;
       for (int k=0; k<n; ++k) {
         s = s + (xs[k + (i*n)] * ys[j + (k*n)]);
       }
       zs[j+ (i*n)] =  s;
     }
   }
   delete xs;
   delete ys;
   delete zs;
 }
 

What switches did you use to compile? Not much info you're giving ... Tomas

Jun 15 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
baleog:
I suggest you to show the complete C and the complete D code.

   float[] xs = new float[n*n];

With a smarter use of gc.malloc you may avoid clearing items two times... (I presume the optimizer doesn't remove the first cleaning). You can use this from the d.extra module of my libs: import std.gc: gcmalloc = malloc, gcrealloc = realloc, hasNoPointers; T[] NewVoidGCArray(T)(int n) { assert(n > 0, "NewVoidCGArray: n must be > 0."); auto pt = cast(T*)gcmalloc(n * T.sizeof); hasNoPointers(pt); return pt[0 .. n]; }
   for(int i = n-1; i>=0; --i) {
      xs[i] = 1.0;
   }

D arrays know this shorter and probably faster syntax: xs[] = 1.0; Bye, bearophile
Jun 15 2008
prev sibling parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-06-15 13:53:30 +0200, baleog <maccarka yahoo.com> said:

 Thank you for your replies! I used malloc instead of new and run time 
 was about 1sec

But you probably did not understand why... and it seems that neither did others around here... Indeed it is a subtle pitfall in which it is easy to fall. When you benchmark 1) print something depending on the result like the sum of everything (it is not the main issue in this case, but doing it would have probably shown the problem), so you can also have at least a tiny chance to notice if your algorithm is wrong 2) NaNs operations involving NaNs depending on the IEEE compliance requested on the processor can be 1000 times slower!!!!!!!! D (very thoughtfully, as it makes spotting errors easier) initializes the floating point numbers with NaNs (unlike C). -> your results follow if you use malloc, the memory is not initialized with NaNs -> performance manual malloc in this case is definitely not requested writing a benchmark can be subtle... benchmarking correct code is easier... Fawzi
Jun 16 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-06-16 16:32:56 +0200, Fawzi Mohamed <fmohamed mac.com> said:

 On 2008-06-15 13:53:30 +0200, baleog <maccarka yahoo.com> said:
 
 Thank you for your replies! I used malloc instead of new and run time 
 was about 1sec

But you probably did not understand why... and it seems that neither did others around here... Indeed it is a subtle pitfall in which it is easy to fall. When you benchmark 1) print something depending on the result like the sum of everything (it is not the main issue in this case, but doing it would have probably shown the problem), so you can also have at least a tiny chance to notice if your algorithm is wrong 2) NaNs

ehm, sorry... You do initialize everything... ehm, never post without testing... Fawzi
Jun 16 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-06-16 16:40:16 +0200, Fawzi Mohamed <fmohamed mac.com> said:

 On 2008-06-16 16:32:56 +0200, Fawzi Mohamed <fmohamed mac.com> said:
 
 On 2008-06-15 13:53:30 +0200, baleog <maccarka yahoo.com> said:
 
 Thank you for your replies! I used malloc instead of new and run time 
 was about 1sec

But you probably did not understand why... and it seems that neither did others around here... Indeed it is a subtle pitfall in which it is easy to fall. When you benchmark 1) print something depending on the result like the sum of everything (it is not the main issue in this case, but doing it would have probably shown the problem), so you can also have at least a tiny chance to notice if your algorithm is wrong 2) NaNs

ehm, sorry... You do initialize everything... ehm, never post without testing... Fawzi

I tested... and well I was actually right (I should have trusted my gut feeling a little more...) NaN is the culprit. check your algorithm (you initialize, backwards for some strange reason) just part of the arrays... putting xs[] = 1.0; ys[] = 2.0; instead of your strange loops, solves everything... Fawzi
Jun 16 2008
parent reply baleog <maccarka yahoo.com> writes:
Fawzi Mohamed Wrote:

 check your algorithm (you initialize, backwards for some strange 
 reason) just part of the arrays...
 putting
   xs[] = 1.0;
   ys[] = 2.0;
 instead of your strange loops, solves everything...
 

Jun 16 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-06-16 18:53:48 +0200, baleog <maccarka yahoo.com> said:

 Fawzi Mohamed Wrote:
 
 check your algorithm (you initialize, backwards for some strange
 reason) just part of the arrays...
 putting
 xs[] = 1.0;
 ys[] = 2.0;
 instead of your strange loops, solves everything...
 

initialization)?? did you mean that in this case i must use `mallloc` function

To quote myself:
 2) NaNs
 operations involving NaNs depending on the IEEE compliance requested on 
 the processor can be 1000 times slower!!!!!!!!
 D (very thoughtfully, as it makes spotting errors easier) initializes 
 the floating point numbers with NaNs (unlike C).

your loop for(int i = n-1; i>=0; --i) { xs[i] = 1.0; } initializes only xs[0..n] but you have also xs[n..n*n] that have their default initial value (the same is valid for ys). In D by default this value is NaN (which is good, as it helps you to spot errors in code that you really use, not only benchmark). When you use these values your program goes very slow if full IEEE compliance is requested from your processor (at least on my pc). If you use malloc, the default initialization does not take place, the memory is normally either initialized to 0, or left uninitialized (with values that likely are not NaN). So your program is fast with malloc, but in fact all this is due to a bug in the program that you are benchmarking, and using malloc is not the correct solution, the solution is to initialize all the values that you use. Fawzi
Jun 16 2008
next sibling parent reply "Dave" <Dave_member pathlink.com> writes:
 
 If you use malloc, the default initialization does not take place, the 
 memory is normally either initialized to 0, or left uninitialized (with 
 values that likely are not NaN).
 So your program is fast with malloc, but in fact all this is due to a 
 bug in the program that you are benchmarking, and using malloc is not 
 the correct solution, the solution is to initialize all the values that 
 you use.
 
 Fawzi

Good catch...
Jun 16 2008
parent Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-06-17 03:23:54 +0200, "Dave" <Dave_member pathlink.com> said:

 
 If you use malloc, the default initialization does not take place, the 
 memory is normally either initialized to 0, or left uninitialized (with 
 values that likely are not NaN).
 So your program is fast with malloc, but in fact all this is due to a 
 bug in the program that you are benchmarking, and using malloc is not 
 the correct solution, the solution is to initialize all the values that 
 you use.
 
 Fawzi
 

Good catch...

thanks :)
Jun 17 2008
prev sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
Fawzi Mohamed wrote:
 If you use malloc, the default initialization does not take place, the 
 memory is normally either initialized to 0, or left uninitialized (with 
 values that likely are not NaN).

If I remember right, malloc does no initialization; calloc initializes to 0.
Jun 16 2008
parent Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-06-17 04:13:14 +0200, Robert Fraser <fraserofthenight gmail.com> said:

 Fawzi Mohamed wrote:
 If you use malloc, the default initialization does not take place, the 
 memory is normally either initialized to 0, or left uninitialized (with 
 values that likely are not NaN).

If I remember right, malloc does no initialization; calloc initializes to 0.

Indeed calloc is documented to always initialize to 0. I think that by default when reusing memory malloc does not initialize it (but normally you can set environment variables to change this behaviour). When getting the memory from the system initialization might (and often will) take place so that a program cannot "sniff" the memory of other programs. The thing is system dependent, malloc gives no guarantee with respect to any special behavior. Fawzi
Jun 17 2008
prev sibling parent reply "Saaa" <empty needmail.com> writes:
baleog are you Marco? (same ip)
What kind of hardware do you have?
Because Marco also had some strange speed problems I couldn't replicate. 
Jun 15 2008
next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Saaa" <empty needmail.com> wrote in message 
news:g340sc$d1j$1 digitalmars.com...
 baleog are you Marco? (same ip)
 What kind of hardware do you have?
 Because Marco also had some strange speed problems I couldn't replicate.

They have the same IP because they both used the web interface. You'll notice that everyone who uses the web interface has the same IP.
Jun 15 2008
prev sibling parent reply baleog <maccarka yahoo.com> writes:
Saaa Wrote:

 baleog are you Marco? (same ip)

 What kind of hardware do you have?

 Because Marco also had some strange speed problems I couldn't replicate. 
 
 

Jun 16 2008
parent "Saaa" <empty needmail.com> writes:
Ok,
It just sounded like the same problem.
Fawzi seems to have the solution :) 
Jun 16 2008