www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - pi benchmark on ldc and dmd

reply Walter Bright <newshound2 digitalmars.com> writes:
http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98

Anyone care to examine the assembler output and figure out why?
Aug 01 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
 
 Anyone care to examine the assembler output and figure out why?
Do you mean code similar to this one, or code that uses std.bigint? http://shootout.alioth.debian.org/debian/program.php?test=pidigits&lang=gdc&id=3 Bye, bearophile
Aug 01 2011
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
bearophile wrote:
 Do you mean code similar to this one, or code that uses std.bigint?
It was something that used bigint. I whipped it up myself earlier this morning, but left the code on my laptop. I'll post it when I have a chance. I ran obj2asm on it myself, but was a little short on time, so I haven't really analyzed it yet.
Aug 01 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Adam D. Ruppe:
 
 It was something that used bigint. I whipped it up myself earlier
 this morning, but left the code on my laptop. I'll post it when
 I have a chance.
OK. In such situations it's never enough to compare the D code compiled with DMD to the D code compiled with LDC. You also need a reference point, like a C version compiled with GCC (here using GMP bignums). Such reference points are necessary to anchor performance discussions to something. Bye, bearophile
Aug 01 2011
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
bearophile wrote:
 In such situations it's never enough to compare the D code compiled
 with DMD to the D code compiled with LDC. You also need a reference
 point, like a C version compiled with GCC (here using GMP bignums).
 Such reference points are necessary to anchor performance
 discussions to something.
Actually, I don't think that would be relevant here. The thread started with someone saying the DMD backend is garbage and should be abandoned. I'm sick and tired of hearing people say that. The Digital Mars code has many, many advantages over the others*. But, it was challenged specifically on the optimizer, so to check that out, I wanted all other things to be equal. Same code, same front end, same computer, as close to same runtime and library is possible with different compilers. The only difference should be the backend so we can draw conclusions about it without other factors skewing the results. So for this, I just wanted to compare dmd backend to ldc and gdc backend so I didn't worry too much about absolute numbers or other languages. (Actually, one of the reasons I picked the pi one was after the embarrassing defeat in floating point, I was hoping dmd could score a second victory and I could follow up on that "prove it" post with satisfaction. Alas, the facts didn't work out that way. Though, I still do find dmd to beat g++ on a lot of real world code - things like slices actually make a sizable difference.) But regardless, it was just about comparing backends, not doing language comparisons. === * To name a huge one. Today was the first time I ever got ldc or gdc to actually work on my computer, and it took a long, long time to do it. I've tried in the past, and failed, so this was a triumph. Big success. I was waiting over an hour just for gcc+gdc to compile! In the time it takes for gcc's configure script to run, you can make clean, build dmd, druntime and phobos. It's a huge hassle to get the code together too. I had to go to *four* different sites to get gdc's stuff together (like 80 MB of crap, compressed!), and two different ones to get even the ldc binary to work. Pain in my ASS. And this is on Linux too. I pity the fool who tries to do this on Windows, knowing how so much linux software treats their Windows "ports". I'd like to contrast to dmd: unzip and play with wild abandon.
Aug 01 2011
next sibling parent reply Andrew Wiley <wiley.andrew.j gmail.com> writes:
On Mon, Aug 1, 2011 at 8:38 PM, Adam D. Ruppe <destructionator gmail.com>wrote:

 bearophile wrote:
 In such situations it's never enough to compare the D code compiled
 with DMD to the D code compiled with LDC. You also need a reference
 point, like a C version compiled with GCC (here using GMP bignums).
 Such reference points are necessary to anchor performance
 discussions to something.
Actually, I don't think that would be relevant here. The thread started with someone saying the DMD backend is garbage and should be abandoned. I'm sick and tired of hearing people say that. The Digital Mars code has many, many advantages over the others*. But, it was challenged specifically on the optimizer, so to check that out, I wanted all other things to be equal. Same code, same front end, same computer, as close to same runtime and library is possible with different compilers. The only difference should be the backend so we can draw conclusions about it without other factors skewing the results. So for this, I just wanted to compare dmd backend to ldc and gdc backend so I didn't worry too much about absolute numbers or other languages. (Actually, one of the reasons I picked the pi one was after the embarrassing defeat in floating point, I was hoping dmd could score a second victory and I could follow up on that "prove it" post with satisfaction. Alas, the facts didn't work out that way. Though, I still do find dmd to beat g++ on a lot of real world code - things like slices actually make a sizable difference.) But regardless, it was just about comparing backends, not doing language comparisons. === * To name a huge one. Today was the first time I ever got ldc or gdc to actually work on my computer, and it took a long, long time to do it. I've tried in the past, and failed, so this was a triumph. Big success. I was waiting over an hour just for gcc+gdc to compile! In the time it takes for gcc's configure script to run, you can make clean, build dmd, druntime and phobos. It's a huge hassle to get the code together too. I had to go to *four* different sites to get gdc's stuff together (like 80 MB of crap, compressed!), and two different ones to get even the ldc binary to work. Pain in my ASS. And this is on Linux too. I pity the fool who tries to do this on Windows, knowing how so much linux software treats their Windows "ports". I'd like to contrast to dmd: unzip and play with wild abandon.
Yes, GDC takes forever and a half to build. That's true of anything in GCC, and it's just because they don't trust the native C compiler at all. LDC builds in under a half hour, even on my underpowered ARM SoC, so I don't see how you could be having trouble there. As for Windows, Daniel Green (hopefully I'm remembering right) has been posting GDC binaries. I do respect that DMD generates reasonably fast executables recklessly fast, but it also doesn't exist outside x86 and x86_64 and the debug symbols (at least on Linux) are just hilariously bad. Now if I could just get GDC to pad structs correctly on ARM...
Aug 01 2011
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
 LDC builds in under a half hour, even on my underpowered ARM SoC,
 so I don't see how you could be having trouble there.
Building dmd from the zip took 37 *seconds* for me just now, after running a make clean (this is on Linux). gdc and ldc have their advantages, but they have disadvantages too. I think the people saying "abandon dmd" don't know the other side of the story. Basically, I think the more compilers we have for D the better. gdc is good. ldc is good. And so is dmd. We shouldn't abandon any of them.
Aug 02 2011
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
I think I have it: 64 bit registers. I got ldc to work
in 32 bit (didn't have that yesterday, so I was doing 64 bit only)
and compiled.

No difference in timing between ldc 32 bit and dmd 32 bit.
The disassembly isn't identical but the time is. (The disassembly
seems to mainly order things differently, but ldc has fewer jump
instructions too.)

Anyway.

In 64 bit, ldc gets a speedup over dmd. Looking at the asm
output, it looks like dmd doesn't use any of the new registers,
whereas ldc does. (dmd's 64 bit looks mostly like 32 bit code with
r instead of e.)


Here's the program. It's based on one of the Python ones.

====
import std.bigint;
import std.stdio;

alias BigInt number;

void main() {
	auto N = 10000;

	number i, k, ns;
	number k1 = 1;
	number n,a,d,t,u;
	n = 1;
	d = 1;
	while(1) {
		k += 1;
		t = n<<1;
		n *= k;
		a += t;
		k1 += 2;
		a *= k1;
		d *= k1;
		if(a >= n) {
			t = (n*3 +a)/d;
			u = (n*3 +a)%d;
			u += n;
			if(d > u) {
				ns = ns*10 + t;
				i += 1;
				if(i % 10 == 0) {
					debug writefln ("%010d\t:%d", ns, i);
					ns = 0;
				}
				if(i >= N) {
					break;
				}
				a -= d*t;
				a *= 10;
				n *= 10;
			}
		}
	}
}
=====

BigInt's calls aren't inlined, but that's a frontend issue. Let's
eliminate that by switching to long in that alias.

The result will be wrong, but that's beside the point for now. I
just want to see integer math. (this is why the writefln is debug
too)

With optimizations turned on, ldc again wins by the same ratio -
it runs in about 2/3 the time - and the code is much easier to look
at.


Let's see what's going on.


The relevant loop from DMD (64 bit):

===
L47:		inc	qword ptr -040h[RBP]
		mov	RAX,-028h[RBP]
		add	RAX,RAX
		mov	-010h[RBP],RAX
		mov	RAX,-040h[RBP]
		imul	RAX,-028h[RBP]
		mov	-028h[RBP],RAX
		mov	RAX,-010h[RBP]
		add	-020h[RBP],RAX
		add	qword ptr -030h[RBP],2
		mov	RAX,-030h[RBP]
		imul	RAX,-020h[RBP]
		mov	-020h[RBP],RAX
		mov	RAX,-030h[RBP]
		imul	RAX,-018h[RBP]
		mov	-018h[RBP],RAX
		mov	RAX,-020h[RBP]
		cmp	RAX,-028h[RBP]
		jl	L47
		mov	RAX,-028h[RBP]
		lea	RAX,[RAX*2][RAX]
		add	RAX,-020h[RBP]
		mov	-058h[RBP],RAX
		cqo
		idiv	qword ptr -018h[RBP]
		mov	-010h[RBP],RAX
		mov	RAX,-058h[RBP]
		cqo
		idiv	qword ptr -018h[RBP]
		mov	-8[RBP],RDX
		mov	RAX,-028h[RBP]
		add	-8[RBP],RAX
		mov	RAX,-018h[RBP]
		cmp	RAX,-8[RBP]
		jle	L47
		mov	RAX,-038h[RBP]
		lea	RAX,[RAX*4][RAX]
		add	RAX,RAX
		add	RAX,-010h[RBP]
		mov	-038h[RBP],RAX
		inc	qword ptr -048h[RBP]
		mov	RAX,-048h[RBP]
		mov	RCX,0Ah
		cqo
		idiv	RCX
		test	RDX,RDX
		jne	L109
		mov	qword ptr -038h[RBP],0
L109:		cmp	qword ptr -048h[RBP],02710h
		jge	L137
		mov	RAX,-018h[RBP]
		imul	RAX,-010h[RBP]
		sub	-020h[RBP],RAX
		imul	EAX,-020h[RBP],0Ah
		mov	-020h[RBP],RAX
		imul	EAX,-028h[RBP],0Ah
		mov	-028h[RBP],RAX
		jmp	  L47
===


and from ldc 64 bit:

====
L20:		add	RDI,2
		inc	RCX
		lea	R9,[R10*2][R9]
		imul	R9,RDI
		imul	R8,RDI
		imul	R10,RCX
		cmp	R9,R10
		jl	L20
		lea	RAX,[R10*2][R10]
		add	RAX,R9
		cqo
		idiv	R8
		add	RDX,R10
		cmp	R8,RDX
		jle	L20
		cmp	RSI,0270Fh
		jg	L73
		imul	RAX,R8
		sub	R9,RAX
		add	R9,R9
		lea	R9,[R9*4][R9]
		inc	RSI
		add	R10,R10
		lea	R10,[R10*4][R10]
		jmp short	L20
===


First thing that immediately pops out is the code is a lot shorter.
Second thing that jumps out is it looks like ldc makes better use
of the registers. Indeed, the shortness looks to be thanks to
the registers eliminating a lot of movs.



So I'm pretty sure the difference is caused by dmd not using the
new registers in x64. The other differences look trivial to my
eyes.
Aug 02 2011
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2011 12:49 PM, Adam D. Ruppe wrote:
 So I'm pretty sure the difference is caused by dmd not using the
 new registers in x64. The other differences look trivial to my
 eyes.
dmd does use all the registers on the x64, but it seems to not be enregistering here. I'll have a look see.
Aug 02 2011
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Adam D. Ruppe:

 Here's the program. It's based on one of the Python ones.
The D code is about 2.8 times slower than the Haskell version, and it has a bug, shown here: import std.stdio, std.bigint; void main() { int x = 100; writefln("%010d", x); BigInt bx = x; writefln("%010d", bx); } Output: 0000000100 100 ---------------------------- The Haskell code I've used: -- Compile with: ghc --make -O3 -XBangPatterns -rtsopts pidigits_hs.hs import System i%ds | i >= n = [] | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j%t where k = i+10; j = min n k (h,t) | k > n = (take (n`mod`10) ds ++ replicate (k-n) " ",[]) | True = splitAt 10 ds where k = j+1; t (n,a,d)=k&s; (q,r)=(n*3+a)`divMod`d j&(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1) main = putStr.pidgits.read.head =<< getArgs Bye, bearophile
Aug 02 2011
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 02.08.2011, 22:35 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:


  i%ds
   | i >= n = []
   | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j%t
   where k = i+10; j = min n k
         (h,t) | k > n = (take (n`mod`10) ds ++ replicate (k-n) " ",[])
               | True = splitAt 10 ds


   where k = j+1; t (n,a,d)=k&s; (q,r)=(n*3+a)`divMod`d
  j&(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1)

 main = putStr.pidgits.read.head =<< getArgs
Is this Indonesian cast to ASCII? :p
Aug 02 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Marco Leise:

 Am 02.08.2011, 22:35 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:
 

  i%ds
   | i >= n = []
   | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j%t
   where k = i+10; j = min n k
         (h,t) | k > n = (take (n`mod`10) ds ++ replicate (k-n) " ",[])
               | True = splitAt 10 ds


   where k = j+1; t (n,a,d)=k&s; (q,r)=(n*3+a)`divMod`d
  j&(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1)

 main = putStr.pidgits.read.head =<< getArgs
Is this Indonesian cast to ASCII? :p
I agree it's very bad looking, it isn't idiomatic Haskell code. But it contains nothing too much strange (and the algorithm is the same used in the D code). This is formatted a bit better, but I don't fully understand it yet: import System (getArgs) i % ds | i >= n = [] | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j % t where k = i + 10 j = min n k (h, t) | k > n = (take (n `mod` 10) ds ++ replicate (k - n) " ", []) | True = splitAt 10 ds where k = j + 1 t (n, a, d) = k & s (q, r) = (n * 3 + a) `divMod` d j & (n, a, d) = (n * j, (a + n * 2) * y, d * y) where y = (j * 2 + 1) main = putStr . pidgits . read . head =<< getArgs The Shootout site (where I have copied that code) ranks programs for the performance and their compactness (using a low-performance compressor...), so there you see Haskell (and other languages) programs that are sometimes too much compact and often use clever tricks to increase their performance. In normal Haskell code you don't find those tricks (this specific program seems to not use strange tricks, but on the Haskell Wiki page about this problem (http://www.haskell.org/haskellwiki/Shootout/Pidigits ) you see several programs that are both longer and slower than this one). The first working implementation of a C program is probably long and fast enough, while the first working implementation of a Haskell program is often short but not so fast. Usually there are ways to speed up the Haskell code. My experience of Haskell is limited, so usually when I write some Haskell my head hurts a bit :-) The higher level nature of Python allows me to implement working algorithms that are more complex, so sometimes the code ends being faster than C code, where you often avoid (at a first implementation) too much complex algorithms for fear of too much hard to find bugs, or just too much long to write implementation. Haskell in theory allows you to implement complex algorithms in a short space, and safely. In practice I think you need lot of brain to do this. Haskell sometimes looks like a puzzle language to me (maybe I just need more self-training on functional programming). Bye, bearophile
Aug 02 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2011 12:49 PM, Adam D. Ruppe wrote:
 So I'm pretty sure the difference is caused by dmd not using the
 new registers in x64. The other differences look trivial to my
 eyes.
When I compile it, it uses the registers: L2E: inc R11 lea R9D,[00h][RSI*2] mov R9,R9 mov RCX,R11 imul RCX,RSI mov RSI,RCX add RDI,R9 add R8,2 mov RDX,R8 imul RDX,RDI mov RDI,RDX mov R10,R8 imul R10,RBX mov RBX,R10 cmp RDI,RSI jl L2E lea RAX,[RCX*2][RCX] add RAX,RDX mov -8[RBP],RAX cqo idiv R10 mov R9,RAX mov R9,R9 mov RAX,-8[RBP] cqo idiv R10 mov R12,RDX mov R12,R12 add R12,RSI cmp RBX,R12 jle L2E lea R14,[R14*4][R14] add R14,R14 add R14,R9 mov R14,R14 inc R13 mov RAX,R13 mov RCX,0Ah cqo idiv RCX test RDX,RDX jne LBD xor R14,R14 LBD: cmp R13,02710h jge LE3 mov RDX,RBX imul RDX,R9 sub RDI,RDX imul R10D,RDI,0Ah mov RDI,R10 imul R12D,RSI,0Ah mov RSI,R12 jmp L2E All I did with your example was replace BigInt with long.
Aug 02 2011
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
Walter Bright wrote:
 All I did with your example was replace BigInt with long.
hmm.... this is my error, but might be a bug too. Take that same program and add some inline asm to it. void main() { asm { nop; } [... the rest is identical ...] } Now compile it and check the output. With the asm, I get the output I posted. If I cut it out, I get what you posted. My error here is when I did the obj2asm the first time, I added an instruction inline so I could confirm quickly that I was in the right place in the file. (I cut that out later but forgot to rerun obj2asm.)
Aug 02 2011
parent reply Brad Roberts <braddr slice-2.puremagic.com> writes:
Ok.. I'm pretty sure that's a bug I discovered the other day in the 
initilization code of asm blocks.  I've already got a fix for it and will 
be sending a pull request shortly.

The asm semantic code calls the 32bit initialization code of the backend 
unconditionally, which is just wrong.

On Tue, 2 Aug 2011, Adam D. Ruppe wrote:

 Walter Bright wrote:
 All I did with your example was replace BigInt with long.
hmm.... this is my error, but might be a bug too. Take that same program and add some inline asm to it. void main() { asm { nop; } [... the rest is identical ...] } Now compile it and check the output. With the asm, I get the output I posted. If I cut it out, I get what you posted. My error here is when I did the obj2asm the first time, I added an instruction inline so I could confirm quickly that I was in the right place in the file. (I cut that out later but forgot to rerun obj2asm.)
Aug 02 2011
parent Brad Roberts <braddr slice-2.puremagic.com> writes:
https://github.com/D-Programming-Language/dmd/pull/287

Before pulling this, though, the current win32 compilation failure should 
be fixed to avoid compounding problems:

  https://github.com/D-Programming-Language/dmd/pull/288

Later,
Brad

On Tue, 2 Aug 2011, Brad Roberts wrote:

 Ok.. I'm pretty sure that's a bug I discovered the other day in the 
 initilization code of asm blocks.  I've already got a fix for it and will 
 be sending a pull request shortly.
 
 The asm semantic code calls the 32bit initialization code of the backend 
 unconditionally, which is just wrong.
 
 On Tue, 2 Aug 2011, Adam D. Ruppe wrote:
 
 Walter Bright wrote:
 All I did with your example was replace BigInt with long.
hmm.... this is my error, but might be a bug too. Take that same program and add some inline asm to it. void main() { asm { nop; } [... the rest is identical ...] } Now compile it and check the output. With the asm, I get the output I posted. If I cut it out, I get what you posted. My error here is when I did the obj2asm the first time, I added an instruction inline so I could confirm quickly that I was in the right place in the file. (I cut that out later but forgot to rerun obj2asm.)
Aug 02 2011
prev sibling parent reply Trass3r <un known.com> writes:
Am 02.08.2011, 22:38 Uhr, schrieb Walter Bright  
<newshound2 digitalmars.com>:

 L2E:            inc     R11
                  lea     R9D,[00h][RSI*2]
                  mov     R9,R9
...
                  mov     R9,RAX
                  mov     R9,R9
...
                  mov     R12,RDX
                  mov     R12,R12
...
                  lea     R14,[R14*4][R14]
                  add     R14,R14
                  add     R14,R9
                  mov     R14,R14
... Any reason for all those mov x,x 's?
Aug 02 2011
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2011 3:23 PM, Trass3r wrote:
 Any reason for all those mov x,x 's?
No. They'll get removed shortly. I see three problems with dmd's codegen here: 1. those redundant moves 2. failing to merge a couple divides 3. replacing a mul with an add/lea I'll see about taking care of them. (2) is the most likely culprit on the speed.
Aug 02 2011
prev sibling parent Andrew Wiley <wiley.andrew.j gmail.com> writes:
On Tue, Aug 2, 2011 at 7:08 AM, Adam D. Ruppe <destructionator gmail.com>wrote:

 LDC builds in under a half hour, even on my underpowered ARM SoC,
 so I don't see how you could be having trouble there.
Building dmd from the zip took 37 *seconds* for me just now, after running a make clean (this is on Linux). gdc and ldc have their advantages, but they have disadvantages too. I think the people saying "abandon dmd" don't know the other side of the story. Basically, I think the more compilers we have for D the better. gdc is good. ldc is good. And so is dmd. We shouldn't abandon any of them.
For the record, I'm fine with the current arrangement and just playing devil's advocate here: So far, you've outlined that GDC takes a while to build and the build processes for GDC and LDC are inconvenient as the only disadvantages they have. LDC took about 3 minutes on a Linux VM on my laptop, and since it has proper incremental build support through CMake, I don't really see that qualifying as a disadvantage. The only people that really need to regularly build compilers are the folks that work on them, and that's why we have incremental builds. Now, DMD does have speed on its side. It doesn't have debugging support (you have to jump through hoops on Windows and Linux is just a joke), binary and object file compatibility (even GDC has more going for it on Windows than DMD does), platform compatibility (outside x86 and x86_64), name recognition (I'm a college student, and people look at me funny when I mention Digital Mars), shared library support, and acceptance in the Linux world. The reason I use GDC for pretty much all my development is that it has all those things, and the reason I think it's worth playing devil's advocate and really considering the current situation is that GDC and LDC get all this for free by wiring up the DMD frontend to a different backend. The current state of affairs is certainly maintainable, but I think it's worth some thought as to whether it would be better in the long run if we started officially supporting a more accepted backend. My example would be Go, which got all sorts of notice when gccgo became important enough to get into the GCC codebase. I'm not saying DMD is terrible because it isn't. I'm just saying that there are a lot of benefits to be had by developing a more mature compiler on top of GCC or LLVM, and that we should consider whether that's a goal we should be working more towards.
Aug 02 2011
prev sibling parent Trass3r <un known.com> writes:
Am 02.08.2011, 05:38 Uhr, schrieb Adam D. Ruppe  
<destructionator gmail.com>:
 I was waiting over an hour just for gcc+gdc to compile! In the
 time it takes for gcc's configure script to run, you can make
 clean, build dmd, druntime and phobos.
Make sure you disable bootstrapping. Compiling gdc works pleasantly fast for me. Try compiling it on Windows, that's what I call slow.
Aug 02 2011
prev sibling next sibling parent reply Jason House <jason.james.house gmail.com> writes:
The post says they did "dmd -O". They did not mention "-inline -noboundscheck
-release". There may be extra flags that are required.

Walter Bright Wrote:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
 
 Anyone care to examine the assembler output and figure out why?
Aug 02 2011
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2011 5:00 AM, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline -noboundscheck
 -release". There may be extra flags that are required.
Often when I see benchmark results like this, I wait to see what the actual problem is before jumping to conclusions. I have a lot of experience with this :-) The results could be any of: 1. wrong flags used (especially by inexperienced users) 2. the benchmark isn't measuring what it purports to be (an example might be it is actually measuring printf or malloc speed, not the generated code) 3. the benchmark is optimized for one particular compiler/language by someone very familiar with that compiler/language and it exploits a particular quirk of it 4. the compiler is hand optimized for a specific benchmark, and the great results disappear if anything in the source code changes (yes, this is dirty, and I've seen it done by big name compilers) 5. the different benchmarks are run on different computers 6. the memory layout could wind up arbitrarily different for the different compilers/languages, resulting in different performance due to memory caching etc.
Aug 02 2011
prev sibling parent reply KennyTM~ <kennytm gmail.com> writes:
On Aug 2, 11 20:00, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline -noboundscheck
-release". There may be extra flags that are required.

 Walter Bright Wrote:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98

 Anyone care to examine the assembler output and figure out why?
Let dmd have an '-O9999' flag to as a synonym of '-O -inline -noboundscheck -release' so people won't miss the extra flags in benchmarks. [/joke]
Aug 02 2011
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On the flags: I did use them, but didn't write it all out and
tried to make them irrelevant (by avoiding functions and arrays).

But, if the same ones are passed to each compiler, it shouldn't
matter anyway... the idea is to get an apples to apples comparison
between the two D implementations, not to chase after a number itself.
Aug 02 2011
prev sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from KennyTM~ (kennytm gmail.com)'s article
 On Aug 2, 11 20:00, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline -noboundscheck
-release". There may be extra flags that are required.
 Walter Bright Wrote:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98

 Anyone care to examine the assembler output and figure out why?
Let dmd have an '-O9999' flag to as a synonym of '-O -inline -noboundscheck -release' so people won't miss the extra flags in benchmarks. [/joke]
-Ofast sounds better. ;)
Aug 02 2011
parent reply Andrew Wiley <wiley.andrew.j gmail.com> writes:
On Tue, Aug 2, 2011 at 1:31 PM, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 == Quote from KennyTM~ (kennytm gmail.com)'s article
 On Aug 2, 11 20:00, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline
-noboundscheck -release". There may be extra flags that are required.
 Walter Bright Wrote:


http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
 Anyone care to examine the assembler output and figure out why?
Let dmd have an '-O9999' flag to as a synonym of '-O -inline -noboundscheck -release' so people won't miss the extra flags in benchmarks. [/joke]
-O9001 will make the Redditors happy.
 -Ofast sounds better. ;)
Aug 02 2011
parent simendsjo <simendsjo gmail.com> writes:
On 02.08.2011 22:36, Andrew Wiley wrote:
 On Tue, Aug 2, 2011 at 1:31 PM, Iain Buclaw <ibuclaw ubuntu.com
 <mailto:ibuclaw ubuntu.com>> wrote:

     == Quote from KennyTM~ (kennytm gmail.com
     <mailto:kennytm gmail.com>)'s article
      > On Aug 2, 11 20:00, Jason House wrote:
      > > The post says they did "dmd -O". They did not mention "-inline
     -noboundscheck
     -release". There may be extra flags that are required.
      > >
      > > Walter Bright Wrote:
      > >
      > >>
     http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
      > >>
      > >> Anyone care to examine the assembler output and figure out why?
      > >
      > Let dmd have an '-O9999' flag to as a synonym of '-O -inline
      > -noboundscheck -release' so people won't miss the extra flags in
      > benchmarks. [/joke]


 -O9001 will make the Redditors happy.


     -Ofast sounds better. ;)
How about replacing -w with -9001? http://en.wikipedia.org/wiki/ISO_9001#Contents_of_ISO_9001
Aug 02 2011
prev sibling parent reply Robert Clipsham <robert octarineparrot.com> writes:
On 02/08/2011 00:40, Walter Bright wrote:
 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98


 Anyone care to examine the assembler output and figure out why?
I was talking to David Nadlinger the other day, and there was some sort of codegen bug causing things to massively outperform dmd and clang with equivalent code - it's possible this is the cause, I don't know without looking though. He may be able to shed some light on it. -- Robert http://octarineparrot.com/
Aug 02 2011
parent David Nadlinger <see klickverbot.at> writes:
On 8/2/11 7:34 PM, Robert Clipsham wrote:
 On 02/08/2011 00:40, Walter Bright wrote:
 Anyone care to examine the assembler output and figure out why?
I was talking to David Nadlinger the other day, and there was some sort of codegen bug causing things to massively outperform dmd and clang with equivalent code - it's possible this is the cause, I don't know without looking though. He may be able to shed some light on it.
Nope, this turned out to be a bug in my program, where some memory chunk used as test input data was prematurely garbage collected (that only surfaced with aggressive compiler optimizations, which is why I suspected a compiler bug). David
Aug 02 2011