digitalmars.D - pi benchmark on ldc and dmd

Walter Bright (2/2) Aug 01 2011 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_...

bearophile (5/8) Aug 01 2011 Do you mean code similar to this one, or code that uses std.bigint?

Adam D. Ruppe (6/7) Aug 01 2011 It was something that used bigint. I whipped it up myself earlier

bearophile (6/9) Aug 01 2011 OK.

Adam D. Ruppe (39/44) Aug 01 2011 Actually, I don't think that would be relevant here.

Andrew Wiley (11/55) Aug 01 2011 Yes, GDC takes forever and a half to build. That's true of anything in G...

Adam D. Ruppe (8/10) Aug 02 2011 Building dmd from the zip took 37 *seconds* for me just now, after

Adam D. Ruppe (159/159) Aug 02 2011 I think I have it: 64 bit registers. I got ldc to work

Walter Bright (3/6) Aug 02 2011 dmd does use all the registers on the x64, but it seems to not be enregi...
bearophile (30/31) Aug 02 2011 The D code is about 2.8 times slower than the Haskell version, and it ha...

Marco Leise (2/14) Aug 02 2011 Is this Indonesian cast to ASCII? :p

bearophile (27/44) Aug 02 2011 I agree it's very bad looking, it isn't idiomatic Haskell code. But it c...

Walter Bright (56/59) Aug 02 2011 When I compile it, it uses the registers:

Adam D. Ruppe (13/14) Aug 02 2011 hmm.... this is my error, but might be a bug too.

Brad Roberts (6/28) Aug 02 2011 Ok.. I'm pretty sure that's a bug I discovered the other day in the

Brad Roberts (7/39) Aug 02 2011 https://github.com/D-Programming-Language/dmd/pull/287

Trass3r (7/18) Aug 02 2011 ...

Walter Bright (7/8) Aug 02 2011 No. They'll get removed shortly.

Andrew Wiley (30/40) Aug 02 2011 For the record, I'm fine with the current arrangement and just playing

Trass3r (4/7) Aug 02 2011 Make sure you disable bootstrapping. Compiling gdc works pleasantly fast...

Jason House (2/5) Aug 02 2011

Walter Bright (16/18) Aug 02 2011 Often when I see benchmark results like this, I wait to see what the act...
KennyTM~ (4/9) Aug 02 2011 Let dmd have an '-O9999' flag to as a synonym of '-O -inline

Adam D. Ruppe (5/5) Aug 02 2011 On the flags: I did use them, but didn't write it all out and
Iain Buclaw (3/15) Aug 02 2011 -Ofast sounds better. ;)

Andrew Wiley (2/19) Aug 02 2011

simendsjo (3/24) Aug 02 2011 How about replacing -w with -9001?

Robert Clipsham (8/10) Aug 02 2011 I was talking to David Nadlinger the other day, and there was some sort

David Nadlinger (6/12) Aug 02 2011 Nope, this turned out to be a bug in my program, where some memory chunk...

Walter Bright <newshound2 digitalmars.com> writes:

http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98

Anyone care to examine the assembler output and figure out why?

Aug 01 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
 
 Anyone care to examine the assembler output and figure out why?

Do you mean code similar to this one, or code that uses std.bigint?
http://shootout.alioth.debian.org/debian/program.php?test=pidigits&lang=gdc&id=3

Bye,
bearophile

Aug 01 2011

Adam D. Ruppe <destructionator gmail.com> writes:

bearophile wrote:
 Do you mean code similar to this one, or code that uses std.bigint?

It was something that used bigint. I whipped it up myself earlier
this morning, but left the code on my laptop. I'll post it when
I have a chance.

I ran obj2asm on it myself, but was a little short on time, so
I haven't really analyzed it yet.

Aug 01 2011

bearophile <bearophileHUGS lycos.com> writes:

Adam D. Ruppe:
 
 It was something that used bigint. I whipped it up myself earlier
 this morning, but left the code on my laptop. I'll post it when
 I have a chance.

OK.
In such situations it's never enough to compare the D code compiled with DMD to
the D code compiled with LDC. You also need a reference point, like a C version
compiled with GCC (here using GMP bignums). Such reference points are necessary
to anchor performance discussions to something.

Bye,
bearophile

Aug 01 2011

Adam D. Ruppe <destructionator gmail.com> writes:

bearophile wrote:
 In such situations it's never enough to compare the D code compiled
 with DMD to the D code compiled with LDC. You also need a reference
 point, like a C version compiled with GCC (here using GMP bignums).
 Such reference points are necessary to anchor performance
 discussions to something.

Actually, I don't think that would be relevant here.

The thread started with someone saying the DMD backend is garbage
and should be abandoned.

I'm sick and tired of hearing people say that. The Digital Mars
code has many, many advantages over the others*.

But, it was challenged specifically on the optimizer, so to
check that out, I wanted all other things to be equal.

Same code, same front end, same computer, as close to same runtime
and library is possible with different compilers. The only
difference should be the backend so we can draw conclusions about
it without other factors skewing the results.


So for this, I just wanted to compare dmd backend to ldc and
gdc backend so I didn't worry too much about absolute numbers
or other languages. (Actually, one of the reasons I picked the
pi one was after the embarrassing defeat in floating point, I was
hoping dmd could score a second victory and I could follow up
on that "prove it" post with satisfaction. Alas, the facts didn't
work out that way. Though, I still do find dmd to beat g++
on a lot of real world code - things like slices actually make
a sizable difference.)

But regardless, it was just about comparing backends, not
doing language comparisons.

===

* To name a huge one. Today was the first time I ever got ldc
or gdc to actually work on my computer, and it took a long, long
time to do it. I've tried in the past, and failed, so this was
a triumph. Big success.

I was waiting over an hour just for gcc+gdc to compile! In the
time it takes for gcc's configure script to run, you can make
clean, build dmd, druntime and phobos.

It's a huge hassle to get the code together too. I had to go
to *four* different sites to get gdc's stuff together (like 80
MB of crap, compressed!), and two different ones to get even the
ldc binary to work. Pain in my ASS.


And this is on Linux too. I pity the fool who tries to do this
on Windows, knowing how so much linux software treats their
Windows "ports".


I'd like to contrast to dmd: unzip and play with wild abandon.

Aug 01 2011

Andrew Wiley <wiley.andrew.j gmail.com> writes:

On Mon, Aug 1, 2011 at 8:38 PM, Adam D. Ruppe <destructionator gmail.com>wrote:

 bearophile wrote:
 In such situations it's never enough to compare the D code compiled
 with DMD to the D code compiled with LDC. You also need a reference
 point, like a C version compiled with GCC (here using GMP bignums).
 Such reference points are necessary to anchor performance
 discussions to something.

 Actually, I don't think that would be relevant here.

 The thread started with someone saying the DMD backend is garbage
 and should be abandoned.

 I'm sick and tired of hearing people say that. The Digital Mars
 code has many, many advantages over the others*.

 But, it was challenged specifically on the optimizer, so to
 check that out, I wanted all other things to be equal.

 Same code, same front end, same computer, as close to same runtime
 and library is possible with different compilers. The only
 difference should be the backend so we can draw conclusions about
 it without other factors skewing the results.


 So for this, I just wanted to compare dmd backend to ldc and
 gdc backend so I didn't worry too much about absolute numbers
 or other languages. (Actually, one of the reasons I picked the
 pi one was after the embarrassing defeat in floating point, I was
 hoping dmd could score a second victory and I could follow up
 on that "prove it" post with satisfaction. Alas, the facts didn't
 work out that way. Though, I still do find dmd to beat g++
 on a lot of real world code - things like slices actually make
 a sizable difference.)

 But regardless, it was just about comparing backends, not
 doing language comparisons.

 ===

 * To name a huge one. Today was the first time I ever got ldc
 or gdc to actually work on my computer, and it took a long, long
 time to do it. I've tried in the past, and failed, so this was
 a triumph. Big success.

 I was waiting over an hour just for gcc+gdc to compile! In the
 time it takes for gcc's configure script to run, you can make
 clean, build dmd, druntime and phobos.

 It's a huge hassle to get the code together too. I had to go
 to *four* different sites to get gdc's stuff together (like 80
 MB of crap, compressed!), and two different ones to get even the
 ldc binary to work. Pain in my ASS.


 And this is on Linux too. I pity the fool who tries to do this
 on Windows, knowing how so much linux software treats their
 Windows "ports".


 I'd like to contrast to dmd: unzip and play with wild abandon.

Yes, GDC takes forever and a half to build. That's true of anything in GCC,
and it's just because they don't trust the native C compiler at all. LDC
builds in under a half hour, even on my underpowered ARM SoC, so I don't see
how you could be having trouble there.
As for Windows, Daniel Green (hopefully I'm remembering right) has been
posting GDC binaries.

I do respect that DMD generates reasonably fast executables recklessly fast,
but it also doesn't exist outside x86 and x86_64 and the debug symbols (at
least on Linux) are just hilariously bad.

Now if I could just get GDC to pad structs correctly on ARM...

Aug 01 2011

Adam D. Ruppe <destructionator gmail.com> writes:

 LDC builds in under a half hour, even on my underpowered ARM SoC,
 so I don't see how you could be having trouble there.

Building dmd from the zip took 37 *seconds* for me just now, after
running a make clean (this is on Linux).

gdc and ldc have their advantages, but they have disadvantages too.
I think the people saying "abandon dmd" don't know the other side
of the story.


Basically, I think the more compilers we have for D the better.
gdc is good. ldc is good. And so is dmd. We shouldn't abandon
any of them.

Aug 02 2011

Adam D. Ruppe <destructionator gmail.com> writes:

I think I have it: 64 bit registers. I got ldc to work
in 32 bit (didn't have that yesterday, so I was doing 64 bit only)
and compiled.

No difference in timing between ldc 32 bit and dmd 32 bit.
The disassembly isn't identical but the time is. (The disassembly
seems to mainly order things differently, but ldc has fewer jump
instructions too.)

Anyway.

In 64 bit, ldc gets a speedup over dmd. Looking at the asm
output, it looks like dmd doesn't use any of the new registers,
whereas ldc does. (dmd's 64 bit looks mostly like 32 bit code with
r instead of e.)


Here's the program. It's based on one of the Python ones.

====
import std.bigint;
import std.stdio;

alias BigInt number;

void main() {
	auto N = 10000;

	number i, k, ns;
	number k1 = 1;
	number n,a,d,t,u;
	n = 1;
	d = 1;
	while(1) {
		k += 1;
		t = n<<1;
		n *= k;
		a += t;
		k1 += 2;
		a *= k1;
		d *= k1;
		if(a >= n) {
			t = (n*3 +a)/d;
			u = (n*3 +a)%d;
			u += n;
			if(d > u) {
				ns = ns*10 + t;
				i += 1;
				if(i % 10 == 0) {
					debug writefln ("%010d\t:%d", ns, i);
					ns = 0;
				}
				if(i >= N) {
					break;
				}
				a -= d*t;
				a *= 10;
				n *= 10;
			}
		}
	}
}
=====

BigInt's calls aren't inlined, but that's a frontend issue. Let's
eliminate that by switching to long in that alias.

The result will be wrong, but that's beside the point for now. I
just want to see integer math. (this is why the writefln is debug
too)

With optimizations turned on, ldc again wins by the same ratio -
it runs in about 2/3 the time - and the code is much easier to look
at.


Let's see what's going on.


The relevant loop from DMD (64 bit):

===
L47:		inc	qword ptr -040h[RBP]
		mov	RAX,-028h[RBP]
		add	RAX,RAX
		mov	-010h[RBP],RAX
		mov	RAX,-040h[RBP]
		imul	RAX,-028h[RBP]
		mov	-028h[RBP],RAX
		mov	RAX,-010h[RBP]
		add	-020h[RBP],RAX
		add	qword ptr -030h[RBP],2
		mov	RAX,-030h[RBP]
		imul	RAX,-020h[RBP]
		mov	-020h[RBP],RAX
		mov	RAX,-030h[RBP]
		imul	RAX,-018h[RBP]
		mov	-018h[RBP],RAX
		mov	RAX,-020h[RBP]
		cmp	RAX,-028h[RBP]
		jl	L47
		mov	RAX,-028h[RBP]
		lea	RAX,[RAX*2][RAX]
		add	RAX,-020h[RBP]
		mov	-058h[RBP],RAX
		cqo
		idiv	qword ptr -018h[RBP]
		mov	-010h[RBP],RAX
		mov	RAX,-058h[RBP]
		cqo
		idiv	qword ptr -018h[RBP]
		mov	-8[RBP],RDX
		mov	RAX,-028h[RBP]
		add	-8[RBP],RAX
		mov	RAX,-018h[RBP]
		cmp	RAX,-8[RBP]
		jle	L47
		mov	RAX,-038h[RBP]
		lea	RAX,[RAX*4][RAX]
		add	RAX,RAX
		add	RAX,-010h[RBP]
		mov	-038h[RBP],RAX
		inc	qword ptr -048h[RBP]
		mov	RAX,-048h[RBP]
		mov	RCX,0Ah
		cqo
		idiv	RCX
		test	RDX,RDX
		jne	L109
		mov	qword ptr -038h[RBP],0
L109:		cmp	qword ptr -048h[RBP],02710h
		jge	L137
		mov	RAX,-018h[RBP]
		imul	RAX,-010h[RBP]
		sub	-020h[RBP],RAX
		imul	EAX,-020h[RBP],0Ah
		mov	-020h[RBP],RAX
		imul	EAX,-028h[RBP],0Ah
		mov	-028h[RBP],RAX
		jmp	  L47
===


and from ldc 64 bit:

====
L20:		add	RDI,2
		inc	RCX
		lea	R9,[R10*2][R9]
		imul	R9,RDI
		imul	R8,RDI
		imul	R10,RCX
		cmp	R9,R10
		jl	L20
		lea	RAX,[R10*2][R10]
		add	RAX,R9
		cqo
		idiv	R8
		add	RDX,R10
		cmp	R8,RDX
		jle	L20
		cmp	RSI,0270Fh
		jg	L73
		imul	RAX,R8
		sub	R9,RAX
		add	R9,R9
		lea	R9,[R9*4][R9]
		inc	RSI
		add	R10,R10
		lea	R10,[R10*4][R10]
		jmp short	L20
===


First thing that immediately pops out is the code is a lot shorter.
Second thing that jumps out is it looks like ldc makes better use
of the registers. Indeed, the shortness looks to be thanks to
the registers eliminating a lot of movs.



So I'm pretty sure the difference is caused by dmd not using the
new registers in x64. The other differences look trivial to my
eyes.

Aug 02 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2011 12:49 PM, Adam D. Ruppe wrote:
 So I'm pretty sure the difference is caused by dmd not using the
 new registers in x64. The other differences look trivial to my
 eyes.

dmd does use all the registers on the x64, but it seems to not be enregistering 
here. I'll have a look see.

Aug 02 2011

bearophile <bearophileHUGS lycos.com> writes:

Adam D. Ruppe:

 Here's the program. It's based on one of the Python ones.

The D code is about 2.8 times slower than the Haskell version, and it has a
bug, shown here:

import std.stdio, std.bigint;
void main() {
    int x = 100;
    writefln("%010d", x);
    BigInt bx = x;
    writefln("%010d", bx);
}

Output:
0000000100
100

----------------------------

The Haskell code I've used:

-- Compile with:  ghc --make -O3 -XBangPatterns -rtsopts pidigits_hs.hs
import System


 i%ds
  | i >= n = []
  | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j%t
  where k = i+10; j = min n k
        (h,t) | k > n = (take (n`mod`10) ds ++ replicate (k-n) " ",[])
              | True = splitAt 10 ds


  where k = j+1; t (n,a,d)=k&s; (q,r)=(n*3+a)`divMod`d
 j&(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1)

main = putStr.pidgits.read.head =<< getArgs

Bye,
bearophile

Aug 02 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 02.08.2011, 22:35 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:


  i%ds
   | i >= n = []
   | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j%t
   where k = i+10; j = min n k
         (h,t) | k > n = (take (n`mod`10) ds ++ replicate (k-n) " ",[])
               | True = splitAt 10 ds


   where k = j+1; t (n,a,d)=k&s; (q,r)=(n*3+a)`divMod`d
  j&(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1)

 main = putStr.pidgits.read.head =<< getArgs

Is this Indonesian cast to ASCII? :p

Aug 02 2011

bearophile <bearophileHUGS lycos.com> writes:

Marco Leise:

 Am 02.08.2011, 22:35 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:
 

  i%ds
   | i >= n = []
   | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j%t
   where k = i+10; j = min n k
         (h,t) | k > n = (take (n`mod`10) ds ++ replicate (k-n) " ",[])
               | True = splitAt 10 ds


   where k = j+1; t (n,a,d)=k&s; (q,r)=(n*3+a)`divMod`d
  j&(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1)

 main = putStr.pidgits.read.head =<< getArgs

 
 Is this Indonesian cast to ASCII? :p

I agree it's very bad looking, it isn't idiomatic Haskell code. But it contains
nothing too much strange (and the algorithm is the same used in the D code).
This is formatted a bit better, but I don't fully understand it yet:


import System (getArgs)


    i % ds
      | i >= n = []
      | True = (concat h ++ "\t:" ++ show j ++ "\n") ++ j % t
      where
        k = i + 10
        j = min n k
        (h, t) | k > n = (take (n `mod` 10) ds ++ replicate (k - n) " ", [])
               | True = splitAt 10 ds


        where
            k = j + 1
            t (n, a, d) = k & s
            (q, r) = (n * 3 + a) `divMod` d
    j & (n, a, d) = (n * j, (a + n * 2) * y, d * y)
        where
            y = (j * 2 + 1)

main = putStr . pidgits . read . head =<< getArgs


The Shootout site (where I have copied that code) ranks programs for the
performance and their compactness (using a low-performance compressor...), so
there you see Haskell (and other languages) programs that are sometimes too
much compact and often use clever tricks to increase their performance. In
normal Haskell code you don't find those tricks (this specific program seems to
not use strange tricks, but on the Haskell Wiki page about this problem
(http://www.haskell.org/haskellwiki/Shootout/Pidigits ) you see several
programs that are both longer and slower than this one).

The first working implementation of a C program is probably long and fast
enough, while the first working implementation of a Haskell program is often
short but not so fast. Usually there are ways to speed up the Haskell code. My
experience of Haskell is limited, so usually when I write some Haskell my head
hurts a bit :-)

The higher level nature of Python allows me to implement working algorithms
that are more complex, so sometimes the code ends being faster than C code,
where you often avoid (at a first implementation) too much complex algorithms
for fear of too much hard to find bugs, or just too much long to write
implementation. Haskell in theory allows you to implement complex algorithms in
a short space, and safely. In practice I think you need lot of brain to do
this. Haskell sometimes looks like a puzzle language to me (maybe I just need
more self-training on functional programming).

Bye,
bearophile

Aug 02 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2011 12:49 PM, Adam D. Ruppe wrote:
 So I'm pretty sure the difference is caused by dmd not using the
 new registers in x64. The other differences look trivial to my
 eyes.

When I compile it, it uses the registers:

L2E:            inc     R11
                 lea     R9D,[00h][RSI*2]
                 mov     R9,R9
                 mov     RCX,R11
                 imul    RCX,RSI
                 mov     RSI,RCX
                 add     RDI,R9
                 add     R8,2
                 mov     RDX,R8
                 imul    RDX,RDI
                 mov     RDI,RDX
                 mov     R10,R8
                 imul    R10,RBX
                 mov     RBX,R10
                 cmp     RDI,RSI
                 jl      L2E
                 lea     RAX,[RCX*2][RCX]
                 add     RAX,RDX
                 mov     -8[RBP],RAX
                 cqo
                 idiv    R10
                 mov     R9,RAX
                 mov     R9,R9
                 mov     RAX,-8[RBP]
                 cqo
                 idiv    R10
                 mov     R12,RDX
                 mov     R12,R12
                 add     R12,RSI
                 cmp     RBX,R12
                 jle     L2E
                 lea     R14,[R14*4][R14]
                 add     R14,R14
                 add     R14,R9
                 mov     R14,R14
                 inc     R13
                 mov     RAX,R13
                 mov     RCX,0Ah
                 cqo
                 idiv    RCX
                 test    RDX,RDX
                 jne     LBD
                 xor     R14,R14
LBD:            cmp     R13,02710h
                 jge     LE3
                 mov     RDX,RBX
                 imul    RDX,R9
                 sub     RDI,RDX
                 imul    R10D,RDI,0Ah
                 mov     RDI,R10
                 imul    R12D,RSI,0Ah
                 mov     RSI,R12
                 jmp       L2E

All I did with your example was replace BigInt with long.

Aug 02 2011

Adam D. Ruppe <destructionator gmail.com> writes:

Walter Bright wrote:
 All I did with your example was replace BigInt with long.

hmm.... this is my error, but might be a bug too.

Take that same program and add some inline asm to it.

void main() {
   asm { nop; }
[... the rest is identical ...]
}


Now compile it and check the output. With the asm, I get the
output I posted. If I cut it out, I get what you posted.


My error here is when I did the obj2asm the first time, I added
an instruction inline so I could confirm quickly that I was in
the right place in the file. (I cut that out later but forgot to
rerun obj2asm.)

Aug 02 2011

Brad Roberts <braddr slice-2.puremagic.com> writes:

Ok.. I'm pretty sure that's a bug I discovered the other day in the 
initilization code of asm blocks.  I've already got a fix for it and will 
be sending a pull request shortly.

The asm semantic code calls the 32bit initialization code of the backend 
unconditionally, which is just wrong.

On Tue, 2 Aug 2011, Adam D. Ruppe wrote:

 Walter Bright wrote:
 All I did with your example was replace BigInt with long.

 
 hmm.... this is my error, but might be a bug too.
 
 Take that same program and add some inline asm to it.
 
 void main() {
    asm { nop; }
 [... the rest is identical ...]
 }
 
 
 Now compile it and check the output. With the asm, I get the
 output I posted. If I cut it out, I get what you posted.
 
 
 My error here is when I did the obj2asm the first time, I added
 an instruction inline so I could confirm quickly that I was in
 the right place in the file. (I cut that out later but forgot to
 rerun obj2asm.)

Aug 02 2011

Brad Roberts <braddr slice-2.puremagic.com> writes:

https://github.com/D-Programming-Language/dmd/pull/287

Before pulling this, though, the current win32 compilation failure should 
be fixed to avoid compounding problems:

  https://github.com/D-Programming-Language/dmd/pull/288

Later,
Brad

On Tue, 2 Aug 2011, Brad Roberts wrote:

 Ok.. I'm pretty sure that's a bug I discovered the other day in the 
 initilization code of asm blocks.  I've already got a fix for it and will 
 be sending a pull request shortly.
 
 The asm semantic code calls the 32bit initialization code of the backend 
 unconditionally, which is just wrong.
 
 On Tue, 2 Aug 2011, Adam D. Ruppe wrote:
 
 Walter Bright wrote:
 All I did with your example was replace BigInt with long.

 
 hmm.... this is my error, but might be a bug too.
 
 Take that same program and add some inline asm to it.
 
 void main() {
    asm { nop; }
 [... the rest is identical ...]
 }
 
 
 Now compile it and check the output. With the asm, I get the
 output I posted. If I cut it out, I get what you posted.
 
 
 My error here is when I did the obj2asm the first time, I added
 an instruction inline so I could confirm quickly that I was in
 the right place in the file. (I cut that out later but forgot to
 rerun obj2asm.)

Aug 02 2011

Trass3r <un known.com> writes:

Am 02.08.2011, 22:38 Uhr, schrieb Walter Bright  
<newshound2 digitalmars.com>:

 L2E:            inc     R11
                  lea     R9D,[00h][RSI*2]
                  mov     R9,R9

...
                  mov     R9,RAX
                  mov     R9,R9

...
                  mov     R12,RDX
                  mov     R12,R12

...
                  lea     R14,[R14*4][R14]
                  add     R14,R14
                  add     R14,R9
                  mov     R14,R14

...

Any reason for all those mov x,x 's?

Aug 02 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2011 3:23 PM, Trass3r wrote:
 Any reason for all those mov x,x 's?

No. They'll get removed shortly.

I see three problems with dmd's codegen here:

1. those redundant moves
2. failing to merge a couple divides
3. replacing a mul with an add/lea

I'll see about taking care of them. (2) is the most likely culprit on the speed.

Aug 02 2011

Andrew Wiley <wiley.andrew.j gmail.com> writes:

On Tue, Aug 2, 2011 at 7:08 AM, Adam D. Ruppe <destructionator gmail.com>wrote:

 LDC builds in under a half hour, even on my underpowered ARM SoC,
 so I don't see how you could be having trouble there.

 Building dmd from the zip took 37 *seconds* for me just now, after
 running a make clean (this is on Linux).

 gdc and ldc have their advantages, but they have disadvantages too.
 I think the people saying "abandon dmd" don't know the other side
 of the story.


 Basically, I think the more compilers we have for D the better.
 gdc is good. ldc is good. And so is dmd. We shouldn't abandon
 any of them.

For the record, I'm fine with the current arrangement and just playing
devil's advocate here:

So far, you've outlined that GDC takes a while to build and the build
processes for GDC and LDC are inconvenient as the only disadvantages they
have.
LDC took about 3 minutes on a Linux VM on my laptop, and since it has proper
incremental build support through CMake, I don't really see that qualifying
as a disadvantage. The only people that really need to regularly build
compilers are the folks that work on them, and that's why we have
incremental builds.
Now, DMD does have speed on its side. It doesn't have debugging support (you
have to jump through hoops on Windows and Linux is just a joke), binary and
object file compatibility (even GDC has more going for it on Windows than
DMD does), platform compatibility (outside x86 and x86_64), name recognition
(I'm a college student, and people look at me funny when I mention Digital
Mars), shared library support, and acceptance in the Linux world.
The reason I use GDC for pretty much all my development is that it has all
those things, and the reason I think it's worth playing devil's advocate and
really considering the current situation is that GDC and LDC get all this
for free by wiring up the DMD frontend to a different backend. The current
state of affairs is certainly maintainable, but I think it's worth some
thought as to whether it would be better in the long run if we started
officially supporting a more accepted backend.
My example would be Go, which got all sorts of notice when gccgo became
important enough to get into the GCC codebase.

I'm not saying DMD is terrible because it isn't. I'm just saying that there
are a lot of benefits to be had by developing a more mature compiler on top
of GCC or LLVM, and that we should consider whether that's a goal we should
be working more towards.

Aug 02 2011

Trass3r <un known.com> writes:

Am 02.08.2011, 05:38 Uhr, schrieb Adam D. Ruppe  
<destructionator gmail.com>:
 I was waiting over an hour just for gcc+gdc to compile! In the
 time it takes for gcc's configure script to run, you can make
 clean, build dmd, druntime and phobos.

Make sure you disable bootstrapping. Compiling gdc works pleasantly fast  
for me. Try compiling it on Windows, that's what I call slow.

Aug 02 2011

Jason House <jason.james.house gmail.com> writes:

The post says they did "dmd -O". They did not mention "-inline -noboundscheck
-release". There may be extra flags that are required.

Walter Bright Wrote:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
 
 Anyone care to examine the assembler output and figure out why?

Aug 02 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2011 5:00 AM, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline -noboundscheck
 -release". There may be extra flags that are required.

Often when I see benchmark results like this, I wait to see what the actual 
problem is before jumping to conclusions. I have a lot of experience with this
:-)

The results could be any of:

1. wrong flags used (especially by inexperienced users)

2. the benchmark isn't measuring what it purports to be (an example might be it 
is actually measuring printf or malloc speed, not the generated code)

3. the benchmark is optimized for one particular compiler/language by someone 
very familiar with that compiler/language and it exploits a particular quirk of
it

4. the compiler is hand optimized for a specific benchmark, and the great 
results disappear if anything in the source code changes (yes, this is dirty, 
and I've seen it done by big name compilers)

5. the different benchmarks are run on different computers

6. the memory layout could wind up arbitrarily different for the different 
compilers/languages, resulting in different performance due to memory caching

etc.

Aug 02 2011

KennyTM~ <kennytm gmail.com> writes:

On Aug 2, 11 20:00, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline -noboundscheck
-release". There may be extra flags that are required.

 Walter Bright Wrote:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98

 Anyone care to examine the assembler output and figure out why?


Let dmd have an '-O9999' flag to as a synonym of '-O -inline 
-noboundscheck -release' so people won't miss the extra flags in 
benchmarks. [/joke]

Aug 02 2011

Adam D. Ruppe <destructionator gmail.com> writes:

On the flags: I did use them, but didn't write it all out and
tried to make them irrelevant (by avoiding functions and arrays).

But, if the same ones are passed to each compiler, it shouldn't
matter anyway... the idea is to get an apples to apples comparison
between the two D implementations, not to chase after a number itself.

Aug 02 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from KennyTM~ (kennytm gmail.com)'s article
 On Aug 2, 11 20:00, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline -noboundscheck


-release". There may be extra flags that are required.
 Walter Bright Wrote:

 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98

 Anyone care to examine the assembler output and figure out why?


 Let dmd have an '-O9999' flag to as a synonym of '-O -inline
 -noboundscheck -release' so people won't miss the extra flags in
 benchmarks. [/joke]

-Ofast sounds better. ;)

Aug 02 2011

Andrew Wiley <wiley.andrew.j gmail.com> writes:

On Tue, Aug 2, 2011 at 1:31 PM, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 == Quote from KennyTM~ (kennytm gmail.com)'s article
 On Aug 2, 11 20:00, Jason House wrote:
 The post says they did "dmd -O". They did not mention "-inline


 -noboundscheck
 -release". There may be extra flags that are required.
 Walter Bright Wrote:




 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
 Anyone care to examine the assembler output and figure out why?


 Let dmd have an '-O9999' flag to as a synonym of '-O -inline
 -noboundscheck -release' so people won't miss the extra flags in
 benchmarks. [/joke]


-O9001 will make the Redditors happy.

 -Ofast sounds better. ;)

Aug 02 2011

simendsjo <simendsjo gmail.com> writes:

On 02.08.2011 22:36, Andrew Wiley wrote:
 On Tue, Aug 2, 2011 at 1:31 PM, Iain Buclaw <ibuclaw ubuntu.com
 <mailto:ibuclaw ubuntu.com>> wrote:

     == Quote from KennyTM~ (kennytm gmail.com
     <mailto:kennytm gmail.com>)'s article
      > On Aug 2, 11 20:00, Jason House wrote:
      > > The post says they did "dmd -O". They did not mention "-inline
     -noboundscheck
     -release". There may be extra flags that are required.
      > >
      > > Walter Bright Wrote:
      > >
      > >>
     http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98
      > >>
      > >> Anyone care to examine the assembler output and figure out why?
      > >
      > Let dmd have an '-O9999' flag to as a synonym of '-O -inline
      > -noboundscheck -release' so people won't miss the extra flags in
      > benchmarks. [/joke]


 -O9001 will make the Redditors happy.


     -Ofast sounds better. ;)

How about replacing -w with -9001? 
http://en.wikipedia.org/wiki/ISO_9001#Contents_of_ISO_9001

Aug 02 2011

Robert Clipsham <robert octarineparrot.com> writes:

On 02/08/2011 00:40, Walter Bright wrote:
 http://www.reddit.com/r/programming/comments/j48tf/how_is_c_better_than_d/c29do98


 Anyone care to examine the assembler output and figure out why?

I was talking to David Nadlinger the other day, and there was some sort 
of codegen bug causing things to massively outperform dmd and clang with 
equivalent code - it's possible this is the cause, I don't know without 
looking though. He may be able to shed some light on it.

-- 
Robert
http://octarineparrot.com/

Aug 02 2011

David Nadlinger <see klickverbot.at> writes:

On 8/2/11 7:34 PM, Robert Clipsham wrote:
 On 02/08/2011 00:40, Walter Bright wrote:
 Anyone care to examine the assembler output and figure out why?

 I was talking to David Nadlinger the other day, and there was some sort
 of codegen bug causing things to massively outperform dmd and clang with
 equivalent code - it's possible this is the cause, I don't know without
 looking though. He may be able to shed some light on it.

Nope, this turned out to be a bug in my program, where some memory chunk 
used as test input data was prematurely garbage collected (that only 
surfaced with aggressive compiler optimizations, which is why I 
suspected a compiler bug).

David

Aug 02 2011

D Programming

C/C++ Programming

Other

digitalmars.D - pi benchmark on ldc and dmd