digitalmars.D - It seems like DMD doesn't optimize tail recursion very well

Akhropov (59/59) Jun 09 2006 Looking through tests on http://shootout.alioth.debian.org/ website I no...

James Pelcis (64/64) Jun 09 2006 You seem to be right. Just to make a fairer test, I made both versions

sailormoontw (20/20) Jun 09 2006 It takes twice the time for D to do the same job...
pragma (17/21) Jun 09 2006 I can confirm. Using your source code, comparing DMC to DMD, I got:

Deewiant (2/16) Jun 09 2006 Looking at your numbers, D was actually precisely 313657318 clock cycles...

pragma (3/19) Jun 09 2006 Oh, my bad. Thanks for catching that. :)

AKhropov (21/45) Jun 09 2006 Did your turn maximum optimization on?

pragma (45/46) Jun 09 2006 As a matter of fact, I did not. Sure enough, with "-O -inline release"

Bruno Medeiros (5/56) Jun 11 2006 What are those NOPs for, if I may ask?

Jarrett Billingsley (5/6) Jun 11 2006 Usually for alignment, so that the instructions pair better in the pipel...

James Dunne (6/18) Jun 11 2006 A MOV reg, reg instruction is encoded as a single byte, so to pad it

pragma (3/7) Jun 12 2006 I inserted them manually into the routine. They're there so I could eas...

pragma (141/141) Jun 09 2006 Attached are the ASM dump of the Ack() function as outlined elsewhere in...

michal.minich gmail.com (11/152) Jun 09 2006 High level functions like ENTER LEVAVE are relic of old effort of trying...
Andrei Khropov (89/90) Jun 09 2006 It is more efficient compared to DMC only.

Gregor Richards (7/7) Jun 09 2006 $ gdmd ack.d -O -release -inline -ofack-gdc

Dave (21/28) Jun 09 2006 I get the following. If dmd is reverted back to v0.139, it does much

"Akhropov" <andkhropov nospam_mtu-net.ru> writes:

Looking through tests on http://shootout.alioth.debian.org/ website I noticed
that the only tests where DMD is inferior to GCC are where recursion is
seriously involved (binary trees and recursive). My own experiments with
ackermann function confirmed that:

C++:
---------------------------------------------------------------
#include <iostream>

#include <windows.h> // for GetTickCount


int Ack(int a, int b)
{
    if (a == 0)
        return b + 1;
    else if (b == 0)
        return Ack(a - 1, 1);
    else
        return Ack(a - 1, Ack(a, b - 1));
}

int main()
{
    DWORD start = GetTickCount();
    std::cout << Ack(3, 11) << std::endl;
    DWORD stop = GetTickCount();
    std::cout << stop-start << " ms elapsed.\n";
    
    return 0;
}
-------------------------------------------------------------

D:
-------------------------------------------------------------

private import std.perf, std.stdio;

int Ack(int a, int b)
{
    if (a == 0)
        return b + 1;
    else if (b == 0)
        return Ack(a - 1, 1);
    else
        return Ack(a - 1, Ack(a, b - 1));
}

void main()
{
    HighPerformanceCounter t = new HighPerformanceCounter();
    t.start();
    Ack(3, 11);
    t.stop();
    writefln("%d ms elapsed.", t.milliseconds());
}
---------------------------------------------------------------

I tried MSVC++ 8, MinGW GCC 3.4.2 and DMD 0.157

cl ack.cpp /Ox /Feack-cpp.exe
g++ ack.cpp -O99 -oack-gpp.exe
dmd ack.d -O -release -inline -ofack-d.exe


Here are the results (averaged over 10 times):

1) MS C++ : 830.5 ms
3) GCC    : 990.4 ms
5) DMD    : 2247.3 ms

Am I right? What could be done with this?

-- 
AKhropov

Jun 09 2006

James Pelcis <jpelcis gmail.com> writes:

You seem to be right.  Just to make a fairer test, I made both versions 
use the same timing code and did a DMC vs. DMD.  I'm not the best of 
coders though, so it might not be perfect.

------------------------------------------
C++:
------------------------------------------
#include <iostream.h>

int Ack(int a, int b)
{
     if (a == 0)
         return b + 1;
     else if (b == 0)
         return Ack(a - 1, 1);
     else
         return Ack(a - 1, Ack(a, b - 1));
}

long long getCount()
{
     asm
     {
         rdtsc;
         ret;
     }
}

int main()
{
     long long start = getCount();
     Ack(3, 11);
     long long stop = getCount();
     cout << stop-start << " clock cycles elapsed.\n";

     return 0;
}

------------------------------------------
D:
------------------------------------------
private import std.stdio;

int Ack(int a, int b)
{
     if (a == 0)
         return b + 1;
     else if (b == 0)
         return Ack(a - 1, 1);
     else
         return Ack(a - 1, Ack(a, b - 1));
}

long getCount()
{
	asm
	{
		rdtsc;
		ret;
	}
}

void main()
{
     long start = getCount();
     Ack(3, 11);
     long stop = getCount();
     writefln("%d clock cycles elapsed.", stop - start);
}

------------------------------------------

My results (averaged and rounded) were 5,575,010,530 for C++ and 
8,339,840,041 for D.  Since it was with the same linker, optimizer, 
etc., this is a bad sign.  Is the problem in the code or the compiler?

Jun 09 2006

sailormoontw <sailormoontw_member pathlink.com> writes:

It takes twice the time for D to do the same job...
but at least it's better than Ruby, Common Lisp, and Haskell
these 3 all failed...either stack overflow or halted.

Ruby:

def Ack(a,b)
return b + 1 if a == 0
return Ack(a-1, 1) if b == 0
Ack(a-1, Ack(a, b-1))
end 

Common Lisp:

(defun ack (a b)
(cond 
((equal a 0) (1+ b))
((equal b 0) (ack (1- a) 1))
(t (ack (1- a) (ack a (1- b))))))

Haskell:

ack :: Int -> Int -> Int
ack 0 b = b+1
ack a 0 = ack (a-1) 1
ack a b = ack (a-1) (ack a (b-1))

Jun 09 2006

pragma <pragma_member pathlink.com> writes:

In article <e6c2kn$m48$1 digitaldaemon.com>, James Pelcis says...
My results (averaged and rounded) were 5,575,010,530 for C++ and 
8,339,840,041 for D.  Since it was with the same linker, optimizer, 
etc., this is a bad sign.  Is the problem in the code or the compiler?

I can confirm. Using your source code, comparing DMC to DMD, I got:

C:\home\pragma\src\test>ackcpp.exe
7982698959 clock cycles elapsed.

C:\home\pragma\src\test>ackd.exe
10881160506 clock cycles elapsed.

Now on a whim, I added "extern(C)" to the definition of Ack() like so:

 extern(C) int Ack(int a, int b){ /* ... */ }

And got this:

C:\home\pragma\src\test>ackcpp.exe
7371540535 clock cycles elapsed.

C:\home\pragma\src\test>ackd.exe
7057883217 clock cycles elapsed.

With that change D is still lagging, but by *significantly* less - so there's
another factor here.  So there is some kind of drawback to using the D callspec
with recursion.  My guess is that exploring the ASM using obj2asm would help
show why.

- EricAnderton at yahoo

Jun 09 2006

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

pragma wrote:
 Now on a whim, I added "extern(C)" to the definition of Ack() like so:
 
 extern(C) int Ack(int a, int b){ /* ... */ }

 
 And got this:
 
 C:\home\pragma\src\test>ackcpp.exe
 7371540535 clock cycles elapsed.
 
 C:\home\pragma\src\test>ackd.exe
 7057883217 clock cycles elapsed.
 
 With that change D is still lagging, but by *significantly* less - so there's
 another factor here.

Looking at your numbers, D was actually precisely 313657318 clock cycles
_faster_.

Jun 09 2006

pragma <pragma_member pathlink.com> writes:

In article <e6c6o2$qba$1 digitaldaemon.com>, Deewiant says...
pragma wrote:
 Now on a whim, I added "extern(C)" to the definition of Ack() like so:
 
 extern(C) int Ack(int a, int b){ /* ... */ }

 
 And got this:
 
 C:\home\pragma\src\test>ackcpp.exe
 7371540535 clock cycles elapsed.
 
 C:\home\pragma\src\test>ackd.exe
 7057883217 clock cycles elapsed.
 
 With that change D is still lagging, but by *significantly* less - so there's
 another factor here.

Looking at your numbers, D was actually precisely 313657318 clock cycles
_faster_.

Oh, my bad.  Thanks for catching that. :)

- EricAnderton at yahoo

Jun 09 2006

"AKhropov" <andkhropov nospam_mtu-net.ru> writes:

pragma wrote:

 In article <e6c2kn$m48$1 digitaldaemon.com>, James Pelcis says...
 My results (averaged and rounded) were 5,575,010,530 for C++ and 
 8,339,840,041 for D.  Since it was with the same linker, optimizer, 
 etc., this is a bad sign.  Is the problem in the code or the compiler?

 
 I can confirm. Using your source code, comparing DMC to DMD, I got:
 
 C:\home\pragma\src\test>ackcpp.exe
 7982698959 clock cycles elapsed.
 
 C:\home\pragma\src\test>ackd.exe
 10881160506 clock cycles elapsed.
 
 Now on a whim, I added "extern(C)" to the definition of Ack() like so:
 
 extern(C) int Ack(int a, int b){ /* ... */ }

 
 And got this:
 
 C:\home\pragma\src\test>ackcpp.exe
 7371540535 clock cycles elapsed.
 
 C:\home\pragma\src\test>ackd.exe
 7057883217 clock cycles elapsed.

Did your turn maximum optimization on?

Here are my results with James' code:

--- compiler flags: -----
dmc akk2.cpp -o -oakk2-cpp.exe
dmd akk2.d -O -release -inline -ofakk2-d.exe
-------------------------

1) both variants without "extern (C)"
 
C++: 4379147218 clock cycles elapsed.
D  : 3669672891 clock cycles elapsed.

2) both variants with "extern (C)" (well, "extern "C"" in C++ case)

C++: 4414772278 clock cycles elapsed.
D  : 4394787775 clock cycles elapsed.

So, turning on "extern (C)" actually made it slower and approximately the same
speed as C++.


So, I think the problem is in the back-end (which is the same for DMD and DMC
AFAIK ).

It would be interesting to hear from guys using GDC with GCC back-end.

-- 
AKhropov

Jun 09 2006

pragma <pragma_member pathlink.com> writes:

In article <e6c9mm$v9q$1 digitaldaemon.com>, AKhropov says...
Did your turn maximum optimization on?

As a matter of fact, I did not.  Sure enough, with "-O -inline release"
specified, the enter/leave instructions dissapear and are replaced with their
push/pop/mov counterparts (see attached).  :)

- Eric

SUB_L00402010:
push	ebp
mov	ebp,esp
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402022
mov	eax,[ebp+0Ch]
inc	eax
pop	ebp
retn
;------------------------------------------------------------------------------
L00402022:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L00402039
push	00000001h
mov	eax,[ebp+08h]
dec	eax
push	eax
call	SUB_L00402010
add	esp,00000008h
pop	ebp
retn
;------------------------------------------------------------------------------
L00402039:
mov	eax,[ebp+0Ch]
dec	eax
push	eax
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	eax,[ebp+08h]
dec	eax
push	eax
call	SUB_L00402010
add	esp,00000008h
pop	ebp
retn

Jun 09 2006

Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:

pragma wrote:
 In article <e6c9mm$v9q$1 digitaldaemon.com>, AKhropov says...
 Did your turn maximum optimization on?

 
 As a matter of fact, I did not.  Sure enough, with "-O -inline release"
 specified, the enter/leave instructions dissapear and are replaced with their
 push/pop/mov counterparts (see attached).  :)
 
 - Eric
 
 SUB_L00402010:
 push	ebp
 mov	ebp,esp
 nop
 nop
 nop
 cmp	dword ptr [ebp+08h],00000000h
 jnz	L00402022
 mov	eax,[ebp+0Ch]
 inc	eax
 pop	ebp
 retn
 ;------------------------------------------------------------------------------
 L00402022:
 cmp	dword ptr [ebp+0Ch],00000000h
 jnz	L00402039
 push	00000001h
 mov	eax,[ebp+08h]
 dec	eax
 push	eax
 call	SUB_L00402010
 add	esp,00000008h
 pop	ebp
 retn
 ;------------------------------------------------------------------------------
 L00402039:
 mov	eax,[ebp+0Ch]
 dec	eax
 push	eax
 push	[ebp+08h]
 call	SUB_L00402010
 add	esp,00000008h
 push	eax
 mov	eax,[ebp+08h]
 dec	eax
 push	eax
 call	SUB_L00402010
 add	esp,00000008h
 pop	ebp
 retn
 
 

What are those NOPs for, if I may ask?

-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Jun 11 2006

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Bruno Medeiros" <brunodomedeirosATgmail SPAM.com> wrote in message 
news:e6i5n5$kok$2 digitaldaemon.com...

 What are those NOPs for, if I may ask?

Usually for alignment, so that the instructions pair better in the pipeline, 
though three nops in a row seems kind of odd, unless that's some kind of 
technique I've never seen before.

Jun 11 2006

James Dunne <james.jdunne gmail.com> writes:

Jarrett Billingsley wrote:
 "Bruno Medeiros" <brunodomedeirosATgmail SPAM.com> wrote in message 
 news:e6i5n5$kok$2 digitaldaemon.com...
 
 
What are those NOPs for, if I may ask?

 
 
 Usually for alignment, so that the instructions pair better in the pipeline, 
 though three nops in a row seems kind of odd, unless that's some kind of 
 technique I've never seen before. 
 
 

A MOV reg, reg instruction is encoded as a single byte, so to pad it 
gets three NOPs.  Correct me if mistaken!

-- 
Regards,
James Dunne

Jun 11 2006

pragma <pragma_member pathlink.com> writes:

In article <e6i5n5$kok$2 digitaldaemon.com>, Bruno Medeiros says...
What are those NOPs for, if I may ask?

I inserted them manually into the routine.  They're there so I could easily find
the routine in the dump. ;)

-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Jun 12 2006

pragma <pragma_member pathlink.com> writes:

Attached are the ASM dump of the Ack() function as outlined elsewhere in this
thread.  

I was actually suprised that the extern(C) D version of the Ack() function was
byte-for-byte identical to the plain C/CPP version of the same code.  The plain
extern(D) version was the only run that created a different routine, and hence,
the wildly different timings.

To my eye, the codegen in D is not only shorter, but more efficent.  The only
thing I'm not too sure on is the use of enter/leave instead of push/pop/mov for
maintaining up the function's stack frame.  While I have no solid evidence, I'd
be willing to bet that these two instructions are what's holding things up.

My timings are on an Intel chip.  Anyone out there with an AMD machine willing
to try this on?

- Eric

// dissasembler listing for ackd.exe 
// int Ack(int a, int b)

SUB_L00402010:
enter	0004h,00h
mov	[ebp-04h],eax
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402025
inc	eax
leave
retn	0004h
;------------------------------------------------------------------------------
L00402025:
cmp	dword ptr [ebp-04h],00000000h
jnz	L0040203E
mov	eax,[ebp+08h]
dec	eax
push	eax
mov	eax,00000001h
call	SUB_L00402010
leave
retn	0004h
;------------------------------------------------------------------------------
L0040203E:
mov	ecx,[ebp+08h]
lea	edx,[ecx-01h]
push	edx
push	ecx
mov	eax,[ebp-04h]
dec	eax
call	SUB_L00402010
call	SUB_L00402010
leave
retn	0004h

// dissasembler listing for ackd.exe 
// extern(C) int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn

// dissasembler listing for ackcpp.exe 
// int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn

Jun 09 2006

michal.minich gmail.com writes:

High level functions like ENTER LEVAVE are relic of old effort of trying to make
machine code / asm human programmable. In fact, the current processors are going
the opposite way, divigin instruction into micro-inctructions. There is better
alternate way by using push / mov / pop functions instead; and seems the
companies are not trying to optimize this high level instructions anymore. You
can see for yourself at:

http://webster.cs.ucr.edu/AoA/DOS/ch12/CH12-3.html#HEADING3-73
http://www.goof.com/pcg/doc/pentium-optxref.html
http://www.google.com/search?hl=en&lr=&q=enter+leave+push+pop+mov+instruction+cycles+pentium+intel+amd&btnG=Search

-M




In article <e6cauc$10g3$1 digitaldaemon.com>, pragma says...
Attached are the ASM dump of the Ack() function as outlined elsewhere in this
thread.  

I was actually suprised that the extern(C) D version of the Ack() function was
byte-for-byte identical to the plain C/CPP version of the same code.  The plain
extern(D) version was the only run that created a different routine, and hence,
the wildly different timings.

To my eye, the codegen in D is not only shorter, but more efficent.  The only
thing I'm not too sure on is the use of enter/leave instead of push/pop/mov for
maintaining up the function's stack frame.  While I have no solid evidence, I'd
be willing to bet that these two instructions are what's holding things up.

My timings are on an Intel chip.  Anyone out there with an AMD machine willing
to try this on?

- Eric

// dissasembler listing for ackd.exe 
// int Ack(int a, int b)

SUB_L00402010:
enter	0004h,00h
mov	[ebp-04h],eax
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402025
inc	eax
leave
retn	0004h
;------------------------------------------------------------------------------
L00402025:
cmp	dword ptr [ebp-04h],00000000h
jnz	L0040203E
mov	eax,[ebp+08h]
dec	eax
push	eax
mov	eax,00000001h
call	SUB_L00402010
leave
retn	0004h
;------------------------------------------------------------------------------
L0040203E:
mov	ecx,[ebp+08h]
lea	edx,[ecx-01h]
push	edx
push	ecx
mov	eax,[ebp-04h]
dec	eax
call	SUB_L00402010
call	SUB_L00402010
leave
retn	0004h

// dissasembler listing for ackd.exe 
// extern(C) int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn

// dissasembler listing for ackcpp.exe 
// int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn

Jun 09 2006

"Andrei Khropov" <andkhropov nospam_mtu-net.ru> writes:

pragma wrote:


 To my eye, the codegen in D is not only shorter, but more efficent. 

It is more efficient compared to DMC only.

My point was that DMD codegen is substantially less efficient _in this
particular case_ than widespread C++ compilers like MSVC++ or GCC while being
on a par with them (or superior) in the other cases. It's even slower than MS


Here is the assembler output (ack and main) for my original program (output was
removed from the original program to make the listing clearer) in GCC 3.4.2 (in
MinGW)
 :

----------------------------------------------------------
__Z3Ackii:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	subl	$8, %esp
	movl	8(%ebp), %edx
	movl	12(%ebp), %eax
	.p2align 4,,15
L12:
	testl	%edx, %edx
	je	L7
L14:
	testl	%eax, %eax
	jne	L4
	decl	%edx
	movl	$1, %eax
	testl	%edx, %edx
	jne	L14
L7:
	addl	$8, %esp
	incl	%eax
	popl	%ebx
	popl	%ebp
	ret
	.p2align 4,,7
L4:
	movl	%edx, (%esp)
	leal	-1(%edx), %ebx
	decl	%eax
	movl	%eax, 4(%esp)
	call	__Z3Akkii
	movl	%ebx, %edx
	jmp	L12
	.def	___main;	.scl	2;	.type	32;	.endef
	.section .rdata,"dr"
LC0:
	.ascii " ms elapsed.\12\0"
	.text
	.align 2
	.p2align 4,,15
.globl _main
	.def	_main;	.scl	2;	.type	32;	.endef
_main:
	pushl	%ebp
	movl	$16, %eax
	movl	%esp, %ebp
	pushl	%ebx
	subl	$20, %esp
	andl	$-16, %esp
	call	__alloca
	call	___main
	call	_GetTickCount 0
	movl	$3, (%esp)
	movl	$10, %ecx
	movl	%eax, %ebx
	movl	%ecx, 4(%esp)
	call	__Z3Ackii
	movl	%eax, 4(%esp)
	movl	$2, (%esp)
	call	__Z3Ackii
	call	_GetTickCount 0
	movl	$__ZSt4cout, (%esp)
	subl	%ebx, %eax
	movl	%eax, 4(%esp)
	call	__ZNSolsEm
	movl	%eax, (%esp)
	movl	$LC0, %edx
	movl	%edx, 4(%esp)
	call	__ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
	movl	-4(%ebp), %ebx
	xorl	%eax, %eax
	leave
	ret 
----------------------------------------------------------

It seems like GCC was clever enough to somehow split the calculation in two
calls.

-- 
AKhropov

Jun 09 2006

Gregor Richards <Richards codu.org> writes:

$ gdmd ack.d -O -release -inline -ofack-gdc
$ ./ack-gdc
703 ms elapsed.
$ dmd ack.d -O -release -inline -ofack-dmd
gcc ack.o -o ack-dmd -m32 -lphobos -lpthread -lm
$ ./ack-dmd
2373 ms elapsed.

Jun 09 2006

Dave <Dave_member pathlink.com> writes:

Gregor Richards wrote:
 $ gdmd ack.d -O -release -inline -ofack-gdc
 $ ./ack-gdc
 703 ms elapsed.
 $ dmd ack.d -O -release -inline -ofack-dmd
 gcc ack.o -o ack-dmd -m32 -lphobos -lpthread -lm
 $ ./ack-dmd
 2373 ms elapsed.

I get the following. If dmd is reverted back to v0.139, it does much 
better, for whatever that is worth.

GDC v0.17 w/ 4.01 backend:



real    0m0.828s
user    0m0.824s
sys     0m0.000s

DMD v0.160:



real    0m2.628s
user    0m2.572s
sys     0m0.000s

DMD v0.139:



real    0m1.131s
user    0m1.124s
sys     0m0.004s

Jun 09 2006

D Programming

C/C++ Programming

Other

digitalmars.D - It seems like DMD doesn't optimize tail recursion very well