www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - It seems like DMD doesn't optimize tail recursion very well

reply "Akhropov" <andkhropov nospam_mtu-net.ru> writes:
Looking through tests on http://shootout.alioth.debian.org/ website I noticed
that the only tests where DMD is inferior to GCC are where recursion is
seriously involved (binary trees and recursive). My own experiments with
ackermann function confirmed that:

C++:
---------------------------------------------------------------
#include <iostream>

#include <windows.h> // for GetTickCount


int Ack(int a, int b)
{
    if (a == 0)
        return b + 1;
    else if (b == 0)
        return Ack(a - 1, 1);
    else
        return Ack(a - 1, Ack(a, b - 1));
}

int main()
{
    DWORD start = GetTickCount();
    std::cout << Ack(3, 11) << std::endl;
    DWORD stop = GetTickCount();
    std::cout << stop-start << " ms elapsed.\n";
    
    return 0;
}
-------------------------------------------------------------

D:
-------------------------------------------------------------

private import std.perf, std.stdio;

int Ack(int a, int b)
{
    if (a == 0)
        return b + 1;
    else if (b == 0)
        return Ack(a - 1, 1);
    else
        return Ack(a - 1, Ack(a, b - 1));
}

void main()
{
    HighPerformanceCounter t = new HighPerformanceCounter();
    t.start();
    Ack(3, 11);
    t.stop();
    writefln("%d ms elapsed.", t.milliseconds());
}
---------------------------------------------------------------

I tried MSVC++ 8, MinGW GCC 3.4.2 and DMD 0.157

cl ack.cpp /Ox /Feack-cpp.exe
g++ ack.cpp -O99 -oack-gpp.exe
dmd ack.d -O -release -inline -ofack-d.exe


Here are the results (averaged over 10 times):

1) MS C++ : 830.5 ms
3) GCC    : 990.4 ms
5) DMD    : 2247.3 ms

Am I right? What could be done with this?

-- 
AKhropov
Jun 09 2006
next sibling parent reply James Pelcis <jpelcis gmail.com> writes:
You seem to be right.  Just to make a fairer test, I made both versions 
use the same timing code and did a DMC vs. DMD.  I'm not the best of 
coders though, so it might not be perfect.

------------------------------------------
C++:
------------------------------------------
#include <iostream.h>

int Ack(int a, int b)
{
     if (a == 0)
         return b + 1;
     else if (b == 0)
         return Ack(a - 1, 1);
     else
         return Ack(a - 1, Ack(a, b - 1));
}

long long getCount()
{
     asm
     {
         rdtsc;
         ret;
     }
}

int main()
{
     long long start = getCount();
     Ack(3, 11);
     long long stop = getCount();
     cout << stop-start << " clock cycles elapsed.\n";

     return 0;
}

------------------------------------------
D:
------------------------------------------
private import std.stdio;

int Ack(int a, int b)
{
     if (a == 0)
         return b + 1;
     else if (b == 0)
         return Ack(a - 1, 1);
     else
         return Ack(a - 1, Ack(a, b - 1));
}

long getCount()
{
	asm
	{
		rdtsc;
		ret;
	}
}

void main()
{
     long start = getCount();
     Ack(3, 11);
     long stop = getCount();
     writefln("%d clock cycles elapsed.", stop - start);
}

------------------------------------------

My results (averaged and rounded) were 5,575,010,530 for C++ and 
8,339,840,041 for D.  Since it was with the same linker, optimizer, 
etc., this is a bad sign.  Is the problem in the code or the compiler?
Jun 09 2006
next sibling parent sailormoontw <sailormoontw_member pathlink.com> writes:
It takes twice the time for D to do the same job...
but at least it's better than Ruby, Common Lisp, and Haskell
these 3 all failed...either stack overflow or halted.

Ruby:

def Ack(a,b)
return b + 1 if a == 0
return Ack(a-1, 1) if b == 0
Ack(a-1, Ack(a, b-1))
end 

Common Lisp:

(defun ack (a b)
(cond 
((equal a 0) (1+ b))
((equal b 0) (ack (1- a) 1))
(t (ack (1- a) (ack a (1- b))))))

Haskell:

ack :: Int -> Int -> Int
ack 0 b = b+1
ack a 0 = ack (a-1) 1
ack a b = ack (a-1) (ack a (b-1))
Jun 09 2006
prev sibling parent reply pragma <pragma_member pathlink.com> writes:
In article <e6c2kn$m48$1 digitaldaemon.com>, James Pelcis says...
My results (averaged and rounded) were 5,575,010,530 for C++ and 
8,339,840,041 for D.  Since it was with the same linker, optimizer, 
etc., this is a bad sign.  Is the problem in the code or the compiler?

I can confirm. Using your source code, comparing DMC to DMD, I got: C:\home\pragma\src\test>ackcpp.exe 7982698959 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 10881160506 clock cycles elapsed. Now on a whim, I added "extern(C)" to the definition of Ack() like so:
 extern(C) int Ack(int a, int b){ /* ... */ }

And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed. With that change D is still lagging, but by *significantly* less - so there's another factor here. So there is some kind of drawback to using the D callspec with recursion. My guess is that exploring the ASM using obj2asm would help show why. - EricAnderton at yahoo
Jun 09 2006
next sibling parent reply Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
pragma wrote:
 Now on a whim, I added "extern(C)" to the definition of Ack() like so:
 
 extern(C) int Ack(int a, int b){ /* ... */ }

And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed. With that change D is still lagging, but by *significantly* less - so there's another factor here.

Looking at your numbers, D was actually precisely 313657318 clock cycles _faster_.
Jun 09 2006
parent pragma <pragma_member pathlink.com> writes:
In article <e6c6o2$qba$1 digitaldaemon.com>, Deewiant says...
pragma wrote:
 Now on a whim, I added "extern(C)" to the definition of Ack() like so:
 
 extern(C) int Ack(int a, int b){ /* ... */ }

And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed. With that change D is still lagging, but by *significantly* less - so there's another factor here.

Looking at your numbers, D was actually precisely 313657318 clock cycles _faster_.

Oh, my bad. Thanks for catching that. :) - EricAnderton at yahoo
Jun 09 2006
prev sibling parent reply "AKhropov" <andkhropov nospam_mtu-net.ru> writes:
pragma wrote:

 In article <e6c2kn$m48$1 digitaldaemon.com>, James Pelcis says...
 My results (averaged and rounded) were 5,575,010,530 for C++ and 
 8,339,840,041 for D.  Since it was with the same linker, optimizer, 
 etc., this is a bad sign.  Is the problem in the code or the compiler?

I can confirm. Using your source code, comparing DMC to DMD, I got: C:\home\pragma\src\test>ackcpp.exe 7982698959 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 10881160506 clock cycles elapsed. Now on a whim, I added "extern(C)" to the definition of Ack() like so:
 extern(C) int Ack(int a, int b){ /* ... */ }

And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed.

Did your turn maximum optimization on? Here are my results with James' code: --- compiler flags: ----- dmc akk2.cpp -o -oakk2-cpp.exe dmd akk2.d -O -release -inline -ofakk2-d.exe ------------------------- 1) both variants without "extern (C)" C++: 4379147218 clock cycles elapsed. D : 3669672891 clock cycles elapsed. 2) both variants with "extern (C)" (well, "extern "C"" in C++ case) C++: 4414772278 clock cycles elapsed. D : 4394787775 clock cycles elapsed. So, turning on "extern (C)" actually made it slower and approximately the same speed as C++. So, I think the problem is in the back-end (which is the same for DMD and DMC AFAIK ). It would be interesting to hear from guys using GDC with GCC back-end. -- AKhropov
Jun 09 2006
parent reply pragma <pragma_member pathlink.com> writes:
In article <e6c9mm$v9q$1 digitaldaemon.com>, AKhropov says...
Did your turn maximum optimization on?

As a matter of fact, I did not. Sure enough, with "-O -inline release" specified, the enter/leave instructions dissapear and are replaced with their push/pop/mov counterparts (see attached). :) - Eric SUB_L00402010: push ebp mov ebp,esp nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402022 mov eax,[ebp+0Ch] inc eax pop ebp retn ;------------------------------------------------------------------------------ L00402022: cmp dword ptr [ebp+0Ch],00000000h jnz L00402039 push 00000001h mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn ;------------------------------------------------------------------------------ L00402039: mov eax,[ebp+0Ch] dec eax push eax push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn
Jun 09 2006
parent reply Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:
pragma wrote:
 In article <e6c9mm$v9q$1 digitaldaemon.com>, AKhropov says...
 Did your turn maximum optimization on?

As a matter of fact, I did not. Sure enough, with "-O -inline release" specified, the enter/leave instructions dissapear and are replaced with their push/pop/mov counterparts (see attached). :) - Eric SUB_L00402010: push ebp mov ebp,esp nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402022 mov eax,[ebp+0Ch] inc eax pop ebp retn ;------------------------------------------------------------------------------ L00402022: cmp dword ptr [ebp+0Ch],00000000h jnz L00402039 push 00000001h mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn ;------------------------------------------------------------------------------ L00402039: mov eax,[ebp+0Ch] dec eax push eax push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn

What are those NOPs for, if I may ask? -- Bruno Medeiros - CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Jun 11 2006
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Bruno Medeiros" <brunodomedeirosATgmail SPAM.com> wrote in message 
news:e6i5n5$kok$2 digitaldaemon.com...

 What are those NOPs for, if I may ask?

Usually for alignment, so that the instructions pair better in the pipeline, though three nops in a row seems kind of odd, unless that's some kind of technique I've never seen before.
Jun 11 2006
parent James Dunne <james.jdunne gmail.com> writes:
Jarrett Billingsley wrote:
 "Bruno Medeiros" <brunodomedeirosATgmail SPAM.com> wrote in message 
 news:e6i5n5$kok$2 digitaldaemon.com...
 
 
What are those NOPs for, if I may ask?

Usually for alignment, so that the instructions pair better in the pipeline, though three nops in a row seems kind of odd, unless that's some kind of technique I've never seen before.

A MOV reg, reg instruction is encoded as a single byte, so to pad it gets three NOPs. Correct me if mistaken! -- Regards, James Dunne
Jun 11 2006
prev sibling parent pragma <pragma_member pathlink.com> writes:
In article <e6i5n5$kok$2 digitaldaemon.com>, Bruno Medeiros says...
What are those NOPs for, if I may ask?

I inserted them manually into the routine. They're there so I could easily find the routine in the dump. ;)
-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Jun 12 2006
prev sibling next sibling parent reply pragma <pragma_member pathlink.com> writes:
Attached are the ASM dump of the Ack() function as outlined elsewhere in this
thread.  

I was actually suprised that the extern(C) D version of the Ack() function was
byte-for-byte identical to the plain C/CPP version of the same code.  The plain
extern(D) version was the only run that created a different routine, and hence,
the wildly different timings.

To my eye, the codegen in D is not only shorter, but more efficent.  The only
thing I'm not too sure on is the use of enter/leave instead of push/pop/mov for
maintaining up the function's stack frame.  While I have no solid evidence, I'd
be willing to bet that these two instructions are what's holding things up.

My timings are on an Intel chip.  Anyone out there with an AMD machine willing
to try this on?

- Eric

// dissasembler listing for ackd.exe 
// int Ack(int a, int b)

SUB_L00402010:
enter	0004h,00h
mov	[ebp-04h],eax
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402025
inc	eax
leave
retn	0004h
;------------------------------------------------------------------------------
L00402025:
cmp	dword ptr [ebp-04h],00000000h
jnz	L0040203E
mov	eax,[ebp+08h]
dec	eax
push	eax
mov	eax,00000001h
call	SUB_L00402010
leave
retn	0004h
;------------------------------------------------------------------------------
L0040203E:
mov	ecx,[ebp+08h]
lea	edx,[ecx-01h]
push	edx
push	ecx
mov	eax,[ebp-04h]
dec	eax
call	SUB_L00402010
call	SUB_L00402010
leave
retn	0004h

// dissasembler listing for ackd.exe 
// extern(C) int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn

// dissasembler listing for ackcpp.exe 
// int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
Jun 09 2006
next sibling parent michal.minich gmail.com writes:
High level functions like ENTER LEVAVE are relic of old effort of trying to make
machine code / asm human programmable. In fact, the current processors are going
the opposite way, divigin instruction into micro-inctructions. There is better
alternate way by using push / mov / pop functions instead; and seems the
companies are not trying to optimize this high level instructions anymore. You
can see for yourself at:

http://webster.cs.ucr.edu/AoA/DOS/ch12/CH12-3.html#HEADING3-73
http://www.goof.com/pcg/doc/pentium-optxref.html
http://www.google.com/search?hl=en&lr=&q=enter+leave+push+pop+mov+instruction+cycles+pentium+intel+amd&btnG=Search

-M




In article <e6cauc$10g3$1 digitaldaemon.com>, pragma says...
Attached are the ASM dump of the Ack() function as outlined elsewhere in this
thread.  

I was actually suprised that the extern(C) D version of the Ack() function was
byte-for-byte identical to the plain C/CPP version of the same code.  The plain
extern(D) version was the only run that created a different routine, and hence,
the wildly different timings.

To my eye, the codegen in D is not only shorter, but more efficent.  The only
thing I'm not too sure on is the use of enter/leave instead of push/pop/mov for
maintaining up the function's stack frame.  While I have no solid evidence, I'd
be willing to bet that these two instructions are what's holding things up.

My timings are on an Intel chip.  Anyone out there with an AMD machine willing
to try this on?

- Eric

// dissasembler listing for ackd.exe 
// int Ack(int a, int b)

SUB_L00402010:
enter	0004h,00h
mov	[ebp-04h],eax
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402025
inc	eax
leave
retn	0004h
;------------------------------------------------------------------------------
L00402025:
cmp	dword ptr [ebp-04h],00000000h
jnz	L0040203E
mov	eax,[ebp+08h]
dec	eax
push	eax
mov	eax,00000001h
call	SUB_L00402010
leave
retn	0004h
;------------------------------------------------------------------------------
L0040203E:
mov	ecx,[ebp+08h]
lea	edx,[ecx-01h]
push	edx
push	ecx
mov	eax,[ebp-04h]
dec	eax
call	SUB_L00402010
call	SUB_L00402010
leave
retn	0004h

// dissasembler listing for ackd.exe 
// extern(C) int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn

// dissasembler listing for ackcpp.exe 
// int Ack(int a, int b)

SUB_L00402010:
push	ebp
mov	ebp,esp
push	ebx
nop
nop
nop
cmp	dword ptr [ebp+08h],00000000h
jnz	L00402024
mov	eax,[ebp+0Ch]
inc	eax
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L00402024:
cmp	dword ptr [ebp+0Ch],00000000h
jnz	L0040203C
push	00000001h
mov	ecx,[ebp+08h]
dec	ecx
push	ecx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn
;------------------------------------------------------------------------------
L0040203C:
mov	edx,[ebp+0Ch]
dec	edx
push	edx
push	[ebp+08h]
call	SUB_L00402010
add	esp,00000008h
push	eax
mov	ebx,[ebp+08h]
dec	ebx
push	ebx
call	SUB_L00402010
add	esp,00000008h
pop	ebx
pop	ebp
retn

Jun 09 2006
prev sibling parent "Andrei Khropov" <andkhropov nospam_mtu-net.ru> writes:
pragma wrote:


 To my eye, the codegen in D is not only shorter, but more efficent. 

It is more efficient compared to DMC only. My point was that DMD codegen is substantially less efficient _in this particular case_ than widespread C++ compilers like MSVC++ or GCC while being on a par with them (or superior) in the other cases. It's even slower than MS C# compiler ( about 1500 ms vs 2200 ms for DMD ). Here is the assembler output (ack and main) for my original program (output was removed from the original program to make the listing clearer) in GCC 3.4.2 (in MinGW) : ---------------------------------------------------------- __Z3Ackii: pushl %ebp movl %esp, %ebp pushl %ebx subl $8, %esp movl 8(%ebp), %edx movl 12(%ebp), %eax .p2align 4,,15 L12: testl %edx, %edx je L7 L14: testl %eax, %eax jne L4 decl %edx movl $1, %eax testl %edx, %edx jne L14 L7: addl $8, %esp incl %eax popl %ebx popl %ebp ret .p2align 4,,7 L4: movl %edx, (%esp) leal -1(%edx), %ebx decl %eax movl %eax, 4(%esp) call __Z3Akkii movl %ebx, %edx jmp L12 .def ___main; .scl 2; .type 32; .endef .section .rdata,"dr" LC0: .ascii " ms elapsed.\12\0" .text .align 2 .p2align 4,,15 .globl _main .def _main; .scl 2; .type 32; .endef _main: pushl %ebp movl $16, %eax movl %esp, %ebp pushl %ebx subl $20, %esp andl $-16, %esp call __alloca call ___main call _GetTickCount 0 movl $3, (%esp) movl $10, %ecx movl %eax, %ebx movl %ecx, 4(%esp) call __Z3Ackii movl %eax, 4(%esp) movl $2, (%esp) call __Z3Ackii call _GetTickCount 0 movl $__ZSt4cout, (%esp) subl %ebx, %eax movl %eax, 4(%esp) call __ZNSolsEm movl %eax, (%esp) movl $LC0, %edx movl %edx, 4(%esp) call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc movl -4(%ebp), %ebx xorl %eax, %eax leave ret ---------------------------------------------------------- It seems like GCC was clever enough to somehow split the calculation in two calls. -- AKhropov
Jun 09 2006
prev sibling parent reply Gregor Richards <Richards codu.org> writes:
$ gdmd ack.d -O -release -inline -ofack-gdc
$ ./ack-gdc
703 ms elapsed.
$ dmd ack.d -O -release -inline -ofack-dmd
gcc ack.o -o ack-dmd -m32 -lphobos -lpthread -lm
$ ./ack-dmd
2373 ms elapsed.
Jun 09 2006
parent Dave <Dave_member pathlink.com> writes:
Gregor Richards wrote:
 $ gdmd ack.d -O -release -inline -ofack-gdc
 $ ./ack-gdc
 703 ms elapsed.
 $ dmd ack.d -O -release -inline -ofack-dmd
 gcc ack.o -o ack-dmd -m32 -lphobos -lpthread -lm
 $ ./ack-dmd
 2373 ms elapsed.

I get the following. If dmd is reverted back to v0.139, it does much better, for whatever that is worth. GDC v0.17 w/ 4.01 backend: # gdmd -O -inline -release t.d # time t real 0m0.828s user 0m0.824s sys 0m0.000s DMD v0.160: # dmd -O -inline -release t.d # time t real 0m2.628s user 0m2.572s sys 0m0.000s DMD v0.139: # dmd -O -inline -release t.d # time t real 0m1.131s user 0m1.124s sys 0m0.004s
Jun 09 2006