www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - dmd simple loop disassembly - redundant instruction?

reply "Ivan Kazmenko" <gassa mail.ru> writes:
Hello,

I am studying the difference between x86 generated code of DMD 
and C/C++ compilers on Windows (simply put: why exactly, and by 
what margin, DMD-compiled D code is often slower than 
GCC-compiled C/C++ equivalent).

Now, I have this simple D program:

-----
immutable int MAX_N = 1_000_000;
void main () {
     int [MAX_N] a;
     foreach (i; 0..MAX_N)
         a[i] = i;
}
-----

(I know there's iota in std.range, and it turns out to be even 
slower - but that's a high level function, and I'm trying to 
understand the lower-level details now.)

The assembly (dmd -O -release -inline -noboundscheck, then 
obj2asm) has the following piece corresponding to the cycle:

-----
L2C:		mov	-03D0900h[EDX*4][EBP],EDX
		mov	ECX,EDX
		inc	EDX
		cmp	EDX,0F4240h
		jb	L2C
-----

Now, I am not exactly fluent in assembler, but the "mov ECX, EDX" 
seems unnecessary.  The ECX register is explicitly used three 
times in the whole program, and it looks like this instruction 
can at least be moved out of the loop, if not removed completely. 
  Is it indeed a bug, or there's some reason here?  And if the 
former, where do I report it - at http://d.puremagic.com/issues/, 
as with the front-end?

I didn't try GDC or LDC since I didn't find a clear instruction 
for using them under Win32.  If there is one, please kindly point 
me to it.  I found a few explanations for GDC, but had a hard 
time trying to figure out which is the most current one.

Note that the C++ version does the same with four instructions 
instead of five, as D version is expected to be if we remove the 
instruction in question.  Indeed, it goes like (code inside the 
loop):

-----
L3:
	movl	%eax, _a(,%eax,4)
	addl	$1, %eax
	cmpl	$1000000, %eax
	jne	L3
-----

The full assembly listings, and the source codes (D and C++), are 
here:
http://acm.math.spbu.ru/~gassa/dlang/simple_loop/

I've tried a few other versions as well.  Changing the loop to an 
explicit "for (int i = 0; i < MAX_N; i++)" (a2.d) does not affect 
the generated assembly.  Making the array dynamic (a3.d) leads to 
five instructions, all seemingly important.  A __gshared static 
array (a4.d) gives the same seemingly unneeded instruction but 
with EAX instead of ECX:

-----
L2:		mov	_D2a41aG1000000i[EDX*4],EDX
		mov	EAX,EDX
		inc	EDX
		cmp	EDX,0F4240h
		jb	L2
-----

Ivan Kazmenko.
Dec 25 2013
next sibling parent reply "Chris Cain" <clcain uncg.edu> writes:
On Wednesday, 25 December 2013 at 12:03:08 UTC, Ivan Kazmenko 
wrote:
 Now, I am not exactly fluent in assembler, but the "mov ECX, 
 EDX" seems unnecessary.  The ECX register is explicitly used 
 three times in the whole program, and it looks like this 
 instruction can at least be moved out of the loop, if not 
 removed completely.
  Is it indeed a bug, or there's some reason here?  And if the 
 former, where do I report it - at 
 http://d.puremagic.com/issues/, as with the front-end?
Did you try something like: for(immutable i; 0..MAX_N) a[i] = i; too? One thing to note is that, technically, i is a _copy_ of the iterated number. So things like for(i; 0..5) i++; have no effect (it will loop 5 times regardless). Indeed, in your case, this could be optimized out, but in general the extra instruction is technically correct. I don't know if making i immutable would change things, but it might give the compiler enough of a hint to do the correct optimization here.
Dec 25 2013
parent "Ivan Kazmenko" <gassa mail.ru> writes:
On Wednesday, 25 December 2013 at 12:43:05 UTC, Chris Cain wrote:
 Did you try something like:

 for(immutable i; 0..MAX_N)
     a[i] = i;

 too? One thing to note is that, technically, i is a _copy_ of 
 the iterated number. So things like

 for(i; 0..5)
    i++;

 have no effect (it will loop 5 times regardless). Indeed, in 
 your case, this could be optimized out, but in general the 
 extra instruction is technically correct. I don't know if 
 making i immutable would change things, but it might give the 
 compiler enough of a hint to do the correct optimization here.
Thanks, that sounded reasonable. Still, in this particular case, the generated assembly remained the same.
Dec 25 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Ivan Kazmenko:

 I am studying the difference between x86 generated code of DMD 
 and C/C++ compilers on Windows (simply put: why exactly, and by 
 what margin, DMD-compiled D code is often slower than 
 GCC-compiled C/C++ equivalent).

 Now, I have this simple D program:

 -----
 immutable int MAX_N = 1_000_000;
 void main () {
     int [MAX_N] a;
     foreach (i; 0..MAX_N)
         a[i] = i;
 }
 -----

 (I know there's iota in std.range, and it turns out to be even 
 slower - but that's a high level function, and I'm trying to 
 understand the lower-level details now.)

 The assembly (dmd -O -release -inline -noboundscheck, then 
 obj2asm) has the following piece corresponding to the cycle:

 -----
 L2C:		mov	-03D0900h[EDX*4][EBP],EDX
 		mov	ECX,EDX
 		inc	EDX
 		cmp	EDX,0F4240h
 		jb	L2C
 -----
ldc2 optimizes the useless loop away: __Dmain: xorl %eax, %eax ret If I modify the code returning some value from the int main: return a[7]; ldc2 gives the loop code: LBB0_1: movl %eax, 12(%esp,%eax,4) incl %eax cmpl $1000000, %eax jne LBB0_1 If I use iota ldc2 copiles the loop to exactly the same asm: foreach (i; MAX_N.iota) Bye, bearophile
Dec 25 2013
parent "Ivan Kazmenko" <gassa mail.ru> writes:
On Wednesday, 25 December 2013 at 14:51:11 UTC, bearophile wrote:
 ldc2 optimizes the useless loop away:

 __Dmain:
 	xorl	%eax, %eax
 	ret

 If I modify the code returning some value from the int main:
     return a[7];

 ldc2 gives the loop code:

 LBB0_1:
 	movl	%eax, 12(%esp,%eax,4)
 	incl	%eax
 	cmpl	$1000000, %eax
 	jne	LBB0_1

 If I use iota ldc2 copiles the loop to exactly the same asm:

 foreach (i; MAX_N.iota)
Glad to know that! But what about DMD? Anyone?.. If someone with better knowledge in assembly confirms the instruction is unnecessary, I'll file a bug report (at http://d.puremagic.com/issues/ I presume). Ivan Kazmenko.
Dec 25 2013
prev sibling parent reply Lionello Lunesu <lionello lunesu.remove.com> writes:
On 12/25/13, 20:03, Ivan Kazmenko wrote:
 -----
 L2C:        mov    -03D0900h[EDX*4][EBP],EDX
          mov    ECX,EDX
          inc    EDX
          cmp    EDX,0F4240h
          jb    L2C
 -----
You should have said that all instructions are redundant :) Looks like the array got optimized out, but then the optimizer stopped. The ECX likely refers to the 'i' loop variable. When the array write code got optimized out, the compile could have figured out that 'i' was in turn unused as well and remove it too. And then, the foreach, etc... You can file backend bugs on the same site.
Dec 26 2013
parent "Ivan Kazmenko" <gassa mail.ru> writes:
On Thursday, 26 December 2013 at 08:08:09 UTC, Lionello Lunesu 
wrote:
 You should have said that all instructions are redundant :) 
 Looks like the array got optimized out, but then the optimizer 
 stopped. The ECX likely refers to the 'i' loop variable. When 
 the array write code got optimized out, the compile could have 
 figured out that 'i' was in turn unused as well and remove it 
 too. And then, the foreach, etc...
I added the "return a[7];" part from bearophile's suggestion, but that did not change anything in the loop code. So, my guess is that the array does not get optimized out at all.
 You can file backend bugs on the same site.
Thanks, issue created! https://d.puremagic.com/issues/show_bug.cgi?id=11821 Ivan Kazmenko.
Dec 26 2013