digitalmars.D.ldc - LDC 1.5-1.6 huge degradation of optimization

Igor Shirkalin (80/80) Nov 27 2017 Hello!

kinke (12/17) Nov 27 2017 LDC 1.5 and 1.6 both come with LLVM 5.0.0, so it looks as if this

Igor Shirkalin (4/23) Nov 30 2017 It's now obvious the reason of this regression is LLVM 5.0.0

kinke (7/10) Nov 30 2017 5.0.0 *is* the latest released version. 5.0.1 is about to be

Joakim (3/13) Nov 30 2017 I too read it the way you did, that llvm needs to be updated

Igor Shirkalin (2/17) Nov 30 2017 Right. Exactly what I meant. Excuse for some irrational English.

kinke (5/23) Dec 01 2017 No worries. One performance regression is by no means enough to

Johan Engelen (16/21) Nov 30 2017 This will need a lot more investigation to figure out what is

Igor Shirkalin (7/23) Nov 30 2017 I'm almost sure that the problem is in new LLVM.

Igor Shirkalin <mathsoft inbox.ru> writes:

Hello!

I have found that LDC1.5-1.6 generate unoptimized code in 
contrast to LDC1.3-1.4 in some cases. I tried to extract the 
example and make it as short as possible. The goal is to get the 
compiled code with avx(2) instructions.

Here is the source of tst.d with comments to demonstrate the 
problem.

// tst.d

import ldc.attributes;

// ldc1.3-1.4 generate higly optimized code with avx2 instructions
// ldc1.5-1.6 generate the code without any vector instructions
// the command line: ldc2 tst.d -m32 -O3 -release -output-s

alias Arr = ubyte[16][20]; // 20 of 16-ubyte vectors

import ldc.attributes;

 target("avx2")  nogc pure
auto distance(ref const Arr t1, ref const Arr t2)
{	
		int[20] res = void;
		int sum;
		foreach(t, ref r; res) {
			int sv=0;
			foreach(i; 0 .. 16) // the main cycle to be optimized with 
avx2 instructions
				sv += (t1[t][i]-t2[t][i])^^2;
			r = sv;
			// by uncommenting the following assignmet the avx2 
optimization is turned on in ldc 1.6
			// sum += sv;
		}
		return sum + res[10]; // returm some dummy sum
}

/* ldc1.3 (avx2 instructions are used)
LBB0_1:
	vpmovzxbd	-8(%ecx), %ymm0
	vpmovzxbd	-8(%eax), %ymm1
	vpmovzxbd	(%eax), %ymm2
	addl	$16, %eax
	vpsubd	%ymm1, %ymm0, %ymm0
	vpmovzxbd	(%ecx), %ymm1
	addl	$16, %ecx
	vpmulld	%ymm0, %ymm0, %ymm0
	vpsubd	%ymm2, %ymm1, %ymm1
	vpmulld	%ymm1, %ymm1, %ymm1
	vpaddd	%ymm0, %ymm1, %ymm0
	vextracti128	$1, %ymm0, %xmm1
	vpaddd	%ymm1, %ymm0, %ymm0
	vpshufd	$78, %xmm0, %xmm1
	vpaddd	%ymm1, %ymm0, %ymm0
	vphaddd	%ymm0, %ymm0, %ymm0
	vmovd	%xmm0, (%esp,%edx,4)
	incl	%edx
	cmpl	$20, %edx
	jb	LBB0_1
*/

/* ldc1.6 (avx2 instructions aren't used)
LBB0_1:
	movl	%edx, (%esp)
	movzbl	-15(%ecx), %esi
	movzbl	-15(%eax), %edx
	movzbl	-14(%ecx), %edi
	subl	%edx, %esi
	movzbl	-14(%eax), %edx
	imull	%esi, %esi

	... ; skipped

	imull	%ebp, %ebp
	addl	%ebp, %esi
	movl	36(%esp), %ebp
	imull	%ebp, %ebp
	addl	%ebp, %esi
	movl	32(%esp), %ebp

	... ; skipped

	imull	%edx, %edx
	addl	%esi, %edx
	movl	(%esp), %esi
	movl	%edx, 48(%esp,%esi,4)
	movl	(%esp), %edx
	incl	%edx
	cmpl	$20, %edx
	jb	LBB0_1
*/

Nov 27 2017

kinke <noone nowhere.com> writes:

On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin wrote:
 Hello!

 I have found that LDC1.5-1.6 generate unoptimized code in 
 contrast to LDC1.3-1.4 in some cases. I tried to extract the 
 example and make it as short as possible. The goal is to get 
 the compiled code with avx(2) instructions.

LDC 1.5 and 1.6 both come with LLVM 5.0.0, so it looks as if this 
is an LLVM regression.
This can be shown by compiling to unoptimized textual LLVM IR 
(that's the LLVM IR LDC generates, before LLVM optimizations) and 
comparing it across LDC versions. I did that for LDC 1.4 and 1.6, 
and the relevant IR is identical:

ldc2-1.4.0-win64-msvc\bin\ldc2 -release -output-ll perf.d 
-of=perf_1.4.ll
ldc2-1.6.0-win64-msvc\bin\ldc2 -release -output-ll perf.d 
-of=perf_1.6.ll
<compare files perf_1.4.ll and perf_1.6.ll>

Nov 27 2017

Igor Shirkalin <mathsoft inbox.ru> writes:

On Monday, 27 November 2017 at 13:21:04 UTC, kinke wrote:
 On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin 
 wrote:
 Hello!

 I have found that LDC1.5-1.6 generate unoptimized code in 
 contrast to LDC1.3-1.4 in some cases. I tried to extract the 
 example and make it as short as possible. The goal is to get 
 the compiled code with avx(2) instructions.

 LDC 1.5 and 1.6 both come with LLVM 5.0.0, so it looks as if 
 this is an LLVM regression.
 This can be shown by compiling to unoptimized textual LLVM IR 
 (that's the LLVM IR LDC generates, before LLVM optimizations) 
 and comparing it across LDC versions. I did that for LDC 1.4 
 and 1.6, and the relevant IR is identical:

 ldc2-1.4.0-win64-msvc\bin\ldc2 -release -output-ll perf.d 
 -of=perf_1.4.ll
 ldc2-1.6.0-win64-msvc\bin\ldc2 -release -output-ll perf.d 
 -of=perf_1.6.ll
 <compare files perf_1.4.ll and perf_1.6.ll>

It's now obvious the reason of this regression is LLVM 5.0.0
Does it mean it's not time to move to latest LLVM for the latest 
LDC?

Nov 30 2017

kinke <kinke libero.it> writes:

On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin 
wrote:
 It's now obvious the reason of this regression is LLVM 5.0.0
 Does it mean it's not time to move to latest LLVM for the 
 latest LDC?

5.0.0 *is* the latest released version. 5.0.1 is about to be 
released these days, but whether it'll fix this issue is 
uncertain. As is whether it's fixed in current LLVM master 
(6.0.0). LLVM is a huge piece of software, bugs and regressions 
are to be expected.

Nov 30 2017

Joakim <dlang joakim.fea.st> writes:

On Thursday, 30 November 2017 at 16:01:19 UTC, kinke wrote:
 On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin 
 wrote:
 It's now obvious the reason of this regression is LLVM 5.0.0
 Does it mean it's not time to move to latest LLVM for the 
 latest LDC?

 5.0.0 *is* the latest released version. 5.0.1 is about to be 
 released these days, but whether it'll fix this issue is 
 uncertain. As is whether it's fixed in current LLVM master 
 (6.0.0). LLVM is a huge piece of software, bugs and regressions 
 are to be expected.

I too read it the way you did, that llvm needs to be updated 
forward, but I think he meant ldc should stick with 4.0.1 for now.

Nov 30 2017

Igor Shirkalin <mathsoft inbox.ru> writes:

On Thursday, 30 November 2017 at 22:58:10 UTC, Joakim wrote:
 On Thursday, 30 November 2017 at 16:01:19 UTC, kinke wrote:
 On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin 
 wrote:
 It's now obvious the reason of this regression is LLVM 5.0.0
 Does it mean it's not time to move to latest LLVM for the 
 latest LDC?

 5.0.0 *is* the latest released version. 5.0.1 is about to be 
 released these days, but whether it'll fix this issue is 
 uncertain. As is whether it's fixed in current LLVM master 
 (6.0.0). LLVM is a huge piece of software, bugs and 
 regressions are to be expected.

 I too read it the way you did, that llvm needs to be updated 
 forward, but I think he meant ldc should stick with 4.0.1 for 
 now.

Right. Exactly what I meant. Excuse for some irrational English.

Nov 30 2017

kinke <kinke libero.it> writes:

On Friday, 1 December 2017 at 04:30:52 UTC, Igor Shirkalin wrote:
 On Thursday, 30 November 2017 at 22:58:10 UTC, Joakim wrote:
 On Thursday, 30 November 2017 at 16:01:19 UTC, kinke wrote:
 On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin 
 wrote:
 It's now obvious the reason of this regression is LLVM 5.0.0
 Does it mean it's not time to move to latest LLVM for the 
 latest LDC?

 5.0.0 *is* the latest released version. 5.0.1 is about to be 
 released these days, but whether it'll fix this issue is 
 uncertain. As is whether it's fixed in current LLVM master 
 (6.0.0). LLVM is a huge piece of software, bugs and 
 regressions are to be expected.

 I too read it the way you did, that llvm needs to be updated 
 forward, but I think he meant ldc should stick with 4.0.1 for 
 now.

 Right. Exactly what I meant. Excuse for some irrational English.

No worries. One performance regression is by no means enough to 
convince me to step back, especially since anyone is free to 
compile LDC himself and use LLVM versions as old as 3.7 if they 
like.

Dec 01 2017

Johan Engelen <j j.nl> writes:

On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin wrote:
 Hello!

 I have found that LDC1.5-1.6 generate unoptimized code in 
 contrast to LDC1.3-1.4 in some cases. I tried to extract the 
 example and make it as short as possible. The goal is to get 
 the compiled code with avx(2) instructions.

This will need a lot more investigation to figure out what is 
going wrong. It could be that the optimization pipeline set up by 
LDC needs to be adjusted for newer LLVM versions, or that extra 
annotations are needed.
Some notes:
- It's strange that adding the calculation of `sum` leads to an 
overall more optimized output with AVX2 instructions (good that 
you found out about that!).
- It would help if you find a C/C++ equivalent to show the 
problem to LLVM devs (gcc.godbolt.org has all relevant LLVM/Clang 
versions)
- The optimization is fragile also in LDC 1.4: manual unrolling 
of the inner loop somehow removes the AVX2 optimizations. 
https://godbolt.org/g/NF3eHf  Did I make a mistake?

-Johan

Nov 30 2017

Igor Shirkalin <mathsoft inbox.ru> writes:

On Thursday, 30 November 2017 at 08:56:31 UTC, Johan Engelen 
wrote:
 On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin 
 This will need a lot more investigation to figure out what is 
 going wrong. It could be that the optimization pipeline set up 
 by LDC needs to be adjusted for newer LLVM versions, or that 
 extra annotations are needed.

I'm almost sure that the problem is in new LLVM.

 Some notes:

 - It's strange that adding the calculation of `sum` leads to an 
 overall more optimized output with AVX2 instructions (good that 
 you found out about that!).

 - It would help if you find a C/C++ equivalent to show the 
 problem to LLVM devs (gcc.godbolt.org has all relevant 
 LLVM/Clang versions)

Yes, I have found it for clang: 
https://bugs.llvm.org/show_bug.cgi?id=35448

 - The optimization is fragile also in LDC 1.4: manual unrolling 
 of the inner loop somehow removes the AVX2 optimizations. 
 https://godbolt.org/g/NF3eHf  Did I make a mistake?

I've noticed that manual unrolling usually doesnt't help to 
vectorize the code.


 -Johan

Nov 30 2017

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - LDC 1.5-1.6 huge degradation of optimization