digitalmars.D.ldc - Disappointing math performance compared to GDC

Gabor Mezo (18/18) Oct 08 2014 Hello,

Trass3r (16/16) Oct 08 2014 Try with '-O3 -release -vectorize-slp-aggressive -g

Gabor Mezo (5/22) Oct 08 2014 I get:

Trass3r (2/5) Oct 08 2014 They were added to llvm in April/May.

Gabor Mezo (4/9) Oct 08 2014 I can confirm that there are no pass-remarks options found. I'm

Russel Winder via digitalmars-d-ldc (20/31) Oct 08 2014 -----BEGIN PGP SIGNED MESSAGE-----

Trass3r (11/17) Oct 08 2014 Note that this is less of an issue for x86 code using x87 by

Gabor Mezo (5/23) Oct 08 2014 Just for the record my benchmark code doesn't use math libraries,

Trass3r (2/2) Oct 08 2014 Just check it with '-output-ll' or '-output-s

Gabor Mezo (96/98) Oct 08 2014 I'm not an ASM expert, but as far as I can see it indeed use some

Gabor Mezo (17/17) Oct 08 2014 On a second thought I can see that the main problem is my

Gabor Mezo (14/14) Oct 08 2014 Here are the abs/min/max functions:
David Nadlinger (19/20) Oct 08 2014 They are likely in a different module than the code using them,

Gabor Mezo (5/5) Oct 08 2014 Hi David,

Trass3r (4/10) Oct 08 2014 If you see a 'ps' suffix (packed single-precision) it's SIMD ;)
David Nadlinger (11/13) Oct 08 2014 On x86_64, scalar single and double precision math uses the SSE

David Nadlinger (13/20) Oct 08 2014 Would it be possible to publish the relevant parts of the code,

Gabor Mezo (2/12) Oct 08 2014 Of course. The code will be accessible on github on this week.

Gabor Mezo (30/30) Oct 09 2014 Let me introduce my project for you guys.

Gabor Mezo (8/8) Oct 09 2014 Hey,

John Colvin (3/11) Oct 10 2014 The -singleobj flag may give you that same performance boost

Gabor Mezo (2/15) Oct 10 2014 How do you do this by using dub?

Gabor Mezo (2/18) Oct 10 2014 Ok, thanks, I've already figured it out.

Gabor Mezo (7/7) Oct 13 2014 I just wanted to inform you guys, that I optimized my code to

Fool (9/12) Oct 11 2014 I recently posted a test case [1] that originated from the

"Gabor Mezo" <gabor.mezo outlook.com> writes:

Hello,

I have a machine learning library and I'm porting it from C++ to 
D right now.

There is a number crunching benchmark in it that doing a simple 
gradient descent learning on a small multilayer perceptron neural 
network. The core of the benchmark is about some loops doing 
basic computations on numbers in float[] arrays (add, mul, exp, 
abs).

The reference is the C++ version compiled with Clang: 0.044 secs.

D results:

DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
LDC2 0.14 -O3 -release                         : 0.051 secs
GDC 4.9 -O3 -release                           : 0.031 secs

I think my benchmark code would hugely benefit from auto 
vectorization, so that might be the cause of the above results. 
I've found some vectorization compiler options for ldc2 but they 
seems have no effect on performance whatsoever.

Any suggestions?

Oct 08 2014

"Trass3r" <un known.com> writes:

Try with '-O3 -release -vectorize-slp-aggressive -g 
-pass-remarks-analysis="loop-vectorize|loop-unroll" 
-pass-remarks=loop-unroll'

Note that the D situation is a mess in general (correct me if I'm 
wrong):
* Never ever use std.math as you will get the insane 80-bit 
functions.
* core.math has some hacks to use llvm builtins but also mostly 
using type real.
* core.stdc.math supports all types but uses suffixes and maps to 
C functions.
* core.stdc.tgmath gets rid of the suffixes at least. Best way 
imo to write code if you disregard auto-vectorization.
* you can also use ldc.intrinsics to kill portability. Hello C++.

And there's no fast-math yet:
https://github.com/ldc-developers/ldc/issues/722

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

On Wednesday, 8 October 2014 at 11:29:30 UTC, Trass3r wrote:
 Try with '-O3 -release -vectorize-slp-aggressive -g 
 -pass-remarks-analysis="loop-vectorize|loop-unroll" 
 -pass-remarks=loop-unroll'

 Note that the D situation is a mess in general (correct me if 
 I'm wrong):
 * Never ever use std.math as you will get the insane 80-bit 
 functions.
 * core.math has some hacks to use llvm builtins but also mostly 
 using type real.
 * core.stdc.math supports all types but uses suffixes and maps 
 to C functions.
 * core.stdc.tgmath gets rid of the suffixes at least. Best way 
 imo to write code if you disregard auto-vectorization.
 * you can also use ldc.intrinsics to kill portability. Hello 
 C++.

 And there's no fast-math yet:
 https://github.com/ldc-developers/ldc/issues/722

I get:

Unknown command line argument 
'-pass-remarks-analysis=loop-vectorize|loop-unroll'
Unknown command line argument '-pass-remarks=loop-unroll'

Oct 08 2014

"Trass3r" <un known.com> writes:

 Unknown command line argument 
 '-pass-remarks-analysis=loop-vectorize|loop-unroll'
 Unknown command line argument '-pass-remarks=loop-unroll'

They were added to llvm in April/May.
-help-hidden lists all available options.

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

On Wednesday, 8 October 2014 at 15:04:10 UTC, Trass3r wrote:
 Unknown command line argument 
 '-pass-remarks-analysis=loop-vectorize|loop-unroll'
 Unknown command line argument '-pass-remarks=loop-unroll'

 They were added to llvm in April/May.
 -help-hidden lists all available options.

I can confirm that there are no pass-remarks options found. I'm 
using 0.14.0 from here:

https://github.com/ldc-developers/ldc/releases/tag/v0.14.0

Oct 08 2014

Russel Winder via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/10/14 12:29, Trass3r via digitalmars-d-ldc wrote:
[…]
 * Never ever use std.math as you will get the insane 80-bit
 functions.

What can one use to avoid this, and use the 64-bit numbers.

 * core.math has some hacks to use llvm builtins but also mostly
 using type real. * core.stdc.math supports all types but uses
 suffixes and maps to C functions. * core.stdc.tgmath gets rid of
 the suffixes at least. Best way imo to write code if you disregard
 auto-vectorization. * you can also use ldc.intrinsics to kill
 portability. Hello C++.
 
 And there's no fast-math yet: 
 https://github.com/ldc-developers/ldc/issues/722

Is there any work to handle the above? Does GDC actually suffer the
same (or analogous) issues?


- -- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip:
sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlQ1TlEACgkQ+ooS3F10Be/1BwCeNxSxy86BThgVuhYOawFNHi3A
uboAn0lautCqxvrOHp+mIYlVz7qw2jpV
=7Wy1
-----END PGP SIGNATURE-----

Oct 08 2014

"Trass3r" <un known.com> writes:

 * Never ever use std.math as you will get the insane 80-bit
 functions.

 What can one use to avoid this, and use the 64-bit numbers.

Note that this is less of an issue for x86 code using x87 by 
default.
On x64 though this results in really bad code switching between 
SSE and x87 registers.
But vectorization is usually killed in any case.

I personally use core.stdc.tgmath atm.

 * [...]

 Is there any work to handle the above? Does GDC actually suffer 
 the same (or analogous) issues?

I think there have been threads debating the unreasonable 'real 
by default' attitude.
No clue if there's any result.

And core.math is a big question mark to me.

I don't know about gdc. Its runtime doesn't look much different.

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

On Wednesday, 8 October 2014 at 15:32:46 UTC, Trass3r wrote:
 * Never ever use std.math as you will get the insane 80-bit
 functions.

 What can one use to avoid this, and use the 64-bit numbers.

 Note that this is less of an issue for x86 code using x87 by 
 default.
 On x64 though this results in really bad code switching between 
 SSE and x87 registers.
 But vectorization is usually killed in any case.

 I personally use core.stdc.tgmath atm.

 * [...]

 Is there any work to handle the above? Does GDC actually 
 suffer the same (or analogous) issues?

 I think there have been threads debating the unreasonable 'real 
 by default' attitude.
 No clue if there's any result.

 And core.math is a big question mark to me.

 I don't know about gdc. Its runtime doesn't look much different.

Just for the record my benchmark code doesn't use math libraries, 
I'm using logistics function approximations. That's why I thought 
that the cause of my results hafta be the lack of the auto 
vectorization.

Oct 08 2014

"Trass3r" <un known.com> writes:

Just check it with '-output-ll' or '-output-s 
-x86-asm-syntax=intel' ;)

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

On Wednesday, 8 October 2014 at 16:02:17 UTC, Trass3r wrote:
 Just check it with '-output-ll' or '-output-s 
 -x86-asm-syntax=intel' ;)

I'm not an ASM expert, but as far as I can see it indeed use some 
SIMD registers and instructions. For examlple:

.LBB0_16:
	mov	rcx, qword ptr [rax]
	mov	rdi, rax
	call	qword ptr [rcx + 56]
	test	rax, rax
	jne	.LBB0_18
	movss	xmm1, dword ptr [rsp + 116]
	jmp	.LBB0_20
	.align	16, 0x90
.LBB0_18:
	mov	rcx, rbx
	imul	rcx, rax
	add	r12, rcx
	movss	xmm1, dword ptr [rsp + 116]
	.align	16, 0x90
.LBB0_19:
	movss	xmm0, dword ptr [rdx]
	mulss	xmm0, dword ptr [r12]
	addss	xmm1, xmm0
	add	rdx, 4
	add	r12, 4
	dec	rax
	jne	.LBB0_19
.LBB0_20:
	movss	dword ptr [rsp + 116], xmm1
	inc	r14
	cmp	r14, r15
	jne	.LBB0_12
.LBB0_21:
	mov	rax, qword ptr [rsp + 80]
	mov	rdi, qword ptr [rax]
	mov	rax, qword ptr [rdi]
	call	qword ptr [rax + 40]
	test	eax, eax
	mov	rbp, qword ptr [rsp + 104]
	jne	.LBB0_24
	movss	xmm0, dword ptr [rsp + 92]
	movss	xmm1, dword ptr [rsp + 116]
	call	_D8nhelpers7sigmoidFNbffZf
	mov	rax, qword ptr [rsp + 64]
	movss	dword ptr [rax + 4*rbp], xmm0
	xor	edx, edx
	xor	ecx, ecx
	mov	r8d, _D11TypeInfo_Af6__initZ
	mov	rdi, qword ptr [rsp + 48]
	mov	rsi, qword ptr [rsp + 96]
	call	_adEq2
	test	eax, eax
	jne	.LBB0_27
	movss	xmm0, dword ptr [rsp + 92]
	movss	xmm1, dword ptr [rsp + 116]
	call	_D8nhelpers12sigmoidDerivFNbffZf
	mov	rax, qword ptr [rsp + 96]
	jmp	.LBB0_26
	.align	16, 0x90
.LBB0_24:
	movss	xmm0, dword ptr [rsp + 92]
	movss	xmm1, dword ptr [rsp + 116]
	call	_D8nhelpers6linearFNbffZf
	mov	rax, qword ptr [rsp + 64]
	movss	dword ptr [rax + 4*rbp], xmm0
	xor	edx, edx
	xor	ecx, ecx
	mov	r8d, _D11TypeInfo_Af6__initZ
	mov	rdi, qword ptr [rsp + 48]
	mov	rsi, qword ptr [rsp + 96]
	call	_adEq2
	test	eax, eax
	jne	.LBB0_27
	mov	rax, qword ptr [rsp + 96]
	movss	xmm0, dword ptr [rsp + 92]
.LBB0_26:
	movss	dword ptr [rax + 4*rbp], xmm0
.LBB0_27:
	inc	rbp
	add	rbx, 4
	cmp	rbp, qword ptr [rsp + 72]
	jne	.LBB0_9
.LBB0_28:
	mov	rax, qword ptr [rsp + 24]
	inc	rax
	cmp	rax, qword ptr [rsp + 8]
	mov	rbp, qword ptr [rsp + 16]
	jne	.LBB0_1
.LBB0_29:
	add	rsp, 120
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

On a second thought I can see that the main problem is my 
computation functions are not inlined. They are likes of this:

float sigmoid(float value, float alpha) nothrow
{
         return (value * alpha) / (1.0f + nfAbs(value * alpha)); 
// Elliot
}

float sigmoidDeriv(float value, float alpha) nothrow
{
	return alpha * 1.0f / ((1.0f + nfAbs(value * alpha)) * (1.0f + 
nfAbs(value * alpha))); // Elliot
}

float linear(float value, float alpha) nothrow
{
	return nfMin(nfMax(value * alpha, -alpha), alpha);
}

Why those calls are not inlined?? Or vectorized?

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

Here are the abs/min/max functions:

float nfAbs(float num) nothrow
{
     return num < 0.0f ? -num : num;
}

float nfMax(float num1, float num2) nothrow
{
     return num1 < num2 ? num2 : num1;
}

float nfMin(float num1, float num2) nothrow
{
     return num2 < num1 ? num2 : num1;
}

Didn't inlined too. Why?

Oct 08 2014

"David Nadlinger" <code klickverbot.at> writes:

On Wednesday, 8 October 2014 at 16:26:02 UTC, Gabor Mezo wrote:
 Why those calls are not inlined?

They are likely in a different module than the code using them, 
right? Modules in D are supposed to be their own, separate 
compilation unit, just like .cpp in C++. Thus by default no 
inlining across module boundaries will take place, unless you use 
something like link-time optimization.

Now of course this is rather undesirable and a big problem for 
trivial helper functions. If you just compile a single 
executable, you can pass -singleobj to LDC to instruct it to 
generate only one object file, so that the optimization 
boundaries disappear (arguably, this should be the default).

Furthermore, both DMD and LDC actually attempt to work around 
this by also analyzing imported modules so that functions in them 
can be inlined. Unfortunately, the LDC implementation of this is 
defunct as of a couple of DMD frontend merges ago. Thus, not even 
simple cases as in your example are not covered. I'm working on a 
reimplementation right now, hopefully to appear in master soon.

Cheers,
David

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

Hi David,

Thanks for trying to help me out.

Indeed, helper functions reside in separate modules. They are 
 system functions. I try to convert my helper function system to 
mixins then.

Oct 08 2014

"Trass3r" <un known.com> writes:

 I'm not an ASM expert

'-output-ll' gives you llvm IR, a bit higher level.

 but as far as I can see it indeed use some SIMD registers and 
 instructions. For examlple:
 	movss	xmm0, dword ptr [rdx]
 	mulss	xmm0, dword ptr [r12]
 	addss	xmm1, xmm0

If you see a 'ps' suffix (packed single-precision) it's SIMD ;)

Your helper functions are probably in a different module.
Cross-module inlining is problematic currently.

Oct 08 2014

"David Nadlinger" <code klickverbot.at> writes:

On Wednesday, 8 October 2014 at 16:23:19 UTC, Gabor Mezo wrote:
 I'm not an ASM expert, but as far as I can see it indeed use 
 some SIMD registers and instructions.

On x86_64, scalar single and double precision math uses the SSE 
registers and instructions by default too. The relevant mnemonics 
(mostly) end with "ss", which stands for "scalar single". On the 
other hand, vectorize code would use e.g. the instructions ending 
in "ps", for "packed single" (multiple values in one SSE 
register).

Your snippet has not actually been vectorized. Assuming that the 
code you posted was from a hot loop, a much bigger problem are 
the many function calls, though.

David

Oct 08 2014

"David Nadlinger" <code klickverbot.at> writes:

Hi,

On Wednesday, 8 October 2014 at 07:37:15 UTC, Gabor Mezo wrote:
 There is a number crunching benchmark in it that doing a simple 
 gradient descent learning on a small multilayer perceptron 
 neural network. The core of the benchmark is about some loops 
 doing basic computations on numbers in float[] arrays (add, 
 mul, exp, abs).

Would it be possible to publish the relevant parts of the code, 
i.e. what is needed to reproduce the performance problem? I'm 
currently working on a D compiler performance tracking project, 
so real-world test-cases where one compiler does much better than 
another are interesting to me.

If the code is proprietary, would it be possible for or me 
another compiler dev to have a look at the code, so we can 
determine the issues more quickly?

 DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
 LDC2 0.14 -O3 -release                         : 0.051 secs

Note that array bounds checks are still enabled for LDC here if 
your code was  safe.

David

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

 Would it be possible to publish the relevant parts of the code, 
 i.e. what is needed to reproduce the performance problem? I'm 
 currently working on a D compiler performance tracking project, 
 so real-world test-cases where one compiler does much better 
 than another are interesting to me.

 If the code is proprietary, would it be possible for or me 
 another compiler dev to have a look at the code, so we can 
 determine the issues more quickly?

 DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
 LDC2 0.14 -O3 -release                         : 0.051 secs


Of course. The code will be accessible on github on this week. 
This is an LGPL licensed hobbyist project, not confidential. ;)

Oct 08 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

Let me introduce my project for you guys.

There is the blog:

http://neuroflowblog.wordpress.com/


and the productivity of the language allowed me to implement 
advanced machine learning algorithms like Realtime Recurrent 
Learning and Scaled Conjugate Gradient. Sadly the performance was 
not that good, so I learned OpenCL. I implemented a provider 
model in my framework, so I became able to use managed and OpenCL 
implementations in the same system. Because my experimental code 

experiments went really slow.

Then I decided to move my experimental layer to C++11, and my 
framework became pure native. Sadly productivity of C++ is poor 

experiments became slower than was by using the managed version.

The I decided to learn D, and the result is on the Github (a DUB 
project):

https://github.com/unbornchikken/neuroflow-D

This is a console application, the mentioned benchmark will start.

Please note, this is my first time D code. There are constructs 
those seems lead to nowhere, but they will gain purpose when I 
port all of the planned functionality. Because there are a 
provider model to have OpenCL and D (and whatever) based 
implementations in parallel, I wasn't able to avoid downcasting 
in my design. Because downcasting can hugely affect performance I 
implemented some ugly but performant void * magic. Sorry for 
that. :) Conversion of the OpenCL implementation to D is still 
TODO. Recurrent learning implementations are not implemented 
right now.

Oct 09 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

Hey,

We have made progress. I've merged my computation code into a 
single module, and now the LDC build is as perfomant as the Clang 
one! The benchmark took around 0.044 secs. It's slower that the 
GDC version but it is amazing that D language can be as 
performant as C++ by using the same compiler backend, so no magic 
allowed.

Results pushed in.

Oct 09 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:
 Hey,

 We have made progress. I've merged my computation code into a 
 single module, and now the LDC build is as perfomant as the 
 Clang one! The benchmark took around 0.044 secs. It's slower 
 that the GDC version but it is amazing that D language can be 
 as performant as C++ by using the same compiler backend, so no 
 magic allowed.

 Results pushed in.

The -singleobj flag may give you that same performance boost 
without having to refactor the code.

Oct 10 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

On Friday, 10 October 2014 at 12:13:49 UTC, John Colvin wrote:
 On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:
 Hey,

 We have made progress. I've merged my computation code into a 
 single module, and now the LDC build is as perfomant as the 
 Clang one! The benchmark took around 0.044 secs. It's slower 
 that the GDC version but it is amazing that D language can be 
 as performant as C++ by using the same compiler backend, so no 
 magic allowed.

 Results pushed in.

 The -singleobj flag may give you that same performance boost 
 without having to refactor the code.

How do you do this by using dub?

Oct 10 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

On Friday, 10 October 2014 at 15:08:21 UTC, Gabor Mezo wrote:
 On Friday, 10 October 2014 at 12:13:49 UTC, John Colvin wrote:
 On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:
 Hey,

 We have made progress. I've merged my computation code into a 
 single module, and now the LDC build is as perfomant as the 
 Clang one! The benchmark took around 0.044 secs. It's slower 
 that the GDC version but it is amazing that D language can be 
 as performant as C++ by using the same compiler backend, so 
 no magic allowed.

 Results pushed in.

 The -singleobj flag may give you that same performance boost 
 without having to refactor the code.

 How do you do this by using dub?

Ok, thanks, I've already figured it out.

Oct 10 2014

"Gabor Mezo" <gabor.mezo outlook.com> writes:

I just wanted to inform you guys, that I optimized my code to
avoid casting entirely in hot paths. To be fair I did backport my
refinements to C++ version. Now all builds runs with roughly the
same performance (Clang, LDC, GDC).

All changes are pushed to the mentioned repo.

Thanks for your help, I'm satisfied. (And eagerly waiting for
2.066 compatible GDC and LDC releases. :))

Oct 13 2014

"Fool" <fool dlang.org> writes:

On Wednesday, 8 October 2014 at 17:31:21 UTC, David Nadlinger 
wrote:
 [...]
 so real-world test-cases where one compiler does much better 
 than another are interesting to me.

I recently posted a test case [1] that originated from the 
discussion [2].

[1] 
http://forum.dlang.org/thread/fowvgokbjuxplvcskswg forum.dlang.org
[2] http://forum.dlang.org/thread/ls9dbk$jkq$1 digitalmars.com

Kind regards,
Fool

Oct 11 2014

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Disappointing math performance compared to GDC