www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Disappointing math performance compared to GDC

reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
Hello,

I have a machine learning library and I'm porting it from C++ to 
D right now.

There is a number crunching benchmark in it that doing a simple 
gradient descent learning on a small multilayer perceptron neural 
network. The core of the benchmark is about some loops doing 
basic computations on numbers in float[] arrays (add, mul, exp, 
abs).

The reference is the C++ version compiled with Clang: 0.044 secs.

D results:

DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
LDC2 0.14 -O3 -release                         : 0.051 secs
GDC 4.9 -O3 -release                           : 0.031 secs

I think my benchmark code would hugely benefit from auto 
vectorization, so that might be the cause of the above results. 
I've found some vectorization compiler options for ldc2 but they 
seems have no effect on performance whatsoever.

Any suggestions?
Oct 08 2014
next sibling parent reply "Trass3r" <un known.com> writes:
Try with '-O3 -release -vectorize-slp-aggressive -g 
-pass-remarks-analysis="loop-vectorize|loop-unroll" 
-pass-remarks=loop-unroll'

Note that the D situation is a mess in general (correct me if I'm 
wrong):
* Never ever use std.math as you will get the insane 80-bit 
functions.
* core.math has some hacks to use llvm builtins but also mostly 
using type real.
* core.stdc.math supports all types but uses suffixes and maps to 
C functions.
* core.stdc.tgmath gets rid of the suffixes at least. Best way 
imo to write code if you disregard auto-vectorization.
* you can also use ldc.intrinsics to kill portability. Hello C++.

And there's no fast-math yet:
https://github.com/ldc-developers/ldc/issues/722
Oct 08 2014
next sibling parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
On Wednesday, 8 October 2014 at 11:29:30 UTC, Trass3r wrote:
 Try with '-O3 -release -vectorize-slp-aggressive -g 
 -pass-remarks-analysis="loop-vectorize|loop-unroll" 
 -pass-remarks=loop-unroll'

 Note that the D situation is a mess in general (correct me if 
 I'm wrong):
 * Never ever use std.math as you will get the insane 80-bit 
 functions.
 * core.math has some hacks to use llvm builtins but also mostly 
 using type real.
 * core.stdc.math supports all types but uses suffixes and maps 
 to C functions.
 * core.stdc.tgmath gets rid of the suffixes at least. Best way 
 imo to write code if you disregard auto-vectorization.
 * you can also use ldc.intrinsics to kill portability. Hello 
 C++.

 And there's no fast-math yet:
 https://github.com/ldc-developers/ldc/issues/722
I get: Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' Unknown command line argument '-pass-remarks=loop-unroll'
Oct 08 2014
parent reply "Trass3r" <un known.com> writes:
 Unknown command line argument 
 '-pass-remarks-analysis=loop-vectorize|loop-unroll'
 Unknown command line argument '-pass-remarks=loop-unroll'
They were added to llvm in April/May. -help-hidden lists all available options.
Oct 08 2014
parent "Gabor Mezo" <gabor.mezo outlook.com> writes:
On Wednesday, 8 October 2014 at 15:04:10 UTC, Trass3r wrote:
 Unknown command line argument 
 '-pass-remarks-analysis=loop-vectorize|loop-unroll'
 Unknown command line argument '-pass-remarks=loop-unroll'
They were added to llvm in April/May. -help-hidden lists all available options.
I can confirm that there are no pass-remarks options found. I'm using 0.14.0 from here: https://github.com/ldc-developers/ldc/releases/tag/v0.14.0
Oct 08 2014
prev sibling parent reply Russel Winder via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/10/14 12:29, Trass3r via digitalmars-d-ldc wrote:
[…]
 * Never ever use std.math as you will get the insane 80-bit
 functions.
What can one use to avoid this, and use the 64-bit numbers.
 * core.math has some hacks to use llvm builtins but also mostly
 using type real. * core.stdc.math supports all types but uses
 suffixes and maps to C functions. * core.stdc.tgmath gets rid of
 the suffixes at least. Best way imo to write code if you disregard
 auto-vectorization. * you can also use ldc.intrinsics to kill
 portability. Hello C++.
 
 And there's no fast-math yet: 
 https://github.com/ldc-developers/ldc/issues/722
Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues? - -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlQ1TlEACgkQ+ooS3F10Be/1BwCeNxSxy86BThgVuhYOawFNHi3A uboAn0lautCqxvrOHp+mIYlVz7qw2jpV =7Wy1 -----END PGP SIGNATURE-----
Oct 08 2014
parent reply "Trass3r" <un known.com> writes:
 * Never ever use std.math as you will get the insane 80-bit
 functions.
What can one use to avoid this, and use the 64-bit numbers.
Note that this is less of an issue for x86 code using x87 by default. On x64 though this results in really bad code switching between SSE and x87 registers. But vectorization is usually killed in any case. I personally use core.stdc.tgmath atm.
 * [...]
Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues?
I think there have been threads debating the unreasonable 'real by default' attitude. No clue if there's any result. And core.math is a big question mark to me. I don't know about gdc. Its runtime doesn't look much different.
Oct 08 2014
parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
On Wednesday, 8 October 2014 at 15:32:46 UTC, Trass3r wrote:
 * Never ever use std.math as you will get the insane 80-bit
 functions.
What can one use to avoid this, and use the 64-bit numbers.
Note that this is less of an issue for x86 code using x87 by default. On x64 though this results in really bad code switching between SSE and x87 registers. But vectorization is usually killed in any case. I personally use core.stdc.tgmath atm.
 * [...]
Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues?
I think there have been threads debating the unreasonable 'real by default' attitude. No clue if there's any result. And core.math is a big question mark to me. I don't know about gdc. Its runtime doesn't look much different.
Just for the record my benchmark code doesn't use math libraries, I'm using logistics function approximations. That's why I thought that the cause of my results hafta be the lack of the auto vectorization.
Oct 08 2014
parent reply "Trass3r" <un known.com> writes:
Just check it with '-output-ll' or '-output-s 
-x86-asm-syntax=intel' ;)
Oct 08 2014
parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
On Wednesday, 8 October 2014 at 16:02:17 UTC, Trass3r wrote:
 Just check it with '-output-ll' or '-output-s 
 -x86-asm-syntax=intel' ;)
I'm not an ASM expert, but as far as I can see it indeed use some SIMD registers and instructions. For examlple: .LBB0_16: mov rcx, qword ptr [rax] mov rdi, rax call qword ptr [rcx + 56] test rax, rax jne .LBB0_18 movss xmm1, dword ptr [rsp + 116] jmp .LBB0_20 .align 16, 0x90 .LBB0_18: mov rcx, rbx imul rcx, rax add r12, rcx movss xmm1, dword ptr [rsp + 116] .align 16, 0x90 .LBB0_19: movss xmm0, dword ptr [rdx] mulss xmm0, dword ptr [r12] addss xmm1, xmm0 add rdx, 4 add r12, 4 dec rax jne .LBB0_19 .LBB0_20: movss dword ptr [rsp + 116], xmm1 inc r14 cmp r14, r15 jne .LBB0_12 .LBB0_21: mov rax, qword ptr [rsp + 80] mov rdi, qword ptr [rax] mov rax, qword ptr [rdi] call qword ptr [rax + 40] test eax, eax mov rbp, qword ptr [rsp + 104] jne .LBB0_24 movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers7sigmoidFNbffZf mov rax, qword ptr [rsp + 64] movss dword ptr [rax + 4*rbp], xmm0 xor edx, edx xor ecx, ecx mov r8d, _D11TypeInfo_Af6__initZ mov rdi, qword ptr [rsp + 48] mov rsi, qword ptr [rsp + 96] call _adEq2 test eax, eax jne .LBB0_27 movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers12sigmoidDerivFNbffZf mov rax, qword ptr [rsp + 96] jmp .LBB0_26 .align 16, 0x90 .LBB0_24: movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers6linearFNbffZf mov rax, qword ptr [rsp + 64] movss dword ptr [rax + 4*rbp], xmm0 xor edx, edx xor ecx, ecx mov r8d, _D11TypeInfo_Af6__initZ mov rdi, qword ptr [rsp + 48] mov rsi, qword ptr [rsp + 96] call _adEq2 test eax, eax jne .LBB0_27 mov rax, qword ptr [rsp + 96] movss xmm0, dword ptr [rsp + 92] .LBB0_26: movss dword ptr [rax + 4*rbp], xmm0 .LBB0_27: inc rbp add rbx, 4 cmp rbp, qword ptr [rsp + 72] jne .LBB0_9 .LBB0_28: mov rax, qword ptr [rsp + 24] inc rax cmp rax, qword ptr [rsp + 8] mov rbp, qword ptr [rsp + 16] jne .LBB0_1 .LBB0_29: add rsp, 120 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret
Oct 08 2014
next sibling parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
On a second thought I can see that the main problem is my 
computation functions are not inlined. They are likes of this:

float sigmoid(float value, float alpha) nothrow
{
         return (value * alpha) / (1.0f + nfAbs(value * alpha)); 
// Elliot
}

float sigmoidDeriv(float value, float alpha) nothrow
{
	return alpha * 1.0f / ((1.0f + nfAbs(value * alpha)) * (1.0f + 
nfAbs(value * alpha))); // Elliot
}

float linear(float value, float alpha) nothrow
{
	return nfMin(nfMax(value * alpha, -alpha), alpha);
}

Why those calls are not inlined?? Or vectorized?
Oct 08 2014
next sibling parent "Gabor Mezo" <gabor.mezo outlook.com> writes:
Here are the abs/min/max functions:

float nfAbs(float num) nothrow
{
     return num < 0.0f ? -num : num;
}

float nfMax(float num1, float num2) nothrow
{
     return num1 < num2 ? num2 : num1;
}

float nfMin(float num1, float num2) nothrow
{
     return num2 < num1 ? num2 : num1;
}

Didn't inlined too. Why?
Oct 08 2014
prev sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
On Wednesday, 8 October 2014 at 16:26:02 UTC, Gabor Mezo wrote:
 Why those calls are not inlined?
They are likely in a different module than the code using them, right? Modules in D are supposed to be their own, separate compilation unit, just like .cpp in C++. Thus by default no inlining across module boundaries will take place, unless you use something like link-time optimization. Now of course this is rather undesirable and a big problem for trivial helper functions. If you just compile a single executable, you can pass -singleobj to LDC to instruct it to generate only one object file, so that the optimization boundaries disappear (arguably, this should be the default). Furthermore, both DMD and LDC actually attempt to work around this by also analyzing imported modules so that functions in them can be inlined. Unfortunately, the LDC implementation of this is defunct as of a couple of DMD frontend merges ago. Thus, not even simple cases as in your example are not covered. I'm working on a reimplementation right now, hopefully to appear in master soon. Cheers, David
Oct 08 2014
parent "Gabor Mezo" <gabor.mezo outlook.com> writes:
Hi David,

Thanks for trying to help me out.

Indeed, helper functions reside in separate modules. They are 
 system functions. I try to convert my helper function system to 
mixins then.
Oct 08 2014
prev sibling next sibling parent "Trass3r" <un known.com> writes:
 I'm not an ASM expert
'-output-ll' gives you llvm IR, a bit higher level.
 but as far as I can see it indeed use some SIMD registers and 
 instructions. For examlple:
 	movss	xmm0, dword ptr [rdx]
 	mulss	xmm0, dword ptr [r12]
 	addss	xmm1, xmm0
If you see a 'ps' suffix (packed single-precision) it's SIMD ;) Your helper functions are probably in a different module. Cross-module inlining is problematic currently.
Oct 08 2014
prev sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Wednesday, 8 October 2014 at 16:23:19 UTC, Gabor Mezo wrote:
 I'm not an ASM expert, but as far as I can see it indeed use 
 some SIMD registers and instructions.
On x86_64, scalar single and double precision math uses the SSE registers and instructions by default too. The relevant mnemonics (mostly) end with "ss", which stands for "scalar single". On the other hand, vectorize code would use e.g. the instructions ending in "ps", for "packed single" (multiple values in one SSE register). Your snippet has not actually been vectorized. Assuming that the code you posted was from a hot loop, a much bigger problem are the many function calls, though. David
Oct 08 2014
prev sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
Hi,

On Wednesday, 8 October 2014 at 07:37:15 UTC, Gabor Mezo wrote:
 There is a number crunching benchmark in it that doing a simple 
 gradient descent learning on a small multilayer perceptron 
 neural network. The core of the benchmark is about some loops 
 doing basic computations on numbers in float[] arrays (add, 
 mul, exp, abs).
Would it be possible to publish the relevant parts of the code, i.e. what is needed to reproduce the performance problem? I'm currently working on a D compiler performance tracking project, so real-world test-cases where one compiler does much better than another are interesting to me. If the code is proprietary, would it be possible for or me another compiler dev to have a look at the code, so we can determine the issues more quickly?
 DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
 LDC2 0.14 -O3 -release                         : 0.051 secs
Note that array bounds checks are still enabled for LDC here if your code was safe. David
Oct 08 2014
next sibling parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
 Would it be possible to publish the relevant parts of the code, 
 i.e. what is needed to reproduce the performance problem? I'm 
 currently working on a D compiler performance tracking project, 
 so real-world test-cases where one compiler does much better 
 than another are interesting to me.

 If the code is proprietary, would it be possible for or me 
 another compiler dev to have a look at the code, so we can 
 determine the issues more quickly?

 DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
 LDC2 0.14 -O3 -release                         : 0.051 secs
Of course. The code will be accessible on github on this week. This is an LGPL licensed hobbyist project, not confidential. ;)
Oct 08 2014
parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
Let me introduce my project for you guys.

There is the blog:

http://neuroflowblog.wordpress.com/

I started to work on it almost 10 years ago. It was a C# project, 
and the productivity of the language allowed me to implement 
advanced machine learning algorithms like Realtime Recurrent 
Learning and Scaled Conjugate Gradient. Sadly the performance was 
not that good, so I learned OpenCL. I implemented a provider 
model in my framework, so I became able to use managed and OpenCL 
implementations in the same system. Because my experimental code 
was implemented in C# and because managed code was slow, my 
experiments went really slow.

Then I decided to move my experimental layer to C++11, and my 
framework became pure native. Sadly productivity of C++ is poor 
compared to C#, so even my experimental code was fast, my 
experiments became slower than was by using the managed version.

The I decided to learn D, and the result is on the Github (a DUB 
project):

https://github.com/unbornchikken/neuroflow-D

This is a console application, the mentioned benchmark will start.

Please note, this is my first time D code. There are constructs 
those seems lead to nowhere, but they will gain purpose when I 
port all of the planned functionality. Because there are a 
provider model to have OpenCL and D (and whatever) based 
implementations in parallel, I wasn't able to avoid downcasting 
in my design. Because downcasting can hugely affect performance I 
implemented some ugly but performant void * magic. Sorry for 
that. :) Conversion of the OpenCL implementation to D is still 
TODO. Recurrent learning implementations are not implemented 
right now.
Oct 09 2014
parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
Hey,

We have made progress. I've merged my computation code into a 
single module, and now the LDC build is as perfomant as the Clang 
one! The benchmark took around 0.044 secs. It's slower that the 
GDC version but it is amazing that D language can be as 
performant as C++ by using the same compiler backend, so no magic 
allowed.

Results pushed in.
Oct 09 2014
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:
 Hey,

 We have made progress. I've merged my computation code into a 
 single module, and now the LDC build is as perfomant as the 
 Clang one! The benchmark took around 0.044 secs. It's slower 
 that the GDC version but it is amazing that D language can be 
 as performant as C++ by using the same compiler backend, so no 
 magic allowed.

 Results pushed in.
The -singleobj flag may give you that same performance boost without having to refactor the code.
Oct 10 2014
parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
On Friday, 10 October 2014 at 12:13:49 UTC, John Colvin wrote:
 On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:
 Hey,

 We have made progress. I've merged my computation code into a 
 single module, and now the LDC build is as perfomant as the 
 Clang one! The benchmark took around 0.044 secs. It's slower 
 that the GDC version but it is amazing that D language can be 
 as performant as C++ by using the same compiler backend, so no 
 magic allowed.

 Results pushed in.
The -singleobj flag may give you that same performance boost without having to refactor the code.
How do you do this by using dub?
Oct 10 2014
parent reply "Gabor Mezo" <gabor.mezo outlook.com> writes:
On Friday, 10 October 2014 at 15:08:21 UTC, Gabor Mezo wrote:
 On Friday, 10 October 2014 at 12:13:49 UTC, John Colvin wrote:
 On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:
 Hey,

 We have made progress. I've merged my computation code into a 
 single module, and now the LDC build is as perfomant as the 
 Clang one! The benchmark took around 0.044 secs. It's slower 
 that the GDC version but it is amazing that D language can be 
 as performant as C++ by using the same compiler backend, so 
 no magic allowed.

 Results pushed in.
The -singleobj flag may give you that same performance boost without having to refactor the code.
How do you do this by using dub?
Ok, thanks, I've already figured it out.
Oct 10 2014
parent "Gabor Mezo" <gabor.mezo outlook.com> writes:
I just wanted to inform you guys, that I optimized my code to
avoid casting entirely in hot paths. To be fair I did backport my
refinements to C++ version. Now all builds runs with roughly the
same performance (Clang, LDC, GDC).

All changes are pushed to the mentioned repo.

Thanks for your help, I'm satisfied. (And eagerly waiting for
2.066 compatible GDC and LDC releases. :))
Oct 13 2014
prev sibling parent "Fool" <fool dlang.org> writes:
On Wednesday, 8 October 2014 at 17:31:21 UTC, David Nadlinger 
wrote:
 [...]
 so real-world test-cases where one compiler does much better 
 than another are interesting to me.
I recently posted a test case [1] that originated from the discussion [2]. [1] http://forum.dlang.org/thread/fowvgokbjuxplvcskswg forum.dlang.org [2] http://forum.dlang.org/thread/ls9dbk$jkq$1 digitalmars.com Kind regards, Fool
Oct 11 2014