digitalmars.D.ldc - How to deal with inline asm functions in Phobos/druntime?

Johan Engelen (50/50) Apr 07 2015 Hi all,

Kai Nacke (19/22) Apr 07 2015 Hi Johan!

Johan Engelen (10/10) Apr 08 2015 Hi Kai,

Kai Nacke (42/52) Apr 08 2015 Hi Johan!

Johan Engelen (2/8) Apr 08 2015 Exactly!

David Nadlinger via digitalmars-d-ldc (11/13) Apr 08 2015 Speaking about performance: Haven't we switched to 64 bit reals on

Kai Nacke (10/24) Apr 08 2015 Yes, reals are 64bit on Win64. Using SSE can really help here.

Johan Engelen (21/21) Apr 08 2015 Thanks Kai!

Daniel Murphy (4/12) Apr 08 2015 I don't think it's so much about vectorizing as it is about avoiding the...

David Nadlinger via digitalmars-d-ldc (9/11) Apr 08 2015 Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by

Johan Engelen (4/18) Apr 08 2015 Ah, ok. Didn't realize.

Johan Engelen (5/5) Apr 12 2015 I wrote a non-asm ilogb, that actually runs quite a bit faster

Kai Nacke (5/10) Apr 13 2015 Nice! For LDC you can replace bsr with intrinsic llvm.ctlz.i#. It

"Johan Engelen" <j j.nl> writes:

Hi all,
   I've hit on a problem that I know how to fix, but I do not know 
how to properly do it. Thanks for your help.

A number of functions have inline assembly implementations in 
Phobos, e.g. std.math.ilogb(). I don't know why exactly they have 
asm implementations for Windows. The default path 
"core.stdc.math.ilogbl(x);" would be fine on Windows. The problem 
is that LDC would take that same assembly code, although it 
assumes DMD calling conventions. Also, _much_ better code is 
generated when ldc.llvmasm inline assembly code is used for 
non-naked asm functions: for D-style naked asm, the generated 
code contains huge function pro-/epilogues. The LDC 
implementation without assumptions about calling conventions for 
MSVC-compatible ilogb would be:

         import ldc.llvmasm;
         return __asm!int(
            `fldl     $1         ;
             fxam                ;
             fstsw   %AX         ;
             and     $$0x45, %AH ;
             cmp     $$0x40, %AH ;
             jz      Lzeronan    ;
             cmp     $$5, %AH    ;
             jz      Linfinity   ;
             cmp     $$1, %AH    ;
             jz      Lzeronan    ;
             fxtract             ;
             fstp    %ST(0)      ;
             fistpl  (%RSP)      ;
             mov     (%RSP), $0  ;
             jmp     Ldone       ;
         Lzeronan:
             mov     $$0x80000000, $0 ;
             fstp    %ST(0)           ;
             jmp     Ldone            ;

         Linfinity:
             mov     $$0x7FFFFFFF, $0 ;
             fstp    %ST(0)           ;

         Ldone: ;`, "=r,*m,~{ax},~{memory}", &x);

I think it'd be relatively straightforward to write the code such 
that it works for 80-bit and 64-bit reals.

My question: how do I fix our fork of Phobos? Do we just want to 
pass the call to core.stdc.math.ilogbl, and disregard the 
'optimized' inline asm? Or do we add "version(LDC) version 
(Win64)" or similar and add our own asm implementations?

It is fun to write these small asm blobs, but I am not sure how 
maintainable all this will be.

Confused :S

Thanks!
   Johan

Apr 07 2015

"Kai Nacke" <kai redstar.de> writes:

On Tuesday, 7 April 2015 at 18:15:58 UTC, Johan Engelen wrote:
 It is fun to write these small asm blobs, but I am not sure how 
 maintainable all this will be.

 Confused :S

Hi Johan!

The functions in Druntime/Phobos exists because the MSVC runtime 
does not support 80bit reals which is a must-have for DMD.

If you are going to rewrite assembly snippets then please use 
ldc.llvmasm because of all the advantages you have listed. Plus, 
ldc.llvmasm do not disable function inlining which is also a big 
win.

Regarding std.math, there are already lot of places with versions 
for LDC. One reason is that LLVM intrinsics exists. Another 
reason is that LDC supports PPC, ARM and MIPS, too. At last, a 
reason is that the DMD ASM does not work. I prefer having 
rewritten assembly functions. But please note that there must 
always be a fallback to the core.stdc function because of the 
many supported architectures.

BTW: Many users of LDC want fast math functions. Having a better 
implementation of math functions is therefore desirable.

Regards,
Kai

Apr 07 2015

"Johan Engelen" <j j.nl> writes:

Hi Kai,
   OK!
I'd very much appreciate an example of how to add the code above 
to ilogb in std.math. The "version(...)" stuff quickly becomes a 
mess.
Could you take the code above and show me how you would put this 
into std.math?
I'm sorry if this is too much hand-holding, it is really very 
much appreciated.

-Johan

Apr 08 2015

"Kai Nacke" <kai redstar.de> writes:

On Wednesday, 8 April 2015 at 09:29:56 UTC, Johan Engelen wrote:
 Hi Kai,
   OK!
 I'd very much appreciate an example of how to add the code 
 above to ilogb in std.math. The "version(...)" stuff quickly 
 becomes a mess.
 Could you take the code above and show me how you would put 
 this into std.math?
 I'm sorry if this is too much hand-holding, it is really very 
 much appreciated.

 -Johan

Hi Johan!

The structure of ilogb() is:

int ilogb(real x)
{
     version(Win64_DMD_InlineAsm_X87)
         ....
     version(CRuntime_Microsoft)
         ....
     else
         return core.stdc.math.ilogbl(x);
}

You can simply add your stuff at the top of the function, 
resulting in:

int ilogb(real x)
{
     version(LDC)
     {
         version(X86_64)
             ....
         else version(X86_64)
             ....
         else
             return core.stdc.math.ilogbl(x);
     }
     else
     version(Win64_DMD_InlineAsm_X87)
         ....
     else version(CRuntime_Microsoft)
         ....
     else
         return core.stdc.math.ilogbl(x);
}

Look at the 'misplaced' else: Doing the addition in this way you 
do not touch the original implementation. This makes merging more 
easier. If you look closer at std.math, you see a lot of 'wrong' 
formatting/spacing if version(LDC) is involved. The goal here is 
to leave as much as possible of the original source untouched.

I hope this answers your question. If not then do not hesitate to 
asked again.

Regards,
Kai

Apr 08 2015

"Johan Engelen" <j j.nl> writes:

On Wednesday, 8 April 2015 at 11:09:48 UTC, Kai Nacke wrote:
 Look at the 'misplaced' else: Doing the addition in this way 
 you do not touch the original implementation. This makes 
 merging more easier. If you look closer at std.math, you see a 
 lot of 'wrong' formatting/spacing if version(LDC) is involved. 
 The goal here is to leave as much as possible of the original 
 source untouched.

Exactly!

Apr 08 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

On Wed, Apr 8, 2015 at 8:31 AM, Kai Nacke via digitalmars-d-ldc
<digitalmars-d-ldc puremagic.com> wrote:
 BTW: Many users of LDC want fast math functions. Having a better
 implementation of math functions is therefore desirable.

Speaking about performance: Haven't we switched to 64 bit reals on
Win64 for the time being? If so, we'd probably want a version of ilogb
that uses SSE instead of the x87 FPU. This is probably (I'm rather
sure, but did not measure it) a much bigger performance problem than
any overly long general purpose register pushing/popping prologues
generated by non-naked asm functions.

ilogb() does not seem to be a commonly used function, though, so it
might not be worth the effort.

 — David

Apr 08 2015

"Kai Nacke" <kai redstar.de> writes:

On Wednesday, 8 April 2015 at 11:29:26 UTC, David Nadlinger wrote:
 Speaking about performance: Haven't we switched to 64 bit reals 
 on
 Win64 for the time being? If so, we'd probably want a version 
 of ilogb
 that uses SSE instead of the x87 FPU. This is probably (I'm 
 rather
 sure, but did not measure it) a much bigger performance problem 
 than
 any overly long general purpose register pushing/popping 
 prologues
 generated by non-naked asm functions.

Yes, reals are 64bit on Win64. Using SSE can really help here.
Speaking of performance: The only way to know the some piece of 
code is faster is to measure it. This can reveal surprising 
insights.... (Hint!)

 ilogb() does not seem to be a commonly used function, though, 
 so it
 might not be worth the effort.

There are lot of other functions, too. E.g. the inline assembler 
in expi() is commented because of a problem with the asm parser. 
I think Johan will find enough worthwhile targets.

Regards,
Kai

Apr 08 2015

"Johan Engelen" <j j.nl> writes:

Thanks Kai!

Well, first we should get LDC fully functional on Win64, no? :P
That was my main goal, really.

About SSE: I can't vectorize the code for this one function with 
one real as argument! I had done a brief search for what 
instructions are available on xmm regs (argument is passed 
through xmm0), but it is mostly simple arithmetic I think, not 
the kind of stuff that is used in the original druntime asm code.
(Btw, the pro-epilogues also consist of pushing/popping all XMM 
regs, quite a bit of data, but indeed no clue how slow/fast that 
is. Didn't measure a thing, but it just looked kind of wasteful :)

But thinking about David's comment a bit more, if I understand 
ilogb correctly, all it needs to do is output the exponent of the 
real as an int. For that, one doesn't need floating point 
operations at all. Just a bit of bit shifting and masking, and 
subtracting the floating point format's exponent bias value. I 
think we can just express that as normal D code, which can then 
be optimized / vectorized by LLVM ? That code could go upstream, 
with a few static ifs for all floating point formats supported.

Again, I want to get LDC fully functional foremost, so all this 
is fun but distracting ;)

Apr 08 2015

"Daniel Murphy" <yebbliesnospam gmail.com> writes:

"Johan Engelen"  wrote in message 
news:hqpjetsgaeoqkfyqexka forum.dlang.org...

 About SSE: I can't vectorize the code for this one function with one real 
 as argument! I had done a brief search for what instructions are available 
 on xmm regs (argument is passed through xmm0), but it is mostly simple 
 arithmetic I think, not the kind of stuff that is used in the original 
 druntime asm code.
 (Btw, the pro-epilogues also consist of pushing/popping all XMM regs, 
 quite a bit of data, but indeed no clue how slow/fast that is. Didn't 
 measure a thing, but it just looked kind of wasteful :)

I don't think it's so much about vectorizing as it is about avoiding the x87 
FPU, which you can do when 80-bit precision is not needed.

Apr 08 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

On 8 Apr 2015, at 15:15, Daniel Murphy via digitalmars-d-ldc wrote:
 I don't think it's so much about vectorizing as it is about avoiding 
 the x87 FPU, which you can do when 80-bit precision is not needed.

Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by 
default for single- and double-precision floating point operations. The 
x87 FPU is not particularly well-optimized on newer CPUs to begin with, 
and transferring data from the SSE registers to the FPU on function 
entry and then back again is quite costly too.

For example, this is what made us (all D compilers) look bad on that 
Perlin noise microbenchmark (the thread from a couple of months ago).

  — David

Apr 08 2015

"Johan Engelen" <j j.nl> writes:

On Wednesday, 8 April 2015 at 13:28:16 UTC, David Nadlinger wrote:
 On 8 Apr 2015, at 15:15, Daniel Murphy via digitalmars-d-ldc 
 wrote:
 I don't think it's so much about vectorizing as it is about 
 avoiding the x87 FPU, which you can do when 80-bit precision 
 is not needed.

 Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used 
 by default for single- and double-precision floating point 
 operations. The x87 FPU is not particularly well-optimized on 
 newer CPUs to begin with, and transferring data from the SSE 
 registers to the FPU on function entry and then back again is 
 quite costly too.

 For example, this is what made us (all D compilers) look bad on 
 that Perlin noise microbenchmark (the thread from a couple of 
 months ago).

Ah, ok. Didn't realize.

For future reference:
http://gruntthepeon.free.fr/ssemath

Apr 08 2015

"Johan Engelen" <j j.nl> writes:

I wrote a non-asm ilogb, that actually runs quite a bit faster 
than what DMD or LDC do standard, and should also be much more 
portable.
See
https://github.com/D-Programming-Language/phobos/pull/3186

Apr 12 2015

"Kai Nacke" <kai redstar.de> writes:

On Sunday, 12 April 2015 at 22:35:20 UTC, Johan Engelen wrote:
 I wrote a non-asm ilogb, that actually runs quite a bit faster 
 than what DMD or LDC do standard, and should also be much more 
 portable.
 See
 https://github.com/D-Programming-Language/phobos/pull/3186


is a template and also CTFE enabled.

Regards,
Kai

Apr 13 2015

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - How to deal with inline asm functions in Phobos/druntime?