www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - How to deal with inline asm functions in Phobos/druntime?

reply "Johan Engelen" <j j.nl> writes:
Hi all,
   I've hit on a problem that I know how to fix, but I do not know 
how to properly do it. Thanks for your help.

A number of functions have inline assembly implementations in 
Phobos, e.g. std.math.ilogb(). I don't know why exactly they have 
asm implementations for Windows. The default path 
"core.stdc.math.ilogbl(x);" would be fine on Windows. The problem 
is that LDC would take that same assembly code, although it 
assumes DMD calling conventions. Also, _much_ better code is 
generated when ldc.llvmasm inline assembly code is used for 
non-naked asm functions: for D-style naked asm, the generated 
code contains huge function pro-/epilogues. The LDC 
implementation without assumptions about calling conventions for 
MSVC-compatible ilogb would be:

         import ldc.llvmasm;
         return __asm!int(
            `fldl     $1         ;
             fxam                ;
             fstsw   %AX         ;
             and     $$0x45, %AH ;
             cmp     $$0x40, %AH ;
             jz      Lzeronan    ;
             cmp     $$5, %AH    ;
             jz      Linfinity   ;
             cmp     $$1, %AH    ;
             jz      Lzeronan    ;
             fxtract             ;
             fstp    %ST(0)      ;
             fistpl  (%RSP)      ;
             mov     (%RSP), $0  ;
             jmp     Ldone       ;
         Lzeronan:
             mov     $$0x80000000, $0 ;
             fstp    %ST(0)           ;
             jmp     Ldone            ;

         Linfinity:
             mov     $$0x7FFFFFFF, $0 ;
             fstp    %ST(0)           ;

         Ldone: ;`, "=r,*m,~{ax},~{memory}", &x);

I think it'd be relatively straightforward to write the code such 
that it works for 80-bit and 64-bit reals.

My question: how do I fix our fork of Phobos? Do we just want to 
pass the call to core.stdc.math.ilogbl, and disregard the 
'optimized' inline asm? Or do we add "version(LDC) version 
(Win64)" or similar and add our own asm implementations?

It is fun to write these small asm blobs, but I am not sure how 
maintainable all this will be.

Confused :S

Thanks!
   Johan
Apr 07 2015
parent reply "Kai Nacke" <kai redstar.de> writes:
On Tuesday, 7 April 2015 at 18:15:58 UTC, Johan Engelen wrote:
 It is fun to write these small asm blobs, but I am not sure how 
 maintainable all this will be.

 Confused :S
Hi Johan! The functions in Druntime/Phobos exists because the MSVC runtime does not support 80bit reals which is a must-have for DMD. If you are going to rewrite assembly snippets then please use ldc.llvmasm because of all the advantages you have listed. Plus, ldc.llvmasm do not disable function inlining which is also a big win. Regarding std.math, there are already lot of places with versions for LDC. One reason is that LLVM intrinsics exists. Another reason is that LDC supports PPC, ARM and MIPS, too. At last, a reason is that the DMD ASM does not work. I prefer having rewritten assembly functions. But please note that there must always be a fallback to the core.stdc function because of the many supported architectures. BTW: Many users of LDC want fast math functions. Having a better implementation of math functions is therefore desirable. Regards, Kai
Apr 07 2015
next sibling parent reply "Johan Engelen" <j j.nl> writes:
Hi Kai,
   OK!
I'd very much appreciate an example of how to add the code above 
to ilogb in std.math. The "version(...)" stuff quickly becomes a 
mess.
Could you take the code above and show me how you would put this 
into std.math?
I'm sorry if this is too much hand-holding, it is really very 
much appreciated.

-Johan
Apr 08 2015
parent reply "Kai Nacke" <kai redstar.de> writes:
On Wednesday, 8 April 2015 at 09:29:56 UTC, Johan Engelen wrote:
 Hi Kai,
   OK!
 I'd very much appreciate an example of how to add the code 
 above to ilogb in std.math. The "version(...)" stuff quickly 
 becomes a mess.
 Could you take the code above and show me how you would put 
 this into std.math?
 I'm sorry if this is too much hand-holding, it is really very 
 much appreciated.

 -Johan
Hi Johan! The structure of ilogb() is: int ilogb(real x) { version(Win64_DMD_InlineAsm_X87) .... version(CRuntime_Microsoft) .... else return core.stdc.math.ilogbl(x); } You can simply add your stuff at the top of the function, resulting in: int ilogb(real x) { version(LDC) { version(X86_64) .... else version(X86_64) .... else return core.stdc.math.ilogbl(x); } else version(Win64_DMD_InlineAsm_X87) .... else version(CRuntime_Microsoft) .... else return core.stdc.math.ilogbl(x); } Look at the 'misplaced' else: Doing the addition in this way you do not touch the original implementation. This makes merging more easier. If you look closer at std.math, you see a lot of 'wrong' formatting/spacing if version(LDC) is involved. The goal here is to leave as much as possible of the original source untouched. I hope this answers your question. If not then do not hesitate to asked again. Regards, Kai
Apr 08 2015
parent "Johan Engelen" <j j.nl> writes:
On Wednesday, 8 April 2015 at 11:09:48 UTC, Kai Nacke wrote:
 Look at the 'misplaced' else: Doing the addition in this way 
 you do not touch the original implementation. This makes 
 merging more easier. If you look closer at std.math, you see a 
 lot of 'wrong' formatting/spacing if version(LDC) is involved. 
 The goal here is to leave as much as possible of the original 
 source untouched.
Exactly!
Apr 08 2015
prev sibling parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
On Wed, Apr 8, 2015 at 8:31 AM, Kai Nacke via digitalmars-d-ldc
<digitalmars-d-ldc puremagic.com> wrote:
 BTW: Many users of LDC want fast math functions. Having a better
 implementation of math functions is therefore desirable.
Speaking about performance: Haven't we switched to 64 bit reals on Win64 for the time being? If so, we'd probably want a version of ilogb that uses SSE instead of the x87 FPU. This is probably (I'm rather sure, but did not measure it) a much bigger performance problem than any overly long general purpose register pushing/popping prologues generated by non-naked asm functions. ilogb() does not seem to be a commonly used function, though, so it might not be worth the effort. — David
Apr 08 2015
parent reply "Kai Nacke" <kai redstar.de> writes:
On Wednesday, 8 April 2015 at 11:29:26 UTC, David Nadlinger wrote:
 Speaking about performance: Haven't we switched to 64 bit reals 
 on
 Win64 for the time being? If so, we'd probably want a version 
 of ilogb
 that uses SSE instead of the x87 FPU. This is probably (I'm 
 rather
 sure, but did not measure it) a much bigger performance problem 
 than
 any overly long general purpose register pushing/popping 
 prologues
 generated by non-naked asm functions.
Yes, reals are 64bit on Win64. Using SSE can really help here. Speaking of performance: The only way to know the some piece of code is faster is to measure it. This can reveal surprising insights.... (Hint!)
 ilogb() does not seem to be a commonly used function, though, 
 so it
 might not be worth the effort.
There are lot of other functions, too. E.g. the inline assembler in expi() is commented because of a problem with the asm parser. I think Johan will find enough worthwhile targets. Regards, Kai
Apr 08 2015
parent reply "Johan Engelen" <j j.nl> writes:
Thanks Kai!

Well, first we should get LDC fully functional on Win64, no? :P
That was my main goal, really.

About SSE: I can't vectorize the code for this one function with 
one real as argument! I had done a brief search for what 
instructions are available on xmm regs (argument is passed 
through xmm0), but it is mostly simple arithmetic I think, not 
the kind of stuff that is used in the original druntime asm code.
(Btw, the pro-epilogues also consist of pushing/popping all XMM 
regs, quite a bit of data, but indeed no clue how slow/fast that 
is. Didn't measure a thing, but it just looked kind of wasteful :)

But thinking about David's comment a bit more, if I understand 
ilogb correctly, all it needs to do is output the exponent of the 
real as an int. For that, one doesn't need floating point 
operations at all. Just a bit of bit shifting and masking, and 
subtracting the floating point format's exponent bias value. I 
think we can just express that as normal D code, which can then 
be optimized / vectorized by LLVM ? That code could go upstream, 
with a few static ifs for all floating point formats supported.

Again, I want to get LDC fully functional foremost, so all this 
is fun but distracting ;)
Apr 08 2015
parent reply "Daniel Murphy" <yebbliesnospam gmail.com> writes:
"Johan Engelen"  wrote in message 
news:hqpjetsgaeoqkfyqexka forum.dlang.org...

 About SSE: I can't vectorize the code for this one function with one real 
 as argument! I had done a brief search for what instructions are available 
 on xmm regs (argument is passed through xmm0), but it is mostly simple 
 arithmetic I think, not the kind of stuff that is used in the original 
 druntime asm code.
 (Btw, the pro-epilogues also consist of pushing/popping all XMM regs, 
 quite a bit of data, but indeed no clue how slow/fast that is. Didn't 
 measure a thing, but it just looked kind of wasteful :)
I don't think it's so much about vectorizing as it is about avoiding the x87 FPU, which you can do when 80-bit precision is not needed.
Apr 08 2015
parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
On 8 Apr 2015, at 15:15, Daniel Murphy via digitalmars-d-ldc wrote:
 I don't think it's so much about vectorizing as it is about avoiding 
 the x87 FPU, which you can do when 80-bit precision is not needed.
Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by default for single- and double-precision floating point operations. The x87 FPU is not particularly well-optimized on newer CPUs to begin with, and transferring data from the SSE registers to the FPU on function entry and then back again is quite costly too. For example, this is what made us (all D compilers) look bad on that Perlin noise microbenchmark (the thread from a couple of months ago). — David
Apr 08 2015
parent reply "Johan Engelen" <j j.nl> writes:
On Wednesday, 8 April 2015 at 13:28:16 UTC, David Nadlinger wrote:
 On 8 Apr 2015, at 15:15, Daniel Murphy via digitalmars-d-ldc 
 wrote:
 I don't think it's so much about vectorizing as it is about 
 avoiding the x87 FPU, which you can do when 80-bit precision 
 is not needed.
Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by default for single- and double-precision floating point operations. The x87 FPU is not particularly well-optimized on newer CPUs to begin with, and transferring data from the SSE registers to the FPU on function entry and then back again is quite costly too. For example, this is what made us (all D compilers) look bad on that Perlin noise microbenchmark (the thread from a couple of months ago).
Ah, ok. Didn't realize. For future reference: http://gruntthepeon.free.fr/ssemath
Apr 08 2015
parent reply "Johan Engelen" <j j.nl> writes:
I wrote a non-asm ilogb, that actually runs quite a bit faster 
than what DMD or LDC do standard, and should also be much more 
portable.
See
https://github.com/D-Programming-Language/phobos/pull/3186
Apr 12 2015
parent "Kai Nacke" <kai redstar.de> writes:
On Sunday, 12 April 2015 at 22:35:20 UTC, Johan Engelen wrote:
 I wrote a non-asm ilogb, that actually runs quite a bit faster 
 than what DMD or LDC do standard, and should also be much more 
 portable.
 See
 https://github.com/D-Programming-Language/phobos/pull/3186
is a template and also CTFE enabled. Regards, Kai
Apr 13 2015