digitalmars.D.ldc - How to deal with inline asm functions in Phobos/druntime?
- Johan Engelen (50/50) Apr 07 2015 Hi all,
- Kai Nacke (19/22) Apr 07 2015 Hi Johan!
- Johan Engelen (10/10) Apr 08 2015 Hi Kai,
- Kai Nacke (42/52) Apr 08 2015 Hi Johan!
- Johan Engelen (2/8) Apr 08 2015 Exactly!
- David Nadlinger via digitalmars-d-ldc (11/13) Apr 08 2015 Speaking about performance: Haven't we switched to 64 bit reals on
- Kai Nacke (10/24) Apr 08 2015 Yes, reals are 64bit on Win64. Using SSE can really help here.
- Johan Engelen (21/21) Apr 08 2015 Thanks Kai!
- Daniel Murphy (4/12) Apr 08 2015 I don't think it's so much about vectorizing as it is about avoiding the...
- David Nadlinger via digitalmars-d-ldc (9/11) Apr 08 2015 Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by
- Johan Engelen (4/18) Apr 08 2015 Ah, ok. Didn't realize.
- Johan Engelen (5/5) Apr 12 2015 I wrote a non-asm ilogb, that actually runs quite a bit faster
- Kai Nacke (5/10) Apr 13 2015 Nice! For LDC you can replace bsr with intrinsic llvm.ctlz.i#. It
Hi all, I've hit on a problem that I know how to fix, but I do not know how to properly do it. Thanks for your help. A number of functions have inline assembly implementations in Phobos, e.g. std.math.ilogb(). I don't know why exactly they have asm implementations for Windows. The default path "core.stdc.math.ilogbl(x);" would be fine on Windows. The problem is that LDC would take that same assembly code, although it assumes DMD calling conventions. Also, _much_ better code is generated when ldc.llvmasm inline assembly code is used for non-naked asm functions: for D-style naked asm, the generated code contains huge function pro-/epilogues. The LDC implementation without assumptions about calling conventions for MSVC-compatible ilogb would be: import ldc.llvmasm; return __asm!int( `fldl $1 ; fxam ; fstsw %AX ; and $$0x45, %AH ; cmp $$0x40, %AH ; jz Lzeronan ; cmp $$5, %AH ; jz Linfinity ; cmp $$1, %AH ; jz Lzeronan ; fxtract ; fstp %ST(0) ; fistpl (%RSP) ; mov (%RSP), $0 ; jmp Ldone ; Lzeronan: mov $$0x80000000, $0 ; fstp %ST(0) ; jmp Ldone ; Linfinity: mov $$0x7FFFFFFF, $0 ; fstp %ST(0) ; Ldone: ;`, "=r,*m,~{ax},~{memory}", &x); I think it'd be relatively straightforward to write the code such that it works for 80-bit and 64-bit reals. My question: how do I fix our fork of Phobos? Do we just want to pass the call to core.stdc.math.ilogbl, and disregard the 'optimized' inline asm? Or do we add "version(LDC) version (Win64)" or similar and add our own asm implementations? It is fun to write these small asm blobs, but I am not sure how maintainable all this will be. Confused :S Thanks! Johan
Apr 07 2015
On Tuesday, 7 April 2015 at 18:15:58 UTC, Johan Engelen wrote:It is fun to write these small asm blobs, but I am not sure how maintainable all this will be. Confused :SHi Johan! The functions in Druntime/Phobos exists because the MSVC runtime does not support 80bit reals which is a must-have for DMD. If you are going to rewrite assembly snippets then please use ldc.llvmasm because of all the advantages you have listed. Plus, ldc.llvmasm do not disable function inlining which is also a big win. Regarding std.math, there are already lot of places with versions for LDC. One reason is that LLVM intrinsics exists. Another reason is that LDC supports PPC, ARM and MIPS, too. At last, a reason is that the DMD ASM does not work. I prefer having rewritten assembly functions. But please note that there must always be a fallback to the core.stdc function because of the many supported architectures. BTW: Many users of LDC want fast math functions. Having a better implementation of math functions is therefore desirable. Regards, Kai
Apr 07 2015
Hi Kai, OK! I'd very much appreciate an example of how to add the code above to ilogb in std.math. The "version(...)" stuff quickly becomes a mess. Could you take the code above and show me how you would put this into std.math? I'm sorry if this is too much hand-holding, it is really very much appreciated. -Johan
Apr 08 2015
On Wednesday, 8 April 2015 at 09:29:56 UTC, Johan Engelen wrote:Hi Kai, OK! I'd very much appreciate an example of how to add the code above to ilogb in std.math. The "version(...)" stuff quickly becomes a mess. Could you take the code above and show me how you would put this into std.math? I'm sorry if this is too much hand-holding, it is really very much appreciated. -JohanHi Johan! The structure of ilogb() is: int ilogb(real x) { version(Win64_DMD_InlineAsm_X87) .... version(CRuntime_Microsoft) .... else return core.stdc.math.ilogbl(x); } You can simply add your stuff at the top of the function, resulting in: int ilogb(real x) { version(LDC) { version(X86_64) .... else version(X86_64) .... else return core.stdc.math.ilogbl(x); } else version(Win64_DMD_InlineAsm_X87) .... else version(CRuntime_Microsoft) .... else return core.stdc.math.ilogbl(x); } Look at the 'misplaced' else: Doing the addition in this way you do not touch the original implementation. This makes merging more easier. If you look closer at std.math, you see a lot of 'wrong' formatting/spacing if version(LDC) is involved. The goal here is to leave as much as possible of the original source untouched. I hope this answers your question. If not then do not hesitate to asked again. Regards, Kai
Apr 08 2015
On Wednesday, 8 April 2015 at 11:09:48 UTC, Kai Nacke wrote:Look at the 'misplaced' else: Doing the addition in this way you do not touch the original implementation. This makes merging more easier. If you look closer at std.math, you see a lot of 'wrong' formatting/spacing if version(LDC) is involved. The goal here is to leave as much as possible of the original source untouched.Exactly!
Apr 08 2015
On Wed, Apr 8, 2015 at 8:31 AM, Kai Nacke via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> wrote:BTW: Many users of LDC want fast math functions. Having a better implementation of math functions is therefore desirable.Speaking about performance: Haven't we switched to 64 bit reals on Win64 for the time being? If so, we'd probably want a version of ilogb that uses SSE instead of the x87 FPU. This is probably (I'm rather sure, but did not measure it) a much bigger performance problem than any overly long general purpose register pushing/popping prologues generated by non-naked asm functions. ilogb() does not seem to be a commonly used function, though, so it might not be worth the effort. — David
Apr 08 2015
On Wednesday, 8 April 2015 at 11:29:26 UTC, David Nadlinger wrote:Speaking about performance: Haven't we switched to 64 bit reals on Win64 for the time being? If so, we'd probably want a version of ilogb that uses SSE instead of the x87 FPU. This is probably (I'm rather sure, but did not measure it) a much bigger performance problem than any overly long general purpose register pushing/popping prologues generated by non-naked asm functions.Yes, reals are 64bit on Win64. Using SSE can really help here. Speaking of performance: The only way to know the some piece of code is faster is to measure it. This can reveal surprising insights.... (Hint!)ilogb() does not seem to be a commonly used function, though, so it might not be worth the effort.There are lot of other functions, too. E.g. the inline assembler in expi() is commented because of a problem with the asm parser. I think Johan will find enough worthwhile targets. Regards, Kai
Apr 08 2015
Thanks Kai! Well, first we should get LDC fully functional on Win64, no? :P That was my main goal, really. About SSE: I can't vectorize the code for this one function with one real as argument! I had done a brief search for what instructions are available on xmm regs (argument is passed through xmm0), but it is mostly simple arithmetic I think, not the kind of stuff that is used in the original druntime asm code. (Btw, the pro-epilogues also consist of pushing/popping all XMM regs, quite a bit of data, but indeed no clue how slow/fast that is. Didn't measure a thing, but it just looked kind of wasteful :) But thinking about David's comment a bit more, if I understand ilogb correctly, all it needs to do is output the exponent of the real as an int. For that, one doesn't need floating point operations at all. Just a bit of bit shifting and masking, and subtracting the floating point format's exponent bias value. I think we can just express that as normal D code, which can then be optimized / vectorized by LLVM ? That code could go upstream, with a few static ifs for all floating point formats supported. Again, I want to get LDC fully functional foremost, so all this is fun but distracting ;)
Apr 08 2015
"Johan Engelen" wrote in message news:hqpjetsgaeoqkfyqexka forum.dlang.org...About SSE: I can't vectorize the code for this one function with one real as argument! I had done a brief search for what instructions are available on xmm regs (argument is passed through xmm0), but it is mostly simple arithmetic I think, not the kind of stuff that is used in the original druntime asm code. (Btw, the pro-epilogues also consist of pushing/popping all XMM regs, quite a bit of data, but indeed no clue how slow/fast that is. Didn't measure a thing, but it just looked kind of wasteful :)I don't think it's so much about vectorizing as it is about avoiding the x87 FPU, which you can do when 80-bit precision is not needed.
Apr 08 2015
On 8 Apr 2015, at 15:15, Daniel Murphy via digitalmars-d-ldc wrote:I don't think it's so much about vectorizing as it is about avoiding the x87 FPU, which you can do when 80-bit precision is not needed.Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by default for single- and double-precision floating point operations. The x87 FPU is not particularly well-optimized on newer CPUs to begin with, and transferring data from the SSE registers to the FPU on function entry and then back again is quite costly too. For example, this is what made us (all D compilers) look bad on that Perlin noise microbenchmark (the thread from a couple of months ago). — David
Apr 08 2015
On Wednesday, 8 April 2015 at 13:28:16 UTC, David Nadlinger wrote:On 8 Apr 2015, at 15:15, Daniel Murphy via digitalmars-d-ldc wrote:Ah, ok. Didn't realize. For future reference: http://gruntthepeon.free.fr/ssemathI don't think it's so much about vectorizing as it is about avoiding the x87 FPU, which you can do when 80-bit precision is not needed.Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by default for single- and double-precision floating point operations. The x87 FPU is not particularly well-optimized on newer CPUs to begin with, and transferring data from the SSE registers to the FPU on function entry and then back again is quite costly too. For example, this is what made us (all D compilers) look bad on that Perlin noise microbenchmark (the thread from a couple of months ago).
Apr 08 2015
I wrote a non-asm ilogb, that actually runs quite a bit faster than what DMD or LDC do standard, and should also be much more portable. See https://github.com/D-Programming-Language/phobos/pull/3186
Apr 12 2015
On Sunday, 12 April 2015 at 22:35:20 UTC, Johan Engelen wrote:I wrote a non-asm ilogb, that actually runs quite a bit faster than what DMD or LDC do standard, and should also be much more portable. See https://github.com/D-Programming-Language/phobos/pull/3186is a template and also CTFE enabled. Regards, Kai
Apr 13 2015