www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - x86 intrinsics for sale cheap

reply Cecil Ward <cecil cecilward.com> writes:
I have been working on simple wrappers around new(ish) x86 
instructions that are not otherwise accessible. Also with 
replacement functions in straight D for machines where the 
instruction is not available. Currently only for GDC as LDC 
doesn’t support some of the features of GCC inline asm that I am 
relying on  - named parameters in the asm with %[name] syntax. 
But hopefully that will get fixed by the LDC maintainers, so I 
will be able to work with either compiler. My routines need more 
testing and a vast amount of cleanup. So it’s early days.

Is that something that you would be interested in for the D 
runtime library? (For GDC / LDC ?) I unfortunately haven’t 
attacked DMD yet because that uses a different inline asm syntax, 
and would mean a rewrite. But that isn’t a problem because thr 
DMD user gets the pure D replacement anyway due to conditional 
compilation.

If you are interested, then let me know. I do need help testing 
though and some advice about unit tests.
May 31 2023
parent reply Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. My 
 routines need more testing and a vast amount of cleanup. So 
 it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due to 
 conditional compilation.

 If you are interested, then let me know. I do need help testing 
 though and some advice about unit tests.
The instructions are those that were new with the Haswell micro architecture so that’s what ten years ago now, so now is the time that these instructions will become more usable for programmers worried about older machines, and there are the fallbacks too, as far as I have got with that.
May 31 2023
next sibling parent Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. 
 My routines need more testing and a vast amount of cleanup. So 
 it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due to 
 conditional compilation.

 If you are interested, then let me know. I do need help 
 testing though and some advice about unit tests.
The instructions are those that were new with the Haswell micro architecture so that’s what ten years ago now, so now is the time that these instructions will become more usable for programmers worried about older machines, and there are the fallbacks too, as far as I have got with that.
It’s been a project to help me learn D and explore the code quality of these compilers. I wrote various assembler languages for a living when I was working some years back, although when C compilers rose to sufficient quality of code generation then we switched to C for x86 at work and asm was much less of a thing, as for everyone. I have also written a module that allows cached querying of results of calls to cpuid so that users can test for availability once only getting all the checks done before main so that there’s minimal overhead inside the real code in loops or wherever. The module calls cpuid many times in a loop with all the leaf subfunction queries that you might be interested in. That needs more work to be selective, maybe, and I haven’t yet enumerated all of the possibilities, because there are potentially a lot of them, and possibly many that users are not interested in in their use case. So I could perhaps do with a bit of advice there. Again if this is something that might be of interest then let me know. Needs a lot of cleanup once again to make the code look pretty.
May 31 2023
prev sibling parent reply Johan <j j.nl> writes:
On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. 
 My routines need more testing and a vast amount of cleanup. So 
 it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due to 
 conditional compilation.

 If you are interested, then let me know. I do need help 
 testing though and some advice about unit tests.
The instructions are those that were new with the Haswell micro architecture so that’s what ten years ago now, so now is the time that these instructions will become more usable for programmers worried about older machines, and there are the fallbacks too, as far as I have got with that.
Are you aware of intel-intrinsics? https://code.dlang.org/packages/intel-intrinsics It sounds like you are duplicating the effort; better to team up with that project. -Johan
May 31 2023
parent reply Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 16:07:55 UTC, Johan wrote:
 On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. 
 My routines need more testing and a vast amount of cleanup. 
 So it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due 
 to conditional compilation.

 If you are interested, then let me know. I do need help 
 testing though and some advice about unit tests.
The instructions are those that were new with the Haswell micro architecture so that’s what ten years ago now, so now is the time that these instructions will become more usable for programmers worried about older machines, and there are the fallbacks too, as far as I have got with that.
Are you aware of intel-intrinsics? https://code.dlang.org/packages/intel-intrinsics It sounds like you are duplicating the effort; better to team up with that project. -Johan
Yes, I am very aware, and was even thinking of using the same names. My goals are rather different though and I don’t use the same non-standard __xmm256 type names (or whatever). Those Intel routines don’t have a fallback equivalent though for machines where the instruction isn’t available so there’s some Intel sales promotion in there since you do need to have a sufficiently new CPU or nothing. And I’m concentrating solely on D, not trying to write thing in C, put another wrapper round that for D and then hope it all still inlines with zero overhead parameter passing. Lastly, those Intel intrinsics are I assume, unless I’m wrong, restricted to the Intel C/C++ compiler. And I’m GDC/LDC only. So quite a gulf there and I’m not solely trying to do the same thing. And it’s D first, and with zero overhead being a requirement.
May 31 2023
next sibling parent Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 So quite a gulf there and I’m not solely trying to do the same 
 thing. And it’s D first, and with zero overhead being a 
 requirement.
On a different topic. I’d like to develop similar things for AAarch64, but that’s an instruction set that’s new to me, so a new learning curve. Do any of our members have ARM64 asm experience and if so would they recommend tutorials for experienced asm programmers, beyond what I can google for myself, obviously, and the ARM official docs of course. And any tips on starting out as some stuff looks weird, such as the usage of the carry flag, and the bizarre x / w register width conventions. (Aren’t we all ex-DEC on this ? :-) with b w l q (o? dq? ) )
May 31 2023
prev sibling parent reply max haughton <maxhaton gmail.com> writes:
On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:07:55 UTC, Johan wrote:
 On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 [...]
The instructions are those that were new with the Haswell micro architecture so that’s what ten years ago now, so now is the time that these instructions will become more usable for programmers worried about older machines, and there are the fallbacks too, as far as I have got with that.
Are you aware of intel-intrinsics? https://code.dlang.org/packages/intel-intrinsics It sounds like you are duplicating the effort; better to team up with that project. -Johan
Yes, I am very aware, and was even thinking of using the same names. My goals are rather different though and I don’t use the same non-standard __xmm256 type names (or whatever). Those Intel routines don’t have a fallback equivalent though for machines where the instruction isn’t available so there’s some Intel sales promotion in there since you do need to have a sufficiently new CPU or nothing. And I’m concentrating solely on D, not trying to write thing in C, put another wrapper round that for D and then hope it all still inlines with zero overhead parameter passing. Lastly, those Intel intrinsics are I assume, unless I’m wrong, restricted to the Intel C/C++ compiler. And I’m GDC/LDC only. So quite a gulf there and I’m not solely trying to do the same thing. And it’s D first, and with zero overhead being a requirement.
You and Johan might be talking past each other here, "intel-intrinsics" in this case refers to p0nce's implementation of Intel's intrinsic (names and semantics) in D. There is no dependency on any Intel software. There are some traps that he has worked around that you will bump into at some point, so I recommend looking closely at what he has done. A subset also work on Arm.
May 31 2023
parent reply Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 [...]
Ah, I was indeed misunderstanding. And no harm done as this was a D learning project until I started to think that I might be of some use to someone. Thanks for giving me that link !
May 31 2023
parent reply Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 [...]
Ah, I was indeed misunderstanding. And no harm done as this was a D learning project until I started to think that I might be of some use to someone. Thanks for giving me that link !
Ah, just followed that link. No that’s (solely?) SIMD, something I was aware of and so I’m not duplicating that as I haven’t gone near SIMD. The pext instruction would be one instruction that I attacked some time ago, and that would already be fine with ARM as there’s a pure D fallback, but maybe I can find some native ARM equivalent if I study AArch64. So no, this would be something new. Non-SIMD insns for general use. The smallest instructions might be something like andn if I can keep to zero-overhead obviously, seeing as the benefit in the instruction is so tiny anyway. But mind you I could have done with it for graphics bit twiddling manipulation code. Because I have zero familiarity with the tools, and am very unwell, I would just give the .d files with their inline asm and pure D code to someone experienced who is sufficiently motivated to help out. I wouldn’t be able to do anything on my own. I would also like some help with some problems with unittest. To test that a native insn conforms to the spec, in respect of its mating up of register passing and the like, I would ideally want to use static asserts. Since I’m testing with x86 boxes on godbolt.org, If the compiler doesn’t mind doing ctfe with asm then all will be well. I I avoid a problem by using static if ( __ctfe ) (or whatever) then I’m not would not be doing a test against the native instruction but against the pure-D replacement. Thus defeating the whole point, as that’s a separate test, albeit one that very much needs doing anyway, but there I would compare the native instruction with the D replacement rather than comparing both against hand-calculated values. The problem with hand-calculated values is that you are just testing against your own understanding of the algorithm, testing your own self against your own ideas, although that has some value in anti-regression testing later on but that’s a different thing.
May 31 2023
next sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
A concern here is that inline assembly is unlikely (if at all) to inline.

So you're going to have to be pretty careful that what you do is 
actually worth the function call, because if it isn't simd, it just 
might not be doing enough work to justify using inline assembly.

If you are able to get a backend to generate the instruction you want 
using regular D code, then you're good to go. As that'll inline.

My general recommendation here is to not worry about specific 
instructions unless you really _really_ need to (very tiny percentage of 
code fits this, almost to the point of not being worth considering).

Instead focus on making your D code communicate to the backend what you 
intend. Even if it doesn't do the job today, in 2 years time it could 
generate significantly better assembly.
May 31 2023
next sibling parent Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 17:44:21 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 A concern here is that inline assembly is unlikely (if at all) 
 to inline.

 So you're going to have to be pretty careful that what you do 
 is actually worth the function call, because if it isn't simd, 
 it just might not be doing enough work to justify using inline 
 assembly.

 If you are able to get a backend to generate the instruction 
 you want using regular D code, then you're good to go. As 
 that'll inline.

 My general recommendation here is to not worry about specific 
 instructions unless you really _really_ need to (very tiny 
 percentage of code fits this, almost to the point of not being 
 worth considering).

 Instead focus on making your D code communicate to the backend 
 what you intend. Even if it doesn't do the job today, in 2 
 years time it could generate significantly better assembly.
Understood and agreed. I’m able to get functions to inline with no problems with GDC when there is inline-asm code in them. As you say, without that, the overhead of a call can wipe out all of the benefit and it’s pointless. I’ve written test functions that call the instruction and it all inlines perfectly with no problem interfacing register usage in a very flexible manner thanks to GCC/GDC’s superb design. And LDC would perhaps be even better were it not for the inline-asm syntax wishlist-item mentioned earlier that means that the current LDC would require me to rewrite all the asm to not use _named_ parameters within the asm body itself. Something I’d love to fix myself within LDC, but I don’t remotely have the knowledge of compiler internals nor the general expertise. As for worrying about individual instructions, that isn’t my goal, it’s just both a learning exercise and possibly to make the instructions available to anyone who decides that they want them, and they are assumed to have enough experience to make that decision based on performance, but I will give them a zero-overhead solution (unless D prevents me from doing so)
May 31 2023
prev sibling parent Guillaume Piolat <first.last spam.org> writes:
On Wednesday, 31 May 2023 at 17:44:21 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 Instead focus on making your D code communicate to the backend 
 what you intend. Even if it doesn't do the job today, in 2 
 years time it could generate significantly better assembly.
For LDC the least performance regression usually comes from any form of LDC's __ir_pure, however it becomes slower to compile on large projects (up to 50ms, which is the cost of a 1500x1500 JPEG decoding ;) ). https://github.com/ldc-developers/ldc/issues/4388 As a reminder of what intel-intrinsics does: - implement the semantics of the Intel intrinsics, up to AVX (AVX2 is WIP) - on DMD x86/x86_64 + GDC x86_64 + LDC x86/x86_64/arm64/arm32 - supporting a fallback for everything, even the SSE4.1 string instructions and rounding modes Interestingly if you use AVX intrinsics even without the AVX instructions enabled, you might sometimes be able to get speedup thanks to the implicit loop unrolling.
Jun 01 2023
prev sibling parent reply claptrap <clap trap.com> writes:
On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
Ah, just followed that link. No that’s (solely?) SIMD, something I was aware of and so I’m not duplicating that as I haven’t gone near SIMD. The pext instruction would be one instruction that I attacked some time ago, and that would already be fine with ARM as there’s a pure D fallback, but maybe I can find some native ARM equivalent if I study AArch64. So no, this would be something new. Non-SIMD insns for general use. The smallest instructions might be something like andn if I can keep to zero-overhead obviously, seeing as the benefit in the instruction is so tiny anyway. But mind you I could have done with it for graphics bit twiddling manipulation code.
If you tell LDC the right cpu target, and to use optimization, IE.. "-O -mcpu=haswell" It will use the andn instruction... uint foo(uint a, uint b) { return a & (b ^ 0xFFFFFFFF); } compiles to ----> uint example.foo(uint, uint): andn eax, edi, esi ret So you will probably find the compiler is already doing what you want if you let it know it can target the right cpu architechre. I've been writing asm for over 30 years, the opportunities for beating modern compilers have gotten vanishingly small for pretty much everything except for SIMD code. And tbh the differences between CPUs, ie different instruction latency on different architectures, means it's pretty much pointless to chance few percent here or there, since there's a good chance it'll be a few percent the other way on a different CPU.
May 31 2023
parent reply Cecil Ward <cecil cecilward.com> writes:
On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:
 On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
Ah, just followed that link. No that’s (solely?) SIMD, something I was aware of and so I’m not duplicating that as I haven’t gone near SIMD. The pext instruction would be one instruction that I attacked some time ago, and that would already be fine with ARM as there’s a pure D fallback, but maybe I can find some native ARM equivalent if I study AArch64. So no, this would be something new. Non-SIMD insns for general use. The smallest instructions might be something like andn if I can keep to zero-overhead obviously, seeing as the benefit in the instruction is so tiny anyway. But mind you I could have done with it for graphics bit twiddling manipulation code.
If you tell LDC the right cpu target, and to use optimization, IE.. "-O -mcpu=haswell" It will use the andn instruction... uint foo(uint a, uint b) { return a & (b ^ 0xFFFFFFFF); } compiles to ----> uint example.foo(uint, uint): andn eax, edi, esi ret So you will probably find the compiler is already doing what you want if you let it know it can target the right cpu architechre. I've been writing asm for over 30 years, the opportunities for beating modern compilers have gotten vanishingly small for pretty much everything except for SIMD code. And tbh the differences between CPUs, ie different instruction latency on different architectures, means it's pretty much pointless to chance few percent here or there, since there's a good chance it'll be a few percent the other way on a different CPU.
I couldn’t agree more. I wrote asm full time for about five years at an operating systems outfit. But my aim was to just make these instructions available with zero overhead and then if I can somehow work out how to do it make them switch over to fallbacks in pure D _still with zero overhead for the test_ which I think is damn near impossible. And when I originally thought about andn, that would be the ultimate challenge because the benefit to be had is so very small that I would absolutely have to have to have zero overhead or it’s hopeless. So I wanted to see if I could get it to inline, checking the GDC and LDC compilers’ behaviour but I haven’t been able to test for inlining in call into an imported module from outside, from another .d file. I don’t have the tools, right now, long story. abut I will do something about that when I feel better, am quite unwell right now. As for your insight into LDC and andn. Damn, I missed that. Many thanks for your help there. It’s not the first time I’ve seen this kind of excellent performance. I haven’t been using LDC enough because I am stuffed by the lack of support for
May 31 2023
next sibling parent Guillaume Piolat <first.last spam.org> writes:
On Thursday, 1 June 2023 at 05:26:56 UTC, Cecil Ward wrote:
 I've been writing asm for over 30 years, the opportunities for 
 beating modern compilers have gotten vanishingly small for 
 pretty much everything except for SIMD code. And tbh the 
 differences between CPUs, ie different instruction latency on 
 different architectures, means it's pretty much pointless to 
 chance few percent here or there, since there's a good chance 
 it'll be a few percent the other way on a different CPU.
I couldn’t agree more. I wrote asm full time for about five years at an operating systems outfit.
I'll join the party as an assembly lover :). There is vanishingly few parts where it can make a big difference vs better communicating with the backend, I spent two years of my life working only on codec optimization with the Intel C++ compiler and in the end we had one bit of x86 assembly left, that one was using the EFLAGS for multiple jumps from the same op. You can also sometimes win if your algorithm fit with the exact register count but with "register renaming" I'm not even sure. Often the assembly was better than the codegen, but the spilling code the compiler insert before and after would make it worse, in addition to the lack of optimization. Big positive with asm is the build time though!
Jun 01 2023
prev sibling parent claptrap <clap trap.com> writes:
On Thursday, 1 June 2023 at 05:26:56 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:

 As for your insight into LDC and andn. Damn, I missed that. 
 Many thanks for your help there. It’s not the first time I’ve 
 seen this kind of excellent performance. I haven’t been using 
 LDC enough because I am stuffed by the lack of support for
You probably already know about it but in case you dont an easy way to see what the various D compilers are doing is by using https://d.godbolt.org/ Its recompiles and updates the disassembly as you type.
Jun 01 2023