digitalmars.D - x86 intrinsics for sale cheap

Cecil Ward (17/17) May 31 2023 I have been working on simple wrappers around new(ish) x86

Cecil Ward (6/24) May 31 2023 The instructions are those that were new with the Haswell micro

Cecil Ward (19/45) May 31 2023 It’s been a project to help me learn D and explore the code
Johan (6/32) May 31 2023 Are you aware of intel-intrinsics?

Cecil Ward (16/49) May 31 2023 Yes, I am very aware, and was even thinking of using the same

Cecil Ward (10/13) May 31 2023 On a different topic. I’d like to develop similar things for
max haughton (8/40) May 31 2023 You and Johan might be talking past each other here,

Cecil Ward (4/6) May 31 2023 Ah, I was indeed misunderstanding. And no harm done as this was a

Cecil Ward (32/38) May 31 2023 Ah, just followed that link. No that’s (solely?) SIMD, something

Richard (Rikki) Andrew Cattermole (12/12) May 31 2023 A concern here is that inline assembly is unlikely (if at all) to inline...

Cecil Ward (21/37) May 31 2023 Understood and agreed. I’m able to get functions to inline with
Guillaume Piolat (16/19) Jun 01 2023 For LDC the least performance regression usually comes from any

claptrap (22/37) May 31 2023 If you tell LDC the right cpu target, and to use optimization,

Cecil Ward (20/60) May 31 2023 I couldn’t agree more. I wrote asm full time for about five years

Guillaume Piolat (13/23) Jun 01 2023 I'll join the party as an assembly lover :). There is vanishingly
claptrap (5/10) Jun 01 2023 You probably already know about it but in case you dont an easy

Cecil Ward <cecil cecilward.com> writes:

I have been working on simple wrappers around new(ish) x86 
instructions that are not otherwise accessible. Also with 
replacement functions in straight D for machines where the 
instruction is not available. Currently only for GDC as LDC 
doesn’t support some of the features of GCC inline asm that I am 
relying on  - named parameters in the asm with %[name] syntax. 
But hopefully that will get fixed by the LDC maintainers, so I 
will be able to work with either compiler. My routines need more 
testing and a vast amount of cleanup. So it’s early days.

Is that something that you would be interested in for the D 
runtime library? (For GDC / LDC ?) I unfortunately haven’t 
attacked DMD yet because that uses a different inline asm syntax, 
and would mean a rewrite. But that isn’t a problem because thr 
DMD user gets the pure D replacement anyway due to conditional 
compilation.

If you are interested, then let me know. I do need help testing 
though and some advice about unit tests.

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. My 
 routines need more testing and a vast amount of cleanup. So 
 it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due to 
 conditional compilation.

 If you are interested, then let me know. I do need help testing 
 though and some advice about unit tests.

The instructions are those that were new with the Haswell micro 
architecture so that’s what ten years ago now, so now is the time 
that these instructions will become more usable for programmers 
worried about older machines, and there are the fallbacks too, as 
far as I have got with that.

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. 
 My routines need more testing and a vast amount of cleanup. So 
 it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due to 
 conditional compilation.

 If you are interested, then let me know. I do need help 
 testing though and some advice about unit tests.

 The instructions are those that were new with the Haswell micro 
 architecture so that’s what ten years ago now, so now is the 
 time that these instructions will become more usable for 
 programmers worried about older machines, and there are the 
 fallbacks too, as far as I have got with that.

It’s been a project to help me learn D and explore the code 
quality of these compilers. I wrote various assembler languages 
for a living when I was working some years back, although when C 
compilers rose to sufficient quality of code generation then we 
switched to C for x86 at work and asm was much less of a thing, 
as for everyone.

I have also written a module that allows cached querying of 
results of calls to cpuid so that users can test for availability 
once only getting all the checks done before main so that there’s 
minimal overhead inside the real code in loops or wherever. The 
module calls cpuid many times in a loop with all the leaf 
subfunction queries that you might be interested in. That needs 
more work to be selective, maybe, and I haven’t yet enumerated 
all of the possibilities, because there are potentially a lot of 
them, and possibly many that users are not interested in in their 
use case. So I could perhaps do with a bit of advice there. Again 
if this is something that might be of interest then let me know. 
Needs a lot of cleanup once again to make the code look pretty.

May 31 2023

Johan <j j.nl> writes:

On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. 
 My routines need more testing and a vast amount of cleanup. So 
 it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due to 
 conditional compilation.

 If you are interested, then let me know. I do need help 
 testing though and some advice about unit tests.

 The instructions are those that were new with the Haswell micro 
 architecture so that’s what ten years ago now, so now is the 
 time that these instructions will become more usable for 
 programmers worried about older machines, and there are the 
 fallbacks too, as far as I have got with that.

Are you aware of intel-intrinsics? 
https://code.dlang.org/packages/intel-intrinsics
It sounds like you are duplicating the effort; better to team up 
with that project.

-Johan

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 16:07:55 UTC, Johan wrote:
 On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 I have been working on simple wrappers around new(ish) x86 
 instructions that are not otherwise accessible. Also with 
 replacement functions in straight D for machines where the 
 instruction is not available. Currently only for GDC as LDC 
 doesn’t support some of the features of GCC inline asm that I 
 am relying on  - named parameters in the asm with %[name] 
 syntax. But hopefully that will get fixed by the LDC 
 maintainers, so I will be able to work with either compiler. 
 My routines need more testing and a vast amount of cleanup. 
 So it’s early days.

 Is that something that you would be interested in for the D 
 runtime library? (For GDC / LDC ?) I unfortunately haven’t 
 attacked DMD yet because that uses a different inline asm 
 syntax, and would mean a rewrite. But that isn’t a problem 
 because thr DMD user gets the pure D replacement anyway due 
 to conditional compilation.

 If you are interested, then let me know. I do need help 
 testing though and some advice about unit tests.

 The instructions are those that were new with the Haswell 
 micro architecture so that’s what ten years ago now, so now is 
 the time that these instructions will become more usable for 
 programmers worried about older machines, and there are the 
 fallbacks too, as far as I have got with that.

 Are you aware of intel-intrinsics? 
 https://code.dlang.org/packages/intel-intrinsics
 It sounds like you are duplicating the effort; better to team 
 up with that project.

 -Johan

Yes, I am very aware, and was even thinking of using the same 
names. My goals are rather different though and I don’t use the 
same non-standard __xmm256 type names (or whatever). Those Intel 
routines don’t have a fallback equivalent though for machines 
where the instruction isn’t available so there’s some Intel sales 
promotion in there since you do need to have a sufficiently new 
CPU or nothing.

And I’m concentrating solely on D, not trying to write thing in 
C, put another wrapper round that for D and then hope it all 
still inlines with zero overhead parameter passing.

Lastly, those Intel intrinsics are I assume, unless I’m wrong, 
restricted to the Intel C/C++ compiler. And I’m GDC/LDC only.

So quite a gulf there and I’m not solely trying to do the same 
thing. And it’s D first, and with zero overhead being a 
requirement.

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 So quite a gulf there and I’m not solely trying to do the same 
 thing. And it’s D first, and with zero overhead being a 
 requirement.

On a different topic. I’d like to develop similar things for 
AAarch64, but that’s an instruction set that’s new to me, so a 
new learning curve. Do any of our members have ARM64 asm 
experience and if so would they recommend tutorials for 
experienced asm programmers, beyond what I can google for myself, 
obviously, and the ARM official docs of course. And any tips on 
starting out as some stuff looks weird, such as the usage of the 
carry flag, and the bizarre x / w register width conventions. 
(Aren’t we all ex-DEC on this ? :-) with b w l q (o? dq? ) )

May 31 2023

max haughton <maxhaton gmail.com> writes:

On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:07:55 UTC, Johan wrote:
 On Wednesday, 31 May 2023 at 15:58:53 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 15:56:45 UTC, Cecil Ward wrote:
 [...]

 The instructions are those that were new with the Haswell 
 micro architecture so that’s what ten years ago now, so now 
 is the time that these instructions will become more usable 
 for programmers worried about older machines, and there are 
 the fallbacks too, as far as I have got with that.

 Are you aware of intel-intrinsics? 
 https://code.dlang.org/packages/intel-intrinsics
 It sounds like you are duplicating the effort; better to team 
 up with that project.

 -Johan

 Yes, I am very aware, and was even thinking of using the same 
 names. My goals are rather different though and I don’t use the 
 same non-standard __xmm256 type names (or whatever). Those 
 Intel routines don’t have a fallback equivalent though for 
 machines where the instruction isn’t available so there’s some 
 Intel sales promotion in there since you do need to have a 
 sufficiently new CPU or nothing.

 And I’m concentrating solely on D, not trying to write thing in 
 C, put another wrapper round that for D and then hope it all 
 still inlines with zero overhead parameter passing.

 Lastly, those Intel intrinsics are I assume, unless I’m wrong, 
 restricted to the Intel C/C++ compiler. And I’m GDC/LDC only.

 So quite a gulf there and I’m not solely trying to do the same 
 thing. And it’s D first, and with zero overhead being a 
 requirement.

You and Johan might be talking past each other here, 
"intel-intrinsics" in this case refers to p0nce's implementation 
of Intel's intrinsic (names and semantics) in D. There is no 
dependency on any Intel software. There are some traps that he 
has worked around that you will bump into at some point, so I 
recommend looking closely at what he has done. A subset also work 
on Arm.

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 [...]


Ah, I was indeed misunderstanding. And no harm done as this was a 
D learning project until I started to think that I might be of 
some use to someone. Thanks for giving me that link !

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
 [...]


 Ah, I was indeed misunderstanding. And no harm done as this was 
 a D learning project until I started to think that I might be 
 of some use to someone. Thanks for giving me that link !

Ah, just followed that link. No that’s (solely?) SIMD, something 
I was aware of and so I’m not duplicating that as I haven’t gone 
near SIMD. The pext instruction would be one instruction that I 
attacked some time ago, and that would already be fine with ARM 
as there’s a pure D fallback, but maybe I can find some native 
ARM equivalent if I study AArch64.

So no, this would be something new. Non-SIMD insns for general 
use. The smallest instructions might be something like andn if I 
can keep to zero-overhead obviously, seeing as the benefit in the 
instruction is so tiny anyway. But mind you I could have done 
with it for graphics bit twiddling manipulation code.

Because I have zero familiarity with the tools, and am very 
unwell, I would just give the .d files with their inline asm and 
pure D code to someone experienced who is sufficiently motivated 
to help out. I wouldn’t be able to do anything on my own.

I would also like some help with some problems with unittest. To 
test that a native insn conforms to the spec, in respect of its 
mating up of register passing and the like, I would ideally want 
to use static asserts. Since I’m testing with x86 boxes on 
godbolt.org, If the compiler doesn’t mind doing ctfe with asm 
then all will be well. I I avoid a problem by using static if ( 
__ctfe ) (or whatever) then I’m not would not be doing a test 
against the native instruction but against the pure-D 
replacement. Thus defeating the whole point, as that’s a separate 
test, albeit one that very much needs doing anyway, but there I 
would compare the native instruction with the D replacement 
rather than comparing both against hand-calculated values. The 
problem with hand-calculated values is that you are just testing 
against your own understanding of the algorithm, testing your own 
self against your own ideas, although that has some value in 
anti-regression testing later on but that’s a different thing.

May 31 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

A concern here is that inline assembly is unlikely (if at all) to inline.

So you're going to have to be pretty careful that what you do is 
actually worth the function call, because if it isn't simd, it just 
might not be doing enough work to justify using inline assembly.

If you are able to get a backend to generate the instruction you want 
using regular D code, then you're good to go. As that'll inline.

My general recommendation here is to not worry about specific 
instructions unless you really _really_ need to (very tiny percentage of 
code fits this, almost to the point of not being worth considering).

Instead focus on making your D code communicate to the backend what you 
intend. Even if it doesn't do the job today, in 2 years time it could 
generate significantly better assembly.

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 17:44:21 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 A concern here is that inline assembly is unlikely (if at all) 
 to inline.

 So you're going to have to be pretty careful that what you do 
 is actually worth the function call, because if it isn't simd, 
 it just might not be doing enough work to justify using inline 
 assembly.

 If you are able to get a backend to generate the instruction 
 you want using regular D code, then you're good to go. As 
 that'll inline.

 My general recommendation here is to not worry about specific 
 instructions unless you really _really_ need to (very tiny 
 percentage of code fits this, almost to the point of not being 
 worth considering).

 Instead focus on making your D code communicate to the backend 
 what you intend. Even if it doesn't do the job today, in 2 
 years time it could generate significantly better assembly.

Understood and agreed. I’m able to get functions to inline with 
no problems with GDC when there is inline-asm code in them. As 
you say, without that, the overhead of a call can wipe out all of 
the benefit and it’s pointless. I’ve written test functions that 
call the instruction and it all inlines perfectly with no problem 
interfacing register usage in a very flexible manner thanks to 
GCC/GDC’s superb design. And  LDC would perhaps be even better 
were it not for the inline-asm syntax wishlist-item mentioned 
earlier that means that the current LDC would require me to 
rewrite all the asm to not use _named_ parameters within the asm 
body itself. Something I’d love to fix myself within LDC, but I 
don’t remotely have the knowledge of compiler internals nor the 
general expertise.

As for worrying about individual instructions, that isn’t my 
goal, it’s just both a learning exercise and possibly to make the 
instructions available to anyone who decides that they want them, 
and they are assumed to have enough experience to make that 
decision based on performance, but I will give them a 
zero-overhead solution (unless D prevents me from doing so)

May 31 2023

Guillaume Piolat <first.last spam.org> writes:

On Wednesday, 31 May 2023 at 17:44:21 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 Instead focus on making your D code communicate to the backend 
 what you intend. Even if it doesn't do the job today, in 2 
 years time it could generate significantly better assembly.

For LDC the least performance regression usually comes from any 
form of LDC's __ir_pure, however it becomes slower to compile on 
large projects (up to 50ms, which is the cost of a 1500x1500 JPEG 
decoding ;) ).
https://github.com/ldc-developers/ldc/issues/4388

As a reminder of what intel-intrinsics does:
   - implement the semantics of the Intel intrinsics, up to AVX 
(AVX2 is WIP)
   - on DMD x86/x86_64 + GDC x86_64 + LDC x86/x86_64/arm64/arm32
   - supporting a fallback for everything, even the SSE4.1 string 
instructions and rounding modes

Interestingly if you use AVX intrinsics even without the AVX 
instructions enabled, you might sometimes be able to get speedup 
thanks to the implicit loop unrolling.

Jun 01 2023

claptrap <clap trap.com> writes:

On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:


 Ah, just followed that link. No that’s (solely?) SIMD, 
 something I was aware of and so I’m not duplicating that as I 
 haven’t gone near SIMD. The pext instruction would be one 
 instruction that I attacked some time ago, and that would 
 already be fine with ARM as there’s a pure D fallback, but 
 maybe I can find some native ARM equivalent if I study AArch64.

 So no, this would be something new. Non-SIMD insns for general 
 use. The smallest instructions might be something like andn if 
 I can keep to zero-overhead obviously, seeing as the benefit in 
 the instruction is so tiny anyway. But mind you I could have 
 done with it for graphics bit twiddling manipulation code.

If you tell LDC the right cpu target, and to use optimization, 
IE..

"-O -mcpu=haswell"

It will use the andn instruction...

uint foo(uint a, uint b)
{
     return a & (b ^ 0xFFFFFFFF);
}

compiles to ---->

uint example.foo(uint, uint):
         andn    eax, edi, esi
         ret

So you will probably find the compiler is already doing what you 
want if you let it know it can target the right cpu architechre.

I've been writing asm for over 30 years, the opportunities for 
beating modern compilers have gotten vanishingly small for pretty 
much everything except for SIMD code. And tbh the differences 
between CPUs, ie different instruction latency on different 
architectures, means it's pretty much pointless to chance few 
percent here or there, since there's a good chance it'll be a few 
percent the other way on a different CPU.

May 31 2023

Cecil Ward <cecil cecilward.com> writes:

On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:
 On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
 On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:


 Ah, just followed that link. No that’s (solely?) SIMD, 
 something I was aware of and so I’m not duplicating that as I 
 haven’t gone near SIMD. The pext instruction would be one 
 instruction that I attacked some time ago, and that would 
 already be fine with ARM as there’s a pure D fallback, but 
 maybe I can find some native ARM equivalent if I study AArch64.

 So no, this would be something new. Non-SIMD insns for general 
 use. The smallest instructions might be something like andn if 
 I can keep to zero-overhead obviously, seeing as the benefit 
 in the instruction is so tiny anyway. But mind you I could 
 have done with it for graphics bit twiddling manipulation code.

 If you tell LDC the right cpu target, and to use optimization, 
 IE..

 "-O -mcpu=haswell"

 It will use the andn instruction...

 uint foo(uint a, uint b)
 {
     return a & (b ^ 0xFFFFFFFF);
 }

 compiles to ---->

 uint example.foo(uint, uint):
         andn    eax, edi, esi
         ret

 So you will probably find the compiler is already doing what 
 you want if you let it know it can target the right cpu 
 architechre.

 I've been writing asm for over 30 years, the opportunities for 
 beating modern compilers have gotten vanishingly small for 
 pretty much everything except for SIMD code. And tbh the 
 differences between CPUs, ie different instruction latency on 
 different architectures, means it's pretty much pointless to 
 chance few percent here or there, since there's a good chance 
 it'll be a few percent the other way on a different CPU.

I couldn’t agree more. I wrote asm full time for about five years 
at an operating systems outfit. But my aim was to just make these 
instructions available with zero overhead and then if I can 
somehow work out how to do it make them switch over to fallbacks 
in pure D _still with zero overhead for the test_ which I think 
is damn near impossible. And when I originally thought about 
andn, that would be the ultimate challenge because the benefit to 
be had is so very small that I would absolutely have to have to 
have zero overhead or it’s hopeless. So I wanted to see if I 
could get it to inline, checking the GDC and LDC compilers’ 
behaviour but I haven’t been able to test for inlining in call 
into an imported module from outside, from another .d file. I 
don’t have the tools, right now, long story. abut I will do 
something about that when I feel better, am quite unwell right 
now.

As for your insight into LDC and andn. Damn, I missed that. Many 
thanks for your help there. It’s not the first time I’ve seen 
this kind of excellent performance. I haven’t been using LDC 
enough because I am stuffed by the lack of support for

May 31 2023

Guillaume Piolat <first.last spam.org> writes:

On Thursday, 1 June 2023 at 05:26:56 UTC, Cecil Ward wrote:
 I've been writing asm for over 30 years, the opportunities for 
 beating modern compilers have gotten vanishingly small for 
 pretty much everything except for SIMD code. And tbh the 
 differences between CPUs, ie different instruction latency on 
 different architectures, means it's pretty much pointless to 
 chance few percent here or there, since there's a good chance 
 it'll be a few percent the other way on a different CPU.

 I couldn’t agree more. I wrote asm full time for about five 
 years at an operating systems outfit.

I'll join the party as an assembly lover :). There is vanishingly 
few parts where it can make a big difference vs better 
communicating with the backend, I spent two years of my life 
working only on codec optimization with the Intel C++ compiler 
and in the end we had one bit of x86 assembly left, that one was 
using the EFLAGS for multiple jumps from the same op. You can 
also sometimes win if your algorithm fit with the exact register 
count but with "register renaming" I'm not even sure. Often the 
assembly was better than the codegen, but the spilling code the 
compiler insert before and after would make it worse, in addition 
to the lack of optimization. Big positive with asm is the build 
time though!

Jun 01 2023

claptrap <clap trap.com> writes:

On Thursday, 1 June 2023 at 05:26:56 UTC, Cecil Ward wrote:
 On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:

 As for your insight into LDC and andn. Damn, I missed that. 
 Many thanks for your help there. It’s not the first time I’ve 
 seen this kind of excellent performance. I haven’t been using 
 LDC enough because I am stuffed by the lack of support for

You probably already know about it but in case you dont an easy 
way to see what the various D compilers are doing is by using

https://d.godbolt.org/

Its recompiles and updates the disassembly as you type.

Jun 01 2023

D Programming

C/C++ Programming

Other

digitalmars.D - x86 intrinsics for sale cheap