digitalmars.D.learn - Optimizing for SIMD: best practices?(i.e. what features are allowed?)

z (12/38) Feb 25 2021 How does one optimize code to make full use of the CPU's SIMD

Guillaume Piolat (2/6) Feb 25 2021 https://code.dlang.org/packages/intel-intrinsics

Guillaume Piolat (11/17) Feb 26 2021 A bit of elaboration on why you might want to prefer
z (4/10) Mar 07 2021 I'd try to use it but the platform i'm building on requires AVX

Bruce Carneal (30/41) Mar 07 2021 The code below might be worth a try on your AVX512 machine.

tsbockman (46/51) Feb 25 2021 You need to use __vector(float[4]) instead of float[3] to tell
tsbockman (29/38) Feb 25 2021 The default value for float is float.nan. You need to explicitly

z (6/11) Mar 07 2021 I do not know why but the caller's performance nosedives whenever

tsbockman (21/34) Mar 07 2021 My way is definitely (slightly) better; something is going wrong

Bruce Carneal (44/82) Feb 25 2021 If you are developing for deployment to a platform that has a
z (26/215) Mar 07 2021 It seems that using static foreach with pointer parameters

tsbockman (49/56) Mar 07 2021 You're not compiling with AVX512 enabled. You would need to use

tsbockman (22/35) Mar 07 2021 Apparently the optimizer is too stupid to skip the redundant
tsbockman (41/46) Mar 07 2021 Oops, that's supposed to say `i; 1 .. 3`. Fixed:

z <z z.com> writes:

How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions 
that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
   float distance;
   a[] -= b[];
   a[] *= a[];
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller
   }
   return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real 7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret

I've tried to experiment with dynamic arrays of float[3] but the 
output assembly seemed to be worse.[1](in short, it's calling 
internal D functions which use "vxxxss" instructions while 
performing many moves.)

Big thanks
[1] https://run.dlang.io/is/F3Xye3

Feb 25 2021

Guillaume Piolat <first.name spam.org> writes:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)

https://code.dlang.org/packages/intel-intrinsics

Feb 25 2021

Guillaume Piolat <first.name spam.org> writes:

On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)

 https://code.dlang.org/packages/intel-intrinsics

A bit of elaboration on why you might want to prefer 
intel-intrinsics:
- it supports all D compilers, including DMD 32-bit target
- targets arm32 and arm64 with same code (LDC only)
- core.simd just give you the basic operators, but not say, 
pmaddwd or any of the complex instructions. Some instructions 
need very specific work to get them.
- at least with LLVM, optimizers works reliably over subsequent 
versions of the compiler.

Feb 26 2021

z <z z.com> writes:

On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)

 https://code.dlang.org/packages/intel-intrinsics

I'd try to use it but the platform i'm building on requires AVX 
to get the most performance.

Mar 07 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
 On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
 wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)

 https://code.dlang.org/packages/intel-intrinsics

 I'd try to use it but the platform i'm building on requires AVX 
 to get the most performance.

The code below might be worth a try on your AVX512 machine.

Unless you're looking for a combined result, you might need to 
separate out the memory access overhead by running multiple 
passes over a "known optimal for L2" data set.

Also note that I compiled with -preview=in.  I don't know if that 
matters.



import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
     void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A 
b1, in A b2, in A b3)
{
     alias V = __vector(A);
     static V vsqrt(V v)
     {
         A a = cast(A) v;
         static foreach (i; 0 .. A.length)
             a[i] = sqrt(a[i]);
         return cast(V)a;
     }

     static V sd(in A a, in A b)
     {
         V v = cast(V) b - cast(V) a;
         return v * v;
     }

     auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
     a0[] = vsqrt(v)[];
}

Mar 07 2021

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
 To give some context, this is a sample of one of the functions 
 that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {


You need to use __vector(float[4]) instead of float[3] to tell 
the compiler to pack multiple elements per SIMD register. Right 
now your data lacks proper alignment for SIMD load/stores.

Beyond that, SIMD code is rather difficult to optimize. Code 
written in ignorance or in a rush is unlikely to be meaningfully 
faster than ordinary scalar code, unless the data flow is very 
simple. You will probably get a bigger speedup for less effort 
and pain by first minimizing heap allocations, maximizing 
locality of reference, minimizing indirections, and minimizing 
memory use. (And, of course, it should go without saying that 
choosing an asymptotically efficient high-level algorithm is more 
important than any micro-optimization for large data sets.) 
Nevertheless, if you are up to the challenge, SIMD can sometimes 
provide a final 2-3x speed boost.

Your algorithms will need to be designed to minimize mixing of 
data between SIMD channels, as this forces the generation of lots 
of extra instructions to swizzle the data, or worse to unpack and 
repack it. Something like a Cartesian dot product or cross 
product will benefit much less from SIMD than vector addition, 
for example. Sometimes the amount of swizzling can be greatly 
reduced with a little algebra, other times you might need to 
refactor an array of structures into a structure of arrays.

Per-element conditional branches are very bad, and often 
completely defeat the benefits of SIMD. For very short segments 
of code (like conditional assignment), replace them with a SIMD 
conditional move (vcmp and vblend). Bit-twiddling is your friend.

Finally, do not trust the compiler or the optimizer. People love 
to make the claim that "The Compiler" is always better than 
humans at micro-optimizations, but this is not at all the case 
for SIMD code with current systems. I have found even LLVM to 
produce quite bad SIMD code for complex algorithms, unless I 
carefully structure my code to make it as easy as possible for 
the optimizer to get to the final assembly I want. A sprinkling 
of manual assembly code (directly, or via a library) is also 
necessary to fill in certain instructions that the compiler 
doesn't know when to use at all.

Resources I have found very helpful:

Matt Godbolt's Compiler Explorer online visual disassembler 
(supports D):
     https://godbolt.org/

Felix Cloutier's x86 and amd64 instruction reference:
     https://www.felixcloutier.com/x86/

Agner Fog's optimization guide (especially the instruction 
tables):
     https://agner.org/optimize/

Feb 25 2021

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {


Use __vector(float[4]), not float[3].

   float distance;


The default value for float is float.nan. You need to explicitly 
initialize it to 0.0f or something if you want this function to 
actually do anything useful.

   a[] -= b[];
   a[] *= a[];


With __vector types, this can be simplified (not optimized) to 
just:
     a -= b;
     a *= a;

   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller


(a * a) above is always positive for real numbers. You don't need 
the call to abs unless you're trying to guarantee that even nan 
values will have a clear sign bit.

Also, there is no point to adding the first component to zero, 
and copying element [0] from a SIMD register into a scalar is 
free, so this can become:

     float distance = a[0];
     static foreach(size_t i; 1 .. 3)
         distance += a[i];

   }
   return sqrt(distance);
}


Final assembly output (ldc 1.24.0 with -release -O3 
-preview=intpromote -preview=dip1000 -m64 -mcpu=haswell 
-fp-contract=fast -enable-cross-module-inlining):

     vsubps  xmm0, xmm1, xmm0
     vmulps  xmm0, xmm0, xmm0
     vmovshdup       xmm1, xmm0
     vaddss  xmm1, xmm0, xmm1
     vpermilpd       xmm0, xmm0, 1
     vaddss  xmm0, xmm0, xmm1
     vsqrtss xmm0, xmm0, xmm0
     ret

Feb 25 2021

z <z z.com> writes:

On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller


 (a * a) above is always positive for real numbers. You don't 
 need the call to abs unless you're trying to guarantee that 
 even nan values will have a clear sign bit.

I do not know why but the caller's performance nosedives whenever 
there is no .abs at this particular line.(there's a 3x 
difference, no joke.)
Same for assignment instead of addition, but with a 2x difference 
instead.

Mar 07 2021

tsbockman <thomas.bockman gmail.com> writes:

On Sunday, 7 March 2021 at 18:00:57 UTC, z wrote:
 On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller


 (a * a) above is always positive for real numbers. You don't 
 need the call to abs unless you're trying to guarantee that 
 even nan values will have a clear sign bit.

 I do not know why but the caller's performance nosedives

My way is definitely (slightly) better; something is going wrong 
in either the caller, or the optimizer. Show me the code for the 
caller and maybe I can figure it out.

 whenever there is no .abs at this particular line.(there's a 3x 
 difference, no joke.)

Perhaps the compiler is performing a value range propagation 
(VRP) based optimization in the caller, but isn't smart enough to 
figure out that the value is already always positive without the 
`abs` call? I've run into that specific problem before.

Alternatively, sometimes trivial changes to the code that 
*shouldn't* matter make the difference between hitting a smart 
path in the optimizer, and a dumb one. Automatic SIMD 
optimization is quite sensitive and temperamental.

Either way, the problem can be fixed by figuring out what 
optimization the compiler is doing when it knows that distance is 
positive, and just doing it manually.

 Same for assignment instead of addition, but with a 2x 
 difference instead.

Did you fix the nan bug I pointed out earlier? More generally, 
are you actually verifying the correctness of the results in any 
way for each alternative implementation? Because you can get big 
speedups sometimes from buggy code when the compiler realizes 
that some later logic error makes earlier code irrelevant, but 
that doesn't mean the buggy code is better...

Mar 07 2021

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
 To give some context, this is a sample of one of the functions 
 that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
   float distance;
   a[] -= b[];
   a[] *= a[];
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller
   }
   return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real 7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret

 I've tried to experiment with dynamic arrays of float[3] but 
 the output assembly seemed to be worse.[1](in short, it's 
 calling internal D functions which use "vxxxss" instructions 
 while performing many moves.)

 Big thanks
 [1] https://run.dlang.io/is/F3Xye3

If you are developing for deployment to a platform that has a 
GPU, you might consider going SIMT (dcompute) rather than SIMD.  
SIMT is a lot easier on the eyes.  More importantly, if you're 
targetting an SoC or console, or have relatively chunky 
computations that allow you to work around the PCIe transit 
costs, the path is open to very large performance improvements.  
I've only been using dcompute for a week or so but so far it's 
been great.

If your algorithims are very branchy, or you decide to stick with 
multi-core/SIMD for any of a number of other good reasons, here 
are a few things I learned before decamping to dcompute land that 
may help:

   1)  LDC is pretty good at auto vectorization as you have 
probably observed.  Definitely worth a few iterations to try and 
get the vectorizer engaged.

   2)  LDC auto vectorization was good but explicit __vector 
programming is more predictable and was, at least for my tasks, 
much faster. I couldn't persuade the auto vectorizer to "do the 
right thing" throughout the hot path but perhaps you'll have 
better luck.

   3)  LDC does a good job of going between T[N] <==> 
__vector(T[N]) so using the static array types as your 
input/output types and the __vector types as your compute types 
works out well whenever you have to interface with an unaligned 
world. LDC issues unaligned vector loads/stores for casts or full 
array assigns: v = cast(VT)sa[];  or v[] = sa[];  These are quite 
good on modern CPUs.  To calibrate note that Ethan recently 
talked about a 10% gain he experienced using full alignment, 
IIRC, so there's that.

   4) LDC also does a good job of discovering SIMD equivalents 
given static foreach unrolled loops with explicit complie-time 
indexing of vector element operands.  You can use those along 
with pragma(inline, true) to develop your own "intrinsics" that 
supplement other libs.

   5) If you adopt the __vector approach you'll have to handle the 
partials manually. (array length % vec length != 0 indicates a 
partial or tail fragment).  If the classic (copying/padding) 
approaches to such fragmentation don't work for you I'd suggest 
using nested static functions that take ref T[N] inputs and 
outputs.  The main loops become very simple and the tail handling 
reduces to loading stack allocated T[N] variables explicitly, 
calling the static function, and unloading.

Good luck.

Feb 25 2021

z <z z.com> writes:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 ...

It seems that using static foreach with pointer parameters 
exclusively is the best way to "guide" LDC into optimizing 
code.(using arr1[] += arr2[] syntax resulted in worse performance 
for me.)
However, AVX512 support seems limited to being able to use the 16 
other YMM registers, rather than using the same code template but 
changed to use ZMM registers and double the offsets to take 
advantage of the new size.
Compiled with «-g -enable-unsafe-fp-math -enable-no-infs-fp-math 
-ffast-math -O -release -mcpu=skylake» :
__gshared simdf init = [0f,0f,0f,0f,0f,0f,0f,0f];
alias simdf = float[8]
extern(C)//with extern(D)(the default), the assembly output uses 
one register for two pointers.
void vEUCLIDpsptr_void(simdf* a0, simdf* a1, simdf* a2, simdf* 
a3, simdf* b1, simdf* b2, simdf* b3) {
	simdf amm0 = init;//returned simdf
	simdf amm1 = *a1;
	simdf amm2 = *a2;
	simdf amm3 = *a3;
	static foreach(size_t i; 0..simdlength) {
		//Needs to be interleaved like this, otherwise LDC generates 
worse code.
		amm1[i] -= (*b1)[i];
		amm1[i] *= amm1[i];
		amm2[i] -= (*b2)[i];
		amm2[i] *= amm2[i];
		amm3[i] -= (*b3)[i];
		amm3[i] *= amm3[i];
		amm0[i] += amm1[i];
		amm0[i] += amm2[i];
		amm0[i] += amm3[i];
		amm0[i] = sqrt(amm0[i]);
	}
	*a0 = amm0;
	return;
}

mov r10,qword ptr ss:[rsp+38]
mov r11,qword ptr ss:[rsp+30]
mov rax,qword ptr ss:[rsp+28]
vmovups ymm0,yword ptr ds:[rdx]
vmovups ymm1,yword ptr ds:[r8]
vsubps ymm0,ymm0,yword ptr ds:[rax]
vmovups ymm2,yword ptr ds:[r9]
vfmadd213ps ymm0,ymm0,yword ptr ds:[<_D12euclideandst4initG8f>]
vsubps ymm1,ymm1,yword ptr ds:[r11]
vfmadd213ps ymm1,ymm1,ymm0
vsubps ymm0,ymm2,yword ptr ds:[r10]
vfmadd213ps ymm0,ymm0,ymm1
vsqrtps ymm0,ymm0
vmovups yword ptr ds:[rcx],ymm0
vzeroupper ret

The speed difference is near 400% for the same amount of 
distances compared with the single distance function example.
However, the assembly generated isn't the fastest, for example 
removing vzeroupper and using the unused and known-zeroed YMM15 
register as a zero-filled register operand for the first 
vfmadd213ps instruction improves performance by 10%(70 vs 78ms 
for 256 million distances...)
The function can then be improved further to use pointer offsets 
and more registers, this is more efficient and results in a 500%~ 
improvement :
extern(C)
void vEUCLIDpsptr_void_40(simdf* a0, simdf* a1, simdf* a2, 
simdf* a3, simdf* b1, simdf* b2, simdf* b3) {
	simdf amm0 = init;
	simdf amm1 = *a1;
	simdf amm2 = *a2;
	simdf amm3 = *a3;
	simdf emm0 = init;
	simdf emm1 = amm1;
	simdf emm2 = amm2;//mirror AMM for positions
	simdf emm3 = amm3;
	simdf imm0 = init;
	simdf imm1 = emm1;
	simdf imm2 = emm2;
	simdf imm3 = emm3;
	simdf omm0 = init;
	simdf omm1 = emm1;
	simdf omm2 = emm2;
	simdf omm3 = emm3;
	simdf umm0 = init;
	simdf umm1 = omm1;
	simdf umm2 = omm2;
	simdf umm3 = omm3;
	//cascading assignment may not be the fastest way, especially 
compared to just loading from the pointer!

	static foreach(size_t i; 0..simdlength) {
		amm1[i] -= (b1[0])[i];
		amm1[i] *= amm1[i];
		amm0[i] += amm1[i];

		amm2[i] -= (b2[0])[i];
		amm2[i] *= amm2[i];
		amm0[i] += amm2[i];

		amm3[i] -= (b3[0])[i];
		amm3[i] *= amm3[i];
		amm0[i] += amm3[i];

		amm0[i] = sqrt(amm0[i]);
		//template
		emm1[i] -= (b1[1])[i];
		emm1[i] *= emm1[i];
		emm0[i] += emm1[i];

		emm2[i] -= (b2[1])[i];
		emm2[i] *= emm2[i];
		emm0[i] += emm2[i];

		emm3[i] -= (b3[1])[i];
		emm3[i] *= emm3[i];
		emm0[i] += emm3[i];

		emm0[i] = sqrt(emm0[i]);
		//
		imm1[i] -= (b1[2])[i];
		imm1[i] *= imm1[i];
		imm0[i] += imm1[i];

		imm2[i] -= (b2[2])[i];
		imm2[i] *= imm2[i];
		imm0[i] += imm2[i];

		imm3[i] -= (b3[2])[i];
		imm3[i] *= imm3[i];
		imm0[i] += imm3[i];

		imm0[i] = sqrt(imm0[i]);
		//
		omm1[i] -= (b1[3])[i];
		omm1[i] *= omm1[i];
		omm0[i] += omm1[i];

		omm2[i] -= (b2[3])[i];
		omm2[i] *= omm2[i];
		omm0[i] += omm2[i];

		omm3[i] -= (b3[3])[i];
		omm3[i] *= omm3[i];
		omm0[i] += omm3[i];

		omm0[i] = sqrt(omm0[i]);
		//
		umm1[i] -= (b1[4])[i];
		umm1[i] *= umm1[i];
		umm0[i] += umm1[i];

		umm2[i] -= (b2[4])[i];
		umm2[i] *= umm2[i];
		umm0[i] += umm2[i];

		umm3[i] -= (b3[4])[i];
		umm3[i] *= umm3[i];
		umm0[i] += umm3[i];

		umm0[i] = sqrt(umm0[i]);

	}

	//Sadly, writing to the arrays from within the static foreach 
causes LDC to not use SIMD at all
	*a0 = amm0;
	*(a0+1) = emm0;
	*(a0+2) = imm0;
	*(a0+3) = omm0;
	*(a0+4) = umm0;
	return;

}

sub rsp,38
vmovaps xmmword ptr ss:[rsp+20],xmm8
vmovaps xmmword ptr ss:[rsp+10],xmm7
vmovaps xmmword ptr ss:[rsp],xmm6
mov r10,qword ptr ss:[rsp+70]
mov r11,qword ptr ss:[rsp+68]
mov rax,qword ptr ss:[rsp+60]
vmovaps ymm1,yword ptr ds:[<_D12euclideandst4initG8f>]
vmovups ymm3,yword ptr ds:[rdx]
vmovups ymm2,yword ptr ds:[r8]
vmovups ymm0,yword ptr ds:[r9]
vsubps ymm4,ymm3,yword ptr ds:[rax]
vfmadd213ps ymm4,ymm4,ymm1
vsubps ymm5,ymm2,yword ptr ds:[r11]
vfmadd213ps ymm5,ymm5,ymm4
vsubps ymm4,ymm0,yword ptr ds:[r10]
vfmadd213ps ymm4,ymm4,ymm5
vsqrtps ymm4,ymm4
vsubps ymm5,ymm3,yword ptr ds:[rax+20]
vfmadd213ps ymm5,ymm5,ymm1
vsubps ymm6,ymm2,yword ptr ds:[r11+20]
vfmadd213ps ymm6,ymm6,ymm5
vsubps ymm5,ymm0,yword ptr ds:[r10+20]
vfmadd213ps ymm5,ymm5,ymm6
vsqrtps ymm5,ymm5
vsubps ymm6,ymm3,yword ptr ds:[rax+40]
vsubps ymm7,ymm2,yword ptr ds:[r11+40]
vfmadd213ps ymm6,ymm6,ymm1
vfmadd213ps ymm7,ymm7,ymm6
vsubps ymm6,ymm0,yword ptr ds:[r10+40]
vfmadd213ps ymm6,ymm6,ymm7
vsqrtps ymm6,ymm6
vsubps ymm7,ymm3,yword ptr ds:[rax+60]
vfmadd213ps ymm7,ymm7,ymm1
vsubps ymm8,ymm2,yword ptr ds:[r11+60]
vfmadd213ps ymm8,ymm8,ymm7
vsubps ymm7,ymm0,yword ptr ds:[r10+60]
vfmadd213ps ymm7,ymm7,ymm8
vsqrtps ymm7,ymm7
vsubps ymm3,ymm3,yword ptr ds:[rax+80]
vfmadd213ps ymm3,ymm3,ymm1
vsubps ymm1,ymm2,yword ptr ds:[r11+80]
vfmadd213ps ymm1,ymm1,ymm3
vsubps ymm0,ymm0,yword ptr ds:[r10+80]
vfmadd213ps ymm0,ymm0,ymm1
vsqrtps ymm0,ymm0
vmovups yword ptr ds:[rcx],ymm4
vmovups yword ptr ds:[rcx+20],ymm5
vmovups yword ptr ds:[rcx+40],ymm6
vmovups yword ptr ds:[rcx+60],ymm7
vmovups yword ptr ds:[rcx+80],ymm0
vmovaps xmm6,xmmword ptr ss:[rsp]
vmovaps xmm7,xmmword ptr ss:[rsp+10]
vmovaps xmm8,xmmword ptr ss:[rsp+20]
add rsp,38
vzeroupper ret

This is slightly more efficient(54 vs 78ms for the same amount of 
distances(256 million)) but still far from perfect, also i do not 
know what it is doing with XMM registers knowing i can safely 
remove the instructions with a debugger and it doesn't crash or 
corrupt data.(sometimes it was doing this for 0x100 bytes)

Mar 07 2021

tsbockman <thomas.bockman gmail.com> writes:

On Sunday, 7 March 2021 at 13:26:37 UTC, z wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 However, AVX512 support seems limited to being able to use the 
 16 other YMM registers, rather than using the same code 
 template but changed to use ZMM registers and double the 
 offsets to take advantage of the new size.
 Compiled with «-g -enable-unsafe-fp-math 
 -enable-no-infs-fp-math -ffast-math -O -release -mcpu=skylake» :

You're not compiling with AVX512 enabled. You would need to use 
-mcpu=skylake-avx512.

However, LLVM's code generation for AVX512 seems to be pretty 
terrible still, so you'll need to either use some inline ASM, or 
stick with AVX2. Here's a structure of arrays style example:

import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) 
a, ref Repeat!(3, const(V)) b, out V result)
     if(is(V : __vector(float[length]), size_t length))
{
     Repeat!(3, V) diffSq = a;
     static foreach(i; 0 .. 3) {
         diffSq[i] -= b[i];
         diffSq[i] *= diffSq[i];
     }

     result = diffSq[0];
     static foreach(i; 0 .. 3)
         result += diffSq[i];

     version(LDC) { version(X86_64) {
         enum isSupportedPlatform = true;
         import ldc.llvmasm : __asm;
         result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result);
     } }
     static assert(isSupportedPlatform);
}

Resulting asm with is(V == __vector(float[16])):

.LCPI1_0:
         .long   0x7fc00000
pure nothrow  nogc void 
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan
eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])),
ref const(__vector(float[16])), ref const(__vector(float[16])), ref
const(__vector(float[16])), ref const(__vector(float[16])), out
__vector(float[16])):
         mov     rax, qword ptr [rsp + 8]
         vbroadcastss    zmm0, dword ptr [rip + .LCPI1_0]
         vmovaps zmmword ptr [rdi], zmm0
         vmovaps zmm0, zmmword ptr [rax]
         vmovaps zmm1, zmmword ptr [r9]
         vmovaps zmm2, zmmword ptr [r8]
         vsubps  zmm0, zmm0, zmmword ptr [rcx]
         vmulps  zmm0, zmm0, zmm0
         vsubps  zmm1, zmm1, zmmword ptr [rdx]
         vsubps  zmm2, zmm2, zmmword ptr [rsi]
         vaddps  zmm0, zmm0, zmm0
         vfmadd231ps     zmm0, zmm1, zmm1
         vfmadd231ps     zmm0, zmm2, zmm2
         vmovaps zmmword ptr [rdi], zmm0
         vsqrtps zmm0, zmm0
         vmovaps zmmword ptr [rdi], zmm0
         vzeroupper
         ret

Mar 07 2021

tsbockman <thomas.bockman gmail.com> writes:

On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote:
 import std.meta : Repeat;
 void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, 
 const(V)) a, ref Repeat!(3, const(V)) b, out V result)
     if(is(V : __vector(float[length]), size_t length))
 ...

 Resulting asm with is(V == __vector(float[16])):

 .LCPI1_0:
         .long   0x7fc00000
 pure nothrow  nogc void 
 app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan
eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])),
ref const(__vector(float[16])), ref const(__vector(float[16])), ref
const(__vector(float[16])), ref const(__vector(float[16])), out
__vector(float[16])):
         mov     rax, qword ptr [rsp + 8]
         vbroadcastss    zmm0, dword ptr [rip + .LCPI1_0]
 ...

Apparently the optimizer is too stupid to skip the redundant 
float.nan broadcast when result is an `out` parameter, so just 
make it `ref V result` instead for better code gen:

pure nothrow  nogc void 
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan
eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])),
ref const(__vector(float[16])), ref const(__vector(float[16])), ref
const(__vector(float[16])), ref const(__vector(float[16])), ref
__vector(float[16])):
         mov     rax, qword ptr [rsp + 8]
         vmovaps zmm0, zmmword ptr [rax]
         vmovaps zmm1, zmmword ptr [r9]
         vmovaps zmm2, zmmword ptr [r8]
         vsubps  zmm0, zmm0, zmmword ptr [rcx]
         vmulps  zmm0, zmm0, zmm0
         vsubps  zmm1, zmm1, zmmword ptr [rdx]
         vsubps  zmm2, zmm2, zmmword ptr [rsi]
         vaddps  zmm0, zmm0, zmm0
         vfmadd231ps     zmm0, zmm1, zmm1
         vfmadd231ps     zmm0, zmm2, zmm2
         vmovaps zmmword ptr [rdi], zmm0
         vsqrtps zmm0, zmm0
         vmovaps zmmword ptr [rdi], zmm0
         vzeroupper
         ret

Mar 07 2021

tsbockman <thomas.bockman gmail.com> writes:

On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote:
 ...
     result = diffSq[0];
     static foreach(i; 0 .. 3)
         result += diffSq[i];
 ...

Oops, that's supposed to say `i; 1 .. 3`. Fixed:

import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) 
a, ref Repeat!(3, const(V)) b, ref V result)
     if(is(V : __vector(float[length]), size_t length))
{
     Repeat!(3, V) diffSq = a;
     static foreach(i; 0 .. 3) {
         diffSq[i] -= b[i];
         diffSq[i] *= diffSq[i];
     }

     result = diffSq[0];
     static foreach(i; 1 .. 3)
         result += diffSq[i];

     version(LDC) { version(X86_64) {
         enum isSupportedPlatform = true;
         import ldc.llvmasm : __asm;
         result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result);
     } }
     static assert(isSupportedPlatform);
}

Fixed asm:

pure nothrow  nogc void 
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan
eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])),
ref const(__vector(float[16])), ref const(__vector(float[16])), ref
const(__vector(float[16])), ref const(__vector(float[16])), ref
__vector(float[16])):
         mov     rax, qword ptr [rsp + 8]
         vmovaps zmm0, zmmword ptr [rax]
         vmovaps zmm1, zmmword ptr [r9]
         vmovaps zmm2, zmmword ptr [r8]
         vsubps  zmm0, zmm0, zmmword ptr [rcx]
         vsubps  zmm1, zmm1, zmmword ptr [rdx]
         vmulps  zmm1, zmm1, zmm1
         vsubps  zmm2, zmm2, zmmword ptr [rsi]
         vfmadd231ps     zmm1, zmm0, zmm0
         vfmadd231ps     zmm1, zmm2, zmm2
         vmovaps zmmword ptr [rdi], zmm1
         vsqrtps zmm0, zmm1
         vmovaps zmmword ptr [rdi], zmm0
         vzeroupper
         ret

(I really wish I could just edit my posts here...)

Mar 07 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Optimizing for SIMD: best practices?(i.e. what features are allowed?)