www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Optimizing for SIMD: best practices?(i.e. what features are allowed?)

reply z <z z.com> writes:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions 
that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
   float distance;
   a[] -= b[];
   a[] *= a[];
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller
   }
   return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real 7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret
I've tried to experiment with dynamic arrays of float[3] but the output assembly seemed to be worse.[1](in short, it's calling internal D functions which use "vxxxss" instructions while performing many moves.) Big thanks [1] https://run.dlang.io/is/F3Xye3
Feb 25
next sibling parent reply Guillaume Piolat <first.name spam.org> writes:
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
https://code.dlang.org/packages/intel-intrinsics
Feb 25
next sibling parent Guillaume Piolat <first.name spam.org> writes:
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
https://code.dlang.org/packages/intel-intrinsics
A bit of elaboration on why you might want to prefer intel-intrinsics: - it supports all D compilers, including DMD 32-bit target - targets arm32 and arm64 with same code (LDC only) - core.simd just give you the basic operators, but not say, pmaddwd or any of the complex instructions. Some instructions need very specific work to get them. - at least with LLVM, optimizers works reliably over subsequent versions of the compiler.
Feb 26
prev sibling parent reply z <z z.com> writes:
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
https://code.dlang.org/packages/intel-intrinsics
I'd try to use it but the platform i'm building on requires AVX to get the most performance.
Mar 07
parent Bruce Carneal <bcarneal gmail.com> writes:
On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
 On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
 wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
https://code.dlang.org/packages/intel-intrinsics
I'd try to use it but the platform i'm building on requires AVX to get the most performance.
The code below might be worth a try on your AVX512 machine. Unless you're looking for a combined result, you might need to separate out the memory access overhead by running multiple passes over a "known optimal for L2" data set. Also note that I compiled with -preview=in. I don't know if that matters. import std.math : sqrt; enum SIMDBits = 512; // 256 was tested, 512 was not alias A = float[SIMDBits / (float.sizeof * 8)]; pragma(inline, true) void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A b1, in A b2, in A b3) { alias V = __vector(A); static V vsqrt(V v) { A a = cast(A) v; static foreach (i; 0 .. A.length) a[i] = sqrt(a[i]); return cast(V)a; } static V sd(in A a, in A b) { V v = cast(V) b - cast(V) a; return v * v; } auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3); a0[] = vsqrt(v)[]; }
Mar 07
prev sibling next sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
 To give some context, this is a sample of one of the functions 
 that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
You need to use __vector(float[4]) instead of float[3] to tell the compiler to pack multiple elements per SIMD register. Right now your data lacks proper alignment for SIMD load/stores. Beyond that, SIMD code is rather difficult to optimize. Code written in ignorance or in a rush is unlikely to be meaningfully faster than ordinary scalar code, unless the data flow is very simple. You will probably get a bigger speedup for less effort and pain by first minimizing heap allocations, maximizing locality of reference, minimizing indirections, and minimizing memory use. (And, of course, it should go without saying that choosing an asymptotically efficient high-level algorithm is more important than any micro-optimization for large data sets.) Nevertheless, if you are up to the challenge, SIMD can sometimes provide a final 2-3x speed boost. Your algorithms will need to be designed to minimize mixing of data between SIMD channels, as this forces the generation of lots of extra instructions to swizzle the data, or worse to unpack and repack it. Something like a Cartesian dot product or cross product will benefit much less from SIMD than vector addition, for example. Sometimes the amount of swizzling can be greatly reduced with a little algebra, other times you might need to refactor an array of structures into a structure of arrays. Per-element conditional branches are very bad, and often completely defeat the benefits of SIMD. For very short segments of code (like conditional assignment), replace them with a SIMD conditional move (vcmp and vblend). Bit-twiddling is your friend. Finally, do not trust the compiler or the optimizer. People love to make the claim that "The Compiler" is always better than humans at micro-optimizations, but this is not at all the case for SIMD code with current systems. I have found even LLVM to produce quite bad SIMD code for complex algorithms, unless I carefully structure my code to make it as easy as possible for the optimizer to get to the final assembly I want. A sprinkling of manual assembly code (directly, or via a library) is also necessary to fill in certain instructions that the compiler doesn't know when to use at all. Resources I have found very helpful: Matt Godbolt's Compiler Explorer online visual disassembler (supports D): https://godbolt.org/ Felix Cloutier's x86 and amd64 instruction reference: https://www.felixcloutier.com/x86/ Agner Fog's optimization guide (especially the instruction tables): https://agner.org/optimize/
Feb 25
prev sibling next sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
Use __vector(float[4]), not float[3].
   float distance;
The default value for float is float.nan. You need to explicitly initialize it to 0.0f or something if you want this function to actually do anything useful.
   a[] -= b[];
   a[] *= a[];
With __vector types, this can be simplified (not optimized) to just: a -= b; a *= a;
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller
(a * a) above is always positive for real numbers. You don't need the call to abs unless you're trying to guarantee that even nan values will have a clear sign bit. Also, there is no point to adding the first component to zero, and copying element [0] from a SIMD register into a scalar is free, so this can become: float distance = a[0]; static foreach(size_t i; 1 .. 3) distance += a[i];
   }
   return sqrt(distance);
}
Final assembly output (ldc 1.24.0 with -release -O3 -preview=intpromote -preview=dip1000 -m64 -mcpu=haswell -fp-contract=fast -enable-cross-module-inlining): vsubps xmm0, xmm1, xmm0 vmulps xmm0, xmm0, xmm0 vmovshdup xmm1, xmm0 vaddss xmm1, xmm0, xmm1 vpermilpd xmm0, xmm0, 1 vaddss xmm0, xmm0, xmm1 vsqrtss xmm0, xmm0, xmm0 ret
Feb 25
parent reply z <z z.com> writes:
On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller
(a * a) above is always positive for real numbers. You don't need the call to abs unless you're trying to guarantee that even nan values will have a clear sign bit.
I do not know why but the caller's performance nosedives whenever there is no .abs at this particular line.(there's a 3x difference, no joke.) Same for assignment instead of addition, but with a 2x difference instead.
Mar 07
parent tsbockman <thomas.bockman gmail.com> writes:
On Sunday, 7 March 2021 at 18:00:57 UTC, z wrote:
 On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller
(a * a) above is always positive for real numbers. You don't need the call to abs unless you're trying to guarantee that even nan values will have a clear sign bit.
I do not know why but the caller's performance nosedives
My way is definitely (slightly) better; something is going wrong in either the caller, or the optimizer. Show me the code for the caller and maybe I can figure it out.
 whenever there is no .abs at this particular line.(there's a 3x 
 difference, no joke.)
Perhaps the compiler is performing a value range propagation (VRP) based optimization in the caller, but isn't smart enough to figure out that the value is already always positive without the `abs` call? I've run into that specific problem before. Alternatively, sometimes trivial changes to the code that *shouldn't* matter make the difference between hitting a smart path in the optimizer, and a dumb one. Automatic SIMD optimization is quite sensitive and temperamental. Either way, the problem can be fixed by figuring out what optimization the compiler is doing when it knows that distance is positive, and just doing it manually.
 Same for assignment instead of addition, but with a 2x 
 difference instead.
Did you fix the nan bug I pointed out earlier? More generally, are you actually verifying the correctness of the results in any way for each alternative implementation? Because you can get big speedups sometimes from buggy code when the compiler realizes that some later logic error makes earlier code irrelevant, but that doesn't mean the buggy code is better...
Mar 07
prev sibling next sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 How does one optimize code to make full use of the CPU's SIMD 
 capabilities?
 Is there any way to guarantee that "packed" versions of SIMD 
 instructions will be used?(e.g. vmulps, vsqrtps, etc...)
 To give some context, this is a sample of one of the functions 
 that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
   float distance;
   a[] -= b[];
   a[] *= a[];
   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
       distance += a[i].abs;//abs required by the caller
   }
   return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real 7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret
I've tried to experiment with dynamic arrays of float[3] but the output assembly seemed to be worse.[1](in short, it's calling internal D functions which use "vxxxss" instructions while performing many moves.) Big thanks [1] https://run.dlang.io/is/F3Xye3
If you are developing for deployment to a platform that has a GPU, you might consider going SIMT (dcompute) rather than SIMD. SIMT is a lot easier on the eyes. More importantly, if you're targetting an SoC or console, or have relatively chunky computations that allow you to work around the PCIe transit costs, the path is open to very large performance improvements. I've only been using dcompute for a week or so but so far it's been great. If your algorithims are very branchy, or you decide to stick with multi-core/SIMD for any of a number of other good reasons, here are a few things I learned before decamping to dcompute land that may help: 1) LDC is pretty good at auto vectorization as you have probably observed. Definitely worth a few iterations to try and get the vectorizer engaged. 2) LDC auto vectorization was good but explicit __vector programming is more predictable and was, at least for my tasks, much faster. I couldn't persuade the auto vectorizer to "do the right thing" throughout the hot path but perhaps you'll have better luck. 3) LDC does a good job of going between T[N] <==> __vector(T[N]) so using the static array types as your input/output types and the __vector types as your compute types works out well whenever you have to interface with an unaligned world. LDC issues unaligned vector loads/stores for casts or full array assigns: v = cast(VT)sa[]; or v[] = sa[]; These are quite good on modern CPUs. To calibrate note that Ethan recently talked about a 10% gain he experienced using full alignment, IIRC, so there's that. 4) LDC also does a good job of discovering SIMD equivalents given static foreach unrolled loops with explicit complie-time indexing of vector element operands. You can use those along with pragma(inline, true) to develop your own "intrinsics" that supplement other libs. 5) If you adopt the __vector approach you'll have to handle the partials manually. (array length % vec length != 0 indicates a partial or tail fragment). If the classic (copying/padding) approaches to such fragmentation don't work for you I'd suggest using nested static functions that take ref T[N] inputs and outputs. The main loops become very simple and the tail handling reduces to loading stack allocated T[N] variables explicitly, calling the static function, and unloading. Good luck.
Feb 25
prev sibling parent reply z <z z.com> writes:
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 ...
It seems that using static foreach with pointer parameters exclusively is the best way to "guide" LDC into optimizing code.(using arr1[] += arr2[] syntax resulted in worse performance for me.) However, AVX512 support seems limited to being able to use the 16 other YMM registers, rather than using the same code template but changed to use ZMM registers and double the offsets to take advantage of the new size. Compiled with «-g -enable-unsafe-fp-math -enable-no-infs-fp-math -ffast-math -O -release -mcpu=skylake» :
__gshared simdf init = [0f,0f,0f,0f,0f,0f,0f,0f];
alias simdf = float[8]
extern(C)//with extern(D)(the default), the assembly output uses 
one register for two pointers.
void vEUCLIDpsptr_void(simdf* a0, simdf* a1, simdf* a2, simdf* 
a3, simdf* b1, simdf* b2, simdf* b3) {
	simdf amm0 = init;//returned simdf
	simdf amm1 = *a1;
	simdf amm2 = *a2;
	simdf amm3 = *a3;
	static foreach(size_t i; 0..simdlength) {
		//Needs to be interleaved like this, otherwise LDC generates 
worse code.
		amm1[i] -= (*b1)[i];
		amm1[i] *= amm1[i];
		amm2[i] -= (*b2)[i];
		amm2[i] *= amm2[i];
		amm3[i] -= (*b3)[i];
		amm3[i] *= amm3[i];
		amm0[i] += amm1[i];
		amm0[i] += amm2[i];
		amm0[i] += amm3[i];
		amm0[i] = sqrt(amm0[i]);
	}
	*a0 = amm0;
	return;
}
mov r10,qword ptr ss:[rsp+38]
mov r11,qword ptr ss:[rsp+30]
mov rax,qword ptr ss:[rsp+28]
vmovups ymm0,yword ptr ds:[rdx]
vmovups ymm1,yword ptr ds:[r8]
vsubps ymm0,ymm0,yword ptr ds:[rax]
vmovups ymm2,yword ptr ds:[r9]
vfmadd213ps ymm0,ymm0,yword ptr ds:[<_D12euclideandst4initG8f>]
vsubps ymm1,ymm1,yword ptr ds:[r11]
vfmadd213ps ymm1,ymm1,ymm0
vsubps ymm0,ymm2,yword ptr ds:[r10]
vfmadd213ps ymm0,ymm0,ymm1
vsqrtps ymm0,ymm0
vmovups yword ptr ds:[rcx],ymm0
vzeroupper ret
The speed difference is near 400% for the same amount of distances compared with the single distance function example. However, the assembly generated isn't the fastest, for example removing vzeroupper and using the unused and known-zeroed YMM15 register as a zero-filled register operand for the first vfmadd213ps instruction improves performance by 10%(70 vs 78ms for 256 million distances...) The function can then be improved further to use pointer offsets and more registers, this is more efficient and results in a 500%~ improvement :
extern(C)
void vEUCLIDpsptr_void_40(simdf* a0, simdf* a1, simdf* a2, 
simdf* a3, simdf* b1, simdf* b2, simdf* b3) {
	simdf amm0 = init;
	simdf amm1 = *a1;
	simdf amm2 = *a2;
	simdf amm3 = *a3;
	simdf emm0 = init;
	simdf emm1 = amm1;
	simdf emm2 = amm2;//mirror AMM for positions
	simdf emm3 = amm3;
	simdf imm0 = init;
	simdf imm1 = emm1;
	simdf imm2 = emm2;
	simdf imm3 = emm3;
	simdf omm0 = init;
	simdf omm1 = emm1;
	simdf omm2 = emm2;
	simdf omm3 = emm3;
	simdf umm0 = init;
	simdf umm1 = omm1;
	simdf umm2 = omm2;
	simdf umm3 = omm3;
	//cascading assignment may not be the fastest way, especially 
compared to just loading from the pointer!

	static foreach(size_t i; 0..simdlength) {
		amm1[i] -= (b1[0])[i];
		amm1[i] *= amm1[i];
		amm0[i] += amm1[i];

		amm2[i] -= (b2[0])[i];
		amm2[i] *= amm2[i];
		amm0[i] += amm2[i];

		amm3[i] -= (b3[0])[i];
		amm3[i] *= amm3[i];
		amm0[i] += amm3[i];

		amm0[i] = sqrt(amm0[i]);
		//template
		emm1[i] -= (b1[1])[i];
		emm1[i] *= emm1[i];
		emm0[i] += emm1[i];

		emm2[i] -= (b2[1])[i];
		emm2[i] *= emm2[i];
		emm0[i] += emm2[i];

		emm3[i] -= (b3[1])[i];
		emm3[i] *= emm3[i];
		emm0[i] += emm3[i];

		emm0[i] = sqrt(emm0[i]);
		//
		imm1[i] -= (b1[2])[i];
		imm1[i] *= imm1[i];
		imm0[i] += imm1[i];

		imm2[i] -= (b2[2])[i];
		imm2[i] *= imm2[i];
		imm0[i] += imm2[i];

		imm3[i] -= (b3[2])[i];
		imm3[i] *= imm3[i];
		imm0[i] += imm3[i];

		imm0[i] = sqrt(imm0[i]);
		//
		omm1[i] -= (b1[3])[i];
		omm1[i] *= omm1[i];
		omm0[i] += omm1[i];

		omm2[i] -= (b2[3])[i];
		omm2[i] *= omm2[i];
		omm0[i] += omm2[i];

		omm3[i] -= (b3[3])[i];
		omm3[i] *= omm3[i];
		omm0[i] += omm3[i];

		omm0[i] = sqrt(omm0[i]);
		//
		umm1[i] -= (b1[4])[i];
		umm1[i] *= umm1[i];
		umm0[i] += umm1[i];

		umm2[i] -= (b2[4])[i];
		umm2[i] *= umm2[i];
		umm0[i] += umm2[i];

		umm3[i] -= (b3[4])[i];
		umm3[i] *= umm3[i];
		umm0[i] += umm3[i];

		umm0[i] = sqrt(umm0[i]);

	}

	//Sadly, writing to the arrays from within the static foreach 
causes LDC to not use SIMD at all
	*a0 = amm0;
	*(a0+1) = emm0;
	*(a0+2) = imm0;
	*(a0+3) = omm0;
	*(a0+4) = umm0;
	return;

}
sub rsp,38
vmovaps xmmword ptr ss:[rsp+20],xmm8
vmovaps xmmword ptr ss:[rsp+10],xmm7
vmovaps xmmword ptr ss:[rsp],xmm6
mov r10,qword ptr ss:[rsp+70]
mov r11,qword ptr ss:[rsp+68]
mov rax,qword ptr ss:[rsp+60]
vmovaps ymm1,yword ptr ds:[<_D12euclideandst4initG8f>]
vmovups ymm3,yword ptr ds:[rdx]
vmovups ymm2,yword ptr ds:[r8]
vmovups ymm0,yword ptr ds:[r9]
vsubps ymm4,ymm3,yword ptr ds:[rax]
vfmadd213ps ymm4,ymm4,ymm1
vsubps ymm5,ymm2,yword ptr ds:[r11]
vfmadd213ps ymm5,ymm5,ymm4
vsubps ymm4,ymm0,yword ptr ds:[r10]
vfmadd213ps ymm4,ymm4,ymm5
vsqrtps ymm4,ymm4
vsubps ymm5,ymm3,yword ptr ds:[rax+20]
vfmadd213ps ymm5,ymm5,ymm1
vsubps ymm6,ymm2,yword ptr ds:[r11+20]
vfmadd213ps ymm6,ymm6,ymm5
vsubps ymm5,ymm0,yword ptr ds:[r10+20]
vfmadd213ps ymm5,ymm5,ymm6
vsqrtps ymm5,ymm5
vsubps ymm6,ymm3,yword ptr ds:[rax+40]
vsubps ymm7,ymm2,yword ptr ds:[r11+40]
vfmadd213ps ymm6,ymm6,ymm1
vfmadd213ps ymm7,ymm7,ymm6
vsubps ymm6,ymm0,yword ptr ds:[r10+40]
vfmadd213ps ymm6,ymm6,ymm7
vsqrtps ymm6,ymm6
vsubps ymm7,ymm3,yword ptr ds:[rax+60]
vfmadd213ps ymm7,ymm7,ymm1
vsubps ymm8,ymm2,yword ptr ds:[r11+60]
vfmadd213ps ymm8,ymm8,ymm7
vsubps ymm7,ymm0,yword ptr ds:[r10+60]
vfmadd213ps ymm7,ymm7,ymm8
vsqrtps ymm7,ymm7
vsubps ymm3,ymm3,yword ptr ds:[rax+80]
vfmadd213ps ymm3,ymm3,ymm1
vsubps ymm1,ymm2,yword ptr ds:[r11+80]
vfmadd213ps ymm1,ymm1,ymm3
vsubps ymm0,ymm0,yword ptr ds:[r10+80]
vfmadd213ps ymm0,ymm0,ymm1
vsqrtps ymm0,ymm0
vmovups yword ptr ds:[rcx],ymm4
vmovups yword ptr ds:[rcx+20],ymm5
vmovups yword ptr ds:[rcx+40],ymm6
vmovups yword ptr ds:[rcx+60],ymm7
vmovups yword ptr ds:[rcx+80],ymm0
vmovaps xmm6,xmmword ptr ss:[rsp]
vmovaps xmm7,xmmword ptr ss:[rsp+10]
vmovaps xmm8,xmmword ptr ss:[rsp+20]
add rsp,38
vzeroupper ret
This is slightly more efficient(54 vs 78ms for the same amount of distances(256 million)) but still far from perfect, also i do not know what it is doing with XMM registers knowing i can safely remove the instructions with a debugger and it doesn't crash or corrupt data.(sometimes it was doing this for 0x100 bytes)
Mar 07
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Sunday, 7 March 2021 at 13:26:37 UTC, z wrote:
 On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
 However, AVX512 support seems limited to being able to use the 
 16 other YMM registers, rather than using the same code 
 template but changed to use ZMM registers and double the 
 offsets to take advantage of the new size.
 Compiled with «-g -enable-unsafe-fp-math 
 -enable-no-infs-fp-math -ffast-math -O -release -mcpu=skylake» :
You're not compiling with AVX512 enabled. You would need to use -mcpu=skylake-avx512. However, LLVM's code generation for AVX512 seems to be pretty terrible still, so you'll need to either use some inline ASM, or stick with AVX2. Here's a structure of arrays style example: import std.meta : Repeat; void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) a, ref Repeat!(3, const(V)) b, out V result) if(is(V : __vector(float[length]), size_t length)) { Repeat!(3, V) diffSq = a; static foreach(i; 0 .. 3) { diffSq[i] -= b[i]; diffSq[i] *= diffSq[i]; } result = diffSq[0]; static foreach(i; 0 .. 3) result += diffSq[i]; version(LDC) { version(X86_64) { enum isSupportedPlatform = true; import ldc.llvmasm : __asm; result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result); } } static assert(isSupportedPlatform); } Resulting asm with is(V == __vector(float[16])): .LCPI1_0: .long 0x7fc00000 pure nothrow nogc void app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), out __vector(float[16])): mov rax, qword ptr [rsp + 8] vbroadcastss zmm0, dword ptr [rip + .LCPI1_0] vmovaps zmmword ptr [rdi], zmm0 vmovaps zmm0, zmmword ptr [rax] vmovaps zmm1, zmmword ptr [r9] vmovaps zmm2, zmmword ptr [r8] vsubps zmm0, zmm0, zmmword ptr [rcx] vmulps zmm0, zmm0, zmm0 vsubps zmm1, zmm1, zmmword ptr [rdx] vsubps zmm2, zmm2, zmmword ptr [rsi] vaddps zmm0, zmm0, zmm0 vfmadd231ps zmm0, zmm1, zmm1 vfmadd231ps zmm0, zmm2, zmm2 vmovaps zmmword ptr [rdi], zmm0 vsqrtps zmm0, zmm0 vmovaps zmmword ptr [rdi], zmm0 vzeroupper ret
Mar 07
next sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote:
 import std.meta : Repeat;
 void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, 
 const(V)) a, ref Repeat!(3, const(V)) b, out V result)
     if(is(V : __vector(float[length]), size_t length))
 ...

 Resulting asm with is(V == __vector(float[16])):

 .LCPI1_0:
         .long   0x7fc00000
 pure nothrow  nogc void 
 app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan
eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])),
ref const(__vector(float[16])), ref const(__vector(float[16])), ref
const(__vector(float[16])), ref const(__vector(float[16])), out
__vector(float[16])):
         mov     rax, qword ptr [rsp + 8]
         vbroadcastss    zmm0, dword ptr [rip + .LCPI1_0]
 ...
Apparently the optimizer is too stupid to skip the redundant float.nan broadcast when result is an `out` parameter, so just make it `ref V result` instead for better code gen: pure nothrow nogc void app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref __vector(float[16])): mov rax, qword ptr [rsp + 8] vmovaps zmm0, zmmword ptr [rax] vmovaps zmm1, zmmword ptr [r9] vmovaps zmm2, zmmword ptr [r8] vsubps zmm0, zmm0, zmmword ptr [rcx] vmulps zmm0, zmm0, zmm0 vsubps zmm1, zmm1, zmmword ptr [rdx] vsubps zmm2, zmm2, zmmword ptr [rsi] vaddps zmm0, zmm0, zmm0 vfmadd231ps zmm0, zmm1, zmm1 vfmadd231ps zmm0, zmm2, zmm2 vmovaps zmmword ptr [rdi], zmm0 vsqrtps zmm0, zmm0 vmovaps zmmword ptr [rdi], zmm0 vzeroupper ret
Mar 07
prev sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote:
 ...
     result = diffSq[0];
     static foreach(i; 0 .. 3)
         result += diffSq[i];
 ...
Oops, that's supposed to say `i; 1 .. 3`. Fixed: import std.meta : Repeat; void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) a, ref Repeat!(3, const(V)) b, ref V result) if(is(V : __vector(float[length]), size_t length)) { Repeat!(3, V) diffSq = a; static foreach(i; 0 .. 3) { diffSq[i] -= b[i]; diffSq[i] *= diffSq[i]; } result = diffSq[0]; static foreach(i; 1 .. 3) result += diffSq[i]; version(LDC) { version(X86_64) { enum isSupportedPlatform = true; import ldc.llvmasm : __asm; result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result); } } static assert(isSupportedPlatform); } Fixed asm: pure nothrow nogc void app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistan eFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref __vector(float[16])): mov rax, qword ptr [rsp + 8] vmovaps zmm0, zmmword ptr [rax] vmovaps zmm1, zmmword ptr [r9] vmovaps zmm2, zmmword ptr [r8] vsubps zmm0, zmm0, zmmword ptr [rcx] vsubps zmm1, zmm1, zmmword ptr [rdx] vmulps zmm1, zmm1, zmm1 vsubps zmm2, zmm2, zmmword ptr [rsi] vfmadd231ps zmm1, zmm0, zmm0 vfmadd231ps zmm1, zmm2, zmm2 vmovaps zmmword ptr [rdi], zmm1 vsqrtps zmm0, zmm1 vmovaps zmmword ptr [rdi], zmm0 vzeroupper ret (I really wish I could just edit my posts here...)
Mar 07