www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - auto vectorization of interleaves

reply Bruce Carneal <bcarneal gmail.com> writes:
With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 
vectorizes the code below when T == ubyte but does not vectorize 
that code when T == ushort.

Intra cache throughput testing on a 2.4GhZ zen1 reveals:
   30GB/sec -- custom template function in vanilla D (no asm, no 
intrinsics)
   27GB/sec -- auto vectorized ubyte
    6GB/sec -- non vectorized ushort

I'll continue to use that custom code, so no particular urgency 
here, but if anyone of the LDC crew can, off the top of their 
head, shed some light on this I'd be interested.  My guess is 
that the cost/benefit function in play here does not take 
bandwidth into account at all.


void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
{
     foreach (i, ref dst; quads[])
     {
         dst[0] = s0[i];
         dst[1] = s1[i];
         dst[2] = s2[i];
         dst[3] = s3[i];
     }
}
Jan 09 2022
parent reply Johan <j j.nl> writes:
On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
 With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 
 vectorizes the code below when T == ubyte but does not 
 vectorize that code when T == ushort.

 Intra cache throughput testing on a 2.4GhZ zen1 reveals:
   30GB/sec -- custom template function in vanilla D (no asm, no 
 intrinsics)
   27GB/sec -- auto vectorized ubyte
    6GB/sec -- non vectorized ushort

 I'll continue to use that custom code, so no particular urgency 
 here, but if anyone of the LDC crew can, off the top of their 
 head, shed some light on this I'd be interested.  My guess is 
 that the cost/benefit function in play here does not take 
 bandwidth into account at all.


 void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
 {
     foreach (i, ref dst; quads[])
     {
         dst[0] = s0[i];
         dst[1] = s1[i];
         dst[2] = s2[i];
         dst[3] = s3[i];
     }
 }
Hi Bruce, This could be due to a number of things. Probably it's due to pointer aliasing possibility. Could also be alignment assumptions. Your message is unclear though: what's the difference between the "custom template" and the other two? Most clear to just provide the code of all three, without templates. (If it is a template, then cross-module inlining is possible, is that causing a speed boost?) You can look at the LLVM IR output (--output-ll) to understand better why/what is (not) happening inside the optimizer. -Johan
Jan 09 2022
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Monday, 10 January 2022 at 00:17:48 UTC, Johan wrote:
 On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
 With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 
 vectorizes the code below when T == ubyte but does not 
 vectorize that code when T == ushort.

 Intra cache throughput testing on a 2.4GhZ zen1 reveals:
   30GB/sec -- custom template function in vanilla D (no asm, 
 no intrinsics)
   27GB/sec -- auto vectorized ubyte
    6GB/sec -- non vectorized ushort

 I'll continue to use that custom code, so no particular 
 urgency here, but if anyone of the LDC crew can, off the top 
 of their head, shed some light on this I'd be interested.  My 
 guess is that the cost/benefit function in play here does not 
 take bandwidth into account at all.


 void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
 {
     foreach (i, ref dst; quads[])
     {
         dst[0] = s0[i];
         dst[1] = s1[i];
         dst[2] = s2[i];
         dst[3] = s3[i];
     }
 }
Hi Bruce, This could be due to a number of things. Probably it's due to pointer aliasing possibility. Could also be alignment assumptions.
I don't think it's pointer aliasing since the 10 line template function seen above was used for both ubyte and ushort instantiations. The ubyte instantiation auto vectorized nicely. The ushort instantiation did not. Also, I dont think the unaligned vector load/store instructions have alignment restrictions. They are generated by LDC when you have something like: ushort[8]* sap = ... auto tmp = cast(__vector(ushort[8]))sap[0]; // turns into: vmovups ...
 Your message is unclear though: what's the difference between 
 the "custom template" and the other two? Most clear to just 
 provide the code of all three, without templates. (If it is a 
 template, then cross-module inlining is possible, is that 
 causing a speed boost?)
The 10 liner above was responsible for the 27GB/sec ubyte performance and the 6GB/sec ushort performance. The "custom template", not shown, is a 35 LOC template function that I wrote to accomplish, at speed, what the 10 LOC template could not. Note that the LDC auto-vectorized ubyte instantiation of that simple 10-liner is very good. I'm trying to understand why LDC does such a great job with T == ubyte, yet fails to vectorize at all when T == ushort.
 You can look at the LLVM IR output (--output-ll) to understand 
 better why/what is (not) happening inside the optimizer.

 -Johan
I took a look at the IR but didn't see an explanation. I may have missed something there... the output is a little noisy. On a lark I tried the clang flags that enable vectorization diagnostics. Unsurprisingly :-) those did not work. If/when there is a clamoring for better auto-vectorization, enabling clang style analysis/diagnostics might be cost effective. In the mean time, the workarounds D affords are not bad at all and will likely be preferred in known hot-paths for their predictability anyway. Thanks for the response and thanks again for your very useful contributions to my everyday compiler.
Jan 09 2022
parent reply Johan <j j.nl> writes:
On Monday, 10 January 2022 at 03:04:22 UTC, Bruce Carneal wrote:
 On Monday, 10 January 2022 at 00:17:48 UTC, Johan wrote:
 On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
 With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 
 vectorizes the code below when T == ubyte but does not 
 vectorize that code when T == ushort.

 Intra cache throughput testing on a 2.4GhZ zen1 reveals:
   30GB/sec -- custom template function in vanilla D (no asm, 
 no intrinsics)
   27GB/sec -- auto vectorized ubyte
    6GB/sec -- non vectorized ushort

 I'll continue to use that custom code, so no particular 
 urgency here, but if anyone of the LDC crew can, off the top 
 of their head, shed some light on this I'd be interested.  My 
 guess is that the cost/benefit function in play here does not 
 take bandwidth into account at all.


 void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
 {
     foreach (i, ref dst; quads[])
     {
         dst[0] = s0[i];
         dst[1] = s1[i];
         dst[2] = s2[i];
         dst[3] = s3[i];
     }
 }
Hi Bruce, This could be due to a number of things. Probably it's due to pointer aliasing possibility. Could also be alignment assumptions.
I don't think it's pointer aliasing since the 10 line template function seen above was used for both ubyte and ushort instantiations. The ubyte instantiation auto vectorized nicely. The ushort instantiation did not. Also, I dont think the unaligned vector load/store instructions have alignment restrictions. They are generated by LDC when you have something like: ushort[8]* sap = ... auto tmp = cast(__vector(ushort[8]))sap[0]; // turns into: vmovups ...
The compiler complains about aliasing when optimizing. https://d.godbolt.org/z/hnGj3G3zo For example, the write to `dst[0]` may alias with `s1[i]` so `s1[i]` needs to be reloaded. I think the problem gets worse with 16bit numbers because they may partially overlap? (8bits of dst[0] overlap with s1[i]) Just a guess of why the lookup tables `.LCPI0_x` are generated... -Johan
Jan 10 2022
parent Bruce Carneal <bcarneal gmail.com> writes:
On Monday, 10 January 2022 at 19:21:06 UTC, Johan wrote:
 On Monday, 10 January 2022 at 03:04:22 UTC, Bruce Carneal wrote:
 ...
The compiler complains about aliasing when optimizing. https://d.godbolt.org/z/hnGj3G3zo For example, the write to `dst[0]` may alias with `s1[i]` so `s1[i]` needs to be reloaded. I think the problem gets worse with 16bit numbers because they may partially overlap? (8bits of dst[0] overlap with s1[i]) Just a guess of why the lookup tables `.LCPI0_x` are generated... -Johan
Thanks for the clarification Johan. For some reason I was thinking FORTRAN rather than C wrt aliasing assumptions. LDC's restrict attribute looks useful. That said, for my current work the performance predictability of explicit __vector forms has come to outweigh the convenience of auto-vectorization. The unittesting is more laborious but the performance is easier to understand. Thanks again for your answers and patience. I'm pretty sure I'll be using restrict in other projects.
Jan 10 2022