www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - LDC/GDC auto vectorization observations

For much of the code that I throw at it, LDC auto vectorization 
is comparable to GDC (both very good) but LDC lags in a few 
situations.  The link below shows one of those, a gather of the 
two primaries from a simple Bayer mosaic pixel:

https://godbolt.org/z/sfd8e4hqe

The throughput difference on my (aging) 2.4GhZ zen1 is 
21+GB/sec/core for GDC vs ~9GB/sec for LDC.  The expected (hoped 
for) auto vec loop starts at .L6 in the GDC output.  (I realize 
there are other ways to "skin" the Bayer pixel unpacking cat, 
this is just an example of where one approach breaks down)

Additionally I'll note that GDC can auto vectorize in the face of 
multiple outputs (low degree kernel fusion).  I've not seen LDC 
do that in the few cases that I've tried.

Both LDC and GDC auto vectorization are sufficiently advanced 
that I've been able to eliminate a good deal of __vector code and 
the attendant difficulties in target specialization, source code 
expansion, and testing.  D's __vector capabilities helps out with 
the remainder.

Thanks again for providing a very useful tool.
Aug 08 2022