www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - 512 bit static array to vector

reply Bruce Carneal <bcarneal gmail.com> writes:
Here's a comparison between ldc and gdc converting static arrays 
to 512 bit vectors:
https://godbolt.org/z/8jxafh76W

A few observations:
1) LDC requires more instructions at 512 bits. At 256 (x86-64-v3) 
they're the same.
2) LDC emits worse code for the cleaner .array assignment than 
for the union hack.
3) LDC fabricates non-HW __vectors so is(someVector) has 
diminished CT utility.

Is improving LLVM/LDC wrt any of the above relatively simple?
Jun 19 2022
parent reply kinke <noone nowhere.com> writes:
On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:
 1) LDC requires more instructions at 512 bits. At 256 
 (x86-64-v3) they're the same.
Different results (actually using a 512-bit move, not 2x256) with `-mattr=avx512bw`. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions. The biggest difference with gdc is an ABI difference - gdc returning the vector directly in an AVX512 register, whereas LDC returns it indirectly (sret return - caller passes a pointer to its pre-allocated result). That's a limitation of the frontend's https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting extended for broader vectors...).
 3) LDC fabricates non-HW __vectors so is(someVector) has 
 diminished CT utility.
I consider that useful, e.g., allowing to use a `double4` without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead.
Jun 19 2022
next sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:
 On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:
 1) LDC requires more instructions at 512 bits. At 256 
 (x86-64-v3) they're the same.
Different results (actually using a 512-bit move, not 2x256) with `-mattr=avx512bw`. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.
An unexpected choice there given that x86-64-v4 requires avx512bw. Still, glad to hear that the narrower specialization works well. It would be part of any multi-target binary where the programmer was concerned about maximum width-sensitive performance across the widest range of machines.
 The biggest difference with gdc is an ABI difference - gdc 
 returning the vector directly in an AVX512 register, whereas 
 LDC returns it indirectly (sret return - caller passes a 
 pointer to its pre-allocated result). That's a limitation of 
 the frontend's 
 https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which
supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting
extended for broader vectors...).
So, IIUC, gdc and ldc are not interoperable currently but will be once the frontend is updated?
 3) LDC fabricates non-HW __vectors so is(someVector) has 
 diminished CT utility.
I consider that useful, e.g., allowing to use a `double4` without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead
I agree. Perhaps a template: maxISAVectorLengthFor(T). If we're getting fancy we could do: maxMicroarchVectorLengthFor(T). Even better if these work correctly in multi target compilation scenarios and for the expanding set of types (f16, bf16, other?). Having both variants could be useful when targetting split/paired architectures, as AMD is fond of lately, or the SVE/RVV machines.
Jun 19 2022
prev sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:
 On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:
 1) LDC requires more instructions at 512 bits. At 256 
 (x86-64-v3) they're the same.
Different results (actually using a 512-bit move, not 2x256) with `-mattr=avx512bw`. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.
Note that llvm/ldc chooses a 512 bit 2 instruction ld/st sequence for a2vUnion given x86-64-v4 as the target but goes for a 256 bit wide 4 instruction ld/st sequence in a2vArray. As you note, -mattr=avx512bw forces a2vArray into the 2 instruction form but apparently some difference in the IR presented to LLVM? enables the choice of the shorter sequence for a2vUnion in either case. Just curious. Thanks for your having taken a look and for highlighting the workaround (specify avx512bw explicitly).
Jun 19 2022