digitalmars.D.ldc - 512 bit static array to vector

Bruce Carneal (11/11) Jun 19 2022 Here's a comparison between ldc and gdc converting static arrays

kinke (12/16) Jun 19 2022 Different results (actually using a 512-bit move, not 2x256) with

Bruce Carneal (14/33) Jun 19 2022 An unexpected choice there given that x86-64-v4 requires
Bruce Carneal (10/17) Jun 19 2022 Note that llvm/ldc chooses a 512 bit 2 instruction ld/st sequence

Bruce Carneal <bcarneal gmail.com> writes:

Here's a comparison between ldc and gdc converting static arrays 
to 512 bit vectors:
https://godbolt.org/z/8jxafh76W

A few observations:
1) LDC requires more instructions at 512 bits. At 256 (x86-64-v3) 
they're the same.
2) LDC emits worse code for the cleaner .array assignment than 
for the union hack.
3) LDC fabricates non-HW __vectors so is(someVector) has 
diminished CT utility.

Is improving LLVM/LDC wrt any of the above relatively simple?

Jun 19 2022

kinke <noone nowhere.com> writes:

On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:
 1) LDC requires more instructions at 512 bits. At 256 
 (x86-64-v3) they're the same.

Different results (actually using a 512-bit move, not 2x256) with 
`-mattr=avx512bw`. I guess LLVM makes performance assumptions for 
the provided CPU and prefers 256-bit instructions.

The biggest difference with gdc is an ABI difference - gdc 
returning the vector directly in an AVX512 register, whereas LDC 
returns it indirectly (sret return - caller passes a pointer to 
its pre-allocated result). That's a limitation of the frontend's 
https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which
supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting
extended for broader vectors...).

 3) LDC fabricates non-HW __vectors so is(someVector) has 
 diminished CT utility.

I consider that useful, e.g., allowing to use a `double4` without 
having to consider CPU limitations. - I think the compiler should 
expose a trait for the largest supported vector size instead.

Jun 19 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:
 On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:
 1) LDC requires more instructions at 512 bits. At 256 
 (x86-64-v3) they're the same.

 Different results (actually using a 512-bit move, not 2x256) 
 with `-mattr=avx512bw`. I guess LLVM makes performance 
 assumptions for the provided CPU and prefers 256-bit 
 instructions.

An unexpected choice there given that x86-64-v4 requires 
avx512bw.  Still, glad to hear that the narrower specialization 
works well.  It would be part of any multi-target binary where 
the programmer was concerned about maximum width-sensitive 
performance across the widest range of machines.

 The biggest difference with gdc is an ABI difference - gdc 
 returning the vector directly in an AVX512 register, whereas 
 LDC returns it indirectly (sret return - caller passes a 
 pointer to its pre-allocated result). That's a limitation of 
 the frontend's 
 https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which
supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting
extended for broader vectors...).

So, IIUC, gdc and ldc are not interoperable currently but will be 
once the frontend is updated?

 3) LDC fabricates non-HW __vectors so is(someVector) has 
 diminished CT utility.

 I consider that useful, e.g., allowing to use a `double4` 
 without having to consider CPU limitations. - I think the 
 compiler should expose a trait for the largest supported vector 
 size instead

I agree.  Perhaps a template: maxISAVectorLengthFor(T).  If we're 
getting fancy we could do: maxMicroarchVectorLengthFor(T).  Even 
better if these work correctly in multi target compilation 
scenarios and for the expanding set of types (f16, bf16, other?).

Having both variants could be useful when targetting split/paired 
architectures, as AMD is fond of lately, or the SVE/RVV machines.

Jun 19 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:
 On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:
 1) LDC requires more instructions at 512 bits. At 256 
 (x86-64-v3) they're the same.

 Different results (actually using a 512-bit move, not 2x256) 
 with `-mattr=avx512bw`. I guess LLVM makes performance 
 assumptions for the provided CPU and prefers 256-bit 
 instructions.

Note that llvm/ldc chooses a 512 bit 2 instruction ld/st sequence 
for a2vUnion given x86-64-v4 as the target but goes for a 256 bit 
wide 4 instruction ld/st sequence in a2vArray.

As you note, -mattr=avx512bw forces a2vArray into the 2 
instruction form but apparently some difference in the IR 
presented to LLVM? enables the choice of the shorter sequence for 
a2vUnion in either case.

Just curious.  Thanks for your having taken a look and for 
highlighting the workaround (specify avx512bw explicitly).

Jun 19 2022

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - 512 bit static array to vector