www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - toy windowing auto-vec miss

reply Bruce Carneal <bcarneal gmail.com> writes:
Here's a simple godbolt example of one of the areas in which gdc 
solidly outperforms ldc wrt auto-vectorization: simple but not 
trivial operand gather
https://godbolt.org/z/ox1vvxd8s


Compile time target adaptive manual __vector-ization is an answer 
here if you have no access to SIMT, so not a show stopper, but 
the code is less readable.

I'm not sure what the data parallel future should look like wrt 
language/IR but I'm pretty sure we can do better than praying 
that the auto vectorizer can dig patterns out of for loops, or 
throwing ourselves on the manual vectorization grenade, 
repeatedly.
Nov 06 2022
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
This might be a bit naive, but ldc's output is about a quarter smaller, 
it uses significantly less jumps.

Is gdc actually faster?
Nov 07 2022
parent Bruce Carneal <bcarneal gmail.com> writes:
On Monday, 7 November 2022 at 09:56:13 UTC, rikki cattermole 
wrote:
 This might be a bit naive, but ldc's output is about a quarter 
 smaller, it uses significantly less jumps.

 Is gdc actually faster?
If you have long enough inputs, yes. A vectorized version overcomes the instruction stream overhead quickly after which the performance advantage trends to N/1. As you imply, measurement trumps in-ones-head modelling. I'll measure and report on the exact toy code later today but real world code with the same "simple but not trivial" operand pattern, involving Bayer/CFA data, has been measured and the performance gap verified. For that code the workaround was manual __vector-ization and use of a shuffle intrinsic.
Nov 07 2022
prev sibling next sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple but 
 not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s


 Compile time target adaptive manual __vector-ization is an 
 answer here if you have no access to SIMT, so not a show 
 stopper, but the code is less readable.

 I'm not sure what the data parallel future should look like wrt 
 language/IR but I'm pretty sure we can do better than praying 
 that the auto vectorizer can dig patterns out of for loops, or 
 throwing ourselves on the manual vectorization grenade, 
 repeatedly.
My "grenade" phrasing above was fun to write but overly dramatic. Manual __vector-ization is more tedious than dangerous and D ldc/gdc give you quite a bit of help there including 1) __vector types 2) CT max vector length introspection. Also, auto vectorization *does* work nicely against simple/and-or conditioned inputs/outputs. I believe there is a lot more to be had in the programmer-friendly-data-parallelism department, perhaps involving a (major) pivot to MLIR, but I give my considered thanks to those involved in providing what is already the best option in that arena from my point of view. Introspection, __vector, auto-vec, dcompute, ... it's a potent toolkit.
Nov 07 2022
prev sibling parent reply Johan <j j.nl> writes:
On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple but 
 not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s
Don't have time to dive deeper but I found that: Removing ` restrict` results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM). -Johan
Nov 07 2022
parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:
 On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple but 
 not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s
Don't have time to dive deeper but I found that: Removing ` restrict` results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM). -Johan
That's very interesting. This is the first time I've heard of restrict making things worse wrt auto vectorization. From what I've seen in other experiments, restrict provides a minor benefit (code size reduction) frequently while occasionally enabling vectorization of otherwise complex dependency graphs. Thanks for the heads up.
Nov 07 2022
parent Johan <j j.nl> writes:
On Monday, 7 November 2022 at 18:14:44 UTC, Bruce Carneal wrote:
 On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:
 On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal 
 wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple 
 but not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s
Don't have time to dive deeper but I found that: Removing ` restrict` results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM). -Johan
That's very interesting. This is the first time I've heard of restrict making things worse wrt auto vectorization. From what I've seen in other experiments, restrict provides a minor benefit (code size reduction) frequently while occasionally enabling vectorization of otherwise complex dependency graphs.
Yeah, this is an LLVM bug. If you're interested in digging around a bit further, you can look at how the individual optimization passes change the IR code: https://godbolt.org/z/e9nqPfeKn Loop vectorization pass does nothing for the ` restrict` case. Note that the input for that pass is slightly different: the ` restrict` case has a more complex forbody.preheader and 3 phi nodes in the for body (compared to 1 in the non-restrict case) -Johan
Nov 07 2022