digitalmars.D.learn - auto vectorization notes

Bruce Carneal (17/17) Mar 23 2020 When speeds are equivalent, or very close, I usually prefer auto

Crayo List (5/22) Mar 27 2020 auto vectorization is bad because you never know if your code

Bruce Carneal (12/21) Mar 27 2020 Yes, that's a downside, you have to measure your performance

Crayo List (13/34) Mar 28 2020 This is not true! The idea of ispc is to write portable code that

Bruce Carneal (25/48) Mar 28 2020 There are many waypoints on the readability <==> performance

Bruce Carneal <bcarneal gmail.com> writes:

When speeds are equivalent, or very close, I usually prefer auto 
vectorized code to explicit SIMD/__vector code as it's easier to 
read.  (on the downside you have to guard against compiler 
code-gen performance regressions)

One oddity I've noticed is that I sometimes need to use 
pragma(inline, *false*) in order to get ldc to "do the right 
thing". Apparently the compiler sees the costs/benefits 
differently in the standalone context.

More widely known techniques that have gotten people over the 
serial/SIMD hump include:
  1) simplified indexing relationships
  2) known count inner loops (chunkify)
  3) static foreach blocks (manual inlining that the compiler 
"gathers")

I'd be interested to hear from others regarding their auto 
vectorization and __vector experiences.  What has worked and what 
hasn't worked in your performance sensitive dlang code?

Mar 23 2020

Crayo List <crayolist gmail.com> writes:

On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:
 When speeds are equivalent, or very close, I usually prefer 
 auto vectorized code to explicit SIMD/__vector code as it's 
 easier to read.  (on the downside you have to guard against 
 compiler code-gen performance regressions)

 One oddity I've noticed is that I sometimes need to use 
 pragma(inline, *false*) in order to get ldc to "do the right 
 thing". Apparently the compiler sees the costs/benefits 
 differently in the standalone context.

 More widely known techniques that have gotten people over the 
 serial/SIMD hump include:
  1) simplified indexing relationships
  2) known count inner loops (chunkify)
  3) static foreach blocks (manual inlining that the compiler 
 "gathers")

 I'd be interested to hear from others regarding their auto 
 vectorization and __vector experiences.  What has worked and 
 what hasn't worked in your performance sensitive dlang code?

auto vectorization is bad because you never know if your code 
will get vectorized next time you make some change to it and 
recompile.
Just use : https://ispc.github.io/

Mar 27 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:
 On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:
 [snip]
 (on the downside you have to guard against compiler code-gen 
 performance regressions)

 auto vectorization is bad because you never know if your code 
 will get vectorized next time you make some change to it and 
 recompile.
 Just use : https://ispc.github.io/

Yes, that's a downside, you have to measure your performance 
sensitive code if you change it *or* change compilers or targets.

Explicit SIMD code, ispc or other, isn't as readable or 
composable or vanilla portable but it certainly is performance 
predictable.

I find SIMT code readability better than SIMD but a little worse 
than auto-vectorizable kernels.  Hugely better performance though 
for less effort than SIMD if your platform supports it.

Is anyone actively using dcompute (SIMT enabler)?  Unless I hear 
bad things I'll try it down the road as neither going back to 
CUDA nor "forward" to the SycL-verse appeals.

Mar 27 2020

Crayo List <crayolist gmail.com> writes:

On Saturday, 28 March 2020 at 06:56:14 UTC, Bruce Carneal wrote:
 On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:
 On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:
 [snip]
 (on the downside you have to guard against compiler code-gen 
 performance regressions)

 auto vectorization is bad because you never know if your code 
 will get vectorized next time you make some change to it and 
 recompile.
 Just use : https://ispc.github.io/

 Yes, that's a downside, you have to measure your performance 
 sensitive code if you change it *or* change compilers or 
 targets.

 Explicit SIMD code, ispc or other, isn't as readable or 
 composable or vanilla portable but it certainly is performance 
 predictable.

This is not true! The idea of ispc is to write portable code that 
will
vectorize predictably based on the target CPU. The object 
file/binary is not portable,
if that is what you meant.
Also, I find it readable.

 I find SIMT code readability better than SIMD but a little 
 worse than auto-vectorizable kernels.  Hugely better 
 performance though for less effort than SIMD if your platform 
 supports it.

Again I don't think this is true. Unless I am misunderstanding 
you, SIMT and SIMD
are not mutually exclusive and if you need performance then you 
must use both.
Also based on the workload and processor SIMD may be much more 
effective than SIMT.

Mar 28 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Saturday, 28 March 2020 at 18:01:37 UTC, Crayo List wrote:
 On Saturday, 28 March 2020 at 06:56:14 UTC, Bruce Carneal wrote:
 On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:
 On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:
 [snip]


 Explicit SIMD code, ispc or other, isn't as readable or 
 composable or vanilla portable but it certainly is performance 
 predictable.

 This is not true! The idea of ispc is to write portable code 
 that will
 vectorize predictably based on the target CPU. The object 
 file/binary is not portable,
 if that is what you meant.
 Also, I find it readable.

There are many waypoints on the readability <==> performance 
axis.  If ispc works for you along that axis, great!

 I find SIMT code readability better than SIMD but a little 
 worse than auto-vectorizable kernels.  Hugely better 
 performance though for less effort than SIMD if your platform 
 supports it.

 Again I don't think this is true. Unless I am misunderstanding 
 you, SIMT and SIMD
 are not mutually exclusive and if you need performance then you 
 must use both.
 Also based on the workload and processor SIMD may be much more 
 effective than SIMT.j

SIMD might become part of the solution under the hood for a 
number of reasons including: ease of deployment, programmer 
familiarity, PCIe xfer overheads, kernel launch overhead, memory 
subsystem suitability, existing code base issues, ...

SIMT works for me in high throughput situations where it's hard 
to "take a log" on the problem.  SIMD, in auto-vectorizable or 
more explicit form, works in others.

Combinations can be useful but most of the work I've come in 
contact with splits pretty clearly along the memory bandwidth 
divide (SIMT on one side, SIMD/CPU on the other).  Others need a 
plus-up in arithmetic horsepower.  The more extreme the 
requirements, the more attractive SIMT appears. (hence my 
excitement about dcompute possibly expanding the dlang 
performance envelope with much less cognitive load than 
CUDA/OpenCL/SycL/...)

On the readability front, I find per-lane programming, even with 
the current thread-divergence caveats, to be easier to reason 
about wrt correctness and performance predictability than other 
approaches.  Apparently your mileage does vary.

When you have chosen SIMD, whether ispc or other, over SIMT what 
drove the decision?  Performance?  Ease of programming to reach a 
target speed?

Mar 28 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - auto vectorization notes