## digitalmars.D - Auto-Vectorization and array/vector operations

"Steven" <stevenwilson500 gmail.com> writes:
```I was trying to show someone how awesome Dlang was earlier, and
how the vector operations are expected to take advantage of the
CPU vector instructions, and was dumbstruck when dmd and gdc both
failed to auto-vectorize a simple case.  I've stripped it down to
the bare minimum and loaded the example on the interactive
compiler:
http://asm.dlang.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22JYWwDg9gTgLgBAY2gUwHQGdQBMDcAoPAMwBsIBDGAbQCYBWANgF05kAPM8Y5AQQAoTyVOkzhkANHAEUaDZgCMAlHgDeeOJNLThzBPnUB6fXCjJ0AV2Ix0cYADs45uenRrElZgF5R7uAFo4cu56xsgwZlD2ungAvgRSQrIs7JzIAEL8mgki4hqCMiKKKq7xAByUAMzUjABuZHBeCGToMBmCZZWMCmTBpRVV1XL1iE0tvR0Kcj2Z7f1R6q6GIeaW1nYOZnJgLurVCD5etT7%2BA0EE6iZhEcPNrVqyCrv40UAAA%3D%22%2C%22compiler%22%3A%22dmd2067%22%2C%22options%22%3A%22-O%20-release%20-inline%20-boundscheck%3Doff%22%7D%5D%7D

The reference documentation for arrays says:
Implementation note: many of the more common vector operations
are expected to take advantage of any vector math instructions
available on the target computer.

Does this mean that while compilers are expected to take
advantage of them, they currently do not, even when they have
proper alignment?  I haven't tried LDC yet, so maybe LDC does
perform auto-vectorization and I should attempt to use LDC if I
plan on using vector ops a lot?

import core.simd;

float[256] exampleA(float[256] a, float[256] b)
{
float[256] c;
// results in subss (scalar instruction)
c[] = a[] - b[];
return c;
}

float[256] exampleB(float[256] a, float[256] b)
{
float8[32]va = cast(float8[32])a;
float8[32]vb = cast(float8[32])b;
float8[32]vc;

// results in subps (vector instruction)
vc[] = va[] - vb[];

return cast(float[256])vc;
}
```
Jul 15 2015
"jmh530" <john.michael.hall gmail.com> writes:
```On Wednesday, 15 July 2015 at 22:42:05 UTC, Steven wrote:
I was trying to show someone how awesome Dlang was earlier, and
how the vector operations are expected to take advantage of the
CPU vector instructions, and was dumbstruck when dmd and gdc
both failed to auto-vectorize a simple case.  I've stripped it
down to the bare minimum and loaded the example on the
interactive compiler:

I'm not sure how the compilers handle auto-vectorization, but I
found
http://dconf.org/2013/talks/evans_2.html
informative. It recommends not casting between float and simd
types.
```
Jul 15 2015
"John Colvin" <john.loughran.colvin gmail.com> writes:
```On Wednesday, 15 July 2015 at 22:42:05 UTC, Steven wrote:
I was trying to show someone how awesome Dlang was earlier, and
how the vector operations are expected to take advantage of the
CPU vector instructions, and was dumbstruck when dmd and gdc
both failed to auto-vectorize a simple case.  I've stripped it
down to the bare minimum and loaded the example on the
interactive compiler:
http://asm.dlang.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22JYWwDg9gTgLgBAY2gUwHQGdQBMDcAoPAMwBsIBDGAbQCYBWANgF05kAPM8Y5AQQAoTyVOkzhkANHAEUaDZgCMAlHgDeeOJNLThzBPnUB6fXCjJ0AV2Ix0cYADs45uenRrElZgF5R7uAFo4cu56xsgwZlD2ungAvgRSQrIs7JzIAEL8mgki4hqCMiKKKq7xAByUAMzUjABuZHBeCGToMBmCZZWMCmTBpRVV1XL1iE0tvR0Kcj2Z7f1R6q6GIeaW1nYOZnJgLurVCD5etT7%2BA0EE6iZhEcPNrVqyCrv40UAAA%3D%22%2C%22compiler%22%3A%22dmd2067%22%2C%22options%22%3A%22-O%20-release%20-inline%20-boundscheck%3Doff%22%7D%5D%7D

[...]

Not sure why DMD isn't using SIMD on the first one, haven't
looked at that code in a while. Anyway, gdc vectorises both:
http://goo.gl/CzD15s and that's with gcc4.9 backend, it can
probably do better build against something more recent.
```
Jul 16 2015
Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
```On 16 July 2015 at 00:42, Steven via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
I was trying to show someone how awesome Dlang was earlier, and how the
vector operations are expected to take advantage of the CPU vector
instructions, and was dumbstruck when dmd and gdc both failed to
auto-vectorize a simple case.  I've stripped it down to the bare minimum and
loaded the example on the interactive compiler:
http://asm.dlang.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22JYWwDg9gTgLgBAY2gUwHQGdQBMDcAoPAMwBsIBDGAbQCYBWANgF05kAPM8Y5AQQAoTyVOkzhkANHAEUaDZgCMAlHgDeeOJNLThzBPnUB6fXCjJ0AV2Ix0cYADs45uenRrElZgF5R7uAFo4cu56xsgwZlD2ungAvgRSQrIs7JzIAEL8mgki4hqCMiKKKq7xAByUAMzUjABuZHBeCGToMBmCZZWMCmTBpRVV1XL1iE0tvR0Kcj2Z7f1R6q6GIeaW1nYOZnJgLurVCD5etT7%2BA0EE6iZhEcPNrVqyCrv40UAAA%3D%22%2C%22compiler%22%3A%22dmd2067%22%2C%22options%22%3A%22-O%20-release%20-inline%20-boundscheck%3Doff%22%7D%5D%7D

The reference documentation for arrays says:
Implementation note: many of the more common vector operations are expected
to take advantage of any vector math instructions available on the target
computer.

DMD makes leverage of vector operations in the library, rather than in
the generated code.  So as long as you are doing array operations
using any of the supported types...

Does this mean that while compilers are expected to take advantage of them,
they currently do not, even when they have proper alignment?  I haven't
tried LDC yet, so maybe LDC does perform auto-vectorization and I should
attempt to use LDC if I plan on using vector ops a lot?

Auto-vectorization is deliberately strict in what triggers it to occur.

It is possible to give the compiler hints, however I'm not sure that
this should be done by the code generator.

See, for example: http://goo.gl/iMBbRs

Regards
Iain
```
Jul 16 2015
"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> writes:
```On Wednesday, 15 July 2015 at 22:42:05 UTC, Steven wrote:
Does this mean that while compilers are expected to take
advantage of them, they currently do not, even when they have
proper alignment?  I haven't tried LDC yet, so maybe LDC does
perform auto-vectorization and I should attempt to use LDC if I
plan on using vector ops a lot?

If you want to use vector operations, YOU have to write the code
for it. Addition and multiplication seem like easy things to have
vectorized automatically, but it's complicated to do (I don't
know of any compiler that does a convincing and reliable job of
auto-vectorization) and likely it won't give you many of the
other useful vector operations.

Someone from a game engine company (Unreal?) held a nice talk
about how SIMD pervades their code base, including memory layout,
in order to get decent performance on the less trivial
operations, like dot product. I don't have a link though. =/

The main reason to write the SIMD code yourself is that you know
that it's going to work the way you want. There won't ever be a
case where you add one more variable to some structure and the
compiler decides it can no longer auto-vectorize a loop.
```
Jul 18 2015