digitalmars.D.learn - LLVM asm with constraints, and 2 operands

Guillaume Piolat (12/12) Jul 18 2021 Is anyone versed in LLVM inline asm?

Basile B. (13/25) Jul 18 2021 Yeah I can confirm it's aweful. Took me hours to understand how

Guillaume Piolat (7/19) Jul 18 2021 Thanks.

Basile B. (4/12) Jul 18 2021 I think this should be rejected just like when you use D arrays

Basile B. (2/16) Jul 18 2021 but in any case there's a bug.

Guillaume Piolat (3/21) Jul 18 2021 I checked and thankfullyit works when the enclosed function is

kinke (40/45) Jul 19 2021 $0 is actually the output operand, $1 is `a`, and $2 is `b`.

Guillaume Piolat (3/4) Jul 19 2021 OK that's what I feared. It's very easy to get that wrong.
kinke (20/30) Jul 19 2021 This workaround is actually missing the clobber constraint for

Tejas (3/4) Jul 19 2021 Is LDC still compatible with GDC/GCC inline asm? I remember Johan

kinke (4/8) Jul 19 2021 I'm not aware of any of that; who'd be 'they'? GCC breaking their

Guillaume Piolat (2/8) Jul 19 2021 It went under my radar. Thanks for the tips in this thread.
Tejas (4/13) Jul 19 2021 'They' meant the LDC developers as a whole.

Guillaume Piolat (8/10) Jul 19 2021 An unrelated other issue with asm/__asm is that it doesn't follow

kinke (25/37) Jul 19 2021 You know that asm is to be avoided whenever possible, but

kinke (6/10) Jul 19 2021 Erm sorry should have looked more closely - it's not inlined, and
Guillaume Piolat (9/14) Jul 19 2021 intel-intrinsics employs compile-time detection of CPU

Basile B. (4/14) Jul 19 2021 yeah thnaks for the precision, I totally forgot about that.

kinke (9/11) Jul 19 2021 The original inline asm was buggy and only 'worked' by accident

Guillaume Piolat <first.name domain.tld> writes:

Is anyone versed in LLVM inline asm?

I know how to generate SIMD unary op with:

     return __asm!int4("pmovsxwd $1,$0","=x,x",a);

but I struggle to generate 2-operands SIMD ops like:

     return __asm!int4("paddd $1,$0","=x,x",a, b);

If you know how to do it => https://d.godbolt.org/z/ccM38bfMT  it 
would probably help build speed of SIMD heavy code, also -O0 
performance
Also generating the right instruction is good but it must resist 
optimization too, so proper LLVM constraints is needed. It would 
be really helpful if someone has understood the cryptic rules of 
LLVM assembly constraints.

Jul 18 2021

Basile B. <b2.temp gmx.com> writes:

On Sunday, 18 July 2021 at 11:42:24 UTC, Guillaume Piolat wrote:
 Is anyone versed in LLVM inline asm?

 I know how to generate SIMD unary op with:

     return __asm!int4("pmovsxwd $1,$0","=x,x",a);

 but I struggle to generate 2-operands SIMD ops like:

     return __asm!int4("paddd $1,$0","=x,x",a, b);

 If you know how to do it => https://d.godbolt.org/z/ccM38bfMT  
 it would probably help build speed of SIMD heavy code, also -O0 
 performance
 Also generating the right instruction is good but it must 
 resist optimization too, so proper LLVM constraints is needed. 
 It would be really helpful if someone has understood the 
 cryptic rules of LLVM assembly constraints.

Yeah I can confirm it's aweful. Took me hours to understand how 
to use it a bit (my PL has [an 
interface](https://styx-lang.gitlab.io/styx/primary_expressions
html#asmexpression) for LLVM asm)

You need to add a "x" to the constraint string

     return __asm!int4("paddd $1,$0","=x,x,x",a, b);

- **=x** says "returns in whatever is has to"
- **x** (1) is the constraint for input `a`, which is passed as 
operand **$0**
- **x** (2) is the constraint for input `b`, which is passed as 
operand **$1**

So the thing to get is that the output constraint does not 
consume anything else, it is standalone.

Jul 18 2021

Guillaume Piolat <first.name domain.tld> writes:

On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 Yeah I can confirm it's aweful. Took me hours to understand how 
 to use it a bit (my PL has [an 
 interface](https://styx-lang.gitlab.io/styx/primary_expressions
html#asmexpression) for LLVM asm)

 You need to add a "x" to the constraint string

     return __asm!int4("paddd $1,$0","=x,x,x",a, b);

 - **=x** says "returns in whatever is has to"
 - **x** (1) is the constraint for input `a`, which is passed as 
 operand **$0**
 - **x** (2) is the constraint for input `b`, which is passed as 
 operand **$1**

 So the thing to get is that the output constraint does not 
 consume anything else, it is standalone.

Thanks.

Indeed that seems to work even when inline and optimized. 
Registers are spilled to stack.
A minor concern is what happens when the enclosing function is 
extern(C) => https://d.godbolt.org/z/s6dM3a3de
I need to check that more...

Jul 18 2021

Basile B. <b2.temp gmx.com> writes:

On Sunday, 18 July 2021 at 17:45:05 UTC, Guillaume Piolat wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 [...]

 Thanks.

 Indeed that seems to work even when inline and optimized. 
 Registers are spilled to stack.
 A minor concern is what happens when the enclosing function is 
 extern(C) => https://d.godbolt.org/z/s6dM3a3de
 I need to check that more...

I think this should be rejected just like when you use D arrays 
in the interface of an `extern(C)` func, as C has no equivalent 
of __vector (afaik).

Jul 18 2021

Basile B. <b2.temp gmx.com> writes:

On Sunday, 18 July 2021 at 18:47:50 UTC, Basile B. wrote:
 On Sunday, 18 July 2021 at 17:45:05 UTC, Guillaume Piolat wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 [...]

 Thanks.

 Indeed that seems to work even when inline and optimized. 
 Registers are spilled to stack.
 A minor concern is what happens when the enclosing function is 
 extern(C) => https://d.godbolt.org/z/s6dM3a3de
 I need to check that more...

 I think this should be rejected just like when you use D arrays 
 in the interface of an `extern(C)` func, as C has no equivalent 
 of __vector (afaik).

but in any case there's a bug.

Jul 18 2021

Guillaume Piolat <first.name domain.tld> writes:

On Sunday, 18 July 2021 at 18:48:47 UTC, Basile B. wrote:
 On Sunday, 18 July 2021 at 18:47:50 UTC, Basile B. wrote:
 On Sunday, 18 July 2021 at 17:45:05 UTC, Guillaume Piolat 
 wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 [...]

 Thanks.

 Indeed that seems to work even when inline and optimized. 
 Registers are spilled to stack.
 A minor concern is what happens when the enclosing function 
 is extern(C) => https://d.godbolt.org/z/s6dM3a3de
 I need to check that more...

 I think this should be rejected just like when you use D 
 arrays in the interface of an `extern(C)` func, as C has no 
 equivalent of __vector (afaik).

 but in any case there's a bug.

I checked and thankfullyit works when the enclosed function is 
inlined in an extern(C) function, that respects extern(C) ABI.

Jul 18 2021

kinke <noone nowhere.com> writes:

On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 - **=x** says "returns in whatever is has to"
 - **x** (1) is the constraint for input `a`, which is passed as 
 operand **$0**
 - **x** (2) is the constraint for input `b`, which is passed as 
 operand **$1**

$0 is actually the output operand, $1 is `a`, and $2 is `b`.

The official docs are here, but IMO not very user-friendly: 
https://llvm.org/docs/LangRef.html#inline-assembler-expressions

I recommend using GDC/GCC inline asm instead, where you'll find 
more examples. For the given paddd example, I'd have gone with

```
int4 _mm_add_int4(int4 a, int4 b)
{
     asm { "paddd %1, %0" : "=*x" (a) : "x" (b); }
     // the above is equivalent to:
     // __asm!void("paddd $1, $0","=*x,x", &a, b);
     return a;
}
```

but the produced asm is rubbish (apparently an LLVM issue):

```
movaps	%xmm1, -24(%rsp)
paddd	%xmm0, %xmm0 // WTF?
movaps	%xmm0, -24(%rsp)
retq
```

What works reliably is a manual mov:

```
int4 _mm_add_int4(int4 a, int4 b)
{
     int4 r;
     asm { "paddd %1, %2; movdqa %2, %0" : "=x" (r) : "x" (a), "x" 
(b); }
     return r;
}
```

=>

```
paddd	%xmm1, %xmm0
movdqa	%xmm0, %xmm0 // useless but cannot be optimized away
retq
```

Note: inline asm syntax and resulting asm in AT&T syntax, *not* 
Intel syntax.

Jul 19 2021

Guillaume Piolat <first.name domain.tld> writes:

On Monday, 19 July 2021 at 10:21:58 UTC, kinke wrote:
 What works reliably is a manual mov:

OK that's what I feared. It's very easy to get that wrong. 
Thankfully I haven't used __asm a lot.

Jul 19 2021

kinke <noone nowhere.com> writes:

On Monday, 19 July 2021 at 10:21:58 UTC, kinke wrote:
 What works reliably is a manual mov:

 ```
 int4 _mm_add_int4(int4 a, int4 b)
 {
     int4 r;
     asm { "paddd %1, %2; movdqa %2, %0" : "=x" (r) : "x" (a), 
 "x" (b); }
     return r;
 }
 ```

This workaround is actually missing the clobber constraint for 
`%2`, which might be problematic after inlining.

You can also specify the registers explicitly like so (here 
exploiting ABI knowledge about `a` being passed in XMM1, and `b` 
in XMM0 for extern(D)):

```
int4 _mm_add_int4(int4 a, int4 b)
{
     asm { "paddd %1, %0" : "=xmm0" (b) : "xmm1" (a), "xmm0" (b); }
     return b;
}
```

=>

```
paddd   xmm0, xmm1
ret
```

But this might likely tamper with LLVM register allocation 
optimizations after inlining...

Jul 19 2021

Tejas <notrealemail gmail.com> writes:

On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 On[snip]

Is LDC still compatible with GDC/GCC inline asm? I remember Johan 
saying they will break compatibilty in the near future...

Jul 19 2021

kinke <noone nowhere.com> writes:

On Monday, 19 July 2021 at 11:16:49 UTC, Tejas wrote:
 On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 On[snip]

 Is LDC still compatible with GDC/GCC inline asm? I remember 
 Johan saying they will break compatibilty in the near future...

I'm not aware of any of that; who'd be 'they'? GCC breaking their 
syntax is IMO unimaginable. LDC supporting it (to some extent) is 
pretty recent, was introduced with v1.21.

Jul 19 2021

Guillaume Piolat <first.name domain.tld> writes:

On Monday, 19 July 2021 at 16:05:57 UTC, kinke wrote:
 Is LDC still compatible with GDC/GCC inline asm? I remember 
 Johan saying they will break compatibilty in the near future...

 I'm not aware of any of that; who'd be 'they'? GCC breaking 
 their syntax is IMO unimaginable. LDC supporting it (to some 
 extent) is pretty recent, was introduced with v1.21.

It went under my radar. Thanks for the tips in this thread.

Jul 19 2021

Tejas <notrealemail gmail.com> writes:

On Monday, 19 July 2021 at 16:05:57 UTC, kinke wrote:
 On Monday, 19 July 2021 at 11:16:49 UTC, Tejas wrote:
 On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 On[snip]

 Is LDC still compatible with GDC/GCC inline asm? I remember 
 Johan saying they will break compatibilty in the near future...

 I'm not aware of any of that; who'd be 'they'? GCC breaking 
 their syntax is IMO unimaginable. LDC supporting it (to some 
 extent) is pretty recent, was introduced with v1.21.

'They' meant the LDC developers as a whole.

Seems like I might have misunderstood what he was writing if GCC 
style asm support is so recent.

Jul 19 2021

Guillaume Piolat <first.name domain.tld> writes:

On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 This workaround is actually missing the clobber constraint for 
 `%2`, which might be problematic after inlining.

An unrelated other issue with asm/__asm is that it doesn't follow 
consistent VEX encoding compared to normal compiler output.

     sometimes you might want: paddq x, y
               at other times: vpaddq x, y, z

but rarely both in the same program.
So this can easily nullify any gain obtained with VEX transition 
costs (if they are still a thing).

Jul 19 2021

kinke <noone nowhere.com> writes:

On Monday, 19 July 2021 at 16:44:35 UTC, Guillaume Piolat wrote:
 On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 This workaround is actually missing the clobber constraint for 
 `%2`, which might be problematic after inlining.

 An unrelated other issue with asm/__asm is that it doesn't 
 follow consistent VEX encoding compared to normal compiler 
 output.

     sometimes you might want: paddq x, y
               at other times: vpaddq x, y, z

 but rarely both in the same program.
 So this can easily nullify any gain obtained with VEX 
 transition costs (if they are still a thing).

You know that asm is to be avoided whenever possible, but 
unfortunately, AFAIK intel-intrinsics doesn't fit the usual 
'don't worry, simply compile all your code with an appropriate 
-mattr/-mcpu option' recommendation, as it employs runtime 
detection of available CPU instructions.

I've just tried another option, but that doesn't play nice with 
inlining:

```
import core.simd;
import ldc.attributes;

 target("sse2") // use SSE2 for this function
int4 _mm_add_int4(int4 a, int4 b)
{
     return a + b; // perfect: paddd %xmm1, %xmm0
}

int4 wrapper(int4 a, int4 b)
{
     return _mm_add_int4(a, b);
}
```

Compiling with `-O -mtriple=i686-linux-gnu -mcpu=i686` (=> no 
SSE2 by default) shows that the inlined version inside 
`wrapper()` is the mega slow one, so the extra instructions 
aren't applied transitively unfortunately.

Jul 19 2021

kinke <noone nowhere.com> writes:

On Monday, 19 July 2021 at 17:20:21 UTC, kinke wrote:
 Compiling with `-O -mtriple=i686-linux-gnu -mcpu=i686` (=> no 
 SSE2 by default) shows that the inlined version inside 
 `wrapper()` is the mega slow one, so the extra instructions 
 aren't applied transitively unfortunately.

Erm sorry should have looked more closely - it's not inlined, and 
the call seems extremely expensive too, with state pushing and 
popping going on, apparently to account for the different 
targets. Brrr, to be avoided at all costs for such tiny 
functions. :)

Jul 19 2021

Guillaume Piolat <first.name domain.tld> writes:

On Monday, 19 July 2021 at 17:20:21 UTC, kinke wrote:
 You know that asm is to be avoided whenever possible, but 
 unfortunately, AFAIK intel-intrinsics doesn't fit the usual 
 'don't worry, simply compile all your code with an appropriate 
 -mattr/-mcpu option' recommendation, as it employs runtime 
 detection of available CPU instructions.

intel-intrinsics employs compile-time detection of CPU 
instructions.
If not available, it will work anyway(tm) with alternate slower 
pathes (and indeed need the right -mattr, so this is the one 
worry you do get).

So, not using  target("feature") right now,  figured it would be 
helpful for runtime dispatch, but that means literring the code 
with __traits(targetHasFeature).

Jul 19 2021

Basile B. <b2.temp gmx.com> writes:

On Monday, 19 July 2021 at 10:21:58 UTC, kinke wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 - **=x** says "returns in whatever is has to"
 - **x** (1) is the constraint for input `a`, which is passed 
 as operand **$0**
 - **x** (2) is the constraint for input `b`, which is passed 
 as operand **$1**

 $0 is actually the output operand, $1 is `a`, and $2 is `b`.
 [...]
 Note: inline asm syntax and resulting asm in AT&T syntax, *not* 
 Intel syntax.

yeah thnaks for the precision, I totally forgot about that.

And what about the `extern(C)` issue ? Does it make sense to be 
used when the parameters are int4 ?

Jul 19 2021

kinke <noone nowhere.com> writes:

On Monday, 19 July 2021 at 11:39:02 UTC, Basile B. wrote:
 And what about the `extern(C)` issue ? Does it make sense to be 
 used when the parameters are int4 ?

The original inline asm was buggy and only 'worked' by accident 
(not using the 2nd input operand at all...) with extern(D) 
reversed parameters. At least for Posix x64, the C calling 
convention is well-defined for vectors and equivalent to 
extern(D) (except for the latter's parameter reversal). Windows 
and 32-bit x86 are different; for Windows, extern(D) pays off, as 
LDC's ABI is similar to the MSVC++ __vectorcall calling 
convention (passing vectors in SIMD registers).

Jul 19 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - LLVM asm with constraints, and 2 operands