www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - LLVM asm with constraints, and 2 operands

reply Guillaume Piolat <first.name domain.tld> writes:
Is anyone versed in LLVM inline asm?

I know how to generate SIMD unary op with:

     return __asm!int4("pmovsxwd $1,$0","=x,x",a);

but I struggle to generate 2-operands SIMD ops like:

     return __asm!int4("paddd $1,$0","=x,x",a, b);

If you know how to do it => https://d.godbolt.org/z/ccM38bfMT  it 
would probably help build speed of SIMD heavy code, also -O0 
performance
Also generating the right instruction is good but it must resist 
optimization too, so proper LLVM constraints is needed. It would 
be really helpful if someone has understood the cryptic rules of 
LLVM assembly constraints.
Jul 18 2021
parent reply Basile B. <b2.temp gmx.com> writes:
On Sunday, 18 July 2021 at 11:42:24 UTC, Guillaume Piolat wrote:
 Is anyone versed in LLVM inline asm?

 I know how to generate SIMD unary op with:

     return __asm!int4("pmovsxwd $1,$0","=x,x",a);

 but I struggle to generate 2-operands SIMD ops like:

     return __asm!int4("paddd $1,$0","=x,x",a, b);

 If you know how to do it => https://d.godbolt.org/z/ccM38bfMT  
 it would probably help build speed of SIMD heavy code, also -O0 
 performance
 Also generating the right instruction is good but it must 
 resist optimization too, so proper LLVM constraints is needed. 
 It would be really helpful if someone has understood the 
 cryptic rules of LLVM assembly constraints.
Yeah I can confirm it's aweful. Took me hours to understand how to use it a bit (my PL has [an interface](https://styx-lang.gitlab.io/styx/primary_expressions html#asmexpression) for LLVM asm) You need to add a "x" to the constraint string return __asm!int4("paddd $1,$0","=x,x,x",a, b); - **=x** says "returns in whatever is has to" - **x** (1) is the constraint for input `a`, which is passed as operand **$0** - **x** (2) is the constraint for input `b`, which is passed as operand **$1** So the thing to get is that the output constraint does not consume anything else, it is standalone.
Jul 18 2021
next sibling parent reply Guillaume Piolat <first.name domain.tld> writes:
On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 Yeah I can confirm it's aweful. Took me hours to understand how 
 to use it a bit (my PL has [an 
 interface](https://styx-lang.gitlab.io/styx/primary_expressions
html#asmexpression) for LLVM asm)

 You need to add a "x" to the constraint string

     return __asm!int4("paddd $1,$0","=x,x,x",a, b);

 - **=x** says "returns in whatever is has to"
 - **x** (1) is the constraint for input `a`, which is passed as 
 operand **$0**
 - **x** (2) is the constraint for input `b`, which is passed as 
 operand **$1**

 So the thing to get is that the output constraint does not 
 consume anything else, it is standalone.
Thanks. Indeed that seems to work even when inline and optimized. Registers are spilled to stack. A minor concern is what happens when the enclosing function is extern(C) => https://d.godbolt.org/z/s6dM3a3de I need to check that more...
Jul 18 2021
parent reply Basile B. <b2.temp gmx.com> writes:
On Sunday, 18 July 2021 at 17:45:05 UTC, Guillaume Piolat wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 [...]
Thanks. Indeed that seems to work even when inline and optimized. Registers are spilled to stack. A minor concern is what happens when the enclosing function is extern(C) => https://d.godbolt.org/z/s6dM3a3de I need to check that more...
I think this should be rejected just like when you use D arrays in the interface of an `extern(C)` func, as C has no equivalent of __vector (afaik).
Jul 18 2021
parent reply Basile B. <b2.temp gmx.com> writes:
On Sunday, 18 July 2021 at 18:47:50 UTC, Basile B. wrote:
 On Sunday, 18 July 2021 at 17:45:05 UTC, Guillaume Piolat wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 [...]
Thanks. Indeed that seems to work even when inline and optimized. Registers are spilled to stack. A minor concern is what happens when the enclosing function is extern(C) => https://d.godbolt.org/z/s6dM3a3de I need to check that more...
I think this should be rejected just like when you use D arrays in the interface of an `extern(C)` func, as C has no equivalent of __vector (afaik).
but in any case there's a bug.
Jul 18 2021
parent Guillaume Piolat <first.name domain.tld> writes:
On Sunday, 18 July 2021 at 18:48:47 UTC, Basile B. wrote:
 On Sunday, 18 July 2021 at 18:47:50 UTC, Basile B. wrote:
 On Sunday, 18 July 2021 at 17:45:05 UTC, Guillaume Piolat 
 wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 [...]
Thanks. Indeed that seems to work even when inline and optimized. Registers are spilled to stack. A minor concern is what happens when the enclosing function is extern(C) => https://d.godbolt.org/z/s6dM3a3de I need to check that more...
I think this should be rejected just like when you use D arrays in the interface of an `extern(C)` func, as C has no equivalent of __vector (afaik).
but in any case there's a bug.
I checked and thankfullyit works when the enclosed function is inlined in an extern(C) function, that respects extern(C) ABI.
Jul 18 2021
prev sibling parent reply kinke <noone nowhere.com> writes:
On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 - **=x** says "returns in whatever is has to"
 - **x** (1) is the constraint for input `a`, which is passed as 
 operand **$0**
 - **x** (2) is the constraint for input `b`, which is passed as 
 operand **$1**
$0 is actually the output operand, $1 is `a`, and $2 is `b`. The official docs are here, but IMO not very user-friendly: https://llvm.org/docs/LangRef.html#inline-assembler-expressions I recommend using GDC/GCC inline asm instead, where you'll find more examples. For the given paddd example, I'd have gone with ``` int4 _mm_add_int4(int4 a, int4 b) { asm { "paddd %1, %0" : "=*x" (a) : "x" (b); } // the above is equivalent to: // __asm!void("paddd $1, $0","=*x,x", &a, b); return a; } ``` but the produced asm is rubbish (apparently an LLVM issue): ``` movaps %xmm1, -24(%rsp) paddd %xmm0, %xmm0 // WTF? movaps %xmm0, -24(%rsp) retq ``` What works reliably is a manual mov: ``` int4 _mm_add_int4(int4 a, int4 b) { int4 r; asm { "paddd %1, %2; movdqa %2, %0" : "=x" (r) : "x" (a), "x" (b); } return r; } ``` => ``` paddd %xmm1, %xmm0 movdqa %xmm0, %xmm0 // useless but cannot be optimized away retq ``` Note: inline asm syntax and resulting asm in AT&T syntax, *not* Intel syntax.
Jul 19 2021
next sibling parent Guillaume Piolat <first.name domain.tld> writes:
On Monday, 19 July 2021 at 10:21:58 UTC, kinke wrote:
 What works reliably is a manual mov:
OK that's what I feared. It's very easy to get that wrong. Thankfully I haven't used __asm a lot.
Jul 19 2021
prev sibling next sibling parent reply kinke <noone nowhere.com> writes:
On Monday, 19 July 2021 at 10:21:58 UTC, kinke wrote:
 What works reliably is a manual mov:

 ```
 int4 _mm_add_int4(int4 a, int4 b)
 {
     int4 r;
     asm { "paddd %1, %2; movdqa %2, %0" : "=x" (r) : "x" (a), 
 "x" (b); }
     return r;
 }
 ```
This workaround is actually missing the clobber constraint for `%2`, which might be problematic after inlining. You can also specify the registers explicitly like so (here exploiting ABI knowledge about `a` being passed in XMM1, and `b` in XMM0 for extern(D)): ``` int4 _mm_add_int4(int4 a, int4 b) { asm { "paddd %1, %0" : "=xmm0" (b) : "xmm1" (a), "xmm0" (b); } return b; } ``` => ``` paddd xmm0, xmm1 ret ``` But this might likely tamper with LLVM register allocation optimizations after inlining...
Jul 19 2021
next sibling parent reply Tejas <notrealemail gmail.com> writes:
On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 On[snip]
Is LDC still compatible with GDC/GCC inline asm? I remember Johan saying they will break compatibilty in the near future...
Jul 19 2021
parent reply kinke <noone nowhere.com> writes:
On Monday, 19 July 2021 at 11:16:49 UTC, Tejas wrote:
 On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 On[snip]
Is LDC still compatible with GDC/GCC inline asm? I remember Johan saying they will break compatibilty in the near future...
I'm not aware of any of that; who'd be 'they'? GCC breaking their syntax is IMO unimaginable. LDC supporting it (to some extent) is pretty recent, was introduced with v1.21.
Jul 19 2021
next sibling parent Guillaume Piolat <first.name domain.tld> writes:
On Monday, 19 July 2021 at 16:05:57 UTC, kinke wrote:
 Is LDC still compatible with GDC/GCC inline asm? I remember 
 Johan saying they will break compatibilty in the near future...
I'm not aware of any of that; who'd be 'they'? GCC breaking their syntax is IMO unimaginable. LDC supporting it (to some extent) is pretty recent, was introduced with v1.21.
It went under my radar. Thanks for the tips in this thread.
Jul 19 2021
prev sibling parent Tejas <notrealemail gmail.com> writes:
On Monday, 19 July 2021 at 16:05:57 UTC, kinke wrote:
 On Monday, 19 July 2021 at 11:16:49 UTC, Tejas wrote:
 On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 On[snip]
Is LDC still compatible with GDC/GCC inline asm? I remember Johan saying they will break compatibilty in the near future...
I'm not aware of any of that; who'd be 'they'? GCC breaking their syntax is IMO unimaginable. LDC supporting it (to some extent) is pretty recent, was introduced with v1.21.
'They' meant the LDC developers as a whole. Seems like I might have misunderstood what he was writing if GCC style asm support is so recent.
Jul 19 2021
prev sibling parent reply Guillaume Piolat <first.name domain.tld> writes:
On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 This workaround is actually missing the clobber constraint for 
 `%2`, which might be problematic after inlining.
An unrelated other issue with asm/__asm is that it doesn't follow consistent VEX encoding compared to normal compiler output. sometimes you might want: paddq x, y at other times: vpaddq x, y, z but rarely both in the same program. So this can easily nullify any gain obtained with VEX transition costs (if they are still a thing).
Jul 19 2021
parent reply kinke <noone nowhere.com> writes:
On Monday, 19 July 2021 at 16:44:35 UTC, Guillaume Piolat wrote:
 On Monday, 19 July 2021 at 10:49:56 UTC, kinke wrote:
 This workaround is actually missing the clobber constraint for 
 `%2`, which might be problematic after inlining.
An unrelated other issue with asm/__asm is that it doesn't follow consistent VEX encoding compared to normal compiler output. sometimes you might want: paddq x, y at other times: vpaddq x, y, z but rarely both in the same program. So this can easily nullify any gain obtained with VEX transition costs (if they are still a thing).
You know that asm is to be avoided whenever possible, but unfortunately, AFAIK intel-intrinsics doesn't fit the usual 'don't worry, simply compile all your code with an appropriate -mattr/-mcpu option' recommendation, as it employs runtime detection of available CPU instructions. I've just tried another option, but that doesn't play nice with inlining: ``` import core.simd; import ldc.attributes; target("sse2") // use SSE2 for this function int4 _mm_add_int4(int4 a, int4 b) { return a + b; // perfect: paddd %xmm1, %xmm0 } int4 wrapper(int4 a, int4 b) { return _mm_add_int4(a, b); } ``` Compiling with `-O -mtriple=i686-linux-gnu -mcpu=i686` (=> no SSE2 by default) shows that the inlined version inside `wrapper()` is the mega slow one, so the extra instructions aren't applied transitively unfortunately.
Jul 19 2021
next sibling parent kinke <noone nowhere.com> writes:
On Monday, 19 July 2021 at 17:20:21 UTC, kinke wrote:
 Compiling with `-O -mtriple=i686-linux-gnu -mcpu=i686` (=> no 
 SSE2 by default) shows that the inlined version inside 
 `wrapper()` is the mega slow one, so the extra instructions 
 aren't applied transitively unfortunately.
Erm sorry should have looked more closely - it's not inlined, and the call seems extremely expensive too, with state pushing and popping going on, apparently to account for the different targets. Brrr, to be avoided at all costs for such tiny functions. :)
Jul 19 2021
prev sibling parent Guillaume Piolat <first.name domain.tld> writes:
On Monday, 19 July 2021 at 17:20:21 UTC, kinke wrote:
 You know that asm is to be avoided whenever possible, but 
 unfortunately, AFAIK intel-intrinsics doesn't fit the usual 
 'don't worry, simply compile all your code with an appropriate 
 -mattr/-mcpu option' recommendation, as it employs runtime 
 detection of available CPU instructions.
intel-intrinsics employs compile-time detection of CPU instructions. If not available, it will work anyway(tm) with alternate slower pathes (and indeed need the right -mattr, so this is the one worry you do get). So, not using target("feature") right now, figured it would be helpful for runtime dispatch, but that means literring the code with __traits(targetHasFeature).
Jul 19 2021
prev sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Monday, 19 July 2021 at 10:21:58 UTC, kinke wrote:
 On Sunday, 18 July 2021 at 16:32:46 UTC, Basile B. wrote:
 - **=x** says "returns in whatever is has to"
 - **x** (1) is the constraint for input `a`, which is passed 
 as operand **$0**
 - **x** (2) is the constraint for input `b`, which is passed 
 as operand **$1**
$0 is actually the output operand, $1 is `a`, and $2 is `b`. [...] Note: inline asm syntax and resulting asm in AT&T syntax, *not* Intel syntax.
yeah thnaks for the precision, I totally forgot about that. And what about the `extern(C)` issue ? Does it make sense to be used when the parameters are int4 ?
Jul 19 2021
parent kinke <noone nowhere.com> writes:
On Monday, 19 July 2021 at 11:39:02 UTC, Basile B. wrote:
 And what about the `extern(C)` issue ? Does it make sense to be 
 used when the parameters are int4 ?
The original inline asm was buggy and only 'worked' by accident (not using the 2nd input operand at all...) with extern(D) reversed parameters. At least for Posix x64, the C calling convention is well-defined for vectors and equivalent to extern(D) (except for the latter's parameter reversal). Windows and 32-bit x86 are different; for Windows, extern(D) pays off, as LDC's ABI is similar to the MSVC++ __vectorcall calling convention (passing vectors in SIMD registers).
Jul 19 2021