www.digitalmars.com         C & C++   DMDScript  

D.gnu - Implementation of gcc SIMD builtins

reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
Sorry to start a new thread on this, but I didn't want it to get lost in
the middle of the previous comments.  I have the start of a working
implementation in gdc giving access to the __builtin_ia32_* functions.
The way I did it so far manages to compile down to very tight SSE code.
 That's the good part.

The bad part is that there are a couple of limitations that engender
slightly ugly code at the moment.

Issue 1:

In order for the compiler to actually recognize the builtins when you
call them, I had to define a set of custom types that represent gcc's
types with the vector_size attribute that get passed into the builtins.

I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't
set the alignment, and it automatically generates calls to _d_array_init
and _d_array_copy and such, rather than instead just staying in SSE
registers.  Let's just say the code was very non-optimal.

Instead, I create a struct declaration in gcc.builtins for all of the
types expected by those builtins, and I name them to match what they
would nominally contain: __v4sf would have 4 floats, __v32qi would have
32 bytes, __v2df would have 2 doubles, and so forth.  Each struct has
16-byte alignment and the correct size.  But here's the rub: they have
no fields in them.  I tried my darndest to add VarDeclarations to them,
but the fact that the actual gcc tree type wasn't a struct, it would
just ICE the compiler when instantiating any of those structs.

I'd like to fix this, so that you can literally access the contents of
the struct like it had a float[4], or a double[2], or a byte[32], or
whatever it should actually have; or instead it should give you an
overloaded [] operator for direct indexing.

Issue 2:

The builtin structs I generate are *not* recognized by the frontend as
having support for +, -, *, /, etc like the gcc vector_size types
automatically do in C and C++.  I might be able to add those and have
them contain code to drop into the builtins.  For now, you *must* use
the builtin functions to perform operations on these types.  I'm
obviously aiming to use the builtin functions, myself (for now).

Quick example:

///// File VectorsMain.d /////

import gcc.builtins;
import mmintrins;
import std.stdio;

void main()
{
    __v4sf bv1;
    setvelem(bv1, 0, 1.0f);
    setvelem(bv1, 1, 2.0f);
    setvelem(bv1, 2, 3.0f);
    setvelem(bv1, 3, 0.0f);

    __v4sf bv2;
    setvelem(bv2, 0, 1.0f);
    setvelem(bv2, 1, 1.0f);
    setvelem(bv2, 2, 1.0f);
    setvelem(bv2, 3, 0.0f);

    __v4sf bv3 = _mm_add_ps(bv1, bv2);

    std.stdio.writefln("Result: (%s, %s, %s, %s)",
                       velem!float(bv3, 0),
                       velem!float(bv3, 1),
                       velem!float(bv3, 2),
                       velem!float(bv3, 3));
}

///// File mmintrins.d /////

module mmintrins;

import gcc.builtins;


T velem(T, VT)(VT vector, uint elem)
{
    return (cast(T*) &vector)[elem];
}

void setvelem(T, VT)(ref VT vector, uint elem, T value)
{
    (cast(T*) &vector)[elem] = value;
}


//pragma(set_attribute, _mm_add_ps, always_inline, artificial);
T _mm_add_ps(T)(const(T) v1, const(T) v2)
{
    return __builtin_ia32_addps(v1, v2);
}

///// End example /////

Note a few things: I made _mm_add_ps templated on vector type (I'll
constrain it eventually to appropriate types), and this solves a couple
of problems: cross-module inlining works as the other module gets the
whole definition, and you can technically addps types other than v4sf.

Note the velem and setvelem methods are just to add a pretty face on the
fact that the data of the struct is hidden, with no fields to access it.
 More checks are needed (at least in debug mode), and there will be some
other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup
of vectors easier.  I'll admit that this part is a bit ugly, but it
works, and it generates excellent code.  I compared the actual assembly
generated to my own C++ code with the same intrinsics, and so far the D
side is keeping up.

Please don't collectively throw up when you see this...fast vector ops
are kindof a big deal for me, so be gentle. =)  What do you all think?

-Mike
Feb 10 2011
parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 Sorry to start a new thread on this, but I didn't want it to get lost in
 the middle of the previous comments.  I have the start of a working
 implementation in gdc giving access to the __builtin_ia32_* functions.
 The way I did it so far manages to compile down to very tight SSE code.
  That's the good part.
 The bad part is that there are a couple of limitations that engender
 slightly ugly code at the moment.
 Issue 1:
 In order for the compiler to actually recognize the builtins when you
 call them, I had to define a set of custom types that represent gcc's
 types with the vector_size attribute that get passed into the builtins.
 I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't
 set the alignment, and it automatically generates calls to _d_array_init
 and _d_array_copy and such, rather than instead just staying in SSE
 registers.  Let's just say the code was very non-optimal.

I didn't hope for anything, I'm not the crazy one using them. =)
 Instead, I create a struct declaration in gcc.builtins for all of the
 types expected by those builtins, and I name them to match what they
 would nominally contain: __v4sf would have 4 floats, __v32qi would have
 32 bytes, __v2df would have 2 doubles, and so forth.  Each struct has
 16-byte alignment and the correct size.  But here's the rub: they have
 no fields in them.  I tried my darndest to add VarDeclarations to them,
 but the fact that the actual gcc tree type wasn't a struct, it would
 just ICE the compiler when instantiating any of those structs.
 I'd like to fix this, so that you can literally access the contents of
 the struct like it had a float[4], or a double[2], or a byte[32], or
 whatever it should actually have; or instead it should give you an
 overloaded [] operator for direct indexing.
 Issue 2:
 The builtin structs I generate are *not* recognized by the frontend as
 having support for +, -, *, /, etc like the gcc vector_size types
 automatically do in C and C++.  I might be able to add those and have
 them contain code to drop into the builtins.  For now, you *must* use
 the builtin functions to perform operations on these types.  I'm
 obviously aiming to use the builtin functions, myself (for now).

Actually, more I think about it, the more I feel a user-defined union would be better to scale the shortcomings of gcc attribute support in gdc. And trying to use whatever builtins gcc has to offer won't get you anywhere far anytime soon. There's one or two ICEs when using arithmetic operations (+,-,/,*,=) for typedef'd types with vector attributes assigned to them. This has mostly been fixed in my local tree (with hopefully kind error message for invalid ops too), which will be pushed soon after the next dmd release merge.
 Quick example:
 ///// File VectorsMain.d /////
 import gcc.builtins;
 import mmintrins;
 import std.stdio;
 void main()
 {
     __v4sf bv1;
     setvelem(bv1, 0, 1.0f);
     setvelem(bv1, 1, 2.0f);
     setvelem(bv1, 2, 3.0f);
     setvelem(bv1, 3, 0.0f);
     __v4sf bv2;
     setvelem(bv2, 0, 1.0f);
     setvelem(bv2, 1, 1.0f);
     setvelem(bv2, 2, 1.0f);
     setvelem(bv2, 3, 0.0f);
     __v4sf bv3 = _mm_add_ps(bv1, bv2);
     std.stdio.writefln("Result: (%s, %s, %s, %s)",
                        velem!float(bv3, 0),
                        velem!float(bv3, 1),
                        velem!float(bv3, 2),
                        velem!float(bv3, 3));
 }
 ///// File mmintrins.d /////
 module mmintrins;
 import gcc.builtins;
 T velem(T, VT)(VT vector, uint elem)
 {
     return (cast(T*) &vector)[elem];
 }
 void setvelem(T, VT)(ref VT vector, uint elem, T value)
 {
     (cast(T*) &vector)[elem] = value;
 }
 //pragma(set_attribute, _mm_add_ps, always_inline, artificial);
 T _mm_add_ps(T)(const(T) v1, const(T) v2)
 {
     return __builtin_ia32_addps(v1, v2);
 }
 ///// End example /////
 Note a few things: I made _mm_add_ps templated on vector type (I'll
 constrain it eventually to appropriate types), and this solves a couple
 of problems: cross-module inlining works as the other module gets the
 whole definition, and you can technically addps types other than v4sf.
 Note the velem and setvelem methods are just to add a pretty face on the
 fact that the data of the struct is hidden, with no fields to access it.
  More checks are needed (at least in debug mode), and there will be some
 other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup
 of vectors easier.  I'll admit that this part is a bit ugly, but it
 works, and it generates excellent code.  I compared the actual assembly
 generated to my own C++ code with the same intrinsics, and so far the D
 side is keeping up.
 Please don't collectively throw up when you see this...fast vector ops
 are kindof a big deal for me, so be gentle. =)  What do you all think?
 -Mike

I think I'm gonna throw up... :~)
Feb 10 2011
parent Mike Farnsworth <mike.farnsworth gmail.com> writes:
Iain Buclaw Wrote:

 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 Sorry to start a new thread on this, but I didn't want it to get lost in
 the middle of the previous comments.  I have the start of a working
 implementation in gdc giving access to the __builtin_ia32_* functions.
 The way I did it so far manages to compile down to very tight SSE code.
  That's the good part.
 The bad part is that there are a couple of limitations that engender
 slightly ugly code at the moment.
 Issue 1:
 In order for the compiler to actually recognize the builtins when you
 call them, I had to define a set of custom types that represent gcc's
 types with the vector_size attribute that get passed into the builtins.
 I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't
 set the alignment, and it automatically generates calls to _d_array_init
 and _d_array_copy and such, rather than instead just staying in SSE
 registers.  Let's just say the code was very non-optimal.

I didn't hope for anything, I'm not the crazy one using them. =)
 Instead, I create a struct declaration in gcc.builtins for all of the
 types expected by those builtins, and I name them to match what they
 would nominally contain: __v4sf would have 4 floats, __v32qi would have
 32 bytes, __v2df would have 2 doubles, and so forth.  Each struct has
 16-byte alignment and the correct size.  But here's the rub: they have
 no fields in them.  I tried my darndest to add VarDeclarations to them,
 but the fact that the actual gcc tree type wasn't a struct, it would
 just ICE the compiler when instantiating any of those structs.
 I'd like to fix this, so that you can literally access the contents of
 the struct like it had a float[4], or a double[2], or a byte[32], or
 whatever it should actually have; or instead it should give you an
 overloaded [] operator for direct indexing.
 Issue 2:
 The builtin structs I generate are *not* recognized by the frontend as
 having support for +, -, *, /, etc like the gcc vector_size types
 automatically do in C and C++.  I might be able to add those and have
 them contain code to drop into the builtins.  For now, you *must* use
 the builtin functions to perform operations on these types.  I'm
 obviously aiming to use the builtin functions, myself (for now).

Actually, more I think about it, the more I feel a user-defined union would be better to scale the shortcomings of gcc attribute support in gdc. And trying to use whatever builtins gcc has to offer won't get you anywhere far anytime soon. There's one or two ICEs when using arithmetic operations (+,-,/,*,=) for typedef'd types with vector attributes assigned to them. This has mostly been fixed in my local tree (with hopefully kind error message for invalid ops too), which will be pushed soon after the next dmd release merge.

I tried to use the typedef'd types, but even with simple ops it gave me all sorts of weird compile errors as soon as I tried to pass them to the intrinsics. E.g. I can get the +,-,*,/,... operators to work but not the builtin functions, or else with my builtin structs I can get the builtin functions to work, but not the +,-,*,/,... operators to work. I'm hoping at some point I can get the best of both worlds, but I still feel pretty lost in all the gdc code (although I'm slowly learning).
 Quick example:
 ///// File VectorsMain.d /////
 import gcc.builtins;
 import mmintrins;
 import std.stdio;
 void main()
 {
     __v4sf bv1;
     setvelem(bv1, 0, 1.0f);
     setvelem(bv1, 1, 2.0f);
     setvelem(bv1, 2, 3.0f);
     setvelem(bv1, 3, 0.0f);
     __v4sf bv2;
     setvelem(bv2, 0, 1.0f);
     setvelem(bv2, 1, 1.0f);
     setvelem(bv2, 2, 1.0f);
     setvelem(bv2, 3, 0.0f);
     __v4sf bv3 = _mm_add_ps(bv1, bv2);
     std.stdio.writefln("Result: (%s, %s, %s, %s)",
                        velem!float(bv3, 0),
                        velem!float(bv3, 1),
                        velem!float(bv3, 2),
                        velem!float(bv3, 3));
 }
 ///// File mmintrins.d /////
 module mmintrins;
 import gcc.builtins;
 T velem(T, VT)(VT vector, uint elem)
 {
     return (cast(T*) &vector)[elem];
 }
 void setvelem(T, VT)(ref VT vector, uint elem, T value)
 {
     (cast(T*) &vector)[elem] = value;
 }
 //pragma(set_attribute, _mm_add_ps, always_inline, artificial);
 T _mm_add_ps(T)(const(T) v1, const(T) v2)
 {
     return __builtin_ia32_addps(v1, v2);
 }
 ///// End example /////
 Note a few things: I made _mm_add_ps templated on vector type (I'll
 constrain it eventually to appropriate types), and this solves a couple
 of problems: cross-module inlining works as the other module gets the
 whole definition, and you can technically addps types other than v4sf.
 Note the velem and setvelem methods are just to add a pretty face on the
 fact that the data of the struct is hidden, with no fields to access it.
  More checks are needed (at least in debug mode), and there will be some
 other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup
 of vectors easier.  I'll admit that this part is a bit ugly, but it
 works, and it generates excellent code.  I compared the actual assembly
 generated to my own C++ code with the same intrinsics, and so far the D
 side is keeping up.
 Please don't collectively throw up when you see this...fast vector ops
 are kindof a big deal for me, so be gentle. =)  What do you all think?
 -Mike

I think I'm gonna throw up... :~)

Well, keep in mind that I still have a few things to do in the near term that should help: 1) Add at minimum an overloaded [] op to allow direct indexing into the vector structs, which should make the velem/setvelem crap extraneous. 2) Add a bunch more intrinsics wrappers that follow the Intel standard, so the utility of this goes up. 3) Add some example wrapper structs that define all of the relevant operators, dot/cross products, etc. These will (at least for typical usage) hide all of the ugliness and hopefully will compile down to very good code, while still being proper D types that are easy to understand how to use. Hopefully then nobody will want to throw up anymore. -Mike
Feb 10 2011