D.gnu - Support for gcc vector attributes, SIMD builtins

Mike Farnsworth (26/26) Feb 01 2011 I built gdc from tip on Fedora 13 (x86-64) and started playing around

Daniel Gibson (14/40) Feb 01 2011 I'm not sure if that'll help at all, but you may try something like

Mike Farnsworth (3/12) Feb 01 2011 I actually tried making the actual data a float[4] vs float x, y, z, w, ...

Iain Buclaw (28/49) Feb 01 2011 Although GDC hashes out GCC builtins and attributes, most of it is very ...

Jerry Quinn (2/22) Feb 01 2011 The workaround actually looks like a cleaner way to define types for vec...

Iain Buclaw (39/61) Feb 01 2011 intrinsics. How hard would it be to export vector intrinsics so the API...

Mike Farnsworth (5/81) Feb 01 2011 Huh, that's actually pretty promising. Hooray for gcc's vector ops. =)
Mike Farnsworth (52/81) Feb 05 2011 I've been giving this a serious try, and while the above works, I can't

Jacob Carlborg (5/86) Feb 06 2011 Don't know it it has anything to do with it but you should wrap any
Iain Buclaw (21/68) Feb 06 2011 Try:

Iain Buclaw (18/82) Feb 06 2011 A better way actually:
Brad Roberts (5/81) Feb 06 2011 I'd be happy to have gcc finding vectorization opportunities, but there'...

Iain Buclaw (10/14) Feb 06 2011 Aye, and 9 times out of 10 I would agree with this thinking also.

Mike Farnsworth (15/33) Feb 06 2011 Yes, and I am definitely a "speed freak", but I have good reason: in my
Brad Roberts (9/27) Feb 06 2011 The harm that I'd like to minimize (preferably avoid) is compiler specif...

Mike Farnsworth (6/68) Feb 01 2011 Hah. It's obvious dmdfe doesn't understand that the builtin's signature...

Iain Buclaw (32/47) Feb 01 2011 correctly, so I'll hold off on a bug report until I can figure out what ...

Mike Farnsworth <mike.farnsworth gmail.com> writes:

I built gdc from tip on Fedora 13 (x86-64) and started playing around
with creating a vector struct (x,y,z,w) to see what kind of optimization
the code generator did with it.  It was able to partially drop into SSE
registers and instructions, but not as well as I had hoped from writing
"regular" D code.

I poked through the builtins that get pulled into d-builtins.c /
d-builtins2.cc but I don't see anything that might be pulling in
definitions such as __builtin_ia32_* for SSE, for example.

How hard would it be to get some sort of vector attribute attached to a
type (or just plain indroduce v4sf, __m128, or something like that) and
get those SIMD builtins available?

For the curious, here are how they are defined in, for example,
xmmintrin.h for gcc:

typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

typedef float __v4sf __attribute__ ((__vector_size__ (16)));

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
}

I'm game for making an attempt myself if someone can point me in the
right direction.  I'm a hardcore ray tracing / rendering guy, and
performance is of the utmost importance.  If I could write a ray tracer
in D that matches my C++ tracer for performance, I'd be ecstatic.

-Mike

Feb 01 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 01.02.2011 09:10, schrieb Mike Farnsworth:
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.

 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.

 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?

 For the curious, here are how they are defined in, for example,
 xmmintrin.h for gcc:

 typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

 typedef float __v4sf __attribute__ ((__vector_size__ (16)));

 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
 __artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
    return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
 }

 I'm game for making an attempt myself if someone can point me in the
 right direction.  I'm a hardcore ray tracing / rendering guy, and
 performance is of the utmost importance.  If I could write a ray tracer
 in D that matches my C++ tracer for performance, I'd be ecstatic.

 -Mike

I'm not sure if that'll help at all, but you may try something like
alias float[4] vec4; // or whatever type you're using
/Maybe/ SSE optimizations work better on arrays than on structs.
Of course, such a type isn't as handy because it'll be vec4[0] instead 
of vec4.x, but it may be worth a try..
If it helps (i.e. SSE is used better) you could go on trying to put that 
vector in a struct, have x, y, z, w as properties[1] that get/set the 
corresponding fields in the array and overload operators so they work 
directly on the array.

Cheers,
- Daniel

[1] http://digitalmars.com/d/2.0/property.html at the bottom of the 
page. I guess this will cause little/no overhead because of inlining.

Feb 01 2011

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Daniel Gibson Wrote:
 I'm not sure if that'll help at all, but you may try something like
 alias float[4] vec4; // or whatever type you're using
 /Maybe/ SSE optimizations work better on arrays than on structs.
 Of course, such a type isn't as handy because it'll be vec4[0] instead 
 of vec4.x, but it may be worth a try..
 If it helps (i.e. SSE is used better) you could go on trying to put that 
 vector in a struct, have x, y, z, w as properties[1] that get/set the 
 corresponding fields in the array and overload operators so they work 
 directly on the array.

I actually tried making the actual data a float[4] vs float x, y, z, w, and
while it generates different code, neither one boiled down to the simpler SSE
instructions I had hoped for (and generally get out of gcc with my c++ classes,
especially if I use the SSE intrinsics in the *mmintrin.h headers).  I poked
with various bits of syntax to see if I could convince it, with no luck.

-Mike

Feb 01 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?
 For the curious, here are how they are defined in, for example,
 xmmintrin.h for gcc:
 typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
 typedef float __v4sf __attribute__ ((__vector_size__ (16)));
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
 __artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
 }

Although GDC hashes out GCC builtins and attributes, most of it is very much
incomplete. For example, a D version (for GDC) of the code above would be
something like:


import gcc.builtins;

pragma(set_attribute, __m128, vector_size(16), may_alias);
pragma(set_attribute, __v4sf, vector_size(16));
pragma(set_attribute, _mm_add_ps, always_inline, artificial);

typedef float __m128;
typedef float __v4sf;

__m128 _mm_add_ps (__m128 __A, __m128 __B)
{
    return cast(__m128) __builtin_ia32_addps (cast(__v4sf)__A, cast(__v4sf)__B);
}



However, this doesn't work because

1) There is no 128bit float type in DMDFE (can be put in though, even if it is
just for internal use).
2) Vectors are not representable in DMDFE.

So __builtin_ia32_addps (and many other ia32 builtins) cannot be emitted to the
D
environment.


Interestingly enough, this particular example actually ICEs the compiler. It
appears that while *explicit* casting is done in the code, DMDFE actually
*ignores* this, which is terrible on DMD's part...

Saying that, workaround is to use array types.
typedef float[4] __m128;
typedef float[4] __v4sf;


All the more reason to show you that pragma(attribute) is still very incomplete
to
use. Any ideas to improve it are welcome though. :)

Feb 01 2011

Jerry Quinn <jlquinn optonline.net> writes:

Iain Buclaw Wrote:

 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?


 
 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;
 
 
 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)

The workaround actually looks like a cleaner way to define types for vector
intrinsics.  How hard would it be to export vector intrinsics so the API
expects float[4], for example?

Feb 01 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Jerry Quinn (jlquinn optonline.net)'s article
 Iain Buclaw Wrote:
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?

 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;


 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)

 The workaround actually looks like a cleaner way to define types for vector

intrinsics.  How hard would it be to export vector intrinsics so the API expects
float[4], for example?

I haven't given it much thought on how internal representation could be, but I'd
lean on using unions in D code for usage in the language. As its probably most
portable.

For example, one of the older 'hello vectors' I know of:

import std.c.stdio;

pragma(set_attribute, __v4sf, vector_size(16));
typedef float __v4sf;

union f4vector
{
    __v4sf v;
    float[4] f;
}

int main()
{
    f4vector a, b, c;

    a.f = [1, 2, 3, 4];
    b.f = [5, 6, 7, 8];

    c.v = a.v + b.v;
    printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

    return 0;
}


Compile: gdc -c -g -msse hellovector.d
Dump Object: objdump -dS hellovector.o'

And the output of the SIMD operation speaks for itself:

c.v = a.v + b.v;
  xorps  %xmm1,%xmm1
  movlps %gs:0x0,%xmm1
  movhps %gs:0x8,%xmm1
  xorps  %xmm0,%xmm0
  movlps %gs:0x0,%xmm0
  movhps %gs:0x8,%xmm0
  addps  %xmm1,%xmm0
  movlps %xmm0,%gs:0x0
  movhps %xmm0,%gs:0x8


Regards.
Iain

Feb 01 2011

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Iain Buclaw Wrote:

 == Quote from Jerry Quinn (jlquinn optonline.net)'s article
 Iain Buclaw Wrote:
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?

 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;


 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)

 The workaround actually looks like a cleaner way to define types for vector

 intrinsics.  How hard would it be to export vector intrinsics so the API
expects
 float[4], for example?
 
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.
 
 For example, one of the older 'hello vectors' I know of:
 
 import std.c.stdio;
 
 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;
 
 union f4vector
 {
     __v4sf v;
     float[4] f;
 }
 
 int main()
 {
     f4vector a, b, c;
 
     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];
 
     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);
 
     return 0;
 }
 
 
 Compile: gdc -c -g -msse hellovector.d
 Dump Object: objdump -dS hellovector.o'
 
 And the output of the SIMD operation speaks for itself:
 
 c.v = a.v + b.v;
   xorps  %xmm1,%xmm1
   movlps %gs:0x0,%xmm1
   movhps %gs:0x8,%xmm1
   xorps  %xmm0,%xmm0
   movlps %gs:0x0,%xmm0
   movhps %gs:0x8,%xmm0
   addps  %xmm1,%xmm0
   movlps %xmm0,%gs:0x0
   movhps %xmm0,%gs:0x8
 
 
 Regards.
 Iain

Huh, that's actually pretty promising.  Hooray for gcc's vector ops. =)

I suppose I should still try to beat up on the __builtin_ia32_* stuff to make
sure that can work, but if the codegen already gets us that far then that's
pretty good.  With a little -O3 it might even clean up some of the extraneous
stuff, especially with a sequence of vector operations.  The intrinsics on 
will get us some of the more interesting things like movemasks, shuffles,
vector compares, etc.

As long as the union doesn't cause a bunch of load/store deadweight in the
generated code, this might work nicely.  However, I'll bet dmdfe doesn't
undertand that __v4sf isn't really just a float, though...so at some point that
will need to be fixed so that there is not accidental slicing and invalid
array/structure sizes, etc.

-Mike

Feb 01 2011

Mike Farnsworth <mike.farnsworth gmail.com> writes:

On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.
 
 For example, one of the older 'hello vectors' I know of:
 
 import std.c.stdio;
 
 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;
 
 union f4vector
 {
     __v4sf v;
     float[4] f;
 }
 
 int main()
 {
     f4vector a, b, c;
 
     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];
 
     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);
 
     return 0;
 }

I've been giving this a serious try, and while the above works, I can't
get any __builtin_... functions to actually work.  I've added support
for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in
d_builtins2.cc):

        case VECTOR_TYPE:
        {
            tree baseType = TREE_TYPE(t);
            d = gcc_type_to_d_type(baseType, printstuff);
            if (d)
                return d;
            break;
        }

This allows it to succeed in interpreting the SSE-related builtins in
gcc_type_to_d_type(tree).  Note that all it does is grab the base vector
element type and convert that to a D type so as not to confuse the
frontend; this way it matches the typedef for __v4sf, so as long as we
use the union we won't lose data before we can pass it to a builtin.

I've verified (with a bunch of verbatim(...) calls) that the compiler
*is* pushing function declarations of things like __builtin_ia32_addps,
but I cannot for the life of me get my actual D code to see any of those
functions:

========
pragma(set_attribute, __v4sf, vector_size(16));
typedef float __v4sf;

union v4f
{
    __v4sf v;
    float[4] f;
}

import gcc.builtins;

pragma(set_attribute, _mm_add_ps, always_inline, artificial);

__v4sf _mm_add_ps(__v4sf __A, __v4sf __B)
{
    return __builtin_ia32_addps(__A, __B);
}
========

And I get:
../../Vectors.d:24: Error: undefined identifier __builtin_ia32_addps

If I explicitly prefix the call as
gcc.builtins.__builtin_ia32_addps(__A, __B)

I get:
../../Vectors.d:24: Error: undefined identifier module
builtins.__builtin_ia32_addps

Which doesn't make a whole lot of sense.

I thought there might be something wrong recognizing the argument types,
so I tried __isnanf and __isnan builtins as well, and...same failures.
I don't think any of the builtins besides the alias declarations are
working, honestly.  (__builtin_Clong, __builtin_Culong, etc do work, but
that's the only thing from gcc.builtins that I can access without errors).

Any hints?

-Mike

Feb 05 2011

Jacob Carlborg <doob me.com> writes:

On 2011-02-06 07:24, Mike Farnsworth wrote:
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
      __v4sf v;
      float[4] f;
 }

 int main()
 {
      f4vector a, b, c;

      a.f = [1, 2, 3, 4];
      b.f = [5, 6, 7, 8];

      c.v = a.v + b.v;
      printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

      return 0;
 }

 I've been giving this a serious try, and while the above works, I can't
 get any __builtin_... functions to actually work.  I've added support
 for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in
 d_builtins2.cc):

          case VECTOR_TYPE:
          {
              tree baseType = TREE_TYPE(t);
              d = gcc_type_to_d_type(baseType, printstuff);
              if (d)
                  return d;
              break;
          }

 This allows it to succeed in interpreting the SSE-related builtins in
 gcc_type_to_d_type(tree).  Note that all it does is grab the base vector
 element type and convert that to a D type so as not to confuse the
 frontend; this way it matches the typedef for __v4sf, so as long as we
 use the union we won't lose data before we can pass it to a builtin.

 I've verified (with a bunch of verbatim(...) calls) that the compiler
 *is* pushing function declarations of things like __builtin_ia32_addps,
 but I cannot for the life of me get my actual D code to see any of those
 functions:

 ========
 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union v4f
 {
      __v4sf v;
      float[4] f;
 }

 import gcc.builtins;

 pragma(set_attribute, _mm_add_ps, always_inline, artificial);

 __v4sf _mm_add_ps(__v4sf __A, __v4sf __B)
 {
      return __builtin_ia32_addps(__A, __B);
 }
 ========

 And I get:
 ../../Vectors.d:24: Error: undefined identifier __builtin_ia32_addps

 If I explicitly prefix the call as
 gcc.builtins.__builtin_ia32_addps(__A, __B)

 I get:
 ../../Vectors.d:24: Error: undefined identifier module
 builtins.__builtin_ia32_addps

 Which doesn't make a whole lot of sense.

 I thought there might be something wrong recognizing the argument types,
 so I tried __isnanf and __isnan builtins as well, and...same failures.
 I don't think any of the builtins besides the alias declarations are
 working, honestly.  (__builtin_Clong, __builtin_Culong, etc do work, but
 that's the only thing from gcc.builtins that I can access without errors).

 Any hints?

 -Mike

Don't know it it has anything to do with it but you should wrap any 
non-standard pragmas in a version block.

-- 
/Jacob Carlborg

Feb 06 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
     __v4sf v;
     float[4] f;
 }

 int main()
 {
     f4vector a, b, c;

     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];

     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

     return 0;
 }

 I've been giving this a serious try, and while the above works, I can't
 get any __builtin_... functions to actually work.  I've added support
 for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in
 d_builtins2.cc):
         case VECTOR_TYPE:
         {
             tree baseType = TREE_TYPE(t);
             d = gcc_type_to_d_type(baseType, printstuff);
             if (d)
                 return d;
             break;
         }
 This allows it to succeed in interpreting the SSE-related builtins in
 gcc_type_to_d_type(tree).  Note that all it does is grab the base vector
 element type and convert that to a D type so as not to confuse the
 frontend; this way it matches the typedef for __v4sf, so as long as we
 use the union we won't lose data before we can pass it to a builtin.

Try:
        case VECTOR_TYPE:
        {
            tree basetype = TYPE_DEBUG_REPRESENTATION_TYPE(t);
            assert(TREE_CODE(basetype) == RECORD_TYPE);
            basetype = TREE_TYPE(TYPE_FIELDS(basetype));
            d = gcc_type_to_d_type(basetype);
            if (d)
            {
                d->ctype = t;
                return d;
            }
            break;
        }

That makes them static arrays, so you needn't require a whacky union to use
vector
functions.

  float[4] a = [1,2,3,4], b = [5,6,7,8], c;
  c = __builtin_ia32_addps(a,b);



Secondly, __builtin_ia32_addps requires SSE turned on. Compile with -msse


Regards

Feb 06 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Iain Buclaw (ibuclaw ubuntu.com)'s article
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
     __v4sf v;
     float[4] f;
 }

 int main()
 {
     f4vector a, b, c;

     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];

     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

     return 0;
 }

 I've been giving this a serious try, and while the above works, I can't
 get any __builtin_... functions to actually work.  I've added support
 for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in
 d_builtins2.cc):
         case VECTOR_TYPE:
         {
             tree baseType = TREE_TYPE(t);
             d = gcc_type_to_d_type(baseType, printstuff);
             if (d)
                 return d;
             break;
         }
 This allows it to succeed in interpreting the SSE-related builtins in
 gcc_type_to_d_type(tree).  Note that all it does is grab the base vector
 element type and convert that to a D type so as not to confuse the
 frontend; this way it matches the typedef for __v4sf, so as long as we
 use the union we won't lose data before we can pass it to a builtin.

 Try:
         case VECTOR_TYPE:
         {
             tree basetype = TYPE_DEBUG_REPRESENTATION_TYPE(t);
             assert(TREE_CODE(basetype) == RECORD_TYPE);
             basetype = TREE_TYPE(TYPE_FIELDS(basetype));
             d = gcc_type_to_d_type(basetype);
             if (d)
             {
                 d->ctype = t;
                 return d;
             }
             break;
         }
 That makes them static arrays, so you needn't require a whacky union to use
vector
 functions.

A better way actually:

        case VECTOR_TYPE:
        {
            d = gcc_type_to_d_type(TREE_TYPE(t));
            if (d)
            {
                d = new TypeSArray(d,
                        new IntegerExp(0, TYPE_VECTOR_SUBPARTS(t),
                            Type::tindex));
                d->ctype = t;
                return d;
            }
            break;
        }

Happy hacking! :)

Regards
Iain

Feb 06 2011

Brad Roberts <braddr puremagic.com> writes:

On 2/6/2011 4:15 AM, Iain Buclaw wrote:
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
     __v4sf v;
     float[4] f;
 }

 int main()
 {
     f4vector a, b, c;

     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];

     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

     return 0;
 }

 I've been giving this a serious try, and while the above works, I can't
 get any __builtin_... functions to actually work.  I've added support
 for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in
 d_builtins2.cc):
         case VECTOR_TYPE:
         {
             tree baseType = TREE_TYPE(t);
             d = gcc_type_to_d_type(baseType, printstuff);
             if (d)
                 return d;
             break;
         }
 This allows it to succeed in interpreting the SSE-related builtins in
 gcc_type_to_d_type(tree).  Note that all it does is grab the base vector
 element type and convert that to a D type so as not to confuse the
 frontend; this way it matches the typedef for __v4sf, so as long as we
 use the union we won't lose data before we can pass it to a builtin.

 
 Try:
         case VECTOR_TYPE:
         {
             tree basetype = TYPE_DEBUG_REPRESENTATION_TYPE(t);
             assert(TREE_CODE(basetype) == RECORD_TYPE);
             basetype = TREE_TYPE(TYPE_FIELDS(basetype));
             d = gcc_type_to_d_type(basetype);
             if (d)
             {
                 d->ctype = t;
                 return d;
             }
             break;
         }
 
 That makes them static arrays, so you needn't require a whacky union to use
vector
 functions.
 
   float[4] a = [1,2,3,4], b = [5,6,7,8], c;
   c = __builtin_ia32_addps(a,b);
 
 
 
 Secondly, __builtin_ia32_addps requires SSE turned on. Compile with -msse
 
 
 Regards

I'd be happy to have gcc finding vectorization opportunities, but there's no
need to add this sort of thing to the
language.  This already has a hook to call a library function:

float[4] a = [1,2,3,4], b = [5,6,7,8], c;
c[] = a[] + b[];

Feb 06 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Brad Roberts (braddr puremagic.com)'s article
 I'd be happy to have gcc finding vectorization opportunities, but there's no

need to add this sort of thing to the
 language.  This already has a hook to call a library function:
 float[4] a = [1,2,3,4], b = [5,6,7,8], c;
 c[] = a[] + b[];

Aye, and 9 times out of 10 I would agree with this thinking also.

The pros to hashing out GCC Vector intrinsics to the D frontend though are that
the GCC backend has much more creative control over the codegen. Inlining and
optimising the intrinsics in a far better way than optimising the overhead of an
external library call.

Baring in mind that DMD's array libraries are already extremely performant
anyway,
I honestly don't see the harm if it makes the poignant speed freaks happy.

Regards

Feb 06 2011

Mike Farnsworth <mike.farnsworth gmail.com> writes:

On 02/06/2011 02:58 PM, Iain Buclaw wrote:
 == Quote from Brad Roberts (braddr puremagic.com)'s article
 I'd be happy to have gcc finding vectorization opportunities, but there's no

 need to add this sort of thing to the
 language.  This already has a hook to call a library function:
 float[4] a = [1,2,3,4], b = [5,6,7,8], c;
 c[] = a[] + b[];

 
 Aye, and 9 times out of 10 I would agree with this thinking also.
 
 The pros to hashing out GCC Vector intrinsics to the D frontend though are that
 the GCC backend has much more creative control over the codegen. Inlining and
 optimising the intrinsics in a far better way than optimising the overhead of
an
 external library call.
 
 Baring in mind that DMD's array libraries are already extremely performant
anyway,
 I honestly don't see the harm if it makes the poignant speed freaks happy.
 
 Regards

Yes, and I am definitely a "speed freak", but I have good reason: in my
field, 3D rendering performance is extremely important.  If I can write
a few classes in D that can get me to very optimized SSE code on
x86(-64) for most vector/point/color operations, for example, that can
make or break my ability to get anyone to use my renderer *at all*.

In D I can easily do the version(gnu) thing to make the program 100%
cross-platform for the cases where I don't have the intrinsics.  I would
love it if the array-wise operations were able to automatically just
boil down to the intrinsics, but in order to make it fast enough they
must always be 16-byte aligned, pass float[4] by SSE register where
possible, etc, etc.  Someday, the compiler hopefully will just do that,
but it doesn't always do it today (or really at all, in my tests of just
the float[4] static arrays).

-Mike

Feb 06 2011

Brad Roberts <braddr puremagic.com> writes:

On 2/6/2011 2:58 PM, Iain Buclaw wrote:
 == Quote from Brad Roberts (braddr puremagic.com)'s article
 I'd be happy to have gcc finding vectorization opportunities, but there's no

 need to add this sort of thing to the
 language.  This already has a hook to call a library function:
 float[4] a = [1,2,3,4], b = [5,6,7,8], c;
 c[] = a[] + b[];

 
 Aye, and 9 times out of 10 I would agree with this thinking also.
 
 The pros to hashing out GCC Vector intrinsics to the D frontend though are that
 the GCC backend has much more creative control over the codegen. Inlining and
 optimising the intrinsics in a far better way than optimising the overhead of
an
 external library call.
 
 Baring in mind that DMD's array libraries are already extremely performant
anyway,
 I honestly don't see the harm if it makes the poignant speed freaks happy.
 
 Regards

The harm that I'd like to minimize (preferably avoid) is compiler specific
language changes.  When GDC or LDC or DMD add
little things to the language that aren't supported by all three, the choice of
compilers used to build a chunk of code
is reduced.  The result is a fragmentation of the language.

I agree that gcc's inliner and optimizers are way better than dmd's when it
comes to vectors (among other things), and
would love to see those brought to bear.

So, imho, try not to go any higher than the glue layer.

Later,
Brad

Feb 06 2011

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Iain Buclaw Wrote:

 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?
 For the curious, here are how they are defined in, for example,
 xmmintrin.h for gcc:
 typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
 typedef float __v4sf __attribute__ ((__vector_size__ (16)));
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
 __artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
 }

 
 Although GDC hashes out GCC builtins and attributes, most of it is very much
 incomplete. For example, a D version (for GDC) of the code above would be
 something like:
 
 
 import gcc.builtins;
 
 pragma(set_attribute, __m128, vector_size(16), may_alias);
 pragma(set_attribute, __v4sf, vector_size(16));
 pragma(set_attribute, _mm_add_ps, always_inline, artificial);
 
 typedef float __m128;
 typedef float __v4sf;
 
 __m128 _mm_add_ps (__m128 __A, __m128 __B)
 {
     return cast(__m128) __builtin_ia32_addps (cast(__v4sf)__A,
cast(__v4sf)__B);
 }
 
 
 
 However, this doesn't work because
 
 1) There is no 128bit float type in DMDFE (can be put in though, even if it is
 just for internal use).
 2) Vectors are not representable in DMDFE.
 
 So __builtin_ia32_addps (and many other ia32 builtins) cannot be emitted to
the D
 environment.

I figured this would be the case; the "typedef float whatever
__attribute((vector_size(16)))" stuff is already weird, so I don't expect dmdfe
to do the right thing with even similar syntax at all.

 Interestingly enough, this particular example actually ICEs the compiler. It
 appears that while *explicit* casting is done in the code, DMDFE actually
 *ignores* this, which is terrible on DMD's part...

Hah.  It's obvious dmdfe doesn't understand that the builtin's signature
correctly, so I'll hold off on a bug report until I can figure out what kind of
signature that builtin had registered with dmdfe.

 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;
 
 
 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)

In my (not very abundant) spare time, I'll poke around the attribute stuff to
see if I can attach the vector_size(16) attribute to a float[4] array type.  I
know the __builtin_ia32_addps function, for example, takes a v4sf (__m128 is
just Intel's version that can change personalities at will; I feel no
inclination to keep it around, and instead go with more strictly defined types
and cast intrinsics).  If I can get that builtin to take a typedef'd float[4]
without a cast, perhaps dmdfe will not drop any data and the codegen will
happen properly.

Where do I look to see the attribute pragmas in gdc?  Where do I look to
potentially change the signature that dmdfe sees for the __builtin_ia32_*
functions?  If I can get a hand-coded signature to work, then we'll be in
business.

-Mike

Feb 01 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 Iain Buclaw Wrote:
 Interestingly enough, this particular example actually ICEs the compiler. It
 appears that while *explicit* casting is done in the code, DMDFE actually
 *ignores* this, which is terrible on DMD's part...

 Hah.  It's obvious dmdfe doesn't understand that the builtin's signature

correctly, so I'll hold off on a bug report until I can figure out what kind of
signature that builtin had registered with dmdfe.

Actually, it appears it's much more simpler than that.

IntA a;
IntB b;
a = cast(IntA)b;

Although explicit casts are required to not get errors, somewhere in the
semantic
stage (I presume), the frontend decides no codegen is required to perform the
cast, so omits it.

Where this puts GDC (I think), is that the backend is told to perform a
convert/move where the to and from register types are different (due to
attributes
applied to the type), and triggers an assert.

It's nothing too much to worry about, but maybe raise a bug (to remind me to
look
at it in better depth sometime).

 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;


 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)

 In my (not very abundant) spare time, I'll poke around the attribute stuff to

see if I can attach the vector_size(16) attribute to a float[4] array type.  I
know the __builtin_ia32_addps function, for example, takes a v4sf (__m128 is
just
Intel's version that can change personalities at will; I feel no inclination to
keep it around, and instead go with more strictly defined types and cast
intrinsics).  If I can get that builtin to take a typedef'd float[4] without a
cast, perhaps dmdfe will not drop any data and the codegen will happen properly.
 Where do I look to see the attribute pragmas in gdc?  Where do I look to

potentially change the signature that dmdfe sees for the __builtin_ia32_*
functions?  If I can get a hand-coded signature to work, then we'll be in
business.
 -Mike

In d-builtins2.cc:
// Entry point for import gcc.builtins, builds GCC builtins in DMD AST on the
fly.
d_gcc_magic_builtins_module()

// Convert GCC type to DMD type, builds functions as well normal D types.
gcc_type_to_d_type()

In d-bi-attrs*.h are all handlers for supported attributes (you needn't touch
them, as they are all copied from c-common.c).

Regards
Iain

Feb 01 2011

D Programming

C/C++ Programming

Other

D.gnu - Support for gcc vector attributes, SIMD builtins