www.digitalmars.com         C & C++   DMDScript  

D.gnu - Support for gcc vector attributes, SIMD builtins

reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
I built gdc from tip on Fedora 13 (x86-64) and started playing around
with creating a vector struct (x,y,z,w) to see what kind of optimization
the code generator did with it.  It was able to partially drop into SSE
registers and instructions, but not as well as I had hoped from writing
"regular" D code.

I poked through the builtins that get pulled into d-builtins.c /
d-builtins2.cc but I don't see anything that might be pulling in
definitions such as __builtin_ia32_* for SSE, for example.

How hard would it be to get some sort of vector attribute attached to a
type (or just plain indroduce v4sf, __m128, or something like that) and
get those SIMD builtins available?

For the curious, here are how they are defined in, for example,
xmmintrin.h for gcc:

typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

typedef float __v4sf __attribute__ ((__vector_size__ (16)));

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
}

I'm game for making an attempt myself if someone can point me in the
right direction.  I'm a hardcore ray tracing / rendering guy, and
performance is of the utmost importance.  If I could write a ray tracer
in D that matches my C++ tracer for performance, I'd be ecstatic.

-Mike
Feb 01 2011
next sibling parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 01.02.2011 09:10, schrieb Mike Farnsworth:
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.

 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.

 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?

 For the curious, here are how they are defined in, for example,
 xmmintrin.h for gcc:

 typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

 typedef float __v4sf __attribute__ ((__vector_size__ (16)));

 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
 __artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
    return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
 }

 I'm game for making an attempt myself if someone can point me in the
 right direction.  I'm a hardcore ray tracing / rendering guy, and
 performance is of the utmost importance.  If I could write a ray tracer
 in D that matches my C++ tracer for performance, I'd be ecstatic.

 -Mike

I'm not sure if that'll help at all, but you may try something like alias float[4] vec4; // or whatever type you're using /Maybe/ SSE optimizations work better on arrays than on structs. Of course, such a type isn't as handy because it'll be vec4[0] instead of vec4.x, but it may be worth a try.. If it helps (i.e. SSE is used better) you could go on trying to put that vector in a struct, have x, y, z, w as properties[1] that get/set the corresponding fields in the array and overload operators so they work directly on the array. Cheers, - Daniel [1] http://digitalmars.com/d/2.0/property.html at the bottom of the page. I guess this will cause little/no overhead because of inlining.
Feb 01 2011
parent Mike Farnsworth <mike.farnsworth gmail.com> writes:
Daniel Gibson Wrote:
 I'm not sure if that'll help at all, but you may try something like
 alias float[4] vec4; // or whatever type you're using
 /Maybe/ SSE optimizations work better on arrays than on structs.
 Of course, such a type isn't as handy because it'll be vec4[0] instead 
 of vec4.x, but it may be worth a try..
 If it helps (i.e. SSE is used better) you could go on trying to put that 
 vector in a struct, have x, y, z, w as properties[1] that get/set the 
 corresponding fields in the array and overload operators so they work 
 directly on the array.

I actually tried making the actual data a float[4] vs float x, y, z, w, and while it generates different code, neither one boiled down to the simpler SSE instructions I had hoped for (and generally get out of gcc with my c++ classes, especially if I use the SSE intrinsics in the *mmintrin.h headers). I poked with various bits of syntax to see if I could convince it, with no luck. -Mike
Feb 01 2011
prev sibling next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?
 For the curious, here are how they are defined in, for example,
 xmmintrin.h for gcc:
 typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
 typedef float __v4sf __attribute__ ((__vector_size__ (16)));
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
 __artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
 }

Although GDC hashes out GCC builtins and attributes, most of it is very much incomplete. For example, a D version (for GDC) of the code above would be something like: import gcc.builtins; pragma(set_attribute, __m128, vector_size(16), may_alias); pragma(set_attribute, __v4sf, vector_size(16)); pragma(set_attribute, _mm_add_ps, always_inline, artificial); typedef float __m128; typedef float __v4sf; __m128 _mm_add_ps (__m128 __A, __m128 __B) { return cast(__m128) __builtin_ia32_addps (cast(__v4sf)__A, cast(__v4sf)__B); } However, this doesn't work because 1) There is no 128bit float type in DMDFE (can be put in though, even if it is just for internal use). 2) Vectors are not representable in DMDFE. So __builtin_ia32_addps (and many other ia32 builtins) cannot be emitted to the D environment. Interestingly enough, this particular example actually ICEs the compiler. It appears that while *explicit* casting is done in the code, DMDFE actually *ignores* this, which is terrible on DMD's part... Saying that, workaround is to use array types. typedef float[4] __m128; typedef float[4] __v4sf; All the more reason to show you that pragma(attribute) is still very incomplete to use. Any ideas to improve it are welcome though. :)
Feb 01 2011
next sibling parent reply Jerry Quinn <jlquinn optonline.net> writes:
Iain Buclaw Wrote:

 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?


 
 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;
 
 
 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)

The workaround actually looks like a cleaner way to define types for vector intrinsics. How hard would it be to export vector intrinsics so the API expects float[4], for example?
Feb 01 2011
parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Jerry Quinn (jlquinn optonline.net)'s article
 Iain Buclaw Wrote:
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?

Saying that, workaround is to use array types. typedef float[4] __m128; typedef float[4] __v4sf; All the more reason to show you that pragma(attribute) is still very incomplete to use. Any ideas to improve it are welcome though. :)


float[4], for example? I haven't given it much thought on how internal representation could be, but I'd lean on using unions in D code for usage in the language. As its probably most portable. For example, one of the older 'hello vectors' I know of: import std.c.stdio; pragma(set_attribute, __v4sf, vector_size(16)); typedef float __v4sf; union f4vector { __v4sf v; float[4] f; } int main() { f4vector a, b, c; a.f = [1, 2, 3, 4]; b.f = [5, 6, 7, 8]; c.v = a.v + b.v; printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]); return 0; } Compile: gdc -c -g -msse hellovector.d Dump Object: objdump -dS hellovector.o' And the output of the SIMD operation speaks for itself: c.v = a.v + b.v; xorps %xmm1,%xmm1 movlps %gs:0x0,%xmm1 movhps %gs:0x8,%xmm1 xorps %xmm0,%xmm0 movlps %gs:0x0,%xmm0 movhps %gs:0x8,%xmm0 addps %xmm1,%xmm0 movlps %xmm0,%gs:0x0 movhps %xmm0,%gs:0x8 Regards. Iain
Feb 01 2011
next sibling parent Mike Farnsworth <mike.farnsworth gmail.com> writes:
Iain Buclaw Wrote:

 == Quote from Jerry Quinn (jlquinn optonline.net)'s article
 Iain Buclaw Wrote:
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?

Saying that, workaround is to use array types. typedef float[4] __m128; typedef float[4] __v4sf; All the more reason to show you that pragma(attribute) is still very incomplete to use. Any ideas to improve it are welcome though. :)


float[4], for example? I haven't given it much thought on how internal representation could be, but I'd lean on using unions in D code for usage in the language. As its probably most portable. For example, one of the older 'hello vectors' I know of: import std.c.stdio; pragma(set_attribute, __v4sf, vector_size(16)); typedef float __v4sf; union f4vector { __v4sf v; float[4] f; } int main() { f4vector a, b, c; a.f = [1, 2, 3, 4]; b.f = [5, 6, 7, 8]; c.v = a.v + b.v; printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]); return 0; } Compile: gdc -c -g -msse hellovector.d Dump Object: objdump -dS hellovector.o' And the output of the SIMD operation speaks for itself: c.v = a.v + b.v; xorps %xmm1,%xmm1 movlps %gs:0x0,%xmm1 movhps %gs:0x8,%xmm1 xorps %xmm0,%xmm0 movlps %gs:0x0,%xmm0 movhps %gs:0x8,%xmm0 addps %xmm1,%xmm0 movlps %xmm0,%gs:0x0 movhps %xmm0,%gs:0x8 Regards. Iain

Huh, that's actually pretty promising. Hooray for gcc's vector ops. =) I suppose I should still try to beat up on the __builtin_ia32_* stuff to make sure that can work, but if the codegen already gets us that far then that's pretty good. With a little -O3 it might even clean up some of the extraneous stuff, especially with a sequence of vector operations. The intrinsics on will get us some of the more interesting things like movemasks, shuffles, vector compares, etc. As long as the union doesn't cause a bunch of load/store deadweight in the generated code, this might work nicely. However, I'll bet dmdfe doesn't undertand that __v4sf isn't really just a float, though...so at some point that will need to be fixed so that there is not accidental slicing and invalid array/structure sizes, etc. -Mike
Feb 01 2011
prev sibling parent reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.
 
 For example, one of the older 'hello vectors' I know of:
 
 import std.c.stdio;
 
 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;
 
 union f4vector
 {
     __v4sf v;
     float[4] f;
 }
 
 int main()
 {
     f4vector a, b, c;
 
     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];
 
     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);
 
     return 0;
 }

I've been giving this a serious try, and while the above works, I can't get any __builtin_... functions to actually work. I've added support for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in d_builtins2.cc): case VECTOR_TYPE: { tree baseType = TREE_TYPE(t); d = gcc_type_to_d_type(baseType, printstuff); if (d) return d; break; } This allows it to succeed in interpreting the SSE-related builtins in gcc_type_to_d_type(tree). Note that all it does is grab the base vector element type and convert that to a D type so as not to confuse the frontend; this way it matches the typedef for __v4sf, so as long as we use the union we won't lose data before we can pass it to a builtin. I've verified (with a bunch of verbatim(...) calls) that the compiler *is* pushing function declarations of things like __builtin_ia32_addps, but I cannot for the life of me get my actual D code to see any of those functions: ======== pragma(set_attribute, __v4sf, vector_size(16)); typedef float __v4sf; union v4f { __v4sf v; float[4] f; } import gcc.builtins; pragma(set_attribute, _mm_add_ps, always_inline, artificial); __v4sf _mm_add_ps(__v4sf __A, __v4sf __B) { return __builtin_ia32_addps(__A, __B); } ======== And I get: ../../Vectors.d:24: Error: undefined identifier __builtin_ia32_addps If I explicitly prefix the call as gcc.builtins.__builtin_ia32_addps(__A, __B) I get: ../../Vectors.d:24: Error: undefined identifier module builtins.__builtin_ia32_addps Which doesn't make a whole lot of sense. I thought there might be something wrong recognizing the argument types, so I tried __isnanf and __isnan builtins as well, and...same failures. I don't think any of the builtins besides the alias declarations are working, honestly. (__builtin_Clong, __builtin_Culong, etc do work, but that's the only thing from gcc.builtins that I can access without errors). Any hints? -Mike
Feb 05 2011
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-02-06 07:24, Mike Farnsworth wrote:
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
      __v4sf v;
      float[4] f;
 }

 int main()
 {
      f4vector a, b, c;

      a.f = [1, 2, 3, 4];
      b.f = [5, 6, 7, 8];

      c.v = a.v + b.v;
      printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

      return 0;
 }

I've been giving this a serious try, and while the above works, I can't get any __builtin_... functions to actually work. I've added support for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in d_builtins2.cc): case VECTOR_TYPE: { tree baseType = TREE_TYPE(t); d = gcc_type_to_d_type(baseType, printstuff); if (d) return d; break; } This allows it to succeed in interpreting the SSE-related builtins in gcc_type_to_d_type(tree). Note that all it does is grab the base vector element type and convert that to a D type so as not to confuse the frontend; this way it matches the typedef for __v4sf, so as long as we use the union we won't lose data before we can pass it to a builtin. I've verified (with a bunch of verbatim(...) calls) that the compiler *is* pushing function declarations of things like __builtin_ia32_addps, but I cannot for the life of me get my actual D code to see any of those functions: ======== pragma(set_attribute, __v4sf, vector_size(16)); typedef float __v4sf; union v4f { __v4sf v; float[4] f; } import gcc.builtins; pragma(set_attribute, _mm_add_ps, always_inline, artificial); __v4sf _mm_add_ps(__v4sf __A, __v4sf __B) { return __builtin_ia32_addps(__A, __B); } ======== And I get: ../../Vectors.d:24: Error: undefined identifier __builtin_ia32_addps If I explicitly prefix the call as gcc.builtins.__builtin_ia32_addps(__A, __B) I get: ../../Vectors.d:24: Error: undefined identifier module builtins.__builtin_ia32_addps Which doesn't make a whole lot of sense. I thought there might be something wrong recognizing the argument types, so I tried __isnanf and __isnan builtins as well, and...same failures. I don't think any of the builtins besides the alias declarations are working, honestly. (__builtin_Clong, __builtin_Culong, etc do work, but that's the only thing from gcc.builtins that I can access without errors). Any hints? -Mike

Don't know it it has anything to do with it but you should wrap any non-standard pragmas in a version block. -- /Jacob Carlborg
Feb 06 2011
prev sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
     __v4sf v;
     float[4] f;
 }

 int main()
 {
     f4vector a, b, c;

     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];

     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

     return 0;
 }

get any __builtin_... functions to actually work. I've added support for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in d_builtins2.cc): case VECTOR_TYPE: { tree baseType = TREE_TYPE(t); d = gcc_type_to_d_type(baseType, printstuff); if (d) return d; break; } This allows it to succeed in interpreting the SSE-related builtins in gcc_type_to_d_type(tree). Note that all it does is grab the base vector element type and convert that to a D type so as not to confuse the frontend; this way it matches the typedef for __v4sf, so as long as we use the union we won't lose data before we can pass it to a builtin.

Try: case VECTOR_TYPE: { tree basetype = TYPE_DEBUG_REPRESENTATION_TYPE(t); assert(TREE_CODE(basetype) == RECORD_TYPE); basetype = TREE_TYPE(TYPE_FIELDS(basetype)); d = gcc_type_to_d_type(basetype); if (d) { d->ctype = t; return d; } break; } That makes them static arrays, so you needn't require a whacky union to use vector functions. float[4] a = [1,2,3,4], b = [5,6,7,8], c; c = __builtin_ia32_addps(a,b); Secondly, __builtin_ia32_addps requires SSE turned on. Compile with -msse Regards
Feb 06 2011
next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Iain Buclaw (ibuclaw ubuntu.com)'s article
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
     __v4sf v;
     float[4] f;
 }

 int main()
 {
     f4vector a, b, c;

     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];

     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

     return 0;
 }

get any __builtin_... functions to actually work. I've added support for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in d_builtins2.cc): case VECTOR_TYPE: { tree baseType = TREE_TYPE(t); d = gcc_type_to_d_type(baseType, printstuff); if (d) return d; break; } This allows it to succeed in interpreting the SSE-related builtins in gcc_type_to_d_type(tree). Note that all it does is grab the base vector element type and convert that to a D type so as not to confuse the frontend; this way it matches the typedef for __v4sf, so as long as we use the union we won't lose data before we can pass it to a builtin.

case VECTOR_TYPE: { tree basetype = TYPE_DEBUG_REPRESENTATION_TYPE(t); assert(TREE_CODE(basetype) == RECORD_TYPE); basetype = TREE_TYPE(TYPE_FIELDS(basetype)); d = gcc_type_to_d_type(basetype); if (d) { d->ctype = t; return d; } break; } That makes them static arrays, so you needn't require a whacky union to use vector functions.

A better way actually: case VECTOR_TYPE: { d = gcc_type_to_d_type(TREE_TYPE(t)); if (d) { d = new TypeSArray(d, new IntegerExp(0, TYPE_VECTOR_SUBPARTS(t), Type::tindex)); d->ctype = t; return d; } break; } Happy hacking! :) Regards Iain
Feb 06 2011
prev sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Brad Roberts (braddr puremagic.com)'s article
 I'd be happy to have gcc finding vectorization opportunities, but there's no

 language.  This already has a hook to call a library function:
 float[4] a = [1,2,3,4], b = [5,6,7,8], c;
 c[] = a[] + b[];

Aye, and 9 times out of 10 I would agree with this thinking also. The pros to hashing out GCC Vector intrinsics to the D frontend though are that the GCC backend has much more creative control over the codegen. Inlining and optimising the intrinsics in a far better way than optimising the overhead of an external library call. Baring in mind that DMD's array libraries are already extremely performant anyway, I honestly don't see the harm if it makes the poignant speed freaks happy. Regards
Feb 06 2011
parent Mike Farnsworth <mike.farnsworth gmail.com> writes:
On 02/06/2011 02:58 PM, Iain Buclaw wrote:
 == Quote from Brad Roberts (braddr puremagic.com)'s article
 I'd be happy to have gcc finding vectorization opportunities, but there's no

 language.  This already has a hook to call a library function:
 float[4] a = [1,2,3,4], b = [5,6,7,8], c;
 c[] = a[] + b[];

Aye, and 9 times out of 10 I would agree with this thinking also. The pros to hashing out GCC Vector intrinsics to the D frontend though are that the GCC backend has much more creative control over the codegen. Inlining and optimising the intrinsics in a far better way than optimising the overhead of an external library call. Baring in mind that DMD's array libraries are already extremely performant anyway, I honestly don't see the harm if it makes the poignant speed freaks happy. Regards

Yes, and I am definitely a "speed freak", but I have good reason: in my field, 3D rendering performance is extremely important. If I can write a few classes in D that can get me to very optimized SSE code on x86(-64) for most vector/point/color operations, for example, that can make or break my ability to get anyone to use my renderer *at all*. In D I can easily do the version(gnu) thing to make the program 100% cross-platform for the cases where I don't have the intrinsics. I would love it if the array-wise operations were able to automatically just boil down to the intrinsics, but in order to make it fast enough they must always be 16-byte aligned, pass float[4] by SSE register where possible, etc, etc. Someday, the compiler hopefully will just do that, but it doesn't always do it today (or really at all, in my tests of just the float[4] static arrays). -Mike
Feb 06 2011
prev sibling parent reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
Iain Buclaw Wrote:

 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 I built gdc from tip on Fedora 13 (x86-64) and started playing around
 with creating a vector struct (x,y,z,w) to see what kind of optimization
 the code generator did with it.  It was able to partially drop into SSE
 registers and instructions, but not as well as I had hoped from writing
 "regular" D code.
 I poked through the builtins that get pulled into d-builtins.c /
 d-builtins2.cc but I don't see anything that might be pulling in
 definitions such as __builtin_ia32_* for SSE, for example.
 How hard would it be to get some sort of vector attribute attached to a
 type (or just plain indroduce v4sf, __m128, or something like that) and
 get those SIMD builtins available?
 For the curious, here are how they are defined in, for example,
 xmmintrin.h for gcc:
 typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
 typedef float __v4sf __attribute__ ((__vector_size__ (16)));
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
 __artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
 }

Although GDC hashes out GCC builtins and attributes, most of it is very much incomplete. For example, a D version (for GDC) of the code above would be something like: import gcc.builtins; pragma(set_attribute, __m128, vector_size(16), may_alias); pragma(set_attribute, __v4sf, vector_size(16)); pragma(set_attribute, _mm_add_ps, always_inline, artificial); typedef float __m128; typedef float __v4sf; __m128 _mm_add_ps (__m128 __A, __m128 __B) { return cast(__m128) __builtin_ia32_addps (cast(__v4sf)__A, cast(__v4sf)__B); } However, this doesn't work because 1) There is no 128bit float type in DMDFE (can be put in though, even if it is just for internal use). 2) Vectors are not representable in DMDFE. So __builtin_ia32_addps (and many other ia32 builtins) cannot be emitted to the D environment.

I figured this would be the case; the "typedef float whatever __attribute((vector_size(16)))" stuff is already weird, so I don't expect dmdfe to do the right thing with even similar syntax at all.
 Interestingly enough, this particular example actually ICEs the compiler. It
 appears that while *explicit* casting is done in the code, DMDFE actually
 *ignores* this, which is terrible on DMD's part...

Hah. It's obvious dmdfe doesn't understand that the builtin's signature correctly, so I'll hold off on a bug report until I can figure out what kind of signature that builtin had registered with dmdfe.
 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;
 
 
 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)

In my (not very abundant) spare time, I'll poke around the attribute stuff to see if I can attach the vector_size(16) attribute to a float[4] array type. I know the __builtin_ia32_addps function, for example, takes a v4sf (__m128 is just Intel's version that can change personalities at will; I feel no inclination to keep it around, and instead go with more strictly defined types and cast intrinsics). If I can get that builtin to take a typedef'd float[4] without a cast, perhaps dmdfe will not drop any data and the codegen will happen properly. Where do I look to see the attribute pragmas in gdc? Where do I look to potentially change the signature that dmdfe sees for the __builtin_ia32_* functions? If I can get a hand-coded signature to work, then we'll be in business. -Mike
Feb 01 2011
parent Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 Iain Buclaw Wrote:
 Interestingly enough, this particular example actually ICEs the compiler. It
 appears that while *explicit* casting is done in the code, DMDFE actually
 *ignores* this, which is terrible on DMD's part...


signature that builtin had registered with dmdfe. Actually, it appears it's much more simpler than that. IntA a; IntB b; a = cast(IntA)b; Although explicit casts are required to not get errors, somewhere in the semantic stage (I presume), the frontend decides no codegen is required to perform the cast, so omits it. Where this puts GDC (I think), is that the backend is told to perform a convert/move where the to and from register types are different (due to attributes applied to the type), and triggers an assert. It's nothing too much to worry about, but maybe raise a bug (to remind me to look at it in better depth sometime).
 Saying that, workaround is to use array types.
 typedef float[4] __m128;
 typedef float[4] __v4sf;


 All the more reason to show you that pragma(attribute) is still very
incomplete to
 use. Any ideas to improve it are welcome though. :)


know the __builtin_ia32_addps function, for example, takes a v4sf (__m128 is just Intel's version that can change personalities at will; I feel no inclination to keep it around, and instead go with more strictly defined types and cast intrinsics). If I can get that builtin to take a typedef'd float[4] without a cast, perhaps dmdfe will not drop any data and the codegen will happen properly.
 Where do I look to see the attribute pragmas in gdc?  Where do I look to

functions? If I can get a hand-coded signature to work, then we'll be in business.
 -Mike

In d-builtins2.cc: // Entry point for import gcc.builtins, builds GCC builtins in DMD AST on the fly. d_gcc_magic_builtins_module() // Convert GCC type to DMD type, builds functions as well normal D types. gcc_type_to_d_type() In d-bi-attrs*.h are all handlers for supported attributes (you needn't touch them, as they are all copied from c-common.c). Regards Iain
Feb 01 2011
prev sibling next sibling parent Brad Roberts <braddr puremagic.com> writes:
On 2/6/2011 4:15 AM, Iain Buclaw wrote:
 == Quote from Mike Farnsworth (mike.farnsworth gmail.com)'s article
 On 02/01/2011 10:38 AM, Iain Buclaw wrote:
 I haven't given it much thought on how internal representation could be, but
I'd
 lean on using unions in D code for usage in the language. As its probably most
 portable.

 For example, one of the older 'hello vectors' I know of:

 import std.c.stdio;

 pragma(set_attribute, __v4sf, vector_size(16));
 typedef float __v4sf;

 union f4vector
 {
     __v4sf v;
     float[4] f;
 }

 int main()
 {
     f4vector a, b, c;

     a.f = [1, 2, 3, 4];
     b.f = [5, 6, 7, 8];

     c.v = a.v + b.v;
     printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

     return 0;
 }

get any __builtin_... functions to actually work. I've added support for the VECTOR_TYPE tree code in gcc_type_to_d_type(tree) function (in d_builtins2.cc): case VECTOR_TYPE: { tree baseType = TREE_TYPE(t); d = gcc_type_to_d_type(baseType, printstuff); if (d) return d; break; } This allows it to succeed in interpreting the SSE-related builtins in gcc_type_to_d_type(tree). Note that all it does is grab the base vector element type and convert that to a D type so as not to confuse the frontend; this way it matches the typedef for __v4sf, so as long as we use the union we won't lose data before we can pass it to a builtin.

Try: case VECTOR_TYPE: { tree basetype = TYPE_DEBUG_REPRESENTATION_TYPE(t); assert(TREE_CODE(basetype) == RECORD_TYPE); basetype = TREE_TYPE(TYPE_FIELDS(basetype)); d = gcc_type_to_d_type(basetype); if (d) { d->ctype = t; return d; } break; } That makes them static arrays, so you needn't require a whacky union to use vector functions. float[4] a = [1,2,3,4], b = [5,6,7,8], c; c = __builtin_ia32_addps(a,b); Secondly, __builtin_ia32_addps requires SSE turned on. Compile with -msse Regards

I'd be happy to have gcc finding vectorization opportunities, but there's no need to add this sort of thing to the language. This already has a hook to call a library function: float[4] a = [1,2,3,4], b = [5,6,7,8], c; c[] = a[] + b[];
Feb 06 2011
prev sibling parent Brad Roberts <braddr puremagic.com> writes:
On 2/6/2011 2:58 PM, Iain Buclaw wrote:
 == Quote from Brad Roberts (braddr puremagic.com)'s article
 I'd be happy to have gcc finding vectorization opportunities, but there's no

 language.  This already has a hook to call a library function:
 float[4] a = [1,2,3,4], b = [5,6,7,8], c;
 c[] = a[] + b[];

Aye, and 9 times out of 10 I would agree with this thinking also. The pros to hashing out GCC Vector intrinsics to the D frontend though are that the GCC backend has much more creative control over the codegen. Inlining and optimising the intrinsics in a far better way than optimising the overhead of an external library call. Baring in mind that DMD's array libraries are already extremely performant anyway, I honestly don't see the harm if it makes the poignant speed freaks happy. Regards

The harm that I'd like to minimize (preferably avoid) is compiler specific language changes. When GDC or LDC or DMD add little things to the language that aren't supported by all three, the choice of compilers used to build a chunk of code is reduced. The result is a fragmentation of the language. I agree that gcc's inliner and optimizers are way better than dmd's when it comes to vectors (among other things), and would love to see those brought to bear. So, imho, try not to go any higher than the glue layer. Later, Brad
Feb 06 2011