www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - SIMD/intrinsincs questions

reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
Hey all,

The other day someone pointed me to Andrei's article in DDJ, and I dove
headlong into researching D and what it is capable of.  I had only seen it
referred to a few times with respect to template metaprogramming and that crazy
compile-time ray tracer, but I have to say I've been very impressed with what
I've seen, especially with D2.

A bit of background:  I work in the movie VFX industry, and worked in games
development previously, and I have my own ray tracer that I experiment with
(see http://renderspud.blogspot.com/ for info).  Back in college the

better version, and now I've slowly been converting it to C++ again with SSE
support (getting to the SOA ray packet form soon, I hope) so that it doesn't
suck speed-wise.  Anyway, long story short, SIMD is really important to me.

In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).

In the alternative, is it possible to support something along the lines of
gcc's vector extensions:

typedef int v4si __attribute__ ((vector_size (16)));
typedef float v4sf __attribute__ ((vector_size (16)));

where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea.  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.

Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).

Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.

Thanks,
Mike Farnsworth
Nov 06 2009
next sibling parent reply Don <nospam nospam.com> writes:
Mike Farnsworth wrote:
 Hey all,
 
 The other day someone pointed me to Andrei's article in DDJ, and I dove
headlong into researching D and what it is capable of.  I had only seen it
referred to a few times with respect to template metaprogramming and that crazy
compile-time ray tracer, but I have to say I've been very impressed with what
I've seen, especially with D2.
 
 A bit of background:  I work in the movie VFX industry, and worked in games
development previously, and I have my own ray tracer that I experiment with
(see http://renderspud.blogspot.com/ for info).  Back in college the

better version, and now I've slowly been converting it to C++ again with SSE
support (getting to the SOA ray packet form soon, I hope) so that it doesn't
suck speed-wise.  Anyway, long story short, SIMD is really important to me.
 
 In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).
 
 In the alternative, is it possible to support something along the lines of
gcc's vector extensions:
 
 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));
 
 where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea..  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.
 
 Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).
 
 Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.
 
 Thanks,
 Mike Farnsworth
Hi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.
Nov 06 2009
next sibling parent reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
Don Wrote:

 Mike Farnsworth wrote:
 In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).
 
 In the alternative, is it possible to support something along the lines of
gcc's vector extensions:
 
 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));
 
 where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea..  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.
 
 Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).
 
 Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.
 
 Thanks,
 Mike Farnsworth
Hi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.
Awesome, does this also apply to dynamic arrays? And how far does that go? E.g. if I were to do something odd like: x[] = ((y[] % 5) ^ 2) + z[]; Would that also work? (Sorry, I should test it myself, but I'm at work and haven't had time to get D tools installed yet and so am flying blind.) On another note, I'm aware that the latest gcc versions have pretty good SIMD auto-vectorization, so I assume that will eventually be in the cards for dmd. As for lcd, that is pretty much dependent on llvm itself, and that doesn't have auto-vectorization of code yet AFAIK. Anyone familiar with ldc have any idea about getting optimized asm and/or SSE intrinsics to do the right thing? As soon as I have some time, I'll stop being lazy and actually go try some of this stuff out myself and see what the compiled asm looks like, but if anyone has already figured out the answers I can stay lazy. If it comes down to me needing to create some x86 asm in structs to get some initial SSE-based vector types working, I'll do that and share with the class. I'm not amazing with that stuff, but it could serve as a poor-man's stopgap until the compilers mature a bit in this regard. -Mike
Nov 06 2009
parent reply Don <nospam nospam.com> writes:
Mike Farnsworth wrote:
 Don Wrote:
 
 Mike Farnsworth wrote:
 In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).

 In the alternative, is it possible to support something along the lines of
gcc's vector extensions:

 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));

 where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea..  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.

 Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).

 Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.

 Thanks,
 Mike Farnsworth
Hi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.
Awesome, does this also apply to dynamic arrays? And how far does that go? E.g. if I were to do something odd like: x[] = ((y[] % 5) ^ 2) + z[];
Yes, that works, and it applies to dynamic arrays too. A key idea behind this is that since modern machines support SIMD, it's quite ridiculous for a high level languages to not be able to express it.
 Would that also work?  (Sorry, I should test it myself, but I'm at work and
haven't had time to get D tools installed yet and so am flying blind.)
 
 On another note, I'm aware that the latest gcc versions have pretty good SIMD
auto-vectorization, so I assume that will eventually be in the cards for dmd. 
As for lcd, that is pretty much dependent on llvm itself, and that doesn't have
auto-vectorization of code yet AFAIK.
 
 Anyone familiar with ldc have any idea about getting optimized asm and/or SSE
intrinsics to do the right thing?  As soon as I have some time, I'll stop being
lazy and actually go try some of this stuff out myself and see what the
compiled asm looks like, but if anyone has already figured out the answers I
can stay lazy.
 
 If it comes down to me needing to create some x86 asm in structs to get some
initial SSE-based vector types working, I'll do that and share with the class. 
I'm not amazing with that stuff, but it could serve as a poor-man's stopgap
until the compilers mature a bit in this regard.
Yes, lots of stuff that should work doesn't yet. The emphasis has been on getting the fundamentals solid. There's a lot of activity planned -- in fact I'm improving the compiler support for operator loading right now.
Nov 06 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 Mike Farnsworth wrote:
 Don Wrote:

 Mike Farnsworth wrote:
 In dmd and ldc, is there any support for SSE or other SIMD 
 intrinsics?  I realize that I could write some asm blocks, but that 
 means each operation (vector add, sub, mul, dot product, etc.) would 
 need to probably include a prelude and postlude with loads and 
 stores.  I worry that this will not get optimized away (unless I 
 don't use 'naked'?).

 In the alternative, is it possible to support something along the 
 lines of gcc's vector extensions:

 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));

 where the compiler will automatically generate opAdd, etc. for those 
 types?  I'm not suggesting using gcc's syntax, of course, but you 
 get the idea..  It would provide a very easy way for the compiler to 
 prefer to keep 4-float vectors in SSE registers, pass them in 
 registers where appropriate in function calls, nuke lots of loads 
 and stores when inlining, etc.

 Having good, native SIMD support in D seems like a natural fit 
 (heck, it's got complex numbers built-in).

 Of course, there are some operations that the available SSE 
 intrinsics cover that the compiler can't expose via the typical 
 operators, so those still need to be supported somehow.  Does anyone 
 know if ldc or dmd has those, or if they'll optimize away SSE loads 
 and stores if I roll my own structs with asm blocks?  I saw from the 
 ldc source it had the usual llvm intrinsics, but as far as 
 hardware-specific codegen intrinsics I couldn't spot any.

 Thanks,
 Mike Farnsworth
Hi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.
Awesome, does this also apply to dynamic arrays? And how far does that go? E.g. if I were to do something odd like: x[] = ((y[] % 5) ^ 2) + z[];
Yes, that works, and it applies to dynamic arrays too. A key idea behind this is that since modern machines support SIMD, it's quite ridiculous for a high level languages to not be able to express it.
Mike, for more info on the supported operations you may want to refer to the Thermopylae excerpt: http://erdani.com/d/thermopylae.pdf Andrei
Nov 06 2009
prev sibling next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Fri, Nov 6, 2009 at 11:29 AM, Don <nospam nospam.com> wrote:
 Hi Mike, Welcome to D!
 In the latest compiler release (ie, this morning!), fixed-length arrays h=
ave
 become value types. This is a big step: it means that (eg) float[4] can b=
e
 returned from a function for the first time. On 32-bit, we're a bit limit=
ed
 in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but th=
is
 will mean that on 64 bit, we'll be able to define an ABI in which =A0shor=
t
 static arrays are passed in SSE registers.

 Also, D has array operations. =A0If x, y, and z are int[4], then
 x[] =3D y[]*3 + z[];
 corresponds directly to SIMD operations. DMD doesn't do much with them ye=
t
 (there's been so many language design issues that optimisation hasn't
 received much attention), but the language has definitely been planned wi=
th
 SIMD in mind.
But what about the question of direct support for SSE intrinsics? I don't see any in std.* but is there any reason not to, say, beef up std.intrinsics with such things? Is there any major hurdle to overcome? Seems like it would be useful to have. --bb
Nov 06 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Bill Baxter wrote:
 But what about the question of direct support for SSE intrinsics?
The following are directly supported: a[] = b[] + c[] a[] = b[] - c[] a[] = b[] + value a[] += value a[] += b[] a[] = b[] - value a[] = value - b[] a[] -= value a[] -= b[] a[] = b[] * value a[] = b[] * c[] a[] *= value a[] *= b[] a[] = b[] / value a[] /= value a[] -= b[] * value CPU detection is done at runtime and picks which of none, mmx, sse, sse2 or amd3dnow instructions to use. Other operations are done using loop fusion.
Nov 06 2009
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 Also, D has array operations.  If x, y, and z are int[4], then
 x[] = y[]*3 + z[];
 corresponds directly to SIMD operations. DMD doesn't do much with them 
 yet (there's been so many language design issues that optimisation 
 hasn't received much attention), but the language has definitely been 
 planned with SIMD in mind.
Many of the array operations do use the CPU vector operations.
Nov 06 2009
prev sibling parent reply Lutger <lutger.blijdestijn gmail.com> writes:
Mike Farnsworth wrote:

...
 
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so those
 still need to be supported somehow.  Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my own
 structs with asm blocks?  I saw from the ldc source it had the usual llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I couldn't
 spot any.
 
 Thanks,
 Mike Farnsworth
 
Have you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
Nov 08 2009
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 08 Nov 2009 17:47:31 -0500, Lutger <lutger.blijdestijn gmail.com>  
wrote:

 Mike Farnsworth wrote:

 ...
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so those
 still need to be supported somehow.  Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my own
 structs with asm blocks?  I saw from the ldc source it had the usual  
 llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I  
 couldn't
 spot any.

 Thanks,
 Mike Farnsworth
Have you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
SSE intrinsics allow you to specify the operation, but allow the compiler to do the register assignments, inlining, etc. D's inline asm requires the programmer to manage everything.
Nov 08 2009
parent reply Michael Farnsworth <mike.farnsworth gmail.com> writes:
On 11/08/2009 06:35 PM, Robert Jacques wrote:
 On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
 <lutger.blijdestijn gmail.com> wrote:

 Mike Farnsworth wrote:

 ...
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so those
 still need to be supported somehow. Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my own
 structs with asm blocks? I saw from the ldc source it had the usual llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I
 couldn't
 spot any.

 Thanks,
 Mike Farnsworth
Have you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
SSE intrinsics allow you to specify the operation, but allow the compiler to do the register assignments, inlining, etc. D's inline asm requires the programmer to manage everything.
I finally went and did a little homework, so sorry for the long reply that follows. I have been experimenting with both the ldc.llvmasm.__asm() function, as well as getting D's asm {} to do what I want. So far, I have been able to get some SSE instructions in there, but I'm running into a few issues. For now, I'm only using ldc, but I'll try out dmd eventually as well. * Using "-release -O5 -enable-inlining" in ldc, I can't for the life of me get it to inline the functions with the SSE asm statements. * Overriding opAdd for a struct, I had a hard time getting it to not spit what appears to me to be a lot of extra loading / stack code. In order to even get it to do what I wanted, I wrote it like this: Vector opAdd(Vector v) { Vector result = void; float* c0 = &c[0]; float* vc0 = &v.c[0]; float* rc0 = &v.c[0]; asm { movaps XMM0,c0 ; movaps XMM1,vc0 ; addps XMM0,XMM1 ; movaps rc0,XMM0 ; } return result; } And that ended up with the address-of code and stack stuff that isn't optimal. * When I instead write a function like this: static vecAdd(ref Vector v1, ref Vector v2, ref Vector result) { asm { movaps XMM0,v1 ; movaps XMM1,v2 ; addps XMM0,XMM1 ; movaps rv,XMM0 ; } } where Vector is defined as: align(16) struct Vector { public: float[4] c; } (Note that 'result' is passed as 'ref' and not 'out'. With 'out', it inserted init code in there, probably because the compiler thought I hadn't actually touched the result, even though the assembly did its job. 'out' is a better contract description, so it'd be nice to know how to suppress that.) With this I get a fewer instructions in the function; but it still has an extraneous stack push/pop pair surrounding it, and it still won't inline for me where I call it. It's all of 8 instructions including the return, and any inlining scheme that thinks that merits a function call instead ought to be drug out and shot. =P * I used __asm(T)(char[], char[], T) from ldc as well, but either I suck at getting LLVM to recognize my constraints, or ldc doesn't support SSE constraints yet, but it just wouldn't take. I ended up going the D asm block route once I figured out how to give it addresses without taking the address of everything (using ref for struct arguments works great!). So, yeah, once I can figure out how to get any of the compilers to inline my asm-laced functions, and then figure out how to get an optimizer to eliminate all the (what should be) extraneous movaps instructions, then I'll be in good shape. Until then, I won't port my ray tracer over to D. But I will be happy to try to help out with patches/experiments until then to get to the goal of making D suitable for heavy SIMD calculations. I'm talking with the ldc guys about it, as LLVM should be able to make really good use of this stuff (especially intrinsics) once the frontend can hand it off suitably. I'm excited to work on a project like this, because if I get better at dealing with SIMD issues in the compiler I should be able to capitalize on it to make my math-heavy code even faster. Mmmm...speed... -Mike
Nov 08 2009
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 09 Nov 2009 01:53:11 -0500, Michael Farnsworth  
<mike.farnsworth gmail.com> wrote:

 On 11/08/2009 06:35 PM, Robert Jacques wrote:
 On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
 <lutger.blijdestijn gmail.com> wrote:

 Mike Farnsworth wrote:

 ...
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so  
 those
 still need to be supported somehow. Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my  
 own
 structs with asm blocks? I saw from the ldc source it had the usual  
 llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I
 couldn't
 spot any.

 Thanks,
 Mike Farnsworth
Have you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
SSE intrinsics allow you to specify the operation, but allow the compiler to do the register assignments, inlining, etc. D's inline asm requires the programmer to manage everything.
I finally went and did a little homework, so sorry for the long reply that follows. I have been experimenting with both the ldc.llvmasm.__asm() function, as well as getting D's asm {} to do what I want. So far, I have been able to get some SSE instructions in there, but I'm running into a few issues. For now, I'm only using ldc, but I'll try out dmd eventually as well. * Using "-release -O5 -enable-inlining" in ldc, I can't for the life of me get it to inline the functions with the SSE asm statements. * Overriding opAdd for a struct, I had a hard time getting it to not spit what appears to me to be a lot of extra loading / stack code. In order to even get it to do what I wanted, I wrote it like this: Vector opAdd(Vector v) { Vector result = void; float* c0 = &c[0]; float* vc0 = &v.c[0]; float* rc0 = &v.c[0]; asm { movaps XMM0,c0 ; movaps XMM1,vc0 ; addps XMM0,XMM1 ; movaps rc0,XMM0 ; } return result; } And that ended up with the address-of code and stack stuff that isn't optimal. * When I instead write a function like this: static vecAdd(ref Vector v1, ref Vector v2, ref Vector result) { asm { movaps XMM0,v1 ; movaps XMM1,v2 ; addps XMM0,XMM1 ; movaps rv,XMM0 ; } } where Vector is defined as: align(16) struct Vector { public: float[4] c; } (Note that 'result' is passed as 'ref' and not 'out'. With 'out', it inserted init code in there, probably because the compiler thought I hadn't actually touched the result, even though the assembly did its job. 'out' is a better contract description, so it'd be nice to know how to suppress that.) With this I get a fewer instructions in the function; but it still has an extraneous stack push/pop pair surrounding it, and it still won't inline for me where I call it. It's all of 8 instructions including the return, and any inlining scheme that thinks that merits a function call instead ought to be drug out and shot. =P * I used __asm(T)(char[], char[], T) from ldc as well, but either I suck at getting LLVM to recognize my constraints, or ldc doesn't support SSE constraints yet, but it just wouldn't take. I ended up going the D asm block route once I figured out how to give it addresses without taking the address of everything (using ref for struct arguments works great!). So, yeah, once I can figure out how to get any of the compilers to inline my asm-laced functions, and then figure out how to get an optimizer to eliminate all the (what should be) extraneous movaps instructions, then I'll be in good shape. Until then, I won't port my ray tracer over to D. But I will be happy to try to help out with patches/experiments until then to get to the goal of making D suitable for heavy SIMD calculations. I'm talking with the ldc guys about it, as LLVM should be able to make really good use of this stuff (especially intrinsics) once the frontend can hand it off suitably. I'm excited to work on a project like this, because if I get better at dealing with SIMD issues in the compiler I should be able to capitalize on it to make my math-heavy code even faster. Mmmm...speed... -Mike
By design, D asm blocks are separated from the optimizer: no code motion, etc occurs. D2 just changed fixed sized arrays to value types, which provide most of the functionality of a small vector struct. However, actual SSE optimization of these types is probably going to wait until x64 support; since a bunch of 32-bit chips don't support them. P.S. For what it's worth, I do research which involves volumetric ray-tracing. I've always found memory to bottleneck computations. Also, why not look into CUDA/OpenCL/DirectCompute?
Nov 08 2009
parent reply Michael Farnsworth <mike.farnsworth gmail.com> writes:
On 11/08/2009 11:28 PM, Robert Jacques wrote:
 By design, D asm blocks are separated from the optimizer: no code
 motion, etc occurs. D2 just changed fixed sized arrays to value types,
 which provide most of the functionality of a small vector struct.
 However, actual SSE optimization of these types is probably going to
 wait until x64 support; since a bunch of 32-bit chips don't support them.

 P.S. For what it's worth, I do research which involves volumetric
 ray-tracing. I've always found memory to bottleneck computations. Also,
 why not look into CUDA/OpenCL/DirectCompute?
Yeah, I've discovered that having either the constraints-based __asm() from ldc or actual intrinsics probably makes optimization opportunities more frequent. But, if it at least inlined the regular asm blocks for me I'd be most of the way there. The ldc guys tell me that they didn't include the llvm vector intrinsics already because they were going to need either a custom type in the frontend, or else the D2 fixed-size-arrays-as-value-types functionality. I might take a stab at some of that in ldc in the future to see if I can get it to work, but I'm not an expert in compilers by any stretch of the imagination. -Mike PS: As for trying CUDA/OpenCL/DirectCompute, I haven't gotten into it much for a few reasons: * The standards and APIs are still evolving * I refuse to pigeon-hole myself into windows (I'm typing this from a Fedora 11 box, and at work we're a linux shop doing movie VFX) * Larrabee (yes, yes, semi-vaporware until Intel gets their crap together) will allow something much closer to standard CPU code. I really think that's the direction the GPU makers are heading in general, so why hobble myself with cruddy GPU memory/threading models to code around right now? * GPUs keep changing, and every change brings with it subtle (and sometimes drastic) effects on your code's performance and results from card to card. It's a nightmare to maintain, and every project we've done trying to do production rendering stuff on GPU (even just relighting) has ended in tears and gnashing of teeth. Everyone just eventually throws up their hands and goes back to optimized CPU rendering in the VFX industry (Pixar, ILM, Tippett have all done that, just to name a few). Good, solid general purpose CPUs with caches, decently wide SIMD with scatter/gather, and plenty of hardware threads are the wave of the future. (Or was that the past? I can't remember.) GPUs are slowly converging back to that, except that currently they have a programmer-managed cache (texture mem), and they execute multiple threads concurrently over the same instructions in groups (warps, in CUDA-speak?). They'll eventually add the 'feature' of a more automatically-managed cache, and better memory throughput when allowing warps to be smaller and more flexible. And they'll look nearly identical to all the multi-core CPUs again when it happens.
Nov 09 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Michael Farnsworth wrote:
 The ldc guys tell me that they didn't 
 include the llvm vector intrinsics already because they were going to 
 need either a custom type in the frontend, or else the D2 
 fixed-size-arrays-as-value-types functionality.  I might take a stab at 
 some of that in ldc in the future to see if I can get it to work, but 
 I'm not an expert in compilers by any stretch of the imagination.
I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Nov 09 2009
parent reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
Walter Bright Wrote:

 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't 
 include the llvm vector intrinsics already because they were going to 
 need either a custom type in the frontend, or else the D2 
 fixed-size-arrays-as-value-types functionality.  I might take a stab at 
 some of that in ldc in the future to see if I can get it to work, but 
 I'm not an expert in compilers by any stretch of the imagination.
I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Can you elaborate a bit on what you mean? If I understand what you're getting at, it's as simple as recognizing array-wise operations (the a[] = b[] * c expressions in D), and decomposing them into SIMD underneath where possible? It would also be cool if the compiler could catch cases where a struct was essentially a wrapper around one of those arrays, and similarly turn the ops into SIMD ops (so as to allow some operator overloads and extra method wrapping additional intrinsics, for example). There are a lot of cases to recognize, but the compiler could start with the simple ones and then go from there with no need to change the language or declare custom types (minus some alignment to help it along, perhaps). The nice thing about it is you automatically get a pretty big swath of auto-vectorization by the compiler in the most natural types and operations you'd expect it to show up. Of course, SOA-style SIMD takes more intervention by the programmer, but there is probably no easy way around that, since it's based on a data-layout technique. -Mike
Nov 09 2009
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Mike Farnsworth wrote:
 Walter Bright Wrote:
 
 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't include the llvm vector
 intrinsics already because they were going to need either a
 custom type in the frontend, or else the D2 
 fixed-size-arrays-as-value-types functionality.  I might take a
 stab at some of that in ldc in the future to see if I can get it
 to work, but I'm not an expert in compilers by any stretch of the
 imagination.
I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Can you elaborate a bit on what you mean?
Sure. Consider the code: for (int i = 0; i < 100; i++) array[i] = 0; It takes a fair amount of work for a compiler to deduce "aha!" this code is intended to clear the array! The compiler then replaces the loop with: memset(array, 0, 100 * sizeof(array[0])); In D, you can specify the array operation at a high level: array[0..100] = 0; In other words, a language is supposed to represent high level concepts and the compiler breaks it down into low level ones supported by the machine. With vector operations, etc., the language supports only the low level operations and the compiler must reconstruct the high level operations supported by the machine. This inversion of roles is bizarre.
Nov 09 2009
prev sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Mon, Nov 9, 2009 at 1:56 PM, Mike Farnsworth
<mike.farnsworth gmail.com> wrote:
 Walter Bright Wrote:

 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't
 include the llvm vector intrinsics already because they were going to
 need either a custom type in the frontend, or else the D2
 fixed-size-arrays-as-value-types functionality. =A0I might take a stab=
at
 some of that in ldc in the future to see if I can get it to work, but
 I'm not an expert in compilers by any stretch of the imagination.
I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Can you elaborate a bit on what you mean? =A0If I understand what you're =
getting at, it's as simple as recognizing array-wise operations (the a[] = =3D b[] * c expressions in D), and decomposing them into SIMD underneath wh= ere possible? =A0It would also be cool if the compiler could catch cases wh= ere a struct was essentially a wrapper around one of those arrays, and simi= larly turn the ops into SIMD ops (so as to allow some operator overloads an= d extra method wrapping additional intrinsics, for example).
 There are a lot of cases to recognize, but the compiler could start with =
the simple ones and then go from there with no need to change the language = or declare custom types (minus some alignment to help it along, perhaps). = =A0The nice thing about it is you automatically get a pretty big swath of a= uto-vectorization by the compiler in the most natural types and operations = you'd expect it to show up.
 Of course, SOA-style SIMD takes more intervention by the programmer, but =
there is probably no easy way around that, since it's based on a data-layou= t technique. I think what he's saying is use array expressions like a[] =3D b[] + c[] and let the compiler take care of it, instead of trying to write SSE yourself. I haven't tried, but does this kind of thing turn into SSE and get inlined? struct Vec3 { float v[3]; void opAddAssign(ref Vec3 o) { this.v[] +=3D o.v[]; } } If so then that's very slick. Much nicer than having to delve into compiler intrinsics. But at least on DMD I know it won't actually inline because it doesn't inline functions with ref arguments. (http://d.puremagic.com/issues/show_bug.cgi?id=3D2008) --bb
Nov 09 2009
parent reply Don <nospam nospam.com> writes:
Bill Baxter wrote:
 On Mon, Nov 9, 2009 at 1:56 PM, Mike Farnsworth
 <mike.farnsworth gmail.com> wrote:
 Walter Bright Wrote:

 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't
 include the llvm vector intrinsics already because they were going to
 need either a custom type in the frontend, or else the D2
 fixed-size-arrays-as-value-types functionality.  I might take a stab at
 some of that in ldc in the future to see if I can get it to work, but
 I'm not an expert in compilers by any stretch of the imagination.
I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Can you elaborate a bit on what you mean? If I understand what you're getting at, it's as simple as recognizing array-wise operations (the a[] = b[] * c expressions in D), and decomposing them into SIMD underneath where possible? It would also be cool if the compiler could catch cases where a struct was essentially a wrapper around one of those arrays, and similarly turn the ops into SIMD ops (so as to allow some operator overloads and extra method wrapping additional intrinsics, for example). There are a lot of cases to recognize, but the compiler could start with the simple ones and then go from there with no need to change the language or declare custom types (minus some alignment to help it along, perhaps). The nice thing about it is you automatically get a pretty big swath of auto-vectorization by the compiler in the most natural types and operations you'd expect it to show up. Of course, SOA-style SIMD takes more intervention by the programmer, but there is probably no easy way around that, since it's based on a data-layout technique.
I think what he's saying is use array expressions like a[] = b[] + c[] and let the compiler take care of it, instead of trying to write SSE yourself. I haven't tried, but does this kind of thing turn into SSE and get inlined? struct Vec3 { float v[3]; void opAddAssign(ref Vec3 o) { this.v[] += o.v[]; } } If so then that's very slick. Much nicer than having to delve into compiler intrinsics. But at least on DMD I know it won't actually inline because it doesn't inline functions with ref arguments. (http://d.puremagic.com/issues/show_bug.cgi?id=2008) --bb
The bad news: The DMD back-end is a state-of-the-art backend from the late 90's. Despite its age, its treatment of integer operations is, in general, still quite respectable. However, it _never_ generates SSE instructions. Ever. However, array operations _are_ detected, and they become to calls to library functions which use SSE if available. That's not bad for moderately large arrays -- 200 elements or so -- but of course it's completely non-optimal for short arrays. The good news: Now that static arrays are passed by value, introducing inline SSE support for short arrays suddenly makes a lot of sense -- there can be a big performance benefit for a small backend change; it could be done without introducing SSE anywhere else. Most importantly, it doesn't require any auto-vectorisation support.
Nov 10 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 The bad news: The DMD back-end is a state-of-the-art backend from the 
 late 90's. Despite its age, its treatment of integer operations is, in 
 general, still quite respectable.
Modern compilers don't do much better. The point of diminishing returns was clearly reached.
 However, it _never_ generates SSE 
 instructions. Ever. However, array operations _are_ detected, and they 
 become to calls to library functions which use SSE if available. That's 
 not bad for moderately large arrays -- 200 elements or so -- but of 
 course it's completely non-optimal for short arrays.
 
 The good news: Now that static arrays are passed by value, introducing 
 inline SSE support for short arrays suddenly makes a lot of sense -- 
 there can be a big performance benefit for a small backend change; it 
 could be done without introducing SSE anywhere else. Most importantly, 
 it doesn't require any auto-vectorisation support.
What the library functions also do is have a runtime switch based on the capabilities of the processor, switching to operations tailored to that processor. To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.
Nov 10 2009
next sibling parent reply Don <nospam nospam.com> writes:
Walter Bright wrote:
 Don wrote:
 The bad news: The DMD back-end is a state-of-the-art backend from the 
 late 90's. Despite its age, its treatment of integer operations is, in 
 general, still quite respectable.
Modern compilers don't do much better. The point of diminishing returns was clearly reached.
Yup. The only integer operation modern compilers still don't do well is -- array operations!
 However, it _never_ generates SSE instructions. Ever. However, array 
 operations _are_ detected, and they become to calls to library 
 functions which use SSE if available. That's not bad for moderately 
 large arrays -- 200 elements or so -- but of course it's completely 
 non-optimal for short arrays.

 The good news: Now that static arrays are passed by value, introducing 
 inline SSE support for short arrays suddenly makes a lot of sense -- 
 there can be a big performance benefit for a small backend change; it 
 could be done without introducing SSE anywhere else. Most importantly, 
 it doesn't require any auto-vectorisation support.
What the library functions also do is have a runtime switch based on the capabilities of the processor, switching to operations tailored to that processor. To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.
I'd say it's not a problem to use MMX or even SSE1. It's really, really difficult to find a processor that doesn't support them. I've tried. I've really tried. I don't think many are still around: they all have motherboards which require really small hard disks that you can no longer buy. Certainly no-one is putting new software on them. Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at work, to support an ancient but expensive bit of lab equipment. Even it was a Pentium II. Getting the spare parts for it was a nightmare*; we had to ship them from 600km away. Hard disks just don't last that long. SSE2 is a different story, since AMD never made a 32 bit CPU with SSE2. *Actually it was more of a horror comedy. It was hard to take it seriously.
Nov 10 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 I'd say it's not a problem to use MMX or even SSE1. It's really, really 
 difficult to find a processor that doesn't support them. I've tried. 
 I've really tried. I don't think many are still around: they all have 
 motherboards which require really small hard disks that you can no 
 longer buy. Certainly no-one is putting new software on them.
 Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at 
 work, to support an ancient but expensive bit of lab equipment. Even it 
 was a Pentium II. Getting the spare parts for it was a nightmare*; we 
 had to ship them from 600km away. Hard disks just don't last that long.
I do have a working Pentium around here somewhere. I even have a 486, though I haven't turned the machine on in 15 years. I no longer have a 386 (gave it away). 10 years ago, I heard that the 386 was commonly used in embedded systems. I don't know what the base level x86 used today is.
Nov 10 2009
next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Tue, Nov 10, 2009 at 12:06:08PM -0800, Walter Bright wrote:
 I do have a working Pentium around here somewhere.
I actually still use my Pentium 1 computers. I have three of them, one works as a thin terminal to my newer computer, one is my secondary main computer (if/when my main computer decides to quit working and I'm waiting on replacement parts, I go back to the old box - it was my main from 1996 through to 2005! I also use it for the occasional multiplayer game), and the last one I still use to host some small, low traffic websites. I don't buy into the "omg must be bleeding edge or else" philosophy. I'll work these computers until their parts fail entirely! But, I'm surely in the minority. Heck, I still sometimes write 16 bit DOS code for those computers! If DMD starts outputting fancier code, that's awesome for the 99% of cases where it is fine, I'd just request a compiler switch in there to turn it back to old behaviour for the <1% of cases where we don't want it. -- Adam D. Ruppe http://arsdnet.net
Nov 10 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Adam D. Ruppe wrote:
 If DMD starts outputting fancier code, that's awesome for the 99% of cases
 where it is fine, I'd just request a compiler switch in there to turn it
 back to old behaviour for the <1% of cases where we don't want it.
Interestingly, dmd does a very good job of Pentium instruction scheduling. I thought that was hopelessly obsolete, although it didn't actually hurt anything, so no worries. But it turns out that the Intel Atom benefits a lot from Pentium style scheduling, and no other compiler seems to support that anymore!
Nov 10 2009
prev sibling parent Lutger <lutger.blijdestijn gmail.com> writes:
Walter Bright wrote:

 Don wrote:
 I'd say it's not a problem to use MMX or even SSE1. It's really, really
 difficult to find a processor that doesn't support them. I've tried.
 I've really tried. I don't think many are still around: they all have
 motherboards which require really small hard disks that you can no
 longer buy. Certainly no-one is putting new software on them.
 Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at
 work, to support an ancient but expensive bit of lab equipment. Even it
 was a Pentium II. Getting the spare parts for it was a nightmare*; we
 had to ship them from 600km away. Hard disks just don't last that long.
I do have a working Pentium around here somewhere. I even have a 486, though I haven't turned the machine on in 15 years. I no longer have a 386 (gave it away). 10 years ago, I heard that the 386 was commonly used in embedded systems. I don't know what the base level x86 used today is.
Until recently my stepdad still had his 8086 setup to interface with an old- school velotype keyboard. I even built a nasty old 5 1/2 inch floppy drive into his shiny dualcore rig, which he used to transfer plain text files between the two machines. It worked fine.
Nov 10 2009
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 Modern compilers don't do much better. The point of diminishing returns 
 was clearly reached.
I routinely see D benchmarks that are 2+ times faster with LDC compared to DMD. Today CPUs don't get faster and faster as in the past so a 250% improvement coming just from the compiler is not something you want to ignore. And more optimizations for LLVM are planned (like auto-vectorization, better inlining of function pointers, better de-virtualization and inlining of virtual class methods, partial compilation, super compilation of tiny chunks of code, and quite more). Another thing to take in account is that today you usually don't want to Python, Fortress, etc. Such higher level languages offer challenges to the optimizators, that were not present in the past. For example most optimizations done by the Just-In-Time compiler for Lua were not needed by a C compiler. Today there are many people that want to program in Lua or Python or JavaScript instead of C, so they need a quite more refined optimizer and compiler, like the LuaJIT2 or Unladen Swallow or V8. You also have new smaller challenges created by multi-core CPUs and languages that are functional, immutable-based. That's why modern compilers are quickly improving today too, and today we need still such improvements. Bye, bearophile
Nov 10 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 
 Modern compilers don't do much better. The point of diminishing
 returns was clearly reached.
I routinely see D benchmarks that are 2+ times faster with LDC compared to DMD.
Have to be careful about benchmarks without looking at why. A few months ago, a benchmark was posted here purportedly showing that dmd was awful at integer math. Turns out, the problem was entirely in the long divide function, not the code generator at all. I rewrote the long divide helper function, and problem solved.
Nov 10 2009
prev sibling next sibling parent reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
Walter Bright Wrote:

 Don wrote:
 The bad news: The DMD back-end is a state-of-the-art backend from the 
 late 90's. Despite its age, its treatment of integer operations is, in 
 general, still quite respectable.
Modern compilers don't do much better. The point of diminishing returns was clearly reached.
 However, it _never_ generates SSE 
 instructions. Ever. However, array operations _are_ detected, and they 
 become to calls to library functions which use SSE if available. That's 
 not bad for moderately large arrays -- 200 elements or so -- but of 
 course it's completely non-optimal for short arrays.
 
 The good news: Now that static arrays are passed by value, introducing 
 inline SSE support for short arrays suddenly makes a lot of sense -- 
 there can be a big performance benefit for a small backend change; it 
 could be done without introducing SSE anywhere else. Most importantly, 
 it doesn't require any auto-vectorisation support.
What the library functions also do is have a runtime switch based on the capabilities of the processor, switching to operations tailored to that processor. To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.
For my purposes, runtime detection is probably out the window, unless the tests for it can happen infrequently enough to reduce the overhead. There are too many SSE variations to switch on them all, and they incrementally provide better and better functionality that I could make use of. I'd rather compile different executables for different hardware and distribute them all (e.g. detect the SSE version at compile time). Really, high performance graphics is an exercise in getting tightly vectorized code to inline appropriately, eliminate as many loads and stores as possible, and then on top of that build algorithms that don't suck in runtime or memory/cache complexity. Often in computer graphics you end up distilling a huge amount of operations down to SIMD instructions that are very highly-threaded and have (hopefully) minimal I/O. If you introduce any extra overhead for getting to those SIMD instructions, you usually take a measurable throughput hit. I'd like to see D give me a much better mix of high throughput + high coding productivity. As it stands, I've got high throughput + medium coding productivity in C++, and I've started looking at some ldc code to lurch towards this goal, and if there is something I can look at in dmd2 itself to help out, I'd love to. Just point me where you think I ought to start. -Mike
Nov 10 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Mike Farnsworth wrote:
 For my purposes, runtime detection is probably out the window, unless
 the tests for it can happen infrequently enough to reduce the
 overhead.  There are too many SSE variations to switch on them all,
 and they incrementally provide better and better functionality that I
 could make use of.  I'd rather compile different executables for
 different hardware and distribute them all (e.g. detect the SSE
 version at compile time).  Really, high performance graphics is an
 exercise in getting tightly vectorized code to inline appropriately,
 eliminate as many loads and stores as possible, and then on top of
 that build algorithms that don't suck in runtime or memory/cache
 complexity.
The way to do it is to not distribute multiple executables, but have the initialization code detect the chip. Then, you compile the same code for different instructions, and have a high level runtime switch between them. I used to do this for machines with and without x87 support.
Nov 10 2009
parent reply Mike Farnsworth <mike.farnsworth gmail.com> writes:
Walter Bright Wrote:

 Mike Farnsworth wrote:
 For my purposes, runtime detection is probably out the window, unless
 the tests for it can happen infrequently enough to reduce the
 overhead.  There are too many SSE variations to switch on them all,
 and they incrementally provide better and better functionality that I
 could make use of.  I'd rather compile different executables for
 different hardware and distribute them all (e.g. detect the SSE
 version at compile time).  Really, high performance graphics is an
 exercise in getting tightly vectorized code to inline appropriately,
 eliminate as many loads and stores as possible, and then on top of
 that build algorithms that don't suck in runtime or memory/cache
 complexity.
The way to do it is to not distribute multiple executables, but have the initialization code detect the chip. Then, you compile the same code for different instructions, and have a high level runtime switch between them. I used to do this for machines with and without x87 support.
Was it actually rewriting the executable code to call the alternate functions (e.g. a exe load time decision, patch the code in memory, and then run)? I thought that sort of thing would run into all sorts of runtime linker issues (ro code pages in memory, shared libs that also need the rewriting, etc.), but then again, they do that with JIT compiling all the time. Does dmd already have some of this capability hanging around (but not used yet)? -Mike
Nov 10 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Mike Farnsworth wrote:
 Was it actually rewriting the executable code to call the alternate
 functions (e.g. a exe load time decision, patch the code in memory,
 and then run)?  I thought that sort of thing would run into all sorts
 of runtime linker issues (ro code pages in memory, shared libs that
 also need the rewriting, etc.), but then again, they do that with JIT
 compiling all the time.
It's much simpler than that. Some C: ========================================= void foo_with_FPU(); void foo_without_FPU(); void (*foo)(); void main() { has_fp = doesCPUhaveFPU(); if (has_fp) foo = &foo_with_FPU(); else foo = &foo_without_FPU(); ... execute app ... (*foo)(); ... execute more app ... } ========================================= #if WITH_FPU #define FOO foo_with_FPU #else #define FOO foo_without_FPU #endif void FOO() { ... do some floating point calculations ... } ========================================== dmc -DWITH_FPU -c foo.c -f -ofoo_with_fpu.obj dmc -c foo.c -ofoo_without_fpu.obj dmc app.obj foo_with_fpu.obj foo_without_fpu.obj =========================================== Hope that makes it clearer. No runtime linking, no runtime compiling, no self-modifying code, etc. A better way to do it is to put your FP code behind a class interface, then have derived classes implement them, compiled with different instruction set options. At runtime, decide which derived class to use.
Nov 10 2009
prev sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
Walter Bright wrote:
 
 ... To generate the code directly, assuming the existence of SSE,
 is to mean the code will only run on modern chips. Whether or not this
 is a problem depends on your application.
If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to have an -sse (and maybe -sse2, -sse3, and -no-sse) switch for the compiler to determine whether the compiler emits those instructions or not. I'm also wondering if a more ideal approach (and perhaps additional option to those above) would be to borrow the best of JIT compilation and emit multiple code paths. Maybe the program would have a bootstrap phase when starting up where it would call cpuid, find out what it has available, rewrite the main binary to use the optimal paths, then execute the main binary. That way feature detection doesn't happen while the program itself is running, and thus doesn't slow down the computations as they happen. Then passing -sse* would cause it to not emit the bootstrap, but instead just assume that the instructions will be available.
Nov 10 2009
parent Mike Farnsworth <mike.farnsworth gmail.com> writes:
Chad J Wrote:

 Walter Bright wrote:
 
 ... To generate the code directly, assuming the existence of SSE,
 is to mean the code will only run on modern chips. Whether or not this
 is a problem depends on your application.
If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to have an -sse (and maybe -sse2, -sse3, and -no-sse) switch for the compiler to determine whether the compiler emits those instructions or not. I'm also wondering if a more ideal approach (and perhaps additional option to those above) would be to borrow the best of JIT compilation and emit multiple code paths. Maybe the program would have a bootstrap phase when starting up where it would call cpuid, find out what it has available, rewrite the main binary to use the optimal paths, then execute the main binary. That way feature detection doesn't happen while the program itself is running, and thus doesn't slow down the computations as they happen. Then passing -sse* would cause it to not emit the bootstrap, but instead just assume that the instructions will be available.
Incidentally, if you use LLVM to compile to their bitcode, you can at runtime do exactly this sort of thing based on the host hardware, selecting opt passes and having it run codegen based on your exact hardware. As long as using a given intrinsic falls through to the right glue code where it isn't supported, or else you let the compiler deduce where to use the fancier instructions (not as likely to happen), that works out nicely. -Mike
Nov 10 2009