digitalmars.D - SIMD/intrinsincs questions

Mike Farnsworth (12/12) Nov 06 2009 Hey all,

Don (14/35) Nov 06 2009 Hi Mike, Welcome to D!

Mike Farnsworth (8/39) Nov 06 2009 Awesome, does this also apply to dynamic arrays? And how far does that ...

Don (7/52) Nov 06 2009 Yes, that works, and it applies to dynamic arrays too. A key idea behind...

Andrei Alexandrescu (5/68) Nov 06 2009 Mike, for more info on the supported operations you may want to refer to...

Bill Baxter (13/26) Nov 06 2009 e

Walter Bright (21/22) Nov 06 2009 The following are directly supported:

Walter Bright (2/8) Nov 06 2009 Many of the array operations do use the CPU vector operations.

Lutger (9/21) Nov 08 2009 Have you seen this page?

Robert Jacques (5/31) Nov 08 2009 SSE intrinsics allow you to specify the operation, but allow the compile...

Michael Farnsworth (75/108) Nov 08 2009 I finally went and did a little homework, so sorry for the long reply

Robert Jacques (10/124) Nov 08 2009 By design, D asm blocks are separated from the optimizer: no code motion...

Michael Farnsworth (39/47) Nov 09 2009 Yeah, I've discovered that having either the constraints-based __asm()

Walter Bright (5/11) Nov 09 2009 I think there's a lot of potential in this. Most languages lack array

Mike Farnsworth (5/17) Nov 09 2009 Can you elaborate a bit on what you mean? If I understand what you're g...

Walter Bright (15/31) Nov 09 2009 Sure. Consider the code:
Bill Baxter (32/48) Nov 09 2009 at

Don (13/55) Nov 10 2009 The bad news: The DMD back-end is a state-of-the-art backend from the

Walter Bright (8/22) Nov 10 2009 Modern compilers don't do much better. The point of diminishing returns

Don (14/38) Nov 10 2009 Yup. The only integer operation modern compilers still don't do well is

Walter Bright (6/15) Nov 10 2009 I do have a working Pentium around here somewhere. I even have a 486,

Adam D. Ruppe (17/18) Nov 10 2009 I actually still use my Pentium 1 computers. I have three of them, one

Walter Bright (6/9) Nov 10 2009 Interestingly, dmd does a very good job of Pentium instruction

Lutger (5/22) Nov 10 2009 Until recently my stepdad still had his 8086 setup to interface with an ...

bearophile (6/8) Nov 10 2009 I routinely see D benchmarks that are 2+ times faster with LDC compared ...

Walter Bright (6/13) Nov 10 2009 Have to be careful about benchmarks without looking at why. A few months...

Mike Farnsworth (5/30) Nov 10 2009 For my purposes, runtime detection is probably out the window, unless th...

Walter Bright (5/16) Nov 10 2009 The way to do it is to not distribute multiple executables, but have the...

Mike Farnsworth (4/22) Nov 10 2009 Was it actually rewriting the executable code to call the alternate func...

Walter Bright (37/43) Nov 10 2009 It's much simpler than that. Some C:

Chad J (15/19) Nov 10 2009 If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to

Mike Farnsworth (3/24) Nov 10 2009 Incidentally, if you use LLVM to compile to their bitcode, you can at ru...

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Hey all,

The other day someone pointed me to Andrei's article in DDJ, and I dove
headlong into researching D and what it is capable of.  I had only seen it
referred to a few times with respect to template metaprogramming and that crazy
compile-time ray tracer, but I have to say I've been very impressed with what
I've seen, especially with D2.

A bit of background:  I work in the movie VFX industry, and worked in games
development previously, and I have my own ray tracer that I experiment with
(see http://renderspud.blogspot.com/ for info).  Back in college the

better version, and now I've slowly been converting it to C++ again with SSE
support (getting to the SOA ray packet form soon, I hope) so that it doesn't
suck speed-wise.  Anyway, long story short, SIMD is really important to me.

In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).

In the alternative, is it possible to support something along the lines of
gcc's vector extensions:

typedef int v4si __attribute__ ((vector_size (16)));
typedef float v4sf __attribute__ ((vector_size (16)));

where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea.  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.

Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).

Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.

Thanks,
Mike Farnsworth

Nov 06 2009

Don <nospam nospam.com> writes:

Mike Farnsworth wrote:
 Hey all,
 
 The other day someone pointed me to Andrei's article in DDJ, and I dove
headlong into researching D and what it is capable of.  I had only seen it
referred to a few times with respect to template metaprogramming and that crazy
compile-time ray tracer, but I have to say I've been very impressed with what
I've seen, especially with D2.
 
 A bit of background:  I work in the movie VFX industry, and worked in games
development previously, and I have my own ray tracer that I experiment with
(see http://renderspud.blogspot.com/ for info).  Back in college the

better version, and now I've slowly been converting it to C++ again with SSE
support (getting to the SOA ray packet form soon, I hope) so that it doesn't
suck speed-wise.  Anyway, long story short, SIMD is really important to me.
 
 In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).
 
 In the alternative, is it possible to support something along the lines of
gcc's vector extensions:
 
 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));
 
 where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea..  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.
 
 Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).
 
 Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.
 
 Thanks,
 Mike Farnsworth

Hi Mike, Welcome to D!
In the latest compiler release (ie, this morning!), fixed-length arrays 
have become value types. This is a big step: it means that (eg) float[4] 
can be returned from a function for the first time. On 32-bit, we're a 
bit limited in SSE support (eg, since *no* 32-bit AMD processors have 
SSE2) -- but this will mean that on 64 bit, we'll be able to define an 
ABI in which  short static arrays are passed in SSE registers.

Also, D has array operations.  If x, y, and z are int[4], then
x[] = y[]*3 + z[];
corresponds directly to SIMD operations. DMD doesn't do much with them 
yet (there's been so many language design issues that optimisation 
hasn't received much attention), but the language has definitely been 
planned with SIMD in mind.

Nov 06 2009

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Don Wrote:

 Mike Farnsworth wrote:

 In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).
 
 In the alternative, is it possible to support something along the lines of
gcc's vector extensions:
 
 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));
 
 where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea..  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.
 
 Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).
 
 Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.
 
 Thanks,
 Mike Farnsworth

 
 Hi Mike, Welcome to D!
 In the latest compiler release (ie, this morning!), fixed-length arrays 
 have become value types. This is a big step: it means that (eg) float[4] 
 can be returned from a function for the first time. On 32-bit, we're a 
 bit limited in SSE support (eg, since *no* 32-bit AMD processors have 
 SSE2) -- but this will mean that on 64 bit, we'll be able to define an 
 ABI in which  short static arrays are passed in SSE registers.
 
 Also, D has array operations.  If x, y, and z are int[4], then
 x[] = y[]*3 + z[];
 corresponds directly to SIMD operations. DMD doesn't do much with them 
 yet (there's been so many language design issues that optimisation 
 hasn't received much attention), but the language has definitely been 
 planned with SIMD in mind.


Awesome, does this also apply to dynamic arrays?  And how far does that go? 
E.g. if I were to do something odd like:

x[] = ((y[] % 5) ^ 2) + z[];

Would that also work?  (Sorry, I should test it myself, but I'm at work and
haven't had time to get D tools installed yet and so am flying blind.)

On another note, I'm aware that the latest gcc versions have pretty good SIMD
auto-vectorization, so I assume that will eventually be in the cards for dmd. 
As for lcd, that is pretty much dependent on llvm itself, and that doesn't have
auto-vectorization of code yet AFAIK.

Anyone familiar with ldc have any idea about getting optimized asm and/or SSE
intrinsics to do the right thing?  As soon as I have some time, I'll stop being
lazy and actually go try some of this stuff out myself and see what the
compiled asm looks like, but if anyone has already figured out the answers I
can stay lazy.

If it comes down to me needing to create some x86 asm in structs to get some
initial SSE-based vector types working, I'll do that and share with the class. 
I'm not amazing with that stuff, but it could serve as a poor-man's stopgap
until the compilers mature a bit in this regard.

-Mike

Nov 06 2009

Don <nospam nospam.com> writes:

Mike Farnsworth wrote:
 Don Wrote:
 
 Mike Farnsworth wrote:

 
 In dmd and ldc, is there any support for SSE or other SIMD intrinsics?  I
realize that I could write some asm blocks, but that means each operation
(vector add, sub, mul, dot product, etc.) would need to probably include a
prelude and postlude with loads and stores.  I worry that this will not get
optimized away (unless I don't use 'naked'?).

 In the alternative, is it possible to support something along the lines of
gcc's vector extensions:

 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));

 where the compiler will automatically generate opAdd, etc. for those types? 
I'm not suggesting using gcc's syntax, of course, but you get the idea..  It
would provide a very easy way for the compiler to prefer to keep 4-float
vectors in SSE registers, pass them in registers where appropriate in function
calls, nuke lots of loads and stores when inlining, etc.

 Having good, native SIMD support in D seems like a natural fit (heck, it's got
complex numbers built-in).

 Of course, there are some operations that the available SSE intrinsics cover
that the compiler can't expose via the typical operators, so those still need
to be supported somehow.  Does anyone know if ldc or dmd has those, or if
they'll optimize away SSE loads and stores if I roll my own structs with asm
blocks?  I saw from the ldc source it had the usual llvm intrinsics, but as far
as hardware-specific codegen intrinsics I couldn't spot any.

 Thanks,
 Mike Farnsworth

 Hi Mike, Welcome to D!
 In the latest compiler release (ie, this morning!), fixed-length arrays 
 have become value types. This is a big step: it means that (eg) float[4] 
 can be returned from a function for the first time. On 32-bit, we're a 
 bit limited in SSE support (eg, since *no* 32-bit AMD processors have 
 SSE2) -- but this will mean that on 64 bit, we'll be able to define an 
 ABI in which  short static arrays are passed in SSE registers.

 Also, D has array operations.  If x, y, and z are int[4], then
 x[] = y[]*3 + z[];
 corresponds directly to SIMD operations. DMD doesn't do much with them 
 yet (there's been so many language design issues that optimisation 
 hasn't received much attention), but the language has definitely been 
 planned with SIMD in mind.

 
 
 Awesome, does this also apply to dynamic arrays?  And how far does that go? 
E.g. if I were to do something odd like:
 
 x[] = ((y[] % 5) ^ 2) + z[];

Yes, that works, and it applies to dynamic arrays too. A key idea behind 
this is that since modern machines support SIMD, it's quite ridiculous 
for a high level languages to not be able to express it.

 Would that also work?  (Sorry, I should test it myself, but I'm at work and
haven't had time to get D tools installed yet and so am flying blind.)
 
 On another note, I'm aware that the latest gcc versions have pretty good SIMD
auto-vectorization, so I assume that will eventually be in the cards for dmd. 
As for lcd, that is pretty much dependent on llvm itself, and that doesn't have
auto-vectorization of code yet AFAIK.
 
 Anyone familiar with ldc have any idea about getting optimized asm and/or SSE
intrinsics to do the right thing?  As soon as I have some time, I'll stop being
lazy and actually go try some of this stuff out myself and see what the
compiled asm looks like, but if anyone has already figured out the answers I
can stay lazy.
 
 If it comes down to me needing to create some x86 asm in structs to get some
initial SSE-based vector types working, I'll do that and share with the class. 
I'm not amazing with that stuff, but it could serve as a poor-man's stopgap
until the compilers mature a bit in this regard.

Yes, lots of stuff that should work doesn't yet. The emphasis has been 
on getting the fundamentals solid. There's a lot of activity planned -- 
in fact I'm improving the compiler support for operator loading right now.

Nov 06 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Don wrote:
 Mike Farnsworth wrote:
 Don Wrote:

 Mike Farnsworth wrote:

 In dmd and ldc, is there any support for SSE or other SIMD 
 intrinsics?  I realize that I could write some asm blocks, but that 
 means each operation (vector add, sub, mul, dot product, etc.) would 
 need to probably include a prelude and postlude with loads and 
 stores.  I worry that this will not get optimized away (unless I 
 don't use 'naked'?).

 In the alternative, is it possible to support something along the 
 lines of gcc's vector extensions:

 typedef int v4si __attribute__ ((vector_size (16)));
 typedef float v4sf __attribute__ ((vector_size (16)));

 where the compiler will automatically generate opAdd, etc. for those 
 types?  I'm not suggesting using gcc's syntax, of course, but you 
 get the idea..  It would provide a very easy way for the compiler to 
 prefer to keep 4-float vectors in SSE registers, pass them in 
 registers where appropriate in function calls, nuke lots of loads 
 and stores when inlining, etc.

 Having good, native SIMD support in D seems like a natural fit 
 (heck, it's got complex numbers built-in).

 Of course, there are some operations that the available SSE 
 intrinsics cover that the compiler can't expose via the typical 
 operators, so those still need to be supported somehow.  Does anyone 
 know if ldc or dmd has those, or if they'll optimize away SSE loads 
 and stores if I roll my own structs with asm blocks?  I saw from the 
 ldc source it had the usual llvm intrinsics, but as far as 
 hardware-specific codegen intrinsics I couldn't spot any.

 Thanks,
 Mike Farnsworth

 Hi Mike, Welcome to D!
 In the latest compiler release (ie, this morning!), fixed-length 
 arrays have become value types. This is a big step: it means that 
 (eg) float[4] can be returned from a function for the first time. On 
 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD 
 processors have SSE2) -- but this will mean that on 64 bit, we'll be 
 able to define an ABI in which  short static arrays are passed in SSE 
 registers.

 Also, D has array operations.  If x, y, and z are int[4], then
 x[] = y[]*3 + z[];
 corresponds directly to SIMD operations. DMD doesn't do much with 
 them yet (there's been so many language design issues that 
 optimisation hasn't received much attention), but the language has 
 definitely been planned with SIMD in mind.


 Awesome, does this also apply to dynamic arrays?  And how far does 
 that go?  E.g. if I were to do something odd like:

 x[] = ((y[] % 5) ^ 2) + z[];

 
 Yes, that works, and it applies to dynamic arrays too. A key idea behind 
 this is that since modern machines support SIMD, it's quite ridiculous 
 for a high level languages to not be able to express it.

Mike, for more info on the supported operations you may want to refer to 
the Thermopylae excerpt:

http://erdani.com/d/thermopylae.pdf


Andrei

Nov 06 2009

Bill Baxter <wbaxter gmail.com> writes:

On Fri, Nov 6, 2009 at 11:29 AM, Don <nospam nospam.com> wrote:
 Hi Mike, Welcome to D!
 In the latest compiler release (ie, this morning!), fixed-length arrays h=

ave
 become value types. This is a big step: it means that (eg) float[4] can b=

e
 returned from a function for the first time. On 32-bit, we're a bit limit=

ed
 in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but th=

is
 will mean that on 64 bit, we'll be able to define an ABI in which =A0shor=

t
 static arrays are passed in SSE registers.

 Also, D has array operations. =A0If x, y, and z are int[4], then
 x[] =3D y[]*3 + z[];
 corresponds directly to SIMD operations. DMD doesn't do much with them ye=

t
 (there's been so many language design issues that optimisation hasn't
 received much attention), but the language has definitely been planned wi=

th
 SIMD in mind.

But what about the question of direct support for SSE intrinsics?
I don't see any in std.* but is there any reason not to, say, beef up
std.intrinsics with such things?  Is there any major hurdle to
overcome?  Seems like it would be useful to have.

--bb

Nov 06 2009

Walter Bright <newshound1 digitalmars.com> writes:

Bill Baxter wrote:
 But what about the question of direct support for SSE intrinsics?

The following are directly supported:

a[] = b[] + c[]
a[] = b[] - c[]
a[] = b[] + value
a[] += value
a[] += b[]
a[] = b[] - value
a[] = value - b[]
a[] -= value
a[] -= b[]
a[] = b[] * value
a[] = b[] * c[]
a[] *= value
a[] *= b[]
a[] = b[] / value
a[] /= value
a[] -= b[] * value

CPU detection is done at runtime and picks which of none, mmx, sse, sse2 
or amd3dnow instructions to use.

Other operations are done using loop fusion.

Nov 06 2009

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 Also, D has array operations.  If x, y, and z are int[4], then
 x[] = y[]*3 + z[];
 corresponds directly to SIMD operations. DMD doesn't do much with them 
 yet (there's been so many language design issues that optimisation 
 hasn't received much attention), but the language has definitely been 
 planned with SIMD in mind.

Many of the array operations do use the CPU vector operations.

Nov 06 2009

Lutger <lutger.blijdestijn gmail.com> writes:

Mike Farnsworth wrote:

...
 
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so those
 still need to be supported somehow.  Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my own
 structs with asm blocks?  I saw from the ldc source it had the usual llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I couldn't
 spot any.
 
 Thanks,
 Mike Farnsworth
 

Have you seen this page?
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

This is similar to gcc's (gdc has it too) extended inline asm expressions. 
I'm not at all in the know about all this, but I think this will allow you 
to built something yourself that works well with the optimizations done by 
the compiler. If someone could clarify how these inline expressions work 
exactly, that would be great.

Nov 08 2009

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 08 Nov 2009 17:47:31 -0500, Lutger <lutger.blijdestijn gmail.com>  
wrote:

 Mike Farnsworth wrote:

 ...
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so those
 still need to be supported somehow.  Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my own
 structs with asm blocks?  I saw from the ldc source it had the usual  
 llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I  
 couldn't
 spot any.

 Thanks,
 Mike Farnsworth

 Have you seen this page?
 http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

 This is similar to gcc's (gdc has it too) extended inline asm  
 expressions.
 I'm not at all in the know about all this, but I think this will allow  
 you
 to built something yourself that works well with the optimizations done  
 by
 the compiler. If someone could clarify how these inline expressions work
 exactly, that would be great.

SSE intrinsics allow you to specify the operation, but allow the compiler  
to do the register assignments, inlining, etc. D's inline asm requires the  
programmer to manage everything.

Nov 08 2009

Michael Farnsworth <mike.farnsworth gmail.com> writes:

On 11/08/2009 06:35 PM, Robert Jacques wrote:
 On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
 <lutger.blijdestijn gmail.com> wrote:

 Mike Farnsworth wrote:

 ...
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so those
 still need to be supported somehow. Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my own
 structs with asm blocks? I saw from the ldc source it had the usual llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I
 couldn't
 spot any.

 Thanks,
 Mike Farnsworth

 Have you seen this page?
 http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

 This is similar to gcc's (gdc has it too) extended inline asm
 expressions.
 I'm not at all in the know about all this, but I think this will allow
 you
 to built something yourself that works well with the optimizations
 done by
 the compiler. If someone could clarify how these inline expressions work
 exactly, that would be great.

 SSE intrinsics allow you to specify the operation, but allow the
 compiler to do the register assignments, inlining, etc. D's inline asm
 requires the programmer to manage everything.

I finally went and did a little homework, so sorry for the long reply 
that follows.

I have been experimenting with both the ldc.llvmasm.__asm() function, as 
well as getting D's asm {} to do what I want.  So far, I have been able 
to get some SSE instructions in there, but I'm running into a few 
issues.  For now, I'm only using ldc, but I'll try out dmd eventually as 
well.


* Using "-release -O5 -enable-inlining" in ldc, I can't for the life of 
me get it to inline the functions with the SSE asm statements.


* Overriding opAdd for a struct, I had a hard time getting it to not 
spit what appears to me to be a lot of extra loading / stack code.  In 
order to even get it to do what I wanted, I wrote it like this:

     Vector opAdd(Vector v)
     {
         Vector result = void;
         float* c0 = &c[0];
         float* vc0 = &v.c[0];
         float* rc0 = &v.c[0];
         asm
         {
             movaps XMM0,c0 ;
             movaps XMM1,vc0 ;
             addps XMM0,XMM1 ;
             movaps rc0,XMM0 ;
         }
         return result;
     }

And that ended up with the address-of code and stack stuff that isn't 
optimal.


* When I instead write a function like this:

     static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
     {
         asm
         {
             movaps XMM0,v1 ;
             movaps XMM1,v2 ;
             addps XMM0,XMM1 ;
             movaps rv,XMM0 ;
         }
     }

where Vector is defined as:

     align(16) struct Vector
     {
     public:
         float[4] c;
     }

(Note that 'result' is passed as 'ref' and not 'out'.  With 'out', it 
inserted init code in there, probably because the compiler thought I 
hadn't actually touched the result, even though the assembly did its 
job.  'out' is a better contract description, so it'd be nice to know 
how to suppress that.)

With this I get a fewer instructions in the function; but it still has 
an extraneous stack push/pop pair surrounding it, and it still won't 
inline for me where I call it.  It's all of 8 instructions including the 
return, and any inlining scheme that thinks that merits a function call 
instead ought to be drug out and shot. =P


* I used __asm(T)(char[], char[], T) from ldc as well, but either I suck 
at getting LLVM to recognize my constraints, or ldc doesn't support SSE 
constraints yet, but it just wouldn't take.  I ended up going the D asm 
block route once I figured out how to give it addresses without taking 
the address of everything (using ref for struct arguments works great!).


So, yeah, once I can figure out how to get any of the compilers to 
inline my asm-laced functions, and then figure out how to get an 
optimizer to eliminate all the (what should be) extraneous movaps 
instructions, then I'll be in good shape.  Until then, I won't port my 
ray tracer over to D.  But I will be happy to try to help out with 
patches/experiments until then to get to the goal of making D suitable 
for heavy SIMD calculations.  I'm talking with the ldc guys about it, as 
LLVM should be able to make really good use of this stuff (especially 
intrinsics) once the frontend can hand it off suitably.

I'm excited to work on a project like this, because if I get better at 
dealing with SIMD issues in the compiler I should be able to capitalize 
on it to make my math-heavy code even faster.  Mmmm...speed...

-Mike

Nov 08 2009

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 09 Nov 2009 01:53:11 -0500, Michael Farnsworth  
<mike.farnsworth gmail.com> wrote:

 On 11/08/2009 06:35 PM, Robert Jacques wrote:
 On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
 <lutger.blijdestijn gmail.com> wrote:

 Mike Farnsworth wrote:

 ...
 Of course, there are some operations that the available SSE intrinsics
 cover that the compiler can't expose via the typical operators, so  
 those
 still need to be supported somehow. Does anyone know if ldc or dmd has
 those, or if they'll optimize away SSE loads and stores if I roll my  
 own
 structs with asm blocks? I saw from the ldc source it had the usual  
 llvm
 intrinsics, but as far as hardware-specific codegen intrinsics I
 couldn't
 spot any.

 Thanks,
 Mike Farnsworth

 Have you seen this page?
 http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

 This is similar to gcc's (gdc has it too) extended inline asm
 expressions.
 I'm not at all in the know about all this, but I think this will allow
 you
 to built something yourself that works well with the optimizations
 done by
 the compiler. If someone could clarify how these inline expressions  
 work
 exactly, that would be great.

 SSE intrinsics allow you to specify the operation, but allow the
 compiler to do the register assignments, inlining, etc. D's inline asm
 requires the programmer to manage everything.

 I finally went and did a little homework, so sorry for the long reply  
 that follows.

 I have been experimenting with both the ldc.llvmasm.__asm() function, as  
 well as getting D's asm {} to do what I want.  So far, I have been able  
 to get some SSE instructions in there, but I'm running into a few  
 issues.  For now, I'm only using ldc, but I'll try out dmd eventually as  
 well.


 * Using "-release -O5 -enable-inlining" in ldc, I can't for the life of  
 me get it to inline the functions with the SSE asm statements.


 * Overriding opAdd for a struct, I had a hard time getting it to not  
 spit what appears to me to be a lot of extra loading / stack code.  In  
 order to even get it to do what I wanted, I wrote it like this:

      Vector opAdd(Vector v)
      {
          Vector result = void;
          float* c0 = &c[0];
          float* vc0 = &v.c[0];
          float* rc0 = &v.c[0];
          asm
          {
              movaps XMM0,c0 ;
              movaps XMM1,vc0 ;
              addps XMM0,XMM1 ;
              movaps rc0,XMM0 ;
          }
          return result;
      }

 And that ended up with the address-of code and stack stuff that isn't  
 optimal.


 * When I instead write a function like this:

      static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
      {
          asm
          {
              movaps XMM0,v1 ;
              movaps XMM1,v2 ;
              addps XMM0,XMM1 ;
              movaps rv,XMM0 ;
          }
      }

 where Vector is defined as:

      align(16) struct Vector
      {
      public:
          float[4] c;
      }

 (Note that 'result' is passed as 'ref' and not 'out'.  With 'out', it  
 inserted init code in there, probably because the compiler thought I  
 hadn't actually touched the result, even though the assembly did its  
 job.  'out' is a better contract description, so it'd be nice to know  
 how to suppress that.)

 With this I get a fewer instructions in the function; but it still has  
 an extraneous stack push/pop pair surrounding it, and it still won't  
 inline for me where I call it.  It's all of 8 instructions including the  
 return, and any inlining scheme that thinks that merits a function call  
 instead ought to be drug out and shot. =P


 * I used __asm(T)(char[], char[], T) from ldc as well, but either I suck  
 at getting LLVM to recognize my constraints, or ldc doesn't support SSE  
 constraints yet, but it just wouldn't take.  I ended up going the D asm  
 block route once I figured out how to give it addresses without taking  
 the address of everything (using ref for struct arguments works great!).


 So, yeah, once I can figure out how to get any of the compilers to  
 inline my asm-laced functions, and then figure out how to get an  
 optimizer to eliminate all the (what should be) extraneous movaps  
 instructions, then I'll be in good shape.  Until then, I won't port my  
 ray tracer over to D.  But I will be happy to try to help out with  
 patches/experiments until then to get to the goal of making D suitable  
 for heavy SIMD calculations.  I'm talking with the ldc guys about it, as  
 LLVM should be able to make really good use of this stuff (especially  
 intrinsics) once the frontend can hand it off suitably.

 I'm excited to work on a project like this, because if I get better at  
 dealing with SIMD issues in the compiler I should be able to capitalize  
 on it to make my math-heavy code even faster.  Mmmm...speed...

 -Mike

By design, D asm blocks are separated from the optimizer: no code motion,  
etc occurs. D2 just changed fixed sized arrays to value types, which  
provide most of the functionality of a small vector struct. However,  
actual SSE optimization of these types is probably going to wait until x64  
support; since a bunch of 32-bit chips don't support them.

P.S. For what it's worth, I do research which involves volumetric  
ray-tracing. I've always found memory to bottleneck computations. Also,  
why not look into CUDA/OpenCL/DirectCompute?

Nov 08 2009

Michael Farnsworth <mike.farnsworth gmail.com> writes:

On 11/08/2009 11:28 PM, Robert Jacques wrote:
 By design, D asm blocks are separated from the optimizer: no code
 motion, etc occurs. D2 just changed fixed sized arrays to value types,
 which provide most of the functionality of a small vector struct.
 However, actual SSE optimization of these types is probably going to
 wait until x64 support; since a bunch of 32-bit chips don't support them.

 P.S. For what it's worth, I do research which involves volumetric
 ray-tracing. I've always found memory to bottleneck computations. Also,
 why not look into CUDA/OpenCL/DirectCompute?

Yeah, I've discovered that having either the constraints-based __asm() 
from ldc or actual intrinsics probably makes optimization opportunities 
more frequent.  But, if it at least inlined the regular asm blocks for 
me I'd be most of the way there.  The ldc guys tell me that they didn't 
include the llvm vector intrinsics already because they were going to 
need either a custom type in the frontend, or else the D2 
fixed-size-arrays-as-value-types functionality.  I might take a stab at 
some of that in ldc in the future to see if I can get it to work, but 
I'm not an expert in compilers by any stretch of the imagination.

-Mike

PS: As for trying CUDA/OpenCL/DirectCompute, I haven't gotten into it 
much for a few reasons:

* The standards and APIs are still evolving
* I refuse to pigeon-hole myself into windows (I'm typing this from a 
Fedora 11 box, and at work we're a linux shop doing movie VFX)
* Larrabee (yes, yes, semi-vaporware until Intel gets their crap 
together) will allow something much closer to standard CPU code.  I 
really think that's the direction the GPU makers are heading in general, 
so why hobble myself with cruddy GPU memory/threading models to code 
around right now?
* GPUs keep changing, and every change brings with it subtle (and 
sometimes drastic) effects on your code's performance and results from 
card to card.  It's a nightmare to maintain, and every project we've 
done trying to do production rendering stuff on GPU (even just 
relighting) has ended in tears and gnashing of teeth.  Everyone just 
eventually throws up their hands and goes back to optimized CPU 
rendering in the VFX industry (Pixar, ILM, Tippett have all done that, 
just to name a few).

Good, solid general purpose CPUs with caches, decently wide SIMD with 
scatter/gather, and plenty of hardware threads are the wave of the 
future.  (Or was that the past?  I can't remember.)

GPUs are slowly converging back to that, except that currently they have 
a programmer-managed cache (texture mem), and they execute multiple 
threads concurrently over the same instructions in groups (warps, in 
CUDA-speak?).  They'll eventually add the 'feature' of a more 
automatically-managed cache, and better memory throughput when allowing 
warps to be smaller and more flexible.  And they'll look nearly 
identical to all the multi-core CPUs again when it happens.

Nov 09 2009

Walter Bright <newshound1 digitalmars.com> writes:

Michael Farnsworth wrote:
 The ldc guys tell me that they didn't 
 include the llvm vector intrinsics already because they were going to 
 need either a custom type in the frontend, or else the D2 
 fixed-size-arrays-as-value-types functionality.  I might take a stab at 
 some of that in ldc in the future to see if I can get it to work, but 
 I'm not an expert in compilers by any stretch of the imagination.

I think there's a lot of potential in this. Most languages lack array 
operations, forcing the compiler into the bizarre task of trying to 
reconstruct high level operations from low level ones to then convert to 
array ops.

Nov 09 2009

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Walter Bright Wrote:

 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't 
 include the llvm vector intrinsics already because they were going to 
 need either a custom type in the frontend, or else the D2 
 fixed-size-arrays-as-value-types functionality.  I might take a stab at 
 some of that in ldc in the future to see if I can get it to work, but 
 I'm not an expert in compilers by any stretch of the imagination.

 
 I think there's a lot of potential in this. Most languages lack array 
 operations, forcing the compiler into the bizarre task of trying to 
 reconstruct high level operations from low level ones to then convert to 
 array ops.

Can you elaborate a bit on what you mean?  If I understand what you're getting
at, it's as simple as recognizing array-wise operations (the a[] = b[] * c
expressions in D), and decomposing them into SIMD underneath where possible? 
It would also be cool if the compiler could catch cases where a struct was
essentially a wrapper around one of those arrays, and similarly turn the ops
into SIMD ops (so as to allow some operator overloads and extra method wrapping
additional intrinsics, for example).

There are a lot of cases to recognize, but the compiler could start with the
simple ones and then go from there with no need to change the language or
declare custom types (minus some alignment to help it along, perhaps).  The
nice thing about it is you automatically get a pretty big swath of
auto-vectorization by the compiler in the most natural types and operations
you'd expect it to show up.

Of course, SOA-style SIMD takes more intervention by the programmer, but there
is probably no easy way around that, since it's based on a data-layout
technique.

-Mike

Nov 09 2009

Walter Bright <newshound1 digitalmars.com> writes:

Mike Farnsworth wrote:
 Walter Bright Wrote:
 
 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't include the llvm vector
 intrinsics already because they were going to need either a
 custom type in the frontend, or else the D2 
 fixed-size-arrays-as-value-types functionality.  I might take a
 stab at some of that in ldc in the future to see if I can get it
 to work, but I'm not an expert in compilers by any stretch of the
 imagination.

 I think there's a lot of potential in this. Most languages lack
 array operations, forcing the compiler into the bizarre task of
 trying to reconstruct high level operations from low level ones to
 then convert to array ops.

 
 Can you elaborate a bit on what you mean?

Sure. Consider the code:

     for (int i = 0; i < 100; i++)
          array[i] = 0;

It takes a fair amount of work for a compiler to deduce "aha!" this code 
is intended to clear the array! The compiler then replaces the loop with:

      memset(array, 0, 100 * sizeof(array[0]));

In D, you can specify the array operation at a high level:

     array[0..100] = 0;

In other words, a language is supposed to represent high level concepts 
and the compiler breaks it down into low level ones supported by the 
machine. With vector operations, etc., the language supports only the 
low level operations and the compiler must reconstruct the high level 
operations supported by the machine.

This inversion of roles is bizarre.

Nov 09 2009

Bill Baxter <wbaxter gmail.com> writes:

On Mon, Nov 9, 2009 at 1:56 PM, Mike Farnsworth
<mike.farnsworth gmail.com> wrote:
 Walter Bright Wrote:

 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't
 include the llvm vector intrinsics already because they were going to
 need either a custom type in the frontend, or else the D2
 fixed-size-arrays-as-value-types functionality. =A0I might take a stab=



 at
 some of that in ldc in the future to see if I can get it to work, but
 I'm not an expert in compilers by any stretch of the imagination.

 I think there's a lot of potential in this. Most languages lack array
 operations, forcing the compiler into the bizarre task of trying to
 reconstruct high level operations from low level ones to then convert to
 array ops.

 Can you elaborate a bit on what you mean? =A0If I understand what you're =

getting at, it's as simple as recognizing array-wise operations (the a[] =
=3D b[] * c expressions in D), and decomposing them into SIMD underneath wh=
ere possible? =A0It would also be cool if the compiler could catch cases wh=
ere a struct was essentially a wrapper around one of those arrays, and simi=
larly turn the ops into SIMD ops (so as to allow some operator overloads an=
d extra method wrapping additional intrinsics, for example).
 There are a lot of cases to recognize, but the compiler could start with =

the simple ones and then go from there with no need to change the language =
or declare custom types (minus some alignment to help it along, perhaps). =
=A0The nice thing about it is you automatically get a pretty big swath of a=
uto-vectorization by the compiler in the most natural types and operations =
you'd expect it to show up.
 Of course, SOA-style SIMD takes more intervention by the programmer, but =

there is probably no easy way around that, since it's based on a data-layou=
t technique.

I think what he's saying is use array expressions like a[] =3D b[] + c[]
and let the compiler take care of it, instead of trying to write SSE
yourself.

I haven't tried, but does this kind of thing turn into SSE and get inlined?

struct Vec3 {
     float v[3];
     void opAddAssign(ref Vec3 o) {
         this.v[] +=3D o.v[];
     }
}

If so then that's very slick.  Much nicer than having to delve into
compiler intrinsics.

But at least on DMD I know it won't actually inline because it doesn't
inline functions with ref arguments.
(http://d.puremagic.com/issues/show_bug.cgi?id=3D2008)

--bb

Nov 09 2009

Don <nospam nospam.com> writes:

Bill Baxter wrote:
 On Mon, Nov 9, 2009 at 1:56 PM, Mike Farnsworth
 <mike.farnsworth gmail.com> wrote:
 Walter Bright Wrote:

 Michael Farnsworth wrote:
 The ldc guys tell me that they didn't
 include the llvm vector intrinsics already because they were going to
 need either a custom type in the frontend, or else the D2
 fixed-size-arrays-as-value-types functionality.  I might take a stab at
 some of that in ldc in the future to see if I can get it to work, but
 I'm not an expert in compilers by any stretch of the imagination.

 I think there's a lot of potential in this. Most languages lack array
 operations, forcing the compiler into the bizarre task of trying to
 reconstruct high level operations from low level ones to then convert to
 array ops.

 Can you elaborate a bit on what you mean?  If I understand what you're getting
at, it's as simple as recognizing array-wise operations (the a[] = b[] * c
expressions in D), and decomposing them into SIMD underneath where possible? 
It would also be cool if the compiler could catch cases where a struct was
essentially a wrapper around one of those arrays, and similarly turn the ops
into SIMD ops (so as to allow some operator overloads and extra method wrapping
additional intrinsics, for example).

 There are a lot of cases to recognize, but the compiler could start with the
simple ones and then go from there with no need to change the language or
declare custom types (minus some alignment to help it along, perhaps).  The
nice thing about it is you automatically get a pretty big swath of
auto-vectorization by the compiler in the most natural types and operations
you'd expect it to show up.

 Of course, SOA-style SIMD takes more intervention by the programmer, but there
is probably no easy way around that, since it's based on a data-layout
technique.

 
 I think what he's saying is use array expressions like a[] = b[] + c[]
 and let the compiler take care of it, instead of trying to write SSE
 yourself.
 
 I haven't tried, but does this kind of thing turn into SSE and get inlined?
 
 struct Vec3 {
      float v[3];
      void opAddAssign(ref Vec3 o) {
          this.v[] += o.v[];
      }
 }
 
 If so then that's very slick.  Much nicer than having to delve into
 compiler intrinsics.
 
 But at least on DMD I know it won't actually inline because it doesn't
 inline functions with ref arguments.
 (http://d.puremagic.com/issues/show_bug.cgi?id=2008)
 
 --bb

The bad news: The DMD back-end is a state-of-the-art backend from the 
late 90's. Despite its age, its treatment of integer operations is, in 
general, still quite respectable. However, it _never_ generates SSE 
instructions. Ever. However, array operations _are_ detected, and they 
become to calls to library functions which use SSE if available. That's 
not bad for moderately large arrays -- 200 elements or so -- but of 
course it's completely non-optimal for short arrays.

The good news: Now that static arrays are passed by value, introducing 
inline SSE support for short arrays suddenly makes a lot of sense -- 
there can be a big performance benefit for a small backend change; it 
could be done without introducing SSE anywhere else. Most importantly, 
it doesn't require any auto-vectorisation support.

Nov 10 2009

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 The bad news: The DMD back-end is a state-of-the-art backend from the 
 late 90's. Despite its age, its treatment of integer operations is, in 
 general, still quite respectable.

Modern compilers don't do much better. The point of diminishing returns 
was clearly reached.

 However, it _never_ generates SSE 
 instructions. Ever. However, array operations _are_ detected, and they 
 become to calls to library functions which use SSE if available. That's 
 not bad for moderately large arrays -- 200 elements or so -- but of 
 course it's completely non-optimal for short arrays.
 
 The good news: Now that static arrays are passed by value, introducing 
 inline SSE support for short arrays suddenly makes a lot of sense -- 
 there can be a big performance benefit for a small backend change; it 
 could be done without introducing SSE anywhere else. Most importantly, 
 it doesn't require any auto-vectorisation support.

What the library functions also do is have a runtime switch based on the 
capabilities of the processor, switching to operations tailored to that 
processor. To generate the code directly, assuming the existence of SSE, 
is to mean the code will only run on modern chips. Whether or not this 
is a problem depends on your application.

Nov 10 2009

Don <nospam nospam.com> writes:

Walter Bright wrote:
 Don wrote:
 The bad news: The DMD back-end is a state-of-the-art backend from the 
 late 90's. Despite its age, its treatment of integer operations is, in 
 general, still quite respectable.

 
 Modern compilers don't do much better. The point of diminishing returns 
 was clearly reached.

Yup. The only integer operation modern compilers still don't do well is 
-- array operations!

 However, it _never_ generates SSE instructions. Ever. However, array 
 operations _are_ detected, and they become to calls to library 
 functions which use SSE if available. That's not bad for moderately 
 large arrays -- 200 elements or so -- but of course it's completely 
 non-optimal for short arrays.

 The good news: Now that static arrays are passed by value, introducing 
 inline SSE support for short arrays suddenly makes a lot of sense -- 
 there can be a big performance benefit for a small backend change; it 
 could be done without introducing SSE anywhere else. Most importantly, 
 it doesn't require any auto-vectorisation support.

 
 What the library functions also do is have a runtime switch based on the 
 capabilities of the processor, switching to operations tailored to that 
 processor. To generate the code directly, assuming the existence of SSE, 
 is to mean the code will only run on modern chips. Whether or not this 
 is a problem depends on your application.

I'd say it's not a problem to use MMX or even SSE1. It's really, really 
difficult to find a processor that doesn't support them. I've tried. 
I've really tried. I don't think many are still around: they all have 
motherboards which require really small hard disks that you can no 
longer buy. Certainly no-one is putting new software on them.
Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at 
work, to support an ancient but expensive bit of lab equipment. Even it 
was a Pentium II. Getting the spare parts for it was a nightmare*; we 
had to ship them from 600km away. Hard disks just don't last that long.

SSE2 is a different story, since AMD never made a 32 bit CPU with SSE2.

*Actually it was more of a horror comedy. It was hard to take it seriously.

Nov 10 2009

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 I'd say it's not a problem to use MMX or even SSE1. It's really, really 
 difficult to find a processor that doesn't support them. I've tried. 
 I've really tried. I don't think many are still around: they all have 
 motherboards which require really small hard disks that you can no 
 longer buy. Certainly no-one is putting new software on them.
 Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at 
 work, to support an ancient but expensive bit of lab equipment. Even it 
 was a Pentium II. Getting the spare parts for it was a nightmare*; we 
 had to ship them from 600km away. Hard disks just don't last that long.

I do have a working Pentium around here somewhere. I even have a 486, 
though I haven't turned the machine on in 15 years. I no longer have a 
386 (gave it away).

10 years ago, I heard that the 386 was commonly used in embedded 
systems. I don't know what the base level x86 used today is.

Nov 10 2009

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Tue, Nov 10, 2009 at 12:06:08PM -0800, Walter Bright wrote:
 I do have a working Pentium around here somewhere.

I actually still use my Pentium 1 computers. I have three of them, one
works as a thin terminal to my newer computer, one is my secondary main
computer (if/when my main computer decides to quit working and I'm waiting
on replacement parts, I go back to the old box - it was my main from 1996
through to 2005! I also use it for the occasional multiplayer game),
and the last one I still use to host some small, low traffic websites.

I don't buy into the "omg must be bleeding edge or else" philosophy. I'll
work these computers until their parts fail entirely!


But, I'm surely in the minority. Heck, I still sometimes write 16 bit DOS
code for those computers!

If DMD starts outputting fancier code, that's awesome for the 99% of cases
where it is fine, I'd just request a compiler switch in there to turn it
back to old behaviour for the <1% of cases where we don't want it.

-- 
Adam D. Ruppe
http://arsdnet.net

Nov 10 2009

Walter Bright <newshound1 digitalmars.com> writes:

Adam D. Ruppe wrote:
 If DMD starts outputting fancier code, that's awesome for the 99% of cases
 where it is fine, I'd just request a compiler switch in there to turn it
 back to old behaviour for the <1% of cases where we don't want it.

Interestingly, dmd does a very good job of Pentium instruction 
scheduling. I thought that was hopelessly obsolete, although it didn't 
actually hurt anything, so no worries. But it turns out that the Intel 
Atom benefits a lot from Pentium style scheduling, and no other compiler 
seems to support that anymore!

Nov 10 2009

Lutger <lutger.blijdestijn gmail.com> writes:

Walter Bright wrote:

 Don wrote:
 I'd say it's not a problem to use MMX or even SSE1. It's really, really
 difficult to find a processor that doesn't support them. I've tried.
 I've really tried. I don't think many are still around: they all have
 motherboards which require really small hard disks that you can no
 longer buy. Certainly no-one is putting new software on them.
 Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at
 work, to support an ancient but expensive bit of lab equipment. Even it
 was a Pentium II. Getting the spare parts for it was a nightmare*; we
 had to ship them from 600km away. Hard disks just don't last that long.

 
 I do have a working Pentium around here somewhere. I even have a 486,
 though I haven't turned the machine on in 15 years. I no longer have a
 386 (gave it away).
 
 10 years ago, I heard that the 386 was commonly used in embedded
 systems. I don't know what the base level x86 used today is.

Until recently my stepdad still had his 8086 setup to interface with an old-
school velotype keyboard. I even built a nasty old 5 1/2 inch floppy drive 
into his shiny dualcore rig, which he used to transfer plain text files 
between the two machines. It worked fine.

Nov 10 2009

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

 Modern compilers don't do much better. The point of diminishing returns 
 was clearly reached.

I routinely see D benchmarks that are 2+ times faster with LDC compared to DMD.
Today CPUs don't get faster and faster as in the past so a 250% improvement
coming just from the compiler is not something you want to ignore. And more
optimizations for LLVM are planned (like auto-vectorization, better inlining of
function pointers, better de-virtualization and inlining of virtual class
methods, partial compilation, super compilation of tiny chunks of code, and
quite more).

Another thing to take in account is that today you usually don't want to

Python, Fortress, etc. Such higher level languages offer challenges to the
optimizators, that were not present in the past. For example most optimizations
done by the Just-In-Time compiler for Lua were not needed by a C compiler.
Today there are many people that want to program in Lua or Python or JavaScript
instead of C, so they need a quite more refined optimizer and compiler, like
the LuaJIT2 or Unladen Swallow or V8. You also have new smaller challenges
created by multi-core CPUs and languages that are functional, immutable-based.

That's why modern compilers are quickly improving today too, and today we need
still such improvements.

Bye,
bearophile

Nov 10 2009

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 Walter Bright:
 
 Modern compilers don't do much better. The point of diminishing
 returns was clearly reached.

 
 I routinely see D benchmarks that are 2+ times faster with LDC
 compared to DMD.

Have to be careful about benchmarks without looking at why. A few months 
ago, a benchmark was posted here purportedly showing that dmd was awful 
at integer math. Turns out, the problem was entirely in the long divide 
function, not the code generator at all. I rewrote the long divide 
helper function, and problem solved.

Nov 10 2009

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Walter Bright Wrote:

 Don wrote:
 The bad news: The DMD back-end is a state-of-the-art backend from the 
 late 90's. Despite its age, its treatment of integer operations is, in 
 general, still quite respectable.

 
 Modern compilers don't do much better. The point of diminishing returns 
 was clearly reached.
 
 However, it _never_ generates SSE 
 instructions. Ever. However, array operations _are_ detected, and they 
 become to calls to library functions which use SSE if available. That's 
 not bad for moderately large arrays -- 200 elements or so -- but of 
 course it's completely non-optimal for short arrays.
 
 The good news: Now that static arrays are passed by value, introducing 
 inline SSE support for short arrays suddenly makes a lot of sense -- 
 there can be a big performance benefit for a small backend change; it 
 could be done without introducing SSE anywhere else. Most importantly, 
 it doesn't require any auto-vectorisation support.

 
 What the library functions also do is have a runtime switch based on the 
 capabilities of the processor, switching to operations tailored to that 
 processor. To generate the code directly, assuming the existence of SSE, 
 is to mean the code will only run on modern chips. Whether or not this 
 is a problem depends on your application.

For my purposes, runtime detection is probably out the window, unless the tests
for it can happen infrequently enough to reduce the overhead.  There are too
many SSE variations to switch on them all, and they incrementally provide
better and better functionality that I could make use of.  I'd rather compile
different executables for different hardware and distribute them all (e.g.
detect the SSE version at compile time).  Really, high performance graphics is
an exercise in getting tightly vectorized code to inline appropriately,
eliminate as many loads and stores as possible, and then on top of that build
algorithms that don't suck in runtime or memory/cache complexity.

Often in computer graphics you end up distilling a huge amount of operations
down to SIMD instructions that are very highly-threaded and have (hopefully)
minimal I/O.  If you introduce any extra overhead for getting to those SIMD
instructions, you usually take a measurable throughput hit.  I'd like to see D
give me a much better mix of high throughput + high coding productivity.  As it
stands, I've got high throughput + medium coding productivity in C++, and


I've started looking at some ldc code to lurch towards this goal, and if there
is something I can look at in dmd2 itself to help out, I'd love to.  Just point
me where you think I ought to start.

-Mike

Nov 10 2009

Walter Bright <newshound1 digitalmars.com> writes:

Mike Farnsworth wrote:
 For my purposes, runtime detection is probably out the window, unless
 the tests for it can happen infrequently enough to reduce the
 overhead.  There are too many SSE variations to switch on them all,
 and they incrementally provide better and better functionality that I
 could make use of.  I'd rather compile different executables for
 different hardware and distribute them all (e.g. detect the SSE
 version at compile time).  Really, high performance graphics is an
 exercise in getting tightly vectorized code to inline appropriately,
 eliminate as many loads and stores as possible, and then on top of
 that build algorithms that don't suck in runtime or memory/cache
 complexity.

The way to do it is to not distribute multiple executables, but have the 
initialization code detect the chip. Then, you compile the same code for 
different instructions, and have a high level runtime switch between them.

I used to do this for machines with and without x87 support.

Nov 10 2009

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Walter Bright Wrote:

 Mike Farnsworth wrote:
 For my purposes, runtime detection is probably out the window, unless
 the tests for it can happen infrequently enough to reduce the
 overhead.  There are too many SSE variations to switch on them all,
 and they incrementally provide better and better functionality that I
 could make use of.  I'd rather compile different executables for
 different hardware and distribute them all (e.g. detect the SSE
 version at compile time).  Really, high performance graphics is an
 exercise in getting tightly vectorized code to inline appropriately,
 eliminate as many loads and stores as possible, and then on top of
 that build algorithms that don't suck in runtime or memory/cache
 complexity.

 
 The way to do it is to not distribute multiple executables, but have the 
 initialization code detect the chip. Then, you compile the same code for 
 different instructions, and have a high level runtime switch between them.
 
 I used to do this for machines with and without x87 support.

Was it actually rewriting the executable code to call the alternate functions
(e.g. a exe load time decision, patch the code in memory, and then run)?  I
thought that sort of thing would run into all sorts of runtime linker issues
(ro code pages in memory, shared libs that also need the rewriting, etc.), but
then again, they do that with JIT compiling all the time.

Does dmd already have some of this capability hanging around (but not used yet)?

-Mike

Nov 10 2009

Walter Bright <newshound1 digitalmars.com> writes:

Mike Farnsworth wrote:
 Was it actually rewriting the executable code to call the alternate
 functions (e.g. a exe load time decision, patch the code in memory,
 and then run)?  I thought that sort of thing would run into all sorts
 of runtime linker issues (ro code pages in memory, shared libs that
 also need the rewriting, etc.), but then again, they do that with JIT
 compiling all the time.


It's much simpler than that. Some C:

=========================================
void foo_with_FPU();
void foo_without_FPU();

void (*foo)();

void main()
{
     has_fp = doesCPUhaveFPU();
     if (has_fp)
         foo = &foo_with_FPU();
     else
         foo = &foo_without_FPU();
     ... execute app ...
     (*foo)();
     ... execute more app ...
}
=========================================

#if WITH_FPU
#define FOO foo_with_FPU
#else
#define FOO foo_without_FPU
#endif

void FOO()
{
     ... do some floating point calculations ...
}
==========================================
dmc -DWITH_FPU -c foo.c -f -ofoo_with_fpu.obj
dmc -c foo.c -ofoo_without_fpu.obj
dmc app.obj foo_with_fpu.obj foo_without_fpu.obj
===========================================

Hope that makes it clearer. No runtime linking, no runtime compiling, no 
self-modifying code, etc.

A better way to do it is to put your FP code behind a class interface, 
then have derived classes implement them, compiled with different 
instruction set options. At runtime, decide which derived class to use.

Nov 10 2009

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

Walter Bright wrote:
 
 ... To generate the code directly, assuming the existence of SSE,
 is to mean the code will only run on modern chips. Whether or not this
 is a problem depends on your application.

If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to
have an -sse (and maybe -sse2, -sse3, and -no-sse) switch for the
compiler to determine whether the compiler emits those instructions or
not.

I'm also wondering if a more ideal approach (and perhaps additional
option to those above) would be to borrow the best of JIT compilation
and emit multiple code paths.  Maybe the program would have a bootstrap
phase when starting up where it would call cpuid, find out what it has
available, rewrite the main binary to use the optimal paths, then
execute the main binary.  That way feature detection doesn't happen
while the program itself is running, and thus doesn't slow down the
computations as they happen.  Then passing -sse* would cause it to not
emit the bootstrap, but instead just assume that the instructions will
be available.

Nov 10 2009

Mike Farnsworth <mike.farnsworth gmail.com> writes:

Chad J Wrote:

 Walter Bright wrote:
 
 ... To generate the code directly, assuming the existence of SSE,
 is to mean the code will only run on modern chips. Whether or not this
 is a problem depends on your application.

 
 If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to
 have an -sse (and maybe -sse2, -sse3, and -no-sse) switch for the
 compiler to determine whether the compiler emits those instructions or
 not.
 
 I'm also wondering if a more ideal approach (and perhaps additional
 option to those above) would be to borrow the best of JIT compilation
 and emit multiple code paths.  Maybe the program would have a bootstrap
 phase when starting up where it would call cpuid, find out what it has
 available, rewrite the main binary to use the optimal paths, then
 execute the main binary.  That way feature detection doesn't happen
 while the program itself is running, and thus doesn't slow down the
 computations as they happen.  Then passing -sse* would cause it to not
 emit the bootstrap, but instead just assume that the instructions will
 be available.

Incidentally, if you use LLVM to compile to their bitcode, you can at runtime
do exactly this sort of thing based on the host hardware, selecting opt passes
and having it run codegen based on your exact hardware.  As long as using a
given intrinsic falls through to the right glue code where it isn't supported,
or else you let the compiler deduce where to use the fancier instructions (not
as likely to happen), that works out nicely.

-Mike

Nov 10 2009

D Programming

C/C++ Programming

Other

digitalmars.D - SIMD/intrinsincs questions