digitalmars.D - Vector Optimizations

Kyle Furlong (8/8) Jan 26 2006 Well my templated Vector is nearly complete! What remains is some

James Dunne (20/29) Jan 26 2006 I'm afraid with this sort of approach the best kind of SIMD optimization...

Kyle Furlong (8/40) Jan 26 2006 Basically I want to utilize all the capabilities of the processor for

James Dunne (33/78) Jan 27 2006 1) You must know the capabilities of the processor on which your library...

Kyle Furlong <kylefurlong gmail.com> writes:

Well my templated Vector is nearly complete! What remains is some 
complex algebra quirks, implementing fast algorithms for cross products 
for different dimension vectors, and optimizing the various operations.

In this vein, what is the best way to get SIMD, SSE, SSE2 etc. 
instructions emitted by the dmd compiler? Any other comments, links, or 
suggestions on my Vector struct or vector optimization in D are welcome.

P.S. - The sources are attached, but the latest Velocity source can be 
found at http://svn.dsource.org/projects/velocity/trunk/source/

Jan 26 2006

James Dunne <james.jdunne gmail.com> writes:

Kyle Furlong wrote:
 Well my templated Vector is nearly complete! What remains is some 
 complex algebra quirks, implementing fast algorithms for cross products 
 for different dimension vectors, and optimizing the various operations.
 
 In this vein, what is the best way to get SIMD, SSE, SSE2 etc. 
 instructions emitted by the dmd compiler? Any other comments, links, or 
 suggestions on my Vector struct or vector optimization in D are welcome.
 
 [snippy]

I'm afraid with this sort of approach the best kind of SIMD optimization 
you'd get would be single operations.  Unless the compiler can perform 
ridiculous amounts of optimization, you won't get *much* benefit.  You 
will get some, and it might be enough for you.  But obviously 
hand-written SIMD instructions in assembly language is going to yield 
far superior results.

True support for SIMD instructions should be a function of the compiler 
because it needs to know how to align the data in memory, what registers 
are available for use, and how it can streamline vectorized operations. 
  If the compiler doesn't optimize your code much and leaves the method 
calls in, you'll just have exactly that: basically calling a function to 
copy a vector into a SIMD register, executing the SIMD instruction, 
copying the vector out of the register into the stack, and returning 
that value to the caller somehow.  All in all, extremely suboptimal IMHO.

So, in summary, the OO methodolgy doesn't apply well to vectorization it 
seems.

-- 
Regards,
James Dunne

Jan 26 2006

Kyle Furlong <kylefurlong gmail.com> writes:

James Dunne wrote:
 Kyle Furlong wrote:
 Well my templated Vector is nearly complete! What remains is some 
 complex algebra quirks, implementing fast algorithms for cross 
 products for different dimension vectors, and optimizing the various 
 operations.

 In this vein, what is the best way to get SIMD, SSE, SSE2 etc. 
 instructions emitted by the dmd compiler? Any other comments, links, 
 or suggestions on my Vector struct or vector optimization in D are 
 welcome.

 [snippy]

 
 I'm afraid with this sort of approach the best kind of SIMD optimization 
 you'd get would be single operations.  Unless the compiler can perform 
 ridiculous amounts of optimization, you won't get *much* benefit.  You 
 will get some, and it might be enough for you.  But obviously 
 hand-written SIMD instructions in assembly language is going to yield 
 far superior results.
 
 True support for SIMD instructions should be a function of the compiler 
 because it needs to know how to align the data in memory, what registers 
 are available for use, and how it can streamline vectorized operations. 
  If the compiler doesn't optimize your code much and leaves the method 
 calls in, you'll just have exactly that: basically calling a function to 
 copy a vector into a SIMD register, executing the SIMD instruction, 
 copying the vector out of the register into the stack, and returning 
 that value to the caller somehow.  All in all, extremely suboptimal IMHO.
 
 So, in summary, the OO methodolgy doesn't apply well to vectorization it 
 seems.
 

Basically I want to utilize all the capabilities of the processor for 
each operation. For example, most of the operations atm are running a 
for loop accross all the elements of the wrapped array of whatevers and 
doing the operation. I'm asking how to optimize this to use the SIMD 
instructions of the various processors. I tried to unroll the loops 
using the duffs device walter put together with templates, but couldnt 
figure out how to get it to work with array indexing.

Jan 26 2006

James Dunne <james.jdunne gmail.com> writes:

Kyle Furlong wrote:
 James Dunne wrote:
 
 Kyle Furlong wrote:

 Well my templated Vector is nearly complete! What remains is some 
 complex algebra quirks, implementing fast algorithms for cross 
 products for different dimension vectors, and optimizing the various 
 operations.

 In this vein, what is the best way to get SIMD, SSE, SSE2 etc. 
 instructions emitted by the dmd compiler? Any other comments, links, 
 or suggestions on my Vector struct or vector optimization in D are 
 welcome.

 [snippy]


 I'm afraid with this sort of approach the best kind of SIMD 
 optimization you'd get would be single operations.  Unless the 
 compiler can perform ridiculous amounts of optimization, you won't get 
 *much* benefit.  You will get some, and it might be enough for you.  
 But obviously hand-written SIMD instructions in assembly language is 
 going to yield far superior results.

 True support for SIMD instructions should be a function of the 
 compiler because it needs to know how to align the data in memory, 
 what registers are available for use, and how it can streamline 
 vectorized operations.  If the compiler doesn't optimize your code 
 much and leaves the method calls in, you'll just have exactly that: 
 basically calling a function to copy a vector into a SIMD register, 
 executing the SIMD instruction, copying the vector out of the register 
 into the stack, and returning that value to the caller somehow.  All 
 in all, extremely suboptimal IMHO.

 So, in summary, the OO methodolgy doesn't apply well to vectorization 
 it seems.

 
 Basically I want to utilize all the capabilities of the processor for 
 each operation. For example, most of the operations atm are running a 
 for loop accross all the elements of the wrapped array of whatevers and 
 doing the operation. I'm asking how to optimize this to use the SIMD 
 instructions of the various processors. I tried to unroll the loops 
 using the duffs device walter put together with templates, but couldnt 
 figure out how to get it to work with array indexing.

1) You must know the capabilities of the processor on which your library 
will be running on.  This is possible to determine with the CPUID 
instruction on x86 machines.  You can check for flags which indicate 
support for SSE, SSE2, SSE3, MMX, 3Dnow!, etc.

2) You must guarantee that the data with which you are working is 
aligned in memory per the requirements of the specific SIMD capabilities 
you're targeting/using.  This can be done with custom class allocators. 
  Do structs have this ability too?

3) SIMD instruction sets usually have very fixed capabilities, such as 
working with either a vector of 4 32-bit floats, or 2 64-bit doubles. 
These hardware registers have fixed size limitations.  Continuing to 
allow for n-dimensional vectors via templates will have you writing so 
much special-case code for SIMD support that the templating support just 
won't be worth it.  Besides, most mathematics dealing with physical laws 
use vectors of dimension 4 or less.  Also, you might have to ditch the 
complex-number support.

If you really want SIMD support, there isn't any way around these 
requirements.

As per loop-unrolling, you can simply hand-unroll your loops.  The basic 
concept is to do as much work as you can within one loop iteration, and 
to simultaneously cut down on the number of loop iterations.  Loops mean 
branches, and branches are bad for pipelines.

In fact, I'm quite sure a generic templating approach to loop unrolling 
would be great!

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne

Jan 27 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Vector Optimizations