www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - MMX/SSE/SIMD without using assembly

reply Chad J <gamerchad2.no-ip.org/email.txt LeftOfAtSignForEmail.meh> writes:
I am hoping to find a way to use SIMD instructions in my programs to 
make them faster without rolling out assembly code each and every time. 
  About a week ago I started working on some functions that would do MMX 
operations to arrays of data.  So far so good, but now I wonder if 
someone has done this already.  I couldn't find anything in dsource or 
wiki4D.  Maybe there is something like this in C that could easily be 
used in D?

Then I began thinking beyond MMX, since MMX is very old and apparently 
slated for removal in the new x86 64 bit processors (in native 64 bit 
mode).  That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
be very fast, the data they operate on must be aligned to 16-byte 
boundaries in memory.  I found other posts on this forum where people 
had problems doing that in D, and they were either unresolved or there 
was no follow up post.  So I was wondering if there was a good (fast) 
way to make sure the data in arrays is aligned to a 16-byte boundary?

Also, I'm not sure if making a struct like so will work:

align(16) struct someName
{
	...
}

Does that determine how the elements are packed or does that ensure that 
the struct's address is a multiple of 16?
Jan 14 2006
next sibling parent reply =?iso-8859-1?q?Knud_S=F8rensen?= <12tkvvb02 sneakemail.com> writes:
hi Chad.

You should take a look at the vectorization suggestion on.
http://all-technology.com/eigenpolls/dwishlist/

Also do a search for vectorization on the news archive.

I think Walter is planing this for 2.0.

Knud

On Sun, 15 Jan 2006 02:38:07 -0500, Chad J wrote:

 I am hoping to find a way to use SIMD instructions in my programs to 
 make them faster without rolling out assembly code each and every time. 
   About a week ago I started working on some functions that would do MMX 
 operations to arrays of data.  So far so good, but now I wonder if 
 someone has done this already.  I couldn't find anything in dsource or 
 wiki4D.  Maybe there is something like this in C that could easily be 
 used in D?
 
 Then I began thinking beyond MMX, since MMX is very old and apparently 
 slated for removal in the new x86 64 bit processors (in native 64 bit 
 mode).  That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
 be very fast, the data they operate on must be aligned to 16-byte 
 boundaries in memory.  I found other posts on this forum where people 
 had problems doing that in D, and they were either unresolved or there 
 was no follow up post.  So I was wondering if there was a good (fast) 
 way to make sure the data in arrays is aligned to a 16-byte boundary?
 
 Also, I'm not sure if making a struct like so will work:
 
 align(16) struct someName
 {
 	...
 }
 
 Does that determine how the elements are packed or does that ensure that 
 the struct's address is a multiple of 16?

Jan 15 2006
parent reply Chad J <gamerchad2.no-ip.org/email.txt LeftOfAtSignForEmail.meh> writes:
Knud Sørensen wrote:
 
 hi Chad.
 
 You should take a look at the vectorization suggestion on.
 http://all-technology.com/eigenpolls/dwishlist/
 
 Also do a search for vectorization on the news archive.
 
 I think Walter is planing this for 2.0.
 
 Knud
 

Right now I am trying to do this by simply writing functions that do the basics. I have the following functions mostly working, they all use MMX if available: void padds( ubyte[] lvalue, ubyte[] rvalue ) void padds( ubyte[] lvalue, ubyte rvalue ) void padds( ushort[] lvalue, ushort[] rvalue ) void padds( ushort[] lvalue, ushort rvalue ) void padds( byte[] lvalue, byte[] rvalue ) void padds( byte[] lvalue, byte rvalue ) void padds( short[] lvalue, short[] rvalue ) void padds( short[] lvalue, short rvalue ) void padd( ubyte[] lvalue, ubyte[] rvalue ) void padd( ubyte[] lvalue, ubyte rvalue ) void padd( byte[] lvalue, byte[] rvalue ) void padd( byte[] lvalue, byte rvalue ) padds does saturated addition on the array lvalue, choosing signed or unsigned based on type. padd does unsaturated addition (wraps on overflow or underflow) on the array lvalue. Signage doesn't matter so the signed function casts to unsigned and calls the unsigned function. If rvalue is an array, it adds every element of rvalue onto the corresponding element in lvalue. If rvalue is not an array, it adds rvalue onto every element in lvalue. I'm probably going to get rid of non-array rvalues and make it use a global 64-bit variable instead, so that you can do stuff like darken an image without touching the alpha channel. OK so would an implementation like I was thinking of even be worth it, or would it just be replaced in a year or so and not help much in the mean time?
Jan 16 2006
parent reply James Dunne <james.jdunne gmail.com> writes:
Chad J wrote:
 Knud Sørensen wrote:
 
 hi Chad.

 You should take a look at the vectorization suggestion on.
 http://all-technology.com/eigenpolls/dwishlist/

 Also do a search for vectorization on the news archive.

 I think Walter is planing this for 2.0.

 Knud

Right now I am trying to do this by simply writing functions that do the basics. I have the following functions mostly working, they all use MMX if available: void padds( ubyte[] lvalue, ubyte[] rvalue ) void padds( ubyte[] lvalue, ubyte rvalue ) void padds( ushort[] lvalue, ushort[] rvalue ) void padds( ushort[] lvalue, ushort rvalue ) void padds( byte[] lvalue, byte[] rvalue ) void padds( byte[] lvalue, byte rvalue ) void padds( short[] lvalue, short[] rvalue ) void padds( short[] lvalue, short rvalue ) void padd( ubyte[] lvalue, ubyte[] rvalue ) void padd( ubyte[] lvalue, ubyte rvalue ) void padd( byte[] lvalue, byte[] rvalue ) void padd( byte[] lvalue, byte rvalue ) padds does saturated addition on the array lvalue, choosing signed or unsigned based on type. padd does unsaturated addition (wraps on overflow or underflow) on the array lvalue. Signage doesn't matter so the signed function casts to unsigned and calls the unsigned function. If rvalue is an array, it adds every element of rvalue onto the corresponding element in lvalue. If rvalue is not an array, it adds rvalue onto every element in lvalue. I'm probably going to get rid of non-array rvalues and make it use a global 64-bit variable instead, so that you can do stuff like darken an image without touching the alpha channel. OK so would an implementation like I was thinking of even be worth it, or would it just be replaced in a year or so and not help much in the mean time?

I would consider dropping MMX support, since the AMD64 architecture plans to deprecate it. As of now, it can't be used in long-mode and is considered legacy (similar to 3DNow! although it seems they don't want to admit it). Go for SSE2/3, 64-bit media, or 128-bit media instructions instead - there're a lot of 'em. Then again, if you're following Intel it's best to read up on it yourself and ignore this post. =P
Jan 16 2006
parent Chad J <gamerchad2.no-ip.org/email.txt LeftOfAtSignForEmail.meh> writes:
 
 I would consider dropping MMX support, since the AMD64 architecture 
 plans to deprecate it.  As of now, it can't be used in long-mode and is 
 considered legacy (similar to 3DNow! although it seems they don't want 
 to admit it).  Go for SSE2/3, 64-bit media, or 128-bit media 
 instructions instead - there're a lot of 'em.
 
 Then again, if you're following Intel it's best to read up on it 
 yourself and ignore this post. =P

Well, one reason why I'd like to do MMX is for legacy support. I have an AMD 2600+ and the utilities I run say it has no SSE2 support. The CPU is kinda old, but not THAT old, so I wouldn't be suprised if these types of CPUs are around for another 5 years. Anyhow, I am more worried about duplication of effort and whether people would actually use this or not.
Jan 16 2006
prev sibling parent reply mclysenk mtu.edu writes:
In article <dqcu4p$29j9$1 digitaldaemon.com>, Chad J says...
I am hoping to find a way to use SIMD instructions in my programs to 
make them faster without rolling out assembly code each and every time. 
  About a week ago I started working on some functions that would do MMX 
operations to arrays of data.  So far so good, but now I wonder if 
someone has done this already.  I couldn't find anything in dsource or 
wiki4D.  Maybe there is something like this in C that could easily be 
used in D?

That depends on what you want to do. Usually SSE/MMX is implemented as a compiler intrinsic, (see the Intel compiler and Visual Studios). For scientific calculations, there are libraries of preimplemented mathematical routines using vector optimizations; like BLAS or LINPACK. For games, most programmers just roll their own using inline assembler or compiler intrinsics. Hobby projects do not usually need the extra speed from SSE optimizations. One thing I have proposed as have many others, is vectorization at the language level. Such a feature would allow efficient and portable algorithms that take full advantage of modern SIMD hardware, at very low cost to the programmer.
That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
be very fast, the data they operate on must be aligned to 16-byte 
boundaries in memory.  I found other posts on this forum where people 
had problems doing that in D, and they were either unresolved or there 
was no follow up post.  So I was wondering if there was a good (fast) 
way to make sure the data in arrays is aligned to a 16-byte boundary?

This is a problem, and to my knowledge, it is unresolved. What you could do is overallocate the memory by 15 bytes, then shift its starting address so that it always begins on the correct boundary. Ideally, the linker should be able to position static objects on the correct boundary, but I have no idea how to make it happen.
Also, I'm not sure if making a struct like so will work:

align(16) struct someName
{
	...
}

Does that determine how the elements are packed or does that ensure that 
the struct's address is a multiple of 16?

That won't do what you want, instead each member of the struct will be aligned on a 16-byte boundary relative to the address of struct.
Jan 21 2006
next sibling parent Robert.AtkinsonNO SPAMgmail.com writes:
align(X) is still broken in the compiler for anything great than 2.  I'm
assuming in the grand scheme of things, there's far more important (and wide
reaching) feature list that Walter is working on before he fixes this.

In article <dqv19a$bjq$1 digitaldaemon.com>, mclysenk mtu.edu says...
In article <dqcu4p$29j9$1 digitaldaemon.com>, Chad J says...
I am hoping to find a way to use SIMD instructions in my programs to 
make them faster without rolling out assembly code each and every time. 
  About a week ago I started working on some functions that would do MMX 
operations to arrays of data.  So far so good, but now I wonder if 
someone has done this already.  I couldn't find anything in dsource or 
wiki4D.  Maybe there is something like this in C that could easily be 
used in D?

That depends on what you want to do. Usually SSE/MMX is implemented as a compiler intrinsic, (see the Intel compiler and Visual Studios). For scientific calculations, there are libraries of preimplemented mathematical routines using vector optimizations; like BLAS or LINPACK. For games, most programmers just roll their own using inline assembler or compiler intrinsics. Hobby projects do not usually need the extra speed from SSE optimizations. One thing I have proposed as have many others, is vectorization at the language level. Such a feature would allow efficient and portable algorithms that take full advantage of modern SIMD hardware, at very low cost to the programmer.
That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
be very fast, the data they operate on must be aligned to 16-byte 
boundaries in memory.  I found other posts on this forum where people 
had problems doing that in D, and they were either unresolved or there 
was no follow up post.  So I was wondering if there was a good (fast) 
way to make sure the data in arrays is aligned to a 16-byte boundary?

This is a problem, and to my knowledge, it is unresolved. What you could do is overallocate the memory by 15 bytes, then shift its starting address so that it always begins on the correct boundary. Ideally, the linker should be able to position static objects on the correct boundary, but I have no idea how to make it happen.
Also, I'm not sure if making a struct like so will work:

align(16) struct someName
{
	...
}

Does that determine how the elements are packed or does that ensure that 
the struct's address is a multiple of 16?

That won't do what you want, instead each member of the struct will be aligned on a 16-byte boundary relative to the address of struct.

Jan 23 2006
prev sibling parent Brian Chapman <noreply example.com> writes:
On 2006-01-21 22:22:02 -0600, mclysenk mtu.edu said:
 
 That depends on what you want to do.  Usually SSE/MMX is implemented as a
 compiler intrinsic, (see the Intel compiler and Visual Studios).  For 
 scientific
 calculations, there are libraries of preimplemented mathematical routines using
 vector optimizations; like BLAS or LINPACK.  For games, most programmers just
 roll their own using inline assembler or compiler intrinsics. Hobby projects do
 not usually need the extra speed from SSE optimizations.
 
 One thing I have proposed as have many others, is vectorization at the language
 level.  Such a feature would allow efficient and portable algorithms that take
 full advantage of modern SIMD hardware, at very low cost to the programmer.

Personally, at the very least, I'd just like to have intrinsics. If it's going to be a big undertaking for some kind of vectorization vs. a simple intrinsic interface, I'd rather take the latter and not wait another six years and 150+ release itterations later for the big 2-point-0. Know what I mean?
Jan 26 2006