digitalmars.D - MMX/SSE/SIMD without using assembly

Chad J (22/22) Jan 14 2006 I am hoping to find a way to use SIMD instructions in my programs to

=?iso-8859-1?q?Knud_S=F8rensen?= (7/33) Jan 15 2006 hi Chad.

Chad J (32/44) Jan 16 2006 Right now I am trying to do this by simply writing functions that do the...

James Dunne (8/58) Jan 16 2006 I would consider dropping MMX support, since the AMD64 architecture

Chad J (6/15) Jan 16 2006 Well, one reason why I'd like to do MMX is for legacy support. I have

mclysenk mtu.edu (17/37) Jan 21 2006 That depends on what you want to do. Usually SSE/MMX is implemented as ...

Robert.AtkinsonNO SPAMgmail.com (4/46) Jan 23 2006 align(X) is still broken in the compiler for anything great than 2. I'm
Brian Chapman (6/18) Jan 26 2006 Personally, at the very least, I'd just like to have intrinsics. If

Chad J <gamerchad2.no-ip.org/email.txt LeftOfAtSignForEmail.meh> writes:

I am hoping to find a way to use SIMD instructions in my programs to 
make them faster without rolling out assembly code each and every time. 
  About a week ago I started working on some functions that would do MMX 
operations to arrays of data.  So far so good, but now I wonder if 
someone has done this already.  I couldn't find anything in dsource or 
wiki4D.  Maybe there is something like this in C that could easily be 
used in D?

Then I began thinking beyond MMX, since MMX is very old and apparently 
slated for removal in the new x86 64 bit processors (in native 64 bit 
mode).  That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
be very fast, the data they operate on must be aligned to 16-byte 
boundaries in memory.  I found other posts on this forum where people 
had problems doing that in D, and they were either unresolved or there 
was no follow up post.  So I was wondering if there was a good (fast) 
way to make sure the data in arrays is aligned to a 16-byte boundary?

Also, I'm not sure if making a struct like so will work:

align(16) struct someName
{
	...
}

Does that determine how the elements are packed or does that ensure that 
the struct's address is a multiple of 16?

Jan 14 2006

=?iso-8859-1?q?Knud_S=F8rensen?= <12tkvvb02 sneakemail.com> writes:

hi Chad.

You should take a look at the vectorization suggestion on.
http://all-technology.com/eigenpolls/dwishlist/

Also do a search for vectorization on the news archive.

I think Walter is planing this for 2.0.

Knud

On Sun, 15 Jan 2006 02:38:07 -0500, Chad J wrote:

 I am hoping to find a way to use SIMD instructions in my programs to 
 make them faster without rolling out assembly code each and every time. 
   About a week ago I started working on some functions that would do MMX 
 operations to arrays of data.  So far so good, but now I wonder if 
 someone has done this already.  I couldn't find anything in dsource or 
 wiki4D.  Maybe there is something like this in C that could easily be 
 used in D?
 
 Then I began thinking beyond MMX, since MMX is very old and apparently 
 slated for removal in the new x86 64 bit processors (in native 64 bit 
 mode).  That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
 be very fast, the data they operate on must be aligned to 16-byte 
 boundaries in memory.  I found other posts on this forum where people 
 had problems doing that in D, and they were either unresolved or there 
 was no follow up post.  So I was wondering if there was a good (fast) 
 way to make sure the data in arrays is aligned to a 16-byte boundary?
 
 Also, I'm not sure if making a struct like so will work:
 
 align(16) struct someName
 {
 	...
 }
 
 Does that determine how the elements are packed or does that ensure that 
 the struct's address is a multiple of 16?

Jan 15 2006

Chad J <gamerchad2.no-ip.org/email.txt LeftOfAtSignForEmail.meh> writes:

Knud S�rensen wrote:
 
 hi Chad.
 
 You should take a look at the vectorization suggestion on.
 http://all-technology.com/eigenpolls/dwishlist/
 
 Also do a search for vectorization on the news archive.
 
 I think Walter is planing this for 2.0.
 
 Knud
 

Right now I am trying to do this by simply writing functions that do the 
basics.
I have the following functions mostly working, they all use MMX if 
available:

void padds( ubyte[] lvalue, ubyte[] rvalue )
void padds( ubyte[] lvalue, ubyte rvalue )
void padds( ushort[] lvalue, ushort[] rvalue )
void padds( ushort[] lvalue, ushort rvalue )
void padds( byte[] lvalue, byte[] rvalue )
void padds( byte[] lvalue, byte rvalue )
void padds( short[] lvalue, short[] rvalue )
void padds( short[] lvalue, short rvalue )
	
void padd( ubyte[] lvalue, ubyte[] rvalue )
void padd( ubyte[] lvalue, ubyte rvalue )
void padd( byte[] lvalue, byte[] rvalue )
void padd( byte[] lvalue, byte rvalue )

padds does saturated addition on the array lvalue, choosing signed or 
unsigned based on type.
padd does unsaturated addition (wraps on overflow or underflow) on the 
array lvalue.  Signage doesn't matter so the signed function casts to 
unsigned and calls the unsigned function.

If rvalue is an array, it adds every element of rvalue onto the 
corresponding element in lvalue.  If rvalue is not an array, it adds 
rvalue onto every element in lvalue.  I'm probably going to get rid of 
non-array rvalues and make it use a global 64-bit variable instead, so 
that you can do stuff like darken an image without touching the alpha 
channel.

OK so would an implementation like I was thinking of even be worth it, 
or would it just be replaced in a year or so and not help much in the 
mean time?

Jan 16 2006

James Dunne <james.jdunne gmail.com> writes:

Chad J wrote:
 Knud S�rensen wrote:
 
 hi Chad.

 You should take a look at the vectorization suggestion on.
 http://all-technology.com/eigenpolls/dwishlist/

 Also do a search for vectorization on the news archive.

 I think Walter is planing this for 2.0.

 Knud

 
 Right now I am trying to do this by simply writing functions that do the 
 basics.
 I have the following functions mostly working, they all use MMX if 
 available:
 
 void padds( ubyte[] lvalue, ubyte[] rvalue )
 void padds( ubyte[] lvalue, ubyte rvalue )
 void padds( ushort[] lvalue, ushort[] rvalue )
 void padds( ushort[] lvalue, ushort rvalue )
 void padds( byte[] lvalue, byte[] rvalue )
 void padds( byte[] lvalue, byte rvalue )
 void padds( short[] lvalue, short[] rvalue )
 void padds( short[] lvalue, short rvalue )
     
 void padd( ubyte[] lvalue, ubyte[] rvalue )
 void padd( ubyte[] lvalue, ubyte rvalue )
 void padd( byte[] lvalue, byte[] rvalue )
 void padd( byte[] lvalue, byte rvalue )
 
 padds does saturated addition on the array lvalue, choosing signed or 
 unsigned based on type.
 padd does unsaturated addition (wraps on overflow or underflow) on the 
 array lvalue.  Signage doesn't matter so the signed function casts to 
 unsigned and calls the unsigned function.
 
 If rvalue is an array, it adds every element of rvalue onto the 
 corresponding element in lvalue.  If rvalue is not an array, it adds 
 rvalue onto every element in lvalue.  I'm probably going to get rid of 
 non-array rvalues and make it use a global 64-bit variable instead, so 
 that you can do stuff like darken an image without touching the alpha 
 channel.
 
 OK so would an implementation like I was thinking of even be worth it, 
 or would it just be replaced in a year or so and not help much in the 
 mean time?

I would consider dropping MMX support, since the AMD64 architecture 
plans to deprecate it.  As of now, it can't be used in long-mode and is 
considered legacy (similar to 3DNow! although it seems they don't want 
to admit it).  Go for SSE2/3, 64-bit media, or 128-bit media 
instructions instead - there're a lot of 'em.

Then again, if you're following Intel it's best to read up on it 
yourself and ignore this post. =P

Jan 16 2006

Chad J <gamerchad2.no-ip.org/email.txt LeftOfAtSignForEmail.meh> writes:

 
 I would consider dropping MMX support, since the AMD64 architecture 
 plans to deprecate it.  As of now, it can't be used in long-mode and is 
 considered legacy (similar to 3DNow! although it seems they don't want 
 to admit it).  Go for SSE2/3, 64-bit media, or 128-bit media 
 instructions instead - there're a lot of 'em.
 
 Then again, if you're following Intel it's best to read up on it 
 yourself and ignore this post. =P

Well, one reason why I'd like to do MMX is for legacy support.  I have 
an AMD 2600+ and the utilities I run say it has no SSE2 support.  The 
CPU is kinda old, but not THAT old, so I wouldn't be suprised if these 
types of CPUs are around for another 5 years.

Anyhow, I am more worried about duplication of effort and whether people 
would actually use this or not.

Jan 16 2006

mclysenk mtu.edu writes:

In article <dqcu4p$29j9$1 digitaldaemon.com>, Chad J says...
I am hoping to find a way to use SIMD instructions in my programs to 
make them faster without rolling out assembly code each and every time. 
  About a week ago I started working on some functions that would do MMX 
operations to arrays of data.  So far so good, but now I wonder if 
someone has done this already.  I couldn't find anything in dsource or 
wiki4D.  Maybe there is something like this in C that could easily be 
used in D?

That depends on what you want to do.  Usually SSE/MMX is implemented as a
compiler intrinsic, (see the Intel compiler and Visual Studios).  For scientific
calculations, there are libraries of preimplemented mathematical routines using
vector optimizations; like BLAS or LINPACK.  For games, most programmers just
roll their own using inline assembler or compiler intrinsics. Hobby projects do
not usually need the extra speed from SSE optimizations.

One thing I have proposed as have many others, is vectorization at the language
level.  Such a feature would allow efficient and portable algorithms that take
full advantage of modern SIMD hardware, at very low cost to the programmer.

That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
be very fast, the data they operate on must be aligned to 16-byte 
boundaries in memory.  I found other posts on this forum where people 
had problems doing that in D, and they were either unresolved or there 
was no follow up post.  So I was wondering if there was a good (fast) 
way to make sure the data in arrays is aligned to a 16-byte boundary?

This is a problem, and to my knowledge, it is unresolved. What you could do is
overallocate the memory by 15 bytes, then shift its starting address so that it
always begins on the correct boundary.  Ideally, the linker should be able to
position static objects on the correct boundary, but I have no idea how to make
it happen.

Also, I'm not sure if making a struct like so will work:

align(16) struct someName
{
	...
}

Does that determine how the elements are packed or does that ensure that 
the struct's address is a multiple of 16?

That won't do what you want, instead each member of the struct will be aligned
on a 16-byte boundary relative to the address of struct.

Jan 21 2006

Robert.AtkinsonNO SPAMgmail.com writes:

align(X) is still broken in the compiler for anything great than 2.  I'm
assuming in the grand scheme of things, there's far more important (and wide
reaching) feature list that Walter is working on before he fixes this.

In article <dqv19a$bjq$1 digitaldaemon.com>, mclysenk mtu.edu says...
In article <dqcu4p$29j9$1 digitaldaemon.com>, Chad J says...
I am hoping to find a way to use SIMD instructions in my programs to 
make them faster without rolling out assembly code each and every time. 
  About a week ago I started working on some functions that would do MMX 
operations to arrays of data.  So far so good, but now I wonder if 
someone has done this already.  I couldn't find anything in dsource or 
wiki4D.  Maybe there is something like this in C that could easily be 
used in D?

That depends on what you want to do.  Usually SSE/MMX is implemented as a
compiler intrinsic, (see the Intel compiler and Visual Studios).  For scientific
calculations, there are libraries of preimplemented mathematical routines using
vector optimizations; like BLAS or LINPACK.  For games, most programmers just
roll their own using inline assembler or compiler intrinsics. Hobby projects do
not usually need the extra speed from SSE optimizations.

One thing I have proposed as have many others, is vectorization at the language
level.  Such a feature would allow efficient and portable algorithms that take
full advantage of modern SIMD hardware, at very low cost to the programmer.

That brings me to SSE, SSE2, and maybe SSE3.  Now for these to 
be very fast, the data they operate on must be aligned to 16-byte 
boundaries in memory.  I found other posts on this forum where people 
had problems doing that in D, and they were either unresolved or there 
was no follow up post.  So I was wondering if there was a good (fast) 
way to make sure the data in arrays is aligned to a 16-byte boundary?

This is a problem, and to my knowledge, it is unresolved. What you could do is
overallocate the memory by 15 bytes, then shift its starting address so that it
always begins on the correct boundary.  Ideally, the linker should be able to
position static objects on the correct boundary, but I have no idea how to make
it happen.

Also, I'm not sure if making a struct like so will work:

align(16) struct someName
{
	...
}

Does that determine how the elements are packed or does that ensure that 
the struct's address is a multiple of 16?

That won't do what you want, instead each member of the struct will be aligned
on a 16-byte boundary relative to the address of struct.

Jan 23 2006

Brian Chapman <noreply example.com> writes:

On 2006-01-21 22:22:02 -0600, mclysenk mtu.edu said:
 
 That depends on what you want to do.  Usually SSE/MMX is implemented as a
 compiler intrinsic, (see the Intel compiler and Visual Studios).  For 
 scientific
 calculations, there are libraries of preimplemented mathematical routines using
 vector optimizations; like BLAS or LINPACK.  For games, most programmers just
 roll their own using inline assembler or compiler intrinsics. Hobby projects do
 not usually need the extra speed from SSE optimizations.
 
 One thing I have proposed as have many others, is vectorization at the language
 level.  Such a feature would allow efficient and portable algorithms that take
 full advantage of modern SIMD hardware, at very low cost to the programmer.


Personally, at the very least, I'd just like to have intrinsics. If 
it's going to be a big undertaking for some kind of vectorization vs. a 
simple intrinsic interface, I'd rather take the latter and not wait 
another six years and 150+ release itterations later for the big 
2-point-0. Know what I mean?

Jan 26 2006

D Programming

C/C++ Programming

Other

digitalmars.D - MMX/SSE/SIMD without using assembly