www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - primitive vector types

reply Mattias Holm <hannibal.holm gmail.com> writes:
Since (SIMD) vectors are so common and every reasonabe system support 
them in one way or the other (and scalar emulation of this is rather 
simple), why not have support for this in D directly?

Yes, the array operations are nice (and one of the main reasons for why 
I like D :) ), but have the problem that an array of floats must be 
aligned on float boundaries and not vector boundaries. In my mind 
vectors are a primitive data type that should be exposed by the 
programming language.

Something OpenCL-like:

	float4 vec;
	vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
	vec.xyzw = vec.wyxz; // permutation
	vec[i] = 1.0; // indexing

And then we can easily immagine some extra nice features to have with 
respect to operators:

	vec ^ vec2; // 3d cross product for float vectors, for int vectors xor

Has this been discussed before?

/ Mattias
Feb 19 2009
next sibling parent reply "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm <hannibal.holm gmail.com>  
wrote:

 Since (SIMD) vectors are so common and every reasonabe system support  
 them in one way or the other (and scalar emulation of this is rather  
 simple), why not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for why  
 I like D :) ), but have the problem that an array of floats must be  
 aligned on float boundaries and not vector boundaries. In my mind  
 vectors are a primitive data type that should be exposed by the  
 programming language.

 Something OpenCL-like:

 	float4 vec;
 	vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
 	vec.xyzw = vec.wyxz; // permutation
 	vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have with  
 respect to operators:

 	vec ^ vec2; // 3d cross product for float vectors, for int vectors xor

 Has this been discussed before?

 / Mattias

I don't see any reason why float4 can't be made a library type.
Feb 19 2009
next sibling parent Don <nospam nospam.com> writes:
Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm 
 <hannibal.holm gmail.com> wrote:
 
 Since (SIMD) vectors are so common and every reasonabe system support 
 them in one way or the other (and scalar emulation of this is rather 
 simple), why not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for 
 why I like D :) ), but have the problem that an array of floats must 
 be aligned on float boundaries and not vector boundaries. In my mind 
 vectors are a primitive data type that should be exposed by the 
 programming language.

 Something OpenCL-like:

     float4 vec;
     vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
     vec.xyzw = vec.wyxz; // permutation
     vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have with 
 respect to operators:

     vec ^ vec2; // 3d cross product for float vectors, for int vectors 
 xor

 Has this been discussed before?

 / Mattias

I don't see any reason why float4 can't be made a library type.

Walter at one point suggested that float[4] should be specially recognized by the compiler -- it would always be aligned, and stored in a SSE register if possible. Ditto for float[3].
Feb 19 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm 
 <hannibal.holm gmail.com> wrote:
 
 Since (SIMD) vectors are so common and every reasonabe system support 
 them in one way or the other (and scalar emulation of this is rather 
 simple), why not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for 
 why I like D :) ), but have the problem that an array of floats must 
 be aligned on float boundaries and not vector boundaries. In my mind 
 vectors are a primitive data type that should be exposed by the 
 programming language.

 Something OpenCL-like:

     float4 vec;
     vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
     vec.xyzw = vec.wyxz; // permutation
     vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have with 
 respect to operators:

     vec ^ vec2; // 3d cross product for float vectors, for int vectors 
 xor

 Has this been discussed before?

 / Mattias

I don't see any reason why float4 can't be made a library type.

Yah, I was thinking the same: struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/. Andrei
Feb 19 2009
next sibling parent Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 Denis Koroskin wrote:
 I don't see any reason why float4 can't be made a library type.

Yah, I was thinking the same: struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/.

I've looked for this functionality in std.bitmanip before. I never thought to look in std.matrix. Fixed size bit arrays seems like an obvious choice for SIMD optimization.
Feb 19 2009
prev sibling next sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Andrei Alexandrescu wrote:
 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm
 <hannibal.holm gmail.com> wrote:

 Since (SIMD) vectors are so common and every reasonabe system support
 them in one way or the other (and scalar emulation of this is rather
 simple), why not have support for this in D directly?

 [snip]

I don't see any reason why float4 can't be made a library type.

Yah, I was thinking the same: struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/. Andrei

I remember implementing a vector struct [1] quite some time ago that had an SSE-accelerated path. There were three problems I had with it: 1. The alignment thing. Incidentally, I just did a quick check and don't see any notes in the changelog about __align(n) syntax. As I remember, there was no way to actually ensure the data was properly aligned. (There's "Data items in static data segment >= 16 bytes in size are now paragraph aligned." but that doesn't help when the vectors are on, say, the stack or in the heap.) 2. As soon as you use inline asm, you lose inlining. When the functions are as small as they are, this can be a bit of overhead. It gets worse when you realise that the CPU is spending most of its time running data back and forth between main memory and the XMM registers... Array operations help, but they don't cover everything. 3. There was a not insignificant performance difference for using byref passing on operators over byval passing. Of course, you can't ACTUALLY use byref because it completely breaks anything that uses a temporary expression as an argument. In the end, I just dropped it to see how BLADE would turn out. I ended up coming to the conclusion that while we can do a float[4] vector in D and use SIMD to speed it up, there's not much point when BLADE is there. Of course, BLADE is a little unwieldy to use what with that mixin malarky. Pity we didn't get AST macros... :P Anyway, just my AUD$0.02. -- Daniel [1] That struct was scary. It was one of those Vector!(type, size) jobbies, so it had multiple paths through functions, members that only existed for certain sizes, special-cased loop unrolling... don't even ASK about the matrix struct... :P
Feb 19 2009
next sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Jarrett Billingsley wrote:
 On Thu, Feb 19, 2009 at 9:31 PM, Daniel Keep
 <daniel.keep.lists gmail.com> wrote:
 struct float4
 {
     __align(16) float[4] data; // right syntax and value?
     alias data this;
 }


 1. The alignment thing.  Incidentally, I just did a quick check and
 don't see any notes in the changelog about __align(n) syntax.  As I
 remember, there was no way to actually ensure the data was properly
 aligned.  (There's "Data items in static data segment >= 16 bytes in
 size are now paragraph aligned." but that doesn't help when the vectors
 are on, say, the stack or in the heap.)

It's just align(16). http://www.digitalmars.com/d/1.0/attribute.html#align

Yeah, except it doesn't do what you want in this case. For example: module alignment; import tango.io.Stdout; struct float4 { align(16) float[4] data; } void main() { float4 a; byte b; float4 c; Stdout.format("a {0} & 0xF == {1}", &a, cast(size_t)&a & 0xF).newline; Stdout.format("c {0} & 0xF == {1}", &c, cast(size_t)&c & 0xF).newline; } Output is: a 12fe68 & 0xF == 8 c 12fe78 & 0xF == 8 The only way I found of guaranteeing alignment was to allocate all vectors on the heap, using a custom allocator to allocate an extra 15 bytes and then align the result appropriately. Obviously, for 16-byte vectors, this is completely unacceptable. -- Daniel
Feb 19 2009
parent reply Don <nospam nospam.com> writes:
Daniel Keep wrote:
 
 Jarrett Billingsley wrote:
 On Thu, Feb 19, 2009 at 9:31 PM, Daniel Keep
 <daniel.keep.lists gmail.com> wrote:
 struct float4
 {
     __align(16) float[4] data; // right syntax and value?
     alias data this;
 }

don't see any notes in the changelog about __align(n) syntax. As I remember, there was no way to actually ensure the data was properly aligned. (There's "Data items in static data segment >= 16 bytes in size are now paragraph aligned." but that doesn't help when the vectors are on, say, the stack or in the heap.)

http://www.digitalmars.com/d/1.0/attribute.html#align

Yeah, except it doesn't do what you want in this case. For example: module alignment; import tango.io.Stdout; struct float4 { align(16) float[4] data; } void main() { float4 a; byte b; float4 c; Stdout.format("a {0} & 0xF == {1}", &a, cast(size_t)&a & 0xF).newline; Stdout.format("c {0} & 0xF == {1}", &c, cast(size_t)&c & 0xF).newline; } Output is: a 12fe68 & 0xF == 8 c 12fe78 & 0xF == 8 The only way I found of guaranteeing alignment was to allocate all vectors on the heap, using a custom allocator to allocate an extra 15 bytes and then align the result appropriately. Obviously, for 16-byte vectors, this is completely unacceptable. -- Daniel

http://d.puremagic.com/issues/show_bug.cgi?id=2278
Feb 19 2009
parent "Lionello Lunesu" <lionello lunesu.remove.com> writes:
 http://d.puremagic.com/issues/show_bug.cgi?id=2278

That bug is interesting. Didn't Walter have to change DMD for the Mac to make sure the stack is aligned to 16 bytes? Perhaps most of the work is done now? L.
Feb 20 2009
prev sibling parent downs <default_357-line yahoo.de> writes:
Daniel Keep wrote:
 Andrei Alexandrescu wrote:
 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm
 <hannibal.holm gmail.com> wrote:

 Since (SIMD) vectors are so common and every reasonabe system support
 them in one way or the other (and scalar emulation of this is rather
 simple), why not have support for this in D directly?

 [snip]


struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/. Andrei

I remember implementing a vector struct [1] quite some time ago that had an SSE-accelerated path.

We have that in dglut too :) It's optimized for the float[3]/float[4] case - all float[3]s add a hidden bogus member, then the arithmetic operations generate for loops which are very easy for a properly patched gdc to autovectorize :) No loss of inlining, which means no ref VS. val issues. Alignment is still a problem, but movups in the aligned case is just as fast as movaps, so I figure it doesn't matter that much. Autovec is sweet.
Feb 20 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Denis Koroskin wrote:
 On Thu, 19 Feb 2009 23:05:34 +0300, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm 
 <hannibal.holm gmail.com> wrote:

 Since (SIMD) vectors are so common and every reasonabe system 
 support them in one way or the other (and scalar emulation of this 
 is rather simple), why not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for 
 why I like D :) ), but have the problem that an array of floats must 
 be aligned on float boundaries and not vector boundaries. In my mind 
 vectors are a primitive data type that should be exposed by the 
 programming language.

 Something OpenCL-like:

     float4 vec;
     vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
     vec.xyzw = vec.wyxz; // permutation
     vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have 
 with respect to operators:

     vec ^ vec2; // 3d cross product for float vectors, for int 
 vectors xor

 Has this been discussed before?

 / Mattias


Yah, I was thinking the same: struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/. Andrei

That would be great. If float4 gets its way into D, I'll share our blazing fast math code with community (most common operations on vectors, matrices, quaternions etc). It is written entirely in SSE (intrinsics, not asm; there is a problem with inlining asm in D, IIRC. Can anyone elaborate on this?) and *very* fast. According to our benchmarks, that's the best we get squeeze out of hardware. I know LLVM have support for *very* wide range of intrinsics: http://www.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/include llvm/Intrinsics.gen Hopefully they will get into LDC (and DMD *hint* Walter *hint*) very soon.

Put me down for that. What do I need to do? Andrei
Feb 19 2009
next sibling parent Mattias Holm <hannibal.holm gmail.com> writes:
Yeah, this is good statistics and does point out that the vector
add/mul/permute stuff are used whenever vectors are in use.

Intrinsics is one thing, however, better would be platform independent stuff.
Altivec have a different syntax for the permute instructions than the SSE
shuffle instructions, so in my mind primitive vectors should support all the
basic operations of the base type such as +,-,* and / for float vectors.
Permutation should be supported with the OpenCL-like syntax (as it is easy to
remember) as I suggested.

Stuff like cross and dot products are up for libraries in my opinion (but could
be nice as operators for readability issues, but this is probably not worth the
hassle unless there is a way to override operators for standard types, which is
probably a bad idea anyway, better would be proper unicode support and a way to
define infix functions so it is possible define a cross product function with
the proper cross product operator as function name on a library level).

/ Mattias
Feb 20 2009
prev sibling next sibling parent reply Christian Kamm <kamm-incasoftware remove-garbage.de> writes:
Denis Koroskin wrote:
 Convince Walter to add float4 type and some intrinsics to DMD (I'll post a
 list of those we use later), LDC will follow, I believe. There should be
 some type that would be treated specially. After all, intrinsics have
 function signatures and those should specify some concrete types.

Yes, LDC would follow. The main reason people can't use these intrinsics in LDC at the moment is that there's no type in D that maps to an LLVM vector type.
Feb 21 2009
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-02-21 04:34:07 -0500, Christian Kamm 
<kamm-incasoftware remove-garbage.de> said:

 Denis Koroskin wrote:
 Convince Walter to add float4 type and some intrinsics to DMD (I'll post a
 list of those we use later), LDC will follow, I believe. There should be
 some type that would be treated specially. After all, intrinsics have
 function signatures and those should specify some concrete types.

Yes, LDC would follow. The main reason people can't use these intrinsics in LDC at the moment is that there's no type in D that maps to an LLVM vector type.

Instead of introducing new a new type, couldn't float[4] be the one mapped to a vector type? Why do we need a new type? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 21 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Michel Fortin:
 Instead of introducing new a new type, couldn't float[4] be the one 
 mapped to a vector type? Why do we need a new type?

Alignment requirements, shuffling operations, scalar operations on just the first item of the vector, ecc. It may be doable, and it may be even a nice idea, but probably it requires lot of care. Bye, bearophile
Feb 21 2009
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
bearophile wrote:
 Michel Fortin:
 Instead of introducing new a new type, couldn't float[4] be the one 
 mapped to a vector type? Why do we need a new type?

Alignment requirements, shuffling operations, scalar operations on just the first item of the vector, ecc. It may be doable, and it may be even a nice idea, but probably it requires lot of care. Bye, bearophile

Another advantage would be that you could specify in the ABI that this vector type should be passed to and returned from functions via the XMM registers. You could make a specific exception for float[4], but that just seems messy. -- Daniel
Feb 21 2009
next sibling parent reply Don <nospam nospam.com> writes:
Daniel Keep wrote:
 
 bearophile wrote:
 Michel Fortin:
 Instead of introducing new a new type, couldn't float[4] be the one 
 mapped to a vector type? Why do we need a new type?

Bye, bearophile

Another advantage would be that you could specify in the ABI that this vector type should be passed to and returned from functions via the XMM registers. You could make a specific exception for float[4], but that just seems messy. -- Daniel

I don't think that's messy at all. I can't see much difference between special support for float[4] versus float4. It's better if the code can take advantage of hardware without specific support. Bear in mind that SSE/SSE2 is a temporary situation. AVX provides for much longer arrays of vectors; and it's extensible. You'd end up needing to keep adding on special types whenever a new CPU comes out. Note that the fundamental concept which is missing from the C virtual machine is that all modern machines can efficiently perform operations on arrays of built-in types of length 2^n, for some small value of n. We need to get this into the language abstraction. Not follow C++ in hacking a few extra special types onto the old, deficient C model. And I think D is actually in a position to do this. float[4] would be a greatly superior option if it could be done. The key requirements are: (1) need to specify that static arrays are passed by value. (2) need to keep stack aligned to 16. The good news is that both of these appear to be done on DMD2-Mac!
Feb 21 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

I agree with float[4] as a good choice. So are value semantics for T[n] implemented on the Mac?? Andrei
Feb 21 2009
next sibling parent reply Don <nospam nospam.com> writes:
Andrei Alexandrescu wrote:
 Don wrote:
 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

I agree with float[4] as a good choice. So are value semantics for T[n] implemented on the Mac?? Andrei

as close as I thought. Bummer.
Feb 21 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 Andrei Alexandrescu wrote:
 Don wrote:
 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

I agree with float[4] as a good choice. So are value semantics for T[n] implemented on the Mac?? Andrei

as close as I thought. Bummer.

Yah. Walter agrees that that's the right thing to do. The only thing that worries us is passing by-value large statically-sized vectors to template functions. But then gaming code wants to do exactly that. It's hard to figure where to draw the line. Imagine the error message "Hey, you're going a bit overboard by passing 512 bytes around on the stack". Besides, we already do have a solution for pass-by-value vectors: Tuple!(T[N]). That would put the burden in the right place (on the programmer actively wanting pass-by-value). But then it's a shame that the built-in type T[N] is a weird exception that must be handled in all template code. No idea what the right choice is. I'm just dumping whatever is buzzing around my head whenever I think of the issue. Andrei
Feb 21 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jarrett Billingsley wrote:
 On Sat, Feb 21, 2009 at 12:32 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Yah. Walter agrees that that's the right thing to do. The only thing that
 worries us is passing by-value large statically-sized vectors to template
 functions. But then gaming code wants to do exactly that. It's hard to
 figure where to draw the line. Imagine the error message "Hey, you're going
 a bit overboard by passing 512 bytes around on the stack".

Structs already work like this. In fact, the compiler will pass a struct in a register if it's 1, 2, or 4 bytes on x86. Having the compiler "magically" put float[4]s in SSE registers seems like a similar idea.

I agree.
 Besides, we already do have a solution for pass-by-value vectors:
 Tuple!(T[N]). That would put the burden in the right place (on the
 programmer actively wanting pass-by-value). But then it's a shame that the
 built-in type T[N] is a weird exception that must be handled in all template
 code.

Please make them value types. I, for one, am tired of dealing with their crap.

Ok, you just tipped the balance :o). I'm also realizing something. The scenario I'm most afraid of is something like: char[10000] humongous = "This is a humongous message. I will type here exactly 10000 characters. ... "; foreach (i; 1 .. 100_000_000) writeln(humongous); But then there is a reason making this scenario rather scarce: for large static arrays, it's hard to keep the claimed length (10000) in sync with the actual length of the vectors. I used to think that's a language defect and suggested the syntax char[$] humongous = " ... " for it, such that the compiler infers the length from the initializer. But now I get to think that the defect actually discourages people from defining very large statically-sized arrays unwittingly. With mixins and template techniques, very large static arrays can still be generated, but such advanced uses also has a nice feedback: those who know the language well enough to embark on such styles of coding will also likely understand the cautions needed in making them work well. So, yes, it seems like it's a solid choice to make statically-sized arrays value types. Now we only need to convince Walter that implementation is "a simple matter of coding" :o). Andrei
Feb 21 2009
next sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
Andrei Alexandrescu wrote:
 Jarrett Billingsley wrote:
 Please make them value types.  I, for one, am tired of dealing with 
 their crap.

Ok, you just tipped the balance :o). I'm also realizing something. The scenario I'm most afraid of is something like: char[10000] humongous = "This is a humongous message. I will type here exactly 10000 characters. ... ";

This is less of an issue with the type of string literals being invariant(T)[] rather than invariant(T)[length]. Even less so if the default type for array literals were dynamic rather than static arrays.
Feb 21 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Christopher Wright wrote:
 Andrei Alexandrescu wrote:
 Jarrett Billingsley wrote:
 Please make them value types.  I, for one, am tired of dealing with 
 their crap.

Ok, you just tipped the balance :o). I'm also realizing something. The scenario I'm most afraid of is something like: char[10000] humongous = "This is a humongous message. I will type here exactly 10000 characters. ... ";

This is less of an issue with the type of string literals being invariant(T)[] rather than invariant(T)[length]. Even less so if the default type for array literals were dynamic rather than static arrays.

Exactly so. So with this other fix in the language there's even more push toward value semantics for T[N]. Andrei
Feb 21 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 On Sun, Feb 22, 2009 at 4:00 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Jarrett Billingsley wrote:
 On Sat, Feb 21, 2009 at 12:32 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 Yah. Walter agrees that that's the right thing to do. The only thing that
 worries us is passing by-value large statically-sized vectors to template
 functions. But then gaming code wants to do exactly that.



I don't follow you. Wouldn't they rather pass such huge chunks of data by reference? --bb

What I'm saying (sorry for being unclear) is: 1. If we choose T[N] as a value type, the downside is that people may pass large arrays by values to e.g. template functions. 2. The upside is that gaming programmers DO want to pass short arrays of type T[N] by value. The conundrum is that a type system can't say that T[N] has some semantics for N <= Nmax and some other semantics for N > Nmax. So we need to pick one, and probably picking the value semantics is the right thing to do. Andrei
Feb 21 2009
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-02-21 15:11:15 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 The conundrum is that a type system can't say that T[N] has some 
 semantics for N <= Nmax and some other semantics for N > Nmax. So we 
 need to pick one, and probably picking the value semantics is the right 
 thing to do.

I think it is the right decision too. This way "static array" becomes the container type and "dynamic array" is the corresonding range type. Perhaps some concept renaming is in order for D2: static array => array dynamic array => array range (or slice) -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 21 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Michel Fortin wrote:
 On 2009-02-21 15:11:15 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:
 
 The conundrum is that a type system can't say that T[N] has some 
 semantics for N <= Nmax and some other semantics for N > Nmax. So we 
 need to pick one, and probably picking the value semantics is the 
 right thing to do.

I think it is the right decision too. This way "static array" becomes the container type and "dynamic array" is the corresonding range type. Perhaps some concept renaming is in order for D2: static array => array dynamic array => array range (or slice)

Yah, and that would give a good model to follow for user-defined containers. Andrei
Feb 21 2009
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 On Sun, Feb 22, 2009 at 5:11 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Sun, Feb 22, 2009 at 4:00 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Jarrett Billingsley wrote:
 On Sat, Feb 21, 2009 at 12:32 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 Yah. Walter agrees that that's the right thing to do. The only thing
 that
 worries us is passing by-value large statically-sized vectors to
 template
 functions. But then gaming code wants to do exactly that.



data by reference? --bb

1. If we choose T[N] as a value type, the downside is that people may pass large arrays by values to e.g. template functions. 2. The upside is that gaming programmers DO want to pass short arrays of type T[N] by value. The conundrum is that a type system can't say that T[N] has some semantics for N <= Nmax and some other semantics for N > Nmax. So we need to pick one, and probably picking the value semantics is the right thing to do.

Ok. And for large static arrays you can still explicitly use ref, right?

Yah, ref on the callee side, or the [] operator without any arguments on the caller side. Andrei
Feb 21 2009
parent Christopher Wright <dhasenan gmail.com> writes:
Jarrett Billingsley wrote:
 On Sat, Feb 21, 2009 at 5:47 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Yah, ref on the callee side, or the [] operator without any arguments on the
 caller side.

Wait, what? I wouldn't have expected anything but "ref type[n] foo" on the function parameter to pass byref. What are you saying here?

void foo(T)(T t) {} int[5_000_000] i; foo(i[]); // foo!(int[]) -- semi-by-reference foo(i); // foo!(int[5_000_000]) -- by value
Feb 21 2009
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 The conundrum is that a type system can't say that T[N] has some 
 semantics for N <= Nmax and some other semantics for N > Nmax. So we 
 need to pick one, and probably picking the value semantics is the right 
 thing to do.

Well, I think the type system can be extended to manage that: the programmer may specify an optional compiler command line argument like: -ms 128 Now all structs/static arrays more than 128 bytes long are passed by reference :-) Alternative solution, less extreme: instead of changing the value/ref pass semantics, when you add such optional command line argument the compiler gives you a compilation warning (or even error, if you want) everywhere you try to pass by value struct or static array more than 128 bytes long. Bye, bearophile
Feb 22 2009
prev sibling parent Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Sat, Feb 21, 2009 at 6:37 PM, Christopher Wright <dhasenan gmail.com> wrote:
 Jarrett Billingsley wrote:
 On Sat, Feb 21, 2009 at 5:47 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Yah, ref on the callee side, or the [] operator without any arguments on
 the
 caller side.

Wait, what? I wouldn't have expected anything but "ref type[n] foo" on the function parameter to pass byref. What are you saying here?

void foo(T)(T t) {} int[5_000_000] i; foo(i[]); // foo!(int[]) -- semi-by-reference foo(i); // foo!(int[5_000_000]) -- by value

Oh. See, I don't usually template _everything_ so that didn't cross my mind ;)
Feb 21 2009
prev sibling parent Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Sat, Feb 21, 2009 at 5:47 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Yah, ref on the callee side, or the [] operator without any arguments on the
 caller side.

Wait, what? I wouldn't have expected anything but "ref type[n] foo" on the function parameter to pass byref. What are you saying here?
Feb 21 2009
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Don:
 I don't think that's messy at all. I can't see much difference between 
 special support for float[4] versus float4. It's better if the code can 
 take advantage of hardware without specific support. Bear in mind that 
 SSE/SSE2 is a temporary situation. AVX provides for much longer arrays 
 of vectors; and it's extensible. You'd end up needing to keep adding on 
 special types whenever a new CPU comes out.
 
 Note that the fundamental concept which is missing from the C virtual 
 machine is that all modern machines can efficiently perform operations 
 on arrays of built-in types of length 2^n, for some small value of n.
 We need to get this into the language abstraction. Not follow C++ in 
 hacking a few extra special types onto the old, deficient C model. And I 
 think D is actually in a position to do this.
 
 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

I have quoted it all because I like the experience and ideas you bring to D. But how can the operations like shuffling, or how to map a sqrt on just the first item of such 4 floats, or on them all, etc? What syntax can be used? Regarding the array operations already implemented, are there ways to force the compiler to inline such code/operations? Bye, bearophile
Feb 21 2009
prev sibling parent reply Mattias Holm <hannibal.holm gmail.com> writes:
On 2009-02-21 17:03:06 +0100, Don <nospam nospam.com> said:
 
 I don't think that's messy at all. I can't see much difference between 
 special support for float[4] versus float4. It's better if the code can 
 take advantage of hardware without specific support. Bear in mind that 
 SSE/SSE2 is a temporary situation. AVX provides for much longer arrays 
 of vectors; and it's extensible. You'd end up needing to keep adding on 
 special types whenever a new CPU comes out.
 
 Note that the fundamental concept which is missing from the C virtual 
 machine is that all modern machines can efficiently perform operations 
 on arrays of built-in types of length 2^n, for some small value of n.
 We need to get this into the language abstraction. Not follow C++ in 
 hacking a few extra special types onto the old, deficient C model. And 
 I think D is actually in a position to do this.
 
 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

Yes, float[4] would be ok, if some CPU independent permutation support can be added. Would this be with some intrinsic then or what? I very much like the OpenCL syntax for permutation, but I suppose that an intrinsic such as "float[4] noref permute(float[4] noref vec, int newPos0, int newPos1, int newPos2, int newPos3)" would work as well. Note that this should also work with double[2], byte[16], short[8] and int[4]. How would pass by value semantics be implemented without breaking compatibility work, would you you have (yet another) type qulifier (noref used in the example above)? In my opinion, vectors are a fundamental type in my mind and there is a reason that arrays and vectors are kept separate in LLVM. The problem is exposing this to the programmer in the proper way. OpenCL does have a lot of nice things in it that might be worth considering. But, yeah, if something is done for ensuring the alignment of power of 2 vectors, the permutation support and the pass by value, then I would be fairly happy with that as well. / Mattias
Feb 21 2009
next sibling parent reply Mattias Holm <hannibal.holm gmail.com> writes:
I think that the following would work reasonably well:

	allow the [] operator for arrays to take comma separated lists of indices.

So the OpenCL like statement:

	v.xyzw = v2.wzyx;

will be written as:

	v[0,1,2,3] = v2[3,2,1,0];

Would this be ok? This is a general extension of the array slicing, and 
it might be possible to permute with a combination of slices and 
indices like this (i.e. v[0..3, 6, 5, 7.. len]). Is permutation 
operations something that Walter would be willing to add?

As said by someone else in this thread, there need to be a way to 
specify that static arrays are passed by value, so can the ref keyword 
be paired with the oposite "byval" or something similar.

And also, functions need to be able to return static arrays which is 
not possible at the moment.

Note that the support should be general and work with any array type 
(so that you can get YMM support whenever that makes it into the future 
chips).


/ Mattias
Feb 22 2009
next sibling parent Christopher Wright <dhasenan gmail.com> writes:
Denis Koroskin wrote:
 On Sun, 22 Feb 2009 14:02:53 +0300, Mattias Holm 
 <hannibal.holm gmail.com> wrote:
 
 I think that the following would work reasonably well:

     allow the [] operator for arrays to take comma separated lists of 
 indices.

 So the OpenCL like statement:

     v.xyzw = v2.wzyx;

 will be written as:

     v[0,1,2,3] = v2[3,2,1,0];

How would you implement it for user-defined types?

T[] opIndex(int[] indices) { ... } void opIndexAssign(int[] indices, T[] values) { ... }
Feb 22 2009
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Mattias Holm:
 I think that the following would work reasonably well:
 	allow the [] operator for arrays to take comma separated lists of indices.
 So the OpenCL like statement:
 	v.xyzw = v2.wzyx;
 will be written as:
 	v[0,1,2,3] = v2[3,2,1,0];

Be careful, I think a syntax like: v[0, 1] v[0, 1, 2, 3] Is better left to index in 2D and 4D arrays. nD arrays can be quite important in a language like D. So the following may be better and more compatible with the future: v[0; 1; 2; 3] = v2[3; 2; 1; 0]; Or: v[(0,1,2,3)] = v2[(3,2,1,0)]; Or: v[[0,1,2,3]] = v2[[3,2,1,0]]; Bye, bearophile
Feb 22 2009
parent Daniel Keep <daniel.keep.lists gmail.com> writes:
bearophile wrote:
 Mattias Holm:
 I think that the following would work reasonably well:
 	allow the [] operator for arrays to take comma separated lists of indices.
 So the OpenCL like statement:
 	v.xyzw = v2.wzyx;
 will be written as:
 	v[0,1,2,3] = v2[3,2,1,0];

Be careful, I think a syntax like: v[0, 1] v[0, 1, 2, 3] Is better left to index in 2D and 4D arrays. nD arrays can be quite important in a language like D. So the following may be better and more compatible with the future: v[0; 1; 2; 3] = v2[3; 2; 1; 0]; Or: v[(0,1,2,3)] = v2[(3,2,1,0)]; Or: v[[0,1,2,3]] = v2[[3,2,1,0]]; Bye, bearophile

Swizzling a vector and multidimensional access are orthogonal on account of vectors having only one dimension [1]. struct FloatTuple(size_t n); // exercise to the reader :P struct vec4f { union { float[4] data; struct { float w, x, y, z; } } float opIndex(size_t i) { return data[i]; } float opIndexAssign(float v, size_t i) { return data[i] = v; } FloatTuple!(2) opIndex(size_t i, size_t j) { return FloatTuple!(2)(data[i],data[j]); } FloatTuple!(2) opIndexAssign(FloatTuple!(2) vs, size_t i, size_t j) { data[i] = vs.data[0]; data[j] = vs.data[1]; return vs; } // and so on for 3 and 4 argument versions. } Personally, I think something like this is a better idea: struct vec4f { ... FloatTuple!(PermSpec.length) perm(string PermSpec)() { FloatTuple!(PermSpec.length) vs; foreach( i ; Range!(PermSpec.length) ) vs.data[i] = mixin(`this.`~PermSpec[i]); return vs; } } void main() { vec4f a, b; a = ...; b = a.perm!("xyzw"); } -- Daniel [1] By this I mean they're a 1D data type; vec4f represents a 4D vector but in terms of implementation, it has only one dimension: the index of data.
Feb 22 2009
prev sibling parent Don <nospam nospam.com> writes:
Mattias Holm wrote:
 I think that the following would work reasonably well:
 
     allow the [] operator for arrays to take comma separated lists of 
 indices.

I've always believed that we need that syntax for multi-dimensional arrays. Swizzling ought to be possible on a multi-dimensional array, and I don't think it would be, with your proposal?
 
 So the OpenCL like statement:
 
     v.xyzw = v2.wzyx;
 
 will be written as:
 
     v[0,1,2,3] = v2[3,2,1,0];
 
 Would this be ok? This is a general extension of the array slicing, and 
 it might be possible to permute with a combination of slices and indices 
 like this (i.e. v[0..3, 6, 5, 7.. len]). Is permutation operations 
 something that Walter would be willing to add?
 
 As said by someone else in this thread, there need to be a way to 
 specify that static arrays are passed by value, so can the ref keyword 
 be paired with the oposite "byval" or something similar.
 
 And also, functions need to be able to return static arrays which is not 
 possible at the moment.
 
 Note that the support should be general and work with any array type (so 
 that you can get YMM support whenever that makes it into the future chips).
 
 
 / Mattias
 

Feb 23 2009
prev sibling parent reply Don <nospam nospam.com> writes:
Mattias Holm wrote:
 On 2009-02-21 17:03:06 +0100, Don <nospam nospam.com> said:
 I don't think that's messy at all. I can't see much difference between 
 special support for float[4] versus float4. It's better if the code 
 can take advantage of hardware without specific support. Bear in mind 
 that SSE/SSE2 is a temporary situation. AVX provides for much longer 
 arrays of vectors; and it's extensible. You'd end up needing to keep 
 adding on special types whenever a new CPU comes out.

 Note that the fundamental concept which is missing from the C virtual 
 machine is that all modern machines can efficiently perform operations 
 on arrays of built-in types of length 2^n, for some small value of n.
 We need to get this into the language abstraction. Not follow C++ in 
 hacking a few extra special types onto the old, deficient C model. And 
 I think D is actually in a position to do this.

 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

Yes, float[4] would be ok, if some CPU independent permutation support can be added. Would this be with some intrinsic then or what? I very much like the OpenCL syntax for permutation, but I suppose that an intrinsic such as "float[4] noref permute(float[4] noref vec, int newPos0, int newPos1, int newPos2, int newPos3)" would work as well. Note that this should also work with double[2], byte[16], short[8] and int[4].

Note that if you had static arrays with value semantics, with proper alignment, then you could simply create module std.swizzle; float[4] permute(float[4] vec, int newPos0, int newPos1, int newPos2, int newPos3); /* intrinsic */ float[4] wzyx(float[4] q) { return permute(q, 4, 3, 2, 1); } float[4] xywz(float[4] q) { return permute(q, 1, 2, 4, 3); } // etc --- and your code would be: import std.swizzle; void main() { float[4] t; auto u = t.wzyx; } I don't think this is terribly difficult once the value semantics are in place. (Note that once you get beyond 4 members, the .xyzw syntax gives an explosion of functions; but I think it's workable at 4; 4! is only 24. Beyond that point, you'd probably require direct permute calls).
Feb 23 2009
parent reply Don <nospam nospam.com> writes:
Bill Baxter wrote:
 On Mon, Feb 23, 2009 at 5:18 PM, Don <nospam nospam.com> wrote:
 Mattias Holm wrote:
 On 2009-02-21 17:03:06 +0100, Don <nospam nospam.com> said:
 I don't think that's messy at all. I can't see much difference between
 special support for float[4] versus float4. It's better if the code can take
 advantage of hardware without specific support. Bear in mind that SSE/SSE2
 is a temporary situation. AVX provides for much longer arrays of vectors;
 and it's extensible. You'd end up needing to keep adding on special types
 whenever a new CPU comes out.

 Note that the fundamental concept which is missing from the C virtual
 machine is that all modern machines can efficiently perform operations on
 arrays of built-in types of length 2^n, for some small value of n.
 We need to get this into the language abstraction. Not follow C++ in
 hacking a few extra special types onto the old, deficient C model. And I
 think D is actually in a position to do this.

 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

be added. Would this be with some intrinsic then or what? I very much like the OpenCL syntax for permutation, but I suppose that an intrinsic such as "float[4] noref permute(float[4] noref vec, int newPos0, int newPos1, int newPos2, int newPos3)" would work as well. Note that this should also work with double[2], byte[16], short[8] and int[4].

alignment, then you could simply create module std.swizzle; float[4] permute(float[4] vec, int newPos0, int newPos1, int newPos2, int newPos3); /* intrinsic */ float[4] wzyx(float[4] q) { return permute(q, 4, 3, 2, 1); } float[4] xywz(float[4] q) { return permute(q, 1, 2, 4, 3); } // etc --- and your code would be: import std.swizzle; void main() { float[4] t; auto u = t.wzyx; } I don't think this is terribly difficult once the value semantics are in place. (Note that once you get beyond 4 members, the .xyzw syntax gives an explosion of functions; but I think it's workable at 4; 4! is only 24. Beyond that point, you'd probably require direct permute calls).

Actually its 4^4 if you do it like OpenCL/GLSL/HLSL/Cg and allow repeats like .xxyy.

Yes. Is the syntax sugar actually needed for all the permutations? Even so, it's still only 256, which is probably still OK. I don't think a language change is required. This scheme doesn't cover: * shufp where the two sources are different * haddpd, haddps [SSE3] { double[2] a, b; a[0]=a[0]+a[1]; a[1]=b[0]+b[1]; } * non-temporal stores (although I think these are covered adequately by array operations) and the byte/word operations: * pack with saturation * movmsk * avg * multiply and add. So it looks to me as though with the minimal language changes, we could get almost complete SIMD support, with excellent syntax.
Feb 23 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Don:

 This scheme doesn't cover:

 and the byte/word operations:

 So it looks to me as though with the minimal language changes, we could 
 get almost complete SIMD support, with excellent syntax.

Do you consider things from the short future too? http://en.wikipedia.org/wiki/SSE5 http://en.wikipedia.org/wiki/Advanced_Vector_Extensions Bye, bearophile
Feb 23 2009
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 Bill Baxter wrote:
 Actually its 4^4 if you do it like OpenCL/GLSL/HLSL/Cg and allow
 repeats like .xxyy.

Yes. Is the syntax sugar actually needed for all the permutations? Even so, it's still only 256, which is probably still OK. I don't think a language change is required.

There's no need to ever enumerate all functions - they can be generated with templates and mixins rather easily.
 This scheme doesn't cover:
 * shufp  where the two sources are different
 * haddpd, haddps [SSE3] { double[2] a, b;  a[0]=a[0]+a[1]; 
 a[1]=b[0]+b[1]; }
 * non-temporal stores (although I think these are covered adequately by 
 array operations)

Well probably we can find ways to generate those too.
 and the byte/word operations:
 
 * pack with saturation
 * movmsk
 * avg
 * multiply and add.
 
 So it looks to me as though with the minimal language changes, we could 
 get almost complete SIMD support, with excellent syntax.
 

That sounds great. Andrei
Feb 23 2009
prev sibling parent reply Chad J <gamerchad __spam.is.bad__gmail.com> writes:
Don wrote:
 
 So it looks to me as though with the minimal language changes, we could
 get almost complete SIMD support, with excellent syntax.
 

enum { x=0, y=1, z=2, w=3 } float[4] foo; foo[x] = 42; foo[y] = foo[x]; // etc foo[] = [foo[y],foo[x],foo[y],foo[x]]; *grin*
Feb 23 2009
parent Fawzi Mohamed <fmohamed mac.com> writes:
On 2009-02-23 14:48:50 +0100, Chad J <gamerchad __spam.is.bad__gmail.com> said:

 Don wrote:
 
 So it looks to me as though with the minimal language changes, we could
 get almost complete SIMD support, with excellent syntax.
 

enum { x=0, y=1, z=2, w=3 } float[4] foo; foo[x] = 42; foo[y] = foo[x]; // etc foo[] = [foo[y],foo[x],foo[y],foo[x]]; *grin*

Sorry to bump up this discussion, but I was away and then busy with other stuff... and coming back I had lot of piled up work... (also make tango work wit the brand new dmd on mac ;) so I had missed it. I think that an aligned vector would be very useful to have for small vectors. Intrinsic functions could be useful, but I agree with the "joke" done by Chad. What I really want to see is that the compiler uses them when it should (as I think downs did with his autovectorization patch for gdc). Not too much cluttering for special notation, but the normal one that is done efficiently when possible. A special type is probably needed (for alignment reasons), but then that's about it from the language point of view (even that might be avoided with align, or maybe sometime even without, but maybe that is to expect too much from the compiler). I also would like to stress again that there are also doubles (but indeed the gain in that case is much smaller). Fawzi
Feb 27 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Sun, Feb 22, 2009 at 5:11 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Sun, Feb 22, 2009 at 4:00 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Jarrett Billingsley wrote:
 On Sat, Feb 21, 2009 at 12:32 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 Yah. Walter agrees that that's the right thing to do. The only thing
 that
 worries us is passing by-value large statically-sized vectors to
 template
 functions. But then gaming code wants to do exactly that.



I don't follow you. Wouldn't they rather pass such huge chunks of data by reference? --bb

What I'm saying (sorry for being unclear) is: 1. If we choose T[N] as a value type, the downside is that people may pass large arrays by values to e.g. template functions. 2. The upside is that gaming programmers DO want to pass short arrays of type T[N] by value. The conundrum is that a type system can't say that T[N] has some semantics for N <= Nmax and some other semantics for N > Nmax. So we need to pick one, and probably picking the value semantics is the right thing to do.

Ok. And for large static arrays you can still explicitly use ref, right? That's what I was confused about -- sounded like you were saying ref wouldn't even be possible. I don't think there are many cases where you don't know in advance if the static array you are expecting is huge or not. And you can always get creative with static if() in the cases where you don't. --bb
Feb 21 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Sun, Feb 22, 2009 at 4:00 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Jarrett Billingsley wrote:
 On Sat, Feb 21, 2009 at 12:32 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 Yah. Walter agrees that that's the right thing to do. The only thing that
 worries us is passing by-value large statically-sized vectors to template
 functions. But then gaming code wants to do exactly that.



I don't follow you. Wouldn't they rather pass such huge chunks of data by reference? --bb
Feb 21 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Mon, Feb 23, 2009 at 10:24 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Don wrote:
 Bill Baxter wrote:
 Actually its 4^4 if you do it like OpenCL/GLSL/HLSL/Cg and allow
 repeats like .xxyy.

Yes. Is the syntax sugar actually needed for all the permutations? Even so, it's still only 256, which is probably still OK. I don't think a language change is required.

There's no need to ever enumerate all functions - they can be generated with templates and mixins rather easily.

I think the issue is just that however you get them there it's a lot of exe bloat. The ideal would be something like a template that only got instantiated when used, like vec2 = vec1.swizzle!("xyzy"); ... but with the syntax vec2 = vec1.xyzy. And which can generate optimal assembly. Really, if .swizzle!("xyzy") can do that, though, that looks good enough to me. I would rather have that than 256 silly little functions that get instantiated no matter what. --bb
Feb 23 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Mon, Feb 23, 2009 at 5:18 PM, Don <nospam nospam.com> wrote:
 Mattias Holm wrote:
 On 2009-02-21 17:03:06 +0100, Don <nospam nospam.com> said:
 I don't think that's messy at all. I can't see much difference between
 special support for float[4] versus float4. It's better if the code can take
 advantage of hardware without specific support. Bear in mind that SSE/SSE2
 is a temporary situation. AVX provides for much longer arrays of vectors;
 and it's extensible. You'd end up needing to keep adding on special types
 whenever a new CPU comes out.

 Note that the fundamental concept which is missing from the C virtual
 machine is that all modern machines can efficiently perform operations on
 arrays of built-in types of length 2^n, for some small value of n.
 We need to get this into the language abstraction. Not follow C++ in
 hacking a few extra special types onto the old, deficient C model. And I
 think D is actually in a position to do this.

 float[4] would be a greatly superior option if it could be done.
 The key requirements are:
 (1) need to specify that static arrays are passed by value.
 (2) need to keep stack aligned to 16.
 The good news is that both of these appear to be done on DMD2-Mac!

Yes, float[4] would be ok, if some CPU independent permutation support can be added. Would this be with some intrinsic then or what? I very much like the OpenCL syntax for permutation, but I suppose that an intrinsic such as "float[4] noref permute(float[4] noref vec, int newPos0, int newPos1, int newPos2, int newPos3)" would work as well. Note that this should also work with double[2], byte[16], short[8] and int[4].

Note that if you had static arrays with value semantics, with proper alignment, then you could simply create module std.swizzle; float[4] permute(float[4] vec, int newPos0, int newPos1, int newPos2, int newPos3); /* intrinsic */ float[4] wzyx(float[4] q) { return permute(q, 4, 3, 2, 1); } float[4] xywz(float[4] q) { return permute(q, 1, 2, 4, 3); } // etc --- and your code would be: import std.swizzle; void main() { float[4] t; auto u = t.wzyx; } I don't think this is terribly difficult once the value semantics are in place. (Note that once you get beyond 4 members, the .xyzw syntax gives an explosion of functions; but I think it's workable at 4; 4! is only 24. Beyond that point, you'd probably require direct permute calls).

Actually its 4^4 if you do it like OpenCL/GLSL/HLSL/Cg and allow repeats like .xxyy. --bb --bb
Feb 23 2009
prev sibling next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Fri, Feb 20, 2009 at 4:25 AM, Mattias Holm <hannibal.holm gmail.com> wrote:
 Since (SIMD) vectors are so common and every reasonabe system support them
 in one way or the other (and scalar emulation of this is rather simple), why
 not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for why I
 like D :) ), but have the problem that an array of floats must be aligned on
 float boundaries and not vector boundaries. In my mind vectors are a
 primitive data type that should be exposed by the programming language.

To justify making them primitive types you need to show that they are widespread, and that there is some good reason that they cannot be implemented in a library. And even if they can't be implemented in a library right now, it could be that fixing that reason is better than making new built-in types. For instance I think there's issues now with getting the right stack alignment for structs. But that's something that needs to be fixed generally, not by making new built-in types that know how to do alignment right.
 Something OpenCL-like:

        float4 vec;
        vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
        vec.xyzw = vec.wyxz; // permutation
        vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have with
 respect to operators:

        vec ^ vec2; // 3d cross product for float vectors, for int vectors
 xor

All of this is pretty much doable in a library. The only nonobvious one is vec.xyzw = vec.wyxz which can be implemented using a bunch of CTFE auto-code generation (see some earlier version of H3r3tic's Hybrid library). And using xor for cross product sucks. It doesn't have the right precedent.
 Has this been discussed before?

I think it has and no one came up with a reason why these can't be library types. --bb
Feb 19 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Bill Baxter:
 I think it has and no one came up with a reason why these can't be
 library types.

Recently I have discussed about this with the LDC main designer, my idea was to to have something like Vect!double, 2) v1; Vect!(float, 4) v2; And then make LDC use LLVM intrinsics for the operations on them. So I think with some intrinsics most of that can be just a library. Bye, bearophile
Feb 19 2009
prev sibling parent Mattias Holm <hannibal.holm gmail.com> writes:
On 2009-02-19 21:04:06 +0100, Bill Baxter <wbaxter gmail.com> said:
 To justify making them primitive types you need to show that they are
 widespread, and that there is some good reason that they cannot be
 implemented in a library.  And even if they can't be implemented in a
 library right now, it could be that fixing that reason is better than
 making new built-in types.   For instance I think there's issues now
 with getting the right stack alignment for structs.  But that's
 something that needs to be fixed generally, not by making new built-in
 types that know how to do alignment right.

Firstly they are widespread. They might be hampered by the fact that they have no direct support in many languages however. There are reasons that you nowdays have standard vector syntax in GCC. It is a bit rough (only support basic operations like elementwise add, div, mul et.c.) and does not support permutations. The cross product is only a nice to have operator for floats, but it is far from neccisary, permutations are more painful though as GCC at the moment force you to write CPU-speciffic intrincics for them. There are firstly the alignment issuse, but there are as other said alignment attribute in D as well. More problematic for a library implementation of these types is however calling conventions. Passing a vector into a function is very efficient, they can be passed in registers for both in and out values (if the ABI allows, I think PPC and x86-64 calling conventions allow for this). The problem with the GCC versions however is that due to how the vectors work in it, you have to resort to union hacks to get the scalars out of a vector: union { _m128 data; struct { float x, y, z, w; } } Unfortunatelly, due to this, in order to make the compiler issue nice code, you have to work with the vector type only, and only bring out the union when inspecting the scalars in the vector. Which unfortunatelly will happen quite often (since the union has an ambiguous call by value semantic, should it be passed by one or four xmm regs (assuming x86-64 conventions)). Also, if a vector is a struct you do get the alignment issuse in some cases but you may also have different call by value calling conventions (with the x86-64 they would in principle be the same, but that is just coincidence). I do agrre that if it could be implemeted a library type, then it should, but this doesn't work if you take the fact of calling conventions into account. Also, I am sure that the optimiser could more easily make transformations of vector types than on structs that are being passed around since the semantics of a primitive type is known by the compiler. / Mattias
Feb 19 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 19 Feb 2009 23:05:34 +0300, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm  
 <hannibal.holm gmail.com> wrote:

 Since (SIMD) vectors are so common and every reasonabe system support  
 them in one way or the other (and scalar emulation of this is rather  
 simple), why not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for  
 why I like D :) ), but have the problem that an array of floats must  
 be aligned on float boundaries and not vector boundaries. In my mind  
 vectors are a primitive data type that should be exposed by the  
 programming language.

 Something OpenCL-like:

     float4 vec;
     vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
     vec.xyzw = vec.wyxz; // permutation
     vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have with  
 respect to operators:

     vec ^ vec2; // 3d cross product for float vectors, for int vectors  
 xor

 Has this been discussed before?

 / Mattias


Yah, I was thinking the same: struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/. Andrei

That would be great. If float4 gets its way into D, I'll share our blazing fast math code with community (most common operations on vectors, matrices, quaternions etc). It is written entirely in SSE (intrinsics, not asm; there is a problem with inlining asm in D, IIRC. Can anyone elaborate on this?) and *very* fast. According to our benchmarks, that's the best we get squeeze out of hardware. I know LLVM have support for *very* wide range of intrinsics: http://www.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/include/llvm/Intrinsics.gen Hopefully they will get into LDC (and DMD *hint* Walter *hint*) very soon.
Feb 19 2009
prev sibling next sibling parent Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Thu, Feb 19, 2009 at 9:31 PM, Daniel Keep
<daniel.keep.lists gmail.com> wrote:
 struct float4
 {
     __align(16) float[4] data; // right syntax and value?
     alias data this;
 }


 1. The alignment thing.  Incidentally, I just did a quick check and
 don't see any notes in the changelog about __align(n) syntax.  As I
 remember, there was no way to actually ensure the data was properly
 aligned.  (There's "Data items in static data segment >= 16 bytes in
 size are now paragraph aligned." but that doesn't help when the vectors
 are on, say, the stack or in the heap.)

It's just align(16). http://www.digitalmars.com/d/1.0/attribute.html#align
Feb 19 2009
prev sibling next sibling parent reply "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 20 Feb 2009 06:22:40 +0300, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 23:05:34 +0300, Andrei Alexandrescu  
 <SeeWebsiteForEmail erdani.org> wrote:

 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm  
 <hannibal.holm gmail.com> wrote:

 Since (SIMD) vectors are so common and every reasonabe system  
 support them in one way or the other (and scalar emulation of this  
 is rather simple), why not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for  
 why I like D :) ), but have the problem that an array of floats must  
 be aligned on float boundaries and not vector boundaries. In my mind  
 vectors are a primitive data type that should be exposed by the  
 programming language.

 Something OpenCL-like:

     float4 vec;
     vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
     vec.xyzw = vec.wyxz; // permutation
     vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have  
 with respect to operators:

     vec ^ vec2; // 3d cross product for float vectors, for int  
 vectors xor

 Has this been discussed before?

 / Mattias


Yah, I was thinking the same: struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/. Andrei

blazing fast math code with community (most common operations on vectors, matrices, quaternions etc). It is written entirely in SSE (intrinsics, not asm; there is a problem with inlining asm in D, IIRC. Can anyone elaborate on this?) and *very* fast. According to our benchmarks, that's the best we get squeeze out of hardware. I know LLVM have support for *very* wide range of intrinsics: http://www.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/include llvm/Intrinsics.gen Hopefully they will get into LDC (and DMD *hint* Walter *hint*) very soon.

Put me down for that. What do I need to do? Andrei

Convince Walter to add float4 type and some intrinsics to DMD (I'll post a list of those we use later), LDC will follow, I believe. There should be some type that would be treated specially. After all, intrinsics have function signatures and those should specify some concrete types.
Feb 19 2009
parent Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Sat, Feb 21, 2009 at 12:32 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Yah. Walter agrees that that's the right thing to do. The only thing that
 worries us is passing by-value large statically-sized vectors to template
 functions. But then gaming code wants to do exactly that. It's hard to
 figure where to draw the line. Imagine the error message "Hey, you're going
 a bit overboard by passing 512 bytes around on the stack".

Structs already work like this. In fact, the compiler will pass a struct in a register if it's 1, 2, or 4 bytes on x86. Having the compiler "magically" put float[4]s in SSE registers seems like a similar idea.
 Besides, we already do have a solution for pass-by-value vectors:
 Tuple!(T[N]). That would put the burden in the right place (on the
 programmer actively wanting pass-by-value). But then it's a shame that the
 built-in type T[N] is a weird exception that must be handled in all template
 code.

Please make them value types. I, for one, am tired of dealing with their crap.
Feb 21 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 20 Feb 2009 08:55:16 +0300, Denis Koroskin <2korden gmail.com>  
wrote:

 On Fri, 20 Feb 2009 06:22:40 +0300, Andrei Alexandrescu  
 <SeeWebsiteForEmail erdani.org> wrote:

 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 23:05:34 +0300, Andrei Alexandrescu  
 <SeeWebsiteForEmail erdani.org> wrote:

 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 22:25:04 +0300, Mattias Holm  
 <hannibal.holm gmail.com> wrote:

 Since (SIMD) vectors are so common and every reasonabe system  
 support them in one way or the other (and scalar emulation of this  
 is rather simple), why not have support for this in D directly?

 Yes, the array operations are nice (and one of the main reasons for  
 why I like D :) ), but have the problem that an array of floats  
 must be aligned on float boundaries and not vector boundaries. In  
 my mind vectors are a primitive data type that should be exposed by  
 the programming language.

 Something OpenCL-like:

     float4 vec;
     vec.xyzw = {1.0,1.0, 1.0, 1.0}; // assignment
     vec.xyzw = vec.wyxz; // permutation
     vec[i] = 1.0; // indexing

 And then we can easily immagine some extra nice features to have  
 with respect to operators:

     vec ^ vec2; // 3d cross product for float vectors, for int  
 vectors xor

 Has this been discussed before?

 / Mattias


Yah, I was thinking the same: struct float4 { __align(16) float[4] data; // right syntax and value? alias data this; } This looks like something that should go into std.matrix pronto. It even has value semantics even though fixed arrays don't :o/. Andrei

blazing fast math code with community (most common operations on vectors, matrices, quaternions etc). It is written entirely in SSE (intrinsics, not asm; there is a problem with inlining asm in D, IIRC. Can anyone elaborate on this?) and *very* fast. According to our benchmarks, that's the best we get squeeze out of hardware. I know LLVM have support for *very* wide range of intrinsics: http://www.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/include llvm/Intrinsics.gen Hopefully they will get into LDC (and DMD *hint* Walter *hint*) very soon.

Put me down for that. What do I need to do? Andrei

Convince Walter to add float4 type and some intrinsics to DMD (I'll post a list of those we use later), LDC will follow, I believe. There should be some type that would be treated specially. After all, intrinsics have function signatures and those should specify some concrete types.

Here is a nice documentation about MMX, SSE, SSE2 intrinsics: http://msdn.microsoft.com/en-us/library/y0dh78ez(VS.80).aspx Here is a quick statistics on what intrinsics are used in our code and how many times. Note that it doesn't directly maps to how many times it is *actually* used in user-code. This info may give Walter some information about priorities (those intrinsics that aren't often used may be given lower priority, for example). Arithmetic Operations (Floating-Point SSE2 Intrinsics) http://msdn.microsoft.com/en-us/library/708ya3be(VS.80).aspx _mm_add_ss - 2 _mm_add_ps - 48 _mm_sub_ss - 4 _mm_sub_ps - 24 _mm_mul_ss - 2 _mm_mul_ps - 100 _mm_div_ss - 0 _mm_div_ps - 1 _mm_sqrt_ss - 0 _mm_sqrt_ps - 0 _mm_rcp_ss - 1 _mm_rcp_ps - 0 _mm_rsqrt_ss - 0 _mm_rsqrt_ps - 1 _mm_min_ss - 0 _mm_min_ps - 1 _mm_max_ss - 0 _mm_max_ps - 1 Store Operations (SSE) http://msdn.microsoft.com/en-us/library/ybhzf6dk(VS.80).aspx _mm_store_ss - 1 _mm_store1_ps - 0 _mm_store_ps1 - 0 _mm_store_ps - 0 _mm_storeu_ps - 0 _mm_storer_ps - 0 _mm_move_ss - 2 Set Operations (SSE) http://msdn.microsoft.com/en-us/library/wbzwdy6a(VS.80).aspx _mm_set_ss - 0 _mm_set1_ps - 0 _mm_set_ps1 - 19 _mm_set_ps - 45 _mm_setr_ps - 0 _mm_setzero_ps - 2 Logical Operations (SSE) http://msdn.microsoft.com/en-us/library/9759as73(VS.80).aspx _mm_and_ps - 2 _mm_andnot_ps - 0 _mm_or_ps - 0 _mm_xor_ps - 3 Miscellaneous Instructions That Use Streaming SIMD Extensions http://msdn.microsoft.com/en-us/library/dzs626wx.aspx _mm_shuffle_ps - 124 _mm_shuffle_pi16 - 0 _mm_unpackhi_ps - 0 _mm_unpacklo_ps - 0 _mm_loadh_pi - 0 _mm_storeh_pi - 0 _mm_movehl_ps - 0 _mm_movelh_ps - 0 _mm_loadl_pi - 0 _mm_storel_pi - 0 _mm_movemask_ps - 0 _mm_getcsr - 0 _mm_setcsr - 0 _mm_extract_si64 - 0 _mm_extracti_si64 - 0 _mm_insert_si64 - 0 _mm_inserti_si64 - 0 Comparison Intrinsics (SSE) http://msdn.microsoft.com/en-us/library/w8kez9sf(VS.80).aspx Not used Conversion Operations (SSE) http://msdn.microsoft.com/en-us/library/0d4dtzhb(VS.80).aspx Not used Macros _MM_SHUFFLE - 100 - #define _MM_SHUFFLE(fp3,fp2,fp1,fp0) (((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | ((fp0)))
Feb 20 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Sun, 22 Feb 2009 14:02:53 +0300, Mattias Holm <hannibal.holm gmail.com>
wrote:

 I think that the following would work reasonably well:

 	allow the [] operator for arrays to take comma separated lists of  
 indices.

 So the OpenCL like statement:

 	v.xyzw = v2.wzyx;

 will be written as:

 	v[0,1,2,3] = v2[3,2,1,0];

 Would this be ok? This is a general extension of the array slicing, and  
 it might be possible to permute with a combination of slices and indices  
 like this (i.e. v[0..3, 6, 5, 7.. len]). Is permutation operations  
 something that Walter would be willing to add?

 As said by someone else in this thread, there need to be a way to  
 specify that static arrays are passed by value, so can the ref keyword  
 be paired with the oposite "byval" or something similar.

 And also, functions need to be able to return static arrays which is not  
 possible at the moment.

 Note that the support should be general and work with any array type (so  
 that you can get YMM support whenever that makes it into the future  
 chips).


 / Mattias

How would you implement it for user-defined types?
Feb 22 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Sun, 22 Feb 2009 16:51:10 +0300, Christopher Wright <dhasenan gmail.com>
wrote:

 Denis Koroskin wrote:
 On Sun, 22 Feb 2009 14:02:53 +0300, Mattias Holm  
 <hannibal.holm gmail.com> wrote:

 I think that the following would work reasonably well:

     allow the [] operator for arrays to take comma separated lists of  
 indices.

 So the OpenCL like statement:

     v.xyzw = v2.wzyx;

 will be written as:

     v[0,1,2,3] = v2[3,2,1,0];


T[] opIndex(int[] indices) { ... } void opIndexAssign(int[] indices, T[] values) { ... }

How about ranges included - v[0..3, 6, 5, 7..len] ?
Feb 22 2009
prev sibling parent "Joel C. Salomon" <joelcsalomon gmail.com> writes:
Mattias Holm wrote:
 And then we can easily immagine some extra nice features to have with
 respect to operators:
 
     vec ^ vec2; // 3d cross product for float vectors, for int vectors xor
 
 Has this been discussed before?

Given that the wedge product is a defined operation on vectors (a^b is a bivector in the common plane of a & b with area |a||b|sin θ)—related to, but very distinct from, the cross product—I’d call this a BAD operator overload. (B.T.W., I am starting work on a Geometric Algebra library which will implement all these products.) —Joel Salomon
Feb 22 2009