www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - SIMD support...

reply Manu <turkeyman gmail.com> writes:
So I've been hassling about this for a while now, and Walter asked me to
pitch an email detailing a minimal implementation with some initial
thoughts.

The first thing I'd like to say is that a lot of people seem to have this
idea that float[4] should be specialised as a candidate for simd
optimisations somehow. It's obviously been discussed, and this general
opinion seems to be shared by a good few people here.
I've had a whole bunch of rants why I think this is wrong in other threads,
so I won't repeat them here... and that said, I'll attempt to detail an
approach based on explicit vector types.

So, what do we need...? A language defined primitive vector type... that's
all.


-- What shall we call it? --

Doesn't really matter... open to suggestions.
VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
float' (a name I particularly hate, not specifying any size, and trying to
associate it with a specific type)

I like v128, or something like that. I'll use that for the sake of this
document. I think it is preferable to float4 for a few reasons:
 * v128 says what the register intends to be, a general purpose 128bit
register that may be used for a variety of simd operations that aren't
necessarily type bound.
 * float4 implies it is a specific 4 component float type, which is not
what the raw type should be.
 * If we use names like float4, it stands to reason that (u)int4,
(u)short8, etc should also exist, and it also stands to reason that one
might expect math operators and such to be defined...

I suggest initial language definition and implementation of something like
v128, and then types like float4, (u)int4, etc, may be implemented in the
std library with complex behaviour like casting mechanics, and basic math
operators...


-- Alignment --

This type needs to be 16byte aligned. Unaligned loads/stores are very
expensive, and also tend to produce extremely costly LHS hazards on most
architectures when accessing vectors in arrays. If they are not aligned,
they are useless... honestly.

** Does this cause problems with class allocation? Are/can classes be
allocated to an alignment as inherited from an aligned member? ... If not,
this might be the bulk of the work.

There is one other problem I know of that is only of concern on x86.
In the C ABI, passing 16byte ALIGNED vectors by value is a problem,
since x86 ALWAYS uses the stack to pass arguments, and has no way to align
the stack.
I wonder if D can get creative with its ABI here, passing vectors in
registers, even though that's not conventional on x86... the C ABI was
invented long before these hardware features.
In lieu of that, x86 would (sadly) need to silently pass by const ref...
and also do this in the case of register overflow.

Every other architecture (including x64) is fine, since all other
architectures pass in regs, and can align the stack as needed when
overflowing the regs (since stack management is manual and not performed
with special opcodes).


-- What does this type do? --

The primitive v128 type DOES nothing... it is a type that facilitates the
compiler allocating SIMD registers, managing assignments, loads, and
stores, and allow passing to/from functions BY VALUE in registers.
Ie, the only valid operations would be:
  v128 myVec = someStruct.vecMember; // and vice versa...
  v128 result = someFunc(myVec); // and calling functions, passing by value.

Nice bonus: This alone is enough to allow implementation of fast memcpy
functions that copy 16 bytes at a time... ;)


-- So, it does nothing... so what good is it? --

Initially you could use this type in conjunction with inline asm, or
architecture intrinsics to do useful stuff. This would be using the
hardware totally raw, which is an important feature to have, but I imagine
most of the good stuff would come from libraries built on top of this.


-- Literal assignment --

This is a hairy one. Endian issues appear in 2 layers here...
Firstly, if you consider the vector to be 4 int's, the ints themselves may
be little or big endian, but in addition, the outer layer (ie. the order of
x,y,z,w) may also be in reverse order on some architectures... This makes a
single 128bit hex literal hard to apply.
I'll have a dig and try and confirm this, but I have a suspicion that VMX
defines its components reverse to other architectures... (Note: not usually
a problem in C, because vector code is sooo non-standard in C that this is
ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order
can suit)

For the primitive v128 type, I generally like the idea of using a huge
128bit hex literal.
  v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)

Since the primitive v128 type is effectively typeless, it makes no sense to
use syntax like this:
  v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be
reserved for use with a float4 type defined in a library somewhere.

... The problem is, this may not be linearly applicable to all hardware. If
the order of the components match the endian, then it is fine...
I suspect VMX orders the components reverse to match the fact the values
are big endian, which would be good, but I need to check. And if not...
then literals may need to get a lot more complicated :)

Assignment of literals to the primitive type IS actually important, it's
common to generate bit masks in these registers which are type-independent.
I also guess libraries still need to leverage this primitive assignment
functionality to assign their more complex literal expressions.


-- Libraries --

With this type, we can write some useful standard libraries. For a start,
we can consider adding float4, int4, etc, and make them more intelligent...
they would have basic maths operators defined, and probably implement type
conversion when casting between types.

  int4 intVec = floatVec; // perform a type conversion from float to int..
or vice versa... (perhaps we make this require an explicit cast?)

  v128 vec = floatVec; // implicit cast to the raw type always possible,
and does no type casting, just a reinterpret
  int4 intVec = vec; // conversely, the primitive type would implicitly
assign to other types.
  int4  intVec = (v128)floatVec; // piping through the primitive v128
allows to easily perform a reinterpret between vector types, rather than
the usual type conversion.

There are also a truckload of other operations that would be fleshed out.
For instance, strongly typed literal assignment, vector comparisons that
can be used with if() (usually these allow you to test if ALL components,
or if ANY components meet a given condition). Conventional logic operators
can't be neatly applied to vectors. You need to do something like this:
  if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...

We can discuss the libraries at a later date, but it's possible that you
might also want to make some advanced functions in the library that are
only supported on particular architectures, std.simd.sse...,
std.simd.vmx..., etc. which may be version()-ed.


-- Exceptions, flags, and error conditions --

SIMD units usually have their own control register for controlling various
behaviours, most importantly NaN policy and exception semantics...
I'm open to input here... what should be default behaviour?
I'll bet the D community opt for strict NaNs, and throw by default... but
it is actually VERY common to disable hardware exceptions when working with
SIMD code:
  * often precision is less important than speed when using SIMD, and some
SIMD units perform faster when these features are disabled.
  * most SIMD algorithms (at least in performance oriented code) are
designed to tolerate '0,0,0,0' as the result of a divide by zero, or some
other error condition.
  * realtime physics tends to suffer error creep and freaky random
explosions, and you can't have those crashing the program :) .. they're not
really 'errors', they're expected behaviour, often producing 0,0,0,0 as a
result, so they're easy to deal with.

I presume it'll end up being NaNs and throw by default, but we do need some
mechanism to change the SIMD unit flags for realtime use... A runtime
function? Perhaps a compiler switch (C does this sort of thing a lot)?

It's also worth noting that there are numerous SIMD units out there that
DON'T follow strict ieee float rules, and don't support NaNs or hardware
exceptions at all... others may simply set a divide-by-zero flag, but not
actually trigger a hardware exception, requiring you to explicitly check
the flag if you're interested.
Will it be okay that the languages default behaviour of NaN's and throws is
unsupported on such platforms? What are the implications of this?


-- Future --

AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256
type, everything else is precisely the same.
I think this is perfectly reasonable... AVX is to SSE exactly as long is to
int, or double is to float. They are different types with different
register allocation and addressing semantics, and deserve a discreet type.
As with v128, libraries may then be created to allow the types to interact.

I know of 2 architectures that support 512bit (4x4 matrix) registers...
same story; implement a primitive type, then using intrinsics, we can build
interesting types in libraries.

We may also consider a v64 type, which would map to older MMX registers on
x86... there are also other architectures with 64bit 'vector' registers
(nintendo wii for one), supporting a pair of floats, or 4 shorts, etc...
Same general concept, but only 64 bits wide.


-- Conclusion --

I think that's about it for a start. I don't think it's particularly a lot
of work, the potential trouble points are 16byte alignment, and literal
expression. Potential issues relating to language guarantees of
exception/error conditions...
Go on, tear it apart!

Discuss...
Jan 05 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2012 5:42 PM, Manu wrote:
 The first thing I'd like to say is that a lot of people seem to have this idea
 that float[4] should be specialised as a candidate for simd optimisations
 somehow. It's obviously been discussed, and this general opinion seems to be
 shared by a good few people here.
 I've had a whole bunch of rants why I think this is wrong in other threads, so
I
 won't repeat them here...

If you could cut&paste them here, I would find it most helpful. I have some ideas on making that work, but I need to know everything wrong with it first.
Jan 05 2012
parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 04:12, Walter Bright <newshound2 digitalmars.com> wrote:

 If you could cut&paste them here, I would find it most helpful. I have
 some ideas on making that work, but I need to know everything wrong with it
 first.

On 5 January 2012 11:02, Manu <turkeyman gmail.com> wrote:
 On 5 January 2012 02:42, bearophile <bearophileHUGS lycos.com> wrote:

 Think about future CPU evolution with SIMD registers 128, then 256, then
 512, then 1024 bits long. In theory a good compiler is able to use them
 with no changes in the D code that uses vector operations.

These are all fundamentally different types, like int and long.. float and double... and I certainly want a keyword to identify each of them. Even if the compiler is trying to make auto vector optimisations, you can't deny programmers explicit control to the hardware when they want/need it. Look at x86 compilers, been TRYING to perform automatic SSE optimisations for 10 years, with basically no success... do you really think you can do better then all that work by microsoft and GCC? In my experience, I've even run into a lot of VC's auto-SSE-ed code that is SLOWER than the original float code. Let's not even mention architectures that receive much less love than x86, and are arguably more important (ARM; slower, simpler processors with more demand to perform well, and not waste power)

... Vector ops and SIMD ops are different things. float[4] (or more
 realistically, float[3]) should NOT be a candidate for automatic SIMD
 implementation, likewise, simd_type should not have its components
 individually accessible. These are operations the hardware can not actually
 perform. So no syntax to worry about, just a type.


 I think the good Hara will be able to implement those syntax fixes in a
 matter of just one day or very few days if a consensus is reached about
 what actually is to be fixed in D vector ops syntax.

 Instead of discussing about *adding* something (register intrinsics) I
 suggest to discuss about what to fix about the *already present* vector op
 syntax. This is not a request to just you Manu, but to this whole newsgroup.

And I think this is exactly the wrong approach. A vector is NOT an array of 4 (actually, usually 3) floats. It should not appear as one. This is overly complicated and ultimately wrong way to engage this hardware. Imagine the complexity in the compiler to try and force float[4] operations into vector arithmetic vs adding a 'v128' type which actually does what people want anyway... What about when when you actually WANT a float[4] array, and NOT a vector? SIMD units are not float units, they should not appear like an aggregation of float units. They have: * Different error semantics, exception handling rules, sometimes different precision... * Special alignment rules. * Special literal expression/assignment. * You can NOT access individual components at will. * May be reinterpreted at any time as float[1] float[4] double[2] short[8] char[16], etc... (up to the architecture intrinsics) * Can not be involved in conventional comparison logic (array of floats would make you think they could) *** Can NOT interact with the regular 'float' unit... Vectors as an array of floats certainly suggests that you can interact with scalar floats... I will use architecture intrinsics to operate on these regs, and put that nice and neatly behind a hardware vector type with version()'s for each architecture, and an API with a whole lot of sugar to make them nice and friendly to use. My argument is that even IF the compiler some day attempts to make vector optimisations to float[4] arrays, the raw hardware should be exposed first, and allow programmers to use it directly. This starts with a language defined (platform independant) v128 type.

... Other rants have been on IRC.
Jan 05 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to pitch
 an email detailing a minimal implementation with some initial thoughts.

Another question: Is this worth doing for 32 bit code? Or is anyone doing this doing it for 64 bit only? The reason I ask is because 64 bit is 16 byte aligned, but aligning the stack in 32 bit code is inefficient for everything else.
Jan 05 2012
parent Manu <turkeyman gmail.com> writes:
 The reason I ask is because 64 bit is 16 byte aligned, but aligning the
 stack in 32 bit code is inefficient for everything else.

Note: you only need to align the stack when a vector is actually stored on it by value. Probably very rare, more rare than you think.
Jan 05 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2012 5:42 PM, Manu wrote:
 -- Alignment --

 This type needs to be 16byte aligned. Unaligned loads/stores are very
expensive,
 and also tend to produce extremely costly LHS hazards on most architectures
when
 accessing vectors in arrays. If they are not aligned, they are useless...
honestly.

 ** Does this cause problems with class allocation? Are/can classes be allocated
 to an alignment as inherited from an aligned member? ... If not, this might be
 the bulk of the work.

The only real issue with alignment is getting the stack aligned to 16 bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS X.
Jan 05 2012
parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 04:16, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:

 -- Alignment --

 This type needs to be 16byte aligned. Unaligned loads/stores are very
 expensive,
 and also tend to produce extremely costly LHS hazards on most
 architectures when
 accessing vectors in arrays. If they are not aligned, they are useless...
 honestly.

 ** Does this cause problems with class allocation? Are/can classes be
 allocated
 to an alignment as inherited from an aligned member? ... If not, this
 might be
 the bulk of the work.

The only real issue with alignment is getting the stack aligned to 16 bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS X.

It's important for all implementations of simd units, x32, x64, and others. As said, if aligning the x32 stack is too much trouble, I suggest silently passing by const ref on x86. Are you talking about for parameter passing, or for local variable assignment on the stack? For parameter passing, I understand the x32 problems with aligning the arguments (I think it's possible to work around though), but there should be no problem with aligning the stack for allocating local variables.
Jan 05 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2012 6:25 PM, Manu wrote:
 Are you talking about for parameter passing, or for local variable assignment
on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the
arguments
 (I think it's possible to work around though), but there should be no problem
 with aligning the stack for allocating local variables.

Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
Jan 05 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 05:22, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 6:25 PM, Manu wrote:

 Are you talking about for parameter passing, or for local variable
 assignment on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the
 arguments
 (I think it's possible to work around though), but there should be no
 problem
 with aligning the stack for allocating local variables.

Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.

Perhaps I misunderstand, I can't see the problem? In the function preamble, you just align it... something like: mov reg, esp ; take a backup of the stack pointer and esp, -16 ; align it ... function mov esp, reg ; restore the stack pointer ret 0
Jan 05 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2012 7:42 PM, Manu wrote:
 Perhaps I misunderstand, I can't see the problem?
 In the function preamble, you just align it... something like:
    mov reg, esp ; take a backup of the stack pointer
    and esp, -16 ; align it

 ... function

    mov esp, reg ; restore the stack pointer
    ret 0

And now you cannot access the function's parameters anymore, because the stack offset for them is now variable rather than fixed.
Jan 05 2012
next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 06 Jan 2012 07:22:55 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/5/2012 7:42 PM, Manu wrote:
 Perhaps I misunderstand, I can't see the problem?
 In the function preamble, you just align it... something like:
    mov reg, esp ; take a backup of the stack pointer
    and esp, -16 ; align it

 ... function

    mov esp, reg ; restore the stack pointer
    ret 0

And now you cannot access the function's parameters anymore, because the stack offset for them is now variable rather than fixed.

Aah, I knew there was something that wouldn't work. One could possibly change from RBP relative addressing to RSP relative addressing for the inner variables. But that would fail with alloca. So this won't work without a second frame register, does it? manu: Instead of using the RegionAllocator you could write an aligning allocator using alloca memory. This will be about the closest you get to that magic compiler alignment.
Jan 06 2012
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 08:22, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 7:42 PM, Manu wrote:

 Perhaps I misunderstand, I can't see the problem?
 In the function preamble, you just align it... something like:
   mov reg, esp ; take a backup of the stack pointer
   and esp, -16 ; align it

 ... function

   mov esp, reg ; restore the stack pointer
   ret 0

And now you cannot access the function's parameters anymore, because the stack offset for them is now variable rather than fixed.

Hehe, true, but not insurmountable. Scheduling of parameter pops before you perform the alignment may solve that straight up, or else don't align esp its self; store the vector to the stack through some other aligned reg copied from esp... I just wrote some test functions using __m128 in VisualC, it seems to do something in between the simplicity of my initial suggestion, and my refined ideas one above :) If you have VisualC, check out what it does, it's very simple, looks pretty good, and I'm sure it's optimal (MS have enough R&D money to assure this) I can paste some disassemblies if you don't have VC...
Jan 06 2012
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Manu:
 
 I can paste some disassemblies if you don't have VC...

Pasting it is useful for all other people reading this thread too, like me. Bye, bearophile
Jan 06 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 6:05 AM, Manu wrote:
 On 6 January 2012 08:22, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/5/2012 7:42 PM, Manu wrote:

         Perhaps I misunderstand, I can't see the problem?
         In the function preamble, you just align it... something like:
            mov reg, esp ; take a backup of the stack pointer
            and esp, -16 ; align it

         ... function

            mov esp, reg ; restore the stack pointer
            ret 0


     And now you cannot access the function's parameters anymore, because the
     stack offset for them is now variable rather than fixed.


 Hehe, true, but not insurmountable. Scheduling of parameter pops before you
 perform the alignment may solve that straight up, or else don't align esp its
 self; store the vector to the stack through some other aligned reg copied from
 esp...

 I just wrote some test functions using __m128 in VisualC, it seems to do
 something in between the simplicity of my initial suggestion, and my refined
 ideas one above :)
 If you have VisualC, check out what it does, it's very simple, looks pretty
 good, and I'm sure it's optimal (MS have enough R&D money to assure this)

 I can paste some disassemblies if you don't have VC...

I don't have VC. I had thought of using an extra level of indirection for all the aligned stuff, essentially rewrite: v128 v; v = x; with: v128 v; // goes in aligned stack v128 *pv = &v; // pv is in regular stack *pv = x; but there are still complexities with it, like spilling aligned temps to the stack.
Jan 06 2012
parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 20:53, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 6:05 AM, Manu wrote:

 On 6 January 2012 08:22, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 **digitalmars.com <newshound2 digitalmars.com>>>
 wrote:

    On 1/5/2012 7:42 PM, Manu wrote:

        Perhaps I misunderstand, I can't see the problem?
        In the function preamble, you just align it... something like:
           mov reg, esp ; take a backup of the stack pointer
           and esp, -16 ; align it

        ... function

           mov esp, reg ; restore the stack pointer
           ret 0


    And now you cannot access the function's parameters anymore, because
 the
    stack offset for them is now variable rather than fixed.


 Hehe, true, but not insurmountable. Scheduling of parameter pops before
 you
 perform the alignment may solve that straight up, or else don't align esp
 its
 self; store the vector to the stack through some other aligned reg copied
 from
 esp...

 I just wrote some test functions using __m128 in VisualC, it seems to do
 something in between the simplicity of my initial suggestion, and my
 refined
 ideas one above :)
 If you have VisualC, check out what it does, it's very simple, looks
 pretty
 good, and I'm sure it's optimal (MS have enough R&D money to assure this)

 I can paste some disassemblies if you don't have VC...

I don't have VC. I had thought of using an extra level of indirection for all the aligned stuff, essentially rewrite: v128 v; v = x; with: v128 v; // goes in aligned stack v128 *pv = &v; // pv is in regular stack *pv = x; but there are still complexities with it, like spilling aligned temps to the stack.

I think we should take this conversation to IRC, or a separate thread? I'll generate some examples from VC for you in various situations. If you can write me a short list of trouble cases as you see them, I'll make sure to address them specifically... Have you tested the code that GCC produces? I'm sure it'll be identical to VC... That said, how do you currently support ANY aligned type? I thought align(n) was a defined keyword in D?
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 11:08 AM, Manu wrote:
 I think we should take this conversation to IRC, or a separate thread?
 I'll generate some examples from VC for you in various situations. If you can
 write me a short list of trouble cases as you see them, I'll make sure to
 address them specifically...
 Have you tested the code that GCC produces? I'm sure it'll be identical to
VC...

What I'm going to do is make the SIMD stuff work on 64 bits for now. The alignment problem is solved for it, and is an orthogonal issue.
 That said, how do you currently support ANY aligned type? I thought align(n)
was
 a defined keyword in D?

Yes, but the alignment is only as good as the alignment underlying it. For example, anything in segments can be aligned to 16 bytes or less, because the segments are aligned to 16 bytes. Anything allocated with new can be aligned to 16 bytes or less. The stack, however, is aligned to 4, so trying to align things on the stack by 8 or 16 will not work.
Jan 06 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 21:34, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 11:08 AM, Manu wrote:

 I think we should take this conversation to IRC, or a separate thread?
 I'll generate some examples from VC for you in various situations. If you
 can
 write me a short list of trouble cases as you see them, I'll make sure to
 address them specifically...
 Have you tested the code that GCC produces? I'm sure it'll be identical
 to VC...

What I'm going to do is make the SIMD stuff work on 64 bits for now. The alignment problem is solved for it, and is an orthogonal issue.

...I'm using DMD on windows... x32. So this isn't ideal ;) Although with this change, Iain should be able to expose the vector types in GDC, and I can work from there, and hopefully even build an ARM/PPC toolchain to experiment with the library in a cross platform environment. That said, how do you currently support ANY aligned type? I thought
 align(n) was
 a defined keyword in D?

Yes, but the alignment is only as good as the alignment underlying it. For example, anything in segments can be aligned to 16 bytes or less, because the segments are aligned to 16 bytes. Anything allocated with new can be aligned to 16 bytes or less. The stack, however, is aligned to 4, so trying to align things on the stack by 8 or 16 will not work.

... this sounds bad. Shall I start another thread? ;) So you're saying it's impossible to align a stack based buffer to, say, 128 bytes... ? This is another fairly important daily requirement of mine (that I assumed was currently supported). Aligning buffers to cache lines is common, and is required for many optimisations. Hopefully the work you do to support 16byte alignment on x86 will also support arbitrary alignment of any buffer... Will arbitrary alignment be supported on x64? What about GCC? Will/does it support arbitrary alignment?
Jan 06 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 11:53 AM, Manu wrote:
 ... this sounds bad. Shall I start another thread? ;)
 So you're saying it's impossible to align a stack based buffer to, say, 128
 bytes... ?

No, it's not impossible. Here's what you can do now: char[128+127] buf; char* pbuf = cast(char*)(((size_t)buf.ptr + 127) & ~127); and now pbuf points to 128 bytes, aligned, on the stack.
 Hopefully the work you do to support 16byte alignment on x86 will also support
 arbitrary alignment of any buffer...
 Will arbitrary alignment be supported on x64?

Aligning to non-powers of 2 will never work. As for other alignments, they only will work if the underlying storage is aligned to that or greater. Otherwise, you'll have to resort to the method outlined above.
 What about GCC? Will/does it support arbitrary alignment?

Don't know about gcc.
Jan 06 2012
next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 06 Jan 2012 21:16:40 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/6/2012 11:53 AM, Manu wrote:
 ... this sounds bad. Shall I start another thread? ;)
 So you're saying it's impossible to align a stack based buffer to, say,  
 128
 bytes... ?

No, it's not impossible. Here's what you can do now: char[128+127] buf; char* pbuf = cast(char*)(((size_t)buf.ptr + 127) & ~127); and now pbuf points to 128 bytes, aligned, on the stack.
 Hopefully the work you do to support 16byte alignment on x86 will also  
 support
 arbitrary alignment of any buffer...
 Will arbitrary alignment be supported on x64?

Aligning to non-powers of 2 will never work. As for other alignments, they only will work if the underlying storage is aligned to that or greater. Otherwise, you'll have to resort to the method outlined above.
 What about GCC? Will/does it support arbitrary alignment?

Don't know about gcc.

Only recently (4.6 I think).
Jan 06 2012
prev sibling parent FeepingCreature <default_357-line yahoo.de> writes:
On 01/06/12 21:16, Walter Bright wrote:
 Aligning to non-powers of 2 will never work. As for other alignments, they
only will work if the underlying storage is aligned to that or greater.
Otherwise, you'll have to resort to the method outlined above.
 
 
 What about GCC? Will/does it support arbitrary alignment?

Don't know about gcc.

GCC keeps the stack 16-byte aligned by default.
Jan 07 2012
prev sibling parent reply "Trass3r" <un known.com> writes:
On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:
 Iain should be able to expose the vector types in GDC,
 and I can work from there, and hopefully even build an ARM/PPC 
 toolchain to experiment with the library in a cross platform 
 environment.

On Windoze? You're a masochist ^^
Jan 06 2012
parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Trass3r wrote:
 On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:
 Iain should be able to expose the vector types in GDC,
 and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform environment.

On Windoze? You're a masochist ^^

Windows 8 will support ARM. I hope that D will too.
Jan 11 2012
parent reply Danni Coy <danni.coy gmail.com> writes:
I was rather under that only the new html5 api would be available under
windows 8 arm - that they were doing a iOS walled garden type thing with it
- if true this could make things difficult...

On Wed, Jan 11, 2012 at 9:42 PM, Piotr Szturmaj <bncrbme jadamspam.pl>wrote:

 Trass3r wrote:

 On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:

 Iain should be able to expose the vector types in GDC,
 and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform environment.

On Windoze? You're a masochist ^^

Windows 8 will support ARM. I hope that D will too.

Jan 11 2012
parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Danni Coy wrote:
 I was rather under that only the new html5 api would be available under
 windows 8 arm - that they were doing a iOS walled garden type thing with
 it - if true this could make things difficult...

http://www.microsoft.com/presspass/exec/ssinofsky/2011/09-13BUILD.mspx?rss_fdn=Custom "[...] And you have your choice of world-class development tools and languages. JavaScript, C#, VB, C++, C, HTML, CSS, XAML, all for X86-64 and ARM. This is an extremely important point: If you go and build your Metro style app in JavaScript and HTML, in C# or in XAML, that app will just run when there's ARM hardware available. So, you donít have to worry about that. Just write your application in HTML5, JavaScript and C# and XAML and your application runs across all the hardware that Windows 8 supports. (Applause.) And if you want to write native code, we're going to help you do that as well and make it so that you can cross-compile into the other platforms as well. So, full platform support with these Metro style applications." It means Win8 ARM will be limited to Metro apps only, but you will be able to choose HTML/CSS/JS, .NET or native code.
Jan 11 2012
parent =?windows-1252?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:
On 11-01-2012 13:23, Piotr Szturmaj wrote:
 Danni Coy wrote:
 I was rather under that only the new html5 api would be available under
 windows 8 arm - that they were doing a iOS walled garden type thing with
 it - if true this could make things difficult...

http://www.microsoft.com/presspass/exec/ssinofsky/2011/09-13BUILD.mspx?rss_fdn=Custom "[...] And you have your choice of world-class development tools and languages. JavaScript, C#, VB, C++, C, HTML, CSS, XAML, all for X86-64 and ARM. This is an extremely important point: If you go and build your Metro style app in JavaScript and HTML, in C# or in XAML, that app will just run when there's ARM hardware available. So, you donít have to worry about that. Just write your application in HTML5, JavaScript and C# and XAML and your application runs across all the hardware that Windows 8 supports. (Applause.) And if you want to write native code, we're going to help you do that as well and make it so that you can cross-compile into the other platforms as well. So, full platform support with these Metro style applications." It means Win8 ARM will be limited to Metro apps only, but you will be able to choose HTML/CSS/JS, .NET or native code.

If they have ported the Common Language Runtime to ARM, I doubt they would put some arbitrary limitation on what apps can run on that hardware. All things considered, AArch32/64 are coming soon. Besides, Windows running on ARM is not a new thing; see Windows Mobile and Windows Phone 7. By now, their ARM support should be as good as their x86 support. - Alex
Jan 11 2012
prev sibling next sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 01/06/12 20:53, Manu wrote:
 What about GCC? Will/does it support arbitrary alignment?

For sane "arbitrary" values (ie powers of two) it looks like this: -------------------------------- import std.stdio; struct S { align(65536) ubyte[64] bs; alias bs this; } pragma(attribute, noinline) void f(ref S s) { s[2] = 42; } void main(string[] argv) { S s = void; f(s); writeln(s.ptr); } --------------------------------- turns into: --------------------------------- 804ae40: 55 push %ebp 804ae41: 89 e5 mov %esp,%ebp 804ae43: 66 bc 00 00 mov $0x0,%sp 804ae47: 81 ec 00 00 01 00 sub $0x10000,%esp 804ae4d: 89 e0 mov %esp,%eax 804ae4f: e8 2c 0e 00 00 call 804bc80 <void align.f(ref align.S)> 804ae54: 89 e0 mov %esp,%eax 804ae56: e8 c5 ff ff ff call 804ae20 <void std.stdio.writeln!(ubyte*).writeln(ubyte*).2084> 804ae5b: 31 c0 xor %eax,%eax 804ae5d: c9 leave 804ae5e: c3 ret 804ae5f: 90 nop --------------------------------- specifying a more sane alignment of 64 gives: --------------------------------- 0804ae40 <_Dmain>: 804ae40: 55 push %ebp 804ae41: 89 e5 mov %esp,%ebp 804ae43: 83 e4 c0 and $0xffffffc0,%esp 804ae46: 83 ec 40 sub $0x40,%esp 804ae49: 89 e0 mov %esp,%eax 804ae4b: e8 30 0e 00 00 call 804bc80 <void align.f(ref align.S)> 804ae50: 89 e0 mov %esp,%eax 804ae52: e8 c9 ff ff ff call 804ae20 <void std.stdio.writeln!(ubyte*).writeln(ubyte*).2084> 804ae57: 31 c0 xor %eax,%eax 804ae59: c9 leave 804ae5a: c3 ret ---------------------------------
Jan 06 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 On 6 January 2012 21:34, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/6/2012 11:08 AM, Manu wrote:
 I think we should take this conversation to IRC, or a separate thread?
 I'll generate some examples from VC for you in various situations. If you
 can
 write me a short list of trouble cases as you see them, I'll make sure to
 address them specifically...
 Have you tested the code that GCC produces? I'm sure it'll be identical
 to VC...

What I'm going to do is make the SIMD stuff work on 64 bits for now. The alignment problem is solved for it, and is an orthogonal issue.

...I'm using DMD on windows... x32. So this isn't ideal ;) Although with this change, Iain should be able to expose the vector types in GDC, and I can work from there, and hopefully even build an ARM/PPC toolchain to experiment with the library in a cross platform environment.

And will also allow me to tap into many vector intrinsics that gcc offers too via the gcc.builtins; module. :) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 23:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 ...I'm using DMD on windows... x32. So this isn't ideal ;)
 Although with this change, Iain should be able to expose the vector

types in
 GDC, and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform environment.

And will also allow me to tap into many vector intrinsics that gcc offers too via the gcc.builtins; module. :)

Huzzah! ... Like what?
Jan 06 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 6 January 2012 22:37, Manu <turkeyman gmail.com> wrote:
 On 6 January 2012 23:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 ...I'm using DMD on windows... x32. So this isn't ideal ;)
 Although with this change, Iain should be able to expose the vector
 types in
 GDC, and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform
 environment.

And will also allow me to tap into many vector intrinsics that gcc offers too via the gcc.builtins; module. :)

Huzzah! ... Like what?

For backend intrinsics, they are all functions that map to asm instructions of the same name, ie: __builtin_ia32_addps. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 06 2012
prev sibling parent Manu <turkeyman gmail.com> writes:
On 7 January 2012 00:47, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 6 January 2012 22:37, Manu <turkeyman gmail.com> wrote:
 On 6 January 2012 23:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 ...I'm using DMD on windows... x32. So this isn't ideal ;)
 Although with this change, Iain should be able to expose the vector
 types in
 GDC, and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform
 environment.

And will also allow me to tap into many vector intrinsics that gcc offers too via the gcc.builtins; module. :)

Huzzah! ... Like what?

For backend intrinsics, they are all functions that map to asm instructions of the same name, ie: __builtin_ia32_addps.

Ah yeah, perfect.. obviously we need all of those for this vector type to be of any use at all ;)
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 05:42, Manu <turkeyman gmail.com> wrote:

 On 6 January 2012 05:22, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 6:25 PM, Manu wrote:

 Are you talking about for parameter passing, or for local variable
 assignment on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the
 arguments
 (I think it's possible to work around though), but there should be no
 problem
 with aligning the stack for allocating local variables.

Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.

Perhaps I misunderstand, I can't see the problem? In the function preamble, you just align it... something like: mov reg, esp ; take a backup of the stack pointer and esp, -16 ; align it ... function mov esp, reg ; restore the stack pointer ret 0

That said, most of the time values used in smaller functions will only ever exist in regs, and won't ever be written to the stack... in this case, there's no problem.
Jan 05 2012
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 06 Jan 2012 04:22:41 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/5/2012 6:25 PM, Manu wrote:
 Are you talking about for parameter passing, or for local variable  
 assignment on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the  
 arguments
 (I think it's possible to work around though), but there should be no  
 problem
 with aligning the stack for allocating local variables.

Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.

extending push RBP; mov RBP, RSP; sub RSP, localStackSize; to push RBP; // new mov RAX, RSP; and RAX, localAlignMask; sub RSP, RAX; // wen mov RBP, RSP; sub RSP, localStackSize; should do the trick. This would require to use biggest align attribute of all stack variables for localAlignMask. Also align needed to be power of 2 of it isn't already. ------------ RBP + 0 int a; RBP + 4 int b; 24 byte padding RBP + 32 align(32) struct float8 { float[8] v; } s; ------------
Jan 05 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated. 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators. So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours: Declare one new basic type: __v128 which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable. Then, have: import core.simd; which provides two functions: __v128 simdop(operator, __v128 op1); __v128 simdop(operator, __v128 op1, __v128 op2); This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.) The operators would be an enum listing of the SIMD opcodes, PFACC, PFADD, PFCMPEQ, etc. For: z = simdop(PFADD, x, y); the compiler would generate: MOV z,x PFADD z,y The code generator knows enough about these instructions to do register assignments reasonably optimally. What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler. One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types: __vdouble2 __vfloat4 __vlong2 __vulong2 __vint4 __vuint4 __vshort8 __vushort8 __vbyte16 __vubyte16
Jan 06 2012
next sibling parent reply Andrew Wiley <wiley.andrew.j gmail.com> writes:
On Fri, Jan 6, 2012 at 2:43 AM, Walter Bright
<newshound2 digitalmars.com> wrote:
 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to
 pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer a=

dds
 with SIMD can be saturated or unsaturated.

 3. Trying to build all the details about how each of the various adds and
 other ops work into the compiler/optimizer is a large undertaking. D woul=

d
 have to support internally maybe a 100 or more new operators.

 So some simplification is in order, perhaps a low level layer that is fai=

rly
 extensible for new instructions, and for which a library can be layered o=

ver
 for a more presentable interface. A half-formed idea of mine is, taking a
 cue from yours:

 Declare one new basic type:

 =A0 =A0__v128

 which represents the 16 byte aligned 128 bit vector type. The only
 operations defined to work on it would be construction and assignment. Th=

e
 __ prefix signals that it is non-portable.

 Then, have:

 =A0 import core.simd;

 which provides two functions:

 =A0 __v128 simdop(operator, __v128 op1);
 =A0 __v128 simdop(operator, __v128 op1, __v128 op2);

 This will be a function built in to the compiler, at least for the x86.
 (Other architectures can provide an implementation of it that simulates i=

ts
 operation, but I doubt that it would be worth anyone's while to use that.=

)
 The operators would be an enum listing of the SIMD opcodes,

 =A0 =A0PFACC, PFADD, PFCMPEQ, etc.

 For:

 =A0 =A0z =3D simdop(PFADD, x, y);

 the compiler would generate:

 =A0 =A0MOV z,x
 =A0 =A0PFADD z,y

Would this tie SIMD support directly to x86/x86_64, or would it possible to also support NEON on ARM (also 128 bit SIMD, see http://infocenter.arm.com/help/index.jsp?topic=3D/com.arm.doc.ddi0409g/inde= x.html ) ? (Obviously not for DMD, but if the syntax wasn't directly tied to x86/64, GDC and LDC could support this) It seems like using a standard naming convention instead of directly referencing instructions could let the underlying SIMD instructions vary across platforms, but I don't know enough about the technologies to say whether NEON's capabilities match SSE closely enough that they could be handled the same way.
Jan 06 2012
parent reply a <a a.com> writes:
 Would this tie SIMD support directly to x86/x86_64, or would it
 possible to also support NEON on ARM (also 128 bit SIMD, see
 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
 ) ?
 (Obviously not for DMD, but if the syntax wasn't directly tied to
 x86/64, GDC and LDC could support this)
 It seems like using a standard naming convention instead of directly
 referencing instructions could let the underlying SIMD instructions
 vary across platforms, but I don't know enough about the technologies
 to say whether NEON's capabilities match SSE closely enough that they
 could be handled the same way.

For NEON you would need at least a function with a signature: __v128 simdop(operator, __v128 op1, __v128 op2, __v128 op3); since many NEON instructions operate on three registers.
Jan 06 2012
parent a <a a.com> writes:
a Wrote:

 Would this tie SIMD support directly to x86/x86_64, or would it
 possible to also support NEON on ARM (also 128 bit SIMD, see
 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
 ) ?
 (Obviously not for DMD, but if the syntax wasn't directly tied to
 x86/64, GDC and LDC could support this)
 It seems like using a standard naming convention instead of directly
 referencing instructions could let the underlying SIMD instructions
 vary across platforms, but I don't know enough about the technologies
 to say whether NEON's capabilities match SSE closely enough that they
 could be handled the same way.

For NEON you would need at least a function with a signature: __v128 simdop(operator, __v128 op1, __v128 op2, __v128 op3); since many NEON instructions operate on three registers.

Disregard that, I wasn't paying atention to the return type. What Walter proposed can already handle three operand NEON instructions.
Jan 06 2012
prev sibling next sibling parent reply a <a a.com> writes:
Walter Bright Wrote:

 which provides two functions:
 
     __v128 simdop(operator, __v128 op1);
     __v128 simdop(operator, __v128 op1, __v128 op2);

You would also need functions that take an immediate too to support instructions such as shufps.
 One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 
 packed doubles. One problem with making it typed is it'll add 10 more types to 
 the base compiler, instead of one. Maybe we should just bite the bullet and do 
 the types:
 
      __vdouble2
      __vfloat4
      __vlong2
      __vulong2
      __vint4
      __vuint4
      __vshort8
      __vushort8
      __vbyte16
      __vubyte16

I don't see it being typeless as a problem. The purpose of this is to expose hardware capabilities to D code and the vector registers are typeless, so why shouldn't vector type be "typeless" too? Types such as vfloat4 can be implemented in a library (which could also be made portable and have a nice API).
Jan 06 2012
parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 12:16, a <a a.com> wrote:

 Walter Bright Wrote:

 which provides two functions:

     __v128 simdop(operator, __v128 op1);
     __v128 simdop(operator, __v128 op1, __v128 op2);

You would also need functions that take an immediate too to support instructions such as shufps.
 One caveat is it is typeless; a __v128 could be used as 4 packed ints or

2
 packed doubles. One problem with making it typed is it'll add 10 more

types to
 the base compiler, instead of one. Maybe we should just bite the bullet

and do
 the types:

      __vdouble2
      __vfloat4
      __vlong2
      __vulong2
      __vint4
      __vuint4
      __vshort8
      __vushort8
      __vbyte16
      __vubyte16

I don't see it being typeless as a problem. The purpose of this is to expose hardware capabilities to D code and the vector registers are typeless, so why shouldn't vector type be "typeless" too? Types such as vfloat4 can be implemented in a library (which could also be made portable and have a nice API).

Hooray! I think we're on exactly the same page. That's refreshing :) I think this __simdop( op, v1, v2, etc ) api is a bit of a bad idea... there are too many permutations of arguments. I know some PPC functions that receive FIVE arguments (2-3 regs, and 2-3 literals).. Why not just expose the opcodes as intrinsic functions directly, for instance (maybe in std.simd.sse)? __v128 __sse_mul_ss( __v128 v1, __v128 v2 ); __v128 __sse_mul_ps( __v128 v1, __v128 v2 ); __v128 __sse_madd_epi16( __v128 v1, __v128 v2, __v128 v3 ); // <- some have more args __v128 __sse_shuffle_ps( __v128 v1, __v128 v2, immutable int i ); // <- some need literal ints etc... This works best for other architectures too I think, they expose their own set of intrinsics, and some have rather different parameter layouts. VMX for instance (perhaps in std.simd.vmx?): __v128 __vmx_vmsum4fp( __v128 v1, __v128 v2, __v128 v3 ); __v128 __vmx_vpermwi( __v128 v1, immutable int i ); // <-- needs a literal __v128 __vmx_vrlimi( __v128 v1, __v128 v2, immutable int mask, immutable int rot ); // <-- you really don't want to add your enum style function for all these prototypes? etc... I have seen at least these argument lists: ( v1 ) ( v1, v2 ) ( v1, v2, v3 ) ( v1, immutable int ) ( v1, v2, immutable int ) ( v1, v2, immutable int, immutable int )
Jan 06 2012
prev sibling next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
Hi,

just bringing into the discussion how Mono does it.

http://tirania.org/blog/archive/2008/Nov-03.html

Also have a look at pages 44-53 from the presentation slides.

--
Paulo


"Walter Bright"  wrote in message news:je6c7j$2ct0$1 digitalmars.com...

On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to 
 pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated. 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators. So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours: Declare one new basic type: __v128 which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable. Then, have: import core.simd; which provides two functions: __v128 simdop(operator, __v128 op1); __v128 simdop(operator, __v128 op1, __v128 op2); This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.) The operators would be an enum listing of the SIMD opcodes, PFACC, PFADD, PFCMPEQ, etc. For: z = simdop(PFADD, x, y); the compiler would generate: MOV z,x PFADD z,y The code generator knows enough about these instructions to do register assignments reasonably optimally. What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler. One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types: __vdouble2 __vfloat4 __vlong2 __vulong2 __vint4 __vuint4 __vshort8 __vushort8 __vbyte16 __vubyte16
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 10:43, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:

 So I've been hassling about this for a while now, and Walter asked me to
 pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated. 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators. So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours: Declare one new basic type: __v128 which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable. Then, have: import core.simd; which provides two functions: __v128 simdop(operator, __v128 op1); __v128 simdop(operator, __v128 op1, __v128 op2); This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.) The operators would be an enum listing of the SIMD opcodes, PFACC, PFADD, PFCMPEQ, etc. For: z = simdop(PFADD, x, y); the compiler would generate: MOV z,x PFADD z,y The code generator knows enough about these instructions to do register assignments reasonably optimally. What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler. One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types: __vdouble2 __vfloat4 __vlong2 __vulong2 __vint4 __vuint4 __vshort8 __vushort8 __vbyte16 __vubyte16

Sounds good to me. Though I think __v128 should definitely be typeless, allowing all those other types to be implemented in libraries. Why wouldn't you leave that volume of work to libraries? All those types and related complications shouldn't be code in the language. There's a reason microsoft chose to only expose __m128 as an intrinsic. The rest you build yourself. Also, the LIBRARIES for types vectors can(/will) attempt to support multiple architectures using version()s behind the scenes.
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 11:04, Andrew Wiley <wiley.andrew.j gmail.com> wrote:

 On Fri, Jan 6, 2012 at 2:43 AM, Walter Bright
 <newshound2 digitalmars.com> wrote:
 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to
 pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer

adds
 with SIMD can be saturated or unsaturated.

 3. Trying to build all the details about how each of the various adds and
 other ops work into the compiler/optimizer is a large undertaking. D

would
 have to support internally maybe a 100 or more new operators.

 So some simplification is in order, perhaps a low level layer that is

fairly
 extensible for new instructions, and for which a library can be layered

over
 for a more presentable interface. A half-formed idea of mine is, taking a
 cue from yours:

 Declare one new basic type:

    __v128

 which represents the 16 byte aligned 128 bit vector type. The only
 operations defined to work on it would be construction and assignment.

The
 __ prefix signals that it is non-portable.

 Then, have:

   import core.simd;

 which provides two functions:

   __v128 simdop(operator, __v128 op1);
   __v128 simdop(operator, __v128 op1, __v128 op2);

 This will be a function built in to the compiler, at least for the x86.
 (Other architectures can provide an implementation of it that simulates

its
 operation, but I doubt that it would be worth anyone's while to use

that.)
 The operators would be an enum listing of the SIMD opcodes,

    PFACC, PFADD, PFCMPEQ, etc.

 For:

    z = simdop(PFADD, x, y);

 the compiler would generate:

    MOV z,x
    PFADD z,y

Would this tie SIMD support directly to x86/x86_64, or would it possible to also support NEON on ARM (also 128 bit SIMD, see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html ) ? (Obviously not for DMD, but if the syntax wasn't directly tied to x86/64, GDC and LDC could support this) It seems like using a standard naming convention instead of directly referencing instructions could let the underlying SIMD instructions vary across platforms, but I don't know enough about the technologies to say whether NEON's capabilities match SSE closely enough that they could be handled the same way.

The underlying architectures are too different to try and map opcodes across architectures. __v128 should map to each architecutres native SIMD type, allowing for the compiler to express the hardware, but the opcodes would come from architecture specific opcodes available in each compiler. As I keep suggesting, LIBRARIES would be created to supply the types like float4, int4, etc, which may also use version() liberally behind the scenes to support all architectures, allowing a common and efficient API for all architectures at this level.
Jan 06 2012
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 
 packed doubles. One problem with making it typed is it'll add 10 more types to 
 the base compiler, instead of one. Maybe we should just bite the bullet and do 
 the types:

What are the disadvantages of making it typeless? If it is typeless how do you tell it to perform a 4 float sum instead of a 2 double sum? Is this low level layer able to support AVX and AVX2 3-way comparison instructions too, and the fused multiplication-add instruction? --------------- For Manu: LDC compiler has this too: http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions Bye, bearophile
Jan 06 2012
parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 14:54, bearophile <bearophileHUGS lycos.com> wrote:

 Walter:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or

2
 packed doubles. One problem with making it typed is it'll add 10 more

types to
 the base compiler, instead of one. Maybe we should just bite the bullet

and do
 the types:

What are the disadvantages of making it typeless? If it is typeless how do you tell it to perform a 4 float sum instead of a 2 double sum? Is this low level layer able to support AVX and AVX2 3-way comparison instructions too, and the fused multiplication-add instruction?

I don't believe there are any. I can see only advantages to implementing the typed versions in libraries. To make it perform float4 math, or double2 match, you either write the pseudo assembly you want directly, but more realistically, you use the __float4 type supplied in the standard library, which will already associate all the float4 related functionality, and try and map it across various architectures as efficiently as possible. AVX needs a __v256 type in addition to the __v128 type already discussed. This should be trivial to add in addition to __v128. Again, the libraries take care of presenting a nice API to the users. The comparisons and m-sum you mention are just opcodes like any other that may be used on the raw type, and will be wrapped up nicely in the strongly typed libraries.
Jan 06 2012
parent reply bearophile <bearophileHUGS lycos.com> writes:
Manu:

 To make it perform float4 math, or double2 match, you either write the
 pseudo assembly you want directly, but more realistically, you use the
 __float4 type supplied in the standard library, which will already
 associate all the float4 related functionality, and try and map it across
 various architectures as efficiently as possible.

I see. While you design, you need to think about the other features of D :-) Is it possible to mix CPU SIMD with D vector ops? __float4[10] a, b, c; c[] = a[] + b[]; Bye, bearophile
Jan 06 2012
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
 I see. While you design, you need to think about the other features of D :-)
Is it possible to mix CPU SIMD with D vector ops?
 
 __float4[10] a, b, c;
 c[] = a[] + b[];

And generally, if the D compiler receives just D vector ops, what's a good way for the compiler to map them efficiently (even if less efficiently than true SIMD operations written manually) to SIMD ops? Generally you can't ask all D programmers to use __float4, some of them will want to use just D vector ops, despite they are a less efficient, because they are simpler to use. So the duty of a good D compiler is to implement them too efficiently enough. Bye, bearophile
Jan 06 2012
parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 16:12, bearophile <bearophileHUGS lycos.com> wrote:

 I see. While you design, you need to think about the other features of D

:-) Is it possible to mix CPU SIMD with D vector ops?
 __float4[10] a, b, c;
 c[] = a[] + b[];

And generally, if the D compiler receives just D vector ops, what's a good way for the compiler to map them efficiently (even if less efficiently than true SIMD operations written manually) to SIMD ops? Generally you can't ask all D programmers to use __float4, some of them will want to use just D vector ops, despite they are a less efficient, because they are simpler to use. So the duty of a good D compiler is to implement them too efficiently enough.

I'm not clear what you mean, are you talking about D vectors of hardware vectors, as in your example above? (no problem, see my last post) Or are you talking about programmers who will prefer to use float[4] instead of __float4? (this is what I think you're getting at?)... Users who prefer to use float[4] are welcome to do so, but I think you are mistaken when you assume this will be 'simpler to use'.. The rules for what they can/can't do efficiently with a float[4] are extremely restrictive, and it's also very unclear if/when they are violating said rules. It will almost always be faster to let the float unit do all the work in this case... Perhaps the compiler COULD apply some SIMD optimisations in very specific cases, but this would require some serious sophistication from the compiler to detect. Some likely problems: * float[4] is not aligned, performing unaligned load/stores will require a long sequence of carefully pipelines vector code to offset/break even on that cost. If the sequence of ops is short, it will be faster to keep it in the FPU. * float[4] allows component-wise access. This produces transfer of data between the FPU and the SIMD unit. This may again negate the advantage of using SIMD opcodes over the FPU directly. * loading a vectorised float[4] with floats calculated/stored on the FPU produces the same hazards as above. SIMD regs should not be loaded with data taken from the FPU if possible. * how do you express logic and comparisons? chances are people will write arbitrary component-wise comparisons. This requires flushing the values out from the SIMD regs back to the FPU for comparisons, again, negating any advantages of SIMD calculation. The hazard I refer to almost universally is that of swapping data between register types. This is a low process, and breaks any possibility for efficient pipelining. FPU pipelines nicely: float[4] x; x += 1.0; // This will result in 4 sequential adds to different registers, there are no data dependencies, this will pipeline beautifully, one cycle after another. This is probably only 3 cycles longer than a simd add, plus a small cost for the extra opcodes in the instruction stream Any time you need to swap register type, the pipeline is broken, imagine something seemingly harmless, and totally logical like this: float[4] hardwareVec; // compiler allows use of a hardware vector for float[4] float[1] = groundHeight; // we want to set Y explicitly, seems reasonable, perhaps we're snapping a position to a ground plane or something... This may be achieved in some way that looks something like this: * groundHeight must be stored to the stack * flush pipeline (wait for the data to arrive) (potentially long time) * UNALIGNED load from stack into a vector register (this may require an additional operation to rotate the vector into the proper position after loading on some architectures) * flush pipeline (wait for data to arrive) * loaded float needs to be merged with the existing vector, this can be done in a variety of ways - use a permute operation [only some architectures support arbitrary permute, VMX is best] (one opcode, but requires pre-loading of a separate permute control register to describe the appropriate merge, this load may be expensive, and the data must be available) - use a series of shifts (requires 2 shifts for X or W, 3 shifts for Y or Z), doesn't require any additional loads from memory, but each of the shifts are dependant operations, and must flush the pipeline between them - use a mask and OR the 2 vectors together (since applying masks to both the source and target vectors can be pipelined in parallel, and only the final OR requires flushing the pipeline...) - [ note: none of these options is ideal, and each may be preferable based on context in different situations] * done Congratulations, you've now set the Y component. At the cost of a LHS though memory, potentially other loads from memory, and 5-10 flushes of the pipeline summing hundreds, maybe thousands of wasted cpu cycles.. In this same amount of wasted time, you could have done a LOT of work with the FPU directly. Process of same operation using just the FPU: * FPU stores groundHeight (already in an FPU reg) to &float[1] * done And if the value is an intermediate and never needs to be stored on the stack, there's a chance the operation will be eliminated entirely, since the value is already in a float reg, ready for use in the next operation :) I think the take-away I'm trying to illustrate here is: SIMD work and scalar word do NOT mix... any syntax that allows it is a mistake. Users won't understand all the details and implications of the seemingly trivial operations they perform, and shouldn't need to. Auto-vectorisation of float[4] will be some amazingly sophisticated code, and very temperamental. If the compiler detects it can make some optimisation, great, but it will not be reliable from a user point of view, and it won't be clear what to change to make the compiler do a better job. It also still implies policy problems, ie, should float[4] be special cased to be aligned(16) when no other array requires this? What about all the different types? How to cast between then, what are the expected results? I think it's best to forget about float[4] as a candidate for reliable auto-vectorisation. Perhaps there's an opportunity for some nice little compiler bonuses, but it should not be the languages window into efficient use of the hardware. Anyone using float[4] should accept that they are working with the FPU, and they probably won't suffer much for it. If they want/need aggressive SIMD optimisation, then they need to use the appropriate API, and understand, at least a little bit, how the hardware works... Ideally the well-defined SIMD API will make it easiest to do the right thing, and they won't need to know all these hardware details to make good use of it.
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 16:06, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:

 To make it perform float4 math, or double2 match, you either write the
 pseudo assembly you want directly, but more realistically, you use the
 __float4 type supplied in the standard library, which will already
 associate all the float4 related functionality, and try and map it across
 various architectures as efficiently as possible.

I see. While you design, you need to think about the other features of D :-) Is it possible to mix CPU SIMD with D vector ops? __float4[10] a, b, c; c[] = a[] + b[];

I don't see any issue with this. An array of vectors makes perfect sense, and I see no reason why arrays/slices/etc of hardware vectors should be any sort of problem. This particular expression should be just as efficient as if it were an array of flat floats, especially if the compiler unrolls it. D's array/slice syntax is something I'm very excited about actually in conjunction with hardware vectors. I could do some really elegant geometry processing with slices from vertex streams.
Jan 06 2012
prev sibling next sibling parent reply Russel Winder <russel russel.org.uk> writes:
On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
[...]
 I don't see any issue with this. An array of vectors makes perfect sense,
 and I see no reason why arrays/slices/etc of hardware vectors should be a=

ny
 sort of problem.
 This particular expression should be just as efficient as if it were an
 array of flat floats, especially if the compiler unrolls it.
=20
 D's array/slice syntax is something I'm very excited about actually in
 conjunction with hardware vectors. I could do some really elegant geometr=

y
 processing with slices from vertex streams.

Excuse me for jumping in part way through, apologies if I have the "wrong end of the stick". As I understand it currently the debate to date has effectively revolved around how to have first class support in D for the SSE (vectorizing) capability of the x86 architecture. This immediately raises the questions in my mind: 1. Should x86 specific things be reified in the D language. Given that ARM and other architectures are increasingly more important than x86, D should not tie itself to x86. 2. Is there a way of doing something in D so that GPGPU can be described? Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA addicts) or OpenCL (for Apple addicts and others). It would be good if D could just take over this market by being able to manage GPU kernels easily. The risk is that PyCUDA and PyOpenCL beat D to market leadership. =20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 06 2012
parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
From what I see in HPC conferences papers and webcasts, I think it might  be 
already too late for D
in those scenarios.

"Russel Winder"  wrote in message 
news:mailman.107.1325862128.16222.digitalmars-d puremagic.com...
On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
[...]

Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
addicts) or OpenCL (for Apple addicts and others).  It would be good if
D could just take over this market by being able to manage GPU kernels
easily.  The risk is that PyCUDA and PyOpenCL beat D to market
leadership.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: 
sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder 
Jan 06 2012
parent reply Russel Winder <russel russel.org.uk> writes:
On Fri, 2012-01-06 at 16:09 +0100, Paulo Pinto wrote:
 From what I see in HPC conferences papers and webcasts, I think it might =

be=20
 already too late for D
 in those scenarios.

Indeed, for core HPC that is true: if you aren't using Fortran, C, C++, and Python you are not in the game. The point is that HPC is really about using computers that cost a significant proportion of the USA national debt. My thinking is that with Intel especially, looking to use the Moore's Law transistor count mountain to put heterogeneous many core systems on chip, i.e. arrays of CPUs connected to GPGPUs on chip, the programming languages used by the majority of programmers not just those playing with multi-billion dollar kit, will have to be able to deal with heterogeneous models of computation. The current model of separate compilation and loading of CPU code and GPGPU kernel is a hack to get things working in a world where tool chains are still about building 1970s single threaded code. This represents an opportunity for non C and C++ languages. Python is beginning to take a stab at trying to deal with all this. D would be another good candidate. Java cannot be in this game without some serious updating of the JVM semantics -- an issue we debated a bit on this list a short time ago so non need to rehearse all the points. It just strikes me as an opportunity to get D front and centre by having it provide a better development experience for these heterogeneous systems that are coming. Sadly Santa failed to bring me a GPGPU card for Christmas so as to do experiments using C++, Python, OpenCL (and probably CUDA, though OpenCL is the industry standard now). I will though be buying one for myself in the next couple of weeks.=20
 "Russel Winder"  wrote in message=20
 news:mailman.107.1325862128.16222.digitalmars-d puremagic.com...
 On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
 [...]
=20
 Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
 addicts) or OpenCL (for Apple addicts and others).  It would be good if
 D could just take over this market by being able to manage GPU kernels
 easily.  The risk is that PyCUDA and PyOpenCL beat D to market
 leadership.
=20

--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 06 2012
next sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
Please don't start a flame war on this, I am just expressing an opinion.

I think that for heterougenous computing we are better of with a language
that supports functional programming concepts.

From what I have seen in papers, many imperative languages have the issue
that they are too tied to the old homogenous computing model we had on the
desktop. That is the main reason why C and C++ start to look like 
frankenstein
languages with all the extensions companies are adding to them to support 
the
new models.

Funcional languages have the advantage that their hardware model is more 
abstract
and as such can be easier mapped to heterougenous hardware. This is also an 
area
where VM based languages might have some kind of advantage, but I am not 
sure.

Now, D actually has quite a few tools to explore functional concepts, so I 
guess it could
take off in this area if enough HPC people got some interest on it.

Regarding CUDA, you will surely now this better than me. I read somewhere 
that in most
research institutes people only care about CUDA, not OpenCL, because of it 
being older
than OpenCL, the C++ support, available tools, and NVidia card's performance 
when compared
with ATI in this area. But I don't have any experience here, so I don't know 
how much of this is
true.

--
Paulo


"Russel Winder"  wrote in message 
news:mailman.109.1325864213.16222.digitalmars-d puremagic.com...
On Fri, 2012-01-06 at 16:09 +0100, Paulo Pinto wrote:
 From what I see in HPC conferences papers and webcasts, I think it might 
 be
 already too late for D
 in those scenarios.

Indeed, for core HPC that is true: if you aren't using Fortran, C, C++, and Python you are not in the game. The point is that HPC is really about using computers that cost a significant proportion of the USA national debt. My thinking is that with Intel especially, looking to use the Moore's Law transistor count mountain to put heterogeneous many core systems on chip, i.e. arrays of CPUs connected to GPGPUs on chip, the programming languages used by the majority of programmers not just those playing with multi-billion dollar kit, will have to be able to deal with heterogeneous models of computation. The current model of separate compilation and loading of CPU code and GPGPU kernel is a hack to get things working in a world where tool chains are still about building 1970s single threaded code. This represents an opportunity for non C and C++ languages. Python is beginning to take a stab at trying to deal with all this. D would be another good candidate. Java cannot be in this game without some serious updating of the JVM semantics -- an issue we debated a bit on this list a short time ago so non need to rehearse all the points. It just strikes me as an opportunity to get D front and centre by having it provide a better development experience for these heterogeneous systems that are coming. Sadly Santa failed to bring me a GPGPU card for Christmas so as to do experiments using C++, Python, OpenCL (and probably CUDA, though OpenCL is the industry standard now). I will though be buying one for myself in the next couple of weeks.
 "Russel Winder"  wrote in message
 news:mailman.107.1325862128.16222.digitalmars-d puremagic.com...
 On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
 [...]

 Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
 addicts) or OpenCL (for Apple addicts and others).  It would be good if
 D could just take over this market by being able to manage GPU kernels
 easily.  The risk is that PyCUDA and PyOpenCL beat D to market
 leadership.

-- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 06 2012
parent "Froglegs" <lugtug gmail.com> writes:
  That Cuda is used more is probably true, OpenCL is fugly C and 
no fun.

Microsoft's upcoming C++ AMP looks interesting as it lets you 
write GPU and CPU code in C++.  The spec is open so hopefully it 
becomes common to implement it in other C++ compilers.

SSE intrinsics in C++ are pretty essential for getting great 
performance, so I do think D needs something like this.  A 
problem with intrinsics in C++ has been poor support from 
compilers, often performing little or no optimization and just 
blindly issuing instructions as you listed them, causing all 
kinds of extra loads and stores.

  Visual Studio is actually one of the worst C++ compilers for 
intrinsics, ICC is likely the best.

So even if D does add these new intrinsic functions it would need 
to actual optimize around them to produce reasonably fast code.

  I agree that the v128 type should be typeless, it is typeless on 
hardware, and this makes it easier to mix and match instructions.
Jan 06 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 7:36 AM, Russel Winder wrote:
 It just strikes me as an opportunity to get D front and centre by having
 it provide a better development experience for these heterogeneous
 systems that are coming.

At the moment, I have no idea what such support might look like :-(
Jan 06 2012
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 17:01, Russel Winder <russel russel.org.uk> wrote:

 On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
 [...]
 I don't see any issue with this. An array of vectors makes perfect sense,
 and I see no reason why arrays/slices/etc of hardware vectors should be

any
 sort of problem.
 This particular expression should be just as efficient as if it were an
 array of flat floats, especially if the compiler unrolls it.

 D's array/slice syntax is something I'm very excited about actually in
 conjunction with hardware vectors. I could do some really elegant

geometry
 processing with slices from vertex streams.

Excuse me for jumping in part way through, apologies if I have the "wrong end of the stick". As I understand it currently the debate to date has effectively revolved around how to have first class support in D for the SSE (vectorizing) capability of the x86 architecture.

No, I'm talking specifically about NOT making the type x86/SSE specific. Hence all my ramblings about a 'generalised'/typeless v128 type which can be used to express 128bit SIMD hardware of any architecture.
 This immediately raises the
 questions in my mind:

 1.  Should x86 specific things be reified in the D language.  Given that
 ARM and other architectures are increasingly more important than x86, D
 should not tie itself to x86.

The opcodes intrinsics you use to interact with the generalised type will be architecture specific, but this isn't the end point of my proposal. The next step is to produce libraries which will use version() heavily behind the API to collate different architectures into nice user-friendly vector types. Sadly vector units across architectures are too different to expose useful vector types cleanly in the language directly, so libraries will do this, making use of compiler defined architecture intrinsics behind lots of version() statements.
 2.  Is there a way of doing something in D so that GPGPU can be
 described?

I think this will map neatly to GPGPU. The vector types proposed will apply to that hardware just fine. This is a much bigger question though, the real problems are: * How to you compile/produce code that will run on the GPU? (do we have a D->Cg compiler?) * How do you express the threading/concurrency aspects of GPGPU usage? (this is way outside the scope of vector arithmetic) * How do you express the data sources available to GPU's? Constant files, etc... (seems D actually had reasonably good language expressions for this sort of thing) Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
 addicts) or OpenCL (for Apple addicts and others).  It would be good if
 D could just take over this market by being able to manage GPU kernels
 easily.  The risk is that PyCUDA and PyOpenCL beat D to market
 leadership.

As said, I think these questions are way outside the scope of SIMD vector libraries ;) Although this is a fundamental piece of the puzzle, since GPGPU is no use without SIMD type expression... but I think everything we've discussed here so far will map perfectly to GPGPU.
Jan 06 2012
parent Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 1/6/2012 9:44 AM, Manu wrote:
 On 6 January 2012 17:01, Russel Winder <russel russel.org.uk
 <mailto:russel russel.org.uk>> wrote:
 As said, I think these questions are way outside the scope of SIMD
 vector libraries ;)
 Although this is a fundamental piece of the puzzle, since GPGPU is no
 use without SIMD type expression... but I think everything we've
 discussed here so far will map perfectly to GPGPU.

I don't think you are in any danger as the GPGPU instructions are more flexible than the CPU SIMD counterparts GPU hardware natively works with float2, float3 extremely well. GPUs have VLIW instructions that can effectively add a huge number of instruction modifiers to their instructions (things like built in saturates of 0..1 range on variable arguments _reads_, arbitrary swizzle on read and write, write masks that leave partial data untouched etc, all in one clock). The CPU SIMD stuff is simplistic by comparions. A good bang for the buck would be to have some basic set of operators (* / + - < > == != <=
= and especially ? (the ternary operator)), and versions of 'any' and 

'all' from HLSL for dynamic branching, that can work at the very least for integer, float, and double types. Bit shifting is useful (esp manipulating floats for transcendental functions or workingw ith half FP16 types requires a lot of), but should be restricted to integer types. Having dedicated signed and unsigned right shifts would be pretty nice to (since about 95% of my right shifts end up needing to be of the zero-extended variety even though I had to cast to 'vector integers')
Jan 14 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me  
 to pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated. 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators. So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours: Declare one new basic type: __v128 which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable. Then, have: import core.simd; which provides two functions: __v128 simdop(operator, __v128 op1); __v128 simdop(operator, __v128 op1, __v128 op2); This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.) The operators would be an enum listing of the SIMD opcodes, PFACC, PFADD, PFCMPEQ, etc. For: z = simdop(PFADD, x, y); the compiler would generate: MOV z,x PFADD z,y The code generator knows enough about these instructions to do register assignments reasonably optimally. What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler. One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types: __vdouble2 __vfloat4 __vlong2 __vulong2 __vint4 __vuint4 __vshort8 __vushort8 __vbyte16 __vubyte16

Those could be typedefs, i.e. alias this wrapper. Still simdop would not be typesafe. As much as this proposal presents a viable solution, why not spending the time to extend inline asm. void foo() { __v128 a = loadss(1.0f); __v128 b = loadss(1.0f); a = addss(a, b); } __v128 load(float v) { __v128 res; // allocates register asm { movss res, v[RBP]; } return res; // return in XMM1 but inlineable return assignment } __v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable { __v128 res = a; // asm prolog, allocates registers for every __v128 used within the asm asm { addss res, b; } // asm epilog, possibly restore spilled registers return res; } What would be needed? - Implement the asm allocation logic. - Functions containing asm statements should participate in inlining. - Determining inline cost of asm statements. When being used with typedefs for __vubyte16 et.al. this would allow a really clean and simple library implementation of intrinsics.
Jan 06 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 06 Jan 2012 13:56:58 +0100, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me  
 to pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated. 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators. So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours: Declare one new basic type: __v128 which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable. Then, have: import core.simd; which provides two functions: __v128 simdop(operator, __v128 op1); __v128 simdop(operator, __v128 op1, __v128 op2); This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.) The operators would be an enum listing of the SIMD opcodes, PFACC, PFADD, PFCMPEQ, etc. For: z = simdop(PFADD, x, y); the compiler would generate: MOV z,x PFADD z,y The code generator knows enough about these instructions to do register assignments reasonably optimally. What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler. One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types: __vdouble2 __vfloat4 __vlong2 __vulong2 __vint4 __vuint4 __vshort8 __vushort8 __vbyte16 __vubyte16

Those could be typedefs, i.e. alias this wrapper. Still simdop would not be typesafe. As much as this proposal presents a viable solution, why not spending the time to extend inline asm. void foo() { __v128 a = loadss(1.0f); __v128 b = loadss(1.0f); a = addss(a, b); } __v128 load(float v) { __v128 res; // allocates register asm { movss res, v[RBP]; } return res; // return in XMM1 but inlineable return assignment } __v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable { __v128 res = a; // asm prolog, allocates registers for every __v128 used within the asm asm { addss res, b; } // asm epilog, possibly restore spilled registers return res; } What would be needed? - Implement the asm allocation logic. - Functions containing asm statements should participate in inlining. - Determining inline cost of asm statements. When being used with typedefs for __vubyte16 et.al. this would allow a really clean and simple library implementation of intrinsics.

Also addss is a pure function which could be important to optimize out certain calls. Maybe we should allow to attribute asm with pure.
Jan 06 2012
prev sibling next sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 14:56, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright <
 newshound2 digitalmars.com> wrote:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or
 2 packed doubles. One problem with making it typed is it'll add 10 more
 types to the base compiler, instead of one. Maybe we should just bite the
 bullet and do the types:

     __vdouble2
     __vfloat4
     __vlong2
     __vulong2
     __vint4
     __vuint4
     __vshort8
     __vushort8
     __vbyte16
     __vubyte16

Those could be typedefs, i.e. alias this wrapper. Still simdop would not be typesafe.

I think they should by well defined structs with lots of type safety and sensible methods. Not just a typedef of the typeless primitive.
 As much as this proposal presents a viable solution,
 why not spending the time to extend inline asm.

I think there are too many risky problems with the inline assembler (as raised in my discussion about supporting pseudo registers in inline asm blocks). * No way to allow the compiler to assign registers (pseudo registers) * Assembly blocks present problems for the optimiser, it's not reliable that it can optimise around an inline asm blocks. How bad will it be when trying to optimise around 100 small inlined functions each containing its own inline asm blocks? * D's inline assembly syntax has to be carefully translated to GCC's inline asm format when using GCC, and this needs to be done PER-ARCHITECTURE, which Iain should not be expected to do for all the obscure architectures GCC supports.
 What would be needed?
  - Implement the asm allocation logic.
  - Functions containing asm statements should participate in inlining.
  - Determining inline cost of asm statements.

I raised these points in my other thread, these are all far more complicated problems I think than exposing opcode intrinsics would be. Opcode intrinsics are almost certainly the way to go. When being used with typedefs for __vubyte16 et.al. this would
 allow a really clean and simple library implementation of intrinsics.

The type safety you're imagining here might actually be annoying when working with the raw type and opcodes.. Consider this common situation and the code that will be built around it: __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); // pack some other useful data in W If vec were strongly typed, I would now need to start casting all over the place to use various float and uint opcodes on this value? I think it's correct when using SIMD at the raw level to express the type as it is, typeless... SIMD regs are infact typeless regs, they only gain concept of type the moment you perform an opcode on it, and only for the duration of that opcode. You will get your strong type safety when you make use of the float4 types which will be created in the libs.
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 5:44 AM, Manu wrote:
 The type safety you're imagining here might actually be annoying when working
 with the raw type and opcodes..
 Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); // pack
some
 other useful data in W
 If vec were strongly typed, I would now need to start casting all over the
place
 to use various float and uint opcodes on this value?
 I think it's correct when using SIMD at the raw level to express the type as it
 is, typeless... SIMD regs are infact typeless regs, they only gain concept of
 type the moment you perform an opcode on it, and only for the duration of that
 opcode.

 You will get your strong type safety when you make use of the float4 types
which
 will be created in the libs.

Consider an analogy with the EAX register. It's untyped. But we don't find it convenient to make it untyped in a high level language, we paint the fiction of a type onto it, and that works very well. To me, the advantage of making the SIMD types typed are: 1. the language does typechecking, for example, trying to add a vector of 4 floats to 16 bytes would be (and should be) an error. 2. Some of the SIMD operations do map nicely onto the operators, so one could write: a = b + c + -d; and the correct SIMD opcodes will be generated based on the types. I think that would be one hell of a lot nicer than using function syntax. Of course, this will only be for those SIMD ops that do map, for the rest you're stuck with the functions. 3. A lot of the SIMD opcodes have 10 variants, one for each of the 10 types. The user would only need to remember the operation, not the variants, and let the usual overloading rules apply. And, of course, casting would be allowed and would be zero cost. I've been thinking about this a lot since last night, and I think that since the back end already supports XMM registers, most of the hard work is done, that doing it this way would fit in well. (At least for 64 bit code, where the alignment issue is solved, but that's an orthogonal issue.)
Jan 06 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 21:21, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 5:44 AM, Manu wrote:

 The type safety you're imagining here might actually be annoying when
 working
 with the raw type and opcodes..
 Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
 pack some

Consider an analogy with the EAX register. It's untyped. But we don't find it convenient to make it untyped in a high level language, we paint the fiction of a type onto it, and that works very well.

Damn it, I though we already reached agreement, why are you having second thoughts? :) Your analogy to EAX is not really valid. EAX may hold some data that is not an int, but it is incapable of performing a float operation on that data. SIMD registers are capable of performing operations of any type at any time to any register, I think this is the key distinction that justifies them as inherently 'typeless' registers.
 To me, the advantage of making the SIMD types typed are:

 1. the language does typechecking, for example, trying to add a vector of
 4 floats to 16 bytes would be (and should be) an error.

The language WILL do that checking as soon as we create the strongly typed libraries. And people will use those libraries, they'll never touch the primitive type. The primitive type however must not inhibit the programmer from being able to perform any operation that the hardware is technically capable of. The primitive type will be used behind the scenes for building said libraries... nobody will use it in front-end code. It's not really a useful type, it doesn't do anything. It just allows the ABI and register semantics to be expressed in the language.
 2. Some of the SIMD operations do map nicely onto the operators, so one
 could write:

   a = b + c + -d;

This is not even true, as you said yourself in a previous post. SIMD int ops may wrap, or saturate... which is it? Don't try and express this at the language level. Let the libraries do it, and if they fail, or are revealed to be poorly defined, they can be updated/changed. 3. A lot of the SIMD opcodes have 10 variants, one for each of the 10
 types. The user would only need to remember the operation, not the
 variants, and let the usual overloading rules apply.

Correct, and they will be hidden behind the api of their strongly typed library counterparts. The user will never need to be aware of the opcodes, or their variants.
 And, of course, casting would be allowed and would be zero cost.

Zero cost? You're suggesting all casts would be reinterprets? Surely: float4 fVec = (float4)intVec; should perform a type conversion? Again, this is detail that can/should be discussed when implementing the standard library, leave this sort of problem out of the language. Your earlier email detailing your simple API with an enum of opcodes sounded fine... whatever's easiest really. The hard part will be implementing the alignment, and the literal syntax.
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 11:41 AM, Manu wrote:
 On 6 January 2012 21:21, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/6/2012 5:44 AM, Manu wrote:

         The type safety you're imagining here might actually be annoying when
         working
         with the raw type and opcodes..
         Consider this common situation and the code that will be built around
it:
         __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
         pack some


     Consider an analogy with the EAX register. It's untyped. But we don't find
     it convenient to make it untyped in a high level language, we paint the
     fiction of a type onto it, and that works very well.


 Damn it, I though we already reached agreement, why are you having second
 thoughts? :)
 Your analogy to EAX is not really valid. EAX may hold some data that is not an
 int, but it is incapable of performing a float operation on that data.
 SIMD registers are capable of performing operations of any type at any time to
 any register, I think this is the key distinction that justifies them as
 inherently 'typeless' registers.

I strongly disagree with this. EAX can (and is) at various times used as byte, ubyte, short, ushort, int, uint, pointer, and yes, even floats! Anything that fits in it, actually. It is typeless. The types used on them are a fiction perpetrated by the language, but a very very useful fiction.
     To me, the advantage of making the SIMD types typed are:

     1. the language does typechecking, for example, trying to add a vector of 4
     floats to 16 bytes would be (and should be) an error.


 The language WILL do that checking as soon as we create the strongly typed
 libraries. And people will use those libraries, they'll never touch the
 primitive type.

I'm not so sure this will work out satisfactorily.
 The primitive type however must not inhibit the programmer from being able to
 perform any operation that the hardware is technically capable of.
 The primitive type will be used behind the scenes for building said
libraries...
 nobody will use it in front-end code. It's not really a useful type, it doesn't
 do anything. It just allows the ABI and register semantics to be expressed in
 the language.

     2. Some of the SIMD operations do map nicely onto the operators, so one
     could write:

        a = b + c + -d;


 This is not even true, as you said yourself in a previous post.
 SIMD int ops may wrap, or saturate... which is it?

It would only be for those ops that actually do map onto the D operators. (This is already done by the library implementation of the array arithmetic operations.) The saturated int ops would not be usable this way.
 Don't try and express this at the language level. Let the libraries do it, and
 if they fail, or are revealed to be poorly defined, they can be
updated/changed.

Doing it as a library type pretty much prevents certain optimizations, for example, the fused operations, from being expressed using infix operators.
     3. A lot of the SIMD opcodes have 10 variants, one for each of the 10
types.
     The user would only need to remember the operation, not the variants, and
     let the usual overloading rules apply.


 Correct, and they will be hidden behind the api of their strongly typed library
 counterparts. The user will never need to be aware of the opcodes, or their
 variants.

     And, of course, casting would be allowed and would be zero cost.


 Zero cost? You're suggesting all casts would be reinterprets? Surely: float4
 fVec = (float4)intVec; should perform a type conversion?
 Again, this is detail that can/should be discussed when implementing the
 standard library, leave this sort of problem out of the language.

Painting a new type (i.e. reinterpret casts) do have zero runtime cost to them. I don't think it's a real problem - we do it all the time when, for example, we want to retype an int as a uint: int i; uint u = cast(uint)i;
 Your earlier email detailing your simple API with an enum of opcodes sounded
 fine... whatever's easiest really. The hard part will be implementing the
 alignment, and the literal syntax.

Jan 06 2012
parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 22:36, Walter Bright <newshound2 digitalmars.com> wrote:

    To me, the advantage of making the SIMD types typed are:
    1. the language does typechecking, for example, trying to add a vector
 of 4
    floats to 16 bytes would be (and should be) an error.


 The language WILL do that checking as soon as we create the strongly typed
 libraries. And people will use those libraries, they'll never touch the
 primitive type.

I'm not so sure this will work out satisfactorily.

How so, can you support this theory?
    2. Some of the SIMD operations do map nicely onto the operators, so one
    could write:

       a = b + c + -d;


 This is not even true, as you said yourself in a previous post.
 SIMD int ops may wrap, or saturate... which is it?

It would only be for those ops that actually do map onto the D operators. (This is already done by the library implementation of the array arithmetic operations.) The saturated int ops would not be usable this way.

But why are you against adding this stuff in the library? It's contrary to the general sentiment around here where people like putting stuff in libraries where possible. It's less committing, and allows alternative implementations if desired. Don't try and express this at the language level. Let the libraries do it,
 and
 if they fail, or are revealed to be poorly defined, they can be
 updated/changed.

Doing it as a library type pretty much prevents certain optimizations, for example, the fused operations, from being expressed using infix operators.

You're talking about MADD? I was going to make a separate suggestion regarding that actually. Multiply-add is a common concept, often available to FPU's aswell (and no way to express it)... I was going to suggest an opMultiplyAdd() operator, which you could have the language call if it detects a conforming arrangement of * and + operators on a type. This would allow operator access to madd in library vectors too. And, of course, casting would be allowed and would be zero cost.
 Zero cost? You're suggesting all casts would be reinterprets? Surely:
 float4
 fVec = (float4)intVec; should perform a type conversion?
 Again, this is detail that can/should be discussed when implementing the
 standard library, leave this sort of problem out of the language.

Painting a new type (i.e. reinterpret casts) do have zero runtime cost to them. I don't think it's a real problem - we do it all the time when, for example, we want to retype an int as a uint: int i; uint u = cast(uint)i;

Yeah sure, but I don't think that's fundamentally correct, if you're drifting towards typing these things in the language, then you should also start considering cast mechanics... and that's a larger topic of debate. I don't really think "float4 floatVec = (float4)intVec;" should be a reinterpret... surely, as a high level type, this should perform a type conversion? I'm afraid this is become a lot more complicated than it needs to be. Can you illustrate your current thoughts/plan, to have it summarised in one place. Has it drifted from what you said last night?
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 1:32 PM, Manu wrote:
 On 6 January 2012 22:36, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

             To me, the advantage of making the SIMD types typed are:

             1. the language does typechecking, for example, trying to add a
         vector of 4
             floats to 16 bytes would be (and should be) an error.


         The language WILL do that checking as soon as we create the strongly
typed
         libraries. And people will use those libraries, they'll never touch the
         primitive type.


     I'm not so sure this will work out satisfactorily.


 How so, can you support this theory?

For one thing, the compiler has a very hard time optimizing library implemented types. It's why int, float, etc., are native types. We've come a long way with library types, but there are limits.
             2. Some of the SIMD operations do map nicely onto the operators,
so one
             could write:

                a = b + c + -d;


         This is not even true, as you said yourself in a previous post.
         SIMD int ops may wrap, or saturate... which is it?


     It would only be for those ops that actually do map onto the D operators.
     (This is already done by the library implementation of the array arithmetic
     operations.) The saturated int ops would not be usable this way.


 But why are you against adding this stuff in the library? It's contrary to the
 general sentiment around here where people like putting stuff in libraries
where
 possible. It's less committing, and allows alternative implementations if
desired.

         Don't try and express this at the language level. Let the libraries do
         it, and
         if they fail, or are revealed to be poorly defined, they can be
         updated/changed.


     Doing it as a library type pretty much prevents certain optimizations, for
     example, the fused operations, from being expressed using infix operators.


 You're talking about MADD? I was going to make a separate suggestion regarding
 that actually.
 Multiply-add is a common concept, often available to FPU's aswell (and no way
to
 express it)... I was going to suggest an opMultiplyAdd() operator, which you
 could have the language call if it detects a conforming arrangement of * and +
 operators on a type. This would allow operator access to madd in library
vectors
 too.

Detecting a "conforming arrangement" is how native types work! Once you wed the compiler logic to a particular library implementation, it acquires the worst aspects of a native type with the worst aspects of a library type.
 Yeah sure, but I don't think that's fundamentally correct, if you're drifting
 towards typing these things in the language, then you should also start
 considering cast mechanics... and that's a larger topic of debate.
 I don't really think "float4 floatVec = (float4)intVec;" should be a
 reinterpret... surely, as a high level type, this should perform a type
conversion?

That's a good point.
 I'm afraid this is become a lot more complicated than it needs to be.
 Can you illustrate your current thoughts/plan, to have it summarised in one
 place.

Support the 10 vector types as basic types, support them with the arithmetic infix operators, and use intrinsics for the rest of the operations. I believe this scheme: 1. will look better in code, and will be easier to use 2. will allow for better error detection and more comprehensible error messages when things are misused 3. will generate better code 4. shouldn't be hard to implement, as I already did most of the work when I did the SIMD support for float and double.
 Has it drifted from what you said last night?

Yes.
Jan 06 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 7 January 2012 01:52, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:32 PM, Manu wrote:

 Yeah sure, but I don't think that's fundamentally correct, if you're
 drifting

towards typing these things in the language, then you should also start
 considering cast mechanics... and that's a larger topic of debate.
 I don't really think "float4 floatVec = (float4)intVec;" should be a
 reinterpret... surely, as a high level type, this should perform a type
 conversion?

That's a good point.

.. oh god, what have I done. :/ I'm afraid this is become a lot more complicated than it needs to be.
 Can you illustrate your current thoughts/plan, to have it summarised in
 one
 place.

Support the 10 vector types as basic types, support them with the arithmetic infix operators, and use intrinsics for the rest of the operations. I believe this scheme: 1. will look better in code, and will be easier to use 2. will allow for better error detection and more comprehensible error messages when things are misused 3. will generate better code 4. shouldn't be hard to implement, as I already did most of the work when I did the SIMD support for float and double. Has it drifted from what you said last night?

Yes.

Okay, I'm very worried at this point. Please don't just do this... There are so many details and gotchas in what you suggest. I couldn't feel comfortable short of reading a thorough proposal. Come on IRC? This requires involved conversation. I'm sure you realise how much more work this is... Why would you commit to this right off the bat? Why not produce the simple primitive type, and allow me the opportunity to try it with the libraries before polluting the language its self with a massive volume of stuff... I'm genuinely concerned that once you add this to the language, it's done, and it'll be stuck there like lots of other debatable features... we can tweak the library implementation as we gain experience with usage of the feature. MS also agree that the primitive __m128 is the right approach. I'm not basing my opinion on their judgement at all, I independently conclude it is the right approach, but it's encouraging that they agree... and perhaps they're a more respectable authority than me and my opinion :) What I proposed in the OP is the simplest, most non-destructive initial implementation in the language. I think there is the lest opportunity for making a mistake/wrong decision in my initial proposal, and it can be extended with what you're suggesting in time after we have the opportunity to prove that it's correct. We can test and prove the rest with libraries before committing to implement it in the language...
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 4:12 PM, Manu wrote:
 Come on IRC? This requires involved conversation.

I'm on skype.
 I'm sure you realise how much more work this is...

Actually, not that much. Surprising, no? <g> I already think I did the hard stuff already by supporting SIMD for float/double.
 Why would you commit to this right off the bat? Why not produce the simple
 primitive type, and allow me the opportunity to try it with the libraries
before
 polluting the language its self with a massive volume of stuff...
 I'm genuinely concerned that once you add this to the language, it's done, and
 it'll be stuck there like lots of other debatable features... we can tweak the
 library implementation as we gain experience with usage of the feature.

If it doesn't work, we can back it out. I'm willing to add it as an experimental feature because I don't see it breaking any existing code.
 MS also agree that the primitive __m128 is the right approach. I'm not basing
my
 opinion on their judgement at all, I independently conclude it is the right
 approach, but it's encouraging that they agree... and perhaps they're a more
 respectable authority than me and my opinion :)

Can you show me a typical example of how it looks in action in source code?
 What I proposed in the OP is the simplest, most non-destructive initial
 implementation in the language. I think there is the lest opportunity for
making
 a mistake/wrong decision in my initial proposal, and it can be extended with
 what you're suggesting in time after we have the opportunity to prove that it's
 correct. We can test and prove the rest with libraries before committing to
 implement it in the language...

I don't think the typeless approach will wind up being any easier, and it'll certainly suck when it comes to optimization, error messages, symbolic debugger support, etc.
Jan 06 2012
parent Manu <turkeyman gmail.com> writes:
On 7 January 2012 02:52, Walter Bright <newshound2 digitalmars.com> wrote:

 MS also agree that the primitive __m128 is the right approach. I'm not
 basing my

opinion on their judgement at all, I independently conclude it is the right
 approach, but it's encouraging that they agree... and perhaps they're a
 more
 respectable authority than me and my opinion :)

Can you show me a typical example of how it looks in action in source code?

Not without breaking NDA's... but maybe I will anyway, I'll dig some stuff up...
 What I proposed in the OP is the simplest, most non-destructive initial
 implementation in the language. I think there is the lest opportunity for
 making
 a mistake/wrong decision in my initial proposal, and it can be extended
 with
 what you're suggesting in time after we have the opportunity to prove
 that it's
 correct. We can test and prove the rest with libraries before committing
 to
 implement it in the language...

I don't think the typeless approach will wind up being any easier, and it'll certainly suck when it comes to optimization, error messages, symbolic debugger support, etc.

Symbolic debugger support eh... now that is a compelling argument! :) Okay, I'm prepared to reconsider... but I'm still apprehensive. I'm manuevans on skype, on there now if you want to add me...
Jan 06 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/6/12 5:52 PM, Walter Bright wrote:
 Support the 10 vector types as basic types, support them with the
 arithmetic infix operators, and use intrinsics for the rest of the
 operations. I believe this scheme:

 1. will look better in code, and will be easier to use
 2. will allow for better error detection and more comprehensible error
 messages when things are misused
 3. will generate better code
 4. shouldn't be hard to implement, as I already did most of the work
 when I did the SIMD support for float and double.

I think it would be great to try avoiding the barbarism of adding 10 built-in types and a bunch of built-ins. Historically, D has erred heavily on the side of building in the compiler. Consider the the complex numbers affair, in tow with crackpot science arguments on why they're a must. It's great that embarrassment is behind us. Also consider how the hard-coding of associative arrays in an awkward interface inside the runtime has stifled efficient implementations, progress, and innovation in that area. Still a lot of work needed there, too, to essentially undo a bad decision. Let's not repeat history. Months later we'll look at the hecatomb of types and builtins we dumped into the language and we'll be like, what were we /thinking/? Adding built-in types and functions is giving up good design, judgment, and using what we have creatively. It may mean failure to understand and use the expressive power of the language, and worse, compensate by adding even more poorly designed artifacts to it. I would very strongly suggest we reconsider the tactics of it all. Yes, it's great to have SIMD support in the language. No, I don't think adding a wheelbarrow to the language is the right way. Thanks, Andrei
Jan 06 2012
parent reply Don <nospam nospam.com> writes:
On 07.01.2012 04:18, Andrei Alexandrescu wrote:
 On 1/6/12 5:52 PM, Walter Bright wrote:
 Support the 10 vector types as basic types, support them with the
 arithmetic infix operators, and use intrinsics for the rest of the
 operations. I believe this scheme:

 1. will look better in code, and will be easier to use
 2. will allow for better error detection and more comprehensible error
 messages when things are misused
 3. will generate better code
 4. shouldn't be hard to implement, as I already did most of the work
 when I did the SIMD support for float and double.

I think it would be great to try avoiding the barbarism of adding 10 built-in types and a bunch of built-ins. Historically, D has erred heavily on the side of building in the compiler. Consider the the complex numbers affair, in tow with crackpot science arguments on why they're a must. It's great that embarrassment is behind us.

 Also consider how the hard-coding of associative arrays in
 an awkward interface inside the runtime has stifled efficient
 implementations, progress, and innovation in that area. Still a lot of
 work needed there, too, to essentially undo a bad decision.

Sorry Andrei, I have to disagree with that in the strongest possible terms. I would have mentioned AAs as a very strong argument in the opposite direction! Moving AAs from a built-in to a library type has been an unmitigated disaster from the implementation side. And it has so far brought us *nothing* in return. Not "hardly anything", but *NOTHING*. I don't even have any idea of what good could possibly come from it. Note that you CANNOT have multiple implementations on a given platform, or you'll get linker errors! So I think there is more pain to come from it. It seems to have been motivated by religious reasons and nothing more. Why should anyone believe the same argument again?
Jan 07 2012
next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 7 January 2012 at 16:10:32 UTC, Don wrote:
 Sorry Andrei, I have to disagree with that in the strongest 
 possible terms. I would have mentioned AAs as a very strong 
 argument in the opposite direction!

Amen. AAs are *still* broken from this change. If you take a look at my cgi.d, you'll find this: // Referencing this gigantic typeid seems to remind the compiler // to actually put the symbol in the object file. I guess the immutable // assoc array array isn't actually included in druntime void hackAroundLinkerError() { writeln(typeid(const(immutable(char)[][])[immutable(char)[]])); writeln(typeid(immutable(char)[][][immutable(char)[]])); writeln(typeid(Cgi.UploadedFile[immutable(char)[]])); writeln(typeid(immutable(Cgi.UploadedFile)[immutable(char)[]])); writeln(typeid(immutable(char[])[immutable(char)[]])); // this is getting kinda ridiculous btw. Moving assoc arrays // to the library is the pain that keeps on coming. // eh this broke the build on the work server // writeln(typeid(immutable(char)[][immutable(string[])])); writeln(typeid(immutable(string[])[immutable(char)[]])); } It was never a problem before... but if I take that otherwise useless function out, it still randomly breaks my builds to this day.
Jan 07 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/7/12 10:19 AM, Adam D. Ruppe wrote:
 On Saturday, 7 January 2012 at 16:10:32 UTC, Don wrote:
 Sorry Andrei, I have to disagree with that in the strongest possible
 terms. I would have mentioned AAs as a very strong argument in the
 opposite direction!

Amen. AAs are *still* broken from this change.

Well they are broken because the change has not been carried to completion. I think that baking AAs in the compiler is poor programming language design. There are absolutely no ifs and buts about it, and the matter is obvious enough to me and sufficiently internalized to cause me difficulty to argue it. Andrei
Jan 07 2012
parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 7 January 2012 at 16:55:00 UTC, Andrei Alexandrescu 
wrote:
 Well they are broken because the change has not been carried to 
 completion.

Here's my position: if we get a library implementation that works better than the compiler implementation, let's do it. If they are equal in use or only worse from minor syntax or some other trivial matter, let's do library since that's nicer indeed. But, if the library one doesn't work as well as the compiler implementation, whether due to design, bugs, or any other practical consideration, let's not break things until the library impl catches up. If that's going to take several years, we have to consider the benefit of having it now rather than later too.
Jan 07 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/7/12 10:10 AM, Don wrote:
 Sorry Andrei, I have to disagree with that in the strongest possible
 terms. I would have mentioned AAs as a very strong argument in the
 opposite direction!

 Moving AAs from a built-in to a library type has been an unmitigated
 disaster from the implementation side. And it has so far brought us
 *nothing* in return. Not "hardly anything", but *NOTHING*.

It would be premature to conclude from this as the conversion is incomplete.
 I don't even
 have any idea of what good could possibly come from it.

Using static calls for hashing and comparisons instead of indirect calls comes to mind. Andrei
Jan 07 2012
prev sibling next sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 01/07/12 17:10, Don wrote:
 On 07.01.2012 04:18, Andrei Alexandrescu wrote:
 Also consider how the hard-coding of associative arrays in
 an awkward interface inside the runtime has stifled efficient
 implementations, progress, and innovation in that area. Still a lot of
 work needed there, too, to essentially undo a bad decision.

Sorry Andrei, I have to disagree with that in the strongest possible terms. I would have mentioned AAs as a very strong argument in the opposite direction! Moving AAs from a built-in to a library type has been an unmitigated disaster from the implementation side. And it has so far brought us *nothing* in return. Not "hardly anything", but *NOTHING*. I don't even have any idea of what good could possibly come from it. Note that you CANNOT have multiple implementations on a given platform, or you'll get linker errors! So I think there is more pain to come from it. It seems to have been motivated by religious reasons and nothing more. Why should anyone believe the same argument again?

Reminded me of this: "static immutable string[string] aa = [ "a": "b" ];" isn't currently possible (AA literals are non-const expressions); could this work w/o compiler support?.. artur
Jan 07 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/7/2012 8:10 AM, Don wrote:
 Moving AAs from a built-in to a library type has been an unmitigated disaster
 from the implementation side. And it has so far brought us *nothing* in return.
 Not "hardly anything", but *NOTHING*. I don't even have any idea of what good
 could possibly come from it. Note that you CANNOT have multiple implementations
 on a given platform, or you'll get linker errors! So I think there is more pain
 to come from it.
 It seems to have been motivated by religious reasons and nothing more.
 Why should anyone believe the same argument again?

Having a pluggable interface so the implementation can be changed is all right, as long as the binary API does not change. If the binary API changes, then of course, two different libraries cannot be linked together. I strongly oppose any changes which would lead to a balkanization of D libraries. (Consider the disaster C++ has had forever with everyone inventing their own string type. That insured zero interoperability between C++ libraries, a situation that persists even for 10 years after C++ finally acquired a standard string library.)
Jan 07 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/7/12 12:48 PM, Walter Bright wrote:
 On 1/7/2012 8:10 AM, Don wrote:
 Moving AAs from a built-in to a library type has been an unmitigated
 disaster
 from the implementation side. And it has so far brought us *nothing*
 in return.
 Not "hardly anything", but *NOTHING*. I don't even have any idea of
 what good
 could possibly come from it. Note that you CANNOT have multiple
 implementations
 on a given platform, or you'll get linker errors! So I think there is
 more pain
 to come from it.
 It seems to have been motivated by religious reasons and nothing more.
 Why should anyone believe the same argument again?

Having a pluggable interface so the implementation can be changed is all right, as long as the binary API does not change. If the binary API changes, then of course, two different libraries cannot be linked together. I strongly oppose any changes which would lead to a balkanization of D libraries.

In my opinion this statement is thoroughly wrong and backwards. I also think it reflects a misunderstanding of what my stance is. Allow me to clarify how I see the situation. Currently built-in hash table use generates special-cased calls to non-template functions implemented surreptitiously in druntime. The underlying theory, also sustained by the statement quoted above, is that we are interested in supporting linking together object files and libraries BUILT WITH DISTINCT MAJOR RELEASES OF DRUNTIME. There is zero interest for that. ZERO. No language even attempts to do so. Runtimes that are not compatible with their previous versions are common, frequent, and well understood as an issue. In an ideal world, built-in hash tables should work in a very simple manner. The compiler lowers all special hashtable syntax - in a manner that's MINIMAL, SIMPLE, and CLEAR - into D code that resolves to use of object.di (not some random user-defined library!). From then on, druntime code takes over. It could choose to use templates, dynamic type info, whatever. It's NOT the concern of the compiler. The compiler has NO BUSINESS taking library code and hardwiring it in for no good reason. This setup allows static and dynamic linking of libraries, as long as the runtimes they were built with are compatible. This is expected, by design, and a good thing.
 (Consider the disaster C++ has had forever with everyone inventing their
 own string type. That insured zero interoperability between C++
 libraries, a situation that persists even for 10 years after C++ finally
 acquired a standard string library.)

It is exactly this kind of canned statement and prejudice that we must avoid. It unfairly singles out C++ when there also exist incompatible libraries in C, Java, Python, you name it. Also, the last time the claim that everywhere invented their own string type could have been credibly aired was around 2004. What's built inside the compiler is like axioms in math, and what's library is like theorems supported by the axioms. A good language, just like a good mathematical system, has few axioms and many theorems. That means the system is coherent and expressive. Hardwiring stuff in the language definition is almost always a failure of the expressive power of the language. Sometimes it's fine to just admit it and hardwire inside the compiler e.g. the prior knowledge that "+" on int does modulo addition. But most always it's NOT, and definitely not in the context of a complex data structure like a hash table. I also think that adding a hecatomb of built-in types and functions has smells, though to a good extent I concede to the necessity of it. We should start from what the user wants to accomplish. Then figure how to express that within the language. And only lastly, when needed, change the language to mandate lowering constructs to the MINIMUM EXTENT POSSIBLE into constructs that can be handled within the existing language. This approach has been immensely successful virtually whenever we applied it: foreach for ranges (though there's work left to do there), operator overloading, and too little with hashes. Lately I see a sort of getting lazy and skipping the second pass entirely. Need something? Yeah, what the hell, we'll put it in the language. I am a bit worried about the increasing radicalization of the discussion here, but recent statements come in frontal collision with my core principles, which I think stand on solid evidential ground. I am appealing for building consensus and staying principled instead of reaching for the cheap solution. If we do the latter, it's quite likely we'll regret it later. Andrei
Jan 07 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/7/2012 1:28 PM, Andrei Alexandrescu wrote:
 Having a pluggable interface so the implementation can be changed is all
 right, as long as the binary API does not change.
 If the binary API changes, then of course, two different libraries
 cannot be linked together. I strongly oppose any changes which would
 lead to a balkanization of D libraries.

In my opinion this statement is thoroughly wrong and backwards. I also think it reflects a misunderstanding of what my stance is. Allow me to clarify how I see the situation. Currently built-in hash table use generates special-cased calls to non-template functions implemented surreptitiously in druntime. The underlying theory, also sustained by the statement quoted above, is that we are interested in supporting linking together object files and libraries BUILT WITH DISTINCT MAJOR RELEASES OF DRUNTIME. There is zero interest for that. ZERO. No language even attempts to do so. Runtimes that are not compatible with their previous versions are common, frequent, and well understood as an issue.

We've agree on this before, perhaps I misstated it here, but I am not talking about changing druntime. I'm talking about someone providing their own hash table implementation that has a different binary API than the one in druntime, such that code from their library cannot be linked with any other code that uses the regular hashtable. A different implementation of hashtable would be fine, as long as it is binary compatible. We did this when we switched from a binary tree collision resolution to a linear one, and the switchover went without a hitch because it did not require even a recompile of existing binaries.
 In an ideal world, built-in hash tables should work in a very simple manner.
The
 compiler lowers all special hashtable syntax - in a manner that's MINIMAL,
 SIMPLE, and CLEAR - into D code that resolves to use of object.di (not some
 random user-defined library!). From then on, druntime code takes over. It could
 choose to use templates, dynamic type info, whatever. It's NOT the concern of
 the compiler. The compiler has NO BUSINESS taking library code and hardwiring
it
 in for no good reason.

That was already true of the hashtables - it's just that the interface to them was through a set of fixed function calls, rather than a template interface. To the compiler, the hashtables were a completely opaque void*. The compiler had zero knowledge of how they actually were implemented inside the runtime. Changing it to a template implementation enables a more efficient interface, as inlining, etc., can be done instead of the slow opApply() interface. The downside of that is it becomes a bit perilous, as the binary API is not so flexible anymore.
 (Consider the disaster C++ has had forever with everyone inventing their
 own string type. That insured zero interoperability between C++
 libraries, a situation that persists even for 10 years after C++ finally
 acquired a standard string library.)

It is exactly this kind of canned statement and prejudice that we must avoid. It unfairly singles out C++ when there also exist incompatible libraries in C, Java, Python, you name it.

Of course, but strings are a fundamental data type, and so it was worse with C++. I don't agree that my opinion on it is prejudicial or unfair, because I many times was stuck with having to deal with the issues of trying to glue together disparate code that had differing string classes. Often, it was the only incompatibility, but it permeated the library interfaces.
 Also, the last time the claim that everywhere invented their own string type
 could have been credibly aired was around 2004.

Sure, people rarely (never?) do their own C++ string classes anymore, but that old code and those old libraries are still around, and are actively maintained. http://msdn.microsoft.com/en-us/library/ms174288.aspx Notice that's for Visual Studio C++ 2010. The string problem was a mistake I was determined not to make with D. I have agreed with you and still agree with the notion of using lowering instead of custom code. Also, keep in mind that the hashtable design was done long before D even had templates. It was "lowered" to what D had at the time - function calls and opApply.
 What's built inside the compiler is like axioms in math, and what's library is
 like theorems supported by the axioms. A good language, just like a good
 mathematical system, has few axioms and many theorems. That means the system is
 coherent and expressive. Hardwiring stuff in the language definition is almost
 always a failure of the expressive power of the language.

True.
 Sometimes it's fine to
 just admit it and hardwire inside the compiler e.g. the prior knowledge that
"+"
 on int does modulo addition.

Right, I understand that the abstraction abilities of D are not good enough to produce a credible 'int' type, or 'float', etc., hence they are wired in.
 But most always it's NOT, and definitely not in the
 context of a complex data structure like a hash table. I also think that adding
 a hecatomb of built-in types and functions has smells, though to a good extent
I
 concede to the necessity of it.

I want to reiterate that I don't think there is a way with the current compiler technology to make a library SIMD type that will perform as well as a builtin one, and those who use SIMD tend to be extremely demanding of performance. (One could make a semantic equivalent, but not a performance equivalent.)
 We should start from what the user wants to accomplish. Then figure how to
 express that within the language. And only lastly, when needed, change the
 language to mandate lowering constructs to the MINIMUM EXTENT POSSIBLE into
 constructs that can be handled within the existing language. This approach has
 been immensely successful virtually whenever we applied it: foreach for ranges
 (though there's work left to do there), operator overloading, and too little
 with hashes. Lately I see a sort of getting lazy and skipping the second pass
 entirely. Need something? Yeah, what the hell, we'll put it in the language.

I don't think that is entirely fair in regards to the SIMD stuff. It reminds me of after I spent a couple years at Caltech, where every class was essentially a math class. My sister asked me for help with her high school trig homework, and I just glanced at it and wrote down all the answers. She said she was supposed to show the steps involved, but to me I was so used to doing it there was only one step. So while it may seem I'm skipping steps with the SIMD, I have been thinking about it for years off and on, and I have a fair experience with what needs to be done to generate good code.
 I am a bit worried about the increasing radicalization of the discussion here,
 but recent statements come in frontal collision with my core principles, which
I
 think stand on solid evidential ground. I am appealing for building consensus
 and staying principled instead of reaching for the cheap solution. If we do the
 latter, it's quite likely we'll regret it later.


 Andrei

Jan 07 2012
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Walter:

 I don't think that is entirely fair in regards to the SIMD stuff. It reminds
me 
 of after I spent a couple years at Caltech, where every class was essentially
a 
 math class.

I think that in several (but not all) fields of science and technology your limits are often determined by how much (in depth and especially in how much variety) mathematics you know :-) Unfortunately lot of people don't seem much able or willing to learn it... Bye, bearophile
Jan 07 2012
prev sibling parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 8/01/12 12:14 AM, Walter Bright wrote:
 On 1/7/2012 1:28 PM, Andrei Alexandrescu wrote:
 But most always it's NOT, and definitely not in the
 context of a complex data structure like a hash table. I also think
 that adding
 a hecatomb of built-in types and functions has smells, though to a
 good extent I
 concede to the necessity of it.

I want to reiterate that I don't think there is a way with the current compiler technology to make a library SIMD type that will perform as well as a builtin one, and those who use SIMD tend to be extremely demanding of performance.

Considering the that entire purpose of SIMD is performance, I think the demand is reasonable :-)
Jan 07 2012
prev sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 What's built inside the compiler is like axioms in math, and what's
 library is like theorems supported by the axioms. A good language, just
 like a good mathematical system, has few axioms and many theorems. That
 means the system is coherent and expressive. Hardwiring stuff in the
 language definition is almost always a failure of the expressive power
 of the language.

Yes, but when it comes to register allocation and platform specific instruction selection, that really is the job of the compiler. It is not something that can be done in a library (without rewriting the compiler in the language, which defeats the purpose of having a language in the first place). I agree that the language should add the minimum number of features to support what we want, although in this case (due to how platform-specific the solutions are) I think it simply requires a lot of work in the compiler.
 We should start from what the user wants to accomplish. Then figure how
 to express that within the language. And only lastly, when needed,
 change the language to mandate lowering constructs to the MINIMUM EXTENT
 POSSIBLE into constructs that can be handled within the existing
 language.

I agree. Essentially, we need at least: - Some type (or types) that map directly to SIMD registers. - The type must be separate from static arrays (aligned or not). - Automatic register allocation, just like other primitive types. - Automatic instruction scheduling. - Ability to specify what instructions to use. I agree with Manu that we should just have a single type like __m128 in MSVC. The other types and their conversions should be solvable in a library with something like strong typedefs. As the *sole* reason for this enhancement is performance, the compiler absolutely must have all the information it needs to produce optimal code.
 I am a bit worried about the increasing radicalization of the discussion
 here, but recent statements come in frontal collision with my core
 principles, which I think stand on solid evidential ground. I am
 appealing for building consensus and staying principled instead of
 reaching for the cheap solution. If we do the latter, it's quite likely
 we'll regret it later.

We also need to be pragmatic. There is no point defining a perfect, modular, clean solution to the problem if it is going to take years to realize. In years, the problem may not exist anymore. This is especially true when it comes to hardware issues like the one we are discussing here.
Jan 07 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 8 January 2012 02:54, Peter Alexander <peter.alexander.au gmail.com>wrote:

 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a library
 with something like strong typedefs.

Walter put in a reasonable effort to sway me to his side of the fence last night. I'm still not entirely sold that implementation inside the language is necessary to achieve these details, but I don't have enough background into to argue, and I'm not the one that has to maintain the code :) Here are some points we discussed... how do we do these (efficiently) in a library? ** Literal syntax.. and constant folding: Constants and literals also need to be aligned. If we use array syntax to express literals, this will be a problem. int4 v = [ 1,2,3,4 ] + [ 5,6,7,8 ]; Any constant expressions need to be simplified at compile time: int4 vec = [ 6,8,10,12 ]; Perhaps this is possible with CTFE? Or will it be automatic if you express literals as if they were arrays? ** Expression interpretation/simplification: float4 v = -b + a; Obviously, this should be simplified to 'a - b'. float4 v = a*b + c; This should use a multiply-accumulate opcode on most architectures: FMADDPS v, a, b, c ** Typed debug info In a debugger it's nice to inspect variables in their supposed type. Can probably use unions to do this... probably wouldn't be as nice though. ** God knows what other optimisations float4 v = [ 0,0,0,0 ]; // XOR v etc... I don't know what amount of this is achievable with libraries, but Walter seems to think this will all work much better in the language... I'm inclined to trust his judgement.
Jan 07 2012
next sibling parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 8/01/12 1:32 AM, Manu wrote:
 On 8 January 2012 02:54, Peter Alexander <peter.alexander.au gmail.com
 <mailto:peter.alexander.au gmail.com>> wrote:

     I agree with Manu that we should just have a single type like __m128
     in MSVC. The other types and their conversions should be solvable in
     a library with something like strong typedefs.


 Walter put in a reasonable effort to sway me to his side of the fence
 last night. I'm still not entirely sold that implementation inside the
 language is necessary to achieve these details, but I don't have enough
 background into to argue, and I'm not the one that has to maintain the
 code :)

 Here are some points we discussed... how do we do these (efficiently) in
 a library?

Just to be clear, it was only the types and conversions that I thought would be suitable for a library. Operations, along with their optimisations are best for compiler.
 ** Literal syntax.. and constant folding:

 Constants and literals also need to be aligned. If we use array syntax
 to express literals, this will be a problem.

   int4 v = [ 1,2,3,4 ] + [ 5,6,7,8 ];

 Any constant expressions need to be simplified at compile time: int4 vec
 = [ 6,8,10,12 ];
 Perhaps this is possible with CTFE? Or will it be automatic if you
 express literals as if they were arrays?

You could use array syntax for vector literals, as long as they are stored directly into vector variables. e.g. immutable int4 a = [1, 2, 3, 4]; immutable int4 b = [5, 6, 7, 8]; int4 v = a + b; Constant folding can be done by compiler, although I don't think this is a priority.
 ** Expression interpretation/simplification:

   float4 v = -b + a;

 Obviously, this should be simplified to 'a - b'.

   float4 v = a*b + c;

 This should use a multiply-accumulate opcode on most architectures:
 FMADDPS v, a, b, c

Compiler should make these decisions, just like it does with int/float etc. In some cases these kinds of simplifications can effect the result due to numeric issues. You can use expression templates for this sort of thing as well, but they are a horrible mess, so I don't think I'd like to see them.
 ** Typed debug info

 In a debugger it's nice to inspect variables in their supposed type.
 Can probably use unions to do this... probably wouldn't be as nice though.

Good point. I'm not an expert on this, but I suspect that a union would be good enough?
 ** God knows what other optimisations

 float4 v = [ 0,0,0,0 ]; // XOR v
 etc...

Again, I think you could use expression templates for this, but it's so much simpler to leave this optimisation to the compiler. Even if the compiler doesn't do it, it's not difficult to do it manually when you really need it: float4 v = void; asm { pxor v, v; } Honestly, I'm not too bothered with these types of optimisations. As long as the compiler does the register allocation and instruction scheduling for me, I would be 99% happy because those things are the most tedious when trying to write structured code. I can easily enough change (-b + a) to (b - a) if that's faster, or insert specific instructions for generating vector constants, or do constant folding manually. Of course, it would be nice if the compiler did them, but that's just icing on the cake. The meat of the problem is register allocation.
 I don't know what amount of this is achievable with libraries, but
 Walter seems to think this will all work much better in the language...
 I'm inclined to trust his judgement.

I agree.
Jan 07 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/7/2012 5:32 PM, Manu wrote:
 Here are some points we discussed... how do we do these (efficiently) in a
library?

Another issue - matching the name mangling and parameter passing/return conventions of how other C/C++ compilers deal with vector types. That is currently not doable with a library type.
Jan 07 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/7/2012 4:54 PM, Peter Alexander wrote:
 I think it simply requires a lot of work in the compiler.

Not that much work. Most of it segues nicely into the previous work I did supporting the XMM floating point code gen.
Jan 07 2012
parent reply Manu <turkeyman gmail.com> writes:
On 8 January 2012 03:44, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/7/2012 4:54 PM, Peter Alexander wrote:

 I think it simply requires a lot of work in the compiler.

Not that much work. Most of it segues nicely into the previous work I did supporting the XMM floating point code gen.

What is this previous work you speak of? Is there already XMM stuff in there somewhere?
Jan 07 2012
next sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 8/01/12 1:48 AM, Manu wrote:
 On 8 January 2012 03:44, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/7/2012 4:54 PM, Peter Alexander wrote:

         I think it simply requires a lot of work in the compiler.


     Not that much work. Most of it segues nicely into the previous work
     I did supporting the XMM floating point code gen.


 What is this previous work you speak of? Is there already XMM stuff in
 there somewhere?

On 64-bit, floats are stored in XMM registers (just as single scalars). I don't think it does any vectorization yet though. It does mean that the register allocation of those registers is already complete though.
Jan 07 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/7/2012 6:32 PM, Peter Alexander wrote:
 On 64-bit, floats are stored in XMM registers (just as single scalars).

Yes.
 I don't think it does any vectorization yet though.

Right. It doesn't do that.
 It does mean that the register
 allocation of those registers is already complete though.

Yup. Does a nice job of it, too :-)
Jan 07 2012
prev sibling parent reply "a" <a a.com> writes:
On Sunday, 8 January 2012 at 01:48:34 UTC, Manu wrote:
 On 8 January 2012 03:44, Walter Bright 
 <newshound2 digitalmars.com> wrote:

 On 1/7/2012 4:54 PM, Peter Alexander wrote:

 I think it simply requires a lot of work in the compiler.

Not that much work. Most of it segues nicely into the previous work I did supporting the XMM floating point code gen.

What is this previous work you speak of? Is there already XMM stuff in there somewhere?

DMD (at least 64 bit on linux, I'm not sure about 32 bit) now uses XMM registers and instructions that work on them (addss, addsd, mulsd...) for scalar floating point operations.
Jan 08 2012
parent Manu <turkeyman gmail.com> writes:
On 8 January 2012 11:56, a <a a.com> wrote:

 On Sunday, 8 January 2012 at 01:48:34 UTC, Manu wrote:

 On 8 January 2012 03:44, Walter Bright <newshound2 digitalmars.com>
 wrote:

  On 1/7/2012 4:54 PM, Peter Alexander wrote:
  I think it simply requires a lot of work in the compiler.

Not that much work. Most of it segues nicely into the previous work I did supporting the XMM floating point code gen.

What is this previous work you speak of? Is there already XMM stuff in there somewhere?

DMD (at least 64 bit on linux, I'm not sure about 32 bit) now uses XMM registers and instructions that work on them (addss, addsd, mulsd...) for scalar floating point operations.

Yeah of course! >_< I forgot that they did that in x64 (I never work with x64), but I recall thinking that was the single most awesome change to the architecture! :)
Jan 08 2012
prev sibling parent reply Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
MS has three types, __m128, __m128i and __m128d  (float, int, double)

Six if you count AVX's 256 forms.

On 1/7/2012 6:54 PM, Peter Alexander wrote:
 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.

Jan 14 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:
 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:
 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


The trouble with MS's scheme, is given the following: __m128i v; v += 2; Can't tell what to do. With D, int4 v; v += 2; it's clear (add 2 to each of the 4 ints).
Jan 14 2012
parent reply Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 1/15/2012 12:09 AM, Walter Bright wrote:
 On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:
 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:
 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


The trouble with MS's scheme, is given the following: __m128i v; v += 2; Can't tell what to do. With D, int4 v; v += 2; it's clear (add 2 to each of the 4 ints).

Working with their intrinsics in their raw form for real code is pure insanity :) You need to wrap it all with a good math library (even if 90% of the library is the intrinsics wrapped into __forceinlined functions), so you can start having sensible operator overloads, and so you can write code that is readable. if (any4(a > b)) { // do stuff } is way way way better than (pseudocode) if (__movemask_ps(_mm_gt_ps(a, b)) == 0x0F) { } and (if the ternary operator was overrideable in C++) float4 foo = (a > b) ? c : d; would be better than float4 mask = _mm_gt_ps(a, b); float4 foo = _mm_or_ps(_mm_and_ps(mask, c), _mm_nand_ps_(mask, d));
Jan 14 2012
parent reply Manu <turkeyman gmail.com> writes:
On 15 January 2012 08:16, Sean Cavanaugh <WorksOnMyMachine gmail.com> wrote:

 On 1/15/2012 12:09 AM, Walter Bright wrote:

 On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:

 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:

 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


The trouble with MS's scheme, is given the following: __m128i v; v += 2; Can't tell what to do. With D, int4 v; v += 2; it's clear (add 2 to each of the 4 ints).

Working with their intrinsics in their raw form for real code is pure insanity :) You need to wrap it all with a good math library (even if 90% of the library is the intrinsics wrapped into __forceinlined functions), so you can start having sensible operator overloads, and so you can write code that is readable. if (any4(a > b)) { // do stuff } is way way way better than (pseudocode) if (__movemask_ps(_mm_gt_ps(a, b)) == 0x0F) { } and (if the ternary operator was overrideable in C++) float4 foo = (a > b) ? c : d; would be better than float4 mask = _mm_gt_ps(a, b); float4 foo = _mm_or_ps(_mm_and_ps(mask, c), _mm_nand_ps_(mask, d));

Yep, it's coming... baby steps :) Walter: I told you games devs would be all over this! :P
Jan 15 2012
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 15.01.2012, 11:45 Uhr, schrieb Manu <turkeyman gmail.com>:

 On 15 January 2012 08:16, Sean Cavanaugh <WorksOnMyMachine gmail.com>  
 wrote:

 On 1/15/2012 12:09 AM, Walter Bright wrote:

 On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:

 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:

 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128  
 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


The trouble with MS's scheme, is given the following: __m128i v; v += 2; Can't tell what to do. With D, int4 v; v += 2; it's clear (add 2 to each of the 4 ints).

Working with their intrinsics in their raw form for real code is pure insanity :) You need to wrap it all with a good math library (even if 90% of the library is the intrinsics wrapped into __forceinlined functions), so you can start having sensible operator overloads, and so you can write code that is readable. if (any4(a > b)) { // do stuff } is way way way better than (pseudocode) if (__movemask_ps(_mm_gt_ps(a, b)) == 0x0F) { } and (if the ternary operator was overrideable in C++) float4 foo = (a > b) ? c : d; would be better than float4 mask = _mm_gt_ps(a, b); float4 foo = _mm_or_ps(_mm_and_ps(mask, c), _mm_nand_ps_(mask, d));

Yep, it's coming... baby steps :) Walter: I told you games devs would be all over this! :P

And even a compression algorithms. I found one written in C, that uses external .asm files to be compiled into object files with NASM for use on the linker command line. They contain some MMX/SSE code depending on the processor you plan to use. The author claims, that the MMX version of the 'outsourced' routines run 8x faster. I didn't verify this, but the idea that these instructions become part of the language and easy to use for regular programmers like me (and not just console game developers) is exciting. I bet there are more programs that could benefit from SSE than is obvious or code that could be rewritten in way, that multiple data sets can be processed simultaneous.
Jan 16 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 5:06 AM, Marco Leise wrote:
 I bet there are more programs that
 could benefit from SSE than is obvious or code that could be rewritten in way,
 that multiple data sets can be processed simultaneous.

I think there's quite a bit more, it's just that using SIMD instructions has historically been so clumsy, few take advantage. For example, a memchr operation could be dramatically speeded up with SIMD, which has implications for regex.
Jan 16 2012
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 21:21, Walter Bright <newshound2 digitalmars.com> wrote:

 1. the language does typechecking, for example, trying to add a vector of
 4 floats to 16 bytes would be (and should be) an error.

I want to sell you on the 'primitive SIMD regs are truly typeless' point. (which I thought you had already agreed with) :) Here are some examples of tight interacting between int/float, and interacting ON floats with int operations... Naturally the examples I present will be wrapped as useful functions in libraries, but the primitive type shouldn't try and make this more annoying by trying to enforce pointless type safety errors like you seem to be suggesting. In computer graphics it's common to work with float16's, a type not supported by simd units. Pack/Unpack code involved detailed float/int interaction. You might take a register of floats, then mask the exponent and then perform integer arithmetic on the exponent to shift it into the float16 exponent range... then you will mask the bottom of the mantissa and shift them into place. Unpacking is same process in reverse. Other tricks with the float sign bits, making everything negative, by or-ing in 1's into the top bits. or you can gather the signs using various techniques.. useful for identifying the cell in a quad-tree for instance. Integer manipulation of floats is surprisingly common.
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 12:45 PM, Manu wrote:
 Here are some examples of tight interacting between int/float, and interacting
 ON floats with int operations...
 Naturally the examples I present will be wrapped as useful functions in
 libraries, but the primitive type shouldn't try and make this more annoying by
 trying to enforce pointless type safety errors like you seem to be suggesting.

I am suggesting it, no doubt about it!
 In computer graphics it's common to work with float16's, a type not supported
by
 simd units. Pack/Unpack code involved detailed float/int interaction.
 You might take a register of floats, then mask the exponent and then perform
 integer arithmetic on the exponent to shift it into the float16 exponent
 range... then you will mask the bottom of the mantissa and shift them into
place.
 Unpacking is same process in reverse.

 Other tricks with the float sign bits, making everything negative, by or-ing in
 1's into the top bits. or you can gather the signs using various techniques..
 useful for identifying the cell in a quad-tree for instance.
 Integer manipulation of floats is surprisingly common.

I'm aware of such tricks, and actually do them with the floating point code generation in the compiler back end. I don't think that renders the idea that floats and ints should be different types a bad one. I'd also argue that such tricks are tricks, and using a reinterpret cast on them makes it clear in the code that you know what you're doing, rather than doing something bizarre like a left shift on a float type. I've worked a lot with large assembler programs. As you know, EAX has no type. The assembler code would constantly shift the type of things that were in EAX, sometimes a pointer, sometimes an int, sometimes a ushort, sometimes treating a pointer as an int, etc. I can unequivocably state that this typeless approach is confusing, buggy, hard to untangle, and ultimately a freedom that is not justifiable. Static typing is a big improvement, and having to insert a few reinterpret casts is a good thing, not a detriment.
Jan 06 2012
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Walter:

 I've worked a lot with large assembler programs. As you know, EAX has no type. 
 The assembler code would constantly shift the type of things that were in EAX, 
 sometimes a pointer, sometimes an int, sometimes a ushort, sometimes treating
a 
 pointer as an int, etc. I can unequivocably state that this typeless approach
is 
 confusing, buggy, hard to untangle, and ultimately a freedom that is not 
 justifiable.

There is even some desire of a typed assembly. It's not easy to design and implement, but it seems able to avoid some bugs: http://www.cs.cornell.edu/talc/papers.html Bye, bearophile
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 7 January 2012 02:00, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 12:45 PM, Manu wrote:

 In computer graphics it's common to work with float16's, a type not
 supported by

simd units. Pack/Unpack code involved detailed float/int interaction.
 You might take a register of floats, then mask the exponent and then
 perform
 integer arithmetic on the exponent to shift it into the float16 exponent
 range... then you will mask the bottom of the mantissa and shift them
 into place.
 Unpacking is same process in reverse.

 Other tricks with the float sign bits, making everything negative, by
 or-ing in
 1's into the top bits. or you can gather the signs using various
 techniques..
 useful for identifying the cell in a quad-tree for instance.
 Integer manipulation of floats is surprisingly common.

I'm aware of such tricks, and actually do them with the floating point code generation in the compiler back end. I don't think that renders the idea that floats and ints should be different types a bad one. I'd also argue that such tricks are tricks, and using a reinterpret cast on them makes it clear in the code that you know what you're doing, rather than doing something bizarre like a left shift on a float type. I've worked a lot with large assembler programs. As you know, EAX has no type. The assembler code would constantly shift the type of things that were in EAX, sometimes a pointer, sometimes an int, sometimes a ushort, sometimes treating a pointer as an int, etc. I can unequivocably state that this typeless approach is confusing, buggy, hard to untangle, and ultimately a freedom that is not justifiable. Static typing is a big improvement, and having to insert a few reinterpret casts is a good thing, not a detriment.

To be clear, I'm not opposing strongly typing vector types... that's my primary goal too. But they're not as simple I think you believe.
From experience, microsoft provices __m128, but GCC does what you're

proposing (although I get the feeling it's not a 'proposal' anymore). GCC uses 'vector float', 'vector int', 'vector unsigned short', etc... I hate writing vector code the GCC way, it's really ugly. The lines tend to become dominated by casts, and it's all for nothing, since it all gets wrapped up behind a library anyway. Secondly, you're introducing confusion. A cast from float4 to int4... does it reinterpret, or does it type convert? In GCC it reinterprets, but what do you actually expect? and regardless of what you expect, what do you actually WANT most of the time... I'm sure you'll agree that the expected/'proper' thing would be a type conversion (and I know you're into 'proper'-ness), but in practise you almost always want to reinterpret. This inevitably leads to ugly reinterpret syntax all over the place. If it were a typeless vector reg type, it all goes away. Despite all this worry and effort, NOBODY will ever use these strongly typed (but still primitive) types of yours. They will need to be extended with bunches of methods, which means wrapping them up in libraries anyway, to add all the higher level functionality... so what's the point? The only reason they will use them is to wrap them up in a library of their own, at which point I promise you they'll be just as annoyed as me by the typing and need for casts all over the place to pass them into basic intrinsics. But if you're insistent on doing this, can you detail the proposal... What types will exist? How will each one cast/interact? What about error conditions/exceptions? How do I control these? ...on a per-type basis? What about CTFE, will you add understanding for every operation supported by each type? This is easily handled in a library... How will you assign literals? How can you assign a typeless literal? (a single 128bit value, used primarily for masks) What operators will be supported... and what will they do? Will you extend support for 64bit and 256bit vector types, that's a whole bundle more types again... I really feel this is polluting the language. ... is this whole thing just so you can support MADD? If so, there are others to worry about too...
Jan 06 2012
prev sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 7 January 2012 00:38, Manu <turkeyman gmail.com> wrote:
 On 7 January 2012 02:00, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/6/2012 12:45 PM, Manu wrote:
 In computer graphics it's common to work with float16's, a type not
 supported by

 simd units. Pack/Unpack code involved detailed float/int interaction.
 You might take a register of floats, then mask the exponent and then
 perform
 integer arithmetic on the exponent to shift it into the float16 exponent
 range... then you will mask the bottom of the mantissa and shift them
 into place.
 Unpacking is same process in reverse.

 Other tricks with the float sign bits, making everything negative, by
 or-ing in
 1's into the top bits. or you can gather the signs using various
 techniques..
 useful for identifying the cell in a quad-tree for instance.
 Integer manipulation of floats is surprisingly common.

I'm aware of such tricks, and actually do them with the floating point code generation in the compiler back end. I don't think that renders the idea that floats and ints should be different types a bad one. I'd also argue that such tricks are tricks, and using a reinterpret cast on them makes it clear in the code that you know what you're doing, rather than doing something bizarre like a left shift on a float type. I've worked a lot with large assembler programs. As you know, EAX has no type. The assembler code would constantly shift the type of things that were in EAX, sometimes a pointer, sometimes an int, sometimes a ushort, sometimes treating a pointer as an int, etc. I can unequivocably state that this typeless approach is confusing, buggy, hard to untangle, and ultimately a freedom that is not justifiable. Static typing is a big improvement, and having to insert a few reinterpret casts is a good thing, not a detriment.

To be clear, I'm not opposing strongly typing vector types... that's my primary goal too. But they're not as simple I think you believe. From experience, microsoft provices __m128, but GCC does what you're proposing (although I get the feeling it's not a 'proposal' anymore). GCC uses 'vector float', 'vector int', 'vector unsigned short', etc... I hate writing vector code the GCC way, it's really ugly. The lines tend to become dominated by casts, and it's all for nothing, since it all gets wrapped up behind a library anyway. Secondly, you're introducing confusion. A cast from float4 to int4... does it reinterpret, or does it type convert? In GCC it reinterprets, but what do you actually expect? and regardless of what you expect, what do you actually WANT most of the time... I'm sure you'll agree that the expected/'proper' thing would be a type conversion (and I know you're into 'proper'-ness), but in practise you almost always want to reinterpret. This inevitably leads to ugly reinterpret syntax all over the place. If it were a typeless vector reg type, it all goes away.

FYI, vector conversion in GCC is roughly to the idiom of *(float4 *)&X; in C. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 06 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 06 Jan 2012 14:44:53 +0100, Manu <turkeyman gmail.com> wrote:

 On 6 January 2012 14:56, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright <
 newshound2 digitalmars.com> wrote:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints  
 or
 2 packed doubles. One problem with making it typed is it'll add 10 more
 types to the base compiler, instead of one. Maybe we should just bite  
 the
 bullet and do the types:

     __vdouble2
     __vfloat4
     __vlong2
     __vulong2
     __vint4
     __vuint4
     __vshort8
     __vushort8
     __vbyte16
     __vubyte16

Those could be typedefs, i.e. alias this wrapper. Still simdop would not be typesafe.

I think they should by well defined structs with lots of type safety and sensible methods. Not just a typedef of the typeless primitive.
 As much as this proposal presents a viable solution,
 why not spending the time to extend inline asm.

I think there are too many risky problems with the inline assembler (as raised in my discussion about supporting pseudo registers in inline asm blocks). * No way to allow the compiler to assign registers (pseudo registers)

That's what I propose he should do. IMHO it's a huge improvement when register variables could be used directly in asm. int a, b; __vec128 c; asm (a, b, c) { mov EAX, a; add b, EAX; movps XMM1, c; mulps c, XMM1; } The compiler has enough knowledge to do this, and it's the common basic block spilling scheme that is used here. There is another benefit. Consider the following: __vec128 addps(__vec128 a, __vec128 b) pure { __vec128 res = a; if (__ctfe) { foreach(i; 0 .. 4) res[i] += b[i]; } else { asm (b, res) { addps res, b; } } return res; }
   * Assembly blocks present problems for the optimiser, it's not reliable
 that it can optimise around an inline asm blocks. How bad will it be when
 trying to optimise around 100 small inlined functions each containing its
 own inline asm blocks?

What do you mean by optimizing around? I don't see any apparent reason why that should perform worse than using intrinsics. The only implementation issue could be that lots of inlined asm snippets make plenty basic blocks which could slow down certain compiler algorithms.
   * D's inline assembly syntax has to be carefully translated to GCC's
 inline asm format when using GCC, and this needs to be done
 PER-ARCHITECTURE, which Iain should not be expected to do for all the
 obscure architectures GCC supports.

??? This would be needed for opcodes as well. You initial goal was to directly influence code gen up to instruction level, how should that be achieved without platform specific extension. Quite contrary with ops and asm he will need two hack paths into gcc's codegen. What I see here is that we can do much good things to the inline assembler while achieving the same goal. With intrinsics on the other hand we're adding a very specialized maintenance burden.
 What would be needed?
  - Implement the asm allocation logic.
  - Functions containing asm statements should participate in inlining.
  - Determining inline cost of asm statements.

I raised these points in my other thread, these are all far more complicated problems I think than exposing opcode intrinsics would be. Opcode intrinsics are almost certainly the way to go. When being used with typedefs for __vubyte16 et.al. this would
 allow a really clean and simple library implementation of intrinsics.

The type safety you're imagining here might actually be annoying when working with the raw type and opcodes.. Consider this common situation and the code that will be built around it: __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //

Such is really not a good idea if the bit pattern of packedColour is a denormal. How can you even execute a single useful command on the floats here? Also mixing integer and FP instructions on the same register may cause performance degradation. The registers are indeed typed CPU internally.
 pack
 some other useful data in W
 If vec were strongly typed, I would now need to start casting all over  
 the
 place to use various float and uint opcodes on this value?
 I think it's correct when using SIMD at the raw level to express the type
 as it is, typeless... SIMD regs are infact typeless regs, they only gain
 concept of type the moment you perform an opcode on it, and only for the
 duration of that opcode.

 You will get your strong type safety when you make use of the float4  
 types
 which will be created in the libs.

Jan 06 2012
prev sibling next sibling parent reply Brad Roberts <braddr puremagic.com> writes:
On 1/6/2012 12:43 AM, Walter Bright wrote:
 Declare one new basic type:
 
     __v128
 
 which represents the 16 byte aligned 128 bit vector type. The only operations
defined to work on it would be
 construction and assignment. The __ prefix signals that it is non-portable.
 
 Then, have:
 
    import core.simd;
 
 which provides two functions:
 
    __v128 simdop(operator, __v128 op1);
    __v128 simdop(operator, __v128 op1, __v128 op2);

How is making __v128 a builtin type better than defining it as: align(16) struct __v128 { ubyte[16] data; }
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 10:25 AM, Brad Roberts wrote:
 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
      ubyte[16] data;
 }

Then the back end knows it should be mapped onto the XMM registers rather than the usual arithmetic set.
Jan 06 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/6/12 1:11 PM, Walter Bright wrote:
 On 1/6/2012 10:25 AM, Brad Roberts wrote:
 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
 ubyte[16] data;
 }

Then the back end knows it should be mapped onto the XMM registers rather than the usual arithmetic set.

If it's possible, then it would be great to express the new constructs within the existing language (optionally by leaving it to the implementation to strengthen guarantees of certain constructs). I very warmly recommend avoiding defining things in the language and compiler wherever the same is possible within a library (however non-portable). Confining features to the language/compiler drastically reduces the number of people that can work on them. Andrei
Jan 06 2012
parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 23:23, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org>wrote:

 On 1/6/12 1:11 PM, Walter Bright wrote:

 On 1/6/2012 10:25 AM, Brad Roberts wrote:

 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
 ubyte[16] data;
 }

Then the back end knows it should be mapped onto the XMM registers rather than the usual arithmetic set.

If it's possible, then it would be great to express the new constructs within the existing language (optionally by leaving it to the implementation to strengthen guarantees of certain constructs).

Now you're at odds with Walter's new take on it.. He seems to have changed his mind and decided library implementation of the complex/strict types is a bad idea now..?
 I very warmly recommend avoiding defining things in the language and
 compiler wherever the same is possible within a library (however
 non-portable). Confining features to the language/compiler drastically
 reduces the number of people that can work on them.

Aye, and my proposal requests only the minimum support required from the language, allowing libraries to do the rest. For some reason Walter seems to have done a bit of a 180 in the last few hours ;)
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 1:46 PM, Manu wrote:
 For some reason Walter seems to have done a bit of a 180 in the last few hours
;)

It must be the drugs!
Jan 06 2012
parent Manu <turkeyman gmail.com> writes:
On 7 January 2012 01:34, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:46 PM, Manu wrote:

 For some reason Walter seems to have done a bit of a 180 in the last few
 hours ;)

It must be the drugs!

That's what I was starting to suspect too! :P
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 20:17, Martin Nowak <dawg dawgfoto.de> wrote:

 There is another benefit.
 Consider the following:

 __vec128 addps(__vec128 a, __vec128 b) pure
 {
    __vec128 res = a;

    if (__ctfe)
    {
        foreach(i; 0 .. 4)
           res[i] += b[i];
    }
    else
    {
        asm (b, res)
        {
            addps res, b;
        }
    }
    return res;

 }

You don't need to use inline ASM to be able to do this, it will work the same with intrinsics. I've detailed numerous problems with using inline asm, and complications with extending the inline assembler to support this. * Assembly blocks present problems for the optimiser, it's not reliable
 that it can optimise around an inline asm blocks. How bad will it be when
 trying to optimise around 100 small inlined functions each containing its
 own inline asm blocks?

What do you mean by optimizing around? I don't see any apparent reason why that should perform worse than using intrinsics.

Most compilers can't reschedule code around inline asm blocks. There are a lot of reasons for this, google can help you. The main reason is that a COMPILER doesn't attempt to understand the assembly it's being asked to insert inline. The information that it may use for optimisation is never present, so it can't do it's job.
 The only implementation issue could be that lots of inlined asm snippets
 make plenty basic blocks which could slow down certain compiler algorithms.

Same problem as above. The compiler would need to understand enough about assembly to perform optimisation on the assembly its self to clean this up. Using intrinsics, all the register allocation, load/store code, etc, is all in the regular realm of compiling the language, and the code generation and optimisation will all work as usual. * D's inline assembly syntax has to be carefully translated to GCC's
 inline asm format when using GCC, and this needs to be done
 PER-ARCHITECTURE, which Iain should not be expected to do for all the
 obscure architectures GCC supports.

  ???

This would be needed for opcodes as well. You initial goal was to directly influence code gen up to instruction level, how should that be achieved without platform specific extension. Quite contrary with ops and asm he will need two hack paths into gcc's codegen.

 What I see here is that we can do much good things to the inline
 assembler while achieving the same goal.
 With intrinsics on the other hand we're adding a very specialized
 maintenance burden.

You need to understand how the inline assembler works in GCC to understand the problems with this. GCC basically receives a string containing assembly code. It does not attempt to understand it, it just pastes it in the .s file verbatim. This means, you can support any architecture without any additional work... you just type the appropriate architectures asm in your program and it's fine... but now if we want to perform pseudo-register assignment, or parameter substitution, we need a front end that parses the D asm expressions, and generated a valid asm string for GCC.. It can't generate that string without detailed knowledge of the architecture its targeting, and it's not feasible to implement that support for all the architectures GCC supports. Even after all that, It's still not ideal.. Inline asm reduces the ability of the compiler to perform many optimisations. Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //

Such is really not a good idea if the bit pattern of packedColour is a denormal. How can you even execute a single useful command on the floats here? Also mixing integer and FP instructions on the same register may cause performance degradation. The registers are indeed typed CPU internally.

It's a very good idea, I am saving memory and, and also saving memory accesses. This leads back to the point in my OP where I said that most games programmers turn NaN, Den, and FP exceptions off. As I've also raised before, most vectors are actually float[3]'s, W is usually ignored and contains rubbish. It's conventional to stash some 32bit value in the W to fill the otherwise wasted space, and also get the load for free alongside the position. The typical program flow, in this case: * the colour will be copied out into a separate register where it will be reinterpreted as a uint, and have an unpack process applied to it. * XYZ will then be used to perform maths, ignoring W, which will continue to accumulate rubbish values... it doesn't matter, all FP exceptions and such are disabled.
Jan 06 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 6 January 2012 20:25, Brad Roberts <braddr puremagic.com> wrote:

 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
    ubyte[16] data;
 }

Where in that code is the compiler informed that your structure should occupy a SIMD registers, and apply SIMD ABI conventions?
Jan 06 2012
prev sibling next sibling parent reply Brad Roberts <braddr puremagic.com> writes:
On 1/6/2012 11:06 AM, Manu wrote:
 On 6 January 2012 20:25, Brad Roberts <braddr puremagic.com
<mailto:braddr puremagic.com>> wrote:
 
     How is making __v128 a builtin type better than defining it as:
 
     align(16) struct __v128
     {
        ubyte[16] data;
     }
 
 
 Where in that code is the compiler informed that your structure should occupy
a SIMD registers, and apply SIMD ABI
 conventions?

Good point, those rules would need to be added. I'd argue that it's not unreasonable to allow any properly aligned and sized types to occupy those registers. Though that's likely not optimal for cases that won't actually use the operations that modify them. However, a counter example, it'd be a lot easier to write a memcpy routine that uses them without having to resort to asm code under this theoretical model.
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 11:16 AM, Brad Roberts wrote:
 However, a counter example, it'd be a lot easier to write a memcpy routine
that uses them
 without having to resort to asm code under this theoretical model.

I would seriously argue that individuals not attempt to write their own memcpy. Why? Because the C one has had probably thousands of programmers looking at it for the last 30 years. You're not going to spend 5 minutes, or even 5 days, and make it faster.
Jan 06 2012
next sibling parent Brad Roberts <braddr puremagic.com> writes:
On Fri, 6 Jan 2012, Walter Bright wrote:

 On 1/6/2012 11:16 AM, Brad Roberts wrote:
 However, a counter example, it'd be a lot easier to write a memcpy routine
 that uses them
 without having to resort to asm code under this theoretical model.

I would seriously argue that individuals not attempt to write their own memcpy. Why? Because the C one has had probably thousands of programmers looking at it for the last 30 years. You're not going to spend 5 minutes, or even 5 days, and make it faster.

Oh, I completely agree. Intel has people that work on that as their primary job. There's a constant trickle of changes going into glibc's mem{cpy,cmp} type routines to specialize for each of the ever evolving set of platforms out there. No way should that effort be duplicated. All I was pondering was how much cleaner much of that could be if it was expressed in higher level representations. But you'd still wind up playing serious tweaking and validation games that would largely if not completely invalidate the utility of being expressed in higher level forms. Probably. Later, Brad
Jan 06 2012
prev sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 6 January 2012 at 20:26:37 UTC, Walter Bright wrote:
 On 1/6/2012 11:16 AM, Brad Roberts wrote:
 However, a counter example, it'd be a lot easier to write a 
 memcpy routine that uses them
 without having to resort to asm code under this theoretical 
 model.

I would seriously argue that individuals not attempt to write their own memcpy.

Agner Fog states in his optimization manuals that the glibc routines are fairly unoptimized. He provides his own versions, however they are GPL.
 Why? Because the C one has had probably thousands of 
 programmers looking at it for the last 30 years. You're not 
 going to spend 5 minutes, or even 5 days, and make it faster.

This assumes that hardware never changes. New memcpy implementations can take advantage of large registers in newer CPUs for higher speeds.
Jan 06 2012
parent reply Manu <turkeyman gmail.com> writes:
On 7 January 2012 03:46, Vladimir Panteleev <vladimir thecybershadow.net>wrote:

 On Friday, 6 January 2012 at 20:26:37 UTC, Walter Bright wrote:

 On 1/6/2012 11:16 AM, Brad Roberts wrote:

 However, a counter example, it'd be a lot easier to write a memcpy
 routine that uses them
 without having to resort to asm code under this theoretical model.

I would seriously argue that individuals not attempt to write their own memcpy.

Agner Fog states in his optimization manuals that the glibc routines are fairly unoptimized. He provides his own versions, however they are GPL. Why? Because the C one has had probably thousands of programmers looking
 at it for the last 30 years. You're not going to spend 5 minutes, or even 5
 days, and make it faster.

This assumes that hardware never changes. New memcpy implementations can take advantage of large registers in newer CPUs for higher speeds.

I've never seen a memcpy on any console system I've ever worked on that takes advantage if its large registers... writing a fast memcpy is usually one of the first things we do when we get a new platform ;)
Jan 06 2012
parent Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 1/6/2012 7:58 PM, Manu wrote:
 On 7 January 2012 03:46, Vladimir Panteleev <vladimir thecybershadow.net
 <mailto:vladimir thecybershadow.net>> wrote:

 I've never seen a memcpy on any console system I've ever worked on that
 takes advantage if its large registers... writing a fast memcpy is
 usually one of the first things we do when we get a new platform ;)

Plus memcpy is optimized for reading and writing to cached virtual memory, so you need several others to write to write-combined or uncached memory efficiently and whatnot.
Jan 14 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 06 Jan 2012 20:00:15 +0100, Manu <turkeyman gmail.com> wrote:

 On 6 January 2012 20:17, Martin Nowak <dawg dawgfoto.de> wrote:

 There is another benefit.
 Consider the following:

 __vec128 addps(__vec128 a, __vec128 b) pure
 {
    __vec128 res = a;

    if (__ctfe)
    {
        foreach(i; 0 .. 4)
           res[i] += b[i];
    }
    else
    {
        asm (res, b)
        {
            addps res, b;
        }
    }
    return res;

 }

You don't need to use inline ASM to be able to do this, it will work the same with intrinsics. I've detailed numerous problems with using inline asm, and complications with extending the inline assembler to support this.

Don't get me wrong here. The idea is to find out if intrinsics can be build with the help of inlineable asm functions. The ctfe support is one good reason to go with a library solution.
  * Assembly blocks present problems for the optimiser, it's not reliable
 that it can optimise around an inline asm blocks. How bad will it be  
 when
 trying to optimise around 100 small inlined functions each containing  
 its
 own inline asm blocks?

What do you mean by optimizing around? I don't see any apparent reason why that should perform worse than using intrinsics.

Most compilers can't reschedule code around inline asm blocks. There are a lot of reasons for this, google can help you. The main reason is that a COMPILER doesn't attempt to understand the assembly it's being asked to insert inline. The information that it may use

It doesn't have to understand the assembly. Wrapping these in functions creates an IR expression with inputs and outputs. Declaring them as pure gives the compiler free hands to apply whatever optimizations he does normally on an IR tree. Common subexpressions elimination, removing dead expressions...
 for optimisation is never present, so it can't do it's job.


 The only implementation issue could be that lots of inlined asm snippets
 make plenty basic blocks which could slow down certain compiler  
 algorithms.

Same problem as above. The compiler would need to understand enough about assembly to perform optimisation on the assembly its self to clean this up. Using intrinsics, all the register allocation, load/store code, etc, is all in the regular realm of compiling the language, and the code generation and optimisation will all work as usual.

There is no informational difference between the intrinsic __m128 _mm_add_ps(__m128 a, __m128 b); and an inline assembler version __m128 _mm_add_ps(__m128 a, __m128 b) { asm { addps a, b; } }
  * D's inline assembly syntax has to be carefully translated to GCC's
 inline asm format when using GCC, and this needs to be done
 PER-ARCHITECTURE, which Iain should not be expected to do for all the
 obscure architectures GCC supports.

  ???

This would be needed for opcodes as well. You initial goal was to directly influence code gen up to instruction level, how should that be achieved without platform specific extension. Quite contrary with ops and asm he will need two hack paths into gcc's codegen.

 What I see here is that we can do much good things to the inline
 assembler while achieving the same goal.
 With intrinsics on the other hand we're adding a very specialized
 maintenance burden.

You need to understand how the inline assembler works in GCC to understand the problems with this. GCC basically receives a string containing assembly code. It does not attempt to understand it, it just pastes it in the .s file verbatim. This means, you can support any architecture without any additional work... you just type the appropriate architectures asm in your program and it's fine... but now if we want to perform pseudo-register assignment, or parameter substitution, we need a front end that parses the D asm expressions, and generated a valid asm string for GCC.. It can't generate that string without detailed knowledge of the architecture its targeting, and it's not feasible to implement that support for all the architectures GCC supports.

So the argument here is that intrinsics in D can easier be mapped to existing intrinsics in GCC? I do understand that this will be pretty difficult for GDC to implement. Reminds me that Walter has stated several times how much better an internal assembler can integrate with the language.
 Even after all that, It's still not ideal.. Inline asm reduces the  
 ability
 of the compiler to perform many optimisations.

 Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //

Such is really not a good idea if the bit pattern of packedColour is a denormal. How can you even execute a single useful command on the floats here? Also mixing integer and FP instructions on the same register may cause performance degradation. The registers are indeed typed CPU internally.

It's a very good idea, I am saving memory and, and also saving memory accesses. This leads back to the point in my OP where I said that most games programmers turn NaN, Den, and FP exceptions off. As I've also raised before, most vectors are actually float[3]'s, W is usually ignored and contains rubbish. It's conventional to stash some 32bit value in the W to fill the otherwise wasted space, and also get the load for free alongside the position. The typical program flow, in this case: * the colour will be copied out into a separate register where it will be reinterpreted as a uint, and have an unpack process applied to it. * XYZ will then be used to perform maths, ignoring W, which will continue to accumulate rubbish values... it doesn't matter, all FP exceptions and such are disabled.

Putting the uint to the front slot would make your life simpler then, only MOVD, no unpacking :).
Jan 06 2012
prev sibling next sibling parent reply Manu <turkeyman gmail.com> writes:
On 6 January 2012 22:40, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 20:00:15 +0100, Manu <turkeyman gmail.com> wrote:

  On 6 January 2012 20:17, Martin Nowak <dawg dawgfoto.de> wrote:
  There is another benefit.
 Consider the following:

 __vec128 addps(__vec128 a, __vec128 b) pure
 {
   __vec128 res = a;

   if (__ctfe)
   {
       foreach(i; 0 .. 4)
          res[i] += b[i];
   }
   else
   {
       asm (res, b)

       {
           addps res, b;
       }
   }
   return res;

 }

You don't need to use inline ASM to be able to do this, it will work the same with intrinsics. I've detailed numerous problems with using inline asm, and complications with extending the inline assembler to support this. Don't get me wrong here. The idea is to find out if intrinsics

can be build with the help of inlineable asm functions. The ctfe support is one good reason to go with a library solution.

/agree, this is a nice argument to support putting it in libraries.
 Most compilers can't reschedule code around inline asm blocks. There are a
 lot of reasons for this, google can help you.
 The main reason is that a COMPILER doesn't attempt to understand the
 assembly it's being asked to insert inline. The information that it may
 use

It doesn't have to understand the assembly. Wrapping these in functions creates an IR expression with inputs and outputs. Declaring them as pure gives the compiler free hands to apply whatever optimizations he does normally on an IR tree. Common subexpressions elimination, removing dead expressions...

These functions shouldn't be functions... if they're not all inlined, then the implementation is broken. Once you inline all these micro asm blocks; 100 small asm blocks inlined in a single function, you're making a very hard time for the optimiser.
 Same problem as above. The compiler would need to understand enough about
 assembly to perform optimisation on the assembly its self to clean this
 up.
 Using intrinsics, all the register allocation, load/store code, etc, is
 all
 in the regular realm of compiling the language, and the code generation
 and
 optimisation will all work as usual.

  There is no informational difference between the intrinsic

__m128 _mm_add_ps(__m128 a, __m128 b); and an inline assembler version

There is actually. To the compiler, the intrinsic is a normal function, with some hook in the code generator to produce the appropriate opcode when it's performing actual code generation. On most compilers, the inline asm on the other hand, is unknown to the compiler, the optimiser can't do much anymore, because it doesn't know what the inline asm has done, and the code generator just goes and pastes your asm code inline where you told it to. It doesn't know if you've written to aliased variables, called functions, etc.. it can no longer safely rearrange code around the inline asm block.. which means it's not free to pipeline the code efficiently. So the argument here is that intrinsics in D can easier be
 mapped to existing intrinsics in GCC?
 I do understand that this will be pretty difficult for GDC
 to implement.
 Reminds me that Walter has stated several times how much
 better an internal assembler can integrate with the language.

Basically yes.
Jan 06 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 1:43 PM, Manu wrote:
 There is actually. To the compiler, the intrinsic is a normal function, with
 some hook in the code generator to produce the appropriate opcode when it's
 performing actual code generation.
 On most compilers, the inline asm on the other hand, is unknown to the
compiler,
 the optimiser can't do much anymore, because it doesn't know what the inline
asm
 has done, and the code generator just goes and pastes your asm code inline
where
 you told it to. It doesn't know if you've written to aliased variables, called
 functions, etc.. it can no longer safely rearrange code around the inline asm
 block.. which means it's not free to pipeline the code efficiently.

And, in fact, the compiler should not try to optimize inline assembler. The IA is there so that the programmer can hand tweak things without the compiler defeating his attempts. For example, suppose the compiler schedules instructions for processor X. The programmer writes inline asm to schedule for Y, because the compiler doesn't specifically support Y. The compiler goes ahead and reschedules it for X. Arggh! What dmd does do with the inline assembler is it keeps track of which registers are read/written, so that effective register allocation can be done for the non-asm code.
Jan 06 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 7 January 2012 02:06, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:43 PM, Manu wrote:

 There is actually. To the compiler, the intrinsic is a normal function,
 with
 some hook in the code generator to produce the appropriate opcode when
 it's
 performing actual code generation.
 On most compilers, the inline asm on the other hand, is unknown to the
 compiler,
 the optimiser can't do much anymore, because it doesn't know what the
 inline asm
 has done, and the code generator just goes and pastes your asm code
 inline where
 you told it to. It doesn't know if you've written to aliased variables,
 called
 functions, etc.. it can no longer safely rearrange code around the inline
 asm
 block.. which means it's not free to pipeline the code efficiently.

And, in fact, the compiler should not try to optimize inline assembler. The IA is there so that the programmer can hand tweak things without the compiler defeating his attempts. For example, suppose the compiler schedules instructions for processor X. The programmer writes inline asm to schedule for Y, because the compiler doesn't specifically support Y. The compiler goes ahead and reschedules it for X. Arggh! What dmd does do with the inline assembler is it keeps track of which registers are read/written, so that effective register allocation can be done for the non-asm code.

And I agree this is exactly correct for the IA... and also why intrinsics must be used to do this work, not IA.
Jan 06 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2012 4:15 PM, Manu wrote:
 And I agree this is exactly correct for the IA... and also why intrinsics must
 be used to do this work, not IA.

Yup.
Jan 06 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sat, 07 Jan 2012 01:06:21 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:43 PM, Manu wrote:
 There is actually. To the compiler, the intrinsic is a normal function,  
 with
 some hook in the code generator to produce the appropriate opcode when  
 it's
 performing actual code generation.
 On most compilers, the inline asm on the other hand, is unknown to the  
 compiler,
 the optimiser can't do much anymore, because it doesn't know what the  
 inline asm
 has done, and the code generator just goes and pastes your asm code  
 inline where
 you told it to. It doesn't know if you've written to aliased variables,  
 called
 functions, etc.. it can no longer safely rearrange code around the  
 inline asm
 block.. which means it's not free to pipeline the code efficiently.

And, in fact, the compiler should not try to optimize inline assembler. The IA is there so that the programmer can hand tweak things without the compiler defeating his attempts. For example, suppose the compiler schedules instructions for processor X. The programmer writes inline asm to schedule for Y, because the compiler doesn't specifically support Y. The compiler goes ahead and reschedules it for X. Arggh!

Yes, but that's not what I meant. Consider __v128 a = load(1), b = loadB(2); __v128 c = add(a, b); __v128 d = add(a, b); A valid optimization could be. __v128 b = load(2); __v128 a = load(1); __v128 tmp = add(a, b); __v128 d = tmp; __v128 c = tmp; __v128 load(int v) pure { __v128 res; asm (res, v) { MOVD res, v; SHUF res, 0x0000; } return res; } __v128 add(__v128 a, __v128 b) pure { __v128 res = a; asm (res, b) { ADD res, b; } return res; } The compiler might drop evaluation of d and just use the comsub of c. He might also evaluate d before c. The important point is to mark those functions as having no-sideeffect, which can be checked if instructions are classified. Thus the compiler can do all kind of optimizations on expression level. After inlining it would look like this. __v128 b; asm (b) { MOV b, 2; } __v128 a; asm (a) { MOV a, 1; } __v128 tmp; asm (a, b, tmp) { MOV tmp, a; ADD tmp, b; } __v128 c; asm (c, tmp) { MOV c, tmp; } __v128 d; asm (d, tmp) { MOV d, tmp; } Then he will do the usual register assignment except that variables must be assigned a register for asm blocks they are used in. This is effectively achieves the same as writing this with intrinsics. It also greatly improves the composition of inline asm.
 What dmd does do with the inline assembler is it keeps track of which  
 registers are read/written, so that effective register allocation can be  
 done for the non-asm code.

Which is why the compiler should be the one to allocate pseudo-registers.
Jan 06 2012
prev sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 01/07/12 04:27, Martin Nowak wrote:
 __v128 add(__v128 a, __v128 b) pure
 {
     __v128 res = a;
     asm (res, b)
     {
         ADD res, b;
     }
     return res;
 }

 This is effectively achieves the same as writing this with intrinsics.
 It also greatly improves the composition of inline asm.

What it also does is allows mixing "ordinary" asm with the SIMD instructions. People will do that, because it's easier this way (less typing), and then the result is practically unportable. Cause every compiler would now have to fully understand and support that one asm variant. If you do "__v128 __simd_add(__v128 a, __v128)" instead, you don't loose anything; in fact it could be internally implemented with your asm(). But now the "real" asm code is separate from the more generic (and sometimes even portable) simd ops -- the compiler does not need to understand asm() to be able to use it. It can still do every optimization as with the raw asm, and possibly more as it knows exactly what's going on. The explicit pure annotations are not needed. It has more freedom to choose better scheduling, ordering, sometimes instruction selection (if there's more than one alternative) and even various code transformations. Even CTFE works. Consider the case when a lot of your above add()-like functions are inlined into another one, which will be a common pattern -- you don't want any false dependencies. (If you do care about exact instruction scheduling you're writing asm, not D, so for that case asm() is a better choice) I wrote "__v128 __simd_add(__v128 a, __v128)" above, but that was just to keep things simple. What you actually want is "vfloat4 __simd_add(vfloat4 a, vfloat4 b)" etc. Ie strongly typed. Whether this needs to go into the compiler itself depends on only one thing - if it can be done efficiently in a library. Efficiently in this case means "zero-cost" or "free". Having different static types (in addition to the untyped __v(64|128|256) ones) gives you not only security (you don't accidentally end up operating on the wrong data/format because you forgot about some version() combination etc), but also allows things like overloading. Then you can write more generic code, which works with all available formats. And eg changing the precision used by some app module involves only changing a few declarations plus data entry/exit points, not modifying every single SIMD instruction. Untyped __v128 only really works for memcpy() type functions; other than that is mainly useful for conversions and passing data etc - the cases where you don't care about the content in transit.
 What dmd does do with the inline assembler is it keeps track of which
registers are read/written, so that effective register allocation can be done
for the non-asm code.

Which is why the compiler should be the one to allocate pseudo-registers.

Yep. artur
Jan 07 2012
prev sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
simdop will need more overloads, e.g. some
instructions need immediate bytes.
z = simdop(SHUFPS, x, y, 0);

How about this:
__v128 simdop(T...)(SIMD op, T args);
Jan 08 2012
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 8/01/12 5:02 PM, Martin Nowak wrote:
 simdop will need more overloads, e.g. some
 instructions need immediate bytes.
 z = simdop(SHUFPS, x, y, 0);

 How about this:
 __v128 simdop(T...)(SIMD op, T args);

These don't make a lot of sense to return as value, e.g. __v128 a, b; a = simdop(movhlps, b); // ??? movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't be done as an expression like this. Would make more sense to just write the instructions like they appear in asm: simdop(movhlps, a, b); simdop(addps, a, b); etc. The difference between this and inline asm would be: 1. Registers are automatically allocated. 2. Loads/stores are inserted when we spill to stack. 3. Instructions can be scheduled and optimised by the compiler. We could then extend this with user-defined types: struct float4 { union { __v128 v; float[4] for_debugging; } float4 opBinary(string op:"+")(float4 rhs) forceinline { __v128 result = v; simdop(addps, result, rhs); return float4(result); } } We'd need a strong guarantee of inlining and removal of redundant load/stores though for this to work well. We'd also need a guarantee that float4's would get the same treatment as __v128 (as it is the only element).
Jan 08 2012
next sibling parent Manu <turkeyman gmail.com> writes:
On 8 January 2012 19:56, Peter Alexander <peter.alexander.au gmail.com>wrote:

 These don't make a lot of sense to return as value, e.g.

 __v128 a, b;
 a = simdop(movhlps, b); // ???

 movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't be
 done as an expression like this.

The conventional way is to write it like this: r = simdop(movhlps, a, b); This allows you to chain the functions together, ie. passing the result as an arg..
Jan 08 2012
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sun, 08 Jan 2012 18:56:04 +0100, Peter Alexander  
<peter.alexander.au gmail.com> wrote:

 On 8/01/12 5:02 PM, Martin Nowak wrote:
 simdop will need more overloads, e.g. some
 instructions need immediate bytes.
 z = simdop(SHUFPS, x, y, 0);

 How about this:
 __v128 simdop(T...)(SIMD op, T args);

These don't make a lot of sense to return as value, e.g. __v128 a, b; a = simdop(movhlps, b); // ??? movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't be done as an expression like this. Would make more sense to just write the instructions like they appear in asm: simdop(movhlps, a, b); simdop(addps, a, b); etc.

Yeah, also thought of this. Having a copy as default would require to eliminate them again.
 The difference between this and inline asm would be:

 1. Registers are automatically allocated.

See asm pseudo-registers.
 2. Loads/stores are inserted when we spill to stack.

There are sequencing point before and after asm blocks.
 3. Instructions can be scheduled and optimised by the compiler.

Optimization can be done on IR level. Scheduling is done after all code is emitted.
Jan 08 2012
prev sibling next sibling parent reply Norbert Nemec <Norbert Nemec-online.de> writes:
On 06.01.2012 02:42, Manu wrote:
 I like v128, or something like that. I'll use that for the sake of this
 document. I think it is preferable to float4 for a few reasons...

I do not agree at all. That way, the type looses all semantic information. This is not only breaking with C/C++/D philosophy but actually *hides* an essential hardware detail on Intel SSE: An SSE register is 128 bit, but the processor actually cares about the semantics of the content: There are different commands for loading two doubles, four singles or integers to a register. They all load the same 128 bits from memory into the same register. Anyhow, the specs warn about a performance penalty when loading a register as one type and then using it as another. I do not know the internals of the processor, but my understanding is that the CPU splits the floats into mantissa, exponent and sign already at the moment of loading and has to drop that information when you reinterpret the bit pattern stored in the register. A type v128 would not provide the necessary information for the compiler to produce the correct mov statements. There definitely must be a float4 and a double2 type to express these semantics. For integers, I am not quite sure. I believe that integer SSE commands can be mixed more so a single 128bit type would be sufficient. Considering these hardware details of the SSE architecture alone, I fear that portable low-level support for SIMD is very hard to achieve. If you want to offer access to the raw power of each architecture, it might be simpler to have machine-specific language extensions for SIMD and leave the portability for a wrapper library with a common front-end and various back-ends for the different architectures.
Jan 12 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/12/2012 12:13 PM, Norbert Nemec wrote:
 A type v128 would not provide the necessary information for the compiler to
 produce the correct mov statements.

 There definitely must be a float4 and a double2 type to express these
semantics.
 For integers, I am not quite sure. I believe that integer SSE commands can be
 mixed more so a single 128bit type would be sufficient.

 Considering these hardware details of the SSE architecture alone, I fear that
 portable low-level support for SIMD is very hard to achieve. If you want to
 offer access to the raw power of each architecture, it might be simpler to have
 machine-specific language extensions for SIMD and leave the portability for a
 wrapper library with a common front-end and various back-ends for the different
 architectures.

That's what we're doing for D's SIMD support. Although the syntax will support any vector type, the semantics will constrain it to what works for the target hardware. Manu has convinced me that to emulate vector types that don't have hardware support is a very bad idea, because then naive users will assume they'll be getting hardware performance, but in reality will have truly execrable performance. Note that gcc does do the emulation for unsupported ops (like some of the multiplies). Take a gander at the code generated - instead of one instruction, it's a page of them. I think this will be an unwelcome surprise to the performance minded vector programmer. Note that explicit emulation will be possible, using D's general purpose vector syntax: a[] = b[] + c[];
Jan 12 2012
prev sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 12/01/12 8:13 PM, Norbert Nemec wrote:
 Considering these hardware details of the SSE architecture alone, I fear
 that portable low-level support for SIMD is very hard to achieve. If you
 want to offer access to the raw power of each architecture, it might be
 simpler to have machine-specific language extensions for SIMD and leave
 the portability for a wrapper library with a common front-end and
 various back-ends for the different architectures.

You are right, but don't forget that the same is true for instructions already in the language. For example, (1 << x) is a very slow operation on PPUs (it's micro-coded). It's simply not possible to be portable and achieve maximum performance for any language features, not just vectors. Algorithms must be tuned for specific architectures in version statements. However, you can get a decent baseline by providing the lowest common denominator in functionality. This v128 type (or whatever it will be called) does that.
Jan 12 2012
parent reply Norbert Nemec <Norbert Nemec-online.de> writes:
On 12.01.2012 23:10, Peter Alexander wrote:
 On 12/01/12 8:13 PM, Norbert Nemec wrote:
 Considering these hardware details of the SSE architecture alone, I fear
 that portable low-level support for SIMD is very hard to achieve. If you
 want to offer access to the raw power of each architecture, it might be
 simpler to have machine-specific language extensions for SIMD and leave
 the portability for a wrapper library with a common front-end and
 various back-ends for the different architectures.

You are right, but don't forget that the same is true for instructions already in the language. For example, (1 << x) is a very slow operation on PPUs (it's micro-coded). It's simply not possible to be portable and achieve maximum performance for any language features, not just vectors. Algorithms must be tuned for specific architectures in version statements. However, you can get a decent baseline by providing the lowest common denominator in functionality. This v128 type (or whatever it will be called) does that.

Actually, my essential message is: The single v128 is too simplistic for the SSE architecture. You actually need different types because the compiler needs to know what type is stored in any given register to be able to move it around.
Jan 12 2012
parent reply Manu <turkeyman gmail.com> writes:
On 13 January 2012 08:34, Norbert Nemec <Norbert nemec-online.de> wrote:

 On 12.01.2012 23:10, Peter Alexander wrote:

 On 12/01/12 8:13 PM, Norbert Nemec wrote:

 Considering these hardware details of the SSE architecture alone, I fear
 that portable low-level support for SIMD is very hard to achieve. If you
 want to offer access to the raw power of each architecture, it might be
 simpler to have machine-specific language extensions for SIMD and leave
 the portability for a wrapper library with a common front-end and
 various back-ends for the different architectures.

You are right, but don't forget that the same is true for instructions already in the language. For example, (1 << x) is a very slow operation on PPUs (it's micro-coded). It's simply not possible to be portable and achieve maximum performance for any language features, not just vectors. Algorithms must be tuned for specific architectures in version statements. However, you can get a decent baseline by providing the lowest common denominator in functionality. This v128 type (or whatever it will be called) does that.

Actually, my essential message is: The single v128 is too simplistic for the SSE architecture. You actually need different types because the compiler needs to know what type is stored in any given register to be able to move it around.

This has already been concluded some days back, the language has a quite of types, just like GCC.
Jan 13 2012
parent reply Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 1/13/2012 7:38 AM, Manu wrote:
 On 13 January 2012 08:34, Norbert Nemec <Norbert nemec-online.de
 <mailto:Norbert nemec-online.de>> wrote:


 This has already been concluded some days back, the language has a quite
 of types, just like GCC.

So I would definitely like to help out on the SIMD stuff in some way, as I have a lot of experience using SIMD math to speed up the games I work on. I've got a vectorized set of transcendetal (currently in the form of MSVC++ intrinics) functions for float and double that would be a good start if anyone is interested. Beyond that I just want to help 'make it right' because its a topic I care alot about, and is my personal biggest gripe with the langauge at the moment. I also have experience with VMX as they two are not exactly the same, it definitely would help to avoid making the code too intel-centric (though typically the VMX is the more flexible design as it can do dynamic shuffling based on the contents of the vector registers etc)
Jan 14 2012
parent reply Manu <turkeyman gmail.com> writes:
On 15 January 2012 09:20, Sean Cavanaugh <WorksOnMyMachine gmail.com> wrote:

 On 1/13/2012 7:38 AM, Manu wrote:

 On 13 January 2012 08:34, Norbert Nemec <Norbert nemec-online.de
 <mailto:Norbert nemec-online.**de <Norbert nemec-online.de>>> wrote:


 This has already been concluded some days back, the language has a quite
 of types, just like GCC.

So I would definitely like to help out on the SIMD stuff in some way, as I have a lot of experience using SIMD math to speed up the games I work on. I've got a vectorized set of transcendetal (currently in the form of MSVC++ intrinics) functions for float and double that would be a good start if anyone is interested. Beyond that I just want to help 'make it right' because its a topic I care alot about, and is my personal biggest gripe with the langauge at the moment. I also have experience with VMX as they two are not exactly the same, it definitely would help to avoid making the code too intel-centric (though typically the VMX is the more flexible design as it can do dynamic shuffling based on the contents of the vector registers etc)

I too have a long history with VMX, CELL SPU, ARMs VFP/NEON, and others (PSP's VFPU, PS2s VU, SH4), and SSE of course, and writing the efficient libraries that take all hardwares into consideration. We should compare notes, are you on IRC? :)
Jan 15 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/15/2012 3:02 AM, Manu wrote:
 On 15 January 2012 09:20, Sean Cavanaugh <WorksOnMyMachine gmail.com
     I also have experience with VMX as they two are not exactly the same, it
     definitely would help to avoid making the code too intel-centric (though
     typically the VMX is the more flexible design as it can do dynamic
shuffling
     based on the contents of the vector registers etc)


 I too have a long history with VMX, CELL SPU, ARMs VFP/NEON, and others (PSP's
 VFPU, PS2s VU, SH4), and SSE of course, and writing the efficient libraries
that
 take all hardwares into consideration. We should compare notes, are you on
IRC? :)

A nice vector math library for D that puts us competitive will be a nice addition to Phobos.
Jan 15 2012
parent reply JoeCoder <dnewsgroup2 yage3d.net> writes:
On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.
Jan 15 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/15/2012 6:54 PM, JoeCoder wrote:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.

I have never used libraries like that, and so it isn't obvious to me what a good one would look like.
Jan 15 2012
parent reply suicide <suicide xited.de> writes:
Here is mine:
http://suicide.zoadian.de/ext/math/geometry/vector.d
i haven't tested (not even compiled) it yet. It needs polishing, but i  =

have not much time to work on it atm. But you may use it as you wish ;)
Any suggestions/improvement is welcome.

Greetings,
Felix



Am 16.01.2012, 04:00 Uhr, schrieb Walter Bright  =

<newshound2 digitalmars.com>:

 On 1/15/2012 6:54 PM, JoeCoder wrote:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a =



=
 nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, an=


d =
 is
 similar to the glm (http://glm.g-truc.net) c++ library which emulates=


=
 glsl
 vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

I have never used libraries like that, and so it isn't obvious to me =

 what a good one would look like.

-- = Erstellt mit Operas revolution=E4rem E-Mail-Modul: http://www.opera.com/= mail/
Jan 16 2012
parent reply "F i L" <witte2008 gmail.com> writes:
On Monday, 16 January 2012 at 17:57:38 UTC, suicide wrote:
 Here is mine:
 http://suicide.zoadian.de/ext/math/geometry/vector.d
 i haven't tested (not even compiled) it yet. It needs 
 polishing, but i have not much time to work on it atm. But you 
 may use it as you wish ;)
 Any suggestions/improvement is welcome.

 Greetings,
 Felix



 Am 16.01.2012, 04:00 Uhr, schrieb Walter Bright 
 <newshound2 digitalmars.com>:

 On 1/15/2012 6:54 PM, JoeCoder wrote:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive 
 will be a nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.

I have never used libraries like that, and so it isn't obvious to me what a good one would look like.


Nice start, though it have quite a few issues. 1. for (i; 0 .. D) needs to be: foreach (i; 0 .. D) 2. asserts(r != 0) should be done in a contract 3. 'Vector(D, T)' can be internally used as just 'Vector' 4. instead of making opAdd, opSub, opMul, etc.. use opBinary and mixins 5. don't pass vectors as 'ref' unless they are going to be modified 6. for performance, don't pass all values through 'real' auto opBinary(string op, U)(U r) if (U.sizeof <= T.sizeof && isImplicitlyConvertible(T, U)) in { assert(r != 0); } body { Vector nvec(this); foreach (i; 0 .. D) mixin("nvec.vec[i]" ~ op ~ "= r;"); return nvec; } auto opBinary(string op, U)(U r) if (U.sizeof > T.sizeof && isImplicitlyConvertible(U, T)) in { assert(r != 0); } body { Vector nvec(this); foreach (i; 0 .. D) mixin("nvec.vec[i]" ~ op ~ "= cast(T) r;"); return nvec; } . . . . . auto opBinary(string op, V, U)(Vector!(V, U) vec) if (U.sizeof <= T.sizeof && isImplicitlyConvertible(U, T)) in { foreach (i; 0 .. V) assert(vec.vec[i] != 0); } body { Vector nvec(this); static if (D <= V) { foreach (i; 0 .. D) mixin("nvec.vec[i]" ~ op ~ "= vec.vec[i];"); } else { foreach (i; 0 .. V) mixin("nvec.vec[i]" ~ op ~ "= vec.vec[i];"); } return nvec; } // etc... Something along those lines. Also, make sure you can't create a vector of zero or one length (struct Vector(D, T) if (D >= 2) { ... }). Plus, none of your Vector(D,T) instances will compile because you for the '!' mark: Vector!(D, T)
Jan 16 2012
parent reply "F i L" <witte2008 gmail.com> writes:
Whoops! opBinary should be opOpAssign in my examples.
Jan 16 2012
parent "F i L" <witte2008 gmail.com> writes:
On Monday, 16 January 2012 at 19:27:16 UTC, F i L wrote:
 Whoops! opBinary should be opOpAssign in my examples.

wait... no, opBinary was the right one... i'm confusing myself here :D
Jan 16 2012
prev sibling parent reply David <d dav1d.de> writes:
Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.

Hi, that's definitly possible! But to be honest, I don't think putting gl3n into phobos is a good idea. Why does phobos, the std. lib, need a vector-lib? I haven't seen any other language with something like gl3n in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it would be a real pain changing this to the Phobos style. One more point is, that it's not just a Vector-lib, it also does Matrix-, Quaternion-math, interpolation and implements some other useful mathematical functions (as found in GLSL). Of course I am open to a discussion. PS:// I already talked with Manu about this topic, and I don't wait too long, gl3n will have core.simd support soon.
Jan 16 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 1:26 PM, David wrote:
 PS:// I already talked with Manu about this topic, and I don't wait too long,
 gl3n will have core.simd support soon.

Awesome. I was hoping that adding simd support would have a an "enabling" effect for a lot of great library improvements!
Jan 16 2012
prev sibling next sibling parent reply =?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:
On Mon, 16 Jan 2012 22:26:55 +0100, David <d dav1d.de> wrote:

 Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a  
 nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.

Hi, that's definitly possible! But to be honest, I don't think putting gl3n into phobos is a good idea. Why does phobos, the std. lib, need a vector-lib?

To make it all the easier for those who want to make games in D?
 I haven't seen any other language with something like gl3n in the std.  
 lib. Also I used my own PEP-8, C (K&R with spaces) style, it would be a  
 real pain changing this to the Phobos style. One more point is, that  
 it's not just a Vector-lib, it also does Matrix-, Quaternion-math,  
 interpolation and implements some other useful mathematical functions  
 (as found in GLSL).
 Of course I am open to a discussion.

IMO, we should have vectors, matrices, quaternions and all those other neat things easily accessible in the language (dual quaternions? Are they used in games?)
 PS:// I already talked with Manu about this topic, and I don't wait too  
 long, gl3n will have core.simd support soon.

Looking forward to it.
Jan 16 2012
next sibling parent reply Danni Coy <danni.coy gmail.com> writes:
 (dual quaternions? Are they
 used in games?)

 yes

Jan 16 2012
parent Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 1/16/2012 7:21 PM, Danni Coy wrote:
     (dual quaternions? Are they
     used in games?)

 yes

While the GPU tends to do this particular step of the work, the answer in general is 'definitely'. One of the most immediate applications of dual quats was to improve the image quality of joints on characters that twist and rotate at the same time (shoulder blades, wrists, etc), at a minimal increase (or in some cases equivalent) computational cost over older methods. http://isg.cs.tcd.ie/projects/DualQuaternions/
Jan 17 2012
prev sibling parent reply David <d dav1d.de> writes:
Am 16.01.2012 23:28, schrieb Simen Kjærås:
 To make it all the easier for those who want to make games in D?

Then they can get my lib easily from bitbucket.
 IMO, we should have vectors, matrices, quaternions and all those other
 neat things easily accessible in the language (dual quaternions? Are they
 used in games?)

The question is, what's the aim of phobos, including everything?, including the basics?, so you can implement nearly everything on top of it (similiar to pythons)? or beeing minimalistic as possible? I get the impression, phobos tries to include everything possible, yeah I am pointing at curl (wtf, a c-lib, not even a D implementation, also you need libcurl now to get phobos compiled (correct me if I am wrong here)). IMHO a std.-lib should include just the basics, so you can build ontop of it. Including gl3n into phobos would be a real honor for me, but is the goal of phobos to include everything and to ship a fat lib (lol, reminds me a bit of boost) with dmd, or even call it the power of D, an overloaded std.-lib?
Jan 17 2012
parent reply =?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:
On Tue, 17 Jan 2012 13:52:01 +0100, David <d dav1d.de> wrote:

 Am 16.01.2012 23:28, schrieb Simen Kj=C3=A6r=C3=A5s:
 To make it all the easier for those who want to make games in D?

Then they can get my lib easily from bitbucket.

Or any one of a 100 other places, with incompatible implementations? Vectors and matrices are low enough level that people generally won't need to write their own to match *their* exact use case. That makes them prime stdlib material.
Jan 17 2012
parent Danni Coy <danni.coy gmail.com> writes:
+1

On Wed, Jan 18, 2012 at 4:35 AM, Simen Kj=E6r=E5s <simen.kjaras gmail.com>w=
rote:

 On Tue, 17 Jan 2012 13:52:01 +0100, David <d dav1d.de> wrote:

  Am 16.01.2012 23:28, schrieb Simen Kj=E6r=E5s:
 To make it all the easier for those who want to make games in D?

Then they can get my lib easily from bitbucket.

Or any one of a 100 other places, with incompatible implementations? Vectors and matrices are low enough level that people generally won't need to write their own to match *their* exact use case. That makes them prime stdlib material.

Jan 17 2012
prev sibling next sibling parent reply JoeCoder <dnewsgroup2 yage3d.net> writes:
On 1/16/2012 4:26 PM, David wrote:
 Why does phobos, the std. lib, need a vector-lib? I haven't seen any
 other language with something like gl3n in the std.

I guess this depends on the goals for phobos. Is it minimal, or batteries included? As for not seeing it in other languages, I don't think there's very many low-level, high performance languages that take a batteries-included approach to the standard library. I'd argue that a std.math3d would be used just as much, if not more, than std.complex.
Jan 16 2012
parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 16 January 2012 22:32, JoeCoder <dnewsgroup2 yage3d.net> wrote:
 On 1/16/2012 4:26 PM, David wrote:
 Why does phobos, the std. lib, need a vector-lib? I haven't seen any
 other language with something like gl3n in the std.

I guess this depends on the goals for phobos. =A0Is it minimal, or batter=

ies
 included? =A0As for not seeing it in other languages, I don't think there=

's
 very many low-level, high performance languages that take a
 batteries-included approach to the standard library.

 I'd argue that a std.math3d would be used just as much, if not more, than
 std.complex.

Since when did people use std.complex? :~) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 16 2012
prev sibling parent reply Kiith-Sa <42 theanswer.com> writes:
David wrote:

 Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.

Hi, that's definitly possible! But to be honest, I don't think putting gl3n into phobos is a good idea. Why does phobos, the std. lib, need a vector-lib? I haven't seen any other language with something like gl3n in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it would be a real pain changing this to the Phobos style. One more point is, that it's not just a Vector-lib, it also does Matrix-, Quaternion-math, interpolation and implements some other useful mathematical functions (as found in GLSL). Of course I am open to a discussion. PS:// I already talked with Manu about this topic, and I don't wait too long, gl3n will have core.simd support soon.

gl3n has a really good API with regards to game development (resembling GLSL helps), although I guess changing to a more Phobos style might be needed for inclusion. I think having it in the standard library would be extremely useful, though - no need to implement it myself then. Typical matrices used in gamedev (4x4 etc) woud be really useful as well (as said before, I'd even like stuff like AABBoxes, but let's go for vectors/matrices/quaternions first .
Jan 16 2012
parent reply David <d dav1d.de> writes:
Am 17.01.2012 05:31, schrieb Kiith-Sa:
 David wrote:

 Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/gl3n It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.

Hi, that's definitly possible! But to be honest, I don't think putting gl3n into phobos is a good idea. Why does phobos, the std. lib, need a vector-lib? I haven't seen any other language with something like gl3n in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it would be a real pain changing this to the Phobos style. One more point is, that it's not just a Vector-lib, it also does Matrix-, Quaternion-math, interpolation and implements some other useful mathematical functions (as found in GLSL). Of course I am open to a discussion. PS:// I already talked with Manu about this topic, and I don't wait too long, gl3n will have core.simd support soon.

gl3n has a really good API with regards to game development (resembling GLSL helps), although I guess changing to a more Phobos style might be needed for inclusion. I think having it in the standard library would be extremely useful, though - no need to implement it myself then. Typical matrices used in gamedev (4x4 etc) woud be really useful as well (as said before, I'd even like stuff like AABBoxes, but let's go for vectors/matrices/quaternions first .

AABB are also planed for gl3n.
Jan 17 2012
parent Manu <turkeyman gmail.com> writes:
On 17 January 2012 14:43, David <d dav1d.de> wrote:

 Am 17.01.2012 05:31, schrieb Kiith-Sa:

  David wrote:
  Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:

 A nice vector math library for D that puts us competitive will be a
 nice
 addition to Phobos.

The gl3n library might be something good to build on: https://bitbucket.org/dav1d/**gl3n <https://bitbucket.org/dav1d/gl3n> It looks to be a continuation of the OMG library used by Deadlock, and is similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl vector ops in software. We'd need to ask if it can be re-licensed from MIT to Boost.

Hi, that's definitly possible! But to be honest, I don't think putting gl3n into phobos is a good idea. Why does phobos, the std. lib, need a vector-lib? I haven't seen any other language with something like gl3n in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it would be a real pain changing this to the Phobos style. One more point is, that it's not just a Vector-lib, it also does Matrix-, Quaternion-math, interpolation and implements some other useful mathematical functions (as found in GLSL). Of course I am open to a discussion. PS:// I already talked with Manu about this topic, and I don't wait too long, gl3n will have core.simd support soon.

gl3n has a really good API with regards to game development (resembling GLSL helps), although I guess changing to a more Phobos style might be needed for inclusion. I think having it in the standard library would be extremely useful, though - no need to implement it myself then. Typical matrices used in gamedev (4x4 etc) woud be really useful as well (as said before, I'd even like stuff like AABBoxes, but let's go for vectors/matrices/quaternions first .

AABB are also planed for gl3n.

Yeah I probably wouldn't put anything that high level in a standard library. Everyone will want a slightly different flavour. I think linear algebra with vectors, matrices, quats is about the fair extent of a std lib. That stuff is pretty un-debatable, but beyond that, it starts getting very subjective or context specific. Better left for higher level libraries that may also integrate with renderers/physics systems/etc.
Jan 17 2012
prev sibling parent reply Mehrdad <wfunction hotmail.com> writes:
In case this is at all helpful...
[see attached]
Jan 14 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/14/2012 2:11 AM, Mehrdad wrote:
 In case this is at all helpful...
 [see attached]

Hope you like the new simd compiler stuff.
Jan 14 2012