digitalmars.D - SIMD support...

Manu (164/164) Jan 05 2012 So I've been hassling about this for a while now, and Walter asked me to

Walter Bright (3/9) Jan 05 2012 If you could cut&paste them here, I would find it most helpful. I have s...

Manu (6/66) Jan 05 2012 ...

Walter Bright (6/8) Jan 05 2012 Another question:

Manu (2/4) Jan 05 2012 Note: you only need to align the stack when a vector is actually stored ...

Walter Bright (3/10) Jan 05 2012 The only real issue with alignment is getting the stack aligned to 16 by...

Manu (9/28) Jan 05 2012 It's important for all implementations of simd units, x32, x64, and othe...

Walter Bright (3/8) Jan 05 2012 Aligning the stack. Before I say anything, I want to hear your suggestio...

Manu (8/20) Jan 05 2012 Perhaps I misunderstand, I can't see the problem?

Walter Bright (3/10) Jan 05 2012 And now you cannot access the function's parameters anymore, because the...

Martin Nowak (11/23) Jan 06 2012 Aah, I knew there was something that wouldn't work.
Manu (11/24) Jan 06 2012 Hehe, true, but not insurmountable. Scheduling of parameter pops before ...

bearophile (5/6) Jan 06 2012 Pasting it is useful for all other people reading this thread too, like ...
Walter Bright (10/32) Jan 06 2012 I don't have VC. I had thought of using an extra level of indirection fo...

Manu (9/60) Jan 06 2012 I think we should take this conversation to IRC, or a separate thread?

Walter Bright (9/16) Jan 06 2012 What I'm going to do is make the SIMD stuff work on 64 bits for now. The...

Manu (16/36) Jan 06 2012 ...I'm using DMD on windows... x32. So this isn't ideal ;)

Walter Bright (9/16) Jan 06 2012 No, it's not impossible. Here's what you can do now:

Martin Nowak (3/21) Jan 06 2012 Only recently (4.6 I think).
FeepingCreature (2/8) Jan 07 2012 GCC keeps the stack 16-byte aligned by default.

Trass3r (2/6) Jan 06 2012 On Windoze? You're a masochist ^^

Piotr Szturmaj (2/7) Jan 11 2012 Windows 8 will support ARM. I hope that D will too.

Danni Coy (4/15) Jan 11 2012 I was rather under that only the new html5 api would be available under

Piotr Szturmaj (16/19) Jan 11 2012 http://www.microsoft.com/presspass/exec/ssinofsky/2011/09-13BUILD.mspx?r...

=?windows-1252?Q?Alex_R=F8nne_Petersen?= (8/27) Jan 11 2012 If they have ported the Common Language Runtime to ARM, I doubt they

Artur Skawina (42/43) Jan 06 2012 For sane "arbitrary" values (ie powers of two) it looks like this:
Iain Buclaw (6/25) Jan 06 2012 And will also allow me to tap into many vector intrinsics that gcc
Manu (2/11) Jan 06 2012 Huzzah! ... Like what?
Iain Buclaw (6/20) Jan 06 2012 For backend intrinsics, they are all functions that map to asm
Manu (3/22) Jan 06 2012 Ah yeah, perfect.. obviously we need all of those for this vector type t...

Manu (4/27) Jan 05 2012 That said, most of the time values used in smaller functions will only e...
Martin Nowak (25/36) Jan 05 2012 extending

Walter Bright (51/53) Jan 06 2012 Takeaways:

Andrew Wiley (21/57) Jan 06 2012 dds

a (3/14) Jan 06 2012 For NEON you would need at least a function with a signature:

a (2/19) Jan 06 2012 Disregard that, I wasn't paying atention to the return type. What Walter...

a (3/22) Jan 06 2012 I don't see it being typeless as a problem. The purpose of this is to ex...

Manu (31/61) Jan 06 2012 Hooray! I think we're on exactly the same page. That's refreshing :)

Paulo Pinto (68/71) Jan 06 2012 Hi,
Manu (9/64) Jan 06 2012 Sounds good to me. Though I think __v128 should definitely be typeless,
Manu (10/87) Jan 06 2012 The underlying architectures are too different to try and map opcodes
bearophile (9/13) Jan 06 2012 What are the disadvantages of making it typeless?

Manu (14/27) Jan 06 2012 I don't believe there are any. I can see only advantages to implementing

bearophile (6/11) Jan 06 2012 I see. While you design, you need to think about the other features of D...

bearophile (3/7) Jan 06 2012 And generally, if the D compiler receives just D vector ops, what's a go...

Manu (97/109) Jan 06 2012 I'm not clear what you mean, are you talking about D vectors of hardwar...

Manu (9/19) Jan 06 2012 I don't see any issue with this. An array of vectors makes perfect sense...
Russel Winder (31/40) Jan 06 2012 ny

Paulo Pinto (19/19) Jan 06 2012 From what I see in HPC conferences papers and webcasts, I think it might...

Russel Winder (36/50) Jan 06 2012 Indeed, for core HPC that is true: if you aren't using Fortran, C, C++,

Paulo Pinto (64/77) Jan 06 2012 Please don't start a flame war on this, I am just expressing an opinion.

Froglegs (17/17) Jan 06 2012 That Cuda is used more is probably true, OpenCL is fugly C and

Walter Bright (2/5) Jan 06 2012 At the moment, I have no idea what such support might look like :-(

Manu (29/58) Jan 06 2012 No, I'm talking specifically about NOT making the type x86/SSE specific.

Sean Cavanaugh (18/26) Jan 14 2012 I don't think you are in any danger as the GPGPU instructions are more

Martin Nowak (38/93) Jan 06 2012 Those could be typedefs, i.e. alias this wrapper.
Martin Nowak (3/120) Jan 06 2012 Also addss is a pure function which could be important to optimize
Manu (32/59) Jan 06 2012 I think they should by well defined structs with lots of type safety and

Walter Bright (21/34) Jan 06 2012 Consider an analogy with the EAX register. It's untyped. But we don't fi...

Manu (33/52) Jan 06 2012 Damn it, I though we already reached agreement, why are you having secon...

Walter Bright (16/68) Jan 06 2012 I strongly disagree with this. EAX can (and is) at various times used as...

Manu (24/66) Jan 06 2012 But why are you against adding this stuff in the library? It's contrary ...

Walter Bright (18/62) Jan 06 2012 For one thing, the compiler has a very hard time optimizing library impl...

Manu (25/51) Jan 06 2012 .. oh god, what have I done. :/

Walter Bright (10/28) Jan 06 2012 Actually, not that much. Surprising, no? I already think I did the h...

Manu (6/28) Jan 06 2012 Not without breaking NDA's... but maybe I will anyway, I'll dig some stu...

Andrei Alexandrescu (22/31) Jan 06 2012 I think it would be great to try avoiding the barbarism of adding 10

Don (12/33) Jan 07 2012 Sorry Andrei, I have to disagree with that in the strongest possible

Adam D. Ruppe (24/27) Jan 07 2012 Amen. AAs are *still* broken from this change. If you

Andrei Alexandrescu (7/12) Jan 07 2012 Well they are broken because the change has not been carried to completi...

Adam D. Ruppe (15/17) Jan 07 2012 Here's my position: if we get a library implementation that

Andrei Alexandrescu (5/13) Jan 07 2012 Using static calls for hashing and comparisons instead of indirect calls...
Artur Skawina (3/15) Jan 07 2012 Reminded me of this: "static immutable string[string] aa = [ "a": "b" ];...
Walter Bright (10/18) Jan 07 2012 Having a pluggable interface so the implementation can be changed is all...

Andrei Alexandrescu (54/77) Jan 07 2012 In my opinion this statement is thoroughly wrong and backwards. I also

Walter Bright (48/106) Jan 07 2012 We've agree on this before, perhaps I misstated it here, but I am not ta...

bearophile (4/7) Jan 07 2012 I think that in several (but not all) fields of science and technology y...
Peter Alexander (3/14) Jan 07 2012 Considering the that entire purpose of SIMD is performance, I think the

Peter Alexander (26/43) Jan 07 2012 Yes, but when it comes to register allocation and platform specific

Manu (30/33) Jan 07 2012 Walter put in a reasonable effort to sway me to his side of the fence la...

Peter Alexander (34/69) Jan 07 2012 Just to be clear, it was only the types and conversions that I thought
Walter Bright (4/5) Jan 07 2012 Another issue - matching the name mangling and parameter passing/return

Walter Bright (3/4) Jan 07 2012 Not that much work. Most of it segues nicely into the previous work I di...

Manu (3/8) Jan 07 2012 What is this previous work you speak of? Is there already XMM stuff in

Peter Alexander (4/12) Jan 07 2012 On 64-bit, floats are stored in XMM registers (just as single scalars).

Walter Bright (4/8) Jan 07 2012 Right. It doesn't do that.

a (4/18) Jan 08 2012 DMD (at least 64 bit on linux, I'm not sure about 32 bit) now

Manu (4/23) Jan 08 2012 Yeah of course! >_<

Sean Cavanaugh (3/7) Jan 14 2012 MS has three types, __m128, __m128i and __m128d (float, int, double)

Walter Bright (8/16) Jan 14 2012 The trouble with MS's scheme, is given the following:

Sean Cavanaugh (19/37) Jan 14 2012 Working with their intrinsics in their raw form for real code is pure

Manu (3/48) Jan 15 2012 Yep, it's coming... baby steps :)

Marco Leise (11/78) Jan 16 2012 And even a compression algorithms. I found one written in C, that uses

Walter Bright (5/8) Jan 16 2012 I think there's quite a bit more, it's just that using SIMD instructions...

Manu (21/23) Jan 06 2012 I want to sell you on the 'primitive SIMD regs are truly typeless' point...

Walter Bright (16/31) Jan 06 2012 I'm aware of such tricks, and actually do them with the floating point c...

bearophile (5/11) Jan 06 2012 There is even some desire of a typed assembly. It's not easy to design a...
Manu (40/73) Jan 06 2012 To be clear, I'm not opposing strongly typing vector types... that's my
Iain Buclaw (5/62) Jan 06 2012 FYI, vector conversion in GCC is roughly to the idiom of *(float4 *)&X;...

Martin Nowak (56/126) Jan 06 2012 That's what I propose he should do. IMHO it's a huge improvement when
Brad Roberts (6/21) Jan 06 2012 How is making __v128 a builtin type better than defining it as:

Walter Bright (3/8) Jan 06 2012 Then the back end knows it should be mapped onto the XMM registers rathe...

Andrei Alexandrescu (9/18) Jan 06 2012 If it's possible, then it would be great to express the new constructs

Manu (9/30) Jan 06 2012 Now you're at odds with Walter's new take on it.. He seems to have chang...

Walter Bright (2/3) Jan 06 2012 It must be the drugs!

Manu (2/7) Jan 06 2012 That's what I was starting to suspect too! :P

Manu (46/98) Jan 06 2012 You don't need to use inline ASM to be able to do this, it will work the
Manu (3/8) Jan 06 2012 Where in that code is the compiler informed that your structure should
Brad Roberts (5/17) Jan 06 2012 Good point, those rules would need to be added. I'd argue that it's not...

Walter Bright (5/7) Jan 06 2012 I would seriously argue that individuals not attempt to write their own ...

Brad Roberts (12/23) Jan 06 2012 Oh, I completely agree. Intel has people that work on that as their
Vladimir Panteleev (7/17) Jan 06 2012 Agner Fog states in his optimization manuals that the glibc

Manu (4/23) Jan 06 2012 I've never seen a memcpy on any console system I've ever worked on that

Sean Cavanaugh (4/9) Jan 14 2012 Plus memcpy is optimized for reading and writing to cached virtual

Martin Nowak (28/148) Jan 06 2012 Don't get me wrong here. The idea is to find out if intrinsics
Manu (18/83) Jan 06 2012 These functions shouldn't be functions... if they're not all inlined, th...

Walter Bright (11/20) Jan 06 2012 And, in fact, the compiler should not try to optimize inline assembler. ...

Manu (3/32) Jan 06 2012 And I agree this is exactly correct for the IA... and also why intrinsic...

Walter Bright (2/4) Jan 06 2012 Yup.

Martin Nowak (55/83) Jan 06 2012 Yes, but that's not what I meant.
Artur Skawina (10/24) Jan 07 2012 What it also does is allows mixing "ordinary" asm with the SIMD instruct...

Martin Nowak (5/5) Jan 08 2012 simdop will need more overloads, e.g. some

Peter Alexander (34/39) Jan 08 2012 These don't make a lot of sense to return as value, e.g.

Manu (5/10) Jan 08 2012 The conventional way is to write it like this:
Martin Nowak (8/29) Jan 08 2012 Yeah, also thought of this. Having a copy as default would

Norbert Nemec (25/27) Jan 12 2012 I do not agree at all. That way, the type looses all semantic

Walter Bright (14/25) Jan 12 2012 That's what we're doing for D's SIMD support.
Peter Alexander (9/15) Jan 12 2012 You are right, but don't forget that the same is true for instructions

Norbert Nemec (5/20) Jan 12 2012 Actually, my essential message is: The single v128 is too simplistic for...

Manu (3/28) Jan 13 2012 This has already been concluded some days back, the language has a quite...

Sean Cavanaugh (12/16) Jan 14 2012 So I would definitely like to help out on the SIMD stuff in some way, as...

Manu (5/24) Jan 15 2012 I too have a long history with VMX, CELL SPU, ARMs VFP/NEON, and others

Walter Bright (3/11) Jan 15 2012 A nice vector math library for D that puts us competitive will be a nice...

JoeCoder (7/9) Jan 15 2012 The gl3n library might be something good to build on:

Walter Bright (3/13) Jan 15 2012 I have never used libraries like that, and so it isn't obvious to me wha...

suicide (15/34) Jan 16 2012 Here is mine:

F i L (56/86) Jan 16 2012 Nice start, though it have quite a few issues.

F i L (1/1) Jan 16 2012 Whoops! opBinary should be opOpAssign in my examples.

F i L (3/4) Jan 16 2012 wait... no, opBinary was the right one... i'm confusing myself

David (13/23) Jan 16 2012 Hi,

Walter Bright (3/5) Jan 16 2012 Awesome. I was hoping that adding simd support would have a an "enabling...
=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= (6/34) Jan 16 2012 IMO, we should have vectors, matrices, quaternions and all those other

Danni Coy (0/3) Jan 16 2012

Sean Cavanaugh (8/11) Jan 17 2012 While the GPU tends to do this particular step of the work, the answer

David (14/18) Jan 17 2012 The question is, what's the aim of phobos, including everything?,

=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= (5/8) Jan 17 2012 Or any one of a 100 other places, with incompatible implementations?

Danni Coy (3/15) Jan 17 2012 +1

JoeCoder (7/9) Jan 16 2012 I guess this depends on the goals for phobos. Is it minimal, or

Iain Buclaw (7/17) Jan 16 2012 's

Kiith-Sa (8/36) Jan 16 2012 gl3n has a really good API with regards to game development

David (2/38) Jan 17 2012 AABB are also planed for gl3n.

Manu (7/55) Jan 17 2012 Yeah I probably wouldn't put anything that high level in a standard

Mehrdad (2/2) Jan 14 2012 In case this is at all helpful...

Walter Bright (2/4) Jan 14 2012 Hope you like the new simd compiler stuff.

Manu <turkeyman gmail.com> writes:

So I've been hassling about this for a while now, and Walter asked me to
pitch an email detailing a minimal implementation with some initial
thoughts.

The first thing I'd like to say is that a lot of people seem to have this
idea that float[4] should be specialised as a candidate for simd
optimisations somehow. It's obviously been discussed, and this general
opinion seems to be shared by a good few people here.
I've had a whole bunch of rants why I think this is wrong in other threads,
so I won't repeat them here... and that said, I'll attempt to detail an
approach based on explicit vector types.

So, what do we need...? A language defined primitive vector type... that's
all.


-- What shall we call it? --

Doesn't really matter... open to suggestions.
VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
float' (a name I particularly hate, not specifying any size, and trying to
associate it with a specific type)

I like v128, or something like that. I'll use that for the sake of this
document. I think it is preferable to float4 for a few reasons:
 * v128 says what the register intends to be, a general purpose 128bit
register that may be used for a variety of simd operations that aren't
necessarily type bound.
 * float4 implies it is a specific 4 component float type, which is not
what the raw type should be.
 * If we use names like float4, it stands to reason that (u)int4,
(u)short8, etc should also exist, and it also stands to reason that one
might expect math operators and such to be defined...

I suggest initial language definition and implementation of something like
v128, and then types like float4, (u)int4, etc, may be implemented in the
std library with complex behaviour like casting mechanics, and basic math
operators...


-- Alignment --

This type needs to be 16byte aligned. Unaligned loads/stores are very
expensive, and also tend to produce extremely costly LHS hazards on most
architectures when accessing vectors in arrays. If they are not aligned,
they are useless... honestly.

** Does this cause problems with class allocation? Are/can classes be
allocated to an alignment as inherited from an aligned member? ... If not,
this might be the bulk of the work.

There is one other problem I know of that is only of concern on x86.
In the C ABI, passing 16byte ALIGNED vectors by value is a problem,
since x86 ALWAYS uses the stack to pass arguments, and has no way to align
the stack.
I wonder if D can get creative with its ABI here, passing vectors in
registers, even though that's not conventional on x86... the C ABI was
invented long before these hardware features.
In lieu of that, x86 would (sadly) need to silently pass by const ref...
and also do this in the case of register overflow.

Every other architecture (including x64) is fine, since all other
architectures pass in regs, and can align the stack as needed when
overflowing the regs (since stack management is manual and not performed
with special opcodes).


-- What does this type do? --

The primitive v128 type DOES nothing... it is a type that facilitates the
compiler allocating SIMD registers, managing assignments, loads, and
stores, and allow passing to/from functions BY VALUE in registers.
Ie, the only valid operations would be:
  v128 myVec = someStruct.vecMember; // and vice versa...
  v128 result = someFunc(myVec); // and calling functions, passing by value.

Nice bonus: This alone is enough to allow implementation of fast memcpy
functions that copy 16 bytes at a time... ;)


-- So, it does nothing... so what good is it? --

Initially you could use this type in conjunction with inline asm, or
architecture intrinsics to do useful stuff. This would be using the
hardware totally raw, which is an important feature to have, but I imagine
most of the good stuff would come from libraries built on top of this.


-- Literal assignment --

This is a hairy one. Endian issues appear in 2 layers here...
Firstly, if you consider the vector to be 4 int's, the ints themselves may
be little or big endian, but in addition, the outer layer (ie. the order of
x,y,z,w) may also be in reverse order on some architectures... This makes a
single 128bit hex literal hard to apply.
I'll have a dig and try and confirm this, but I have a suspicion that VMX
defines its components reverse to other architectures... (Note: not usually
a problem in C, because vector code is sooo non-standard in C that this is
ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order
can suit)

For the primitive v128 type, I generally like the idea of using a huge
128bit hex literal.
  v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)

Since the primitive v128 type is effectively typeless, it makes no sense to
use syntax like this:
  v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be
reserved for use with a float4 type defined in a library somewhere.

... The problem is, this may not be linearly applicable to all hardware. If
the order of the components match the endian, then it is fine...
I suspect VMX orders the components reverse to match the fact the values
are big endian, which would be good, but I need to check. And if not...
then literals may need to get a lot more complicated :)

Assignment of literals to the primitive type IS actually important, it's
common to generate bit masks in these registers which are type-independent.
I also guess libraries still need to leverage this primitive assignment
functionality to assign their more complex literal expressions.


-- Libraries --

With this type, we can write some useful standard libraries. For a start,
we can consider adding float4, int4, etc, and make them more intelligent...
they would have basic maths operators defined, and probably implement type
conversion when casting between types.

  int4 intVec = floatVec; // perform a type conversion from float to int..
or vice versa... (perhaps we make this require an explicit cast?)

  v128 vec = floatVec; // implicit cast to the raw type always possible,
and does no type casting, just a reinterpret
  int4 intVec = vec; // conversely, the primitive type would implicitly
assign to other types.
  int4  intVec = (v128)floatVec; // piping through the primitive v128
allows to easily perform a reinterpret between vector types, rather than
the usual type conversion.

There are also a truckload of other operations that would be fleshed out.
For instance, strongly typed literal assignment, vector comparisons that
can be used with if() (usually these allow you to test if ALL components,
or if ANY components meet a given condition). Conventional logic operators
can't be neatly applied to vectors. You need to do something like this:
  if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...

We can discuss the libraries at a later date, but it's possible that you
might also want to make some advanced functions in the library that are
only supported on particular architectures, std.simd.sse...,
std.simd.vmx..., etc. which may be version()-ed.


-- Exceptions, flags, and error conditions --

SIMD units usually have their own control register for controlling various
behaviours, most importantly NaN policy and exception semantics...
I'm open to input here... what should be default behaviour?
I'll bet the D community opt for strict NaNs, and throw by default... but
it is actually VERY common to disable hardware exceptions when working with
SIMD code:
  * often precision is less important than speed when using SIMD, and some
SIMD units perform faster when these features are disabled.
  * most SIMD algorithms (at least in performance oriented code) are
designed to tolerate '0,0,0,0' as the result of a divide by zero, or some
other error condition.
  * realtime physics tends to suffer error creep and freaky random
explosions, and you can't have those crashing the program :) .. they're not
really 'errors', they're expected behaviour, often producing 0,0,0,0 as a
result, so they're easy to deal with.

I presume it'll end up being NaNs and throw by default, but we do need some
mechanism to change the SIMD unit flags for realtime use... A runtime
function? Perhaps a compiler switch (C does this sort of thing a lot)?

It's also worth noting that there are numerous SIMD units out there that
DON'T follow strict ieee float rules, and don't support NaNs or hardware
exceptions at all... others may simply set a divide-by-zero flag, but not
actually trigger a hardware exception, requiring you to explicitly check
the flag if you're interested.
Will it be okay that the languages default behaviour of NaN's and throws is
unsupported on such platforms? What are the implications of this?


-- Future --

AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256
type, everything else is precisely the same.
I think this is perfectly reasonable... AVX is to SSE exactly as long is to
int, or double is to float. They are different types with different
register allocation and addressing semantics, and deserve a discreet type.
As with v128, libraries may then be created to allow the types to interact.

I know of 2 architectures that support 512bit (4x4 matrix) registers...
same story; implement a primitive type, then using intrinsics, we can build
interesting types in libraries.

We may also consider a v64 type, which would map to older MMX registers on
x86... there are also other architectures with 64bit 'vector' registers
(nintendo wii for one), supporting a pair of floats, or 4 shorts, etc...
Same general concept, but only 64 bits wide.


-- Conclusion --

I think that's about it for a start. I don't think it's particularly a lot
of work, the potential trouble points are 16byte alignment, and literal
expression. Potential issues relating to language guarantees of
exception/error conditions...
Go on, tear it apart!

Discuss...

Jan 05 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/5/2012 5:42 PM, Manu wrote:
 The first thing I'd like to say is that a lot of people seem to have this idea
 that float[4] should be specialised as a candidate for simd optimisations
 somehow. It's obviously been discussed, and this general opinion seems to be
 shared by a good few people here.
 I've had a whole bunch of rants why I think this is wrong in other threads, so
I
 won't repeat them here...

If you could cut&paste them here, I would find it most helpful. I have some 
ideas on making that work, but I need to know everything wrong with it first.

Jan 05 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 04:12, Walter Bright <newshound2 digitalmars.com> wrote:

 If you could cut&paste them here, I would find it most helpful. I have
 some ideas on making that work, but I need to know everything wrong with it
 first.

On 5 January 2012 11:02, Manu <turkeyman gmail.com> wrote:

 On 5 January 2012 02:42, bearophile <bearophileHUGS lycos.com> wrote:

 Think about future CPU evolution with SIMD registers 128, then 256, then
 512, then 1024 bits long. In theory a good compiler is able to use them
 with no changes in the D code that uses vector operations.

 These are all fundamentally different types, like int and long.. float and
 double... and I certainly want a keyword to identify each of them. Even if
 the compiler is trying to make auto vector optimisations, you can't deny
 programmers explicit control to the hardware when they want/need it.
 Look at x86 compilers, been TRYING to perform automatic SSE optimisations
 for 10 years, with basically no success... do you really think you can do
 better then all that work by microsoft and GCC?
 In my experience, I've even run into a lot of VC's auto-SSE-ed code that
 is SLOWER than the original float code.
 Let's not even mention architectures that receive much less love than x86,
 and are arguably more important (ARM; slower, simpler processors with more
 demand to perform well, and not waste power)

...

Vector ops and SIMD ops are different things. float[4] (or more
 realistically, float[3]) should NOT be a candidate for automatic SIMD
 implementation, likewise, simd_type should not have its components
 individually accessible. These are operations the hardware can not actually
 perform. So no syntax to worry about, just a type.


 I think the good Hara will be able to implement those syntax fixes in a
 matter of just one day or very few days if a consensus is reached about
 what actually is to be fixed in D vector ops syntax.


 Instead of discussing about *adding* something (register intrinsics) I
 suggest to discuss about what to fix about the *already present* vector op
 syntax. This is not a request to just you Manu, but to this whole newsgroup.

 And I think this is exactly the wrong approach. A vector is NOT an array
 of 4 (actually, usually 3) floats. It should not appear as one. This is
 overly complicated and ultimately wrong way to engage this hardware.
 Imagine the complexity in the compiler to try and force float[4]
 operations into vector arithmetic vs adding a 'v128' type which actually
 does what people want anyway... What about when when you actually WANT a
 float[4] array, and NOT a vector?

 SIMD units are not float units, they should not appear like an aggregation
 of float units. They have:
  * Different error semantics, exception handling rules, sometimes
 different precision...
  * Special alignment rules.
  * Special literal expression/assignment.
  * You can NOT access individual components at will.
  * May be reinterpreted at any time as float[1] float[4] double[2]
 short[8] char[16], etc... (up to the architecture intrinsics)
  * Can not be involved in conventional comparison logic (array of floats
 would make you think they could)
  *** Can NOT interact with the regular 'float' unit... Vectors as an array
 of floats certainly suggests that you can interact with scalar floats...

 I will use architecture intrinsics to operate on these regs, and put that
 nice and neatly behind a hardware vector type with version()'s for each
 architecture, and an API with a whole lot of sugar to make them nice and
 friendly to use.

 My argument is that even IF the compiler some day attempts to make vector
 optimisations to float[4] arrays, the raw hardware should be exposed first,
 and allow programmers to use it directly. This starts with a language
 defined (platform independant) v128 type.

...

 Other rants have been on IRC.

Jan 05 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to pitch
 an email detailing a minimal implementation with some initial thoughts.

Another question:

Is this worth doing for 32 bit code? Or is anyone doing this doing it for 64
bit 
only?

The reason I ask is because 64 bit is 16 byte aligned, but aligning the stack
in 
32 bit code is inefficient for everything else.

Jan 05 2012

Manu <turkeyman gmail.com> writes:

 The reason I ask is because 64 bit is 16 byte aligned, but aligning the
 stack in 32 bit code is inefficient for everything else.

Note: you only need to align the stack when a vector is actually stored on
it by value. Probably very rare, more rare than you think.

Jan 05 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/5/2012 5:42 PM, Manu wrote:
 -- Alignment --

 This type needs to be 16byte aligned. Unaligned loads/stores are very
expensive,
 and also tend to produce extremely costly LHS hazards on most architectures
when
 accessing vectors in arrays. If they are not aligned, they are useless...
honestly.

 ** Does this cause problems with class allocation? Are/can classes be allocated
 to an alignment as inherited from an aligned member? ... If not, this might be
 the bulk of the work.

The only real issue with alignment is getting the stack aligned to 16 bytes. 
This is already true of 64 bit code gen, and 32 bit code gen for OS X.

Jan 05 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 04:16, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:

 -- Alignment --

 This type needs to be 16byte aligned. Unaligned loads/stores are very
 expensive,
 and also tend to produce extremely costly LHS hazards on most
 architectures when
 accessing vectors in arrays. If they are not aligned, they are useless...
 honestly.

 ** Does this cause problems with class allocation? Are/can classes be
 allocated
 to an alignment as inherited from an aligned member? ... If not, this
 might be
 the bulk of the work.

 The only real issue with alignment is getting the stack aligned to 16
 bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS
 X.

It's important for all implementations of simd units, x32, x64, and others.
As said, if aligning the x32 stack is too much trouble, I suggest silently
passing by const ref on x86.

Are you talking about for parameter passing, or for local variable
assignment on the stack?
For parameter passing, I understand the x32 problems with aligning the
arguments (I think it's possible to work around though), but there should
be no problem with aligning the stack for allocating local variables.

Jan 05 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/5/2012 6:25 PM, Manu wrote:
 Are you talking about for parameter passing, or for local variable assignment
on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the
arguments
 (I think it's possible to work around though), but there should be no problem
 with aligning the stack for allocating local variables.

Aligning the stack. Before I say anything, I want to hear your suggestion for 
how to do it efficiently.

Jan 05 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 05:22, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 6:25 PM, Manu wrote:

 Are you talking about for parameter passing, or for local variable
 assignment on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the
 arguments
 (I think it's possible to work around though), but there should be no
 problem
 with aligning the stack for allocating local variables.

 Aligning the stack. Before I say anything, I want to hear your suggestion
 for how to do it efficiently.

Perhaps I misunderstand, I can't see the problem?
In the function preamble, you just align it... something like:
  mov reg, esp ; take a backup of the stack pointer
  and esp, -16 ; align it

... function

  mov esp, reg ; restore the stack pointer
  ret 0

Jan 05 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/5/2012 7:42 PM, Manu wrote:
 Perhaps I misunderstand, I can't see the problem?
 In the function preamble, you just align it... something like:
    mov reg, esp ; take a backup of the stack pointer
    and esp, -16 ; align it

 ... function

    mov esp, reg ; restore the stack pointer
    ret 0

And now you cannot access the function's parameters anymore, because the stack 
offset for them is now variable rather than fixed.

Jan 05 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 06 Jan 2012 07:22:55 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/5/2012 7:42 PM, Manu wrote:
 Perhaps I misunderstand, I can't see the problem?
 In the function preamble, you just align it... something like:
    mov reg, esp ; take a backup of the stack pointer
    and esp, -16 ; align it

 ... function

    mov esp, reg ; restore the stack pointer
    ret 0

 And now you cannot access the function's parameters anymore, because the  
 stack offset for them is now variable rather than fixed.

Aah, I knew there was something that wouldn't work.
One could possibly change from RBP relative addressing
to RSP relative addressing for the inner variables.

But that would fail with alloca.
So this won't work without a second frame register, does it?

 manu: Instead of using the RegionAllocator you could
write an aligning allocator using alloca memory. This
will be about the closest you get to that magic compiler
alignment.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 08:22, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 7:42 PM, Manu wrote:

 Perhaps I misunderstand, I can't see the problem?
 In the function preamble, you just align it... something like:
   mov reg, esp ; take a backup of the stack pointer
   and esp, -16 ; align it

 ... function

   mov esp, reg ; restore the stack pointer
   ret 0

 And now you cannot access the function's parameters anymore, because the
 stack offset for them is now variable rather than fixed.

Hehe, true, but not insurmountable. Scheduling of parameter pops before you
perform the alignment may solve that straight up, or else don't align esp
its self; store the vector to the stack through some other aligned reg
copied from esp...

I just wrote some test functions using __m128 in VisualC, it seems to do
something in between the simplicity of my initial suggestion, and my
refined ideas one above :)
If you have VisualC, check out what it does, it's very simple, looks pretty
good, and I'm sure it's optimal (MS have enough R&D money to assure this)

I can paste some disassemblies if you don't have VC...

Jan 06 2012

bearophile <bearophileHUGS lycos.com> writes:

Manu:
 
 I can paste some disassemblies if you don't have VC...

Pasting it is useful for all other people reading this thread too, like me.

Bye,
bearophile

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 6:05 AM, Manu wrote:
 On 6 January 2012 08:22, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/5/2012 7:42 PM, Manu wrote:

         Perhaps I misunderstand, I can't see the problem?
         In the function preamble, you just align it... something like:
            mov reg, esp ; take a backup of the stack pointer
            and esp, -16 ; align it

         ... function

            mov esp, reg ; restore the stack pointer
            ret 0


     And now you cannot access the function's parameters anymore, because the
     stack offset for them is now variable rather than fixed.


 Hehe, true, but not insurmountable. Scheduling of parameter pops before you
 perform the alignment may solve that straight up, or else don't align esp its
 self; store the vector to the stack through some other aligned reg copied from
 esp...

 I just wrote some test functions using __m128 in VisualC, it seems to do
 something in between the simplicity of my initial suggestion, and my refined
 ideas one above :)
 If you have VisualC, check out what it does, it's very simple, looks pretty
 good, and I'm sure it's optimal (MS have enough R&D money to assure this)

 I can paste some disassemblies if you don't have VC...

I don't have VC. I had thought of using an extra level of indirection for all 
the aligned stuff, essentially rewrite:

     v128 v;
     v = x;

with:

     v128 v; // goes in aligned stack
     v128 *pv = &v;  // pv is in regular stack
     *pv = x;

but there are still complexities with it, like spilling aligned temps to the
stack.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 20:53, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 6:05 AM, Manu wrote:

 On 6 January 2012 08:22, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 **digitalmars.com <newshound2 digitalmars.com>>>
 wrote:

    On 1/5/2012 7:42 PM, Manu wrote:

        Perhaps I misunderstand, I can't see the problem?
        In the function preamble, you just align it... something like:
           mov reg, esp ; take a backup of the stack pointer
           and esp, -16 ; align it

        ... function

           mov esp, reg ; restore the stack pointer
           ret 0


    And now you cannot access the function's parameters anymore, because
 the
    stack offset for them is now variable rather than fixed.


 Hehe, true, but not insurmountable. Scheduling of parameter pops before
 you
 perform the alignment may solve that straight up, or else don't align esp
 its
 self; store the vector to the stack through some other aligned reg copied
 from
 esp...

 I just wrote some test functions using __m128 in VisualC, it seems to do
 something in between the simplicity of my initial suggestion, and my
 refined
 ideas one above :)
 If you have VisualC, check out what it does, it's very simple, looks
 pretty
 good, and I'm sure it's optimal (MS have enough R&D money to assure this)

 I can paste some disassemblies if you don't have VC...

 I don't have VC. I had thought of using an extra level of indirection for
 all the aligned stuff, essentially rewrite:

    v128 v;
    v = x;

 with:

    v128 v; // goes in aligned stack
    v128 *pv = &v;  // pv is in regular stack
    *pv = x;

 but there are still complexities with it, like spilling aligned temps to
 the stack.

I think we should take this conversation to IRC, or a separate thread?
I'll generate some examples from VC for you in various situations. If you
can write me a short list of trouble cases as you see them, I'll make sure
to address them specifically...
Have you tested the code that GCC produces? I'm sure it'll be identical to
VC...

That said, how do you currently support ANY aligned type? I thought
align(n) was a defined keyword in D?

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 11:08 AM, Manu wrote:
 I think we should take this conversation to IRC, or a separate thread?
 I'll generate some examples from VC for you in various situations. If you can
 write me a short list of trouble cases as you see them, I'll make sure to
 address them specifically...
 Have you tested the code that GCC produces? I'm sure it'll be identical to
VC...

What I'm going to do is make the SIMD stuff work on 64 bits for now. The 
alignment problem is solved for it, and is an orthogonal issue.


 That said, how do you currently support ANY aligned type? I thought align(n)
was
 a defined keyword in D?

Yes, but the alignment is only as good as the alignment underlying it. For 
example, anything in segments can be aligned to 16 bytes or less, because the 
segments are aligned to 16 bytes. Anything allocated with new can be aligned to 
16 bytes or less.

The stack, however, is aligned to 4, so trying to align things on the stack by
8 
or 16 will not work.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 21:34, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 11:08 AM, Manu wrote:

 I think we should take this conversation to IRC, or a separate thread?
 I'll generate some examples from VC for you in various situations. If you
 can
 write me a short list of trouble cases as you see them, I'll make sure to
 address them specifically...
 Have you tested the code that GCC produces? I'm sure it'll be identical
 to VC...

 What I'm going to do is make the SIMD stuff work on 64 bits for now. The
 alignment problem is solved for it, and is an orthogonal issue.


...I'm using DMD on windows... x32. So this isn't ideal ;)
Although with this change, Iain should be able to expose the vector types
in GDC, and I can work from there, and hopefully even build an ARM/PPC
toolchain to experiment with the library in a cross platform environment.

That said, how do you currently support ANY aligned type? I thought
 align(n) was
 a defined keyword in D?

 Yes, but the alignment is only as good as the alignment underlying it. For
 example, anything in segments can be aligned to 16 bytes or less, because
 the segments are aligned to 16 bytes. Anything allocated with new can be
 aligned to 16 bytes or less.

 The stack, however, is aligned to 4, so trying to align things on the
 stack by 8 or 16 will not work.

... this sounds bad. Shall I start another thread? ;)
So you're saying it's impossible to align a stack based buffer to, say, 128
bytes... ?
This is another fairly important daily requirement of mine (that I assumed
was currently supported). Aligning buffers to cache lines is common, and is
required for many optimisations.

Hopefully the work you do to support 16byte alignment on x86 will also
support arbitrary alignment of any buffer...
Will arbitrary alignment be supported on x64?
What about GCC? Will/does it support arbitrary alignment?

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 11:53 AM, Manu wrote:
 ... this sounds bad. Shall I start another thread? ;)
 So you're saying it's impossible to align a stack based buffer to, say, 128
 bytes... ?

No, it's not impossible. Here's what you can do now:

char[128+127] buf;
char* pbuf = cast(char*)(((size_t)buf.ptr + 127) & ~127);

and now pbuf points to 128 bytes, aligned, on the stack.


 Hopefully the work you do to support 16byte alignment on x86 will also support
 arbitrary alignment of any buffer...
 Will arbitrary alignment be supported on x64?

Aligning to non-powers of 2 will never work. As for other alignments, they only 
will work if the underlying storage is aligned to that or greater. Otherwise, 
you'll have to resort to the method outlined above.


 What about GCC? Will/does it support arbitrary alignment?

Don't know about gcc.

Jan 06 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 06 Jan 2012 21:16:40 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/6/2012 11:53 AM, Manu wrote:
 ... this sounds bad. Shall I start another thread? ;)
 So you're saying it's impossible to align a stack based buffer to, say,  
 128
 bytes... ?

 No, it's not impossible. Here's what you can do now:

 char[128+127] buf;
 char* pbuf = cast(char*)(((size_t)buf.ptr + 127) & ~127);

 and now pbuf points to 128 bytes, aligned, on the stack.


 Hopefully the work you do to support 16byte alignment on x86 will also  
 support
 arbitrary alignment of any buffer...
 Will arbitrary alignment be supported on x64?

 Aligning to non-powers of 2 will never work. As for other alignments,  
 they only will work if the underlying storage is aligned to that or  
 greater. Otherwise, you'll have to resort to the method outlined above.


 What about GCC? Will/does it support arbitrary alignment?

 Don't know about gcc.

Only recently (4.6 I think).

Jan 06 2012

FeepingCreature <default_357-line yahoo.de> writes:

On 01/06/12 21:16, Walter Bright wrote:
 Aligning to non-powers of 2 will never work. As for other alignments, they
only will work if the underlying storage is aligned to that or greater.
Otherwise, you'll have to resort to the method outlined above.
 
 
 What about GCC? Will/does it support arbitrary alignment?

 
 Don't know about gcc.

GCC keeps the stack 16-byte aligned by default.

Jan 07 2012

"Trass3r" <un known.com> writes:

On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:
 Iain should be able to expose the vector types in GDC,
 and I can work from there, and hopefully even build an ARM/PPC 
 toolchain to experiment with the library in a cross platform 
 environment.

On Windoze? You're a masochist ^^

Jan 06 2012

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Trass3r wrote:
 On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:
 Iain should be able to expose the vector types in GDC,
 and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform environment.

 On Windoze? You're a masochist ^^

Windows 8 will support ARM. I hope that D will too.

Jan 11 2012

Danni Coy <danni.coy gmail.com> writes:

I was rather under that only the new html5 api would be available under
windows 8 arm - that they were doing a iOS walled garden type thing with it
- if true this could make things difficult...

On Wed, Jan 11, 2012 at 9:42 PM, Piotr Szturmaj <bncrbme jadamspam.pl>wrote:

 Trass3r wrote:

 On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:

 Iain should be able to expose the vector types in GDC,
 and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform environment.

 On Windoze? You're a masochist ^^

 Windows 8 will support ARM. I hope that D will too.

Jan 11 2012

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Danni Coy wrote:
 I was rather under that only the new html5 api would be available under
 windows 8 arm - that they were doing a iOS walled garden type thing with
 it - if true this could make things difficult...

http://www.microsoft.com/presspass/exec/ssinofsky/2011/09-13BUILD.mspx?rss_fdn=Custom

"[...] And you have your choice of world-class development tools and 

and ARM.

This is an extremely important point: If you go and build your Metro 

run when there's ARM hardware available. So, you don�t have to worry 

XAML and your application runs across all the hardware that Windows 8 
supports. (Applause.)

And if you want to write native code, we're going to help you do that as 
well and make it so that you can cross-compile into the other platforms 
as well. So, full platform support with these Metro style applications."

It means Win8 ARM will be limited to Metro apps only, but you will be 
able to choose HTML/CSS/JS, .NET or native code.

Jan 11 2012

=?windows-1252?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:

On 11-01-2012 13:23, Piotr Szturmaj wrote:
 Danni Coy wrote:
 I was rather under that only the new html5 api would be available under
 windows 8 arm - that they were doing a iOS walled garden type thing with
 it - if true this could make things difficult...

 http://www.microsoft.com/presspass/exec/ssinofsky/2011/09-13BUILD.mspx?rss_fdn=Custom


 "[...] And you have your choice of world-class development tools and

 and ARM.

 This is an extremely important point: If you go and build your Metro

 run when there's ARM hardware available. So, you don�t have to worry

 XAML and your application runs across all the hardware that Windows 8
 supports. (Applause.)

 And if you want to write native code, we're going to help you do that as
 well and make it so that you can cross-compile into the other platforms
 as well. So, full platform support with these Metro style applications."

 It means Win8 ARM will be limited to Metro apps only, but you will be
 able to choose HTML/CSS/JS, .NET or native code.

If they have ported the Common Language Runtime to ARM, I doubt they 
would put some arbitrary limitation on what apps can run on that 
hardware. All things considered, AArch32/64 are coming soon.

Besides, Windows running on ARM is not a new thing; see Windows Mobile 
and Windows Phone 7. By now, their ARM support should be as good as 
their x86 support.

- Alex

Jan 11 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 01/06/12 20:53, Manu wrote:
 What about GCC? Will/does it support arbitrary alignment?

For sane "arbitrary" values (ie powers of two) it looks like this:

--------------------------------
import std.stdio;
struct S { align(65536) ubyte[64] bs; alias bs this; }

pragma(attribute, noinline) void f(ref S s) { s[2] = 42; }

void main(string[] argv) {
  S s = void;
  f(s);
  writeln(s.ptr);
}
---------------------------------

turns into:

---------------------------------
 804ae40:       55                      push   %ebp
 804ae41:       89 e5                   mov    %esp,%ebp
 804ae43:       66 bc 00 00             mov    $0x0,%sp
 804ae47:       81 ec 00 00 01 00       sub    $0x10000,%esp
 804ae4d:       89 e0                   mov    %esp,%eax
 804ae4f:       e8 2c 0e 00 00          call   804bc80 <void align.f(ref
align.S)>
 804ae54:       89 e0                   mov    %esp,%eax
 804ae56:       e8 c5 ff ff ff          call   804ae20 <void
std.stdio.writeln!(ubyte*).writeln(ubyte*).2084>
 804ae5b:       31 c0                   xor    %eax,%eax
 804ae5d:       c9                      leave  
 804ae5e:       c3                      ret    
 804ae5f:       90                      nop
---------------------------------

specifying a more sane alignment of 64 gives:

---------------------------------
0804ae40 <_Dmain>:
 804ae40:       55                      push   %ebp
 804ae41:       89 e5                   mov    %esp,%ebp
 804ae43:       83 e4 c0                and    $0xffffffc0,%esp
 804ae46:       83 ec 40                sub    $0x40,%esp
 804ae49:       89 e0                   mov    %esp,%eax
 804ae4b:       e8 30 0e 00 00          call   804bc80 <void align.f(ref
align.S)>
 804ae50:       89 e0                   mov    %esp,%eax
 804ae52:       e8 c9 ff ff ff          call   804ae20 <void
std.stdio.writeln!(ubyte*).writeln(ubyte*).2084>
 804ae57:       31 c0                   xor    %eax,%eax
 804ae59:       c9                      leave  
 804ae5a:       c3                      ret    
---------------------------------

Jan 06 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 On 6 January 2012 21:34, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/6/2012 11:08 AM, Manu wrote:
 I think we should take this conversation to IRC, or a separate thread?
 I'll generate some examples from VC for you in various situations. If you
 can
 write me a short list of trouble cases as you see them, I'll make sure to
 address them specifically...
 Have you tested the code that GCC produces? I'm sure it'll be identical
 to VC...


 What I'm going to do is make the SIMD stuff work on 64 bits for now. The
 alignment problem is solved for it, and is an orthogonal issue.


 ...I'm using DMD on windows... x32. So this isn't ideal ;)
 Although with this change, Iain should be able to expose the vector types in
 GDC, and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform environment.

And will also allow me to tap into many vector intrinsics that gcc
offers too via the gcc.builtins; module. :)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 23:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 ...I'm using DMD on windows... x32. So this isn't ideal ;)
 Although with this change, Iain should be able to expose the vector

 types in
 GDC, and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform environment.

 And will also allow me to tap into many vector intrinsics that gcc
 offers too via the gcc.builtins; module. :)


Huzzah! ... Like what?

Jan 06 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 6 January 2012 22:37, Manu <turkeyman gmail.com> wrote:
 On 6 January 2012 23:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 ...I'm using DMD on windows... x32. So this isn't ideal ;)
 Although with this change, Iain should be able to expose the vector
 types in
 GDC, and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform
 environment.

 And will also allow me to tap into many vector intrinsics that gcc
 offers too via the gcc.builtins; module. :)


 Huzzah! ... Like what?

For backend intrinsics, they are all functions that map to asm
instructions of the same name, ie: __builtin_ia32_addps.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 7 January 2012 00:47, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 6 January 2012 22:37, Manu <turkeyman gmail.com> wrote:
 On 6 January 2012 23:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 6 January 2012 19:53, Manu <turkeyman gmail.com> wrote:
 ...I'm using DMD on windows... x32. So this isn't ideal ;)
 Although with this change, Iain should be able to expose the vector
 types in
 GDC, and I can work from there, and hopefully even build an ARM/PPC
 toolchain to experiment with the library in a cross platform
 environment.

 And will also allow me to tap into many vector intrinsics that gcc
 offers too via the gcc.builtins; module. :)


 Huzzah! ... Like what?

 For backend intrinsics, they are all functions that map to asm
 instructions of the same name, ie: __builtin_ia32_addps.


Ah yeah, perfect.. obviously we need all of those for this vector type to
be of any use at all ;)

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 05:42, Manu <turkeyman gmail.com> wrote:

 On 6 January 2012 05:22, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 6:25 PM, Manu wrote:

 Are you talking about for parameter passing, or for local variable
 assignment on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the
 arguments
 (I think it's possible to work around though), but there should be no
 problem
 with aligning the stack for allocating local variables.

 Aligning the stack. Before I say anything, I want to hear your suggestion
 for how to do it efficiently.

 Perhaps I misunderstand, I can't see the problem?
 In the function preamble, you just align it... something like:
   mov reg, esp ; take a backup of the stack pointer
   and esp, -16 ; align it

 ... function

   mov esp, reg ; restore the stack pointer
   ret 0

That said, most of the time values used in smaller functions will only ever
exist in regs, and won't ever be written to the stack... in this case,
there's no problem.

Jan 05 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 06 Jan 2012 04:22:41 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/5/2012 6:25 PM, Manu wrote:
 Are you talking about for parameter passing, or for local variable  
 assignment on
 the stack?
 For parameter passing, I understand the x32 problems with aligning the  
 arguments
 (I think it's possible to work around though), but there should be no  
 problem
 with aligning the stack for allocating local variables.

 Aligning the stack. Before I say anything, I want to hear your  
 suggestion for how to do it efficiently.

extending

push RBP;
mov RBP, RSP;
sub RSP, localStackSize;

to

push RBP;
// new
mov  RAX, RSP;
and  RAX, localAlignMask;
sub  RSP, RAX;
// wen
mov  RBP, RSP;
sub  RSP, localStackSize;

should do the trick.
This would require to use biggest align attribute
of all stack variables for localAlignMask. Also align
needed to be power of 2 of it isn't already.

------------
RBP + 0   int a;
RBP + 4   int b;
24 byte padding
RBP + 32  align(32) struct float8 { float[8] v; } s;
------------

Jan 05 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways:

1. SIMD behavior is going to be very machine specific.

2. Even trying to do something with + is fraught with peril, as integer adds 
with SIMD can be saturated or unsaturated.

3. Trying to build all the details about how each of the various adds and other 
ops work into the compiler/optimizer is a large undertaking. D would have to 
support internally maybe a 100 or more new operators.

So some simplification is in order, perhaps a low level layer that is fairly 
extensible for new instructions, and for which a library can be layered over
for 
a more presentable interface. A half-formed idea of mine is, taking a cue from 
yours:

Declare one new basic type:

     __v128

which represents the 16 byte aligned 128 bit vector type. The only operations 
defined to work on it would be construction and assignment. The __ prefix 
signals that it is non-portable.

Then, have:

    import core.simd;

which provides two functions:

    __v128 simdop(operator, __v128 op1);
    __v128 simdop(operator, __v128 op1, __v128 op2);

This will be a function built in to the compiler, at least for the x86. (Other 
architectures can provide an implementation of it that simulates its operation, 
but I doubt that it would be worth anyone's while to use that.)

The operators would be an enum listing of the SIMD opcodes,

     PFACC, PFADD, PFCMPEQ, etc.

For:

     z = simdop(PFADD, x, y);

the compiler would generate:

     MOV z,x
     PFADD z,y

The code generator knows enough about these instructions to do register 
assignments reasonably optimally.

What do you think? It ain't beeyoootiful, but it's implementable in a
reasonable 
amount of time, and it should make writing tight & fast SIMD code without
having 
to do it all in assembler.

One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 
packed doubles. One problem with making it typed is it'll add 10 more types to 
the base compiler, instead of one. Maybe we should just bite the bullet and do 
the types:

     __vdouble2
     __vfloat4
     __vlong2
     __vulong2
     __vint4
     __vuint4
     __vshort8
     __vushort8
     __vbyte16
     __vubyte16

Jan 06 2012

Andrew Wiley <wiley.andrew.j gmail.com> writes:

On Fri, Jan 6, 2012 at 2:43 AM, Walter Bright
<newshound2 digitalmars.com> wrote:
 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to
 pitch
 an email detailing a minimal implementation with some initial thoughts.


 Takeaways:

 1. SIMD behavior is going to be very machine specific.

 2. Even trying to do something with + is fraught with peril, as integer a=

dds
 with SIMD can be saturated or unsaturated.

 3. Trying to build all the details about how each of the various adds and
 other ops work into the compiler/optimizer is a large undertaking. D woul=

d
 have to support internally maybe a 100 or more new operators.

 So some simplification is in order, perhaps a low level layer that is fai=

rly
 extensible for new instructions, and for which a library can be layered o=

ver
 for a more presentable interface. A half-formed idea of mine is, taking a
 cue from yours:

 Declare one new basic type:

 =A0 =A0__v128

 which represents the 16 byte aligned 128 bit vector type. The only
 operations defined to work on it would be construction and assignment. Th=

e
 __ prefix signals that it is non-portable.

 Then, have:

 =A0 import core.simd;

 which provides two functions:

 =A0 __v128 simdop(operator, __v128 op1);
 =A0 __v128 simdop(operator, __v128 op1, __v128 op2);

 This will be a function built in to the compiler, at least for the x86.
 (Other architectures can provide an implementation of it that simulates i=

ts
 operation, but I doubt that it would be worth anyone's while to use that.=

)
 The operators would be an enum listing of the SIMD opcodes,

 =A0 =A0PFACC, PFADD, PFCMPEQ, etc.

 For:

 =A0 =A0z =3D simdop(PFADD, x, y);

 the compiler would generate:

 =A0 =A0MOV z,x
 =A0 =A0PFADD z,y

Would this tie SIMD support directly to x86/x86_64, or would it
possible to also support NEON on ARM (also 128 bit SIMD, see
http://infocenter.arm.com/help/index.jsp?topic=3D/com.arm.doc.ddi0409g/inde=
x.html
) ?
(Obviously not for DMD, but if the syntax wasn't directly tied to
x86/64, GDC and LDC could support this)
It seems like using a standard naming convention instead of directly
referencing instructions could let the underlying SIMD instructions
vary across platforms, but I don't know enough about the technologies
to say whether NEON's capabilities match SSE closely enough that they
could be handled the same way.

Jan 06 2012

a <a a.com> writes:

 Would this tie SIMD support directly to x86/x86_64, or would it
 possible to also support NEON on ARM (also 128 bit SIMD, see
 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
 ) ?
 (Obviously not for DMD, but if the syntax wasn't directly tied to
 x86/64, GDC and LDC could support this)
 It seems like using a standard naming convention instead of directly
 referencing instructions could let the underlying SIMD instructions
 vary across platforms, but I don't know enough about the technologies
 to say whether NEON's capabilities match SSE closely enough that they
 could be handled the same way.

For NEON you would need at least a function with a signature:

__v128 simdop(operator, __v128 op1, __v128 op2,  __v128 op3);

since many NEON instructions operate on three registers.

Jan 06 2012

a <a a.com> writes:

a Wrote:

 Would this tie SIMD support directly to x86/x86_64, or would it
 possible to also support NEON on ARM (also 128 bit SIMD, see
 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
 ) ?
 (Obviously not for DMD, but if the syntax wasn't directly tied to
 x86/64, GDC and LDC could support this)
 It seems like using a standard naming convention instead of directly
 referencing instructions could let the underlying SIMD instructions
 vary across platforms, but I don't know enough about the technologies
 to say whether NEON's capabilities match SSE closely enough that they
 could be handled the same way.

 
 For NEON you would need at least a function with a signature:
 
 __v128 simdop(operator, __v128 op1, __v128 op2,  __v128 op3);
 
 since many NEON instructions operate on three registers.  

Disregard that, I wasn't paying atention to the return type. What Walter
proposed can already handle three operand NEON instructions.

Jan 06 2012

a <a a.com> writes:

Walter Bright Wrote:

 which provides two functions:
 
     __v128 simdop(operator, __v128 op1);
     __v128 simdop(operator, __v128 op1, __v128 op2);

You would also need functions that take an immediate too to support
instructions such as shufps. 

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 
 packed doubles. One problem with making it typed is it'll add 10 more types to 
 the base compiler, instead of one. Maybe we should just bite the bullet and do 
 the types:
 
      __vdouble2
      __vfloat4
      __vlong2
      __vulong2
      __vint4
      __vuint4
      __vshort8
      __vushort8
      __vbyte16
      __vubyte16

I don't see it being typeless as a problem. The purpose of this is to expose
hardware capabilities to D code and the vector registers are typeless, so why
shouldn't vector type be "typeless" too? Types such as vfloat4 can be
implemented in a library (which could also be made portable and have a nice
API).

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 12:16, a <a a.com> wrote:

 Walter Bright Wrote:

 which provides two functions:

     __v128 simdop(operator, __v128 op1);
     __v128 simdop(operator, __v128 op1, __v128 op2);

 You would also need functions that take an immediate too to support
 instructions such as shufps.

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or

 2
 packed doubles. One problem with making it typed is it'll add 10 more

 types to
 the base compiler, instead of one. Maybe we should just bite the bullet

 and do
 the types:

      __vdouble2
      __vfloat4
      __vlong2
      __vulong2
      __vint4
      __vuint4
      __vshort8
      __vushort8
      __vbyte16
      __vubyte16

 I don't see it being typeless as a problem. The purpose of this is to
 expose hardware capabilities to D code and the vector registers are
 typeless, so why shouldn't vector type be "typeless" too? Types such as
 vfloat4 can be implemented in a library (which could also be made portable
 and have a nice API).

Hooray! I think we're on exactly the same page. That's refreshing :)

I think this __simdop( op, v1, v2, etc ) api is a bit of a bad idea...
there are too many permutations of arguments.
I know some PPC functions that receive FIVE arguments (2-3 regs, and 2-3
literals)..
Why not just expose the opcodes as intrinsic functions directly, for
instance (maybe in std.simd.sse)?
__v128 __sse_mul_ss( __v128 v1, __v128 v2 );
__v128 __sse_mul_ps( __v128 v1, __v128 v2 );
__v128 __sse_madd_epi16( __v128 v1, __v128 v2, __v128 v3 ); // <- some have
more args
__v128 __sse_shuffle_ps( __v128 v1, __v128 v2, immutable int i ); // <-
some need literal ints
etc...

This works best for other architectures too I think, they expose their own
set of intrinsics, and some have rather different parameter layouts.
VMX for instance (perhaps in std.simd.vmx?):
__v128 __vmx_vmsum4fp( __v128 v1, __v128 v2, __v128 v3 );
__v128 __vmx_vpermwi( __v128 v1, immutable int i ); // <-- needs a literal
__v128 __vmx_vrlimi( __v128 v1, __v128 v2, immutable int
mask, immutable int rot ); // <-- you really don't want to add your enum
style function for all these prototypes?
etc...

I have seen at least these argument lists:
( v1 )
( v1, v2 )
( v1, v2, v3 )
( v1, immutable int )
( v1, v2, immutable int )
( v1, v2,  immutable int,  immutable int )

Jan 06 2012

"Paulo Pinto" <pjmlp progtools.org> writes:

Hi,

just bringing into the discussion how Mono does it.

http://tirania.org/blog/archive/2008/Nov-03.html

Also have a look at pages 44-53 from the presentation slides.

--
Paulo


"Walter Bright"  wrote in message news:je6c7j$2ct0$1 digitalmars.com...

On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to 
 pitch
 an email detailing a minimal implementation with some initial thoughts.

Takeaways:

1. SIMD behavior is going to be very machine specific.

2. Even trying to do something with + is fraught with peril, as integer adds
with SIMD can be saturated or unsaturated.

3. Trying to build all the details about how each of the various adds and 
other
ops work into the compiler/optimizer is a large undertaking. D would have to
support internally maybe a 100 or more new operators.

So some simplification is in order, perhaps a low level layer that is fairly
extensible for new instructions, and for which a library can be layered over 
for
a more presentable interface. A half-formed idea of mine is, taking a cue 
from
yours:

Declare one new basic type:

     __v128

which represents the 16 byte aligned 128 bit vector type. The only 
operations
defined to work on it would be construction and assignment. The __ prefix
signals that it is non-portable.

Then, have:

    import core.simd;

which provides two functions:

    __v128 simdop(operator, __v128 op1);
    __v128 simdop(operator, __v128 op1, __v128 op2);

This will be a function built in to the compiler, at least for the x86. 
(Other
architectures can provide an implementation of it that simulates its 
operation,
but I doubt that it would be worth anyone's while to use that.)

The operators would be an enum listing of the SIMD opcodes,

     PFACC, PFADD, PFCMPEQ, etc.

For:

     z = simdop(PFADD, x, y);

the compiler would generate:

     MOV z,x
     PFADD z,y

The code generator knows enough about these instructions to do register
assignments reasonably optimally.

What do you think? It ain't beeyoootiful, but it's implementable in a 
reasonable
amount of time, and it should make writing tight & fast SIMD code without 
having
to do it all in assembler.

One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2
packed doubles. One problem with making it typed is it'll add 10 more types 
to
the base compiler, instead of one. Maybe we should just bite the bullet and 
do
the types:

     __vdouble2
     __vfloat4
     __vlong2
     __vulong2
     __vint4
     __vuint4
     __vshort8
     __vushort8
     __vbyte16
     __vubyte16

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 10:43, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:

 So I've been hassling about this for a while now, and Walter asked me to
 pitch
 an email detailing a minimal implementation with some initial thoughts.

 Takeaways:

 1. SIMD behavior is going to be very machine specific.

 2. Even trying to do something with + is fraught with peril, as integer
 adds with SIMD can be saturated or unsaturated.

 3. Trying to build all the details about how each of the various adds and
 other ops work into the compiler/optimizer is a large undertaking. D would
 have to support internally maybe a 100 or more new operators.

 So some simplification is in order, perhaps a low level layer that is
 fairly extensible for new instructions, and for which a library can be
 layered over for a more presentable interface. A half-formed idea of mine
 is, taking a cue from yours:

 Declare one new basic type:

    __v128

 which represents the 16 byte aligned 128 bit vector type. The only
 operations defined to work on it would be construction and assignment. The
 __ prefix signals that it is non-portable.

 Then, have:

   import core.simd;

 which provides two functions:

   __v128 simdop(operator, __v128 op1);
   __v128 simdop(operator, __v128 op1, __v128 op2);

 This will be a function built in to the compiler, at least for the x86.
 (Other architectures can provide an implementation of it that simulates its
 operation, but I doubt that it would be worth anyone's while to use that.)

 The operators would be an enum listing of the SIMD opcodes,

    PFACC, PFADD, PFCMPEQ, etc.

 For:

    z = simdop(PFADD, x, y);

 the compiler would generate:

    MOV z,x
    PFADD z,y

 The code generator knows enough about these instructions to do register
 assignments reasonably optimally.

 What do you think? It ain't beeyoootiful, but it's implementable in a
 reasonable amount of time, and it should make writing tight & fast SIMD
 code without having to do it all in assembler.

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2
 packed doubles. One problem with making it typed is it'll add 10 more types
 to the base compiler, instead of one. Maybe we should just bite the bullet
 and do the types:

    __vdouble2
    __vfloat4
    __vlong2
    __vulong2
    __vint4
    __vuint4
    __vshort8
    __vushort8
    __vbyte16
    __vubyte16

Sounds good to me. Though I think __v128 should definitely be typeless,
allowing all those other types to be implemented in libraries. Why wouldn't
you leave that volume of work to libraries?
All those types and related complications shouldn't be code in the
language. There's a reason microsoft chose to only expose __m128 as an
intrinsic. The rest you build yourself.
Also, the LIBRARIES for types vectors can(/will) attempt to support
multiple architectures using version()s behind the scenes.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 11:04, Andrew Wiley <wiley.andrew.j gmail.com> wrote:

 On Fri, Jan 6, 2012 at 2:43 AM, Walter Bright
 <newshound2 digitalmars.com> wrote:
 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me to
 pitch
 an email detailing a minimal implementation with some initial thoughts.


 Takeaways:

 1. SIMD behavior is going to be very machine specific.

 2. Even trying to do something with + is fraught with peril, as integer

 adds
 with SIMD can be saturated or unsaturated.

 3. Trying to build all the details about how each of the various adds and
 other ops work into the compiler/optimizer is a large undertaking. D

 would
 have to support internally maybe a 100 or more new operators.

 So some simplification is in order, perhaps a low level layer that is

 fairly
 extensible for new instructions, and for which a library can be layered

 over
 for a more presentable interface. A half-formed idea of mine is, taking a
 cue from yours:

 Declare one new basic type:

    __v128

 which represents the 16 byte aligned 128 bit vector type. The only
 operations defined to work on it would be construction and assignment.

 The
 __ prefix signals that it is non-portable.

 Then, have:

   import core.simd;

 which provides two functions:

   __v128 simdop(operator, __v128 op1);
   __v128 simdop(operator, __v128 op1, __v128 op2);

 This will be a function built in to the compiler, at least for the x86.
 (Other architectures can provide an implementation of it that simulates

 its
 operation, but I doubt that it would be worth anyone's while to use

 that.)
 The operators would be an enum listing of the SIMD opcodes,

    PFACC, PFADD, PFCMPEQ, etc.

 For:

    z = simdop(PFADD, x, y);

 the compiler would generate:

    MOV z,x
    PFADD z,y

 Would this tie SIMD support directly to x86/x86_64, or would it
 possible to also support NEON on ARM (also 128 bit SIMD, see

 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
 ) ?
 (Obviously not for DMD, but if the syntax wasn't directly tied to
 x86/64, GDC and LDC could support this)
 It seems like using a standard naming convention instead of directly
 referencing instructions could let the underlying SIMD instructions
 vary across platforms, but I don't know enough about the technologies
 to say whether NEON's capabilities match SSE closely enough that they
 could be handled the same way.

The underlying architectures are too different to try and map opcodes
across architectures.
__v128 should map to each architecutres native SIMD type, allowing for the
compiler to express the hardware, but the opcodes would come from
architecture specific opcodes available in each compiler.

As I keep suggesting, LIBRARIES would be created to supply the types like
float4, int4, etc, which may also use version() liberally behind the scenes
to support all architectures, allowing a common and efficient API for all
architectures at this level.

Jan 06 2012

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 
 packed doubles. One problem with making it typed is it'll add 10 more types to 
 the base compiler, instead of one. Maybe we should just bite the bullet and do 
 the types:

What are the disadvantages of making it typeless?
If it is typeless how do you tell it to perform a 4 float sum instead of a 2
double sum?
Is this low level layer able to support AVX and AVX2 3-way comparison
instructions too, and the fused multiplication-add instruction?

---------------

For Manu: LDC compiler has this too:
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

Bye,
bearophile

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 14:54, bearophile <bearophileHUGS lycos.com> wrote:

 Walter:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or

 2
 packed doubles. One problem with making it typed is it'll add 10 more

 types to
 the base compiler, instead of one. Maybe we should just bite the bullet

 and do
 the types:

 What are the disadvantages of making it typeless?
 If it is typeless how do you tell it to perform a 4 float sum instead of a
 2 double sum?
 Is this low level layer able to support AVX and AVX2 3-way comparison
 instructions too, and the fused multiplication-add instruction?

I don't believe there are any. I can see only advantages to implementing
the typed versions in libraries.

To make it perform float4 math, or double2 match, you either write the
pseudo assembly you want directly, but more realistically, you use the
__float4 type supplied in the standard library, which will already
associate all the float4 related functionality, and try and map it across
various architectures as efficiently as possible.

AVX needs a __v256 type in addition to the __v128 type already discussed.
This should be trivial to add in addition to __v128. Again, the libraries
take care of presenting a nice API to the users.
The comparisons and m-sum you mention are just opcodes like any other that
may be used on the raw type, and will be wrapped up nicely in the strongly
typed libraries.

Jan 06 2012

bearophile <bearophileHUGS lycos.com> writes:

Manu:

 To make it perform float4 math, or double2 match, you either write the
 pseudo assembly you want directly, but more realistically, you use the
 __float4 type supplied in the standard library, which will already
 associate all the float4 related functionality, and try and map it across
 various architectures as efficiently as possible.

I see. While you design, you need to think about the other features of D :-) Is
it possible to mix CPU SIMD with D vector ops?

__float4[10] a, b, c;
c[] = a[] + b[];

Bye,
bearophile

Jan 06 2012

bearophile <bearophileHUGS lycos.com> writes:

 I see. While you design, you need to think about the other features of D :-)
Is it possible to mix CPU SIMD with D vector ops?
 
 __float4[10] a, b, c;
 c[] = a[] + b[];

And generally, if the D compiler receives just D vector ops, what's a good way
for the compiler to map them efficiently (even if less efficiently than true
SIMD operations written manually) to SIMD ops? Generally you can't ask all D
programmers to use __float4, some of them will want to use just D vector ops,
despite they are a less efficient, because they are simpler to use. So the duty
of a good D compiler is to implement them too efficiently enough.

Bye,
bearophile

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 16:12, bearophile <bearophileHUGS lycos.com> wrote:

 I see. While you design, you need to think about the other features of D

 :-) Is it possible to mix CPU SIMD with D vector ops?
 __float4[10] a, b, c;
 c[] = a[] + b[];

 And generally, if the D compiler receives just D vector ops, what's a good
 way for the compiler to map them efficiently (even if less efficiently than
 true SIMD operations written manually) to SIMD ops? Generally you can't ask
 all D programmers to use __float4, some of them will want to use just D
 vector ops, despite they are a less efficient, because they are simpler to
 use. So the duty of a good D compiler is to implement them too efficiently
 enough.

 I'm not clear what you mean, are you talking about D vectors of hardware
vectors, as in your example above? (no problem, see my last post)

Or are you talking about programmers who will prefer to use float[4]
instead of __float4? (this is what I think you're getting at?)...
Users who prefer to use float[4] are welcome to do so, but I think you are
mistaken when you assume this will be 'simpler to use'.. The rules for what
they can/can't do efficiently with a float[4] are extremely restrictive,
and it's also very unclear if/when they are violating said rules.
It will almost always be faster to let the float unit do all the work in
this case... Perhaps the compiler COULD apply some SIMD optimisations in
very specific cases, but this would require some
serious sophistication from the compiler to detect.

Some likely problems:
  * float[4] is not aligned, performing unaligned load/stores will require
a long sequence of carefully pipelines vector code to offset/break even on
that cost. If the sequence of ops is short, it will be faster to keep it in
the FPU.
  * float[4] allows component-wise access. This produces transfer of data
between the FPU and the SIMD unit. This may again negate the advantage of
using SIMD opcodes over the FPU directly.
  * loading a vectorised float[4] with floats calculated/stored on the FPU
produces the same hazards as above. SIMD regs should not be loaded with
data taken from the FPU if possible.
  * how do you express logic and comparisons? chances are people will write
arbitrary component-wise comparisons. This requires flushing the values out
from the SIMD regs back to the FPU for comparisons, again, negating any
advantages of SIMD calculation.

The hazard I refer to almost universally is that of swapping data between
register types. This is a low process, and breaks any possibility for
efficient pipelining.
FPU pipelines nicely:
  float[4] x; x += 1.0; // This will result in 4 sequential adds to
different registers, there are no data dependencies, this will pipeline
beautifully, one cycle after another. This is probably only 3 cycles longer
than a simd add, plus a small cost for the extra opcodes in the instruction
stream

Any time you need to swap register type, the pipeline is broken, imagine
something seemingly harmless, and totally logical like this:

float[4] hardwareVec; // compiler allows use of a hardware vector for
float[4]
float[1] = groundHeight; // we want to set Y explicitly, seems reasonable,
perhaps we're snapping a position to a ground plane or something...

This may be achieved in some way that looks something like this:
 * groundHeight must be stored to the stack
 * flush pipeline (wait for the data to arrive) (potentially long time)
 * UNALIGNED load from stack into a vector register (this may require an
additional operation to rotate the vector into the proper position after
loading on some architectures)
 * flush pipeline (wait for data to arrive)
 * loaded float needs to be merged with the existing vector, this can be
done in a variety of ways
   - use a permute operation [only some architectures support arbitrary
permute, VMX is best] (one opcode, but requires pre-loading of a separate
permute control register to describe the appropriate merge, this load may
be expensive, and the data must be available)
   - use a series of shifts (requires 2 shifts for X or W, 3 shifts for Y
or Z), doesn't require any additional loads from memory, but each of the
shifts are dependant operations, and must flush the pipeline between them
   - use a mask and OR the 2 vectors together (since applying masks to both
the source and target vectors can be pipelined in parallel, and only the
final OR requires flushing the pipeline...)
   - [ note: none of these options is ideal, and each may be preferable
based on context in different situations]
 * done

Congratulations, you've now set the Y component. At the cost of a LHS
though memory, potentially other loads from memory, and 5-10 flushes of the
pipeline summing hundreds, maybe thousands of wasted cpu cycles..
In this same amount of wasted time, you could have done a LOT of work with
the FPU directly.

Process of same operation using just the FPU:
  * FPU stores groundHeight (already in an FPU reg) to &float[1]
  * done

And if the value is an intermediate and never needs to be stored on the
stack, there's a chance the operation will be eliminated entirely, since
the value is already in a float reg, ready for use in the next operation :)

I think the take-away I'm trying to illustrate here is:
SIMD work and scalar word do NOT mix... any syntax that allows it is a
mistake. Users won't understand all the details and implications of the
seemingly trivial operations they perform, and shouldn't need to.
Auto-vectorisation of float[4] will be some amazingly sophisticated code,
and very temperamental. If the compiler detects it can make some
optimisation, great, but it will not be reliable from a user point of view,
and it won't be clear what to change to make the compiler do a better job.
It also still implies policy problems, ie, should float[4] be special cased
to be aligned(16) when no other array requires this? What about all the
different types? How to cast between then, what are the expected results?

I think it's best to forget about float[4] as a candidate for reliable
auto-vectorisation. Perhaps there's an opportunity for some nice little
compiler bonuses, but it should not be the languages window into efficient
use of the hardware.
Anyone using float[4] should accept that they are working with the FPU, and
they probably won't suffer much for it. If they want/need aggressive SIMD
optimisation, then they need to use the appropriate API, and understand, at
least a little bit, how the hardware works... Ideally the well-defined SIMD
API will make it easiest to do the right thing, and they won't need to know
all these hardware details to make good use of it.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 16:06, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:

 To make it perform float4 math, or double2 match, you either write the
 pseudo assembly you want directly, but more realistically, you use the
 __float4 type supplied in the standard library, which will already
 associate all the float4 related functionality, and try and map it across
 various architectures as efficiently as possible.

 I see. While you design, you need to think about the other features of D
 :-) Is it possible to mix CPU SIMD with D vector ops?

 __float4[10] a, b, c;
 c[] = a[] + b[];

I don't see any issue with this. An array of vectors makes perfect sense,
and I see no reason why arrays/slices/etc of hardware vectors should be any
sort of problem.
This particular expression should be just as efficient as if it were an
array of flat floats, especially if the compiler unrolls it.

D's array/slice syntax is something I'm very excited about actually in
conjunction with hardware vectors. I could do some really elegant geometry
processing with slices from vertex streams.

Jan 06 2012

Russel Winder <russel russel.org.uk> writes:

On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
[...]
 I don't see any issue with this. An array of vectors makes perfect sense,
 and I see no reason why arrays/slices/etc of hardware vectors should be a=

ny
 sort of problem.
 This particular expression should be just as efficient as if it were an
 array of flat floats, especially if the compiler unrolls it.
=20
 D's array/slice syntax is something I'm very excited about actually in
 conjunction with hardware vectors. I could do some really elegant geometr=

y
 processing with slices from vertex streams.

Excuse me for jumping in part way through, apologies if I have the
"wrong end of the stick".

As I understand it currently the debate to date has effectively revolved
around how to have first class support in D for the SSE (vectorizing)
capability of the x86 architecture.  This immediately raises the
questions in my mind:

1.  Should x86 specific things be reified in the D language.  Given that
ARM and other architectures are increasingly more important than x86, D
should not tie itself to x86.

2.  Is there a way of doing something in D so that GPGPU can be
described?

Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
addicts) or OpenCL (for Apple addicts and others).  It would be good if
D could just take over this market by being able to manage GPU kernels
easily.  The risk is that PyCUDA and PyOpenCL beat D to market
leadership.
=20
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 06 2012

"Paulo Pinto" <pjmlp progtools.org> writes:

From what I see in HPC conferences papers and webcasts, I think it might  be 
already too late for D
in those scenarios.

"Russel Winder"  wrote in message 
news:mailman.107.1325862128.16222.digitalmars-d puremagic.com...
On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
[...]

Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
addicts) or OpenCL (for Apple addicts and others).  It would be good if
D could just take over this market by being able to manage GPU kernels
easily.  The risk is that PyCUDA and PyOpenCL beat D to market
leadership.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: 
sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 06 2012

Russel Winder <russel russel.org.uk> writes:

On Fri, 2012-01-06 at 16:09 +0100, Paulo Pinto wrote:
 From what I see in HPC conferences papers and webcasts, I think it might =

 be=20
 already too late for D
 in those scenarios.

Indeed, for core HPC that is true:  if you aren't using Fortran, C, C++,
and Python you are not in the game.  The point is that HPC is really
about using computers that cost a significant proportion of the USA
national debt.  My thinking is that with Intel especially, looking to
use the Moore's Law transistor count mountain to put heterogeneous many
core systems on chip, i.e. arrays of CPUs connected to GPGPUs on chip,
the programming languages used by the majority of programmers not just
those playing with multi-billion dollar kit, will have to be able to
deal with heterogeneous models of computation.   The current model of
separate compilation and loading of CPU code and GPGPU kernel is a hack
to get things working in a world where tool chains are still about
building 1970s single threaded code.  This represents an opportunity for
non C and C++ languages.  Python is beginning to take a stab at trying
to deal with all this.  D would be another good candidate.  Java cannot
be in this game without some serious updating of the JVM semantics -- an
issue we debated a bit on this list a short time ago so non need to
rehearse all the points.

It just strikes me as an opportunity to get D front and centre by having
it provide a better development experience for these heterogeneous
systems that are coming.

Sadly Santa failed to bring me a GPGPU card for Christmas so as to do
experiments using C++, Python, OpenCL (and probably CUDA, though OpenCL
is the industry standard now).  I will though be buying one for myself
in the next couple of weeks.=20

 "Russel Winder"  wrote in message=20
 news:mailman.107.1325862128.16222.digitalmars-d puremagic.com...
 On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
 [...]
=20
 Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
 addicts) or OpenCL (for Apple addicts and others).  It would be good if
 D could just take over this market by being able to manage GPU kernels
 easily.  The risk is that PyCUDA and PyOpenCL beat D to market
 leadership.
=20

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 06 2012

"Paulo Pinto" <pjmlp progtools.org> writes:

Please don't start a flame war on this, I am just expressing an opinion.

I think that for heterougenous computing we are better of with a language
that supports functional programming concepts.

From what I have seen in papers, many imperative languages have the issue
that they are too tied to the old homogenous computing model we had on the
desktop. That is the main reason why C and C++ start to look like 
frankenstein
languages with all the extensions companies are adding to them to support 
the
new models.

Funcional languages have the advantage that their hardware model is more 
abstract
and as such can be easier mapped to heterougenous hardware. This is also an 
area
where VM based languages might have some kind of advantage, but I am not 
sure.

Now, D actually has quite a few tools to explore functional concepts, so I 
guess it could
take off in this area if enough HPC people got some interest on it.

Regarding CUDA, you will surely now this better than me. I read somewhere 
that in most
research institutes people only care about CUDA, not OpenCL, because of it 
being older
than OpenCL, the C++ support, available tools, and NVidia card's performance 
when compared
with ATI in this area. But I don't have any experience here, so I don't know 
how much of this is
true.

--
Paulo


"Russel Winder"  wrote in message 
news:mailman.109.1325864213.16222.digitalmars-d puremagic.com...
On Fri, 2012-01-06 at 16:09 +0100, Paulo Pinto wrote:
 From what I see in HPC conferences papers and webcasts, I think it might 
 be
 already too late for D
 in those scenarios.

Indeed, for core HPC that is true:  if you aren't using Fortran, C, C++,
and Python you are not in the game.  The point is that HPC is really
about using computers that cost a significant proportion of the USA
national debt.  My thinking is that with Intel especially, looking to
use the Moore's Law transistor count mountain to put heterogeneous many
core systems on chip, i.e. arrays of CPUs connected to GPGPUs on chip,
the programming languages used by the majority of programmers not just
those playing with multi-billion dollar kit, will have to be able to
deal with heterogeneous models of computation.   The current model of
separate compilation and loading of CPU code and GPGPU kernel is a hack
to get things working in a world where tool chains are still about
building 1970s single threaded code.  This represents an opportunity for
non C and C++ languages.  Python is beginning to take a stab at trying
to deal with all this.  D would be another good candidate.  Java cannot
be in this game without some serious updating of the JVM semantics -- an
issue we debated a bit on this list a short time ago so non need to
rehearse all the points.

It just strikes me as an opportunity to get D front and centre by having
it provide a better development experience for these heterogeneous
systems that are coming.

Sadly Santa failed to bring me a GPGPU card for Christmas so as to do
experiments using C++, Python, OpenCL (and probably CUDA, though OpenCL
is the industry standard now).  I will though be buying one for myself
in the next couple of weeks.

 "Russel Winder"  wrote in message
 news:mailman.107.1325862128.16222.digitalmars-d puremagic.com...
 On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
 [...]

 Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
 addicts) or OpenCL (for Apple addicts and others).  It would be good if
 D could just take over this market by being able to manage GPU kernels
 easily.  The risk is that PyCUDA and PyOpenCL beat D to market
 leadership.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: 
sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 06 2012

"Froglegs" <lugtug gmail.com> writes:

  That Cuda is used more is probably true, OpenCL is fugly C and 
no fun.

Microsoft's upcoming C++ AMP looks interesting as it lets you 
write GPU and CPU code in C++.  The spec is open so hopefully it 
becomes common to implement it in other C++ compilers.

SSE intrinsics in C++ are pretty essential for getting great 
performance, so I do think D needs something like this.  A 
problem with intrinsics in C++ has been poor support from 
compilers, often performing little or no optimization and just 
blindly issuing instructions as you listed them, causing all 
kinds of extra loads and stores.

  Visual Studio is actually one of the worst C++ compilers for 
intrinsics, ICC is likely the best.

So even if D does add these new intrinsic functions it would need 
to actual optimize around them to produce reasonably fast code.

  I agree that the v128 type should be typeless, it is typeless on 
hardware, and this makes it easier to mix and match instructions.

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 7:36 AM, Russel Winder wrote:
 It just strikes me as an opportunity to get D front and centre by having
 it provide a better development experience for these heterogeneous
 systems that are coming.

At the moment, I have no idea what such support might look like :-(

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 17:01, Russel Winder <russel russel.org.uk> wrote:

 On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
 [...]
 I don't see any issue with this. An array of vectors makes perfect sense,
 and I see no reason why arrays/slices/etc of hardware vectors should be

 any
 sort of problem.
 This particular expression should be just as efficient as if it were an
 array of flat floats, especially if the compiler unrolls it.

 D's array/slice syntax is something I'm very excited about actually in
 conjunction with hardware vectors. I could do some really elegant

 geometry
 processing with slices from vertex streams.

 Excuse me for jumping in part way through, apologies if I have the
 "wrong end of the stick".

 As I understand it currently the debate to date has effectively revolved
 around how to have first class support in D for the SSE (vectorizing)
 capability of the x86 architecture.


No, I'm talking specifically about NOT making the type x86/SSE specific.
Hence all my ramblings about a 'generalised'/typeless v128 type which can
be used to express 128bit SIMD hardware of any architecture.


 This immediately raises the
 questions in my mind:

 1.  Should x86 specific things be reified in the D language.  Given that
 ARM and other architectures are increasingly more important than x86, D
 should not tie itself to x86.

The opcodes intrinsics you use to interact with the generalised type will
be architecture specific, but this isn't the end point of my proposal. The
next step is to produce libraries which will use version() heavily behind
the API to collate different architectures into nice user-friendly vector
types.
Sadly vector units across architectures are too different to expose useful
vector types cleanly in the language directly, so libraries will do this,
making use of compiler defined architecture intrinsics behind lots of
version() statements.


 2.  Is there a way of doing something in D so that GPGPU can be
 described?

I think this will map neatly to GPGPU. The vector types proposed will apply
to that hardware just fine.
This is a much bigger question though, the real problems are:
  * How to you compile/produce code that will run on the GPU? (do we have a
D->Cg compiler?)
  * How do you express the threading/concurrency aspects of GPGPU usage?
(this is way outside the scope of vector arithmetic)
  * How do you express the data sources available to GPU's? Constant files,
etc... (seems D actually had reasonably good language expressions for this
sort of thing)

Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
 addicts) or OpenCL (for Apple addicts and others).  It would be good if
 D could just take over this market by being able to manage GPU kernels
 easily.  The risk is that PyCUDA and PyOpenCL beat D to market
 leadership.

As said, I think these questions are way outside the scope of SIMD vector
libraries ;)
Although this is a fundamental piece of the puzzle, since GPGPU is no use
without SIMD type expression... but I think everything we've discussed here
so far will map perfectly to GPGPU.

Jan 06 2012

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

On 1/6/2012 9:44 AM, Manu wrote:
 On 6 January 2012 17:01, Russel Winder <russel russel.org.uk
 <mailto:russel russel.org.uk>> wrote:
 As said, I think these questions are way outside the scope of SIMD
 vector libraries ;)
 Although this is a fundamental piece of the puzzle, since GPGPU is no
 use without SIMD type expression... but I think everything we've
 discussed here so far will map perfectly to GPGPU.

I don't think you are in any danger as the GPGPU instructions are more 
flexible than the CPU SIMD counterparts GPU hardware natively works with 
float2, float3 extremely well.  GPUs have VLIW instructions that can 
effectively add a huge number of instruction modifiers to their 
instructions (things like built in saturates of 0..1 range on variable 
arguments _reads_, arbitrary swizzle on read and write, write masks that 
leave partial data untouched etc, all in one clock).

The CPU SIMD stuff is simplistic by comparions.  A good bang for the 
buck would be to have some basic set of operators (* / + - < > == != <= 
= and especially ? (the ternary operator)), and versions of 'any' and 

'all' from HLSL for dynamic branching, that can work at the very least 
for integer, float, and double types.

Bit shifting is useful (esp manipulating floats for transcendental 
functions or workingw ith half FP16 types requires a lot of), but should 
be restricted to integer types.  Having dedicated signed and unsigned 
right shifts would be pretty nice to (since about 95% of my right shifts 
end up needing to be of the zero-extended variety even though I had to 
cast to 'vector integers')

Jan 14 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me  
 to pitch
 an email detailing a minimal implementation with some initial thoughts.

 Takeaways:

 1. SIMD behavior is going to be very machine specific.

 2. Even trying to do something with + is fraught with peril, as integer  
 adds with SIMD can be saturated or unsaturated.

 3. Trying to build all the details about how each of the various adds  
 and other ops work into the compiler/optimizer is a large undertaking. D  
 would have to support internally maybe a 100 or more new operators.

 So some simplification is in order, perhaps a low level layer that is  
 fairly extensible for new instructions, and for which a library can be  
 layered over for a more presentable interface. A half-formed idea of  
 mine is, taking a cue from yours:

 Declare one new basic type:

      __v128

 which represents the 16 byte aligned 128 bit vector type. The only  
 operations defined to work on it would be construction and assignment.  
 The __ prefix signals that it is non-portable.

 Then, have:

     import core.simd;

 which provides two functions:

     __v128 simdop(operator, __v128 op1);
     __v128 simdop(operator, __v128 op1, __v128 op2);

 This will be a function built in to the compiler, at least for the x86.  
 (Other architectures can provide an implementation of it that simulates  
 its operation, but I doubt that it would be worth anyone's while to use  
 that.)

 The operators would be an enum listing of the SIMD opcodes,

      PFACC, PFADD, PFCMPEQ, etc.

 For:

      z = simdop(PFADD, x, y);

 the compiler would generate:

      MOV z,x
      PFADD z,y

 The code generator knows enough about these instructions to do register  
 assignments reasonably optimally.

 What do you think? It ain't beeyoootiful, but it's implementable in a  
 reasonable amount of time, and it should make writing tight & fast SIMD  
 code without having to do it all in assembler.

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or  
 2 packed doubles. One problem with making it typed is it'll add 10 more  
 types to the base compiler, instead of one. Maybe we should just bite  
 the bullet and do the types:

      __vdouble2
      __vfloat4
      __vlong2
      __vulong2
      __vint4
      __vuint4
      __vshort8
      __vushort8
      __vbyte16
      __vubyte16

Those could be typedefs, i.e. alias this wrapper.
Still simdop would not be typesafe.

As much as this proposal presents a viable solution,
why not spending the time to extend inline asm.

void foo()
{
     __v128 a = loadss(1.0f);
     __v128 b = loadss(1.0f);
     a = addss(a, b);
}

__v128 load(float v)
{
     __v128 res; // allocates register
     asm
     {
         movss res, v[RBP];
     }
     return res; // return in XMM1 but inlineable return assignment
}

__v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
{
     __v128 res = a;
     // asm prolog, allocates registers for every __v128 used within the asm
     asm
     {
         addss res, b;
     }
     // asm epilog, possibly restore spilled registers
     return res;
}

What would be needed?
  - Implement the asm allocation logic.
  - Functions containing asm statements should participate in inlining.
  - Determining inline cost of asm statements.

When being used with typedefs for __vubyte16 et.al. this would
allow a really clean and simple library implementation of intrinsics.

Jan 06 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 06 Jan 2012 13:56:58 +0100, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 1/5/2012 5:42 PM, Manu wrote:
 So I've been hassling about this for a while now, and Walter asked me  
 to pitch
 an email detailing a minimal implementation with some initial thoughts.

 Takeaways:

 1. SIMD behavior is going to be very machine specific.

 2. Even trying to do something with + is fraught with peril, as integer  
 adds with SIMD can be saturated or unsaturated.

 3. Trying to build all the details about how each of the various adds  
 and other ops work into the compiler/optimizer is a large undertaking.  
 D would have to support internally maybe a 100 or more new operators.

 So some simplification is in order, perhaps a low level layer that is  
 fairly extensible for new instructions, and for which a library can be  
 layered over for a more presentable interface. A half-formed idea of  
 mine is, taking a cue from yours:

 Declare one new basic type:

      __v128

 which represents the 16 byte aligned 128 bit vector type. The only  
 operations defined to work on it would be construction and assignment.  
 The __ prefix signals that it is non-portable.

 Then, have:

     import core.simd;

 which provides two functions:

     __v128 simdop(operator, __v128 op1);
     __v128 simdop(operator, __v128 op1, __v128 op2);

 This will be a function built in to the compiler, at least for the x86.  
 (Other architectures can provide an implementation of it that simulates  
 its operation, but I doubt that it would be worth anyone's while to use  
 that.)

 The operators would be an enum listing of the SIMD opcodes,

      PFACC, PFADD, PFCMPEQ, etc.

 For:

      z = simdop(PFADD, x, y);

 the compiler would generate:

      MOV z,x
      PFADD z,y

 The code generator knows enough about these instructions to do register  
 assignments reasonably optimally.

 What do you think? It ain't beeyoootiful, but it's implementable in a  
 reasonable amount of time, and it should make writing tight & fast SIMD  
 code without having to do it all in assembler.

 One caveat is it is typeless; a __v128 could be used as 4 packed ints  
 or 2 packed doubles. One problem with making it typed is it'll add 10  
 more types to the base compiler, instead of one. Maybe we should just  
 bite the bullet and do the types:

      __vdouble2
      __vfloat4
      __vlong2
      __vulong2
      __vint4
      __vuint4
      __vshort8
      __vushort8
      __vbyte16
      __vubyte16

 Those could be typedefs, i.e. alias this wrapper.
 Still simdop would not be typesafe.

 As much as this proposal presents a viable solution,
 why not spending the time to extend inline asm.

 void foo()
 {
      __v128 a = loadss(1.0f);
      __v128 b = loadss(1.0f);
      a = addss(a, b);
 }

 __v128 load(float v)
 {
      __v128 res; // allocates register
      asm
      {
          movss res, v[RBP];
      }
      return res; // return in XMM1 but inlineable return assignment
 }

 __v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
 {
      __v128 res = a;
      // asm prolog, allocates registers for every __v128 used within the  
 asm
      asm
      {
          addss res, b;
      }
      // asm epilog, possibly restore spilled registers
      return res;
 }

 What would be needed?
   - Implement the asm allocation logic.
   - Functions containing asm statements should participate in inlining.
   - Determining inline cost of asm statements.

 When being used with typedefs for __vubyte16 et.al. this would
 allow a really clean and simple library implementation of intrinsics.

Also addss is a pure function which could be important to optimize
out certain calls. Maybe we should allow to attribute asm with pure.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 14:56, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright <
 newshound2 digitalmars.com> wrote:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints or
 2 packed doubles. One problem with making it typed is it'll add 10 more
 types to the base compiler, instead of one. Maybe we should just bite the
 bullet and do the types:

     __vdouble2
     __vfloat4
     __vlong2
     __vulong2
     __vint4
     __vuint4
     __vshort8
     __vushort8
     __vbyte16
     __vubyte16

 Those could be typedefs, i.e. alias this wrapper.
 Still simdop would not be typesafe.

I think they should by well defined structs with lots of type safety and
sensible methods. Not just a typedef of the typeless primitive.


 As much as this proposal presents a viable solution,
 why not spending the time to extend inline asm.

I think there are too many risky problems with the inline assembler (as
raised in my discussion about supporting pseudo registers in inline asm
blocks).
  * No way to allow the compiler to assign registers (pseudo registers)
  * Assembly blocks present problems for the optimiser, it's not reliable
that it can optimise around an inline asm blocks. How bad will it be when
trying to optimise around 100 small inlined functions each containing its
own inline asm blocks?
  * D's inline assembly syntax has to be carefully translated to GCC's
inline asm format when using GCC, and this needs to be done
PER-ARCHITECTURE, which Iain should not be expected to do for all the
obscure architectures GCC supports.


 What would be needed?
  - Implement the asm allocation logic.
  - Functions containing asm statements should participate in inlining.
  - Determining inline cost of asm statements.

I raised these points in my other thread, these are all far more
complicated problems I think than exposing opcode intrinsics would be.
Opcode intrinsics are almost certainly the way to go.

When being used with typedefs for __vubyte16 et.al. this would
 allow a really clean and simple library implementation of intrinsics.

The type safety you're imagining here might actually be annoying when
working with the raw type and opcodes..
Consider this common situation and the code that will be built around it:
__v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); // pack
some other useful data in W
If vec were strongly typed, I would now need to start casting all over the
place to use various float and uint opcodes on this value?
I think it's correct when using SIMD at the raw level to express the type
as it is, typeless... SIMD regs are infact typeless regs, they only gain
concept of type the moment you perform an opcode on it, and only for the
duration of that opcode.

You will get your strong type safety when you make use of the float4 types
which will be created in the libs.

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 5:44 AM, Manu wrote:
 The type safety you're imagining here might actually be annoying when working
 with the raw type and opcodes..
 Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); // pack
some
 other useful data in W
 If vec were strongly typed, I would now need to start casting all over the
place
 to use various float and uint opcodes on this value?
 I think it's correct when using SIMD at the raw level to express the type as it
 is, typeless... SIMD regs are infact typeless regs, they only gain concept of
 type the moment you perform an opcode on it, and only for the duration of that
 opcode.

 You will get your strong type safety when you make use of the float4 types
which
 will be created in the libs.

Consider an analogy with the EAX register. It's untyped. But we don't find it 
convenient to make it untyped in a high level language, we paint the fiction of 
a type onto it, and that works very well.

To me, the advantage of making the SIMD types typed are:

1. the language does typechecking, for example, trying to add a vector of 4 
floats to 16 bytes would be (and should be) an error.

2. Some of the SIMD operations do map nicely onto the operators, so one could
write:

    a = b + c + -d;

and the correct SIMD opcodes will be generated based on the types. I think that 
would be one hell of a lot nicer than using function syntax. Of course, this 
will only be for those SIMD ops that do map, for the rest you're stuck with the 
functions.

3. A lot of the SIMD opcodes have 10 variants, one for each of the 10 types.
The 
user would only need to remember the operation, not the variants, and let the 
usual overloading rules apply.


And, of course, casting would be allowed and would be zero cost.

I've been thinking about this a lot since last night, and I think that since
the 
back end already supports XMM registers, most of the hard work is done, that 
doing it this way would fit in well. (At least for 64 bit code, where the 
alignment issue is solved, but that's an orthogonal issue.)

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 21:21, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 5:44 AM, Manu wrote:

 The type safety you're imagining here might actually be annoying when
 working
 with the raw type and opcodes..
 Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
 pack some


 Consider an analogy with the EAX register. It's untyped. But we don't find
 it convenient to make it untyped in a high level language, we paint the
 fiction of a type onto it, and that works very well.

Damn it, I though we already reached agreement, why are you having second
thoughts? :)
Your analogy to EAX is not really valid. EAX may hold some data that is not
an int, but it is incapable of performing a float operation on that data.
SIMD registers are capable of performing operations of any type at any time
to any register, I think this is the key distinction that justifies them as
inherently 'typeless' registers.


 To me, the advantage of making the SIMD types typed are:

 1. the language does typechecking, for example, trying to add a vector of
 4 floats to 16 bytes would be (and should be) an error.


The language WILL do that checking as soon as we create the strongly typed
libraries. And people will use those libraries, they'll never touch the
primitive type.

The primitive type however must not inhibit the programmer from being able
to perform any operation that the hardware is technically capable of.
The primitive type will be used behind the scenes for building said
libraries... nobody will use it in front-end code. It's not really a useful
type, it doesn't do anything. It just allows the ABI and register semantics
to be expressed in the language.


 2. Some of the SIMD operations do map nicely onto the operators, so one
 could write:

   a = b + c + -d;

This is not even true, as you said yourself in a previous post.
SIMD int ops may wrap, or saturate... which is it?
Don't try and express this at the language level. Let the libraries do it,
and if they fail, or are revealed to be poorly defined, they can be
updated/changed.

3. A lot of the SIMD opcodes have 10 variants, one for each of the 10
 types. The user would only need to remember the operation, not the
 variants, and let the usual overloading rules apply.

Correct, and they will be hidden behind the api of their strongly typed
library counterparts. The user will never need to be aware of the opcodes,
or their variants.


 And, of course, casting would be allowed and would be zero cost.

Zero cost? You're suggesting all casts would be reinterprets? Surely:
float4 fVec = (float4)intVec; should perform a type conversion?
Again, this is detail that can/should be discussed when implementing the
standard library, leave this sort of problem out of the language.


Your earlier email detailing your simple API with an enum of opcodes
sounded fine... whatever's easiest really. The hard part will be
implementing the alignment, and the literal syntax.

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 11:41 AM, Manu wrote:
 On 6 January 2012 21:21, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/6/2012 5:44 AM, Manu wrote:

         The type safety you're imagining here might actually be annoying when
         working
         with the raw type and opcodes..
         Consider this common situation and the code that will be built around
it:
         __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
         pack some


     Consider an analogy with the EAX register. It's untyped. But we don't find
     it convenient to make it untyped in a high level language, we paint the
     fiction of a type onto it, and that works very well.


 Damn it, I though we already reached agreement, why are you having second
 thoughts? :)
 Your analogy to EAX is not really valid. EAX may hold some data that is not an
 int, but it is incapable of performing a float operation on that data.
 SIMD registers are capable of performing operations of any type at any time to
 any register, I think this is the key distinction that justifies them as
 inherently 'typeless' registers.

I strongly disagree with this. EAX can (and is) at various times used as byte, 
ubyte, short, ushort, int, uint, pointer, and yes, even floats! Anything that 
fits in it, actually. It is typeless. The types used on them are a fiction 
perpetrated by the language, but a very very useful fiction.


     To me, the advantage of making the SIMD types typed are:

     1. the language does typechecking, for example, trying to add a vector of 4
     floats to 16 bytes would be (and should be) an error.


 The language WILL do that checking as soon as we create the strongly typed
 libraries. And people will use those libraries, they'll never touch the
 primitive type.

I'm not so sure this will work out satisfactorily.


 The primitive type however must not inhibit the programmer from being able to
 perform any operation that the hardware is technically capable of.
 The primitive type will be used behind the scenes for building said
libraries...
 nobody will use it in front-end code. It's not really a useful type, it doesn't
 do anything. It just allows the ABI and register semantics to be expressed in
 the language.

     2. Some of the SIMD operations do map nicely onto the operators, so one
     could write:

        a = b + c + -d;


 This is not even true, as you said yourself in a previous post.
 SIMD int ops may wrap, or saturate... which is it?

It would only be for those ops that actually do map onto the D operators. (This 
is already done by the library implementation of the array arithmetic 
operations.) The saturated int ops would not be usable this way.


 Don't try and express this at the language level. Let the libraries do it, and
 if they fail, or are revealed to be poorly defined, they can be
updated/changed.

Doing it as a library type pretty much prevents certain optimizations, for 
example, the fused operations, from being expressed using infix operators.


     3. A lot of the SIMD opcodes have 10 variants, one for each of the 10
types.
     The user would only need to remember the operation, not the variants, and
     let the usual overloading rules apply.


 Correct, and they will be hidden behind the api of their strongly typed library
 counterparts. The user will never need to be aware of the opcodes, or their
 variants.

     And, of course, casting would be allowed and would be zero cost.


 Zero cost? You're suggesting all casts would be reinterprets? Surely: float4
 fVec = (float4)intVec; should perform a type conversion?
 Again, this is detail that can/should be discussed when implementing the
 standard library, leave this sort of problem out of the language.

Painting a new type (i.e. reinterpret casts) do have zero runtime cost to them. 
I don't think it's a real problem - we do it all the time when, for example, we 
want to retype an int as a uint:

    int i;
    uint u = cast(uint)i;


 Your earlier email detailing your simple API with an enum of opcodes sounded
 fine... whatever's easiest really. The hard part will be implementing the
 alignment, and the literal syntax.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 22:36, Walter Bright <newshound2 digitalmars.com> wrote:

    To me, the advantage of making the SIMD types typed are:
    1. the language does typechecking, for example, trying to add a vector
 of 4
    floats to 16 bytes would be (and should be) an error.


 The language WILL do that checking as soon as we create the strongly typed
 libraries. And people will use those libraries, they'll never touch the
 primitive type.

 I'm not so sure this will work out satisfactorily.

How so, can you support this theory?


    2. Some of the SIMD operations do map nicely onto the operators, so one
    could write:

       a = b + c + -d;


 This is not even true, as you said yourself in a previous post.
 SIMD int ops may wrap, or saturate... which is it?

 It would only be for those ops that actually do map onto the D operators.
 (This is already done by the library implementation of the array arithmetic
 operations.) The saturated int ops would not be usable this way.


But why are you against adding this stuff in the library? It's contrary to
the general sentiment around here where people like putting stuff in
libraries where possible. It's less committing, and allows alternative
implementations if desired.

Don't try and express this at the language level. Let the libraries do it,
 and
 if they fail, or are revealed to be poorly defined, they can be
 updated/changed.

 Doing it as a library type pretty much prevents certain optimizations, for
 example, the fused operations, from being expressed using infix operators.

You're talking about MADD? I was going to make a separate suggestion
regarding that actually.
Multiply-add is a common concept, often available to FPU's aswell (and no
way to express it)... I was going to suggest an opMultiplyAdd() operator,
which you could have the language call if it detects a conforming
arrangement of * and + operators on a type. This would allow operator
access to madd in library vectors too.

   And, of course, casting would be allowed and would be zero cost.
 Zero cost? You're suggesting all casts would be reinterprets? Surely:
 float4
 fVec = (float4)intVec; should perform a type conversion?
 Again, this is detail that can/should be discussed when implementing the
 standard library, leave this sort of problem out of the language.

 Painting a new type (i.e. reinterpret casts) do have zero runtime cost to
 them. I don't think it's a real problem - we do it all the time when, for
 example, we want to retype an int as a uint:

   int i;
   uint u = cast(uint)i;


Yeah sure, but I don't think that's fundamentally correct, if you're
drifting towards typing these things in the language, then you should also
start considering cast mechanics... and that's a larger topic of debate.
I don't really think "float4 floatVec = (float4)intVec;" should be a
reinterpret... surely, as a high level type, this should perform a type
conversion?

I'm afraid this is become a lot more complicated than it needs to be.
Can you illustrate your current thoughts/plan, to have it summarised in one
place. Has it drifted from what you said last night?

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 1:32 PM, Manu wrote:
 On 6 January 2012 22:36, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

             To me, the advantage of making the SIMD types typed are:

             1. the language does typechecking, for example, trying to add a
         vector of 4
             floats to 16 bytes would be (and should be) an error.


         The language WILL do that checking as soon as we create the strongly
typed
         libraries. And people will use those libraries, they'll never touch the
         primitive type.


     I'm not so sure this will work out satisfactorily.


 How so, can you support this theory?

For one thing, the compiler has a very hard time optimizing library implemented 
types. It's why int, float, etc., are native types. We've come a long way with 
library types, but there are limits.

             2. Some of the SIMD operations do map nicely onto the operators,
so one
             could write:

                a = b + c + -d;


         This is not even true, as you said yourself in a previous post.
         SIMD int ops may wrap, or saturate... which is it?


     It would only be for those ops that actually do map onto the D operators.
     (This is already done by the library implementation of the array arithmetic
     operations.) The saturated int ops would not be usable this way.


 But why are you against adding this stuff in the library? It's contrary to the
 general sentiment around here where people like putting stuff in libraries
where
 possible. It's less committing, and allows alternative implementations if
desired.

         Don't try and express this at the language level. Let the libraries do
         it, and
         if they fail, or are revealed to be poorly defined, they can be
         updated/changed.


     Doing it as a library type pretty much prevents certain optimizations, for
     example, the fused operations, from being expressed using infix operators.


 You're talking about MADD? I was going to make a separate suggestion regarding
 that actually.
 Multiply-add is a common concept, often available to FPU's aswell (and no way
to
 express it)... I was going to suggest an opMultiplyAdd() operator, which you
 could have the language call if it detects a conforming arrangement of * and +
 operators on a type. This would allow operator access to madd in library
vectors
 too.

Detecting a "conforming arrangement" is how native types work! Once you wed the 
compiler logic to a particular library implementation, it acquires the worst 
aspects of a native type with the worst aspects of a library type.


 Yeah sure, but I don't think that's fundamentally correct, if you're drifting
 towards typing these things in the language, then you should also start
 considering cast mechanics... and that's a larger topic of debate.
 I don't really think "float4 floatVec = (float4)intVec;" should be a
 reinterpret... surely, as a high level type, this should perform a type
conversion?

That's a good point.

 I'm afraid this is become a lot more complicated than it needs to be.
 Can you illustrate your current thoughts/plan, to have it summarised in one
 place.

Support the 10 vector types as basic types, support them with the arithmetic 
infix operators, and use intrinsics for the rest of the operations. I believe 
this scheme:

1. will look better in code, and will be easier to use
2. will allow for better error detection and more comprehensible error messages 
when things are misused
3. will generate better code
4. shouldn't be hard to implement, as I already did most of the work when I did 
the SIMD support for float and double.

 Has it drifted from what you said last night?

Yes.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 7 January 2012 01:52, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:32 PM, Manu wrote:

 Yeah sure, but I don't think that's fundamentally correct, if you're
 drifting

 towards typing these things in the language, then you should also start
 considering cast mechanics... and that's a larger topic of debate.
 I don't really think "float4 floatVec = (float4)intVec;" should be a
 reinterpret... surely, as a high level type, this should perform a type
 conversion?

 That's a good point.


.. oh god, what have I done. :/

I'm afraid this is become a lot more complicated than it needs to be.
 Can you illustrate your current thoughts/plan, to have it summarised in
 one
 place.

 Support the 10 vector types as basic types, support them with the
 arithmetic infix operators, and use intrinsics for the rest of the
 operations. I believe this scheme:

 1. will look better in code, and will be easier to use
 2. will allow for better error detection and more comprehensible error
 messages when things are misused
 3. will generate better code
 4. shouldn't be hard to implement, as I already did most of the work when
 I did the SIMD support for float and double.


  Has it drifted from what you said last night?

 Yes.

Okay, I'm very worried at this point. Please don't just do this...
There are so many details and gotchas in what you suggest. I couldn't feel
comfortable short of reading a thorough proposal.

Come on IRC? This requires involved conversation.

I'm sure you realise how much more work this is...
Why would you commit to this right off the bat? Why not produce the simple
primitive type, and allow me the opportunity to try it with the libraries
before polluting the language its self with a massive volume of stuff...
I'm genuinely concerned that once you add this to the language, it's done,
and it'll be stuck there like lots of other debatable features... we can
tweak the library implementation as we gain experience with usage of the
feature.

MS also agree that the primitive __m128 is the right approach. I'm not
basing my opinion on their judgement at all, I independently conclude it is
the right approach, but it's encouraging that they agree... and perhaps
they're a more respectable authority than me and my opinion :)

What I proposed in the OP is the simplest, most non-destructive initial
implementation in the language. I think there is the lest opportunity for
making a mistake/wrong decision in my initial proposal, and it can be
extended with what you're suggesting in time after we have the opportunity
to prove that it's correct. We can test and prove the rest with libraries
before committing to implement it in the language...

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 4:12 PM, Manu wrote:
 Come on IRC? This requires involved conversation.

I'm on skype.

 I'm sure you realise how much more work this is...

Actually, not that much. Surprising, no? <g> I already think I did the hard 
stuff already by supporting SIMD for float/double.


 Why would you commit to this right off the bat? Why not produce the simple
 primitive type, and allow me the opportunity to try it with the libraries
before
 polluting the language its self with a massive volume of stuff...
 I'm genuinely concerned that once you add this to the language, it's done, and
 it'll be stuck there like lots of other debatable features... we can tweak the
 library implementation as we gain experience with usage of the feature.

If it doesn't work, we can back it out. I'm willing to add it as an
experimental 
feature because I don't see it breaking any existing code.


 MS also agree that the primitive __m128 is the right approach. I'm not basing
my
 opinion on their judgement at all, I independently conclude it is the right
 approach, but it's encouraging that they agree... and perhaps they're a more
 respectable authority than me and my opinion :)

Can you show me a typical example of how it looks in action in source code?

 What I proposed in the OP is the simplest, most non-destructive initial
 implementation in the language. I think there is the lest opportunity for
making
 a mistake/wrong decision in my initial proposal, and it can be extended with
 what you're suggesting in time after we have the opportunity to prove that it's
 correct. We can test and prove the rest with libraries before committing to
 implement it in the language...

I don't think the typeless approach will wind up being any easier, and it'll 
certainly suck when it comes to optimization, error messages, symbolic debugger 
support, etc.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 7 January 2012 02:52, Walter Bright <newshound2 digitalmars.com> wrote:

 MS also agree that the primitive __m128 is the right approach. I'm not
 basing my

 opinion on their judgement at all, I independently conclude it is the right
 approach, but it's encouraging that they agree... and perhaps they're a
 more
 respectable authority than me and my opinion :)

 Can you show me a typical example of how it looks in action in source code?


Not without breaking NDA's... but maybe I will anyway, I'll dig some stuff
up...


 What I proposed in the OP is the simplest, most non-destructive initial
 implementation in the language. I think there is the lest opportunity for
 making
 a mistake/wrong decision in my initial proposal, and it can be extended
 with
 what you're suggesting in time after we have the opportunity to prove
 that it's
 correct. We can test and prove the rest with libraries before committing
 to
 implement it in the language...

 I don't think the typeless approach will wind up being any easier, and
 it'll certainly suck when it comes to optimization, error messages,
 symbolic debugger support, etc.

Symbolic debugger support eh... now that is a compelling argument! :)

Okay, I'm prepared to reconsider... but I'm still apprehensive. I'm
manuevans on skype, on there now if you want to add me...

Jan 06 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/6/12 5:52 PM, Walter Bright wrote:
 Support the 10 vector types as basic types, support them with the
 arithmetic infix operators, and use intrinsics for the rest of the
 operations. I believe this scheme:

 1. will look better in code, and will be easier to use
 2. will allow for better error detection and more comprehensible error
 messages when things are misused
 3. will generate better code
 4. shouldn't be hard to implement, as I already did most of the work
 when I did the SIMD support for float and double.

I think it would be great to try avoiding the barbarism of adding 10 
built-in types and a bunch of built-ins.

Historically, D has erred heavily on the side of building in the 
compiler. Consider the the complex numbers affair, in tow with crackpot 
science arguments on why they're a must. It's great that embarrassment 
is behind us. Also consider how the hard-coding of associative arrays in 
an awkward interface inside the runtime has stifled efficient 
implementations, progress, and innovation in that area. Still a lot of 
work needed there, too, to essentially undo a bad decision.

Let's not repeat history. Months later we'll look at the hecatomb of 
types and builtins we dumped into the language and we'll be like, what 
were we /thinking/?

Adding built-in types and functions is giving up good design, judgment, 
and using what we have creatively. It may mean failure to understand and 
use the expressive power of the language, and worse, compensate by 
adding even more poorly designed artifacts to it.

I would very strongly suggest we reconsider the tactics of it all. Yes, 
it's great to have SIMD support in the language. No, I don't think 
adding a wheelbarrow to the language is the right way.


Thanks,

Andrei

Jan 06 2012

Don <nospam nospam.com> writes:

On 07.01.2012 04:18, Andrei Alexandrescu wrote:
 On 1/6/12 5:52 PM, Walter Bright wrote:
 Support the 10 vector types as basic types, support them with the
 arithmetic infix operators, and use intrinsics for the rest of the
 operations. I believe this scheme:

 1. will look better in code, and will be easier to use
 2. will allow for better error detection and more comprehensible error
 messages when things are misused
 3. will generate better code
 4. shouldn't be hard to implement, as I already did most of the work
 when I did the SIMD support for float and double.

 I think it would be great to try avoiding the barbarism of adding 10
 built-in types and a bunch of built-ins.

 Historically, D has erred heavily on the side of building in the
 compiler. Consider the the complex numbers affair, in tow with crackpot
 science arguments on why they're a must. It's great that embarrassment
 is behind us.


 Also consider how the hard-coding of associative arrays in
 an awkward interface inside the runtime has stifled efficient
 implementations, progress, and innovation in that area. Still a lot of
 work needed there, too, to essentially undo a bad decision.

Sorry Andrei, I have to disagree with that in the strongest possible 
terms. I would have mentioned AAs as a very strong argument in the 
opposite direction!

Moving AAs from a built-in to a library type has been an unmitigated 
disaster from the implementation side. And it has so far brought us 
*nothing* in return. Not "hardly anything", but *NOTHING*. I don't even 
have any idea of what good could possibly come from it. Note that you 
CANNOT have multiple implementations on a given platform, or you'll get 
linker errors! So I think there is more pain to come from it.
It seems to have been motivated by religious reasons and nothing more.
Why should anyone believe the same argument again?

Jan 07 2012

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 7 January 2012 at 16:10:32 UTC, Don wrote:
 Sorry Andrei, I have to disagree with that in the strongest 
 possible terms. I would have mentioned AAs as a very strong 
 argument in the opposite direction!

Amen. AAs are *still* broken from this change. If you
take a look at my cgi.d, you'll find this:

// Referencing this gigantic typeid seems to remind the compiler
// to actually put the symbol in the object file. I guess the 
immutable
// assoc array array isn't actually included in druntime
void hackAroundLinkerError() {
      
writeln(typeid(const(immutable(char)[][])[immutable(char)[]]));
      writeln(typeid(immutable(char)[][][immutable(char)[]]));
      writeln(typeid(Cgi.UploadedFile[immutable(char)[]]));
      
writeln(typeid(immutable(Cgi.UploadedFile)[immutable(char)[]]));
      writeln(typeid(immutable(char[])[immutable(char)[]]));
      // this is getting kinda ridiculous btw. Moving assoc arrays
      // to the library is the pain that keeps on coming.

      // eh this broke the build on the work server
      // writeln(typeid(immutable(char)[][immutable(string[])]));
      writeln(typeid(immutable(string[])[immutable(char)[]]));
}


It was never a problem before... but if I take that otherwise
useless function out, it still randomly breaks my builds to
this day.

Jan 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/7/12 10:19 AM, Adam D. Ruppe wrote:
 On Saturday, 7 January 2012 at 16:10:32 UTC, Don wrote:
 Sorry Andrei, I have to disagree with that in the strongest possible
 terms. I would have mentioned AAs as a very strong argument in the
 opposite direction!

 Amen. AAs are *still* broken from this change.

Well they are broken because the change has not been carried to completion.

I think that baking AAs in the compiler is poor programming language 
design. There are absolutely no ifs and buts about it, and the matter is 
obvious enough to me and sufficiently internalized to cause me 
difficulty to argue it.


Andrei

Jan 07 2012

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 7 January 2012 at 16:55:00 UTC, Andrei Alexandrescu 
wrote:
 Well they are broken because the change has not been carried to 
 completion.

Here's my position: if we get a library implementation that
works better than the compiler implementation, let's do it. If 
they
are equal in use or only worse from minor syntax or some other 
trivial
matter, let's do library since that's nicer indeed.

But, if the library one doesn't work as well as the compiler
implementation, whether due to design, bugs, or any other 
practical
consideration, let's not break things until the library impl
catches up.

If that's going to take several years, we have to consider the
benefit of having it now rather than later too.

Jan 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/7/12 10:10 AM, Don wrote:
 Sorry Andrei, I have to disagree with that in the strongest possible
 terms. I would have mentioned AAs as a very strong argument in the
 opposite direction!

 Moving AAs from a built-in to a library type has been an unmitigated
 disaster from the implementation side. And it has so far brought us
 *nothing* in return. Not "hardly anything", but *NOTHING*.

It would be premature to conclude from this as the conversion is incomplete.

 I don't even
 have any idea of what good could possibly come from it.

Using static calls for hashing and comparisons instead of indirect calls 
comes to mind.


Andrei

Jan 07 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 01/07/12 17:10, Don wrote:
 On 07.01.2012 04:18, Andrei Alexandrescu wrote:
 Also consider how the hard-coding of associative arrays in
 an awkward interface inside the runtime has stifled efficient
 implementations, progress, and innovation in that area. Still a lot of
 work needed there, too, to essentially undo a bad decision.

 
 Sorry Andrei, I have to disagree with that in the strongest possible terms. I
would have mentioned AAs as a very strong argument in the opposite direction!
 
 Moving AAs from a built-in to a library type has been an unmitigated disaster
from the implementation side. And it has so far brought us *nothing* in return.
Not "hardly anything", but *NOTHING*. I don't even have any idea of what good
could possibly come from it. Note that you CANNOT have multiple implementations
on a given platform, or you'll get linker errors! So I think there is more pain
to come from it.
 It seems to have been motivated by religious reasons and nothing more.
 Why should anyone believe the same argument again?
 

Reminded me of this: "static immutable string[string] aa = [ "a": "b" ];" isn't
currently possible (AA literals are non-const expressions); could this work w/o
compiler support?..

artur

Jan 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/7/2012 8:10 AM, Don wrote:
 Moving AAs from a built-in to a library type has been an unmitigated disaster
 from the implementation side. And it has so far brought us *nothing* in return.
 Not "hardly anything", but *NOTHING*. I don't even have any idea of what good
 could possibly come from it. Note that you CANNOT have multiple implementations
 on a given platform, or you'll get linker errors! So I think there is more pain
 to come from it.
 It seems to have been motivated by religious reasons and nothing more.
 Why should anyone believe the same argument again?


Having a pluggable interface so the implementation can be changed is all right, 
as long as the binary API does not change.

If the binary API changes, then of course, two different libraries cannot be 
linked together. I strongly oppose any changes which would lead to a 
balkanization of D libraries.

(Consider the disaster C++ has had forever with everyone inventing their own 
string type. That insured zero interoperability between C++ libraries, a 
situation that persists even for 10 years after C++ finally acquired a standard 
string library.)

Jan 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/7/12 12:48 PM, Walter Bright wrote:
 On 1/7/2012 8:10 AM, Don wrote:
 Moving AAs from a built-in to a library type has been an unmitigated
 disaster
 from the implementation side. And it has so far brought us *nothing*
 in return.
 Not "hardly anything", but *NOTHING*. I don't even have any idea of
 what good
 could possibly come from it. Note that you CANNOT have multiple
 implementations
 on a given platform, or you'll get linker errors! So I think there is
 more pain
 to come from it.
 It seems to have been motivated by religious reasons and nothing more.
 Why should anyone believe the same argument again?


 Having a pluggable interface so the implementation can be changed is all
 right, as long as the binary API does not change.
 If the binary API changes, then of course, two different libraries
 cannot be linked together. I strongly oppose any changes which would
 lead to a balkanization of D libraries.

In my opinion this statement is thoroughly wrong and backwards. I also 
think it reflects a misunderstanding of what my stance is. Allow me to 
clarify how I see the situation.

Currently built-in hash table use generates special-cased calls to 
non-template functions implemented surreptitiously in druntime. The 
underlying theory, also sustained by the statement quoted above, is that 
we are interested in supporting linking together object files and 
libraries BUILT WITH DISTINCT MAJOR RELEASES OF DRUNTIME.

There is zero interest for that. ZERO. No language even attempts to do 
so. Runtimes that are not compatible with their previous versions are 
common, frequent, and well understood as an issue.

In an ideal world, built-in hash tables should work in a very simple 
manner. The compiler lowers all special hashtable syntax - in a manner 
that's MINIMAL, SIMPLE, and CLEAR - into D code that resolves to use of 
object.di (not some random user-defined library!). From then on, 
druntime code takes over. It could choose to use templates, dynamic type 
info, whatever. It's NOT the concern of the compiler. The compiler has 
NO BUSINESS taking library code and hardwiring it in for no good reason.

This setup allows static and dynamic linking of libraries, as long as 
the runtimes they were built with are compatible. This is expected, by 
design, and a good thing.

 (Consider the disaster C++ has had forever with everyone inventing their
 own string type. That insured zero interoperability between C++
 libraries, a situation that persists even for 10 years after C++ finally
 acquired a standard string library.)

It is exactly this kind of canned statement and prejudice that we must 
avoid. It unfairly singles out C++ when there also exist incompatible 
libraries in C, Java, Python, you name it.

Also, the last time the claim that everywhere invented their own string 
type could have been credibly aired was around 2004.

What's built inside the compiler is like axioms in math, and what's 
library is like theorems supported by the axioms. A good language, just 
like a good mathematical system, has few axioms and many theorems. That 
means the system is coherent and expressive. Hardwiring stuff in the 
language definition is almost always a failure of the expressive power 
of the language. Sometimes it's fine to just admit it and hardwire 
inside the compiler e.g. the prior knowledge that "+" on int does modulo 
addition. But most always it's NOT, and definitely not in the context of 
a complex data structure like a hash table. I also think that adding a 
hecatomb of built-in types and functions has smells, though to a good 
extent I concede to the necessity of it.

We should start from what the user wants to accomplish. Then figure how 
to express that within the language. And only lastly, when needed, 
change the language to mandate lowering constructs to the MINIMUM EXTENT 
POSSIBLE into constructs that can be handled within the existing 
language. This approach has been immensely successful virtually whenever 
we applied it: foreach for ranges (though there's work left to do 
there), operator overloading, and too little with hashes. Lately I see a 
sort of getting lazy and skipping the second pass entirely. Need 
something? Yeah, what the hell, we'll put it in the language.

I am a bit worried about the increasing radicalization of the discussion 
here, but recent statements come in frontal collision with my core 
principles, which I think stand on solid evidential ground. I am 
appealing for building consensus and staying principled instead of 
reaching for the cheap solution. If we do the latter, it's quite likely 
we'll regret it later.


Andrei

Jan 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/7/2012 1:28 PM, Andrei Alexandrescu wrote:
 Having a pluggable interface so the implementation can be changed is all
 right, as long as the binary API does not change.
 If the binary API changes, then of course, two different libraries
 cannot be linked together. I strongly oppose any changes which would
 lead to a balkanization of D libraries.

 In my opinion this statement is thoroughly wrong and backwards. I also think it
 reflects a misunderstanding of what my stance is. Allow me to clarify how I see
 the situation.

 Currently built-in hash table use generates special-cased calls to non-template
 functions implemented surreptitiously in druntime. The underlying theory, also
 sustained by the statement quoted above, is that we are interested in
supporting
 linking together object files and libraries BUILT WITH DISTINCT MAJOR RELEASES
 OF DRUNTIME.

 There is zero interest for that. ZERO. No language even attempts to do so.
 Runtimes that are not compatible with their previous versions are common,
 frequent, and well understood as an issue.

We've agree on this before, perhaps I misstated it here, but I am not talking 
about changing druntime. I'm talking about someone providing their own hash 
table implementation that has a different binary API than the one in druntime, 
such that code from their library cannot be linked with any other code that
uses 
the regular hashtable.

A different implementation of hashtable would be fine, as long as it is binary 
compatible. We did this when we switched from a binary tree collision
resolution 
to a linear one, and the switchover went without a hitch because it did not 
require even a recompile of existing binaries.


 In an ideal world, built-in hash tables should work in a very simple manner.
The
 compiler lowers all special hashtable syntax - in a manner that's MINIMAL,
 SIMPLE, and CLEAR - into D code that resolves to use of object.di (not some
 random user-defined library!). From then on, druntime code takes over. It could
 choose to use templates, dynamic type info, whatever. It's NOT the concern of
 the compiler. The compiler has NO BUSINESS taking library code and hardwiring
it
 in for no good reason.

That was already true of the hashtables - it's just that the interface to them 
was through a set of fixed function calls, rather than a template interface. To 
the compiler, the hashtables were a completely opaque void*. The compiler had 
zero knowledge of how they actually were implemented inside the runtime.

Changing it to a template implementation enables a more efficient interface, as 
inlining, etc., can be done instead of the slow opApply() interface. The 
downside of that is it becomes a bit perilous, as the binary API is not so 
flexible anymore.


 (Consider the disaster C++ has had forever with everyone inventing their
 own string type. That insured zero interoperability between C++
 libraries, a situation that persists even for 10 years after C++ finally
 acquired a standard string library.)

 It is exactly this kind of canned statement and prejudice that we must avoid.
It
 unfairly singles out C++ when there also exist incompatible libraries in C,
 Java, Python, you name it.

Of course, but strings are a fundamental data type, and so it was worse with 
C++. I don't agree that my opinion on it is prejudicial or unfair, because I 
many times was stuck with having to deal with the issues of trying to glue 
together disparate code that had differing string classes. Often, it was the 
only incompatibility, but it permeated the library interfaces.

 Also, the last time the claim that everywhere invented their own string type
 could have been credibly aired was around 2004.

Sure, people rarely (never?) do their own C++ string classes anymore, but that 
old code and those old libraries are still around, and are actively maintained.

http://msdn.microsoft.com/en-us/library/ms174288.aspx

Notice that's for Visual Studio C++ 2010.

The string problem was a mistake I was determined not to make with D.

I have agreed with you and still agree with the notion of using lowering
instead 
of custom code. Also, keep in mind that the hashtable design was done long 
before D even had templates. It was "lowered" to what D had at the time - 
function calls and opApply.



 What's built inside the compiler is like axioms in math, and what's library is
 like theorems supported by the axioms. A good language, just like a good
 mathematical system, has few axioms and many theorems. That means the system is
 coherent and expressive. Hardwiring stuff in the language definition is almost
 always a failure of the expressive power of the language.

True.

 Sometimes it's fine to
 just admit it and hardwire inside the compiler e.g. the prior knowledge that
"+"
 on int does modulo addition.

Right, I understand that the abstraction abilities of D are not good enough to 
produce a credible 'int' type, or 'float', etc., hence they are wired in.

 But most always it's NOT, and definitely not in the
 context of a complex data structure like a hash table. I also think that adding
 a hecatomb of built-in types and functions has smells, though to a good extent
I
 concede to the necessity of it.

I want to reiterate that I don't think there is a way with the current compiler 
technology to make a library SIMD type that will perform as well as a builtin 
one, and those who use SIMD tend to be extremely demanding of performance.

(One could make a semantic equivalent, but not a performance equivalent.)


 We should start from what the user wants to accomplish. Then figure how to
 express that within the language. And only lastly, when needed, change the
 language to mandate lowering constructs to the MINIMUM EXTENT POSSIBLE into
 constructs that can be handled within the existing language. This approach has
 been immensely successful virtually whenever we applied it: foreach for ranges
 (though there's work left to do there), operator overloading, and too little
 with hashes. Lately I see a sort of getting lazy and skipping the second pass
 entirely. Need something? Yeah, what the hell, we'll put it in the language.

I don't think that is entirely fair in regards to the SIMD stuff. It reminds me 
of after I spent a couple years at Caltech, where every class was essentially a 
math class. My sister asked me for help with her high school trig homework, and 
I just glanced at it and wrote down all the answers. She said she was supposed 
to show the steps involved, but to me I was so used to doing it there was only 
one step.

So while it may seem I'm skipping steps with the SIMD, I have been thinking 
about it for years off and on, and I have a fair experience with what needs to 
be done to generate good code.



 I am a bit worried about the increasing radicalization of the discussion here,
 but recent statements come in frontal collision with my core principles, which
I
 think stand on solid evidential ground. I am appealing for building consensus
 and staying principled instead of reaching for the cheap solution. If we do the
 latter, it's quite likely we'll regret it later.


 Andrei

Jan 07 2012

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 I don't think that is entirely fair in regards to the SIMD stuff. It reminds
me 
 of after I spent a couple years at Caltech, where every class was essentially
a 
 math class.

I think that in several (but not all) fields of science and technology your
limits are often determined by how much (in depth and especially in how much
variety) mathematics you know :-) Unfortunately lot of people don't seem much
able or willing to learn it...

Bye,
bearophile

Jan 07 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 8/01/12 12:14 AM, Walter Bright wrote:
 On 1/7/2012 1:28 PM, Andrei Alexandrescu wrote:
 But most always it's NOT, and definitely not in the
 context of a complex data structure like a hash table. I also think
 that adding
 a hecatomb of built-in types and functions has smells, though to a
 good extent I
 concede to the necessity of it.

 I want to reiterate that I don't think there is a way with the current
 compiler technology to make a library SIMD type that will perform as
 well as a builtin one, and those who use SIMD tend to be extremely
 demanding of performance.

Considering the that entire purpose of SIMD is performance, I think the 
demand is reasonable :-)

Jan 07 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 What's built inside the compiler is like axioms in math, and what's
 library is like theorems supported by the axioms. A good language, just
 like a good mathematical system, has few axioms and many theorems. That
 means the system is coherent and expressive. Hardwiring stuff in the
 language definition is almost always a failure of the expressive power
 of the language.

Yes, but when it comes to register allocation and platform specific 
instruction selection, that really is the job of the compiler. It is not 
something that can be done in a library (without rewriting the compiler 
in the language, which defeats the purpose of having a language in the 
first place).

I agree that the language should add the minimum number of features to 
support what we want, although in this case (due to how 
platform-specific the solutions are) I think it simply requires a lot of 
work in the compiler.


 We should start from what the user wants to accomplish. Then figure how
 to express that within the language. And only lastly, when needed,
 change the language to mandate lowering constructs to the MINIMUM EXTENT
 POSSIBLE into constructs that can be handled within the existing
 language.

I agree.

Essentially, we need at least:

- Some type (or types) that map directly to SIMD registers.
- The type must be separate from static arrays (aligned or not).
- Automatic register allocation, just like other primitive types.
- Automatic instruction scheduling.
- Ability to specify what instructions to use.

I agree with Manu that we should just have a single type like __m128 in 
MSVC. The other types and their conversions should be solvable in a 
library with something like strong typedefs.

As the *sole* reason for this enhancement is performance, the compiler 
absolutely must have all the information it needs to produce optimal code.


 I am a bit worried about the increasing radicalization of the discussion
 here, but recent statements come in frontal collision with my core
 principles, which I think stand on solid evidential ground. I am
 appealing for building consensus and staying principled instead of
 reaching for the cheap solution. If we do the latter, it's quite likely
 we'll regret it later.

We also need to be pragmatic. There is no point defining a perfect, 
modular, clean solution to the problem if it is going to take years to 
realize. In years, the problem may not exist anymore. This is especially 
true when it comes to hardware issues like the one we are discussing here.

Jan 07 2012

Manu <turkeyman gmail.com> writes:

On 8 January 2012 02:54, Peter Alexander <peter.alexander.au gmail.com>wrote:

 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a library
 with something like strong typedefs.

Walter put in a reasonable effort to sway me to his side of the fence last
night. I'm still not entirely sold that implementation inside the language
is necessary to achieve these details, but I don't have enough background
into to argue, and I'm not the one that has to maintain the code :)

Here are some points we discussed... how do we do these (efficiently) in a
library?

** Literal syntax.. and constant folding:

Constants and literals also need to be aligned. If we use array syntax to
express literals, this will be a problem.

 int4 v = [ 1,2,3,4 ] + [ 5,6,7,8 ];

Any constant expressions need to be simplified at compile time: int4 vec =
[ 6,8,10,12 ];
Perhaps this is possible with CTFE? Or will it be automatic if you express
literals as if they were arrays?

** Expression interpretation/simplification:

 float4 v = -b + a;

Obviously, this should be simplified to 'a - b'.

 float4 v = a*b + c;

This should use a multiply-accumulate opcode on most architectures: FMADDPS
v, a, b, c

** Typed debug info

In a debugger it's nice to inspect variables in their supposed type.
Can probably use unions to do this... probably wouldn't be as nice though.

** God knows what other optimisations

float4 v = [ 0,0,0,0 ]; // XOR v
etc...


I don't know what amount of this is achievable with libraries, but Walter
seems to think this will all work much better in the language... I'm
inclined to trust his judgement.

Jan 07 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 8/01/12 1:32 AM, Manu wrote:
 On 8 January 2012 02:54, Peter Alexander <peter.alexander.au gmail.com
 <mailto:peter.alexander.au gmail.com>> wrote:

     I agree with Manu that we should just have a single type like __m128
     in MSVC. The other types and their conversions should be solvable in
     a library with something like strong typedefs.


 Walter put in a reasonable effort to sway me to his side of the fence
 last night. I'm still not entirely sold that implementation inside the
 language is necessary to achieve these details, but I don't have enough
 background into to argue, and I'm not the one that has to maintain the
 code :)

 Here are some points we discussed... how do we do these (efficiently) in
 a library?

Just to be clear, it was only the types and conversions that I thought 
would be suitable for a library. Operations, along with their 
optimisations are best for compiler.


 ** Literal syntax.. and constant folding:

 Constants and literals also need to be aligned. If we use array syntax
 to express literals, this will be a problem.

   int4 v = [ 1,2,3,4 ] + [ 5,6,7,8 ];

 Any constant expressions need to be simplified at compile time: int4 vec
 = [ 6,8,10,12 ];
 Perhaps this is possible with CTFE? Or will it be automatic if you
 express literals as if they were arrays?

You could use array syntax for vector literals, as long as they are 
stored directly into vector variables. e.g.

immutable int4 a = [1, 2, 3, 4];
immutable int4 b = [5, 6, 7, 8];
int4 v = a + b;

Constant folding can be done by compiler, although I don't think this is 
a priority.


 ** Expression interpretation/simplification:

   float4 v = -b + a;

 Obviously, this should be simplified to 'a - b'.

   float4 v = a*b + c;

 This should use a multiply-accumulate opcode on most architectures:
 FMADDPS v, a, b, c

Compiler should make these decisions, just like it does with int/float 
etc.  In some cases these kinds of simplifications can effect the result 
due to numeric issues.

You can use expression templates for this sort of thing as well, but 
they are a horrible mess, so I don't think I'd like to see them.


 ** Typed debug info

 In a debugger it's nice to inspect variables in their supposed type.
 Can probably use unions to do this... probably wouldn't be as nice though.

Good point. I'm not an expert on this, but I suspect that a union would 
be good enough?


 ** God knows what other optimisations

 float4 v = [ 0,0,0,0 ]; // XOR v
 etc...

Again, I think you could use expression templates for this, but it's so 
much simpler to leave this optimisation to the compiler.

Even if the compiler doesn't do it, it's not difficult to do it manually 
when you really need it:

float4 v = void;
asm { pxor v, v; }


Honestly, I'm not too bothered with these types of optimisations. As 
long as the compiler does the register allocation and instruction 
scheduling for me, I would be 99% happy because those things are the 
most tedious when trying to write structured code. I can easily enough 
change (-b + a) to (b - a) if that's faster, or insert specific 
instructions for generating vector constants, or do constant folding 
manually.

Of course, it would be nice if the compiler did them, but that's just 
icing on the cake. The meat of the problem is register allocation.


 I don't know what amount of this is achievable with libraries, but
 Walter seems to think this will all work much better in the language...
 I'm inclined to trust his judgement.

I agree.

Jan 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/7/2012 5:32 PM, Manu wrote:
 Here are some points we discussed... how do we do these (efficiently) in a
library?

Another issue - matching the name mangling and parameter passing/return 
conventions of how other C/C++ compilers deal with vector types. That is 
currently not doable with a library type.

Jan 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/7/2012 4:54 PM, Peter Alexander wrote:
 I think it simply requires a lot of work in the compiler.

Not that much work. Most of it segues nicely into the previous work I did 
supporting the XMM floating point code gen.

Jan 07 2012

Manu <turkeyman gmail.com> writes:

On 8 January 2012 03:44, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/7/2012 4:54 PM, Peter Alexander wrote:

 I think it simply requires a lot of work in the compiler.

 Not that much work. Most of it segues nicely into the previous work I did
 supporting the XMM floating point code gen.

What is this previous work you speak of? Is there already XMM stuff in
there somewhere?

Jan 07 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 8/01/12 1:48 AM, Manu wrote:
 On 8 January 2012 03:44, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/7/2012 4:54 PM, Peter Alexander wrote:

         I think it simply requires a lot of work in the compiler.


     Not that much work. Most of it segues nicely into the previous work
     I did supporting the XMM floating point code gen.


 What is this previous work you speak of? Is there already XMM stuff in
 there somewhere?

On 64-bit, floats are stored in XMM registers (just as single scalars). 
I don't think it does any vectorization yet though. It does mean that 
the register allocation of those registers is already complete though.

Jan 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/7/2012 6:32 PM, Peter Alexander wrote:
 On 64-bit, floats are stored in XMM registers (just as single scalars).

Yes.

 I don't think it does any vectorization yet though.

Right. It doesn't do that.

 It does mean that the register
 allocation of those registers is already complete though.

Yup. Does a nice job of it, too :-)

Jan 07 2012

"a" <a a.com> writes:

On Sunday, 8 January 2012 at 01:48:34 UTC, Manu wrote:
 On 8 January 2012 03:44, Walter Bright 
 <newshound2 digitalmars.com> wrote:

 On 1/7/2012 4:54 PM, Peter Alexander wrote:

 I think it simply requires a lot of work in the compiler.

 Not that much work. Most of it segues nicely into the previous 
 work I did
 supporting the XMM floating point code gen.

 What is this previous work you speak of? Is there already XMM 
 stuff in
 there somewhere?

DMD (at least 64 bit on linux, I'm not sure about 32 bit) now 
uses XMM registers and instructions that work on them (addss, 
addsd, mulsd...) for scalar floating point operations.

Jan 08 2012

Manu <turkeyman gmail.com> writes:

On 8 January 2012 11:56, a <a a.com> wrote:

 On Sunday, 8 January 2012 at 01:48:34 UTC, Manu wrote:

 On 8 January 2012 03:44, Walter Bright <newshound2 digitalmars.com>
 wrote:

  On 1/7/2012 4:54 PM, Peter Alexander wrote:
  I think it simply requires a lot of work in the compiler.

 Not that much work. Most of it segues nicely into the previous work I did
 supporting the XMM floating point code gen.

 What is this previous work you speak of? Is there already XMM stuff in
 there somewhere?

 DMD (at least 64 bit on linux, I'm not sure about 32 bit) now uses XMM
 registers and instructions that work on them (addss, addsd, mulsd...) for
 scalar floating point operations.

Yeah of course! >_<
I forgot that they did that in x64 (I never work with x64), but I recall
thinking that was the single most awesome change to the architecture! :)

Jan 08 2012

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

MS has three types, __m128, __m128i and __m128d  (float, int, double)

Six if you count AVX's 256 forms.

On 1/7/2012 6:54 PM, Peter Alexander wrote:
 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.

Jan 14 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:
 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:
 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


The trouble with MS's scheme, is given the following:

     __m128i v;
     v += 2;

Can't tell what to do. With D,

    int4 v;
    v += 2;

it's clear (add 2 to each of the 4 ints).

Jan 14 2012

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

On 1/15/2012 12:09 AM, Walter Bright wrote:
 On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:
 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:
 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


 The trouble with MS's scheme, is given the following:

 __m128i v;
 v += 2;

 Can't tell what to do. With D,

 int4 v;
 v += 2;

 it's clear (add 2 to each of the 4 ints).

Working with their intrinsics in their raw form for real code is pure 
insanity :)  You need to wrap it all with a good math library (even if 
90% of the library is the intrinsics wrapped into __forceinlined 
functions), so you can start having sensible operator overloads, and so 
you can write code that is readable.


if (any4(a > b))
{
   // do stuff
}


is way way way better than (pseudocode)

if (__movemask_ps(_mm_gt_ps(a, b)) == 0x0F)
{
}



and (if the ternary operator was overrideable in C++)

float4 foo = (a > b) ? c : d;

would be better than

float4 mask = _mm_gt_ps(a, b);
float4 foo = _mm_or_ps(_mm_and_ps(mask, c), _mm_nand_ps_(mask, d));

Jan 14 2012

Manu <turkeyman gmail.com> writes:

On 15 January 2012 08:16, Sean Cavanaugh <WorksOnMyMachine gmail.com> wrote:

 On 1/15/2012 12:09 AM, Walter Bright wrote:

 On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:

 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:

 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


 The trouble with MS's scheme, is given the following:

 __m128i v;
 v += 2;

 Can't tell what to do. With D,

 int4 v;
 v += 2;

 it's clear (add 2 to each of the 4 ints).

 Working with their intrinsics in their raw form for real code is pure
 insanity :)  You need to wrap it all with a good math library (even if 90%
 of the library is the intrinsics wrapped into __forceinlined functions), so
 you can start having sensible operator overloads, and so you can write code
 that is readable.


 if (any4(a > b))
 {
  // do stuff
 }


 is way way way better than (pseudocode)

 if (__movemask_ps(_mm_gt_ps(a, b)) == 0x0F)
 {
 }



 and (if the ternary operator was overrideable in C++)

 float4 foo = (a > b) ? c : d;

 would be better than

 float4 mask = _mm_gt_ps(a, b);
 float4 foo = _mm_or_ps(_mm_and_ps(mask, c), _mm_nand_ps_(mask, d));

Yep, it's coming... baby steps :)

Walter: I told you games devs would be all over this! :P

Jan 15 2012

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 15.01.2012, 11:45 Uhr, schrieb Manu <turkeyman gmail.com>:

 On 15 January 2012 08:16, Sean Cavanaugh <WorksOnMyMachine gmail.com>  
 wrote:

 On 1/15/2012 12:09 AM, Walter Bright wrote:

 On 1/14/2012 9:58 PM, Sean Cavanaugh wrote:

 MS has three types, __m128, __m128i and __m128d (float, int, double)

 Six if you count AVX's 256 forms.

 On 1/7/2012 6:54 PM, Peter Alexander wrote:

 On 7/01/12 9:28 PM, Andrei Alexandrescu wrote:
 I agree with Manu that we should just have a single type like __m128  
 in
 MSVC. The other types and their conversions should be solvable in a
 library with something like strong typedefs.


 The trouble with MS's scheme, is given the following:

 __m128i v;
 v += 2;

 Can't tell what to do. With D,

 int4 v;
 v += 2;

 it's clear (add 2 to each of the 4 ints).

 Working with their intrinsics in their raw form for real code is pure
 insanity :)  You need to wrap it all with a good math library (even if  
 90%
 of the library is the intrinsics wrapped into __forceinlined  
 functions), so
 you can start having sensible operator overloads, and so you can write  
 code
 that is readable.


 if (any4(a > b))
 {
  // do stuff
 }


 is way way way better than (pseudocode)

 if (__movemask_ps(_mm_gt_ps(a, b)) == 0x0F)
 {
 }



 and (if the ternary operator was overrideable in C++)

 float4 foo = (a > b) ? c : d;

 would be better than

 float4 mask = _mm_gt_ps(a, b);
 float4 foo = _mm_or_ps(_mm_and_ps(mask, c), _mm_nand_ps_(mask, d));

 Yep, it's coming... baby steps :)

 Walter: I told you games devs would be all over this! :P

And even a compression algorithms. I found one written in C, that uses  
external .asm files to be compiled into object files with NASM for use on  
the linker command line. They contain some MMX/SSE code depending on the  
processor you plan to use. The author claims, that the MMX version of the  
'outsourced' routines run 8x faster. I didn't verify this, but the idea  
that these instructions become part of the language and easy to use for  
regular programmers like me (and not just console game developers) is  
exciting. I bet there are more programs that could benefit from SSE than  
is obvious or code that could be rewritten in way, that multiple data sets  
can be processed simultaneous.

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 5:06 AM, Marco Leise wrote:
 I bet there are more programs that
 could benefit from SSE than is obvious or code that could be rewritten in way,
 that multiple data sets can be processed simultaneous.

I think there's quite a bit more, it's just that using SIMD instructions has 
historically been so clumsy, few take advantage.

For example, a memchr operation could be dramatically speeded up with SIMD, 
which has implications for regex.

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 21:21, Walter Bright <newshound2 digitalmars.com> wrote:

 1. the language does typechecking, for example, trying to add a vector of
 4 floats to 16 bytes would be (and should be) an error.

I want to sell you on the 'primitive SIMD regs are truly typeless' point.
(which I thought you had already agreed with) :)

Here are some examples of tight interacting between int/float, and
interacting ON floats with int operations...
Naturally the examples I present will be wrapped as useful functions in
libraries, but the primitive type shouldn't try and make this more annoying
by trying to enforce pointless type safety errors like you seem to be
suggesting.

In computer graphics it's common to work with float16's, a type not
supported by simd units. Pack/Unpack code involved detailed float/int
interaction.
You might take a register of floats, then mask the exponent and then
perform integer arithmetic on the exponent to shift it into the float16
exponent range... then you will mask the bottom of the mantissa and shift
them into place.
Unpacking is same process in reverse.

Other tricks with the float sign bits, making everything negative, by
or-ing in 1's into the top bits. or you can gather the signs using various
techniques.. useful for identifying the cell in a quad-tree for instance.
Integer manipulation of floats is surprisingly common.

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 12:45 PM, Manu wrote:
 Here are some examples of tight interacting between int/float, and interacting
 ON floats with int operations...
 Naturally the examples I present will be wrapped as useful functions in
 libraries, but the primitive type shouldn't try and make this more annoying by
 trying to enforce pointless type safety errors like you seem to be suggesting.

I am suggesting it, no doubt about it!

 In computer graphics it's common to work with float16's, a type not supported
by
 simd units. Pack/Unpack code involved detailed float/int interaction.
 You might take a register of floats, then mask the exponent and then perform
 integer arithmetic on the exponent to shift it into the float16 exponent
 range... then you will mask the bottom of the mantissa and shift them into
place.
 Unpacking is same process in reverse.

 Other tricks with the float sign bits, making everything negative, by or-ing in
 1's into the top bits. or you can gather the signs using various techniques..
 useful for identifying the cell in a quad-tree for instance.
 Integer manipulation of floats is surprisingly common.

I'm aware of such tricks, and actually do them with the floating point code 
generation in the compiler back end. I don't think that renders the idea that 
floats and ints should be different types a bad one.

I'd also argue that such tricks are tricks, and using a reinterpret cast on
them 
makes it clear in the code that you know what you're doing, rather than doing 
something bizarre like a left shift on a float type.

I've worked a lot with large assembler programs. As you know, EAX has no type. 
The assembler code would constantly shift the type of things that were in EAX, 
sometimes a pointer, sometimes an int, sometimes a ushort, sometimes treating a 
pointer as an int, etc. I can unequivocably state that this typeless approach
is 
confusing, buggy, hard to untangle, and ultimately a freedom that is not 
justifiable.

Static typing is a big improvement, and having to insert a few reinterpret
casts 
is a good thing, not a detriment.

Jan 06 2012

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 I've worked a lot with large assembler programs. As you know, EAX has no type. 
 The assembler code would constantly shift the type of things that were in EAX, 
 sometimes a pointer, sometimes an int, sometimes a ushort, sometimes treating
a 
 pointer as an int, etc. I can unequivocably state that this typeless approach
is 
 confusing, buggy, hard to untangle, and ultimately a freedom that is not 
 justifiable.

There is even some desire of a typed assembly. It's not easy to design and
implement, but it seems able to avoid some bugs:
http://www.cs.cornell.edu/talc/papers.html

Bye,
bearophile

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 7 January 2012 02:00, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 12:45 PM, Manu wrote:

 In computer graphics it's common to work with float16's, a type not
 supported by

 simd units. Pack/Unpack code involved detailed float/int interaction.
 You might take a register of floats, then mask the exponent and then
 perform
 integer arithmetic on the exponent to shift it into the float16 exponent
 range... then you will mask the bottom of the mantissa and shift them
 into place.
 Unpacking is same process in reverse.

 Other tricks with the float sign bits, making everything negative, by
 or-ing in
 1's into the top bits. or you can gather the signs using various
 techniques..
 useful for identifying the cell in a quad-tree for instance.
 Integer manipulation of floats is surprisingly common.

 I'm aware of such tricks, and actually do them with the floating point
 code generation in the compiler back end. I don't think that renders the
 idea that floats and ints should be different types a bad one.

 I'd also argue that such tricks are tricks, and using a reinterpret cast
 on them makes it clear in the code that you know what you're doing, rather
 than doing something bizarre like a left shift on a float type.

 I've worked a lot with large assembler programs. As you know, EAX has no
 type. The assembler code would constantly shift the type of things that
 were in EAX, sometimes a pointer, sometimes an int, sometimes a ushort,
 sometimes treating a pointer as an int, etc. I can unequivocably state that
 this typeless approach is confusing, buggy, hard to untangle, and
 ultimately a freedom that is not justifiable.

 Static typing is a big improvement, and having to insert a few reinterpret
 casts is a good thing, not a detriment.

To be clear, I'm not opposing strongly typing vector types... that's my
primary goal too. But they're not as simple I think you believe.

From experience, microsoft provices __m128, but GCC does what you're

proposing (although I get the feeling it's not a 'proposal' anymore).
GCC uses 'vector float', 'vector int', 'vector unsigned short', etc...

I hate writing vector code the GCC way, it's really ugly. The lines tend to
become dominated by casts, and it's all for nothing, since it all gets
wrapped up behind a library anyway.

Secondly, you're introducing confusion. A cast from float4 to int4... does
it reinterpret, or does it type convert?
In GCC it reinterprets, but what do you actually expect? and regardless of
what you expect, what do you actually WANT most of the time...
I'm sure you'll agree that the expected/'proper' thing would be a type
conversion (and I know you're into 'proper'-ness), but in practise you
almost always want to reinterpret. This inevitably leads to ugly
reinterpret syntax all over the place.
If it were a typeless vector reg type, it all goes away.

Despite all this worry and effort, NOBODY will ever use these strongly
typed (but still primitive) types of yours. They will need to be extended
with bunches of methods, which means wrapping them up in libraries anyway,
to add all the higher level functionality... so what's the point?
The only reason they will use them is to wrap them up in a library of their
own, at which point I promise you they'll be just as annoyed as me by the
typing and need for casts all over the place to pass them into basic
intrinsics.

But if you're insistent on doing this, can you detail the proposal...
 What types will exist?
 How will each one cast/interact?
 What about error conditions/exceptions? How do I control these? ...on a
per-type basis?
 What about CTFE, will you add understanding for every operation supported
by each type? This is easily handled in a library...
 How will you assign literals?
 How can you assign a typeless literal? (a single 128bit value, used
primarily for masks)
 What operators will be supported... and what will they do?
 Will you extend support for 64bit and 256bit vector types, that's a whole
bundle more types again... I really feel this is polluting the language.
 ... is this whole thing just so you can support MADD? If so, there are
others to worry about too...

Jan 06 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 7 January 2012 00:38, Manu <turkeyman gmail.com> wrote:
 On 7 January 2012 02:00, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/6/2012 12:45 PM, Manu wrote:
 In computer graphics it's common to work with float16's, a type not
 supported by

 simd units. Pack/Unpack code involved detailed float/int interaction.
 You might take a register of floats, then mask the exponent and then
 perform
 integer arithmetic on the exponent to shift it into the float16 exponent
 range... then you will mask the bottom of the mantissa and shift them
 into place.
 Unpacking is same process in reverse.

 Other tricks with the float sign bits, making everything negative, by
 or-ing in
 1's into the top bits. or you can gather the signs using various
 techniques..
 useful for identifying the cell in a quad-tree for instance.
 Integer manipulation of floats is surprisingly common.


 I'm aware of such tricks, and actually do them with the floating point
 code generation in the compiler back end. I don't think that renders the
 idea that floats and ints should be different types a bad one.

 I'd also argue that such tricks are tricks, and using a reinterpret cast
 on them makes it clear in the code that you know what you're doing, rather
 than doing something bizarre like a left shift on a float type.

 I've worked a lot with large assembler programs. As you know, EAX has no
 type. The assembler code would constantly shift the type of things that were
 in EAX, sometimes a pointer, sometimes an int, sometimes a ushort, sometimes
 treating a pointer as an int, etc. I can unequivocably state that this
 typeless approach is confusing, buggy, hard to untangle, and ultimately a
 freedom that is not justifiable.

 Static typing is a big improvement, and having to insert a few reinterpret
 casts is a good thing, not a detriment.


 To be clear, I'm not opposing strongly typing vector types... that's my
 primary goal too. But they're not as simple I think you believe.

 From experience, microsoft provices __m128, but GCC does what you're
 proposing (although I get the feeling it's not a 'proposal' anymore).
 GCC uses 'vector float', 'vector int', 'vector unsigned short', etc...

 I hate writing vector code the GCC way, it's really ugly. The lines tend to
 become dominated by casts, and it's all for nothing, since it all gets
 wrapped up behind a library anyway.

 Secondly, you're introducing confusion. A cast from float4 to int4... does
 it reinterpret, or does it type convert?
 In GCC it reinterprets, but what do you actually expect? and regardless of
 what you expect, what do you actually WANT most of the time...
 I'm sure you'll agree that the expected/'proper' thing would be a type
 conversion (and I know you're into 'proper'-ness), but in practise you
 almost always want to reinterpret. This inevitably leads to ugly reinterpret
 syntax all over the place.
 If it were a typeless vector reg type, it all goes away.

FYI, vector conversion in GCC is roughly to the idiom of  *(float4 *)&X; in C.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 06 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 06 Jan 2012 14:44:53 +0100, Manu <turkeyman gmail.com> wrote:

 On 6 January 2012 14:56, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright <
 newshound2 digitalmars.com> wrote:

 One caveat is it is typeless; a __v128 could be used as 4 packed ints  
 or
 2 packed doubles. One problem with making it typed is it'll add 10 more
 types to the base compiler, instead of one. Maybe we should just bite  
 the
 bullet and do the types:

     __vdouble2
     __vfloat4
     __vlong2
     __vulong2
     __vint4
     __vuint4
     __vshort8
     __vushort8
     __vbyte16
     __vubyte16

 Those could be typedefs, i.e. alias this wrapper.
 Still simdop would not be typesafe.

 I think they should by well defined structs with lots of type safety and
 sensible methods. Not just a typedef of the typeless primitive.


 As much as this proposal presents a viable solution,
 why not spending the time to extend inline asm.

 I think there are too many risky problems with the inline assembler (as
 raised in my discussion about supporting pseudo registers in inline asm
 blocks).
   * No way to allow the compiler to assign registers (pseudo registers)

That's what I propose he should do. IMHO it's a huge improvement when
register variables could be used directly in asm.

int a, b;
__vec128 c;

asm (a, b, c)
{
     mov EAX, a;
     add b, EAX;
     movps XMM1, c;
     mulps c, XMM1;
}

The compiler has enough knowledge to do this, and it's the common basic  
block spilling
scheme that is used here.

There is another benefit.
Consider the following:

__vec128 addps(__vec128 a, __vec128 b) pure
{
     __vec128 res = a;

     if (__ctfe)
     {
         foreach(i; 0 .. 4)
            res[i] += b[i];
     }
     else
     {
         asm (b, res)
         {
             addps res, b;
         }
     }
     return res;
}

   * Assembly blocks present problems for the optimiser, it's not reliable
 that it can optimise around an inline asm blocks. How bad will it be when
 trying to optimise around 100 small inlined functions each containing its
 own inline asm blocks?

What do you mean by optimizing around? I don't see any apparent reason why  
that
should perform worse than using intrinsics.

The only implementation issue could be that lots of inlined asm snippets
make plenty basic blocks which could slow down certain compiler algorithms.

   * D's inline assembly syntax has to be carefully translated to GCC's
 inline asm format when using GCC, and this needs to be done
 PER-ARCHITECTURE, which Iain should not be expected to do for all the
 obscure architectures GCC supports.

???
This would be needed for opcodes as well. You initial goal was to directly  
influence
code gen up to instruction level, how should that be achieved without  
platform specific
extension. Quite contrary with ops and asm he will need two hack paths  
into gcc's codegen.

What I see here is that we can do much good things to the inline
assembler while achieving the same goal.
With intrinsics on the other hand we're adding a very specialized
maintenance burden.
 What would be needed?
  - Implement the asm allocation logic.
  - Functions containing asm statements should participate in inlining.
  - Determining inline cost of asm statements.

 I raised these points in my other thread, these are all far more
 complicated problems I think than exposing opcode intrinsics would be.
 Opcode intrinsics are almost certainly the way to go.

 When being used with typedefs for __vubyte16 et.al. this would
 allow a really clean and simple library implementation of intrinsics.

 The type safety you're imagining here might actually be annoying when
 working with the raw type and opcodes..
 Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //

Such is really not a good idea if the bit pattern of packedColour is a  
denormal.
How can you even execute a single useful command on the floats here?

Also mixing integer and FP instructions on the same register may
cause performance degradation. The registers are indeed typed CPU  
internally.

 pack
 some other useful data in W
 If vec were strongly typed, I would now need to start casting all over  
 the
 place to use various float and uint opcodes on this value?
 I think it's correct when using SIMD at the raw level to express the type
 as it is, typeless... SIMD regs are infact typeless regs, they only gain
 concept of type the moment you perform an opcode on it, and only for the
 duration of that opcode.

 You will get your strong type safety when you make use of the float4  
 types
 which will be created in the libs.

Jan 06 2012

Brad Roberts <braddr puremagic.com> writes:

On 1/6/2012 12:43 AM, Walter Bright wrote:
 Declare one new basic type:
 
     __v128
 
 which represents the 16 byte aligned 128 bit vector type. The only operations
defined to work on it would be
 construction and assignment. The __ prefix signals that it is non-portable.
 
 Then, have:
 
    import core.simd;
 
 which provides two functions:
 
    __v128 simdop(operator, __v128 op1);
    __v128 simdop(operator, __v128 op1, __v128 op2);

How is making __v128 a builtin type better than defining it as:

align(16) struct __v128
{
    ubyte[16] data;
}

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 10:25 AM, Brad Roberts wrote:
 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
      ubyte[16] data;
 }

Then the back end knows it should be mapped onto the XMM registers rather than 
the usual arithmetic set.

Jan 06 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/6/12 1:11 PM, Walter Bright wrote:
 On 1/6/2012 10:25 AM, Brad Roberts wrote:
 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
 ubyte[16] data;
 }

 Then the back end knows it should be mapped onto the XMM registers
 rather than the usual arithmetic set.

If it's possible, then it would be great to express the new constructs 
within the existing language (optionally by leaving it to the 
implementation to strengthen guarantees of certain constructs).

I very warmly recommend avoiding defining things in the language and 
compiler wherever the same is possible within a library (however 
non-portable). Confining features to the language/compiler drastically 
reduces the number of people that can work on them.


Andrei

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 23:23, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org>wrote:

 On 1/6/12 1:11 PM, Walter Bright wrote:

 On 1/6/2012 10:25 AM, Brad Roberts wrote:

 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
 ubyte[16] data;
 }

 Then the back end knows it should be mapped onto the XMM registers
 rather than the usual arithmetic set.

 If it's possible, then it would be great to express the new constructs
 within the existing language (optionally by leaving it to the
 implementation to strengthen guarantees of certain constructs).

Now you're at odds with Walter's new take on it.. He seems to have changed
his mind and decided library implementation of the complex/strict types is
a bad idea now..?


 I very warmly recommend avoiding defining things in the language and
 compiler wherever the same is possible within a library (however
 non-portable). Confining features to the language/compiler drastically
 reduces the number of people that can work on them.


Aye, and my proposal requests only the minimum support required from the
language, allowing libraries to do the rest.
For some reason Walter seems to have done a bit of a 180 in the last few
hours ;)

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 1:46 PM, Manu wrote:
 For some reason Walter seems to have done a bit of a 180 in the last few hours
;)

It must be the drugs!

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 7 January 2012 01:34, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:46 PM, Manu wrote:

 For some reason Walter seems to have done a bit of a 180 in the last few
 hours ;)

 It must be the drugs!

That's what I was starting to suspect too! :P

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 20:17, Martin Nowak <dawg dawgfoto.de> wrote:

 There is another benefit.
 Consider the following:

 __vec128 addps(__vec128 a, __vec128 b) pure
 {
    __vec128 res = a;

    if (__ctfe)
    {
        foreach(i; 0 .. 4)
           res[i] += b[i];
    }
    else
    {
        asm (b, res)
        {
            addps res, b;
        }
    }
    return res;

 }

You don't need to use inline ASM to be able to do this, it will work the
same with intrinsics.
I've detailed numerous problems with using inline asm, and complications
with extending the inline assembler to support this.

 * Assembly blocks present problems for the optimiser, it's not reliable
 that it can optimise around an inline asm blocks. How bad will it be when
 trying to optimise around 100 small inlined functions each containing its
 own inline asm blocks?

 What do you mean by optimizing around? I don't see any apparent reason why
 that
 should perform worse than using intrinsics.

Most compilers can't reschedule code around inline asm blocks. There are a
lot of reasons for this, google can help you.
The main reason is that a COMPILER doesn't attempt to understand the
assembly it's being asked to insert inline. The information that it may use
for optimisation is never present, so it can't do it's job.


 The only implementation issue could be that lots of inlined asm snippets
 make plenty basic blocks which could slow down certain compiler algorithms.


Same problem as above. The compiler would need to understand enough about
assembly to perform optimisation on the assembly its self to clean this up.
Using intrinsics, all the register allocation, load/store code, etc, is all
in the regular realm of compiling the language, and the code generation and
optimisation will all work as usual.

 * D's inline assembly syntax has to be carefully translated to GCC's
 inline asm format when using GCC, and this needs to be done
 PER-ARCHITECTURE, which Iain should not be expected to do for all the
 obscure architectures GCC supports.

  ???

 This would be needed for opcodes as well. You initial goal was to directly
 influence
 code gen up to instruction level, how should that be achieved without
 platform specific
 extension. Quite contrary with ops and asm he will need two hack paths
 into gcc's codegen.


 What I see here is that we can do much good things to the inline
 assembler while achieving the same goal.
 With intrinsics on the other hand we're adding a very specialized
 maintenance burden.


You need to understand how the inline assembler works in GCC to understand
the problems with this.
GCC basically receives a string containing assembly code. It does not
attempt to understand it, it just pastes it in the .s file verbatim.
This means, you can support any architecture without any additional work...
you just type the appropriate architectures asm in your program and it's
fine... but now if we want to perform pseudo-register assignment, or
parameter substitution, we need a front end that parses the D asm
expressions, and generated a valid asm string for GCC.. It can't generate
that string without detailed knowledge of the architecture its targeting,
and it's not feasible to implement that support for all the architectures
GCC supports.

Even after all that, It's still not ideal.. Inline asm reduces the ability
of the compiler to perform many optimisations.

Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //

 Such is really not a good idea if the bit pattern of packedColour is a
 denormal.
 How can you even execute a single useful command on the floats here?

 Also mixing integer and FP instructions on the same register may
 cause performance degradation. The registers are indeed typed CPU
 internally.


It's a very good idea, I am saving memory and, and also saving memory
accesses.

This leads back to the point in my OP where I said that most games
programmers turn NaN, Den, and FP exceptions off.
As I've also raised before, most vectors are actually float[3]'s, W is
usually ignored and contains rubbish.
It's conventional to stash some 32bit value in the W to fill the otherwise
wasted space, and also get the load for free alongside the position.

The typical program flow, in this case:
  * the colour will be copied out into a separate register where it will be
reinterpreted as a uint, and have an unpack process applied to it.
  * XYZ will then be used to perform maths, ignoring W, which will continue
to accumulate rubbish values... it doesn't matter, all FP exceptions and
such are disabled.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 20:25, Brad Roberts <braddr puremagic.com> wrote:

 How is making __v128 a builtin type better than defining it as:

 align(16) struct __v128
 {
    ubyte[16] data;
 }

Where in that code is the compiler informed that your structure should
occupy a SIMD registers, and apply SIMD ABI conventions?

Jan 06 2012

Brad Roberts <braddr puremagic.com> writes:

On 1/6/2012 11:06 AM, Manu wrote:
 On 6 January 2012 20:25, Brad Roberts <braddr puremagic.com
<mailto:braddr puremagic.com>> wrote:
 
     How is making __v128 a builtin type better than defining it as:
 
     align(16) struct __v128
     {
        ubyte[16] data;
     }
 
 
 Where in that code is the compiler informed that your structure should occupy
a SIMD registers, and apply SIMD ABI
 conventions?

Good point, those rules would need to be added.  I'd argue that it's not
unreasonable to allow any properly aligned and
sized types to occupy those registers.  Though that's likely not optimal for
cases that won't actually use the
operations that modify them.  However, a counter example, it'd be a lot easier
to write a memcpy routine that uses them
without having to resort to asm code under this theoretical model.

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 11:16 AM, Brad Roberts wrote:
 However, a counter example, it'd be a lot easier to write a memcpy routine
that uses them
 without having to resort to asm code under this theoretical model.

I would seriously argue that individuals not attempt to write their own memcpy.

Why? Because the C one has had probably thousands of programmers looking at it 
for the last 30 years. You're not going to spend 5 minutes, or even 5 days, and 
make it faster.

Jan 06 2012

Brad Roberts <braddr puremagic.com> writes:

On Fri, 6 Jan 2012, Walter Bright wrote:

 On 1/6/2012 11:16 AM, Brad Roberts wrote:
 However, a counter example, it'd be a lot easier to write a memcpy routine
 that uses them
 without having to resort to asm code under this theoretical model.

 
 I would seriously argue that individuals not attempt to write their own
 memcpy.
 
 Why? Because the C one has had probably thousands of programmers looking at it
 for the last 30 years. You're not going to spend 5 minutes, or even 5 days,
 and make it faster.

Oh, I completely agree.  Intel has people that work on that as their 
primary job.  There's a constant trickle of changes going into glibc's 
mem{cpy,cmp} type routines to specialize for each of the ever evolving set 
of platforms out there.  No way should that effort be duplicated.  All I 
was pondering was how much cleaner much of that could be if it was 
expressed in higher level representations.  But you'd still wind up 
playing serious tweaking and validation games that would largely if not 
completely invalidate the utility of being expressed in higher level 
forms.  Probably.

Later,
Brad

Jan 06 2012

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 6 January 2012 at 20:26:37 UTC, Walter Bright wrote:
 On 1/6/2012 11:16 AM, Brad Roberts wrote:
 However, a counter example, it'd be a lot easier to write a 
 memcpy routine that uses them
 without having to resort to asm code under this theoretical 
 model.

 I would seriously argue that individuals not attempt to write 
 their own memcpy.

Agner Fog states in his optimization manuals that the glibc 
routines are fairly unoptimized. He provides his own versions, 
however they are GPL.

 Why? Because the C one has had probably thousands of 
 programmers looking at it for the last 30 years. You're not 
 going to spend 5 minutes, or even 5 days, and make it faster.

This assumes that hardware never changes. New memcpy 
implementations can take advantage of large registers in newer 
CPUs for higher speeds.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 7 January 2012 03:46, Vladimir Panteleev <vladimir thecybershadow.net>wrote:

 On Friday, 6 January 2012 at 20:26:37 UTC, Walter Bright wrote:

 On 1/6/2012 11:16 AM, Brad Roberts wrote:

 However, a counter example, it'd be a lot easier to write a memcpy
 routine that uses them
 without having to resort to asm code under this theoretical model.

 I would seriously argue that individuals not attempt to write their own
 memcpy.

 Agner Fog states in his optimization manuals that the glibc routines are
 fairly unoptimized. He provides his own versions, however they are GPL.


  Why? Because the C one has had probably thousands of programmers looking
 at it for the last 30 years. You're not going to spend 5 minutes, or even 5
 days, and make it faster.

 This assumes that hardware never changes. New memcpy implementations can
 take advantage of large registers in newer CPUs for higher speeds.

I've never seen a memcpy on any console system I've ever worked on that
takes advantage if its large registers... writing a fast memcpy is usually
one of the first things we do when we get a new platform ;)

Jan 06 2012

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

On 1/6/2012 7:58 PM, Manu wrote:
 On 7 January 2012 03:46, Vladimir Panteleev <vladimir thecybershadow.net
 <mailto:vladimir thecybershadow.net>> wrote:

 I've never seen a memcpy on any console system I've ever worked on that
 takes advantage if its large registers... writing a fast memcpy is
 usually one of the first things we do when we get a new platform ;)

Plus memcpy is optimized for reading and writing to cached virtual 
memory, so you need several others to write to write-combined or 
uncached memory efficiently and whatnot.

Jan 14 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 06 Jan 2012 20:00:15 +0100, Manu <turkeyman gmail.com> wrote:

 On 6 January 2012 20:17, Martin Nowak <dawg dawgfoto.de> wrote:

 There is another benefit.
 Consider the following:

 __vec128 addps(__vec128 a, __vec128 b) pure
 {
    __vec128 res = a;

    if (__ctfe)
    {
        foreach(i; 0 .. 4)
           res[i] += b[i];
    }
    else
    {
        asm (res, b)
        {
            addps res, b;
        }
    }
    return res;

 }

 You don't need to use inline ASM to be able to do this, it will work the
 same with intrinsics.
 I've detailed numerous problems with using inline asm, and complications
 with extending the inline assembler to support this.

Don't get me wrong here. The idea is to find out if intrinsics
can be build with the help of inlineable asm functions.
The ctfe support is one good reason to go with a library solution.

  * Assembly blocks present problems for the optimiser, it's not reliable
 that it can optimise around an inline asm blocks. How bad will it be  
 when
 trying to optimise around 100 small inlined functions each containing  
 its
 own inline asm blocks?

 What do you mean by optimizing around? I don't see any apparent reason  
 why
 that
 should perform worse than using intrinsics.

 Most compilers can't reschedule code around inline asm blocks. There are  
 a
 lot of reasons for this, google can help you.
 The main reason is that a COMPILER doesn't attempt to understand the
 assembly it's being asked to insert inline. The information that it may  
 use

It doesn't have to understand the assembly.
Wrapping these in functions creates an IR expression with inputs and  
outputs.
Declaring them as pure gives the compiler free hands to apply whatever
optimizations he does normally on an IR tree.
Common subexpressions elimination, removing dead expressions...

 for optimisation is never present, so it can't do it's job.


 The only implementation issue could be that lots of inlined asm snippets
 make plenty basic blocks which could slow down certain compiler  
 algorithms.


 Same problem as above. The compiler would need to understand enough about
 assembly to perform optimisation on the assembly its self to clean this  
 up.
 Using intrinsics, all the register allocation, load/store code, etc, is  
 all
 in the regular realm of compiling the language, and the code generation  
 and
 optimisation will all work as usual.

There is no informational difference between the intrinsic

__m128 _mm_add_ps(__m128 a, __m128 b);

and an inline assembler version

__m128 _mm_add_ps(__m128 a, __m128 b)
{
     asm
     {
          addps a, b;
     }
}

  * D's inline assembly syntax has to be carefully translated to GCC's
 inline asm format when using GCC, and this needs to be done
 PER-ARCHITECTURE, which Iain should not be expected to do for all the
 obscure architectures GCC supports.

  ???

 This would be needed for opcodes as well. You initial goal was to  
 directly
 influence
 code gen up to instruction level, how should that be achieved without
 platform specific
 extension. Quite contrary with ops and asm he will need two hack paths
 into gcc's codegen.


 What I see here is that we can do much good things to the inline
 assembler while achieving the same goal.
 With intrinsics on the other hand we're adding a very specialized
 maintenance burden.


 You need to understand how the inline assembler works in GCC to  
 understand
 the problems with this.
 GCC basically receives a string containing assembly code. It does not
 attempt to understand it, it just pastes it in the .s file verbatim.
 This means, you can support any architecture without any additional  
 work...
 you just type the appropriate architectures asm in your program and it's
 fine... but now if we want to perform pseudo-register assignment, or
 parameter substitution, we need a front end that parses the D asm
 expressions, and generated a valid asm string for GCC.. It can't generate
 that string without detailed knowledge of the architecture its targeting,
 and it's not feasible to implement that support for all the architectures
 GCC supports.

So the argument here is that intrinsics in D can easier be
mapped to existing intrinsics in GCC?
I do understand that this will be pretty difficult for GDC
to implement.
Reminds me that Walter has stated several times how much
better an internal assembler can integrate with the language.

 Even after all that, It's still not ideal.. Inline asm reduces the  
 ability
 of the compiler to perform many optimisations.

 Consider this common situation and the code that will be built around it:
 __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //

 Such is really not a good idea if the bit pattern of packedColour is a
 denormal.
 How can you even execute a single useful command on the floats here?

 Also mixing integer and FP instructions on the same register may
 cause performance degradation. The registers are indeed typed CPU
 internally.


 It's a very good idea, I am saving memory and, and also saving memory
 accesses.

 This leads back to the point in my OP where I said that most games
 programmers turn NaN, Den, and FP exceptions off.
 As I've also raised before, most vectors are actually float[3]'s, W is
 usually ignored and contains rubbish.
 It's conventional to stash some 32bit value in the W to fill the  
 otherwise
 wasted space, and also get the load for free alongside the position.

 The typical program flow, in this case:
   * the colour will be copied out into a separate register where it will  
 be
 reinterpreted as a uint, and have an unpack process applied to it.
   * XYZ will then be used to perform maths, ignoring W, which will  
 continue
 to accumulate rubbish values... it doesn't matter, all FP exceptions and
 such are disabled.

Putting the uint to the front slot would make your life simpler then,
only MOVD, no unpacking :).

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 6 January 2012 22:40, Martin Nowak <dawg dawgfoto.de> wrote:

 On Fri, 06 Jan 2012 20:00:15 +0100, Manu <turkeyman gmail.com> wrote:

  On 6 January 2012 20:17, Martin Nowak <dawg dawgfoto.de> wrote:
  There is another benefit.
 Consider the following:

 __vec128 addps(__vec128 a, __vec128 b) pure
 {
   __vec128 res = a;

   if (__ctfe)
   {
       foreach(i; 0 .. 4)
          res[i] += b[i];
   }
   else
   {
       asm (res, b)

       {
           addps res, b;
       }
   }
   return res;

 }

 You don't need to use inline ASM to be able to do this, it will work the
 same with intrinsics.
 I've detailed numerous problems with using inline asm, and complications
 with extending the inline assembler to support this.

  Don't get me wrong here. The idea is to find out if intrinsics

 can be build with the help of inlineable asm functions.
 The ctfe support is one good reason to go with a library solution.


/agree, this is a nice argument to support putting it in libraries.


 Most compilers can't reschedule code around inline asm blocks. There are a
 lot of reasons for this, google can help you.
 The main reason is that a COMPILER doesn't attempt to understand the
 assembly it's being asked to insert inline. The information that it may
 use

 It doesn't have to understand the assembly.
 Wrapping these in functions creates an IR expression with inputs and
 outputs.
 Declaring them as pure gives the compiler free hands to apply whatever
 optimizations he does normally on an IR tree.
 Common subexpressions elimination, removing dead expressions...


These functions shouldn't be functions... if they're not all inlined, then
the implementation is broken.
Once you inline all these micro asm blocks; 100 small asm blocks inlined in
a single function, you're making a very hard time for the optimiser.


 Same problem as above. The compiler would need to understand enough about
 assembly to perform optimisation on the assembly its self to clean this
 up.
 Using intrinsics, all the register allocation, load/store code, etc, is
 all
 in the regular realm of compiling the language, and the code generation
 and
 optimisation will all work as usual.

  There is no informational difference between the intrinsic

 __m128 _mm_add_ps(__m128 a, __m128 b);

 and an inline assembler version

There is actually. To the compiler, the intrinsic is a normal function,
with some hook in the code generator to produce the appropriate opcode when
it's performing actual code generation.
On most compilers, the inline asm on the other hand, is unknown to the
compiler, the optimiser can't do much anymore, because it doesn't know what
the inline asm has done, and the code generator just goes and pastes your
asm code inline where you told it to. It doesn't know if you've written to
aliased variables, called functions, etc.. it can no longer safely
rearrange code around the inline asm block.. which means it's not free to
pipeline the code efficiently.

So the argument here is that intrinsics in D can easier be
 mapped to existing intrinsics in GCC?
 I do understand that this will be pretty difficult for GDC
 to implement.
 Reminds me that Walter has stated several times how much
 better an internal assembler can integrate with the language.


Basically yes.

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 1:43 PM, Manu wrote:
 There is actually. To the compiler, the intrinsic is a normal function, with
 some hook in the code generator to produce the appropriate opcode when it's
 performing actual code generation.
 On most compilers, the inline asm on the other hand, is unknown to the
compiler,
 the optimiser can't do much anymore, because it doesn't know what the inline
asm
 has done, and the code generator just goes and pastes your asm code inline
where
 you told it to. It doesn't know if you've written to aliased variables, called
 functions, etc.. it can no longer safely rearrange code around the inline asm
 block.. which means it's not free to pipeline the code efficiently.

And, in fact, the compiler should not try to optimize inline assembler. The IA 
is there so that the programmer can hand tweak things without the compiler 
defeating his attempts.

For example, suppose the compiler schedules instructions for processor X. The 
programmer writes inline asm to schedule for Y, because the compiler doesn't 
specifically support Y. The compiler goes ahead and reschedules it for X.

Arggh!

What dmd does do with the inline assembler is it keeps track of which registers 
are read/written, so that effective register allocation can be done for the 
non-asm code.

Jan 06 2012

Manu <turkeyman gmail.com> writes:

On 7 January 2012 02:06, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:43 PM, Manu wrote:

 There is actually. To the compiler, the intrinsic is a normal function,
 with
 some hook in the code generator to produce the appropriate opcode when
 it's
 performing actual code generation.
 On most compilers, the inline asm on the other hand, is unknown to the
 compiler,
 the optimiser can't do much anymore, because it doesn't know what the
 inline asm
 has done, and the code generator just goes and pastes your asm code
 inline where
 you told it to. It doesn't know if you've written to aliased variables,
 called
 functions, etc.. it can no longer safely rearrange code around the inline
 asm
 block.. which means it's not free to pipeline the code efficiently.

 And, in fact, the compiler should not try to optimize inline assembler.
 The IA is there so that the programmer can hand tweak things without the
 compiler defeating his attempts.

 For example, suppose the compiler schedules instructions for processor X.
 The programmer writes inline asm to schedule for Y, because the compiler
 doesn't specifically support Y. The compiler goes ahead and reschedules it
 for X.

 Arggh!

 What dmd does do with the inline assembler is it keeps track of which
 registers are read/written, so that effective register allocation can be
 done for the non-asm code.

And I agree this is exactly correct for the IA... and also why intrinsics
must be used to do this work, not IA.

Jan 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/6/2012 4:15 PM, Manu wrote:
 And I agree this is exactly correct for the IA... and also why intrinsics must
 be used to do this work, not IA.

Yup.

Jan 06 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sat, 07 Jan 2012 01:06:21 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/6/2012 1:43 PM, Manu wrote:
 There is actually. To the compiler, the intrinsic is a normal function,  
 with
 some hook in the code generator to produce the appropriate opcode when  
 it's
 performing actual code generation.
 On most compilers, the inline asm on the other hand, is unknown to the  
 compiler,
 the optimiser can't do much anymore, because it doesn't know what the  
 inline asm
 has done, and the code generator just goes and pastes your asm code  
 inline where
 you told it to. It doesn't know if you've written to aliased variables,  
 called
 functions, etc.. it can no longer safely rearrange code around the  
 inline asm
 block.. which means it's not free to pipeline the code efficiently.

 And, in fact, the compiler should not try to optimize inline assembler.  
 The IA is there so that the programmer can hand tweak things without the  
 compiler defeating his attempts.

 For example, suppose the compiler schedules instructions for processor  
 X. The programmer writes inline asm to schedule for Y, because the  
 compiler doesn't specifically support Y. The compiler goes ahead and  
 reschedules it for X.

 Arggh!

Yes, but that's not what I meant.

Consider

__v128 a = load(1), b = loadB(2);
__v128 c = add(a, b);
__v128 d = add(a, b);

A valid optimization could be.

__v128 b = load(2);
__v128 a = load(1);
__v128 tmp = add(a, b);
__v128 d = tmp;
__v128 c = tmp;

__v128 load(int v) pure
{
     __v128 res;
     asm (res, v)
     {
         MOVD res, v;
         SHUF res, 0x0000;
     }
     return res;
}

__v128 add(__v128 a, __v128 b) pure
{
     __v128 res = a;
     asm (res, b)
     {
         ADD res, b;
     }
     return res;
}

The compiler might drop evaluation of
d and just use the comsub of c.
He might also evaluate d before c.
The important point is to mark those functions as having no-sideeffect,
which can be checked if instructions are classified.
Thus the compiler can do all kind of optimizations on expression level.

After inlining it would look like this.

__v128 b;
asm (b) { MOV b, 2; }
__v128 a;
asm (a) { MOV a, 1; }
__v128 tmp;
asm (a, b, tmp) { MOV tmp, a; ADD tmp, b; }
__v128 c;
asm (c, tmp) { MOV c, tmp; }
__v128 d;
asm (d, tmp) { MOV d, tmp; }

Then he will do the usual register assignment except that
variables must be assigned a register for asm blocks they
are used in.

This is effectively achieves the same as writing this with intrinsics.
It also greatly improves the composition of inline asm.

 What dmd does do with the inline assembler is it keeps track of which  
 registers are read/written, so that effective register allocation can be  
 done for the non-asm code.

Which is why the compiler should be the one to allocate pseudo-registers.

Jan 06 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 01/07/12 04:27, Martin Nowak wrote:
 __v128 add(__v128 a, __v128 b) pure
 {
     __v128 res = a;
     asm (res, b)
     {
         ADD res, b;
     }
     return res;
 }


 This is effectively achieves the same as writing this with intrinsics.
 It also greatly improves the composition of inline asm.

What it also does is allows mixing "ordinary" asm with the SIMD instructions.
People will do that, because it's easier this way (less typing), and then the
result is practically unportable. Cause every compiler would now have to fully
understand and support that one asm variant.

If you do "__v128 __simd_add(__v128 a, __v128)" instead, you don't loose
anything; in fact it could be internally implemented with your asm(). But now
the "real" asm code is separate from the more generic (and sometimes even
portable) simd ops -- the compiler does not need to understand asm() to be able
to use it. It can still do every optimization as with the raw asm, and possibly
more as it knows exactly what's going on. The explicit pure annotations are not
needed. It has more freedom to choose better scheduling, ordering, sometimes
instruction selection (if there's more than one alternative) and even various
code transformations. Even CTFE works.
Consider the case when a lot of your above add()-like functions are inlined
into another one, which will be a common pattern -- you don't want any false
dependencies. (If you do care about exact instruction scheduling you're writing
asm, not D, so for that case asm() is a better choice)

I wrote "__v128 __simd_add(__v128 a, __v128)" above, but that was just to keep
things simple. What you actually want is "vfloat4 __simd_add(vfloat4 a, vfloat4
b)" etc. Ie strongly typed.

Whether this needs to go into the compiler itself depends on only one thing -
if it can be done efficiently in a library. Efficiently in this case means
"zero-cost" or "free".

Having different static types (in addition to the untyped __v(64|128|256) ones)
gives you not only security (you don't accidentally end up operating on the
wrong data/format because you forgot about some version() combination etc), but
also allows things like overloading. Then you can write more generic code,
which works with all available formats. And eg changing the precision used by
some app module involves only changing a few declarations plus data entry/exit
points, not modifying every single SIMD instruction.
Untyped __v128 only really works for memcpy() type functions; other than that
is mainly useful for conversions and passing data etc - the cases where you
don't care about the content in transit.

 What dmd does do with the inline assembler is it keeps track of which
registers are read/written, so that effective register allocation can be done
for the non-asm code.

 
 Which is why the compiler should be the one to allocate pseudo-registers.

Yep.

artur

Jan 07 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

simdop will need more overloads, e.g. some
instructions need immediate bytes.
z = simdop(SHUFPS, x, y, 0);

How about this:
__v128 simdop(T...)(SIMD op, T args);

Jan 08 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 8/01/12 5:02 PM, Martin Nowak wrote:
 simdop will need more overloads, e.g. some
 instructions need immediate bytes.
 z = simdop(SHUFPS, x, y, 0);

 How about this:
 __v128 simdop(T...)(SIMD op, T args);

These don't make a lot of sense to return as value, e.g.

__v128 a, b;
a = simdop(movhlps, b); // ???

movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't 
be done as an expression like this.

Would make more sense to just write the instructions like they appear in 
asm:

simdop(movhlps, a, b);
simdop(addps, a, b);
etc.

The difference between this and inline asm would be:

1. Registers are automatically allocated.
2. Loads/stores are inserted when we spill to stack.
3. Instructions can be scheduled and optimised by the compiler.

We could then extend this with user-defined types:

struct float4
{
   union
   {
      __v128 v;
      float[4] for_debugging;
   }

   float4 opBinary(string op:"+")(float4 rhs)  forceinline
   {
     __v128 result = v;
     simdop(addps, result, rhs);
     return float4(result);
   }
}

We'd need a strong guarantee of inlining and removal of redundant 
load/stores though for this to work well. We'd also need a guarantee 
that float4's would get the same treatment as __v128 (as it is the only 
element).

Jan 08 2012

Manu <turkeyman gmail.com> writes:

On 8 January 2012 19:56, Peter Alexander <peter.alexander.au gmail.com>wrote:

 These don't make a lot of sense to return as value, e.g.

 __v128 a, b;
 a = simdop(movhlps, b); // ???

 movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't be
 done as an expression like this.

The conventional way is to write it like this:
  r = simdop(movhlps, a, b);

This allows you to chain the functions together, ie. passing the result as
an arg..

Jan 08 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 08 Jan 2012 18:56:04 +0100, Peter Alexander  
<peter.alexander.au gmail.com> wrote:

 On 8/01/12 5:02 PM, Martin Nowak wrote:
 simdop will need more overloads, e.g. some
 instructions need immediate bytes.
 z = simdop(SHUFPS, x, y, 0);

 How about this:
 __v128 simdop(T...)(SIMD op, T args);

 These don't make a lot of sense to return as value, e.g.

 __v128 a, b;
 a = simdop(movhlps, b); // ???

 movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't  
 be done as an expression like this.

 Would make more sense to just write the instructions like they appear in  
 asm:

 simdop(movhlps, a, b);
 simdop(addps, a, b);
 etc.

Yeah, also thought of this. Having a copy as default would
require to eliminate them again.

 The difference between this and inline asm would be:

 1. Registers are automatically allocated.

See asm pseudo-registers.

 2. Loads/stores are inserted when we spill to stack.

There are sequencing point before and after asm blocks.

 3. Instructions can be scheduled and optimised by the compiler.

Optimization can be done on IR level.
Scheduling is done after all code is emitted.

Jan 08 2012

Norbert Nemec <Norbert Nemec-online.de> writes:

On 06.01.2012 02:42, Manu wrote:
 I like v128, or something like that. I'll use that for the sake of this
 document. I think it is preferable to float4 for a few reasons...

I do not agree at all. That way, the type looses all semantic 
information. This is not only breaking with C/C++/D philosophy but 
actually *hides* an essential hardware detail on Intel SSE:

An SSE register is 128 bit, but the processor actually cares about the 
semantics of the content:

There are different commands for loading two doubles, four singles or 
integers to a register. They all load the same 128 bits from memory into 
the same register. Anyhow, the specs warn about a performance penalty 
when loading a register as one type and then using it as another. I do 
not know the internals of the processor, but my understanding is that 
the CPU splits the floats into mantissa, exponent and sign already at 
the moment of loading and has to drop that information when you 
reinterpret the bit pattern stored in the register.

A type v128 would not provide the necessary information for the compiler 
to produce the correct mov statements.

There definitely must be a float4 and a double2 type to express these 
semantics. For integers, I am not quite sure. I believe that integer SSE 
commands can be mixed more so a single 128bit type would be sufficient.

Considering these hardware details of the SSE architecture alone, I fear 
that portable low-level support for SIMD is very hard to achieve. If you 
want to offer access to the raw power of each architecture, it might be 
simpler to have machine-specific language extensions for SIMD and leave 
the portability for a wrapper library with a common front-end and 
various back-ends for the different architectures.

Jan 12 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/12/2012 12:13 PM, Norbert Nemec wrote:
 A type v128 would not provide the necessary information for the compiler to
 produce the correct mov statements.

 There definitely must be a float4 and a double2 type to express these
semantics.
 For integers, I am not quite sure. I believe that integer SSE commands can be
 mixed more so a single 128bit type would be sufficient.

 Considering these hardware details of the SSE architecture alone, I fear that
 portable low-level support for SIMD is very hard to achieve. If you want to
 offer access to the raw power of each architecture, it might be simpler to have
 machine-specific language extensions for SIMD and leave the portability for a
 wrapper library with a common front-end and various back-ends for the different
 architectures.

That's what we're doing for D's SIMD support.

Although the syntax will support any vector type, the semantics will constrain 
it to what works for the target hardware. Manu has convinced me that to emulate 
vector types that don't have hardware support is a very bad idea, because then 
naive users will assume they'll be getting hardware performance, but in reality 
will have truly execrable performance.

Note that gcc does do the emulation for unsupported ops (like some of the 
multiplies). Take a gander at the code generated - instead of one instruction, 
it's a page of them. I think this will be an unwelcome surprise to the 
performance minded vector programmer.

Note that explicit emulation will be possible, using D's general purpose vector 
syntax:

     a[] = b[] + c[];

Jan 12 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 12/01/12 8:13 PM, Norbert Nemec wrote:
 Considering these hardware details of the SSE architecture alone, I fear
 that portable low-level support for SIMD is very hard to achieve. If you
 want to offer access to the raw power of each architecture, it might be
 simpler to have machine-specific language extensions for SIMD and leave
 the portability for a wrapper library with a common front-end and
 various back-ends for the different architectures.

You are right, but don't forget that the same is true for instructions 
already in the language. For example, (1 << x) is a very slow operation 
on PPUs (it's micro-coded).

It's simply not possible to be portable and achieve maximum performance 
for any language features, not just vectors. Algorithms must be tuned 
for specific architectures in version statements. However, you can get a 
decent baseline by providing the lowest common denominator in 
functionality. This v128 type (or whatever it will be called) does that.

Jan 12 2012

Norbert Nemec <Norbert Nemec-online.de> writes:

On 12.01.2012 23:10, Peter Alexander wrote:
 On 12/01/12 8:13 PM, Norbert Nemec wrote:
 Considering these hardware details of the SSE architecture alone, I fear
 that portable low-level support for SIMD is very hard to achieve. If you
 want to offer access to the raw power of each architecture, it might be
 simpler to have machine-specific language extensions for SIMD and leave
 the portability for a wrapper library with a common front-end and
 various back-ends for the different architectures.

 You are right, but don't forget that the same is true for instructions
 already in the language. For example, (1 << x) is a very slow operation
 on PPUs (it's micro-coded).

 It's simply not possible to be portable and achieve maximum performance
 for any language features, not just vectors. Algorithms must be tuned
 for specific architectures in version statements. However, you can get a
 decent baseline by providing the lowest common denominator in
 functionality. This v128 type (or whatever it will be called) does that.

Actually, my essential message is: The single v128 is too simplistic for 
the SSE architecture. You actually need different types because the 
compiler needs to know what type is stored in any given register to be 
able to move it around.

Jan 12 2012

Manu <turkeyman gmail.com> writes:

On 13 January 2012 08:34, Norbert Nemec <Norbert nemec-online.de> wrote:

 On 12.01.2012 23:10, Peter Alexander wrote:

 On 12/01/12 8:13 PM, Norbert Nemec wrote:

 Considering these hardware details of the SSE architecture alone, I fear
 that portable low-level support for SIMD is very hard to achieve. If you
 want to offer access to the raw power of each architecture, it might be
 simpler to have machine-specific language extensions for SIMD and leave
 the portability for a wrapper library with a common front-end and
 various back-ends for the different architectures.

 You are right, but don't forget that the same is true for instructions
 already in the language. For example, (1 << x) is a very slow operation
 on PPUs (it's micro-coded).

 It's simply not possible to be portable and achieve maximum performance
 for any language features, not just vectors. Algorithms must be tuned
 for specific architectures in version statements. However, you can get a
 decent baseline by providing the lowest common denominator in
 functionality. This v128 type (or whatever it will be called) does that.

 Actually, my essential message is: The single v128 is too simplistic for
 the SSE architecture. You actually need different types because the
 compiler needs to know what type is stored in any given register to be able
 to move it around.

This has already been concluded some days back, the language has a quite of
types, just like GCC.

Jan 13 2012

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

On 1/13/2012 7:38 AM, Manu wrote:
 On 13 January 2012 08:34, Norbert Nemec <Norbert nemec-online.de
 <mailto:Norbert nemec-online.de>> wrote:


 This has already been concluded some days back, the language has a quite
 of types, just like GCC.

So I would definitely like to help out on the SIMD stuff in some way, as 
I have a lot of experience using SIMD math to speed up the games I work 
on.  I've got a vectorized set of transcendetal (currently in the form 
of MSVC++ intrinics) functions for float and double that would be a good 
start if anyone is interested.  Beyond that I just want to help 'make it 
right' because its a topic I care alot about, and is my personal biggest 
gripe with the langauge at the moment.

I also have experience with VMX as they two are not exactly the same, it 
definitely would help to avoid making the code too intel-centric (though 
typically the VMX is the more flexible design as it can do dynamic 
shuffling based on the contents of the vector registers etc)

Jan 14 2012

Manu <turkeyman gmail.com> writes:

On 15 January 2012 09:20, Sean Cavanaugh <WorksOnMyMachine gmail.com> wrote:

 On 1/13/2012 7:38 AM, Manu wrote:

 On 13 January 2012 08:34, Norbert Nemec <Norbert nemec-online.de
 <mailto:Norbert nemec-online.**de <Norbert nemec-online.de>>> wrote:


 This has already been concluded some days back, the language has a quite
 of types, just like GCC.

 So I would definitely like to help out on the SIMD stuff in some way, as I
 have a lot of experience using SIMD math to speed up the games I work on.
  I've got a vectorized set of transcendetal (currently in the form of
 MSVC++ intrinics) functions for float and double that would be a good start
 if anyone is interested.  Beyond that I just want to help 'make it right'
 because its a topic I care alot about, and is my personal biggest gripe
 with the langauge at the moment.

 I also have experience with VMX as they two are not exactly the same, it
 definitely would help to avoid making the code too intel-centric (though
 typically the VMX is the more flexible design as it can do dynamic
 shuffling based on the contents of the vector registers etc)

I too have a long history with VMX, CELL SPU, ARMs VFP/NEON, and others
(PSP's VFPU, PS2s VU, SH4), and SSE of course, and writing the efficient
libraries that take all hardwares into consideration. We should compare
notes, are you on IRC? :)

Jan 15 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/15/2012 3:02 AM, Manu wrote:
 On 15 January 2012 09:20, Sean Cavanaugh <WorksOnMyMachine gmail.com
     I also have experience with VMX as they two are not exactly the same, it
     definitely would help to avoid making the code too intel-centric (though
     typically the VMX is the more flexible design as it can do dynamic
shuffling
     based on the contents of the vector registers etc)


 I too have a long history with VMX, CELL SPU, ARMs VFP/NEON, and others (PSP's
 VFPU, PS2s VU, SH4), and SSE of course, and writing the efficient libraries
that
 take all hardwares into consideration. We should compare notes, are you on
IRC? :)

A nice vector math library for D that puts us competitive will be a nice 
addition to Phobos.

Jan 15 2012

JoeCoder <dnewsgroup2 yage3d.net> writes:

On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

The gl3n library might be something good to build on:
https://bitbucket.org/dav1d/gl3n

It looks to be a continuation of the OMG library used by Deadlock, and 
is similar to the glm (http://glm.g-truc.net) c++ library which emulates 
glsl vector ops in software.

We'd need to ask if it can be re-licensed from MIT to Boost.

Jan 15 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/15/2012 6:54 PM, JoeCoder wrote:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/gl3n

 It looks to be a continuation of the OMG library used by Deadlock, and is
 similar to the glm (http://glm.g-truc.net) c++ library which emulates glsl
 vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

I have never used libraries like that, and so it isn't obvious to me what a
good 
one would look like.

Jan 15 2012

suicide <suicide xited.de> writes:

Here is mine:
http://suicide.zoadian.de/ext/math/geometry/vector.d
i haven't tested (not even compiled) it yet. It needs polishing, but i  =

have not much time to work on it atm. But you may use it as you wish ;)
Any suggestions/improvement is welcome.

Greetings,
Felix



Am 16.01.2012, 04:00 Uhr, schrieb Walter Bright  =

<newshound2 digitalmars.com>:

 On 1/15/2012 6:54 PM, JoeCoder wrote:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a =



 =

 nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/gl3n

 It looks to be a continuation of the OMG library used by Deadlock, an=


d  =

 is
 similar to the glm (http://glm.g-truc.net) c++ library which emulates=


  =

 glsl
 vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

 I have never used libraries like that, and so it isn't obvious to me  =

 what a good one would look like.


-- =

Erstellt mit Operas revolution=E4rem E-Mail-Modul: http://www.opera.com/=
mail/

Jan 16 2012

"F i L" <witte2008 gmail.com> writes:

On Monday, 16 January 2012 at 17:57:38 UTC, suicide wrote:
 Here is mine:
 http://suicide.zoadian.de/ext/math/geometry/vector.d
 i haven't tested (not even compiled) it yet. It needs 
 polishing, but i have not much time to work on it atm. But you 
 may use it as you wish ;)
 Any suggestions/improvement is welcome.

 Greetings,
 Felix



 Am 16.01.2012, 04:00 Uhr, schrieb Walter Bright 
 <newshound2 digitalmars.com>:

 On 1/15/2012 6:54 PM, JoeCoder wrote:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive 
 will be a nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/gl3n

 It looks to be a continuation of the OMG library used by 
 Deadlock, and is
 similar to the glm (http://glm.g-truc.net) c++ library which 
 emulates glsl
 vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

 I have never used libraries like that, and so it isn't obvious 
 to me what a good one would look like.


Nice start, though it have quite a few issues.

1. for (i; 0 .. D) needs to be: foreach (i; 0 .. D)
2. asserts(r != 0) should be done in a contract
3. 'Vector(D, T)' can be internally used as just 'Vector'
4. instead of making opAdd, opSub, opMul, etc.. use opBinary and 
mixins
5. don't pass vectors as 'ref' unless they are going to be 
modified
6. for performance, don't pass all values through 'real'

    auto opBinary(string op, U)(U r)
    if (U.sizeof <= T.sizeof && isImplicitlyConvertible(T, U))
    in {
        assert(r != 0);
    }
    body {
        Vector nvec(this);
        foreach (i; 0 .. D)
            mixin("nvec.vec[i]" ~ op ~ "= r;");
        return nvec;
    }


    auto opBinary(string op, U)(U r)
    if (U.sizeof > T.sizeof && isImplicitlyConvertible(U, T))
    in {
        assert(r != 0);
    }
    body {
        Vector nvec(this);
        foreach (i; 0 .. D)
            mixin("nvec.vec[i]" ~ op ~ "= cast(T) r;");
        return nvec;
    }

    . . . . .

    auto opBinary(string op, V, U)(Vector!(V, U) vec)
    if (U.sizeof <= T.sizeof && isImplicitlyConvertible(U, T))
    in {
        foreach (i; 0 .. V)
            assert(vec.vec[i] != 0);
    }
    body {
        Vector nvec(this);
        static if (D <= V) {
            foreach (i; 0 .. D)
                mixin("nvec.vec[i]" ~ op ~ "= vec.vec[i];");
        }
        else {
            foreach (i; 0 .. V)
                mixin("nvec.vec[i]" ~ op ~ "= vec.vec[i];");
        }
        return nvec;
    }


    // etc...

Something along those lines. Also, make sure you can't create a 
vector of zero or one length (struct Vector(D, T) if (D >= 2) { 
... }). Plus, none of your Vector(D,T) instances will compile 
because you for the '!' mark: Vector!(D, T)

Jan 16 2012

"F i L" <witte2008 gmail.com> writes:

Whoops! opBinary should be opOpAssign in my examples.

Jan 16 2012

"F i L" <witte2008 gmail.com> writes:

On Monday, 16 January 2012 at 19:27:16 UTC, F i L wrote:
 Whoops! opBinary should be opOpAssign in my examples.

wait... no, opBinary was the right one... i'm confusing myself 
here :D

Jan 16 2012

David <d dav1d.de> writes:

Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/gl3n

 It looks to be a continuation of the OMG library used by Deadlock, and
 is similar to the glm (http://glm.g-truc.net) c++ library which emulates
 glsl vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

Hi,

that's definitly possible! But to be honest, I don't  think putting gl3n 
into phobos is a good idea. Why does phobos, the std. lib, need a 
vector-lib? I haven't seen any other language with something like gl3n 
in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it 
would be a real pain changing this to the Phobos style. One more point 
is, that it's not just a Vector-lib, it also does Matrix-, 
Quaternion-math, interpolation and implements some other useful 
mathematical functions (as found in GLSL).
Of course I am open to a discussion.

PS:// I already talked with Manu about this topic, and I don't wait too 
long, gl3n will have core.simd support soon.

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 1:26 PM, David wrote:
 PS:// I already talked with Manu about this topic, and I don't wait too long,
 gl3n will have core.simd support soon.

Awesome. I was hoping that adding simd support would have a an "enabling"
effect 
for a lot of great library improvements!

Jan 16 2012

=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:

On Mon, 16 Jan 2012 22:26:55 +0100, David <d dav1d.de> wrote:

 Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a  
 nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/gl3n

 It looks to be a continuation of the OMG library used by Deadlock, and
 is similar to the glm (http://glm.g-truc.net) c++ library which emulates
 glsl vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

 Hi,

 that's definitly possible! But to be honest, I don't  think putting gl3n  
 into phobos is a good idea. Why does phobos, the std. lib, need a  
 vector-lib?

To make it all the easier for those who want to make games in D?


 I haven't seen any other language with something like gl3n in the std.  
 lib. Also I used my own PEP-8, C (K&R with spaces) style, it would be a  
 real pain changing this to the Phobos style. One more point is, that  
 it's not just a Vector-lib, it also does Matrix-, Quaternion-math,  
 interpolation and implements some other useful mathematical functions  
 (as found in GLSL).
 Of course I am open to a discussion.

IMO, we should have vectors, matrices, quaternions and all those other
neat things easily accessible in the language (dual quaternions? Are they
used in games?)


 PS:// I already talked with Manu about this topic, and I don't wait too  
 long, gl3n will have core.simd support soon.

Looking forward to it.

Jan 16 2012

Danni Coy <danni.coy gmail.com> writes:

 (dual quaternions? Are they
 used in games?)

 yes

Jan 16 2012

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

On 1/16/2012 7:21 PM, Danni Coy wrote:
     (dual quaternions? Are they
     used in games?)

 yes

While the GPU tends to do this particular step of the work, the answer 
in general is 'definitely'.  One of the most immediate applications of 
dual quats was to improve the image quality of joints on characters that 
twist and rotate at the same time (shoulder blades, wrists, etc), at a 
minimal increase (or in some cases equivalent) computational cost over 
older methods.

http://isg.cs.tcd.ie/projects/DualQuaternions/

Jan 17 2012

David <d dav1d.de> writes:

Am 16.01.2012 23:28, schrieb Simen Kjærås:
 To make it all the easier for those who want to make games in D?

Then they can get my lib easily from bitbucket.


 IMO, we should have vectors, matrices, quaternions and all those other
 neat things easily accessible in the language (dual quaternions? Are they
 used in games?)


The question is, what's the aim of phobos, including everything?, 
including the basics?, so you can implement nearly everything on top of 
it (similiar to pythons)? or beeing minimalistic as possible?
I get the impression, phobos tries to include everything possible, yeah 
I am pointing at curl (wtf, a c-lib, not even a D implementation, also 
you need libcurl now to get phobos compiled (correct me if I am wrong 
here)). IMHO a std.-lib should include just the basics, so you can build 
ontop of it.
Including gl3n into phobos would be a real honor for me, but is the goal 
of phobos to include everything and to ship a fat lib (lol, reminds me a 
bit of boost) with dmd, or even call it the power of D, an overloaded 
std.-lib?

Jan 17 2012

=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:

On Tue, 17 Jan 2012 13:52:01 +0100, David <d dav1d.de> wrote:

 Am 16.01.2012 23:28, schrieb Simen Kj=C3=A6r=C3=A5s:
 To make it all the easier for those who want to make games in D?

 Then they can get my lib easily from bitbucket.

Or any one of a 100 other places, with incompatible implementations?

Vectors and matrices are low enough level that people generally won't
need to write their own to match *their* exact use case. That makes
them prime stdlib material.

Jan 17 2012

Danni Coy <danni.coy gmail.com> writes:

+1

On Wed, Jan 18, 2012 at 4:35 AM, Simen Kj=E6r=E5s <simen.kjaras gmail.com>w=
rote:

 On Tue, 17 Jan 2012 13:52:01 +0100, David <d dav1d.de> wrote:

  Am 16.01.2012 23:28, schrieb Simen Kj=E6r=E5s:
 To make it all the easier for those who want to make games in D?

 Then they can get my lib easily from bitbucket.

 Or any one of a 100 other places, with incompatible implementations?

 Vectors and matrices are low enough level that people generally won't
 need to write their own to match *their* exact use case. That makes
 them prime stdlib material.

Jan 17 2012

JoeCoder <dnewsgroup2 yage3d.net> writes:

On 1/16/2012 4:26 PM, David wrote:
 Why does phobos, the std. lib, need a vector-lib? I haven't seen any
 other language with something like gl3n in the std.

I guess this depends on the goals for phobos.  Is it minimal, or 
batteries included?  As for not seeing it in other languages, I don't 
think there's very many low-level, high performance languages that take 
a batteries-included approach to the standard library.

I'd argue that a std.math3d would be used just as much, if not more, 
than std.complex.

Jan 16 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 16 January 2012 22:32, JoeCoder <dnewsgroup2 yage3d.net> wrote:
 On 1/16/2012 4:26 PM, David wrote:
 Why does phobos, the std. lib, need a vector-lib? I haven't seen any
 other language with something like gl3n in the std.


 I guess this depends on the goals for phobos. =A0Is it minimal, or batter=

ies
 included? =A0As for not seeing it in other languages, I don't think there=

's
 very many low-level, high performance languages that take a
 batteries-included approach to the standard library.

 I'd argue that a std.math3d would be used just as much, if not more, than
 std.complex.

Since when did people use std.complex? :~)

--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 16 2012

Kiith-Sa <42 theanswer.com> writes:

David wrote:

 Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/gl3n

 It looks to be a continuation of the OMG library used by Deadlock, and
 is similar to the glm (http://glm.g-truc.net) c++ library which emulates
 glsl vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

 Hi,
 
 that's definitly possible! But to be honest, I don't  think putting gl3n
 into phobos is a good idea. Why does phobos, the std. lib, need a
 vector-lib? I haven't seen any other language with something like gl3n
 in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it
 would be a real pain changing this to the Phobos style. One more point
 is, that it's not just a Vector-lib, it also does Matrix-,
 Quaternion-math, interpolation and implements some other useful
 mathematical functions (as found in GLSL).
 Of course I am open to a discussion.
 
 PS:// I already talked with Manu about this topic, and I don't wait too
 long, gl3n will have core.simd support soon.

gl3n has a really good API with regards to game development 
(resembling GLSL helps), although I guess changing to a more Phobos
style might be needed for inclusion. I think having it in the standard 
library would be extremely useful, though - no need to implement it myself
then. Typical matrices used in gamedev (4x4 etc) woud be really useful as 
well (as said before, I'd even like stuff like AABBoxes, but let's go for 
vectors/matrices/quaternions first .

Jan 16 2012

David <d dav1d.de> writes:

Am 17.01.2012 05:31, schrieb Kiith-Sa:
 David wrote:

 Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:
 A nice vector math library for D that puts us competitive will be a nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/gl3n

 It looks to be a continuation of the OMG library used by Deadlock, and
 is similar to the glm (http://glm.g-truc.net) c++ library which emulates
 glsl vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

 Hi,

 that's definitly possible! But to be honest, I don't  think putting gl3n
 into phobos is a good idea. Why does phobos, the std. lib, need a
 vector-lib? I haven't seen any other language with something like gl3n
 in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it
 would be a real pain changing this to the Phobos style. One more point
 is, that it's not just a Vector-lib, it also does Matrix-,
 Quaternion-math, interpolation and implements some other useful
 mathematical functions (as found in GLSL).
 Of course I am open to a discussion.

 PS:// I already talked with Manu about this topic, and I don't wait too
 long, gl3n will have core.simd support soon.

 gl3n has a really good API with regards to game development
 (resembling GLSL helps), although I guess changing to a more Phobos
 style might be needed for inclusion. I think having it in the standard
 library would be extremely useful, though - no need to implement it myself
 then. Typical matrices used in gamedev (4x4 etc) woud be really useful as
 well (as said before, I'd even like stuff like AABBoxes, but let's go for
 vectors/matrices/quaternions first .

AABB are also planed for gl3n.

Jan 17 2012

Manu <turkeyman gmail.com> writes:

On 17 January 2012 14:43, David <d dav1d.de> wrote:

 Am 17.01.2012 05:31, schrieb Kiith-Sa:

  David wrote:
  Am 16.01.2012 03:54, schrieb JoeCoder:
 On 1/15/2012 1:42 PM, Walter Bright wrote:

 A nice vector math library for D that puts us competitive will be a
 nice
 addition to Phobos.

 The gl3n library might be something good to build on:
 https://bitbucket.org/dav1d/**gl3n <https://bitbucket.org/dav1d/gl3n>

 It looks to be a continuation of the OMG library used by Deadlock, and
 is similar to the glm (http://glm.g-truc.net) c++ library which
 emulates
 glsl vector ops in software.

 We'd need to ask if it can be re-licensed from MIT to Boost.

 Hi,

 that's definitly possible! But to be honest, I don't  think putting gl3n
 into phobos is a good idea. Why does phobos, the std. lib, need a
 vector-lib? I haven't seen any other language with something like gl3n
 in the std. lib. Also I used my own PEP-8, C (K&R with spaces) style, it
 would be a real pain changing this to the Phobos style. One more point
 is, that it's not just a Vector-lib, it also does Matrix-,
 Quaternion-math, interpolation and implements some other useful
 mathematical functions (as found in GLSL).
 Of course I am open to a discussion.

 PS:// I already talked with Manu about this topic, and I don't wait too
 long, gl3n will have core.simd support soon.

 gl3n has a really good API with regards to game development
 (resembling GLSL helps), although I guess changing to a more Phobos
 style might be needed for inclusion. I think having it in the standard
 library would be extremely useful, though - no need to implement it myself
 then. Typical matrices used in gamedev (4x4 etc) woud be really useful as
 well (as said before, I'd even like stuff like AABBoxes, but let's go for
 vectors/matrices/quaternions first .

 AABB are also planed for gl3n.

Yeah I probably wouldn't put anything that high level in a standard
library. Everyone will want a slightly different flavour.
I think linear algebra with vectors, matrices, quats is about the fair
extent of a std lib. That stuff is pretty un-debatable, but beyond that, it
starts getting very subjective or context specific. Better left for higher
level libraries that may also integrate with renderers/physics systems/etc.

Jan 17 2012

Mehrdad <wfunction hotmail.com> writes:

In case this is at all helpful...
[see attached]

Jan 14 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/14/2012 2:11 AM, Mehrdad wrote:
 In case this is at all helpful...
 [see attached]

Hope you like the new simd compiler stuff.

Jan 14 2012

D Programming

C/C++ Programming

Other

digitalmars.D - SIMD support...