digitalmars.D - core.simd woes
- "F i L" <witte2008 gmail.com> Aug 04 2012
- Denis Shelomovskij <verylonglogin.reg gmail.com> Aug 05 2012
- Manu <turkeyman gmail.com> Aug 06 2012
- Dmitry Olshansky <dmitry.olsh gmail.com> Oct 01 2012
- Walter Bright <newshound2 digitalmars.com> Oct 14 2012
- "jerro" <a a.com> Aug 06 2012
- "F i L" <witte2008 gmail.com> Aug 06 2012
- "F i L" <witte2008 gmail.com> Aug 06 2012
- Manu <turkeyman gmail.com> Aug 07 2012
- Manu <turkeyman gmail.com> Aug 07 2012
- "jerro" <a a.com> Aug 07 2012
- "F i L" <witte2008 gmail.com> Aug 07 2012
- Jacob Carlborg <doob me.com> Oct 09 2012
- Jacob Carlborg <doob me.com> Oct 09 2012
- "F i L" <witte2008 gmail.com> Aug 07 2012
- "David Nadlinger" <see klickverbot.at> Aug 08 2012
- "F i L" <witte2008 gmail.com> Aug 08 2012
- "F i L" <witte2008 gmail.com> Oct 01 2012
- Manu <turkeyman gmail.com> Oct 02 2012
- Manu <turkeyman gmail.com> Oct 02 2012
- Manu <turkeyman gmail.com> Oct 02 2012
- Manu <turkeyman gmail.com> Oct 02 2012
- Manu <turkeyman gmail.com> Oct 02 2012
- "jerro" <a a.com> Oct 02 2012
- Manu <turkeyman gmail.com> Oct 02 2012
- "F i L" <witte2008 gmail.com> Oct 02 2012
- "jerro" <a a.com> Oct 02 2012
- Manu <turkeyman gmail.com> Oct 02 2012
- "jerro" <a a.com> Oct 02 2012
- "F i L" <witte2008 gmail.com> Oct 02 2012
- "F i L" <witte2008 gmail.com> Oct 02 2012
- "jerro" <a a.com> Oct 02 2012
- "F i L" <witte2008 gmail.com> Oct 02 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 03 2012
- "jerro" <a a.com> Oct 03 2012
- "F i L" <witte2008 gmail.com> Oct 03 2012
- Manu <turkeyman gmail.com> Oct 05 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 05 2012
- Manu <turkeyman gmail.com> Oct 07 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 08 2012
- "F i L" <witte2008 gmail.com> Oct 08 2012
- Manu <turkeyman gmail.com> Oct 08 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 08 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 08 2012
- Manu <turkeyman gmail.com> Oct 08 2012
- Manu <turkeyman gmail.com> Oct 08 2012
- Manu <turkeyman gmail.com> Oct 08 2012
- Manu <turkeyman gmail.com> Oct 08 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 08 2012
- "David Nadlinger" <see klickverbot.at> Oct 08 2012
- "F i L" <witte2008 gmail.com> Oct 08 2012
- "David Nadlinger" <see klickverbot.at> Oct 08 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 08 2012
- Manu <turkeyman gmail.com> Oct 09 2012
- Manu <turkeyman gmail.com> Oct 09 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 09 2012
- "jerro" <a a.com> Oct 09 2012
- "Simen Kjaeraas" <simen.kjaras gmail.com> Oct 09 2012
- "David Nadlinger" <see klickverbot.at> Oct 09 2012
- "David Nadlinger" <see klickverbot.at> Oct 09 2012
- "jerro" <a a.com> Oct 09 2012
- "F i L" <witte2008 gmail.com> Oct 09 2012
- "F i L" <witte2008 gmail.com> Oct 09 2012
- Manu <turkeyman gmail.com> Oct 10 2012
- Manu <turkeyman gmail.com> Oct 10 2012
- "David Nadlinger" <see klickverbot.at> Oct 10 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 10 2012
- Manu <turkeyman gmail.com> Oct 10 2012
- "F i L" <witte2008 gmail.com> Oct 10 2012
- Manu <turkeyman gmail.com> Oct 10 2012
- "David Nadlinger" <see klickverbot.at> Oct 14 2012
- "F i L" <witte2008 gmail.com> Oct 14 2012
- "David Nadlinger" <see klickverbot.at> Oct 14 2012
- "David Nadlinger" <see klickverbot.at> Oct 14 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 14 2012
- Iain Buclaw <ibuclaw ubuntu.com> Oct 14 2012
- "jerro" <a a.com> Oct 14 2012
- Manu <turkeyman gmail.com> Oct 15 2012
- "jerro" <a a.com> Oct 15 2012
- Manu <turkeyman gmail.com> Oct 15 2012
- "jerro" <a a.com> Oct 15 2012
- Manu <turkeyman gmail.com> Oct 15 2012
core.simd vectors are limited in a couple of annoying ways.
First, if I define:
property pure nothrow
{
auto x(float4 v) { return v.ptr[0]; }
auto y(float4 v) { return v.ptr[1]; }
auto z(float4 v) { return v.ptr[2]; }
auto w(float4 v) { return v.ptr[3]; }
void x(ref float4 v, float val) { v.ptr[0] = val; }
void y(ref float4 v, float val) { v.ptr[1] = val; }
void z(ref float4 v, float val) { v.ptr[2] = val; }
void w(ref float4 v, float val) { v.ptr[3] = val; }
}
Then use it like:
float4 a, b;
a.x = a.x + b.x;
it's actually somehow faster than directly using:
a.ptr[0] += b.ptr[0];
However, notice that I can't use '+=' in the first case, because
'x' isn't an lvalue. That's really annoying. Moreover, I can't
assign a vector to anything other than a array of constant
expressions. Which means I have to make functions just to assign
vectors in a convenient way.
float rand = ...;
float4 vec = [rand, 1, 1, 1]; // ERROR: expected constant
Now, none of this would be an issue at all if I could wrap
core.simd vectors into custom structs... but doing that complete
negates their performance gain (I'm guessing because of boxing?).
It's a different between 2-10x speed improvements using float4
directly (depending on CPU), and only a few mil secs improvement
when wrapping float4 in a struct.
So, it's not my ideal situation, but I wouldn't mind at all
having to use core.simd vector types directly, and moving things
like dot/cross/normalize/etc to external functions, but if that's
the case then I would _really_ like some basic usability features
added to the vector types.
Mono C#'s Mono.Simd.Vector4f, etc, types have these basic
features, and working with them is much nicer than using D's
core.simd vectors.
Aug 04 2012
05.08.2012 7:33, F i L пишет:...I'm guessing because of boxing?...
There is no boxing in D language itself. One should use library solutions for such functionality. -- Денис В. Шеломовский Denis V. Shelomovskij
Aug 05 2012
--047d7b15b0639a650204c69a5825 Content-Type: text/plain; charset=UTF-8 On 5 August 2012 06:33, F i L <witte2008 gmail.com> wrote:core.simd vectors are limited in a couple of annoying ways. First, if I define: property pure nothrow { auto x(float4 v) { return v.ptr[0]; } auto y(float4 v) { return v.ptr[1]; } auto z(float4 v) { return v.ptr[2]; } auto w(float4 v) { return v.ptr[3]; } void x(ref float4 v, float val) { v.ptr[0] = val; } void y(ref float4 v, float val) { v.ptr[1] = val; } void z(ref float4 v, float val) { v.ptr[2] = val; } void w(ref float4 v, float val) { v.ptr[3] = val; } } Then use it like: float4 a, b; a.x = a.x + b.x; it's actually somehow faster than directly using: a.ptr[0] += b.ptr[0]; However, notice that I can't use '+=' in the first case, because 'x' isn't an lvalue. That's really annoying. Moreover, I can't assign a vector to anything other than a array of constant expressions. Which means I have to make functions just to assign vectors in a convenient way. float rand = ...; float4 vec = [rand, 1, 1, 1]; // ERROR: expected constant Now, none of this would be an issue at all if I could wrap core.simd vectors into custom structs... but doing that complete negates their performance gain (I'm guessing because of boxing?). It's a different between 2-10x speed improvements using float4 directly (depending on CPU), and only a few mil secs improvement when wrapping float4 in a struct. So, it's not my ideal situation, but I wouldn't mind at all having to use core.simd vector types directly, and moving things like dot/cross/normalize/etc to external functions, but if that's the case then I would _really_ like some basic usability features added to the vector types. Mono C#'s Mono.Simd.Vector4f, etc, types have these basic features, and working with them is much nicer than using D's core.simd vectors.
I think core.simd is only designed for the lowest level of access to the SIMD hardware. I started writing std.simd some time back; it is mostly finished in a fork, but there are some bugs/missing features in D's SIMD support preventing me from finishing/releasing it. (incomplete dmd implementation, missing intrinsics, no SIMD literals, can't do unit testing, etc) The intention was that std.simd would be flat C-style api, which would be the lowest level required for practical and portable use. It's almost done, and it should make it a lot easier for people to build their own SIMD libraries on top. It supplies most useful linear algebraic operations, and implements them as efficiently as possible for other architectures than just SSE. Take a look: https://github.com/TurkeyMan/phobos/blob/master/std/simd.d On a side note, your example where you're performing a scalar add within a vector; this is bad, don't ever do this. SSE (ie, x86) is the most tolerant architecture in this regard, but it's VERY bad SIMD design. You should never perform any component-wise arithmetic when working with SIMD; It's absolutely not portable. Basically, a good rule of thumb is, if the keyword 'float' appears anywhere that interacts with your SIMD code, you are likely to see worse performance than just using float[4] on most architectures. Better to factor your code to eliminate any scalar work, and make sure 'scalars' are broadcast across all 4 components and continue doing 4d operations. Instead of: property pure nothrow float x(float4 v) { return v.ptr[0]; } Better to use: property pure nothrow float4 x(float4 v) { return swizzle!"xxxx"(v); } --047d7b15b0639a650204c69a5825 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 5 August 2012 06:33, F i L <span dir=3D"ltr">= <<a href=3D"mailto:witte2008 gmail.com" target=3D"_blank">witte2008 gmai= l.com</a>></span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> core.simd vectors are limited in a couple of annoying ways. First, if I def= ine:<br> <br> =C2=A0 =C2=A0 property pure nothrow<br> =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 auto x(float4 v) { return v.ptr[0]; }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 auto y(float4 v) { return v.ptr[1]; }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 auto z(float4 v) { return v.ptr[2]; }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 auto w(float4 v) { return v.ptr[3]; }<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 void x(ref float4 v, float val) { v.ptr[0] =3D = val; }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 void y(ref float4 v, float val) { v.ptr[1] =3D = val; }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 void z(ref float4 v, float val) { v.ptr[2] =3D = val; }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 void w(ref float4 v, float val) { v.ptr[3] =3D = val; }<br> =C2=A0 =C2=A0 }<br> <br> Then use it like:<br> <br> =C2=A0 =C2=A0 float4 a, b;<br> <br> =C2=A0 =C2=A0 a.x =3D a.x + b.x;<br> <br> it's actually somehow faster than directly using:<br> <br> =C2=A0 =C2=A0 a.ptr[0] +=3D b.ptr[0];<br> <br> However, notice that I can't use '+=3D' in the first case, beca= use 'x' isn't an lvalue. That's really annoying. Moreover, = I can't assign a vector to anything other than a array of constant expr= essions. Which means I have to make functions just to assign vectors in a c= onvenient way.<br> <br> =C2=A0 =C2=A0 float rand =3D ...;<br> =C2=A0 =C2=A0 float4 vec =3D [rand, 1, 1, 1]; // ERROR: expected constant<b= r> <br> <br> Now, none of this would be an issue at all if I could wrap core.simd vector= s into custom structs... but doing that complete negates their performance = gain (I'm guessing because of boxing?). It's a different between 2-= 10x speed improvements using float4 directly (depending on CPU), and only a= few mil secs improvement when wrapping float4 in a struct.<br> <br> So, it's not my ideal situation, but I wouldn't mind at all having = to use core.simd vector types directly, and moving things like dot/cross/no= rmalize/etc to external functions, but if that's the case then I would = _really_ like some basic usability features added to the vector types.<br> <br> Mono C#'s Mono.Simd.Vector4f, etc, types have these basic features, and= working with them is much nicer than using D's core.simd vectors.<br> </blockquote></div><br><div>I think core.simd is only designed for the lowe= st level of access to the SIMD hardware. I started writing std.simd some ti= me back; it is mostly finished in a fork, but there are some bugs/missing f= eatures in D's SIMD support preventing me from finishing/releasing it. = (incomplete dmd implementation, missing intrinsics, no SIMD literals, can&#= 39;t do unit testing, etc)</div> <div><br></div><div>The intention was that std.simd would be flat C-style a= pi, which would be the lowest level required for practical and portable use= .</div><div>It's almost done, and it should make it a lot easier for pe= ople to build their own SIMD libraries on top.=C2=A0It supplies most useful= linear algebraic operations, and implements them as efficiently as possibl= e for other architectures than just SSE.</div> <div>Take a look:=C2=A0<a href=3D"https://github.com/TurkeyMan/phobos/blob/= master/std/simd.d">https://github.com/TurkeyMan/phobos/blob/master/std/simd= .d</a></div><div><br></div><div>On a side note, your example where you'= re performing a scalar add within a vector; this is bad, don't ever do = this.</div> <div>SSE (ie, x86) is the most tolerant architecture in this regard, but it= 's VERY bad SIMD design. You should never perform any component-wise ar= ithmetic when working with SIMD;=C2=A0It's absolutely not portable.</di= v> <div>Basically, a good rule of thumb is, if the keyword 'float' app= ears anywhere that interacts with your SIMD code, you are likely to see wor= se performance than just using float[4] on most architectures.</div><div> Better to factor your code to eliminate any scalar work, and make sure '= ;scalars' are broadcast across all 4 components and continue doing 4d o= perations.</div><div><br></div><div>Instead of:=C2=A0 property pure nothrow= =C2=A0float x(float4 v) { return v.ptr[0]; }</div> <div>Better to use:=C2=A0 property pure nothrow=C2=A0float4 x(float4 v) { r= eturn swizzle!"xxxx"(v)<span style=3D"font-size:medium;white-spac= e:pre-wrap">;</span>=C2=A0}</div> --047d7b15b0639a650204c69a5825--
Aug 06 2012
On 02-Oct-12 06:28, F i L wrote:D has a big advantage over C/C++ here because of UFCS, in that we can write external functions that appear no different to encapsulated object methods. That combined with public-aliasing means the end-user only sees our pretty functions, but we're not sacrificing performance at all.
Yeah, but it won't cover operators. If only opBinary could be defined at global scope... I think I've seen an enhancement to that end though. But even then simd types are built-in and operator overloading only works with user-defined types. -- Dmitry Olshansky
Oct 01 2012
On 10/8/2012 4:52 PM, David Nadlinger wrote:With all due respect to Walter, core.simd isn't really "designed" much at all, or at least this isn't visible in its current state – it rather seems like a quick hack to get some basic SIMD code working with DMD (but beware of ICEs).
That is correct. I have little experience with SIMD on x86, and none on other platforms. I'm not in a good position to do a portable and useful design. I was mainly interested in providing a very low level method for a more useful design that could be layered over it.Walter, if you are following this thread, do you have any plans for SIMD on non-x86 platforms?
I'm going to leave that up to those who are doing non-x86 platforms for now.
Oct 14 2012
The intention was that std.simd would be flat C-style api, which would be the lowest level required for practical and portable use.
Since LDC and GDC implement intrinsics with an API different from that used in DMD, there are actually two kinds of portability we need to worry about - portability across different compilers and portability across different architectures. std.simd solves both of those problems, which is great for many use cases (for example when dealing with geometric vectors), but it doesn't help when you want to use architecture dependant functionality directly. In this case one would want to have an interface as close to the actual instructions as possible but uniform across compilers. I think we should define such an interface as functions and templates in core.simd, so you would have for example: float4 unpcklps(float4, float4); float4 shufps(int, int, int, int)(float4, float4); Then each compiler would implement this API in its own way. DMD would use __simd (1), gdc would use GCC builtins and LDC would use LLVM intrinsics and shufflevector. If we don't include something like that in core.simd, many applications will need to implement their own versions of it. Using this would also reduce the amount of code needed to implement std.simd (currently most of the std.simd only supports GDC and it's already pretty large). What do you think about adding such an API to core.simd? (1) Some way to support the rest of SSE instructions needs to be added to DMD, of course.
Aug 06 2012
On Monday, 6 August 2012 at 15:15:30 UTC, Manu wrote:I think core.simd is only designed for the lowest level of access to the SIMD hardware. I started writing std.simd some time back; it is mostly finished in a fork, but there are some bugs/missing features in D's SIMD support preventing me from finishing/releasing it. (incomplete dmd implementation, missing intrinsics, no SIMD literals, can't do unit testing, etc)
Yes, I found, and have been referring to, your std.simd library for awhile now. Even with your library having GDC only support AtM, it's been a help. Thank you.The intention was that std.simd would be flat C-style api, which would be the lowest level required for practical and portable use. It's almost done, and it should make it a lot easier for people to build their own SIMD libraries on top. It supplies most useful linear algebraic operations, and implements them as efficiently as possible for other architectures than just SSE. Take a look: https://github.com/TurkeyMan/phobos/blob/master/std/simd.d
Right now I'm working with DMD on Linux x86_64. LDC doesn't support SIMD right now, and I haven't built GDC yet, so I can't do performance comparisons between the two. I really need to get around to setting up GDC, because I've always planned on using that as a "release compiler" for my code. The problem is, as I mentioned above, that performance of SIMD completely get's shot when wrapping a float4 into a struct, rather than using float4 directly. There are some places where (like matrices), where they do make a big impact, but I'm trying to find the best solution for general code. For instance my current math library looks like: struct Vector4 { float x, y, z, w; ... } struct Matrix4 { Vector4 x, y, z, w; ... } but I was planning on changing over to (something like): alias float4 Vector4; alias float4[4] Matrix4; So I could use the types directly and reap the performance gains. I'm currently doing this to both my D code (still in early state), and our C# code for Mono. Both core.simd and Mono.Simd have "compiler magic" vector types, but Mono's version gives me access to component channels and simple constructors I can use, so for user code (and types like the Matrix above, with internal vectors) it's very convenient and natural. D's simply isn't, and I'm not sure there's any ways around it since again, at least with DMD, performance is shot when I put it in a struct.On a side note, your example where you're performing a scalar add within a vector; this is bad, don't ever do this. SSE (ie, x86) is the most tolerant architecture in this regard, but it's VERY bad SIMD design. You should never perform any component-wise arithmetic when working with SIMD; It's absolutely not portable. Basically, a good rule of thumb is, if the keyword 'float' appears anywhere that interacts with your SIMD code, you are likely to see worse performance than just using float[4] on most architectures. Better to factor your code to eliminate any scalar work, and make sure 'scalars' are broadcast across all 4 components and continue doing 4d operations. Instead of: property pure nothrow float x(float4 v) { return v.ptr[0]; } Better to use: property pure nothrow float4 x(float4 v) { return swizzle!"xxxx"(v); }
Thanks a lot for telling me this, I don't know much about SIMD stuff. You're actually the exact person I wanted to talk to, because you do know a lot about this and I've always respected your opinions. I'm not apposed to doing something like: float4 addX(ref float4 v, float val) { float4 f; f.x = val v += f; } to do single component scalars, but it's very inconvenient for users to remember to use: vec.addX(scalar); instead of: vec.x += scalar; But that wouldn't be an issue if I could write custom operators for the components what basically did that. But I can't without wrapping float, which is why I am requesting these magic types get some basic features like that. I'm wondering if I should be looking at just using inlined ASM and use the ASM SIMD instructions directly. I know basic ASM, but I don't know what the potential pitfalls of doing that, especially with portability. Is there a reason not to do this (short of complexity)? I'm also wondering why wrapping a core.simd type into a struct completely negates performance.. I'm guessing because when I return the struct type, the compiler has to think about it as a struct, instead of it's "magic" type and all struct types have a bit more overhead. On a side note, DMD without SIMD is much faster than C# without SIMD, by a factor of 8x usually on simple vector types (micro-benchmarks), and that's not counting the runtimes startup times either. However, when I use Mono.Simd, both DMD (with core.simd) and C# are similar performance (see below). Math code with Mono C# (with SIMD) actually runs faster on Linux (even without the SGen GC or LLVM codegen) than it does on Window 8 with MS .NET, which I find to be pretty impressive and encouraging for our future games with Mono on Android (which has been out biggest performance PITA platform so far). I've noticed some really odd things with core.simd as well, which is another reason I'm thing of trying inlined ASM. I'm not sure what's causing certain compiler optimizations. For instance, given the basic test program, when I do: float rand = ...; // user input value float4 a, b = [1, 4, -12, 5]; a.ptr[0] = rand; a.ptr[1] = rand + 1; a.ptr[2] = rand + 2; a.ptr[3] = rand + 3; ulong mil; StopWatch sw; foreach (t; 0 .. testCount) { sw.start(); foreach (i; 0 .. 1_000_000) { a += b; b -= a; } sw.stop(); mil += sw.peek().msecs; sw.reset(); } writeln(a.array, ", ", b.array); writeln(cast(double) mil / testCount); When I run this on my Phenom II X4 920, it completes in ~9ms. For comparison, C# Mono.Simd gets almost identical performance with identical code. However, if I add: auto vec4(float x, float y, float z, float w) { float4 result; result.ptr[0] = x; result.ptr[1] = y; result.ptr[2] = z; result.ptr[3] = w; return result; } then replace the vector initialization lines: float4 a, b = [ ... ]; a.ptr[0] = rand; ... with ones using my factory function: auto a = vec4(rand, rand+1, rand+2, rand+3); auto b = vec4(1, 4, -12, 5); Then the program consistently completes in 2.15ms... wtf right? The printed vector output is identical, and there's no changes to the loop code (a += b, etc), I just change the construction code of the vectors and it runs 4.5x faster. Beats me, but I'll take it. Btw, for comparison, if I use a struct with an internal float4 it runs in ~19ms, and a struct with four floats runs in ~22ms. So you can see my concerns with using core.simd types directly, especially when my Intel Mac gets even better improvements with SIMD code. I haven't done extensive test on the Intel, but my original test (the one above, only in C# using Mono.Simd) the results for ~55ms using a struct with internal float4, and ~5ms for using float4 directly. anyways, thanks for the feedback.
Aug 06 2012
F i L wrote:On a side note, DMD without SIMD is much faster than C# without SIMD, by a factor of 8x usually on simple vector types...
Excuse me, this should have said a factor of 4x, not 8x.
Aug 06 2012
--e89a8ffbaec56877e804c6a7ffb7 Content-Type: text/plain; charset=UTF-8 On 6 August 2012 22:57, jerro <a a.com> wrote:The intention was that std.simd would be flat C-style api, which would bethe lowest level required for practical and portable use.
Since LDC and GDC implement intrinsics with an API different from that used in DMD, there are actually two kinds of portability we need to worry about - portability across different compilers and portability across different architectures. std.simd solves both of those problems, which is great for many use cases (for example when dealing with geometric vectors), but it doesn't help when you want to use architecture dependant functionality directly. In this case one would want to have an interface as close to the actual instructions as possible but uniform across compilers. I think we should define such an interface as functions and templates in core.simd, so you would have for example: float4 unpcklps(float4, float4); float4 shufps(int, int, int, int)(float4, float4);
I can see your reasoning, but I think that should be in core.sse, or core.simd.sse personally. Or you'll end up with VMX, NEON, etc all blobbed in one huge intrinsic wrapper file. That said, almost all simd opcodes are directly accessible in std.simd. There are relatively few obscure operations that don't have a representing function. The unpck/shuf example above for instance, they both effectively perform a sort of swizzle, and both are accessible through swizzle!(). The swizzle mask is analysed by the template, and it produces the best opcode to match the pattern. Take a look at swizzle, it's bloody complicated to do that the most efficient way on x86. Other architectures are not so much trouble ;) So while you may argue that it might be simpler to use an opcode intrinsic wrapper directly, the opcode is actually still directly accessible via swizzle and an appropriate swizzle arrangement, which it might also be argues is more readable to the end user, since the result of the opcode is clearly written... Then each compiler would implement this API in its own way. DMD would use__simd (1), gdc would use GCC builtins and LDC would use LLVM intrinsics and shufflevector. If we don't include something like that in core.simd, many applications will need to implement their own versions of it. Using this would also reduce the amount of code needed to implement std.simd (currently most of the std.simd only supports GDC and it's already pretty large). What do you think about adding such an API to core.simd? (1) Some way to support the rest of SSE instructions needs to be added to DMD, of course.
The reason I didn't write the DMD support yet is because it was incomplete, and many opcodes weren't yet accessible, like shuf for instance... and I just wasn't finished. Stopped to wait for DMD to be feature complete. I'm not opposed to this idea, although I do have a concern that, because there's no __forceinline in D (or macros), adding another layer of abstraction will make maths code REALLY slow in unoptimised builds. Can you suggest a method where these would be treated as C macros, and not produce additional layers of function calls? I'm already unhappy that std.simd produces redundant function calls. <rant> please please please can haz __forceinline! </rant> --e89a8ffbaec56877e804c6a7ffb7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 6 August 2012 22:57, jerro <span dir=3D"ltr">= <<a href=3D"mailto:a a.com" target=3D"_blank">a a.com</a>></span> wro= te:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-= left:1px #ccc solid;padding-left:1ex"> <div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .= 8ex;border-left:1px #ccc solid;padding-left:1ex"> The intention was that std.simd would be flat C-style api, which would be<b= r> the lowest level required for practical and portable use.<br> </blockquote> <br></div> Since LDC and GDC implement intrinsics with an API different from that used= in DMD, there are actually two kinds of portability we need to worry about= - portability across different compilers and portability across different = architectures. std.simd solves both of those problems, which is great for m= any =C2=A0use cases (for example when dealing with geometric vectors), but = it doesn't help when you want to use architecture dependant functionali= ty directly. In this case one would want to have an interface as close to t= he actual instructions as possible but uniform across compilers. I think we= should define such an interface as functions and templates in core.simd, s= o you would have for example:<br> <br> float4 unpcklps(float4, float4);<br> float4 shufps(int, int, int, int)(float4, float4);<br></blockquote><div><br=</div><div>I can see your reasoning, but I think that should be in core.ss=
l blobbed in one huge intrinsic wrapper file.</div> <div>That said, almost all simd opcodes are directly accessible in std.simd= . There are relatively few obscure operations that don't have a represe= nting function.</div><div>The unpck/shuf example above for instance, they b= oth effectively perform a sort of swizzle, and both are accessible through = swizzle!(). The swizzle mask is analysed by the template, and it produces t= he best opcode to match the pattern. Take a look at swizzle, it's blood= y complicated to do that the most efficient way on x86. Other architectures= are not so much trouble ;)</div> <div>So while you may argue that it might be simpler to use an opcode intri= nsic wrapper directly, the opcode is actually still directly accessible via= swizzle and an appropriate swizzle arrangement, which it might also be arg= ues is more readable to the end user, since the result of the opcode is cle= arly written...</div> <div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Then each comp= iler would implement this API in its own way. DMD would use __simd (1), gdc= would use GCC builtins and LDC would use LLVM intrinsics and shufflevector= . If we don't include something like that in core.simd, many applicatio= ns will need to implement their own versions of it. Using this would also r= educe the amount of code needed to implement std.simd (currently most of th= e std.simd only supports GDC and it's already pretty large). What do yo= u think about adding such an API to core.simd?<br> <br> (1) Some way to support the rest of SSE instructions needs to be added to D= MD, of course.<br> </blockquote></div><br><div>The reason I didn't write the DMD support y= et is because it was incomplete, and many opcodes weren't yet accessibl= e, like shuf for instance... and I just wasn't finished. Stopped to wai= t for DMD to be feature complete.</div> <div>I'm not opposed to this idea, although I do have a concern that, b= ecause there's no __forceinline in D (or macros), adding another layer = of abstraction will make maths code REALLY slow in unoptimised builds.</div=
not produce additional layers of function calls? I'm already unhappy t= hat std.simd produces redundant function calls.</div><div><br></div><div> <br></div><div><rant> please =C2=A0please please can haz __forceinlin= e! </rant></div> --e89a8ffbaec56877e804c6a7ffb7--
Aug 07 2012
--047d7b10ca1dd38c0804c6a9173d Content-Type: text/plain; charset=UTF-8 On 7 August 2012 04:24, F i L <witte2008 gmail.com> wrote:Right now I'm working with DMD on Linux x86_64. LDC doesn't support SIMD right now, and I haven't built GDC yet, so I can't do performance comparisons between the two. I really need to get around to setting up GDC, because I've always planned on using that as a "release compiler" for my code. The problem is, as I mentioned above, that performance of SIMD completely get's shot when wrapping a float4 into a struct, rather than using float4 directly. There are some places where (like matrices), where they do make a big impact, but I'm trying to find the best solution for general code. For instance my current math library looks like: struct Vector4 { float x, y, z, w; ... } struct Matrix4 { Vector4 x, y, z, w; ... } but I was planning on changing over to (something like): alias float4 Vector4; alias float4[4] Matrix4; So I could use the types directly and reap the performance gains. I'm currently doing this to both my D code (still in early state), and our C# code for Mono. Both core.simd and Mono.Simd have "compiler magic" vector types, but Mono's version gives me access to component channels and simple constructors I can use, so for user code (and types like the Matrix above, with internal vectors) it's very convenient and natural. D's simply isn't, and I'm not sure there's any ways around it since again, at least with DMD, performance is shot when I put it in a struct.
I'm not sure why the performance would suffer when placing it in a struct. I suspect it's because the struct causes the vectors to become unaligned, and that impacts performance a LOT. Walter has recently made some changes to expand the capability of align() to do most of the stuff you expect should be possible, including aligning structs, and propogating alignment from a struct member to its containing struct. This change might actually solve your problems... Another suggestion I might make, is to write DMD intrinsics that mirror the GDC code in std.simd and use that, then I'll sort out any performance problems as soon as I have all the tools I need to finish the module :) There's nothing inherent in the std.simd api that will produce slower than optimal code when everything is working properly. Better to factor your code to eliminate any scalar work, and make sure'scalars' are broadcast across all 4 components and continue doing 4d operations. Instead of: property pure nothrow float x(float4 v) { return v.ptr[0]; } Better to use: property pure nothrow float4 x(float4 v) { return swizzle!"xxxx"(v); }
Thanks a lot for telling me this, I don't know much about SIMD stuff. You're actually the exact person I wanted to talk to, because you do know a lot about this and I've always respected your opinions. I'm not apposed to doing something like: float4 addX(ref float4 v, float val) { float4 f; f.x = val v += f; }
to do single component scalars, but it's very inconvenient for users to remember to use: vec.addX(scalar); instead of: vec.x += scalar;
And this is precisely what I suggest you don't do. x64-SSE is the only architecture that can reasonably tolerate this (although it's still not the most efficient way). So if portability is important, you need to find another way. A 'proper' way to do this is something like: float4 wideScalar = loadScalar(scalar); // this function loads a float into all 4 components. Note: this is a little slow, factor these float->vector loads outside the hot loops as is practical. float4 vecX = getX(vec); // we can make shorthand for this, like 'vec.xxxx' for instance... vecX += wideScalar; // all 4 components maintain the same scalar value, this is so you can apply them back to non-scalar vectors later: With this, there are 2 typical uses, one is to scale another vector by your scalar, for instance: someOtherVector *= vecX; // perform a scale of a full 4d vector by our 'wide' scalar The other, less common operation, is that you may want to directly set the scalar to a component of another vector, setting Y to lock something to a height map for instance: someOtherVector = setY(someOtherVector, wideScalar); // note: it is still important that you have a 'wide' scalar in this case for portability, since different architectures have very different interleave operations. Something like '.x' can never appear in efficient code. Sadly, most modern SIMD hardware is simply not able to efficiently express what you as a programmer intuitively want as convenient operations. Most SIMD hardware has absolutely no connection between the FPU and the SIMD unit, resulting in loads and stores to memory, and this in turn introduces another set of performance hazards. x64 is actually the only architecture that does allow interaction between the FPU and SIMD however, although it's still no less efficient to do it how I describe, and as a bonus, your code will be portable. But that wouldn't be an issue if I could write custom operators for thecomponents what basically did that. But I can't without wrapping float, which is why I am requesting these magic types get some basic features like that.
See above. I'm wondering if I should be looking at just using inlined ASM and use theASM SIMD instructions directly. I know basic ASM, but I don't know what the potential pitfalls of doing that, especially with portability. Is there a reason not to do this (short of complexity)? I'm also wondering why wrapping a core.simd type into a struct completely negates performance.. I'm guessing because when I return the struct type, the compiler has to think about it as a struct, instead of it's "magic" type and all struct types have a bit more overhead.
Inline asm is usually less efficient for large blocks of code, it requires that you hand-tune the opcode sequencing, which is very hard to do, particularly for SSE. Small inline asm blocks are also usually less efficient, since most compilers can't rearrange other code within the function around the asm block, this leads to poor opcode sequencing. I recommend avoiding inline asm where performance is desired unless you're confident in writing the ENTIRE function/loop in asm, and hand tuning the opcode sequencing. But that's not portable... On a side note, DMD without SIMD is much faster than C# without SIMD, by afactor of 8x usually on simple vector types (micro-benchmarks), and that's not counting the runtimes startup times either. However, when I use Mono.Simd, both DMD (with core.simd) and C# are similar performance (see below). Math code with Mono C# (with SIMD) actually runs faster on Linux (even without the SGen GC or LLVM codegen) than it does on Window 8 with MS .NET, which I find to be pretty impressive and encouraging for our future games with Mono on Android (which has been out biggest performance PITA platform so far).
Android? But you're benchmarking x64-SSE right? I don't think it's reasonable to expect that performance characteristics for one architectures SIMD hardware will be any indicator at all of how another architecture may perform. Also, if you're doing any of the stuff I've been warning against above, NEON will suffer very hard, whereas x64-SSE will mostly shrug it off. I'm very interested to hear your measurements when you try it out! I've noticed some really odd things with core.simd as well, which isanother reason I'm thing of trying inlined ASM. I'm not sure what's causing certain compiler optimizations. For instance, given the basic test program, when I do: float rand = ...; // user input value float4 a, b = [1, 4, -12, 5]; a.ptr[0] = rand; a.ptr[1] = rand + 1; a.ptr[2] = rand + 2; a.ptr[3] = rand + 3; ulong mil; StopWatch sw; foreach (t; 0 .. testCount) { sw.start(); foreach (i; 0 .. 1_000_000) { a += b; b -= a; } sw.stop(); mil += sw.peek().msecs; sw.reset(); } writeln(a.array, ", ", b.array); writeln(cast(double) mil / testCount); When I run this on my Phenom II X4 920, it completes in ~9ms. For comparison, C# Mono.Simd gets almost identical performance with identical code. However, if I add: auto vec4(float x, float y, float z, float w) { float4 result; result.ptr[0] = x; result.ptr[1] = y; result.ptr[2] = z; result.ptr[3] = w; return result; } then replace the vector initialization lines: float4 a, b = [ ... ]; a.ptr[0] = rand; ... with ones using my factory function: auto a = vec4(rand, rand+1, rand+2, rand+3); auto b = vec4(1, 4, -12, 5); Then the program consistently completes in 2.15ms... wtf right? The printed vector output is identical, and there's no changes to the loop code (a += b, etc), I just change the construction code of the vectors and it runs 4.5x faster. Beats me, but I'll take it. Btw, for comparison, if I use a struct with an internal float4 it runs in ~19ms, and a struct with four floats runs in ~22ms. So you can see my concerns with using core.simd types directly, especially when my Intel Mac gets even better improvements with SIMD code. I haven't done extensive test on the Intel, but my original test (the one above, only in C# using Mono.Simd) the results for ~55ms using a struct with internal float4, and ~5ms for using float4 directly.
wtf indeed! O_o Can you paste the disassembly? There should be no loads or stores in the loop, therefore it should be unaffected... but it obviously is, so the only thing I can imagine that could make a different in the inner loop like that is a change in alignment. Wrapping a vector in a struct will break the alignment, since, until recently, DMD didn't propagate aligned members outwards to the containing struct (which I think Walter fixed in 2.60??). I can tell you this though, as soon as DMDs SIMD support is able to do the missing stuff I need to complete std.simd, I shall do that, along with intensive benchmarks where I'll be scrutinising the code-gen very closely. I expect performance peculiarities like you are seeing will be found and fixed at that time... --047d7b10ca1dd38c0804c6a9173d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 7 August 2012 04:24, F i L <span dir=3D"ltr">= <<a href=3D"mailto:witte2008 gmail.com" target=3D"_blank">witte2008 gmai= l.com</a>></span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">Right now I'm working with DMD on Linux x86_64. LDC d= oesn't support SIMD right now, and I haven't built GDC yet, so I ca= n't do performance comparisons between the two. I really need to get ar= ound to setting up GDC, because I've always planned on using that as a = "release compiler" for my code.</div> <br> The problem is, as I mentioned above, that performance of SIMD completely g= et's shot when wrapping a float4 into a struct, rather than using float= 4 directly. There are some places where (like matrices), where they do make= a big impact, but I'm trying to find the best solution for general cod= e. For instance my current math library looks like:<br> <br> =C2=A0 =C2=A0 struct Vector4 { float x, y, z, w; ... }<br> =C2=A0 =C2=A0 struct Matrix4 { Vector4 x, y, z, w; ... }<br> <br> but I was planning on changing over to (something like):<br> <br> =C2=A0 =C2=A0 alias float4 Vector4;<br> =C2=A0 =C2=A0 alias float4[4] Matrix4;<br> <br> So I could use the types directly and reap the performance gains. I'm c= urrently doing this to both my D code (still in early state), and our C# co= de for Mono. Both core.simd and Mono.Simd have "compiler magic" v= ector types, but Mono's version gives me access to component channels a= nd simple constructors I can use, so for user code (and types like the Matr= ix above, with internal vectors) it's very convenient and natural. D= 9;s simply isn't, and I'm not sure there's any ways around it s= ince again, at least with DMD, performance is shot when I put it in a struc= t.<br> </blockquote><div><br></div><div>I'm not sure why the performance would= suffer when placing it in a struct. I suspect it's because the struct = causes the vectors to become unaligned, and that impacts performance a LOT.= Walter has recently made some changes to expand the capability of align() = to do most of the stuff you expect should be possible, including aligning s= tructs, and propogating alignment from a struct member to its containing st= ruct. This change might actually solve your problems...</div> <div><br></div><div>Another suggestion I might make, is to write DMD intrin= sics that mirror the GDC code in std.simd and use that, then I'll sort = out any performance problems as soon as I have all the tools I need to fini= sh the module :)</div> <div>There's nothing inherent in the std.simd api that will produce slo= wer than optimal code when everything is working properly.</div><div><br></= div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im">Better to factor your code= to eliminate any scalar work, and make sure<br> 'scalars' are broadcast across all 4 components and continue doing = 4d<br> operations.<br> <br></div> Instead of: property pure nothrow float x(float4 v) { return v.ptr[0]; }<d= iv class=3D"im"><br> Better to use: property pure nothrow float4 x(float4 v) { return<br> swizzle!"xxxx"(v); }<br> </div></blockquote> <br> Thanks a lot for telling me this, I don't know much about SIMD stuff. Y= ou're actually the exact person I wanted to talk to, because you do kno= w a lot about this and I've always respected your opinions.<br> <br> I'm not apposed to doing something like:<br> <br> =C2=A0 =C2=A0 float4 addX(ref float4 v, float val)<br> =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 float4 f;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 f.x =3D val<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 v +=3D f;<br> =C2=A0 =C2=A0 }<br></blockquote><blockquote class=3D"gmail_quote" style=3D"= margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br> to do single component scalars, but it's very inconvenient for users to= remember to use:<br> <br> =C2=A0 =C2=A0 vec.addX(scalar);<br> <br> instead of:<br> <br> =C2=A0 =C2=A0 vec.x +=3D scalar;<br></blockquote><div><br></div><div>And th= is is precisely what I suggest you don't do. x64-SSE is the only archit= ecture that can reasonably tolerate this (although it's still not the m= ost efficient way). So if portability is important, you need to find anothe= r way.</div> <div><br></div><div>A 'proper' way to do this is something like:</d= iv><div><div>=C2=A0 float4 wideScalar =3D loadScalar(scalar); // this funct= ion loads a float into all 4 components. Note: this is a little slow, facto= r these float->vector loads outside the hot loops as is practical.</div> <br class=3D"Apple-interchange-newline"></div><div>=C2=A0 float4 vecX =3D g= etX(vec); // we can make shorthand for this, like 'vec.xxxx' for in= stance...</div><div>=C2=A0 vecX +=3D wideScalar; // all 4 components mainta= in the same scalar value, this is so you can apply them back to non-scalar = vectors later:</div> <div><br></div><div>With this, there are 2 typical uses, one is to scale an= other vector by your scalar, for instance:</div><div>=C2=A0 someOtherVector= *=3D=C2=A0vecX; // perform a scale of a full 4d vector by our 'wide= 9; scalar</div> <div><br></div><div>The other, less common operation, is that you may want = to directly set the scalar to a component of another vector, setting Y to l= ock something to a height map for instance:</div><div>=C2=A0=C2=A0someOther= Vector =3D setY(someOtherVector, wideScalar); // note: it is still importan= t that you have a 'wide' scalar in this case for portability, since= different architectures have very different interleave operations.</div> <div><br></div><div><br></div><div>Something like '.x' can never ap= pear in efficient code.</div><div>Sadly, most modern SIMD hardware is simpl= y not able to efficiently express what you as a programmer intuitively want= as convenient operations.</div> <div>Most SIMD hardware has absolutely no connection between the FPU and th= e SIMD unit, resulting in loads and stores to memory, and this in turn intr= oduces another set of performance hazards. </div><div>x64 is actually the only architecture that does allow interactio= n between the FPU and SIMD however, although it's still no less efficie= nt to do it how I describe, and as a bonus, your code will be portable.</di= v> <div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex= ;border-left:1px #ccc solid;padding-left:1ex">But that wouldn't be an i= ssue if I could write custom operators for the components what basically di= d that. But I can't without wrapping float, which is why I am requestin= g these magic types get some basic features like that.<br> </blockquote><div><br></div><div>See above.</div><div><br></div><blockquote= class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli= d;padding-left:1ex"> I'm wondering if I should be looking at just using inlined ASM and use = the ASM SIMD instructions directly. I know basic ASM, but I don't know = what the potential pitfalls of doing that, especially with portability. Is = there a reason not to do this (short of complexity)? I'm also wondering= why wrapping a core.simd type into a struct completely negates performance= .. I'm guessing because when I return the struct type, the compiler has= to think about it as a struct, instead of it's "magic" type = and all struct types have a bit more overhead.<br> </blockquote><div><br></div><div>Inline asm is usually less efficient for l= arge blocks of code, it requires that you hand-tune the opcode sequencing, = which is very hard to do, particularly for SSE.</div><div>Small inline asm = blocks are also usually less efficient, since most compilers can't rear= range other code within the function around the asm block, this leads to po= or opcode sequencing.</div> <div>I recommend avoiding inline asm where performance is desired unless yo= u're confident in writing the ENTIRE function/loop in asm, and hand tun= ing the opcode sequencing.=C2=A0But that's not portable...</div><div><b= r> </div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 = 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On a side note, DMD wit= hout SIMD is much faster than C# without SIMD, by a factor of 8x usually on= simple vector types (micro-benchmarks), and that's not counting the ru= ntimes startup times either. However, when I use Mono.Simd, both DMD (with = core.simd) and C# are similar performance (see below). Math code with Mono = C# (with SIMD) actually runs faster on Linux (even without the SGen GC or L= LVM codegen) than it does on Window 8 with MS .NET, which I find to be pret= ty impressive and encouraging for our future games with Mono on Android (wh= ich has been out biggest performance PITA platform so far).<br> </blockquote><div><br></div><div>Android? But you're benchmarking x64-S= SE right? I don't think it's reasonable to expect that performance = characteristics for one architectures SIMD hardware will be any indicator a= t all of how another architecture may perform.</div> <div>Also, if you're doing any of the stuff I've been warning again= st above, NEON will suffer very hard, whereas x64-SSE will mostly shrug it = off.</div><div><br></div><div>I'm very interested to hear your measurem= ents when you try it out!</div> <div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I've notic= ed some really odd things with core.simd as well, which is another reason I= 'm thing of trying inlined ASM. I'm not sure what's causing cer= tain compiler optimizations. For instance, given the basic test program, wh= en I do:<br> <br> =C2=A0 =C2=A0 float rand =3D ...; // user input value<br> <br> =C2=A0 =C2=A0 float4 a, b =3D [1, 4, -12, 5];<br> <br> =C2=A0 =C2=A0 a.ptr[0] =3D rand;<br> =C2=A0 =C2=A0 a.ptr[1] =3D rand + 1;<br> =C2=A0 =C2=A0 a.ptr[2] =3D rand + 2;<br> =C2=A0 =C2=A0 a.ptr[3] =3D rand + 3;<br> <br> =C2=A0 =C2=A0 ulong mil;<br> =C2=A0 =C2=A0 StopWatch sw;<br> <br> =C2=A0 =C2=A0 foreach (t; 0 .. testCount)<br> =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 sw.start();<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 foreach (i; 0 .. 1_000_000)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 a +=3D b;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 b -=3D a;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 sw.stop();<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 mil +=3D sw.peek().msecs;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 sw.reset();<br> =C2=A0 =C2=A0 }<br> <br> =C2=A0 =C2=A0 writeln(a.array, ", ", b.array);<br> =C2=A0 =C2=A0 writeln(cast(double) mil / testCount);<br> <br> When I run this on my Phenom II X4 920, it completes in ~9ms. For compariso= n, C# Mono.Simd gets almost identical performance with identical code. Howe= ver, if I add:<br> <br> =C2=A0 =C2=A0 auto vec4(float x, float y, float z, float w)<br> =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 float4 result;<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 result.ptr[0] =3D x;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 result.ptr[1] =3D y;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 result.ptr[2] =3D z;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 result.ptr[3] =3D w;<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 return result;<br> =C2=A0 =C2=A0 }<br> <br> then replace the vector initialization lines:<br> <br> =C2=A0 =C2=A0 float4 a, b =3D [ ... ];<br> =C2=A0 =C2=A0 a.ptr[0] =3D rand;<br> =C2=A0 =C2=A0 ...<br> <br> with ones using my factory function:<br> <br> =C2=A0 =C2=A0 auto a =3D vec4(rand, rand+1, rand+2, rand+3);<br> =C2=A0 =C2=A0 auto b =3D vec4(1, 4, -12, 5);<br> <br> Then the program consistently completes in 2.15ms...<br> <br> wtf right? The printed vector output is identical, and there's no chang= es to the loop code (a +=3D b, etc), I just change the construction code of= the vectors and it runs 4.5x faster. Beats me, but I'll take it. Btw, = for comparison, if I use a struct with an internal float4 it runs in ~19ms,= and a struct with four floats runs in ~22ms. So you can see my concerns wi= th using core.simd types directly, especially when my Intel Mac gets even b= etter improvements with SIMD code.<br> I haven't done extensive test on the Intel, but my original test (the o= ne above, only in C# using Mono.Simd) the results for ~55ms using a struct = with internal float4, and ~5ms for using float4 directly.<br></blockquote> <div><br></div><div>wtf indeed! O_o</div><div><br></div><div>Can you paste = the disassembly?<br></div><div>There should be no loads or stores in the lo= op, therefore it should be unaffected... but it obviously is, so the only t= hing I can imagine that could make a different in the inner loop like that = is a change in alignment.=C2=A0Wrapping a vector in a struct will break the= alignment, since, until recently, DMD didn't propagate aligned members= outwards to the containing struct (which I think Walter fixed in 2.60??).<= /div> <div><br></div><div>I can tell you this though, as soon as DMDs SIMD suppor= t is able to do the missing stuff I need to complete std.simd, I shall do t= hat, along with intensive benchmarks where I'll be scrutinising the cod= e-gen very closely.</div> <div>I expect performance peculiarities like you are seeing will be found a= nd fixed at that time...</div></div> --047d7b10ca1dd38c0804c6a9173d--
Aug 07 2012
I can see your reasoning, but I think that should be in core.sse, or core.simd.sse personally. Or you'll end up with VMX, NEON, etc all blobbed in one huge intrinsic wrapper file.
I would be okay with core.simd.sse or core.sse.That said, almost all simd opcodes are directly accessible in std.simd. There are relatively few obscure operations that don't have a representing function. The unpck/shuf example above for instance, they both effectively perform a sort of swizzle, and both are accessible through swizzle!().
They aren't. Swizzle only takes one argument, so you cant use it to select elements from two vectors. Both unpcklps and shufps take two arguments. Writing a swizzle with two arguments would be much harder.The swizzle mask is analysed by the template, and it produces the best opcode to match the pattern. Take a look at swizzle, it's bloody complicated to do that the most efficient way on x86.
Now imagine how complicated it would be to write a swizzle with to vector arguments.The reason I didn't write the DMD support yet is because it was incomplete, and many opcodes weren't yet accessible, like shuf for instance... and I just wasn't finished. Stopped to wait for DMD to be feature complete. I'm not opposed to this idea, although I do have a concern that, because there's no __forceinline in D (or macros), adding another layer of abstraction will make maths code REALLY slow in unoptimised builds. Can you suggest a method where these would be treated as C macros, and not produce additional layers of function calls?
Unfortunately I can't, at least not a clean one. Using string mixins would be one way but I think no one wants that kind of API in Druntime or Phobos.I'm already unhappy that std.simd produces redundant function calls. <rant> please please please can haz __forceinline! </rant>
I agree that we need that.
Aug 07 2012
Manu wrote:I'm not sure why the performance would suffer when placing it in a struct. I suspect it's because the struct causes the vectors to become unaligned, and that impacts performance a LOT. Walter has recently made some changes to expand the capability of align() to do most of the stuff you expect should be possible, including aligning structs, and propogating alignment from a struct member to its containing struct. This change might actually solve your problems...
I've tried all combinations with align() before and inside the struct, with no luck. I'm using DMD 2.060, so unless there's a new syntax I'm unaware of, I don't think it's been adjusted to fix any alignment issues with SIMD stuff. It would be great to be able to wrap float4 into a struct, but for now I've come up with an easy and understandable alternative using SIMD types directly.Another suggestion I might make, is to write DMD intrinsics that mirror the GDC code in std.simd and use that, then I'll sort out any performance problems as soon as I have all the tools I need to finish the module :)
Sounds like a good idea. I'll try and keep my code inline with yours to make transitioning to it easier when it's complete.And this is precisely what I suggest you don't do. x64-SSE is the only architecture that can reasonably tolerate this (although it's still not the most efficient way). So if portability is important, you need to find another way. A 'proper' way to do this is something like: float4 wideScalar = loadScalar(scalar); // this function loads a float into all 4 components. Note: this is a little slow, factor these float->vector loads outside the hot loops as is practical. float4 vecX = getX(vec); // we can make shorthand for this, like 'vec.xxxx' for instance... vecX += wideScalar; // all 4 components maintain the same scalar value, this is so you can apply them back to non-scalar vectors later: With this, there are 2 typical uses, one is to scale another vector by your scalar, for instance: someOtherVector *= vecX; // perform a scale of a full 4d vector by our 'wide' scalar The other, less common operation, is that you may want to directly set the scalar to a component of another vector, setting Y to lock something to a height map for instance: someOtherVector = setY(someOtherVector, wideScalar); // note: it is still important that you have a 'wide' scalar in this case for portability, since different architectures have very different interleave operations. Something like '.x' can never appear in efficient code. Sadly, most modern SIMD hardware is simply not able to efficiently express what you as a programmer intuitively want as convenient operations. Most SIMD hardware has absolutely no connection between the FPU and the SIMD unit, resulting in loads and stores to memory, and this in turn introduces another set of performance hazards. x64 is actually the only architecture that does allow interaction between the FPU and SIMD however, although it's still no less efficient to do it how I describe, and as a bonus, your code will be portable.
Okay, that makes a lot of sense and is inline with what I was reading last night about FPU/SSE assembly code. However I'm also a bit confused. At some point, like in your hightmap example, I'm going to need to do arithmetic work on single vector components. Is there some sort of SSE arithmetic/shuffle instruction which uses "masking" that I should use to isolate and manipulate components? If not, and manipulating components is just bad for performance reasons, then I've figured out a solution to my original concern. By using this code: property trusted pure nothrow { auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; } auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; } auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; } auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; } void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; } void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; } void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; } void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; } } I am able to perform arithmetic on single components: auto vec = Vectors.float4(x, y, 0, 1); // factory vec.x += scalar; // += components again, I'll abandon this approach if there's a better way to manipulate single components, like you mentioned above. I'm just not aware of how to do that using SSE instructions alone. I'll do more research, but would appreciate any insight you can give.Inline asm is usually less efficient for large blocks of code, it requires that you hand-tune the opcode sequencing, which is very hard to do, particularly for SSE. Small inline asm blocks are also usually less efficient, since most compilers can't rearrange other code within the function around the asm block, this leads to poor opcode sequencing. I recommend avoiding inline asm where performance is desired unless you're confident in writing the ENTIRE function/loop in asm, and hand tuning the opcode sequencing. But that's not portable...
Yes, after a bit of messing around with and researching ASM yesterday, I came to the conclusion that they're not a good fit for this. DMD can't inline functions with ASM blocks right now anyways (although LDC can), which would kill any performance gains SSE brings I'd imagine. Plus, ASM is a pain in the ass. :-)Android? But you're benchmarking x64-SSE right? I don't think it's reasonable to expect that performance characteristics for one architectures SIMD hardware will be any indicator at all of how another architecture may perform.
I only meant that, since Mono C# is what were using for our game code on any platform besides Windows/WP7/Xbox, and since Android has been really the only performance PITA for our Mono C# code, that upgrading our Vector libraries to use Mono.Simd should yield significant improvements there. I'm just learning about SSE and proper vector utilization. In out last game we actually used Vector3's everywhere :-V , which even we should have know not too, because you have to convert them to float4's anyways to pass them into shader constants... I'm guessing this was our main performance issue with SmartPhones.. ahh, oh well.Also, if you're doing any of the stuff I've been warning against above, NEON will suffer very hard, whereas x64-SSE will mostly shrug it off. I'm very interested to hear your measurements when you try it out!
I'll let you know if changing over to proper Vector code makes huge changes.wtf indeed! O_o Can you paste the disassembly?
I'm not sure how to do that with DMD. I remember GDC has a output-to-asm flag, but not DMD. Or is there an external tool you use to look at .o/.obj files?I can tell you this though, as soon as DMDs SIMD support is able to do the missing stuff I need to complete std.simd, I shall do that, along with intensive benchmarks where I'll be scrutinising the code-gen very closely. I expect performance peculiarities like you are seeing will be found and fixed at that time...
For now I've come to terms with using core.simd.float4 types directly have create acceptable solutions to my original problems. But I'm glad to here that in the future I'll have more flexibility within my libraries.
Aug 07 2012
On 2012-10-09 09:50, Manu wrote:std.simd already does have a mammoth mess of static if(arch & compiler). The thing about std.simd is that it's designed to be portable, so it doesn't make sense to expose the low-level sse intrinsics directly there. But giving it some thought, it might be nice to produce std.simd.sse and std.simd.vmx, etc for collating the intrinsics used by different compilers, and anyone who is writing sse code explicitly might use std.simd.sse to avoid having to support all different compilers intrinsics themselves. This sounds like a reasonable approach, the only question is what all these wrappers will do to the code-gen. I'll need to experiment/prove that out.
An alternative approach is to have one module per architecture or compiler. -- /Jacob Carlborg
Oct 09 2012
On 2012-10-09 16:52, Simen Kjaeraas wrote:Nope, like: module std.simd; version(Linux64) { public import std.internal.simd_linux64; } Then all std.internal.simd_* modules have the same public interface, and only the version that fits /your/ platform will be included.
Exactly, what he said. -- /Jacob Carlborg
Oct 09 2012
F i L wrote:Okay, that makes a lot of sense and is inline with what I was reading last night about FPU/SSE assembly code. However I'm also a bit confused. At some point, like in your hightmap example, I'm going to need to do arithmetic work on single vector components. Is there some sort of SSE arithmetic/shuffle instruction which uses "masking" that I should use to isolate and manipulate components? If not, and manipulating components is just bad for performance reasons, then I've figured out a solution to my original concern. By using this code: property trusted pure nothrow { auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; } auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; } auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; } auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; } void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; } void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; } void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; } void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; } } I am able to perform arithmetic on single components: auto vec = Vectors.float4(x, y, 0, 1); // factory vec.x += scalar; // += components again, I'll abandon this approach if there's a better way to manipulate single components, like you mentioned above. I'm just not aware of how to do that using SSE instructions alone. I'll do more research, but would appreciate any insight you can give.
Okay, disregard this. I see you where talking about your function in std.simd (setY), and I'm referring to that for an example of the appropriate vector functions.
Aug 07 2012
On Wednesday, 8 August 2012 at 01:45:52 UTC, F i L wrote:I'm not sure how to do that with DMD. I remember GDC has a output-to-asm flag, but not DMD. Or is there an external tool you use to look at .o/.obj files?
objdump, otool – depending on your OS. David
Aug 08 2012
David Nadlinger wrote:objdump, otool – depending on your OS.
Hey, nice tools. Good to know, thanks! Manu: Here's the disassembly for my benchmark code earlier, isolated between StopWatch .start()/.stop() https://gist.github.com/3294283 Also, I noticed your std.simd.setY() function uses _blendps() op, but DMD's core.simd doesn't support this op (yet? It's there but commented out). Is there an alternative operation I can use for setY() ?
Aug 08 2012
Not to resurrect the dead, I just wanted to share an article I came across concerning SIMD with Manu.. http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php QUOTE: 1. Returning results by value By observing the intrisics interface a vector library must imitate that interface to maximize performance. Therefore, you must return the results by value and not by reference, as such: //correct inline Vec4 VAdd(Vec4 va, Vec4 vb) { return(_mm_add_ps(va, vb)); }; On the other hand if the data is returned by reference the interface will generate code bloat. The incorrect version below: //incorrect (code bloat!) inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb) { vr = _mm_add_ps(va, vb); }; The reason you must return data by value is because the quad-word (128-bit) fits nicely inside one SIMD register. And one of the key factors of a vector library is to keep the data inside these registers as much as possible. By doing that, you avoid unnecessary loads and stores operations from SIMD registers to memory or FPU registers. When combining multiple vector operations the "returned by value" interface allows the compiler to optimize these loads and stores easily by minimizing SIMD to FPU or memory transfers. 2. Data Declared "Purely" Here, "pure data" is defined as data declared outside a "class" or "struct" by a simple "typedef" or "define". When I was researching various vector libraries before coding VMath, I observed one common pattern among all libraries I looked at during that time. In all cases, developers wrapped the basic quad-word type inside a "class" or "struct" instead of declaring it purely, as follows: class Vec4 { ... private: __m128 xyzw; }; This type of data encapsulation is a common practice among C++ developers to make the architecture of the software robust. The data is protected and can be accessed only by the class interface functions. Nonetheless, this design causes code bloat by many different compilers in different platforms, especially if some sort of GCC port is being used. An approach that is much friendlier to the compiler is to declare the vector data "purely", as follows: typedef __m128 Vec4; ENDQUOTE; The article is 2 years old, but It appears my earlier performance issue wasn't D related at all, but an issue with C as well. I think in this situation, it might be best (most optimized) to handle simd "the C way" by creating and alias or union of a simd intrinsic. D has a big advantage over C/C++ here because of UFCS, in that we can write external functions that appear no different to encapsulated object methods. That combined with public-aliasing means the end-user only sees our pretty functions, but we're not sacrificing performance at all.
Oct 01 2012
--047d7b624c90d354a804cb0f2771 Content-Type: text/plain; charset=UTF-8 On 7 August 2012 16:56, jerro <a a.com> wrote:That said, almost all simd opcodes are directly accessible in std.simd.There are relatively few obscure operations that don't have a representing function. The unpck/shuf example above for instance, they both effectively perform a sort of swizzle, and both are accessible through swizzle!().
They aren't. Swizzle only takes one argument, so you cant use it to select elements from two vectors. Both unpcklps and shufps take two arguments. Writing a swizzle with two arguments would be much harder.
Any usages I've missed/haven't thought of; I'm all ears. The swizzlemask is analysed by the template, and it produces the best opcode to match the pattern. Take a look at swizzle, it's bloody complicated to do that the most efficient way on x86.
Now imagine how complicated it would be to write a swizzle with to vector arguments.
I can imagine, I'll have a go at it... it's something I considered, but not all architectures can do it efficiently. That said, a most-efficient implementation would probably still be useful on all architectures, but for cross platform code, I usually prefer to encourage people taking another approach rather than supply a function that is not particularly portable (or not efficient when ported). The reason I didn't write the DMD support yet is because it was incomplete,and many opcodes weren't yet accessible, like shuf for instance... and I just wasn't finished. Stopped to wait for DMD to be feature complete. I'm not opposed to this idea, although I do have a concern that, because there's no __forceinline in D (or macros), adding another layer of abstraction will make maths code REALLY slow in unoptimised builds. Can you suggest a method where these would be treated as C macros, and not produce additional layers of function calls?
Unfortunately I can't, at least not a clean one. Using string mixins would be one way but I think no one wants that kind of API in Druntime or Phobos.
Yeah, absolutely not. This is possibly the most compelling motivation behind a __forceinline mechanism that I've seen come up... ;) I'm already unhappy thatstd.simd produces redundant function calls. <rant> please please please can haz __forceinline! </rant>
I agree that we need that.
Huzzah! :) --047d7b624c90d354a804cb0f2771 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 7 August 2012 16:56, jerro <span dir=3D"ltr"><<a href=3D"mailto:a a.c= om" target=3D"_blank">a a.com</a>></span> wrote:<br><div class=3D"gmail_= quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-= left:1px #ccc solid;padding-left:1ex"> <div class=3D"im"><br></div><div class=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> That said, almost all simd opcodes are directly accessible in std.simd.<br> There are relatively few obscure operations that don't have a represent= ing<br> function.<br> The unpck/shuf example above for instance, they both effectively perform a<= br> sort of swizzle, and both are accessible through swizzle!().<br> </blockquote> <br></div> They aren't. Swizzle only takes one argument, so you cant use it to sel= ect elements from two vectors. Both unpcklps and shufps take two arguments.= Writing a swizzle with two arguments would be much harder.</blockquote> <div><br></div><div>Any usages I've missed/haven't thought of; I= 9;m all ears.</div><div><br></div><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div cla= ss=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> The swizzle<br> mask is analysed by the template, and it produces the best opcode to match<= br> the pattern. Take a look at swizzle, it's bloody complicated to do that= the<br> most efficient way on x86.<br> </blockquote> <br></div> Now imagine how complicated it would be to write a swizzle with to vector a= rguments.</blockquote><div><br></div><div>I can imagine, I'll have a go= at it... it's something I considered, but not all architectures can do= it efficiently.</div> <div>That said, a most-efficient implementation would probably still be use= ful on all architectures, but for cross platform code, I usually prefer to = encourage people taking another approach rather than supply a function that= is not particularly portable (or not efficient when ported).</div> <div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex= ;border-left:1px #ccc solid;padding-left:1ex"><div class=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> The reason I didn't write the DMD support yet is because it was incompl= ete,<br> and many opcodes weren't yet accessible, like shuf for instance... and = I<br> just wasn't finished. Stopped to wait for DMD to be feature complete.<b= r> I'm not opposed to this idea, although I do have a concern that, becaus= e<br> there's no __forceinline in D (or macros), adding another layer of<br> abstraction will make maths code REALLY slow in unoptimised builds.<br> Can you suggest a method where these would be treated as C macros, and not<= br> produce additional layers of function calls?<br> </blockquote> <br></div> Unfortunately I can't, at least not a clean one. Using string mixins wo= uld be one way but I think no one wants that kind of API in Druntime or Pho= bos.</blockquote><div><br></div><div>Yeah, absolutely not.</div><div>This i= s possibly the most compelling motivation behind a __forceinline mechanism = that I've seen come up... ;)</div> <div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex= ;border-left:1px #ccc solid;padding-left:1ex"><div class=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I'm already unhappy that<br> std.simd produces redundant function calls.<br><br> <rant> please =C2=A0please please can haz __forceinline! </rant>= ;<br> </blockquote> <br></div> I agree that we need that.<br> </blockquote></div><br><div>Huzzah! :)</div> --047d7b624c90d354a804cb0f2771--
Oct 02 2012
--14dae9cdc4f3a4366b04cb0fb1ea Content-Type: text/plain; charset=UTF-8 On 8 August 2012 04:45, F i L <witte2008 gmail.com> wrote:Manu wrote:I'm not sure why the performance would suffer when placing it in a struct. I suspect it's because the struct causes the vectors to become unaligned, and that impacts performance a LOT. Walter has recently made some changes to expand the capability of align() to do most of the stuff you expect should be possible, including aligning structs, and propogating alignment from a struct member to its containing struct. This change might actually solve your problems...
I've tried all combinations with align() before and inside the struct, with no luck. I'm using DMD 2.060, so unless there's a new syntax I'm unaware of, I don't think it's been adjusted to fix any alignment issues with SIMD stuff. It would be great to be able to wrap float4 into a struct, but for now I've come up with an easy and understandable alternative using SIMD types directly.
I actually haven't had time to try out the new 2.60 alignment changes in practise yet. As a Win64 D user, I'm stuck with compilers that are forever 3-6 months out of date (2.58). >_< The use cases I required to do this stuff efficiently were definitely agreed though by Walter, and to my knowledge, implemented... so it might be some other subtle details. It's possible that the intrinsic vector code-gen is hard-coded to use unaligned loads too. You might need to assert appropriate alignment, and then issue the movaps intrinsics directly, but I'm sure DMD can be fixed to emit movaps when it detects the vector is aligned >= 16 bytes. [*clip* portability and cross-lane efficiency *clip*]
Okay, that makes a lot of sense and is inline with what I was reading last night about FPU/SSE assembly code. However I'm also a bit confused. At some point, like in your hightmap example, I'm going to need to do arithmetic work on single vector components. Is there some sort of SSE arithmetic/shuffle instruction which uses "masking" that I should use to isolate and manipulate components?
Well, actually, height maps are one thing that hardware SIMD units aren't intrinsically good at, because they do specifically require component-wise access. That said, there are still lots of interesting possibilities. If you're operating on a height map, for instance, rather than looping over the position vectors, fetching y from it, doing something with y (which I presume involves foreign data?), then putting it back, and repeating over the next vector... Do something like: align(16) float[as-many-as-there-are-verts] height_offsets; foreach(h; height_offsets) { // do some work to generate deltas for each vertex... } Now what you can probably do is unpack them and apply them directly to the vertex stream in a single pass: for(i = 0; i < numVerts; i += 4) { four_heights = loadaps(&height_offsets[i]); float4[4] heights; // do some shuffling/unpacking to result: four_heights.xyzw -> height[0].y, height[1].y, height[2].y, height[3].y (fiddly, but simple enough) vertices[i + 0] += height[0]; vertices[i + 1] += height[1]; vertices[i + 2] += height[2]; vertices[i + 3] += height[3]; } ... I'm sure that could be improved, but you get the idea..? This approach should pipeline well, have reasonably low bandwidth, make good use of registers, and you can see there is no interaction between the FPU and SIMD unit. Disclaimer: I just made that up off the top of my head ;), but that illustrates the kind of approach you would usually take in efficient (and portable) SIMD code. If not, and manipulating components is just bad for performance reasons,then I've figured out a solution to my original concern. By using this code: property trusted pure nothrow { auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; } auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; } auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; } auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; } void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; } void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; } void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; } void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; } }
This is fine if your vectors are in memory to begin with. But if you're already doing work on them, and they are in registers/local variables, this is the worst thing you can do. It's a generally bad practise, and someone who isn't careful with their usage will produce very slow code (on some platforms). --14dae9cdc4f3a4366b04cb0fb1ea Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 8 August 2012 04:45, F i L <span dir=3D"ltr"><<a href=3D"mailto:witte= 2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrote:= <br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">Manu wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I'm not sure why the performance would suffer when placing it in a stru= ct.<br> I suspect it's because the struct causes the vectors to become unaligne= d,<br> and that impacts performance a LOT. Walter has recently made some changes<b= r> to expand the capability of align() to do most of the stuff you expect<br> should be possible, including aligning structs, and propogating alignment<b= r> from a struct member to its containing struct. This change might actually<b= r> solve your problems...<br> </blockquote> <br></div> I've tried all combinations with align() before and inside the struct, = with no luck. I'm using DMD 2.060, so unless there's a new syntax I= 'm unaware of, I don't think it's been adjusted to fix any alig= nment issues with SIMD stuff. It would be great to be able to wrap float4 i= nto a struct, but for now I've come up with an easy and understandable = alternative using SIMD types directly.</blockquote> <div><br></div><div>I actually haven't had time to try out the new 2.60= alignment changes in practise yet. As a Win64 D user, I'm stuck with c= ompilers that are forever 3-6 months out of date (2.58). >_<</div> <div>The use cases I required to do this stuff efficiently were definitely = agreed though by Walter, and to my knowledge, implemented... so it might be= some other subtle details.</div><div>It's possible that the intrinsic = vector code-gen is hard-coded to use unaligned loads too. You might need to= assert appropriate alignment, and then issue the movaps intrinsics directl= y, but I'm sure DMD can be fixed to emit movaps when it detects the vec= tor is aligned >=3D 16 bytes.</div> <div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div clas= s=3D"h5"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde= r-left:1px #ccc solid;padding-left:1ex"> [*clip* portability and cross-lane efficiency *clip*]<br> </blockquote> <br></div></div> Okay, that makes a lot of sense and is inline with what I was reading last = night about FPU/SSE assembly code. However I'm also a bit confused. At = some point, like in your hightmap example, I'm going to need to do arit= hmetic work on single vector components. Is there some sort of SSE arithmet= ic/shuffle instruction which uses "masking" that I should use to = isolate and manipulate components?<br> </blockquote><div><br></div><div>Well, actually, height maps are one thing = that hardware SIMD units aren't intrinsically good at, because they do = specifically require component-wise access.</div><div>That said, there are = still lots of interesting possibilities.</div> <div><br></div><div>If you're operating on a height map, for instance, = rather than looping over the position vectors, fetching y from it, doing so= mething with y (which I presume involves foreign data?), then putting it ba= ck, and repeating over the next vector...</div> <div><br></div><div>Do something like:</div><div><br></div><div>=C2=A0 alig= n(16) float[as-many-as-there-are-verts] height_offsets;</div><div>=C2=A0 fo= reach(h;=C2=A0height_offsets)</div><div>=C2=A0 {</div><div>=C2=A0 =C2=A0 //= do some work to generate deltas for each vertex...</div> <div>=C2=A0 }</div><div><br></div><div>Now what you can probably do is unpa= ck them and apply them directly to the vertex stream in a single pass:</div=<div><br></div><div>=C2=A0 for(i =3D 0; i < numVerts; i +=3D 4)</div><d=
<div>=C2=A0 =C2=A0 four_heights =3D loadaps(&height_offsets[i]);</div><= div><br></div><div>=C2=A0 =C2=A0 float4[4] heights;</div><div>=C2=A0 =C2=A0= // do some shuffling/unpacking to result: four_heights.xyzw -> height[0= ].y,=C2=A0height[1].y,=C2=A0height[2].y,=C2=A0height[3].y (fiddly, but simp= le enough)</div> <div><br></div><div>=C2=A0 =C2=A0 vertices[i + 0] +=3D height[0];</div><div==C2=A0 =C2=A0 vertices[i + 1] +=3D height[1];</div><div>=C2=A0 =C2=A0 vert=
ight[3];</div><div>=C2=A0 }</div><div><br></div><div> ... I'm sure that could be improved, but you get the idea..? This appro= ach should pipeline well, have reasonably low bandwidth, make good use of r= egisters, and you can see there is no interaction between the FPU and SIMD = unit.</div> <div><br></div><div>Disclaimer: I just made that up off the top of my head = ;), but that illustrates the kind of approach you would usually take in eff= icient (and portable) SIMD code.</div><div><br></div><div><br></div><blockq= uote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc = solid;padding-left:1ex"> If not, and manipulating components is just bad for performance reasons, th= en I've figured out a solution to my original concern. By using this co= de:<br> <br> property trusted pure nothrow<br> {<br> =C2=A0 auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }<br> =C2=A0 auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }<br> =C2=A0 auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }<br> =C2=A0 auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }<br> <br> =C2=A0 void x(T:float4)(ref float4 v, float val) { v.ptr[0] =3D val; }<br> =C2=A0 void y(T:float4)(ref float4 v, float val) { v.ptr[1] =3D val; }<br> =C2=A0 void z(T:float4)(ref float4 v, float val) { v.ptr[2] =3D val; }<br> =C2=A0 void w(T:float4)(ref float4 v, float val) { v.ptr[3] =3D val; }<br> }<br></blockquote><div><br></div><div>This is fine if your vectors are in m= emory to begin with. But if you're already doing work on them, and they= are in registers/local variables, this is the worst thing you can do.</div=
ith their usage will produce very slow code (on some platforms).</div></div=
--14dae9cdc4f3a4366b04cb0fb1ea--
Oct 02 2012
--bcaec51d28baa70bbc04cb0fb385 Content-Type: text/plain; charset=UTF-8 On 8 August 2012 07:54, F i L <witte2008 gmail.com> wrote:F i L wrote:Okay, that makes a lot of sense and is inline with what I was reading last night about FPU/SSE assembly code. However I'm also a bit confused. At some point, like in your hightmap example, I'm going to need to do arithmetic work on single vector components. Is there some sort of SSE arithmetic/shuffle instruction which uses "masking" that I should use to isolate and manipulate components? If not, and manipulating components is just bad for performance reasons, then I've figured out a solution to my original concern. By using this code: property trusted pure nothrow { auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; } auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; } auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; } auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; } void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; } void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; } void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; } void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; } } I am able to perform arithmetic on single components: auto vec = Vectors.float4(x, y, 0, 1); // factory vec.x += scalar; // += components again, I'll abandon this approach if there's a better way to manipulate single components, like you mentioned above. I'm just not aware of how to do that using SSE instructions alone. I'll do more research, but would appreciate any insight you can give.
Okay, disregard this. I see you where talking about your function in std.simd (setY), and I'm referring to that for an example of the appropriate vector functions.
_<
--bcaec51d28baa70bbc04cb0fb385 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 8 August 2012 07:54, F i L <span dir=3D"ltr"><<a href=3D"mailto:witte= 2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrote:= <br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">F i L wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Okay, that makes a lot of sense and is inline with what I was reading last = night about FPU/SSE assembly code. However I'm also a bit confused. At = some point, like in your hightmap example, I'm going to need to do arit= hmetic work on single vector components. Is there some sort of SSE arithmet= ic/shuffle instruction which uses "masking" that I should use to = isolate and manipulate components?<br> <br> If not, and manipulating components is just bad for performance reasons, th= en I've figured out a solution to my original concern. By using this co= de:<br> <br> property trusted pure nothrow<br> {<br> =C2=A0 auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }<br> =C2=A0 auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }<br> =C2=A0 auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }<br> =C2=A0 auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }<br> <br> =C2=A0 void x(T:float4)(ref float4 v, float val) { v.ptr[0] =3D val; }<br> =C2=A0 void y(T:float4)(ref float4 v, float val) { v.ptr[1] =3D val; }<br> =C2=A0 void z(T:float4)(ref float4 v, float val) { v.ptr[2] =3D val; }<br> =C2=A0 void w(T:float4)(ref float4 v, float val) { v.ptr[3] =3D val; }<br> }<br> <br> I am able to perform arithmetic on single components:<br> <br> =C2=A0 =C2=A0 auto vec =3D Vectors.float4(x, y, 0, 1); // factory<br> =C2=A0 =C2=A0 vec.x +=3D scalar; // +=3D components<br> <br> again, I'll abandon this approach if there's a better way to manipu= late single components, like you mentioned above. I'm just not aware of= how to do that using SSE instructions alone. I'll do more research, bu= t would appreciate any insight you can give.<br> </blockquote> <br> <br></div> Okay, disregard this. I see you where talking about your function in std.si= md (setY), and I'm referring to that for an example of the appropriate = vector functions.<br> </blockquote></div><br><div>>_<</div> --bcaec51d28baa70bbc04cb0fb385--
Oct 02 2012
--047d7b2e5352f1e22c04cb10b021 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 8 August 2012 14:14, F i L <witte2008 gmail.com> wrote:David Nadlinger wrote:objdump, otool =E2=80=93 depending on your OS.
Hey, nice tools. Good to know, thanks! Manu: Here's the disassembly for my benchmark code earlier, isolated between StopWatch .start()/.stop() https://gist.github.com/**3294283 <https://gist.github.com/3294283> Also, I noticed your std.simd.setY() function uses _blendps() op, but DMD's core.simd doesn't support this op (yet? It's there but commented out). Is there an alternative operation I can use for setY() ?
I haven't considered/written an SSE2 fallback yet, but I expect some trick using shuf and/or shifts to blend the 2 vectors together will do it. --047d7b2e5352f1e22c04cb10b021 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 8 August 2012 14:14, F i L <span dir=3D"ltr"><<a href=3D"mailto:witte= 2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrote:= <br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">David Nadlinger wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> objdump, otool =E2=80=93 depending on your OS.<br> </blockquote> <br></div> Hey, nice tools. Good to know, thanks!<br> <br> <br> Manu:<br> <br> Here's the disassembly for my benchmark code earlier, isolated between = StopWatch .start()/.stop()<br> <br> <a href=3D"https://gist.github.com/3294283" target=3D"_blank">https://gist.= github.com/<u></u>3294283</a><br> <br> <br> Also, I noticed your std.simd.setY() function uses _blendps() op, but DMD&#= 39;s core.simd doesn't support this op (yet? It's there but comment= ed out). Is there an alternative operation I can use for setY() ?<br> </blockquote></div><br><div>I haven't considered/written an SSE2 fallba= ck yet, but I expect some trick using shuf and/or shifts to blend the 2 vec= tors together will do it.</div> --047d7b2e5352f1e22c04cb10b021--
Oct 02 2012
--20cf307f3a9e770c0f04cb10d52c Content-Type: text/plain; charset=UTF-8 On 2 October 2012 05:28, F i L <witte2008 gmail.com> wrote:Not to resurrect the dead, I just wanted to share an article I came across concerning SIMD with Manu.. http://www.gamasutra.com/view/**feature/4248/designing_fast_** crossplatform_simd_.php<http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php> QUOTE: 1. Returning results by value By observing the intrisics interface a vector library must imitate that interface to maximize performance. Therefore, you must return the results by value and not by reference, as such: //correct inline Vec4 VAdd(Vec4 va, Vec4 vb) { return(_mm_add_ps(va, vb)); }; On the other hand if the data is returned by reference the interface will generate code bloat. The incorrect version below: //incorrect (code bloat!) inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb) { vr = _mm_add_ps(va, vb); }; The reason you must return data by value is because the quad-word (128-bit) fits nicely inside one SIMD register. And one of the key factors of a vector library is to keep the data inside these registers as much as possible. By doing that, you avoid unnecessary loads and stores operations from SIMD registers to memory or FPU registers. When combining multiple vector operations the "returned by value" interface allows the compiler to optimize these loads and stores easily by minimizing SIMD to FPU or memory transfers. 2. Data Declared "Purely" Here, "pure data" is defined as data declared outside a "class" or "struct" by a simple "typedef" or "define". When I was researching various vector libraries before coding VMath, I observed one common pattern among all libraries I looked at during that time. In all cases, developers wrapped the basic quad-word type inside a "class" or "struct" instead of declaring it purely, as follows: class Vec4 { ... private: __m128 xyzw; }; This type of data encapsulation is a common practice among C++ developers to make the architecture of the software robust. The data is protected and can be accessed only by the class interface functions. Nonetheless, this design causes code bloat by many different compilers in different platforms, especially if some sort of GCC port is being used. An approach that is much friendlier to the compiler is to declare the vector data "purely", as follows: typedef __m128 Vec4; ENDQUOTE; The article is 2 years old, but It appears my earlier performance issue wasn't D related at all, but an issue with C as well. I think in this situation, it might be best (most optimized) to handle simd "the C way" by creating and alias or union of a simd intrinsic. D has a big advantage over C/C++ here because of UFCS, in that we can write external functions that appear no different to encapsulated object methods. That combined with public-aliasing means the end-user only sees our pretty functions, but we're not sacrificing performance at all.
These are indeed common gotchas. But they don't necessarily apply to D, and if they do, then they should be bugged and hopefully addressed. There is no reason that D needs to follow these typical performance patterns from C. It's worth noting that not all C compilers suffer from this problem. There are many (most actually) compilers that can recognise a struct with a single member and treat it as if it were an instance of that member directly when being passed by value. It only tends to be a problem on older games-console compilers. As I said earlier. When I get back to finishing srd.simd off (I presume this will be some time after Walter has finished Win64 support), I'll go through and scrutinise the code-gen for the API very thoroughly. We'll see what that reveals. But I don't think there's any reason we should suffer the same legacy C by-value code-gen problems in D... (hopefully I won't eat those words ;) --20cf307f3a9e770c0f04cb10d52c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 2 October 2012 05:28, F i L <span dir=3D"ltr"><<a href=3D"mailto:witt= e2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrote= :<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"= margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Not to resurrect the dead, I just wanted to share an article I came across = concerning SIMD with Manu..<br> <br> <a href=3D"http://www.gamasutra.com/view/feature/4248/designing_fast_crossp= latform_simd_.php" target=3D"_blank">http://www.gamasutra.com/view/<u></u>f= eature/4248/designing_fast_<u></u>crossplatform_simd_.php</a><br> <br> QUOTE:<br> <br> 1. Returning results by value<br> <br> By observing the intrisics interface a vector library must imitate that int= erface to maximize performance. Therefore, you must return the results by v= alue and not by reference, as such:<br> <br> =C2=A0 =C2=A0 //correct<br> =C2=A0 =C2=A0 inline Vec4 VAdd(Vec4 va, Vec4 vb)<br> =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 return(_mm_add_ps(va, vb));<br> =C2=A0 =C2=A0 };<br> <br> On the other hand if the data is returned by reference the interface will g= enerate code bloat. The incorrect version below:<br> <br> =C2=A0 =C2=A0 //incorrect (code bloat!)<br> =C2=A0 =C2=A0 inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)<br> =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 vr =3D _mm_add_ps(va, vb);<br> =C2=A0 =C2=A0 };<br> <br> The reason you must return data by value is because the quad-word (128-bit)= fits nicely inside one SIMD register. And one of the key factors of a vect= or library is to keep the data inside these registers as much as possible. = By doing that, you avoid unnecessary loads and stores operations from SIMD = registers to memory or FPU registers. When combining multiple vector operat= ions the "returned by value" interface allows the compiler to opt= imize these loads and stores easily by minimizing SIMD to FPU or memory tra= nsfers.<br> <br> 2. Data Declared "Purely"<br> <br> Here, "pure data" is defined as data declared outside a "cla= ss" or "struct" by a simple "typedef" or "def= ine". When I was researching various vector libraries before coding VM= ath, I observed one common pattern among all libraries I looked at during t= hat time. In all cases, developers wrapped the basic quad-word type inside = a "class" or "struct" instead of declaring it purely, a= s follows:<br> <br> =C2=A0 =C2=A0 class Vec4<br> =C2=A0 =C2=A0 { =C2=A0 <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 ...<br> =C2=A0 =C2=A0 private:<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 __m128 xyzw;<br> =C2=A0 =C2=A0 };<br> <br> This type of data encapsulation is a common practice among C++ developers t= o make the architecture of the software robust. The data is protected and c= an be accessed only by the class interface functions. Nonetheless, this des= ign causes code bloat by many different compilers in different platforms, e= specially if some sort of GCC port is being used.<br> <br> An approach that is much friendlier to the compiler is to declare the vecto= r data "purely", as follows:<br> <br> typedef __m128 Vec4;<br> <br> ENDQUOTE;<br> <br> <br> <br> <br> The article is 2 years old, but It appears my earlier performance issue was= n't D related at all, but an issue with C as well. I think in this situ= ation, it might be best (most optimized) to handle simd "the C way&quo= t; by creating and alias or union of a simd intrinsic. D has a big advantag= e over C/C++ here because of UFCS, in that we can write external functions = that appear no different to encapsulated object methods. That combined with= public-aliasing means the end-user only sees our pretty functions, but we&= #39;re not sacrificing performance at all.<br> </blockquote></div><br><div>These are indeed common gotchas. But they don&#= 39;t necessarily apply to D, and if they do, then they should be bugged and= hopefully addressed. There is no reason that D needs to follow these typic= al performance patterns from C.</div> <div>It's worth noting that not all C compilers suffer from this proble= m. There are many (most actually) compilers that can recognise a struct wit= h a single member and treat it as if it were an instance of that member dir= ectly when being passed by value.</div> <div>It only tends to be a problem on older games-console compilers.</div><= div><br></div><div>As I said earlier. When I get back to finishing srd.simd= off (I presume this will be some time after Walter has finished Win64 supp= ort), I'll go through and scrutinise the code-gen for the API very thor= oughly. We'll see what that reveals. But I don't think there's = any reason we should suffer the same legacy C by-value code-gen problems in= D... (hopefully I won't eat those words ;)</div> --20cf307f3a9e770c0f04cb10d52c--
Oct 02 2012
On Tuesday, 2 October 2012 at 08:17:33 UTC, Manu wrote:On 7 August 2012 16:56, jerro <a a.com> wrote:That said, almost all simd opcodes are directly accessible in std.simd.There are relatively few obscure operations that don't have a representing function. The unpck/shuf example above for instance, they both effectively perform a sort of swizzle, and both are accessible through swizzle!().
They aren't. Swizzle only takes one argument, so you cant use it to select elements from two vectors. Both unpcklps and shufps take two arguments. Writing a swizzle with two arguments would be much harder.
Any usages I've missed/haven't thought of; I'm all ears.
I don't think it is possible to think of all usages of this, but for every simd instruction there are valid usages. At least for writing pfft, I found shuffling two vectors very useful. For, example, I needed a function that takes a small, square, power of two number of elements stored in vectors and bit-reverses them - it rearanges them so that you can calculate the new index of each element by reversing bits of the old index (for 16 elements using 4 element vectors this can actually be done using std.simd.transpose, but for AVX it was more efficient to make this function work on 64 elements). There are other places in pfft where I need to select elements from two vectors (for example, here https://github.com/jerro/pfft/blob/sine-transform/pfft/avx_float.d#L141 is the platform specific code for AVX). I don't think this are the kind of things that should be implemented in std.simd. If you wanted to implement all such operations (for example bit reversing a small array) that somebody may find useful at some time, std.simd would need to be huge, and most of it would never be used.I can imagine, I'll have a go at it... it's something I considered, but not all architectures can do it efficiently. That said, a most-efficient implementation would probably still be useful on all architectures, but for cross platform code, I usually prefer to encourage people taking another approach rather than supply a function that is not particularly portable (or not efficient when ported).
One way to do it would be to do the following for every set of selected indices: go through all the two element one instruction operations, and check if any of them does exactly what you need, and use it if it does. Otherwise do something that will always work although it may not always be optimal. One option would be to use swizzle on both vectors to get each of the elements to their final index and then blend the two vectors together. For sse 1, 2 and 3 you would need to use xorps to blend them, so I guess this is one more place where you would need vector literals. Someone who knows which two element shuffling operations the platform supports could still write optimal platform specific (but portable across compilers) code this way and for others this would still be useful to some degree (the documentation should mention that it may not be very efficient, though). But I think that it would be better to have platform specific APIs for platform specific code, as I said earlier in this thread.Unfortunately I can't, at least not a clean one. Using string mixins would be one way but I think no one wants that kind of API in Druntime or Phobos.
Yeah, absolutely not. This is possibly the most compelling motivation behind a __forceinline mechanism that I've seen come up... ;) I'm already unhappy thatstd.simd produces redundant function calls. <rant> please please please can haz __forceinline! </rant>
I agree that we need that.
Huzzah! :)
Walter opposes this, right? I wonder how we could convince him. There's one more thing that I wanted to ask you. If I were to add LDC support to std.simd, should I just add version(LDC) blocks to all the functions? Sounds like a lot of duplicated code...
Oct 02 2012
--bcaec5015f39dd5a4d04cb139c9c Content-Type: text/plain; charset=UTF-8 On 2 October 2012 13:49, jerro <a a.com> wrote:I don't think it is possible to think of all usages of this, but for every simd instruction there are valid usages. At least for writing pfft, I found shuffling two vectors very useful. For, example, I needed a function that takes a small, square, power of two number of elements stored in vectors and bit-reverses them - it rearanges them so that you can calculate the new index of each element by reversing bits of the old index (for 16 elements using 4 element vectors this can actually be done using std.simd.transpose, but for AVX it was more efficient to make this function work on 64 elements). There are other places in pfft where I need to select elements from two vectors (for example, here https://github.com/jerro/pfft/** blob/sine-transform/pfft/avx_**float.d#L141<https://github.com/jerro/pfft/blob/sine-transform/pfft avx_float.d#L141>is the platform specific code for AVX). I don't think this are the kind of things that should be implemented in std.simd. If you wanted to implement all such operations (for example bit reversing a small array) that somebody may find useful at some time, std.simd would need to be huge, and most of it would never be used.
I was referring purely to your 2-vector swizzle idea (or useful high-level ideas in general). Not to hyper-context-specific functions :P I can imagine, I'll have a go at it... it's something I considered, but notall architectures can do it efficiently. That said, a most-efficient implementation would probably still be useful on all architectures, but for cross platform code, I usually prefer to encourage people taking another approach rather than supply a function that is not particularly portable (or not efficient when ported).
One way to do it would be to do the following for every set of selected indices: go through all the two element one instruction operations, and check if any of them does exactly what you need, and use it if it does. Otherwise do something that will always work although it may not always be optimal. One option would be to use swizzle on both vectors to get each of the elements to their final index and then blend the two vectors together. For sse 1, 2 and 3 you would need to use xorps to blend them, so I guess this is one more place where you would need vector literals. Someone who knows which two element shuffling operations the platform supports could still write optimal platform specific (but portable across compilers) code this way and for others this would still be useful to some degree (the documentation should mention that it may not be very efficient, though). But I think that it would be better to have platform specific APIs for platform specific code, as I said earlier in this thread.
Yeah, I have some ideas. Some permutations are obvious, the worst-case fallback is also obvious, but there are a lot of semi-efficient in-between cases which could take a while to identify and test. It'll be a massive block of static-if code to be sure ;) Unfortunately I can't, at least not a clean one. Using string mixins wouldbe one way but I think no one wants that kind of API in Druntime or Phobos.
Yeah, absolutely not. This is possibly the most compelling motivation behind a __forceinline mechanism that I've seen come up... ;) I'm already unhappy thatstd.simd produces redundant function calls.<rant> please please please can haz __forceinline! </rant>
Walter opposes this, right? I wonder how we could convince him.
I just don't think he's seen solid undeniable cases where it's necessary. There's one more thing that I wanted to ask you. If I were to add LDCsupport to std.simd, should I just add version(LDC) blocks to all the functions? Sounds like a lot of duplicated code...
Go for it. And yeah, just add another version(). I don't think it can be done without blatant duplication. Certainly not without __forceinline anyway, and even then I'd be apprehensive to trust the code-gen of intrinsics wrapped in inline wrappers. That file will most likely become a nightmarish bloated mess... but that's the point of libraries ;) .. It's best all that horrible munge-ing for different architectures/compilers is put in one place and tested thoroughly, than to not provide it and allow an infinite variety of different implementations to appear. What we may want to do in the future is to split the different compilers/architectures into readable sub-modules, and public include the appropriate one based on version logic from std.simd... but I wouldn't want to do that until the API has stabilised. --bcaec5015f39dd5a4d04cb139c9c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 2 October 2012 13:49, jerro <span dir=3D"ltr"><<a href=3D"mailto:a a.= com" target=3D"_blank">a a.com</a>></span> wrote:<br><div class=3D"gmail= _quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border= -left:1px #ccc solid;padding-left:1ex"> <div class=3D"im"><br></div> I don't think it is possible to think of all usages of this, but for ev= ery simd instruction there are valid usages. At least for writing pfft, I f= ound shuffling two vectors very useful. For, example, I needed a function t= hat takes a small, square, power of two number of elements stored in vector= s and bit-reverses them - it rearanges them so that you can calculate the n= ew index of each element by reversing bits of the old index (for 16 element= s using 4 element vectors this can actually be done using std.simd.transpos= e, but for AVX it was more efficient to make this function work on 64 eleme= nts). There are other places in pfft where I need to select elements from t= wo vectors (for example, here <a href=3D"https://github.com/jerro/pfft/blob= /sine-transform/pfft/avx_float.d#L141" target=3D"_blank">https://github.com= /jerro/pfft/<u></u>blob/sine-transform/pfft/avx_<u></u>float.d#L141</a> is = the platform specific code for AVX).<br> <br> I don't think this are the kind of things that should be implemented in= std.simd. If you wanted to implement all such operations (for example bit = reversing a small array) that somebody may find useful at some time, std.si= md would need to be huge, and most of it would never be used.</blockquote> <div><br></div><div>I was referring purely to your 2-vector swizzle idea (o= r useful high-level ideas in general). Not to hyper-context-specific functi= ons :P</div><div><br></div><div><br></div><blockquote class=3D"gmail_quote"= style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I can imagine, I'll have a go at it... it's something I considered,= but not<br> all architectures can do it efficiently.<br> That said, a most-efficient implementation would probably still be useful<b= r> on all architectures, but for cross platform code, I usually prefer to<br> encourage people taking another approach rather than supply a function that= <br> is not particularly portable (or not efficient when ported).<br> </blockquote> <br></div> One way to do it would be to do the following for every set of selected ind= ices: go through all the two element one instruction operations, and check = if any of them does exactly what you need, and use it if it does. Otherwise= do something that will always work although it may not always be optimal. = One option would be to use swizzle on both vectors to get each of the eleme= nts to their final index and then blend the two vectors together. For sse 1= , 2 and 3 you would need to use xorps to blend them, so I guess this is one= more place where you would need vector literals.<br> <br> Someone who knows which two element shuffling operations the platform suppo= rts could still write optimal platform specific (but portable across compil= ers) code this way and for others this would still be useful to some degree= (the documentation should mention that it may not be very efficient, thoug= h). But I think that it would be better to have platform specific APIs for = platform specific code, as I said earlier in this thread.<br> </blockquote><div><br></div><div>Yeah, I have some ideas. Some permutations= are obvious, the worst-case fallback is also obvious, but there are a lot = of semi-efficient in-between cases which could take a while to identify and= test. It'll be a massive block of static-if code to be sure ;)</div> <div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im"><blockquote class=3D"gmail= _quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:= 1ex"> Unfortunately I can't, at least not a clean one. Using string mixins wo= uld<br> be one way but I think no one wants that kind of API in Druntime or Phobos.= <br> </blockquote> <br> <br></div><div class=3D"im"> Yeah, absolutely not.<br> This is possibly the most compelling motivation behind a __forceinline<br> mechanism that I've seen come up... ;)<br> <br></div><div class=3D"im"> =C2=A0I'm already unhappy that<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> std.simd produces redundant function calls.<br> <br> <rant> please =C2=A0please please can haz __forceinline! </rant>= ;<br> <br> </blockquote> <br> I agree that we need that.<br> <br> </blockquote> <br></div> Huzzah! :)<br> </blockquote> <br> Walter opposes this, right? I wonder how we could convince him.<br></blockq= uote><div><br></div><div>I just don't think he's seen solid undenia= ble cases where it's necessary.</div><div><br></div><div><br></div> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> There's one more thing that I wanted to ask you. If I were to add LDC s= upport to std.simd, should I just add version(LDC) blocks to all the functi= ons? Sounds like a lot of duplicated code...<br> </blockquote></div><br><div>Go for it. And yeah, just add another version()= . I don't think it can be done without blatant duplication. Certainly n= ot without __forceinline anyway, and even then I'd be apprehensive to t= rust the code-gen of intrinsics wrapped in inline wrappers.</div> <div><br></div><div>That file will most likely become a nightmarish bloated= mess... but that's the point of libraries ;) .. It's best all that= horrible munge-ing for different architectures/compilers is put in one pla= ce and tested thoroughly, than to not provide it and allow an infinite vari= ety of different implementations to appear.</div> <div><br></div><div>What we may want to do in the future is to split the di= fferent compilers/architectures into readable sub-modules, and public inclu= de the appropriate one based on version logic from std.simd... but I wouldn= 't want to do that until the API has stabilised.</div> --bcaec5015f39dd5a4d04cb139c9c--
Oct 02 2012
Manu wrote:These are indeed common gotchas. But they don't necessarily apply to D, and if they do, then they should be bugged and hopefully addressed. There is no reason that D needs to follow these typical performance patterns from C. It's worth noting that not all C compilers suffer from this problem. There are many (most actually) compilers that can recognise a struct with a single member and treat it as if it were an instance of that member directly when being passed by value. It only tends to be a problem on older games-console compilers. As I said earlier. When I get back to finishing srd.simd off (I presume this will be some time after Walter has finished Win64 support), I'll go through and scrutinise the code-gen for the API very thoroughly. We'll see what that reveals. But I don't think there's any reason we should suffer the same legacy C by-value code-gen problems in D... (hopefully I won't eat those words ;)
Thanks for the insight (and the code examples, though I've been researching SIMD best-practice in C recently). It's good to know that D should (hopefully) be able to avoid these pitfalls. On a side note, I'm not sure how easy LLVM is to build on Windows (I think I built it once a long time ago), but recent performance comparisons between DMD, LDC, and GDC show that LDC (with LLVM 3.1 auto-vectorization and not using GCC -ffast-math) actually produces on-par-or-faster binary compared to GDC, at least in my code on Linux64. SIMD in LDC is currently broken, but you might consider using that if you're having trouble keeping a D release compiler up-to-date.
Oct 02 2012
On Tuesday, 2 October 2012 at 13:36:37 UTC, Manu wrote:On 2 October 2012 13:49, jerro <a a.com> wrote:I don't think it is possible to think of all usages of this, but for every simd instruction there are valid usages. At least for writing pfft, I found shuffling two vectors very useful. For, example, I needed a function that takes a small, square, power of two number of elements stored in vectors and bit-reverses them - it rearanges them so that you can calculate the new index of each element by reversing bits of the old index (for 16 elements using 4 element vectors this can actually be done using std.simd.transpose, but for AVX it was more efficient to make this function work on 64 elements). There are other places in pfft where I need to select elements from two vectors (for example, here https://github.com/jerro/pfft/** blob/sine-transform/pfft/avx_**float.d#L141<https://github.com/jerro/pfft/blob/sine-transform/pfft avx_float.d#L141>is the platform specific code for AVX). I don't think this are the kind of things that should be implemented in std.simd. If you wanted to implement all such operations (for example bit reversing a small array) that somebody may find useful at some time, std.simd would need to be huge, and most of it would never be used.
I was referring purely to your 2-vector swizzle idea (or useful high-level ideas in general). Not to hyper-context-specific functions :P
My point was that those context specific functions can be implemented using a 2 vector swizzle. LLVM, for example, actually provides access to most vector shuffling instruction through "shufflevector", which is basically a 2 vector swizzle.
Oct 02 2012
--047d7b2e5352f338fa04cb19ce2b Content-Type: text/plain; charset=UTF-8 On 2 October 2012 23:52, jerro <a a.com> wrote:On Tuesday, 2 October 2012 at 13:36:37 UTC, Manu wrote:On 2 October 2012 13:49, jerro <a a.com> wrote:I don't think it is possible to think of all usages of this, but for every simd instruction there are valid usages. At least for writing pfft, I found shuffling two vectors very useful. For, example, I needed a function that takes a small, square, power of two number of elements stored in vectors and bit-reverses them - it rearanges them so that you can calculate the new index of each element by reversing bits of the old index (for 16 elements using 4 element vectors this can actually be done using std.simd.transpose, but for AVX it was more efficient to make this function work on 64 elements). There are other places in pfft where I need to select elements from two vectors (for example, here https://github.com/jerro/pfft/****<https://github.com/jerro/pfft/**> blob/sine-transform/pfft/avx_****float.d#L141<https://github.** com/jerro/pfft/blob/sine-**transform/pfft/avx_float.d#**L141<https://github.com/jerro/pfft/blob/sine-transform/pfft/avx_float.d#L141>>is the platform specific code for AVX). I don't think this are the kind of things that should be implemented in std.simd. If you wanted to implement all such operations (for example bit reversing a small array) that somebody may find useful at some time, std.simd would need to be huge, and most of it would never be used.
I was referring purely to your 2-vector swizzle idea (or useful high-level ideas in general). Not to hyper-context-specific functions :P
My point was that those context specific functions can be implemented using a 2 vector swizzle. LLVM, for example, actually provides access to most vector shuffling instruction through "shufflevector", which is basically a 2 vector swizzle.
Yeah, I understand. And it's a good suggestion. I'll add support for 2-vector swizzling next time I'm working on it. --047d7b2e5352f338fa04cb19ce2b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 2 October 2012 23:52, jerro <span dir=3D"ltr"=<<a href=3D"mailto:a a.com" target=3D"_blank">a a.com</a>></span> wr=
-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">On Tuesday, 2 October 2012 at 13:36:37 UTC, Manu wrote:<b= r> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> On 2 October 2012 13:49, jerro <<a href=3D"mailto:a a.com" target=3D"_bl= ank">a a.com</a>> wrote:<br> <br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> <br> I don't think it is possible to think of all usages of this, but for ev= ery<br> simd instruction there are valid usages. At least for writing pfft, I found= <br> shuffling two vectors very useful. For, example, I needed a function that<b= r> takes a small, square, power of two number of elements stored in vectors<br=
<br> index of each element by reversing bits of the old index (for 16 elements<b= r> using 4 element vectors this can actually be done using std.simd.transpose,= <br> but for AVX it was more efficient to make this function work on 64<br> elements). There are other places in pfft where I need to select elements<b= r></div> from two vectors (for example, here <a href=3D"https://github.com/jerro/pff= t/**" target=3D"_blank">https://github.com/jerro/pfft/<u></u>**</a><br> blob/sine-transform/pfft/avx_*<u></u>*float.d#L141<<a href=3D"https://gi= thub.com/jerro/pfft/blob/sine-transform/pfft/avx_float.d#L141" target=3D"_b= lank">https://github.<u></u>com/jerro/pfft/blob/sine-<u></u>transform/pfft/= avx_float.d#<u></u>L141</a>>is the platform specific code for AVX).<div = class=3D"im"> <br> <br> I don't think this are the kind of things that should be implemented in= <br> std.simd. If you wanted to implement all such operations (for example bit<b= r> reversing a small array) that somebody may find useful at some time,<br> std.simd would need to be huge, and most of it would never be used.<br> </div></blockquote> <br> <br><div class=3D"im"> I was referring purely to your 2-vector swizzle idea (or useful high-level<= br> ideas in general). Not to hyper-context-specific functions :P<br> </div></blockquote> <br> My point was that those context specific functions can be implemented using= a 2 vector swizzle. LLVM, for example, actually provides access to most ve= ctor shuffling instruction through "shufflevector", which is basi= cally a 2 vector swizzle.<br> </blockquote><div><br></div></div>Yeah, I understand. And it's a good s= uggestion. I'll add support for 2-vector swizzling next time I'm wo= rking on it. --047d7b2e5352f338fa04cb19ce2b--
Oct 02 2012
SIMD in LDC is currently broken
What problems did you have with it? It seems to work fine for me.
Oct 02 2012
On Tuesday, 2 October 2012 at 21:03:36 UTC, jerro wrote:SIMD in LDC is currently broken
What problems did you have with it? It seems to work fine for me.
Can you post an example of doing a simple arithmetic with two 'float4's? My simple tests either fail with LLVM errors or don't produce correct results (which reminds me, I meant to report them, I'll do that). Here's an example: import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oct 02 2012
Also, I'm using the LDC off the official Arch community repo.
Oct 02 2012
import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Oct 02 2012
jerro wrote:This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Yes the SIMD situation isn't entirely usable right now with DMD and LDC. Only simple vector arithmetic is possible to my knowledge. The internal DMD error is actually from processing '(a + b)' and returning it to writeln() without assigning to an separate float4 first.. for example, this compiles with DMD and outputs correctly: import core.simd, std.stdio; void main() { float4 a = 1, b = 2; float4 r = a + b; writeln(r.array); float4 c = [1, 2, 3, 4]; float4 d = 1; c.array[0] = 4; c.ptr[1] = 4; r = c + d; writeln(r.array); } correctly outputs: [3, 3, 3, 3] [5, 5, 4, 5] I've never tried to do SIMD with GDC, though I understand it's done differently and core.simd XMM operations aren't supported (though I can't get them to work in DMD either... *sigh*). Take a look at Manu's std.simd library for reference on GDC SIMD support: https://github.com/TurkeyMan/phobos/blob/master/std/simd.d
Oct 02 2012
On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 03 2012
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla )
I'm trying to create a bugzilla account on that site now, but account creation doesn't seem to be working (I never get the confirmation e-mail).
Oct 03 2012
jerro wrote:I'm trying to create a bugzilla account on that site now, but account creation doesn't seem to be working (I never get the confirmation e-mail).
I never received an email either. Is there a expected time delay?
Oct 03 2012
--bcaec54fb624e953c604cb4d5667 Content-Type: text/plain; charset=UTF-8 On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I
notice that before. This code gives me internal compiler errors with GDC
DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD
and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly in the front end yet? Do they work at all? What does the code generated look like? --bcaec54fb624e953c604cb4d5667 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 3 October 2012 16:40, Iain Buclaw <span dir=3D"ltr"><<a href=3D"mailt= o:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>></span> w= rote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 3 October 2012 02:31, jerro <= <a href=3D"mailto:a a.com">a a.com</a>> wrote:<br> >> import core.simd, std.stdio;<br> >><br> >> void main()<br> >> {<br> >> =C2=A0 float4 a =3D 1, b =3D 2;<br> >> =C2=A0 writeln((a + b).array); // WORKS: [3, 3, 3, 3]<br> >><br> >> =C2=A0 float4 c =3D [1, 2, 3, 4]; // ERROR: "Stored value typ= e does<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// not match pointer operand type!"<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// [..a bunch of LLVM error code..]<br> >><br> >> =C2=A0 float4 c =3D 0, d =3D 1;<br> >> =C2=A0 c.array[0] =3D 4;<br> >> =C2=A0 c.ptr[1] =3D 4;<br> >> =C2=A0 writeln((c + d).array); // WRONG: [1, 1, 1, 1]<br> >> }<br> ><br> ><br> > Oh, that doesn't work for me either. I never tried to use those, s= o I didn't<br> > notice that before. This code gives me internal compiler errors with G= DC and<br> > DMD too (with "float4 c =3D [1, 2, 3, 4]" commented out). I&= #39;m using DMD 2.060<br> > and a recent versions of GDC and LDC on 64 bit Linux.<br> <br> </div></div>Then don't just talk about it, raise a bug - otherwise how = do you<br> expect it to get fixed! =C2=A0( <a href=3D"http://www.gdcproject.org/bugzil= la" target=3D"_blank">http://www.gdcproject.org/bugzilla</a> )<br> <br> I've made a note of the error you get with `__vector(float[4]) c =3D<br=
very basic at the moment. =C2=A0Look forward to hear from all your<br> experiences so we can make vector support rock solid in GDC. ;-)<br></block= quote><div><br></div><div>I didn't realise vector literals like that we= re supported properly in the front end yet?</div><div>Do they work at all? = What does the code generated look like?</div> </div> --bcaec54fb624e953c604cb4d5667--
Oct 05 2012
On 5 October 2012 11:28, Manu <turkeyman gmail.com> wrote:On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly in the front end yet? Do they work at all? What does the code generated look like?
They get passed to the backend as of 2.060 - so looks like the semantic passes now allow them. I've just recently added backend support in GDC - https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194 The codegen looks like so: float4 a = 2; float4 b = [1,2,3,4]; ==> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 }; vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 }; ==> movaps .LC0, %xmm0 movaps %xmm0, -24(%ebp) movaps .LC1, %xmm0 movaps %xmm0, -40(%ebp) .align 16 .LC0: .long 1073741824 .long 1073741824 .long 1073741824 .long 1073741824 .align 16 .LC1: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432 Regards, -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 05 2012
--bcaec54fb81a24a1ee04cb770555 Content-Type: text/plain; charset=UTF-8 On 5 October 2012 14:46, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 5 October 2012 11:28, Manu <turkeyman gmail.com> wrote:On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with
and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly in the front end yet? Do they work at all? What does the code generated look like?
They get passed to the backend as of 2.060 - so looks like the semantic passes now allow them. I've just recently added backend support in GDC - https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194 The codegen looks like so: float4 a = 2; float4 b = [1,2,3,4]; ==> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 }; vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 }; ==> movaps .LC0, %xmm0 movaps %xmm0, -24(%ebp) movaps .LC1, %xmm0 movaps %xmm0, -40(%ebp) .align 16 .LC0: .long 1073741824 .long 1073741824 .long 1073741824 .long 1073741824 .align 16 .LC1: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432
Perfect! I can get on with my unittests :P --bcaec54fb81a24a1ee04cb770555 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div><div class=3D"gmail_quote">On 5 October 2012 14:46, Iain Buclaw <span = dir=3D"ltr"><<a href=3D"mailto:ibuclaw ubuntu.com" target=3D"_blank">ibu= claw ubuntu.com</a>></span> wrote:<br><blockquote class=3D"gmail_quote" = style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 5 October 2012 11:28, Manu <<= a href=3D"mailto:turkeyman gmail.com">turkeyman gmail.com</a>> wrote:<br=
ntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >><br> >> On 3 October 2012 02:31, jerro <<a href=3D"mailto:a a.com">a a.= com</a>> wrote:<br> >> >> import core.simd, std.stdio;<br> >> >><br> >> >> void main()<br> >> >> {<br> >> >> =C2=A0 float4 a =3D 1, b =3D 2;<br> >> >> =C2=A0 writeln((a + b).array); // WORKS: [3, 3, 3, 3]<br> >> >><br> >> >> =C2=A0 float4 c =3D [1, 2, 3, 4]; // ERROR: "Stored = value type does<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// not match pointer operand type!= "<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// [..a bunch of LLVM error code..= ]<br> >> >><br> >> >> =C2=A0 float4 c =3D 0, d =3D 1;<br> >> >> =C2=A0 c.array[0] =3D 4;<br> >> >> =C2=A0 c.ptr[1] =3D 4;<br> >> >> =C2=A0 writeln((c + d).array); // WRONG: [1, 1, 1, 1]<br> >> >> }<br> >> ><br> >> ><br> >> > Oh, that doesn't work for me either. I never tried to use= those, so I<br> >> > didn't<br> >> > notice that before. This code gives me internal compiler erro= rs with GDC<br> >> > and<br> >> > DMD too (with "float4 c =3D [1, 2, 3, 4]" commented= out). I'm using DMD<br> >> > 2.060<br> >> > and a recent versions of GDC and LDC on 64 bit Linux.<br> >><br> >> Then don't just talk about it, raise a bug - otherwise how do = you<br> >> expect it to get fixed! =C2=A0( <a href=3D"http://www.gdcproject.o= rg/bugzilla" target=3D"_blank">http://www.gdcproject.org/bugzilla</a> )<br> >><br> >> I've made a note of the error you get with `__vector(float[4])= c =3D<br> >> [1,2,3,4];' - That is because vector expressions implementatio= n is<br> >> very basic at the moment. =C2=A0Look forward to hear from all your= <br> >> experiences so we can make vector support rock solid in GDC. ;-)<b= r> ><br> ><br> > I didn't realise vector literals like that were supported properly= in the<br> > front end yet?<br> > Do they work at all? What does the code generated look like?<br> <br> </div></div>They get passed to the backend as of 2.060 - so looks like the<= br> semantic passes now allow them.<br> <br> I've just recently added backend support in GDC -<br> <a href=3D"https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b27= 1d82f1ec5208f0b689eb143c#L1R1194" target=3D"_blank">https://github.com/D-Pr= ogramming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194</= a><br> <br> The codegen looks like so:<br> <br> float4 a =3D 2;<br> float4 b =3D [1,2,3,4];<br> <br> =3D=3D><br> vector(4) float a =3D { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };<br> vector(4) float b =3D { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };<br> <br> =3D=3D><br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC0, %xmm0<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -24(%ebp)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC1, %xmm0<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -40(%ebp)<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> .LC0:<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> .LC1:<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1065353216<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1077936128<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1082130432<br></blockquote><div><b= r></div>Perfect!<div>I can get on with my unittests :P</div></div></div> --bcaec54fb81a24a1ee04cb770555--
Oct 07 2012
On 7 October 2012 13:12, Manu <turkeyman gmail.com> wrote:On 5 October 2012 14:46, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 5 October 2012 11:28, Manu <turkeyman gmail.com> wrote:On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly in the front end yet? Do they work at all? What does the code generated look like?
They get passed to the backend as of 2.060 - so looks like the semantic passes now allow them. I've just recently added backend support in GDC - https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194 The codegen looks like so: float4 a = 2; float4 b = [1,2,3,4]; ==> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 }; vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 }; ==> movaps .LC0, %xmm0 movaps %xmm0, -24(%ebp) movaps .LC1, %xmm0 movaps %xmm0, -40(%ebp) .align 16 .LC0: .long 1073741824 .long 1073741824 .long 1073741824 .long 1073741824 .align 16 .LC1: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432
Perfect! I can get on with my unittests :P
I fixed them again. https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201 float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1 Regards -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 08 2012
Iain Buclaw wrote:I fixed them again. https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201 float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Nice, not even DMD can do this yet. Can these changes be pushed upstream? On a side note, I understand GDC doesn't support the core.simd.__simd(...) command, and I'm sure you have good reasons for this. However, it would still be nice if: a) this interface was supported through function-wrappers, or.. b) DMD/LDC could find common ground with GDC in SIMD instructions I just think this sort of difference should be worked out early on. If this simply can't or won't be changed, would you mind giving a short explanation as to why? (Please forgive if you've explained this already before). Is core.simd designed to really never be used and Manu's std.simd is really the starting place for libraries? (I believe I remember him mentioning that)
Oct 08 2012
--bcaec5014c7145d96b04cb92c3ae Content-Type: text/plain; charset=UTF-8 On 8 October 2012 23:05, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 7 October 2012 13:12, Manu <turkeyman gmail.com> wrote:On 5 October 2012 14:46, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 5 October 2012 11:28, Manu <turkeyman gmail.com> wrote:On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those,
didn't notice that before. This code gives me internal compiler errors
GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using
2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly in the front end yet? Do they work at all? What does the code generated look like?
They get passed to the backend as of 2.060 - so looks like the semantic passes now allow them. I've just recently added backend support in GDC -
The codegen looks like so: float4 a = 2; float4 b = [1,2,3,4]; ==> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 }; vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 }; ==> movaps .LC0, %xmm0 movaps %xmm0, -24(%ebp) movaps .LC1, %xmm0 movaps %xmm0, -40(%ebp) .align 16 .LC0: .long 1073741824 .long 1073741824 .long 1073741824 .long 1073741824 .align 16 .LC1: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432
Perfect! I can get on with my unittests :P
I fixed them again. https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201 float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Errr, that's not fixed...? movss is not the opcode you're looking for. Surely that should produce a single movaps... --bcaec5014c7145d96b04cb92c3ae Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 8 October 2012 23:05, Iain Buclaw <span dir=3D"ltr"><<a href=3D"mailt= o:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>></span> w= rote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 7 October 2012 13:12, Manu <<= a href=3D"mailto:turkeyman gmail.com">turkeyman gmail.com</a>> wrote:<br=
ntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >><br> >> On 5 October 2012 11:28, Manu <<a href=3D"mailto:turkeyman gmai= l.com">turkeyman gmail.com</a>> wrote:<br> >> > On 3 October 2012 16:40, Iain Buclaw <<a href=3D"mailto:ib= uclaw ubuntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >> >><br> >> >> On 3 October 2012 02:31, jerro <<a href=3D"mailto:a a.= com">a a.com</a>> wrote:<br> >> >> >> import core.simd, std.stdio;<br> >> >> >><br> >> >> >> void main()<br> >> >> >> {<br> >> >> >> =C2=A0 float4 a =3D 1, b =3D 2;<br> >> >> >> =C2=A0 writeln((a + b).array); // WORKS: [3, 3, = 3, 3]<br> >> >> >><br> >> >> >> =C2=A0 float4 c =3D [1, 2, 3, 4]; // ERROR: &quo= t;Stored value type does<br> >> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// not match pointer opera= nd type!"<br> >> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// [..a bunch of LLVM erro= r code..]<br> >> >> >><br> >> >> >> =C2=A0 float4 c =3D 0, d =3D 1;<br> >> >> >> =C2=A0 c.array[0] =3D 4;<br> >> >> >> =C2=A0 c.ptr[1] =3D 4;<br> >> >> >> =C2=A0 writeln((c + d).array); // WRONG: [1, 1, = 1, 1]<br> >> >> >> }<br> >> >> ><br> >> >> ><br> >> >> > Oh, that doesn't work for me either. I never tri= ed to use those, so I<br> >> >> > didn't<br> >> >> > notice that before. This code gives me internal comp= iler errors with<br> >> >> > GDC<br> >> >> > and<br> >> >> > DMD too (with "float4 c =3D [1, 2, 3, 4]" = commented out). I'm using DMD<br> >> >> > 2.060<br> >> >> > and a recent versions of GDC and LDC on 64 bit Linux= .<br> >> >><br> >> >> Then don't just talk about it, raise a bug - otherwis= e how do you<br> >> >> expect it to get fixed! =C2=A0( <a href=3D"http://www.gdc= project.org/bugzilla" target=3D"_blank">http://www.gdcproject.org/bugzilla<= /a> )<br> >> >><br> >> >> I've made a note of the error you get with `__vector(= float[4]) c =3D<br> >> >> [1,2,3,4];' - That is because vector expressions impl= ementation is<br> >> >> very basic at the moment. =C2=A0Look forward to hear from= all your<br> >> >> experiences so we can make vector support rock solid in G= DC. ;-)<br> >> ><br> >> ><br> >> > I didn't realise vector literals like that were supported= properly in<br> >> > the<br> >> > front end yet?<br> >> > Do they work at all? What does the code generated look like?<= br> >><br> >> They get passed to the backend as of 2.060 - so looks like the<br> >> semantic passes now allow them.<br> >><br> >> I've just recently added backend support in GDC -<br> >><br> >> <a href=3D"https://github.com/D-Programming-GDC/GDC/commit/7ada3d9= 5b8af1b271d82f1ec5208f0b689eb143c#L1R1194" target=3D"_blank">https://github= .com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#= L1R1194</a><br> >><br> >> The codegen looks like so:<br> >><br> >> float4 a =3D 2;<br> >> float4 b =3D [1,2,3,4];<br> >><br> >> =3D=3D><br> >> vector(4) float a =3D { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };<br> >> vector(4) float b =3D { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };<br> >><br> >> =3D=3D><br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC0, %xmm0<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -24(%ebp)<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC1, %xmm0<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -40(%ebp)<br> >><br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> >> .LC0:<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> >> .LC1:<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1065353216<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1077936128<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1082130432<br> ><br> ><br> > Perfect!<br> > I can get on with my unittests :P<br> <br> </div></div>I fixed them again.<br> <br> <a href=3D"https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e= 841a15849f5dc94ae81dccdc#L4R1201" target=3D"_blank">https://github.com/D-Pr= ogramming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201</= a><br> <br> <br> float a =3D 1, b =3D 2, c =3D 3, d =3D 4;<br> float4 f =3D [a,b,c,d];<br> <br> =3D=3D=3D><br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -16(%rbp), %xmm0<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -12(%rbp), %xmm1<br></blockquote><= div><br></div><div>Errr, that's not fixed...?</div><div>movss is not th= e opcode you're looking for.</div><div>Surely that should produce a sin= gle movaps...</div></div> --bcaec5014c7145d96b04cb92c3ae--
Oct 08 2012
On 8 October 2012 22:18, Manu <turkeyman gmail.com> wrote:On 8 October 2012 23:05, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 7 October 2012 13:12, Manu <turkeyman gmail.com> wrote:On 5 October 2012 14:46, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 5 October 2012 11:28, Manu <turkeyman gmail.com> wrote:On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly in the front end yet? Do they work at all? What does the code generated look like?
They get passed to the backend as of 2.060 - so looks like the semantic passes now allow them. I've just recently added backend support in GDC - https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194 The codegen looks like so: float4 a = 2; float4 b = [1,2,3,4]; ==> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 }; vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 }; ==> movaps .LC0, %xmm0 movaps %xmm0, -24(%ebp) movaps .LC1, %xmm0 movaps %xmm0, -40(%ebp) .align 16 .LC0: .long 1073741824 .long 1073741824 .long 1073741824 .long 1073741824 .align 16 .LC1: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432
Perfect! I can get on with my unittests :P
I fixed them again. https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201 float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Errr, that's not fixed...? movss is not the opcode you're looking for. Surely that should produce a single movaps...
I didn't say I compiled with optimisations - only -march=native. =) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 08 2012
On 8 October 2012 22:18, F i L <witte2008 gmail.com> wrote:Iain Buclaw wrote:I fixed them again. https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201 float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Nice, not even DMD can do this yet. Can these changes be pushed upstream? On a side note, I understand GDC doesn't support the core.simd.__simd(...) command, and I'm sure you have good reasons for this. However, it would still be nice if: a) this interface was supported through function-wrappers, or.. b) DMD/LDC could find common ground with GDC in SIMD instructions I just think this sort of difference should be worked out early on. If this simply can't or won't be changed, would you mind giving a short explanation as to why? (Please forgive if you've explained this already before). Is core.simd designed to really never be used and Manu's std.simd is really the starting place for libraries? (I believe I remember him mentioning that)
I'm refusing to implement any intrinsic that is tied to a specific architecture. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 08 2012
--20cf307f3a9e98994404cb92ef71 Content-Type: text/plain; charset=UTF-8 On 9 October 2012 00:18, F i L <witte2008 gmail.com> wrote:Iain Buclaw wrote:I fixed them again. https://github.com/D-**Programming-GDC/GDC/commit/** 9402516e0b07031e841a15849f5dc9**4ae81dccdc#L4R1201<https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201> float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Nice, not even DMD can do this yet. Can these changes be pushed upstream? On a side note, I understand GDC doesn't support the core.simd.__simd(...) command, and I'm sure you have good reasons for this. However, it would still be nice if: a) this interface was supported through function-wrappers, or.. b) DMD/LDC could find common ground with GDC in SIMD instructions I just think this sort of difference should be worked out early on. If this simply can't or won't be changed, would you mind giving a short explanation as to why? (Please forgive if you've explained this already before). Is core.simd designed to really never be used and Manu's std.simd is really the starting place for libraries? (I believe I remember him mentioning that)
core.simd just provides what the compiler provides in it's most primal state. As far as I'm concerned, it's just not meant to be used directly except by library authors. It's possible that a uniform suite of names could be made to wrap all the compiler-specific names (ldc is different again), but that would just get wrapped a second time one level higher. Hardly seems worth the effort. --20cf307f3a9e98994404cb92ef71 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 00:18, F i L <span dir=3D"ltr"><<a href=3D"mailto:witt= e2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrote= :<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"= margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">Iain Buclaw wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I fixed them again.<br> <br> <a href=3D"https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e= 841a15849f5dc94ae81dccdc#L4R1201" target=3D"_blank">https://github.com/D-<u=</u>Programming-GDC/GDC/commit/<u></u>9402516e0b07031e841a15849f5dc9<u></u= 4ae81dccdc#L4R1201</a><br>
<br> <br> float a =3D 1, b =3D 2, c =3D 3, d =3D 4;<br> float4 f =3D [a,b,c,d];<br> <br> =3D=3D=3D><br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -16(%rbp), %xmm0<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -12(%rbp), %xmm1<br> </blockquote> <br></div> Nice, not even DMD can do this yet. Can these changes be pushed upstream?<b= r> <br> On a side note, I understand GDC doesn't support the core.simd.__simd(.= ..) command, and I'm sure you have good reasons for this. However, it w= ould still be nice if:<br> <br> a) this interface was supported through function-wrappers, or..<br> b) DMD/LDC could find common ground with GDC in SIMD instructions<br> <br> I just think this sort of difference should be worked out early on. If this= simply can't or won't be changed, would you mind giving a short ex= planation as to why? (Please forgive if you've explained this already b= efore). Is core.simd designed to really never be used and Manu's std.si= md is really the starting place for libraries? (I believe I remember him me= ntioning that)<br> </blockquote><div><br></div><div>core.simd just provides what the compiler = provides in it's most primal state. As far as I'm concerned, it'= ;s just not meant to be used directly except by library authors.</div><div> It's possible that a uniform suite of names could be made to wrap all t= he compiler-specific names (ldc is different again), but that would just ge= t wrapped a second time one level higher. Hardly seems worth the effort.</d= iv> </div> --20cf307f3a9e98994404cb92ef71--
Oct 08 2012
--bcaec54fb81aa4eafa04cb92f4fb Content-Type: text/plain; charset=UTF-8 On 9 October 2012 00:29, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 8 October 2012 22:18, Manu <turkeyman gmail.com> wrote:On 8 October 2012 23:05, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 7 October 2012 13:12, Manu <turkeyman gmail.com> wrote:On 5 October 2012 14:46, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 5 October 2012 11:28, Manu <turkeyman gmail.com> wrote:On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm
DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly
the front end yet? Do they work at all? What does the code generated look like?
They get passed to the backend as of 2.060 - so looks like the semantic passes now allow them. I've just recently added backend support in GDC -
The codegen looks like so: float4 a = 2; float4 b = [1,2,3,4]; ==> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 }; vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 }; ==> movaps .LC0, %xmm0 movaps %xmm0, -24(%ebp) movaps .LC1, %xmm0 movaps %xmm0, -40(%ebp) .align 16 .LC0: .long 1073741824 .long 1073741824 .long 1073741824 .long 1073741824 .align 16 .LC1: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432
Perfect! I can get on with my unittests :P
I fixed them again.
float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Errr, that's not fixed...? movss is not the opcode you're looking for. Surely that should produce a single movaps...
I didn't say I compiled with optimisations - only -march=native. =)
Either way, that code is wrong. The prior code was correct (albeit with the redundant store, which I presume would have gone away with optimisation enabled) --bcaec54fb81aa4eafa04cb92f4fb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 00:29, Iain Buclaw <span dir=3D"ltr"><<a href=3D"mailt= o:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>></span> w= rote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 8 October 2012 22:18, Manu <<= a href=3D"mailto:turkeyman gmail.com">turkeyman gmail.com</a>> wrote:<br=
ntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >><br> >> On 7 October 2012 13:12, Manu <<a href=3D"mailto:turkeyman gmai= l.com">turkeyman gmail.com</a>> wrote:<br> >> > On 5 October 2012 14:46, Iain Buclaw <<a href=3D"mailto:ib= uclaw ubuntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >> >><br> >> >> On 5 October 2012 11:28, Manu <<a href=3D"mailto:turke= yman gmail.com">turkeyman gmail.com</a>> wrote:<br> >> >> > On 3 October 2012 16:40, Iain Buclaw <<a href=3D"= mailto:ibuclaw ubuntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >> >> >><br> >> >> >> On 3 October 2012 02:31, jerro <<a href=3D"ma= ilto:a a.com">a a.com</a>> wrote:<br> >> >> >> >> import core.simd, std.stdio;<br> >> >> >> >><br> >> >> >> >> void main()<br> >> >> >> >> {<br> >> >> >> >> =C2=A0 float4 a =3D 1, b =3D 2;<br> >> >> >> >> =C2=A0 writeln((a + b).array); // WORKS= : [3, 3, 3, 3]<br> >> >> >> >><br> >> >> >> >> =C2=A0 float4 c =3D [1, 2, 3, 4]; // ER= ROR: "Stored value type does<br> >> >> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// not match poi= nter operand type!"<br> >> >> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// [..a bunch of= LLVM error code..]<br> >> >> >> >><br> >> >> >> >> =C2=A0 float4 c =3D 0, d =3D 1;<br> >> >> >> >> =C2=A0 c.array[0] =3D 4;<br> >> >> >> >> =C2=A0 c.ptr[1] =3D 4;<br> >> >> >> >> =C2=A0 writeln((c + d).array); // WRONG= : [1, 1, 1, 1]<br> >> >> >> >> }<br> >> >> >> ><br> >> >> >> ><br> >> >> >> > Oh, that doesn't work for me either. I = never tried to use those,<br> >> >> >> > so I<br> >> >> >> > didn't<br> >> >> >> > notice that before. This code gives me inte= rnal compiler errors<br> >> >> >> > with<br> >> >> >> > GDC<br> >> >> >> > and<br> >> >> >> > DMD too (with "float4 c =3D [1, 2, 3, = 4]" commented out). I'm using<br> >> >> >> > DMD<br> >> >> >> > 2.060<br> >> >> >> > and a recent versions of GDC and LDC on 64 = bit Linux.<br> >> >> >><br> >> >> >> Then don't just talk about it, raise a bug -= otherwise how do you<br> >> >> >> expect it to get fixed! =C2=A0( <a href=3D"http:= //www.gdcproject.org/bugzilla" target=3D"_blank">http://www.gdcproject.org/= bugzilla</a> )<br> >> >> >><br> >> >> >> I've made a note of the error you get with `= __vector(float[4]) c =3D<br> >> >> >> [1,2,3,4];' - That is because vector express= ions implementation is<br> >> >> >> very basic at the moment. =C2=A0Look forward to = hear from all your<br> >> >> >> experiences so we can make vector support rock s= olid in GDC. ;-)<br> >> >> ><br> >> >> ><br> >> >> > I didn't realise vector literals like that were = supported properly in<br> >> >> > the<br> >> >> > front end yet?<br> >> >> > Do they work at all? What does the code generated lo= ok like?<br> >> >><br> >> >> They get passed to the backend as of 2.060 - so looks lik= e the<br> >> >> semantic passes now allow them.<br> >> >><br> >> >> I've just recently added backend support in GDC -<br> >> >><br> >> >><br> >> >> <a href=3D"https://github.com/D-Programming-GDC/GDC/commi= t/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194" target=3D"_blank">https= ://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b6= 89eb143c#L1R1194</a><br> >> >><br> >> >> The codegen looks like so:<br> >> >><br> >> >> float4 a =3D 2;<br> >> >> float4 b =3D [1,2,3,4];<br> >> >><br> >> >> =3D=3D><br> >> >> vector(4) float a =3D { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };= <br> >> >> vector(4) float b =3D { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };= <br> >> >><br> >> >> =3D=3D><br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC0, %xmm0<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -24(%ebp)= <br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC1, %xmm0<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -40(%ebp)= <br> >> >><br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> >> >> .LC0:<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> >> >> .LC1:<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1065353216<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1077936128<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1082130432<br> >> ><br> >> ><br> >> > Perfect!<br> >> > I can get on with my unittests :P<br> >><br> >> I fixed them again.<br> >><br> >><br> >> <a href=3D"https://github.com/D-Programming-GDC/GDC/commit/9402516= e0b07031e841a15849f5dc94ae81dccdc#L4R1201" target=3D"_blank">https://github= .com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#= L4R1201</a><br> >><br> >><br> >> float a =3D 1, b =3D 2, c =3D 3, d =3D 4;<br> >> float4 f =3D [a,b,c,d];<br> >><br> >> =3D=3D=3D><br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -16(%rbp), %xmm0<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -12(%rbp), %xmm1<br> ><br> ><br> > Errr, that's not fixed...?<br> > movss is not the opcode you're looking for.<br> > Surely that should produce a single movaps...<br> <br> </div></div>I didn't say I compiled with optimisations - only -march=3D= native. =C2=A0=3D)<br></blockquote><div><br></div><div>Either way, that cod= e is wrong. The prior code was correct (albeit with the redundant store, wh= ich I presume would have gone away with optimisation enabled)</div> </div> --bcaec54fb81aa4eafa04cb92f4fb--
Oct 08 2012
--20cf300fb263c0220a04cb92f41a Content-Type: text/plain; charset=UTF-8 On 9 October 2012 00:29, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 8 October 2012 22:18, Manu <turkeyman gmail.com> wrote:On 8 October 2012 23:05, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 7 October 2012 13:12, Manu <turkeyman gmail.com> wrote:On 5 October 2012 14:46, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 5 October 2012 11:28, Manu <turkeyman gmail.com> wrote:On 3 October 2012 16:40, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 3 October 2012 02:31, jerro <a a.com> wrote:import core.simd, std.stdio; void main() { float4 a = 1, b = 2; writeln((a + b).array); // WORKS: [3, 3, 3, 3] float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does // not match pointer operand type!" // [..a bunch of LLVM error code..] float4 c = 0, d = 1; c.array[0] = 4; c.ptr[1] = 4; writeln((c + d).array); // WRONG: [1, 1, 1, 1] }
Oh, that doesn't work for me either. I never tried to use those, so I didn't notice that before. This code gives me internal compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm
DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.
Then don't just talk about it, raise a bug - otherwise how do you expect it to get fixed! ( http://www.gdcproject.org/bugzilla ) I've made a note of the error you get with `__vector(float[4]) c = [1,2,3,4];' - That is because vector expressions implementation is very basic at the moment. Look forward to hear from all your experiences so we can make vector support rock solid in GDC. ;-)
I didn't realise vector literals like that were supported properly
the front end yet? Do they work at all? What does the code generated look like?
They get passed to the backend as of 2.060 - so looks like the semantic passes now allow them. I've just recently added backend support in GDC -
The codegen looks like so: float4 a = 2; float4 b = [1,2,3,4]; ==> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 }; vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 }; ==> movaps .LC0, %xmm0 movaps %xmm0, -24(%ebp) movaps .LC1, %xmm0 movaps %xmm0, -40(%ebp) .align 16 .LC0: .long 1073741824 .long 1073741824 .long 1073741824 .long 1073741824 .align 16 .LC1: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432
Perfect! I can get on with my unittests :P
I fixed them again.
float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Errr, that's not fixed...? movss is not the opcode you're looking for. Surely that should produce a single movaps...
I didn't say I compiled with optimisations - only -march=native. =)
Either way, that code is wrong. The prior code was correct (albeit with the redundant store, which I presume would have gone away with optimisation enabled) --20cf300fb263c0220a04cb92f41a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 00:29, Iain Buclaw <span dir=3D"ltr"><<a href=3D"mailt= o:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>></span> w= rote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 8 October 2012 22:18, Manu <<= a href=3D"mailto:turkeyman gmail.com">turkeyman gmail.com</a>> wrote:<br=
ntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >><br> >> On 7 October 2012 13:12, Manu <<a href=3D"mailto:turkeyman gmai= l.com">turkeyman gmail.com</a>> wrote:<br> >> > On 5 October 2012 14:46, Iain Buclaw <<a href=3D"mailto:ib= uclaw ubuntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >> >><br> >> >> On 5 October 2012 11:28, Manu <<a href=3D"mailto:turke= yman gmail.com">turkeyman gmail.com</a>> wrote:<br> >> >> > On 3 October 2012 16:40, Iain Buclaw <<a href=3D"= mailto:ibuclaw ubuntu.com">ibuclaw ubuntu.com</a>> wrote:<br> >> >> >><br> >> >> >> On 3 October 2012 02:31, jerro <<a href=3D"ma= ilto:a a.com">a a.com</a>> wrote:<br> >> >> >> >> import core.simd, std.stdio;<br> >> >> >> >><br> >> >> >> >> void main()<br> >> >> >> >> {<br> >> >> >> >> =C2=A0 float4 a =3D 1, b =3D 2;<br> >> >> >> >> =C2=A0 writeln((a + b).array); // WORKS= : [3, 3, 3, 3]<br> >> >> >> >><br> >> >> >> >> =C2=A0 float4 c =3D [1, 2, 3, 4]; // ER= ROR: "Stored value type does<br> >> >> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// not match poi= nter operand type!"<br> >> >> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// [..a bunch of= LLVM error code..]<br> >> >> >> >><br> >> >> >> >> =C2=A0 float4 c =3D 0, d =3D 1;<br> >> >> >> >> =C2=A0 c.array[0] =3D 4;<br> >> >> >> >> =C2=A0 c.ptr[1] =3D 4;<br> >> >> >> >> =C2=A0 writeln((c + d).array); // WRONG= : [1, 1, 1, 1]<br> >> >> >> >> }<br> >> >> >> ><br> >> >> >> ><br> >> >> >> > Oh, that doesn't work for me either. I = never tried to use those,<br> >> >> >> > so I<br> >> >> >> > didn't<br> >> >> >> > notice that before. This code gives me inte= rnal compiler errors<br> >> >> >> > with<br> >> >> >> > GDC<br> >> >> >> > and<br> >> >> >> > DMD too (with "float4 c =3D [1, 2, 3, = 4]" commented out). I'm using<br> >> >> >> > DMD<br> >> >> >> > 2.060<br> >> >> >> > and a recent versions of GDC and LDC on 64 = bit Linux.<br> >> >> >><br> >> >> >> Then don't just talk about it, raise a bug -= otherwise how do you<br> >> >> >> expect it to get fixed! =C2=A0( <a href=3D"http:= //www.gdcproject.org/bugzilla" target=3D"_blank">http://www.gdcproject.org/= bugzilla</a> )<br> >> >> >><br> >> >> >> I've made a note of the error you get with `= __vector(float[4]) c =3D<br> >> >> >> [1,2,3,4];' - That is because vector express= ions implementation is<br> >> >> >> very basic at the moment. =C2=A0Look forward to = hear from all your<br> >> >> >> experiences so we can make vector support rock s= olid in GDC. ;-)<br> >> >> ><br> >> >> ><br> >> >> > I didn't realise vector literals like that were = supported properly in<br> >> >> > the<br> >> >> > front end yet?<br> >> >> > Do they work at all? What does the code generated lo= ok like?<br> >> >><br> >> >> They get passed to the backend as of 2.060 - so looks lik= e the<br> >> >> semantic passes now allow them.<br> >> >><br> >> >> I've just recently added backend support in GDC -<br> >> >><br> >> >><br> >> >> <a href=3D"https://github.com/D-Programming-GDC/GDC/commi= t/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194" target=3D"_blank">https= ://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b6= 89eb143c#L1R1194</a><br> >> >><br> >> >> The codegen looks like so:<br> >> >><br> >> >> float4 a =3D 2;<br> >> >> float4 b =3D [1,2,3,4];<br> >> >><br> >> >> =3D=3D><br> >> >> vector(4) float a =3D { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };= <br> >> >> vector(4) float b =3D { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };= <br> >> >><br> >> >> =3D=3D><br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC0, %xmm0<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -24(%ebp)= <br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0.LC1, %xmm0<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movaps =C2=A0%xmm0, -40(%ebp)= <br> >> >><br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> >> >> .LC0:<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .align 16<br> >> >> .LC1:<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1065353216<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1073741824<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1077936128<br> >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 .long =C2=A0 1082130432<br> >> ><br> >> ><br> >> > Perfect!<br> >> > I can get on with my unittests :P<br> >><br> >> I fixed them again.<br> >><br> >><br> >> <a href=3D"https://github.com/D-Programming-GDC/GDC/commit/9402516= e0b07031e841a15849f5dc94ae81dccdc#L4R1201" target=3D"_blank">https://github= .com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#= L4R1201</a><br> >><br> >><br> >> float a =3D 1, b =3D 2, c =3D 3, d =3D 4;<br> >> float4 f =3D [a,b,c,d];<br> >><br> >> =3D=3D=3D><br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -16(%rbp), %xmm0<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -12(%rbp), %xmm1<br> ><br> ><br> > Errr, that's not fixed...?<br> > movss is not the opcode you're looking for.<br> > Surely that should produce a single movaps...<br> <br> </div></div>I didn't say I compiled with optimisations - only -march=3D= native. =C2=A0=3D)<br></blockquote><div><br></div><div>Either way, that cod= e is wrong. The prior code was correct (albeit with the redundant store, wh= ich I presume would have gone away with optimisation enabled)</div> </div> --20cf300fb263c0220a04cb92f41a--
Oct 08 2012
--20cf303b42f7b8a4ae04cb92fc6a Content-Type: text/plain; charset=UTF-8 On 9 October 2012 00:30, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 8 October 2012 22:18, F i L <witte2008 gmail.com> wrote:Iain Buclaw wrote:I fixed them again.
float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Nice, not even DMD can do this yet. Can these changes be pushed upstream? On a side note, I understand GDC doesn't support the
command, and I'm sure you have good reasons for this. However, it would still be nice if: a) this interface was supported through function-wrappers, or.. b) DMD/LDC could find common ground with GDC in SIMD instructions I just think this sort of difference should be worked out early on. If
simply can't or won't be changed, would you mind giving a short
as to why? (Please forgive if you've explained this already before). Is core.simd designed to really never be used and Manu's std.simd is really
starting place for libraries? (I believe I remember him mentioning that)
I'm refusing to implement any intrinsic that is tied to a specific architecture.
GCC offers perfectly good intrinsics anyway. And they're superior to the DMD intrinsics too. --20cf303b42f7b8a4ae04cb92fc6a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 00:30, Iain Buclaw <span dir=3D"ltr"><<a href=3D"mailt= o:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>></span> w= rote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 8 October 2012 22:18, F i L <= <a href=3D"mailto:witte2008 gmail.com">witte2008 gmail.com</a>> wrote:<b= r> > Iain Buclaw wrote:<br> >><br> >> I fixed them again.<br> >><br> >><br> >> <a href=3D"https://github.com/D-Programming-GDC/GDC/commit/9402516= e0b07031e841a15849f5dc94ae81dccdc#L4R1201" target=3D"_blank">https://github= .com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#= L4R1201</a><br> >><br> >><br> >> float a =3D 1, b =3D 2, c =3D 3, d =3D 4;<br> >> float4 f =3D [a,b,c,d];<br> >><br> >> =3D=3D=3D><br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -16(%rbp), %xmm0<br> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 movss =C2=A0 -12(%rbp), %xmm1<br> ><br> ><br> > Nice, not even DMD can do this yet. Can these changes be pushed upstre= am?<br> ><br> > On a side note, I understand GDC doesn't support the core.simd.__s= imd(...)<br> > command, and I'm sure you have good reasons for this. However, it = would<br> > still be nice if:<br> ><br> > a) this interface was supported through function-wrappers, or..<br> > b) DMD/LDC could find common ground with GDC in SIMD instructions<br> ><br> > I just think this sort of difference should be worked out early on. If= this<br> > simply can't or won't be changed, would you mind giving a shor= t explanation<br> > as to why? (Please forgive if you've explained this already before= ). Is<br> > core.simd designed to really never be used and Manu's std.simd is = really the<br> > starting place for libraries? (I believe I remember him mentioning tha= t)<br> ><br> <br> </div></div>I'm refusing to implement any intrinsic that is tied to a s= pecific architecture.<br></blockquote><div><br></div><div>GCC offers perfec= tly good intrinsics anyway. And they're superior to the DMD intrinsics = too.</div> <div><br></div></div> --20cf303b42f7b8a4ae04cb92fc6a--
Oct 08 2012
On 8 October 2012 22:34, Manu <turkeyman gmail.com> wrote:On 9 October 2012 00:30, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 8 October 2012 22:18, F i L <witte2008 gmail.com> wrote:Iain Buclaw wrote:I fixed them again. https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201 float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Nice, not even DMD can do this yet. Can these changes be pushed upstream? On a side note, I understand GDC doesn't support the core.simd.__simd(...) command, and I'm sure you have good reasons for this. However, it would still be nice if: a) this interface was supported through function-wrappers, or.. b) DMD/LDC could find common ground with GDC in SIMD instructions I just think this sort of difference should be worked out early on. If this simply can't or won't be changed, would you mind giving a short explanation as to why? (Please forgive if you've explained this already before). Is core.simd designed to really never be used and Manu's std.simd is really the starting place for libraries? (I believe I remember him mentioning that)
I'm refusing to implement any intrinsic that is tied to a specific architecture.
GCC offers perfectly good intrinsics anyway. And they're superior to the DMD intrinsics too.
Provided that a) the architecture provides them, and b) you have the right -march/-mtune flags turned on. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 08 2012
On Monday, 8 October 2012 at 20:23:50 UTC, Iain Buclaw wrote:float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
The obligatory "me too" post: LDC turns --- import core.simd; struct T { float a, b, c, d; ubyte[100] passOnStack; } void test(T t) { receiver([t.a, t.b, t.c, t.d]); } void receiver(float4 f); --- into --- 0000000000000000 <_D4test4testFS4test1TZv>: 0: 50 push rax 1: 0f 28 44 24 10 movaps xmm0,XMMWORD PTR [rsp+0x10] 6: e8 00 00 00 00 call b <_D4test4testFS4test1TZv+0xb> b: 58 pop rax c: c3 ret --- (the struct is just there so that the values are actually on the stack, and receiver just so that the optimizer doesn't eat everything for breakfast). David
Oct 08 2012
Iain Buclaw wrote:I'm refusing to implement any intrinsic that is tied to a specific architecture.
I see. So the __builtin_ia32_***() functions in gcc.builtins are architecture agnostic? I couldn't find much documentation about them on the web. Do you have any references you could pass on? I guess it makes sense to just make std.simd the lib everyone uses for a "base-line" support of SIMD and let DMD do what it wants with it's core.simd lib. It sounds like gcc.builtins is just a layer above core.simd anyways. Although now it seems that DMD's std.simd will need a bunch of 'static if (architectureX) { ... }' for every GDC builtin... wounder if later that shouldn't be moved to (and standerized) a 'core.builtins' module or something. Thanks for the explanation.
Oct 08 2012
On Monday, 8 October 2012 at 21:36:08 UTC, F i L wrote:Iain Buclaw wrote:float a = 1, b = 2, c = 3, d = 4; float4 f = [a,b,c,d]; ===> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Nice, not even DMD can do this yet. Can these changes be pushed upstream?
No, the actual codegen is compilers-specific (and apparently wrong in the case of GDC, if this is the actual piece of code emitted for the code snippet).On a side note, I understand GDC doesn't support the core.simd.__simd(...) command, and I'm sure you have good reasons for this. However, it would still be nice if: a) this interface was supported through function-wrappers, or.. b) DMD/LDC could find common ground with GDC in SIMD instructions
LDC won't support core.simd.__simd in the forseeable future either. The reason is that it is a) untyped and b) highly x86-specific, both of which make it hard to integrate with LLVM – __simd is really just a glorified inline assembly expression (hm, this makes me think, maybe it could be implemented quite easily in terms of a transformation to LLVM inline assembly expressions…).Is core.simd designed to really never be used and Manu's std.simd is really the starting place for libraries? (I believe I remember him mentioning that)
With all due respect to Walter, core.simd isn't really "designed" much at all, or at least this isn't visible in its current state – it rather seems like a quick hack to get some basic SIMD code working with DMD (but beware of ICEs). Walter, if you are following this thread, do you have any plans for SIMD on non-x86 platforms? David
Oct 08 2012
On 9 October 2012 00:38, F i L <witte2008 gmail.com> wrote:Iain Buclaw wrote:I'm refusing to implement any intrinsic that is tied to a specific architecture.
I see. So the __builtin_ia32_***() functions in gcc.builtins are architecture agnostic? I couldn't find much documentation about them on the web. Do you have any references you could pass on? I guess it makes sense to just make std.simd the lib everyone uses for a "base-line" support of SIMD and let DMD do what it wants with it's core.simd lib. It sounds like gcc.builtins is just a layer above core.simd anyways. Although now it seems that DMD's std.simd will need a bunch of 'static if (architectureX) { ... }' for every GDC builtin... wounder if later that shouldn't be moved to (and standerized) a 'core.builtins' module or something. Thanks for the explanation.
gcc.builtins does something different depending on architecure, and target cpu flags. All I do is take what gcc backend gives to the frontend, and hash it out to D. What I meant is that I won't implement a frontend intrinsic that... -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 08 2012
--20cf307811d0b3649204cb9b9670 Content-Type: text/plain; charset=UTF-8 On 9 October 2012 02:38, F i L <witte2008 gmail.com> wrote:Iain Buclaw wrote:I'm refusing to implement any intrinsic that is tied to a specific architecture.
I see. So the __builtin_ia32_***() functions in gcc.builtins are architecture agnostic? I couldn't find much documentation about them on the web. Do you have any references you could pass on? I guess it makes sense to just make std.simd the lib everyone uses for a "base-line" support of SIMD and let DMD do what it wants with it's core.simd lib. It sounds like gcc.builtins is just a layer above core.simd anyways. Although now it seems that DMD's std.simd will need a bunch of 'static if (architectureX) { ... }' for every GDC builtin... wounder if later that shouldn't be moved to (and standerized) a 'core.builtins' module or something. Thanks for the explanation.
std.simd already does have a mammoth mess of static if(arch & compiler). The thing about std.simd is that it's designed to be portable, so it doesn't make sense to expose the low-level sse intrinsics directly there. But giving it some thought, it might be nice to produce std.simd.sse and std.simd.vmx, etc for collating the intrinsics used by different compilers, and anyone who is writing sse code explicitly might use std.simd.sse to avoid having to support all different compilers intrinsics themselves. This sounds like a reasonable approach, the only question is what all these wrappers will do to the code-gen. I'll need to experiment/prove that out. --20cf307811d0b3649204cb9b9670 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 02:38, F i L <span dir=3D"ltr"><<a href=3D"mailto:witt= e2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrote= :<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"= margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">Iain Buclaw wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I'm refusing to implement any intrinsic that is tied to a specific arch= itecture.<br> </blockquote> <br></div> I see. So the __builtin_ia32_***() functions in gcc.builtins are architectu= re agnostic? I couldn't find much documentation about them on the web. = Do you have any references you could pass on?<br> <br> I guess it makes sense to just make std.simd the lib everyone uses for a &q= uot;base-line" support of SIMD and let DMD do what it wants with it= 9;s core.simd lib. It sounds like gcc.builtins is just a layer above core.s= imd anyways. Although now it seems that DMD's std.simd will need a bunc= h of 'static if (architectureX) { ... }' for every GDC builtin... w= ounder if later that shouldn't be moved to (and standerized) a 'cor= e.builtins' module or something.<br> <br> Thanks for the explanation.<br> </blockquote></div><br><div>std.simd already does have a mammoth mess of st= atic if(arch & compiler). The thing about std.simd is that it's des= igned to be portable, so it doesn't make sense to expose the low-level = sse intrinsics directly there.</div> <div>But giving it some thought, it might be nice to produce std.simd.sse a= nd std.simd.vmx, etc for collating the intrinsics used by different compile= rs, and anyone who is writing sse code explicitly might use std.simd.sse to= avoid having to support all different compilers intrinsics themselves.</di= v> <div>This sounds like a reasonable approach, the only question is what all = these wrappers will do to the code-gen. I'll need to experiment/prove t= hat out.</div> --20cf307811d0b3649204cb9b9670--
Oct 09 2012
--20cf307f3ba2503a3004cb9ba9b0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 02:52, David Nadlinger <see klickverbot.at> wrote:Is core.simd designed to really never be used and Manu's std.simd isreally the starting place for libraries? (I believe I remember him mentioning that)
With all due respect to Walter, core.simd isn't really "designed" much at all, or at least this isn't visible in its current state =E2=80=93 it rat=
like a quick hack to get some basic SIMD code working with DMD (but bewar=
of ICEs). Walter, if you are following this thread, do you have any plans for SIMD on non-x86 platforms?
DMD doesn't support non-x86 platforms... What DMD offer's is fine, since it all needs to be collated anyway; GDC/LDC don't agree on intrinsics either. I already support ARM and did some PPC experiments in std.simd. I just use the intrinsics that gdc/ldc provide, that's perfectly fine. As said in my prior post, I think std.simd.sse, std.simd.neon, and friends, might all be a valuable addition. But we'll need to see about the codegen after it unravels a bunch of wrappers... --20cf307f3ba2503a3004cb9ba9b0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 02:52, David Nadlinger <span dir=3D"ltr"><<a href=3D"m= ailto:see klickverbot.at" target=3D"_blank">see klickverbot.at</a>></spa= n> wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" s= tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im"><br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Is core.simd designed to really never be used and Manu's std.simd is re= ally the starting place for libraries? (I believe I remember him mentioning= that)<br> </blockquote> <br></div> With all due respect to Walter, core.simd isn't really "designed&q= uot; much at all, or at least this isn't visible in its current state = =E2=80=93 it rather seems like a quick hack to get some basic SIMD code wor= king with DMD (but beware of ICEs).<br> <br> Walter, if you are following this thread, do you have any plans for SIMD on= non-x86 platforms?</blockquote><div><br></div><div>DMD doesn't support= non-x86 platforms... What DMD offer's is fine, since it all needs to b= e collated anyway; GDC/LDC don't agree on intrinsics either.</div> <div>I already support ARM and did some PPC experiments in std.simd. I just= use the intrinsics that gdc/ldc provide, that's perfectly fine.</div><= div><br></div><div>As said in my prior post, I think std.simd.sse, std.simd= .neon, and friends, might all be a valuable addition. But we'll need to= see about the codegen after it unravels a bunch of wrappers...</div> </div> --20cf307f3ba2503a3004cb9ba9b0--
Oct 09 2012
On 9 October 2012 00:52, David Nadlinger <see klickverbot.at> wrote:On Monday, 8 October 2012 at 21:36:08 UTC, F i L wrote:Iain Buclaw wrote:float a =3D 1, b =3D 2, c =3D 3, d =3D 4; float4 f =3D [a,b,c,d]; =3D=3D=3D> movss -16(%rbp), %xmm0 movss -12(%rbp), %xmm1
Nice, not even DMD can do this yet. Can these changes be pushed upstream=
No, the actual codegen is compilers-specific (and apparently wrong in the case of GDC, if this is the actual piece of code emitted for the code snippet).On a side note, I understand GDC doesn't support the core.simd.__simd(..=
command, and I'm sure you have good reasons for this. However, it would still be nice if: a) this interface was supported through function-wrappers, or.. b) DMD/LDC could find common ground with GDC in SIMD instructions
LDC won't support core.simd.__simd in the forseeable future either. The reason is that it is a) untyped and b) highly x86-specific, both of which make it hard to integrate with LLVM =96 __simd is really just a glorified inline assembly expression (hm, this makes me think, maybe it could be implemented quite easily in terms of a transformation to LLVM inline assembly expressions=85).Is core.simd designed to really never be used and Manu's std.simd is really the starting place for libraries? (I believe I remember him mentioning that)
With all due respect to Walter, core.simd isn't really "designed" much at all, or at least this isn't visible in its current state =96 it rather se=
like a quick hack to get some basic SIMD code working with DMD (but bewar=
of ICEs). Walter, if you are following this thread, do you have any plans for SIMD =
non-x86 platforms? David
Vector types already support the same basic operations that can be done on D arrays. So that itself guarantees cross platform. --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Oct 09 2012
An alternative approach is to have one module per architecture or compiler.
You mean like something like std.simd.x86_gdc? In this case a user would need to write a different version of his code for each compiler or write his own wrappers (which is basically what we have now). This could cause a lot of redundant work. What is worse, some people wouldn't bother, and then we would have code that only works with one D compiler.
Oct 09 2012
On 2012-10-09, 16:20, jerro wrote:An alternative approach is to have one module per architecture or compiler.
You mean like something like std.simd.x86_gdc? In this case a user would need to write a different version of his code for each compiler or write his own wrappers (which is basically what we have now). This could cause a lot of redundant work. What is worse, some people wouldn't bother, and then we would have code that only works with one D compiler.
Nope, like: module std.simd; version(Linux64) { public import std.internal.simd_linux64; } Then all std.internal.simd_* modules have the same public interface, and only the version that fits /your/ platform will be included. -- Simen
Oct 09 2012
On Tuesday, 9 October 2012 at 10:29:25 UTC, Iain Buclaw wrote:Vector types already support the same basic operations that can be done on D arrays. So that itself guarantees cross platform.
That's obviously true, but not at all enough for most of the "interesting" use cases of vector types (otherwise, you could use array operations just as well). You need at least some sort of broadcasting/swizzling support for it to be interesting. David
Oct 09 2012
On Tuesday, 9 October 2012 at 10:29:25 UTC, Iain Buclaw wrote:Vector types already support the same basic operations that can be done on D arrays. So that itself guarantees cross platform.
That's obviously true, but not at all enough for most of the "interesting" use cases of vector types (otherwise, you could use array operations just as well). You need at least some sort of broadcasting/swizzling support for it to be interesting. David
Oct 09 2012
On Tuesday, 9 October 2012 at 16:59:58 UTC, Jacob Carlborg wrote:On 2012-10-09 16:52, Simen Kjaeraas wrote:Nope, like: module std.simd; version(Linux64) { public import std.internal.simd_linux64; } Then all std.internal.simd_* modules have the same public interface, and only the version that fits /your/ platform will be included.
Exactly, what he said.
I'm guessing the platform in this case would be the CPU architecture, since that determines what SIMD instructions are available, not the OS. But anyway, this does not address the problem Manu was talking about. The problem is that the API for the intrisics for the same architecture is not consistent across compilers. So for example, if you wanted to generate the instruction "movaps XMM1, XMM2, 0x88" (this extracts all even elements from two vectors), you would need to write: version(GNU) { return __builtin_ia32_shufps(a, b, 0x88); } else version(LDC) { return shufflevector(a, b, 0, 2, 4, 6); } else version(DMD) { // can't do that in DMD yet, but the way to do it will probably be different from the way it is done in LDC and GDC } What Manu meant with having std.simd.sse and std.simd.neon was to have modules that would provide access to the platform dependent instructions that would be portable across compilers. So for the shufps instruction above you would have something like this ins std.simd.sse: float4 shufps(int i0, int i1, int i2, int i3)(float4 a, float4 b){ ... } std.simd currently takes care of cases when the code can be written in a cross platform way. But when you need to use platform specific instructions directly, std.simd doesn't currently help you, while std.simd.sse, std.simd.neon and others would. What Manu is worried about is that having instructions wrapped in another level of functions would hurt performance. It certainly would slow things down in debug builds (and IIRC he has written in his previous posts that he does care about that). I don't think it would make much of a difference when compiled with optimizations turned on, at least not with LDC and GDC.
Oct 09 2012
Manu wrote:std.simd already does have a mammoth mess of static if(arch & compiler). The thing about std.simd is that it's designed to be portable, so it doesn't make sense to expose the low-level sse intrinsics directly there.
Well, that's not really what I was suggesting. I was saying maybe eventually matching the agnostic gdc builtins in a separate module: // core.builtins import core.simd; version (GNU) import gcc.builtins; void madd(ref float4 r, float4 a, float4 b) { version (X86_OR_X64) { version (DigitalMars) { r = __simd(XMM.PMADDWD, a, b); } else version (GNU) { __builtin_ia32_fmaddpd(r, a, b) } } } then std.simd can just use a single function (madd) and forget about all the compiler-specific switches. This may be more work than it's worth and std.simd should just contain all the platform specific switches... idk, i'm just throwing out ideas.
Oct 09 2012
On Tuesday, 9 October 2012 at 19:18:35 UTC, F i L wrote:Manu wrote:std.simd already does have a mammoth mess of static if(arch & compiler). The thing about std.simd is that it's designed to be portable, so it doesn't make sense to expose the low-level sse intrinsics directly there.
Well, that's not really what I was suggesting. I was saying maybe eventually matching the agnostic gdc builtins in a separate module: // core.builtins import core.simd; version (GNU) import gcc.builtins; void madd(ref float4 r, float4 a, float4 b) { version (X86_OR_X64) { version (DigitalMars) { r = __simd(XMM.PMADDWD, a, b); } else version (GNU) { __builtin_ia32_fmaddpd(r, a, b) } } } then std.simd can just use a single function (madd) and forget about all the compiler-specific switches. This may be more work than it's worth and std.simd should just contain all the platform specific switches... idk, i'm just throwing out ideas.
You know... now that I think about it, this is pretty much EXACTLY what std.simd IS already... lol, forget all of that, please.
Oct 09 2012
--20cf307f3ba229ebe304cbaff90f Content-Type: text/plain; charset=UTF-8 On 9 October 2012 20:46, jerro <a a.com> wrote:On Tuesday, 9 October 2012 at 16:59:58 UTC, Jacob Carlborg wrote:On 2012-10-09 16:52, Simen Kjaeraas wrote: Nope, like:module std.simd; version(Linux64) { public import std.internal.simd_linux64; } Then all std.internal.simd_* modules have the same public interface, and only the version that fits /your/ platform will be included.
Exactly, what he said.
I'm guessing the platform in this case would be the CPU architecture, since that determines what SIMD instructions are available, not the OS. But anyway, this does not address the problem Manu was talking about. The problem is that the API for the intrisics for the same architecture is not consistent across compilers. So for example, if you wanted to generate the instruction "movaps XMM1, XMM2, 0x88" (this extracts all even elements from two vectors), you would need to write: version(GNU) { return __builtin_ia32_shufps(a, b, 0x88); } else version(LDC) { return shufflevector(a, b, 0, 2, 4, 6); } else version(DMD) { // can't do that in DMD yet, but the way to do it will probably be different from the way it is done in LDC and GDC } What Manu meant with having std.simd.sse and std.simd.neon was to have modules that would provide access to the platform dependent instructions that would be portable across compilers. So for the shufps instruction above you would have something like this ins std.simd.sse: float4 shufps(int i0, int i1, int i2, int i3)(float4 a, float4 b){ ... } std.simd currently takes care of cases when the code can be written in a cross platform way. But when you need to use platform specific instructions directly, std.simd doesn't currently help you, while std.simd.sse, std.simd.neon and others would. What Manu is worried about is that having instructions wrapped in another level of functions would hurt performance. It certainly would slow things down in debug builds (and IIRC he has written in his previous posts that he does care about that). I don't think it would make much of a difference when compiled with optimizations turned on, at least not with LDC and GDC.
Perfect! You saved me writing anything at all ;) I do indeed care about debug builds, but one interesting possibility that I discussed with Walter last week was a #pragma inline statement, which may force-enable inlining even in debug. I'm not sure how that would translate to GDC/LDC, and that's an important consideration. I'd also like to prove that the code-gen does work well with 2 or 3 levels of inlining, and that the optimiser is still able to perform sensible code reordering in the target context. --20cf307f3ba229ebe304cbaff90f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 20:46, jerro <span dir=3D"ltr"><<a href=3D"mailto:a a.= com" target=3D"_blank">a a.com</a>></span> wrote:<br><div class=3D"gmail= _quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border= -left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On Tuesday, 9 October 2012 at 16:59= :58 UTC, Jacob Carlborg wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> On 2012-10-09 16:52, Simen Kjaeraas wrote:<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Nope, like:<br> <br> module std.simd;<br> <br> version(Linux64) {<br> =C2=A0 =C2=A0 public import std.internal.simd_linux64;<br> }<br> <br> <br> Then all std.internal.simd_* modules have the same public interface, and<br=
</blockquote> <br> Exactly, what he said.<br> </blockquote> <br></div></div> I'm guessing the platform in this case would be the CPU architecture, s= ince that determines what SIMD instructions are available, not the OS. But = anyway, this does not address the problem Manu was talking about. The probl= em is that the API for the intrisics for the same architecture is not consi= stent across compilers. So for example, if you wanted to generate the instr= uction "movaps XMM1, XMM2, 0x88" (this extracts all even elements= from two vectors), you would need to write:<br> <br> version(GNU)<br> {<br> =C2=A0 =C2=A0 return __builtin_ia32_shufps(a, b, 0x88);<br> }<br> else version(LDC)<br> {<br> =C2=A0 =C2=A0 return shufflevector(a, b, 0, 2, 4, 6);<br> }<br> else version(DMD)<br> {<br> =C2=A0 =C2=A0 // can't do that in DMD yet, but the way to do it will pr= obably be different from the way it is done in LDC and GDC<br> }<br> <br> What Manu meant with having std.simd.sse and std.simd.neon was to have modu= les that would provide access to the platform dependent instructions that w= ould be portable across compilers. So for the shufps instruction above you = would have something like this ins std.simd.sse:<br> <br> float4 shufps(int i0, int i1, int i2, int i3)(float4 a, float4 b){ ... }<br=
std.simd currently takes care of cases when the code can be written in a cr= oss platform way. But when you need to use platform specific instructions d= irectly, std.simd doesn't currently help you, while std.simd.sse, std.s= imd.neon and others would. What Manu is worried about is that having instru= ctions wrapped in another level of functions would hurt performance. It cer= tainly would slow things down in debug builds (and IIRC he has written in h= is previous posts that he does care about that). I don't think it would= make much of a difference when compiled with optimizations turned on, at l= east not with LDC and GDC.<br> </blockquote></div><br><div>Perfect! You saved me writing anything at all ;= )</div><div><br></div><div>I do indeed care about debug builds, but one int= eresting possibility that I discussed with Walter last week was a #pragma i= nline statement, which may force-enable inlining even in debug. I'm not= sure how that would translate to GDC/LDC, and that's an important cons= ideration. I'd also like to prove that the code-gen does work well with= 2 or 3 levels of inlining, and that the optimiser is still able to perform= sensible code reordering in the target context.</div> --20cf307f3ba229ebe304cbaff90f--
Oct 10 2012
--20cf307811d0390af904cbb01369 Content-Type: text/plain; charset=UTF-8 On 9 October 2012 21:56, F i L <witte2008 gmail.com> wrote:On Tuesday, 9 October 2012 at 19:18:35 UTC, F i L wrote:Manu wrote:std.simd already does have a mammoth mess of static if(arch & compiler). The thing about std.simd is that it's designed to be portable, so it doesn't make sense to expose the low-level sse intrinsics directly there.
Well, that's not really what I was suggesting. I was saying maybe eventually matching the agnostic gdc builtins in a separate module: // core.builtins import core.simd; version (GNU) import gcc.builtins; void madd(ref float4 r, float4 a, float4 b) { version (X86_OR_X64) { version (DigitalMars) { r = __simd(XMM.PMADDWD, a, b); } else version (GNU) { __builtin_ia32_fmaddpd(r, a, b) } } } then std.simd can just use a single function (madd) and forget about all the compiler-specific switches. This may be more work than it's worth and std.simd should just contain all the platform specific switches... idk, i'm just throwing out ideas.
You know... now that I think about it, this is pretty much EXACTLY what std.simd IS already... lol, forget all of that, please.
Yes, I was gonna say... We're discussing providing convenient access to the arch intrinsics directly, which may be useful in many situations, although I think use of std.simd would be encouraged for the most part, for portability reasons. I'll take some time this weekend to do some experiments with GDC and LDC... actually, no I won't, I'm doing a 48 hour game jam (which I'll probably write in D too), but I'll do it soon! ;) --20cf307811d0390af904cbb01369 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 9 October 2012 21:56, F i L <span dir=3D"ltr"><<a href=3D"mailto:witt= e2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrote= :<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"= margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On Tuesday, 9 October 2012 at 19:18= :35 UTC, F i L wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Manu wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> std.simd already does have a mammoth mess of static if(arch & compiler)= .<br> The thing about std.simd is that it's designed to be portable, so it<br=
e.<br> </blockquote> <br> Well, that's not really what I was suggesting. I was saying maybe event= ually matching the agnostic gdc builtins in a separate module:<br> <br> =C2=A0 =C2=A0 // core.builtins<br> <br> =C2=A0 =C2=A0 import core.simd;<br> <br> =C2=A0 =C2=A0 version (GNU)<br> =C2=A0 =C2=A0 =C2=A0 import gcc.builtins;<br> <br> =C2=A0 =C2=A0 void madd(ref float4 r, float4 a, float4 b)<br> =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 version (X86_OR_X64)<br> =C2=A0 =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 version (DigitalMars)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 r =3D __simd(XMM.PMADDWD, a, b);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 else version (GNU)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 __builtin_ia32_fmaddpd(r, a, b)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 }<br> <br> then std.simd can just use a single function (madd) and forget about all th= e compiler-specific switches. This may be more work than it's worth and= std.simd should just contain all the platform specific switches... idk, i&= #39;m just throwing out ideas.<br> </blockquote> <br></div></div> You know... now that I think about it, this is pretty much EXACTLY what std= .simd IS already... lol, forget all of that, please.<br> </blockquote></div><br><div>Yes, I was gonna say...</div><div>We're dis= cussing providing convenient access to the arch intrinsics directly, which = may be useful in many situations, although I think use of std.simd would be= encouraged for the most part, for portability reasons.</div> <div>I'll take some time this weekend to do some experiments with GDC a= nd LDC... actually, no I won't, I'm doing a 48 hour game jam (which= I'll probably write in D too), but I'll do it soon! ;)</div> --20cf307811d0390af904cbb01369--
Oct 10 2012
On Wednesday, 10 October 2012 at 08:33:39 UTC, Manu wrote:I do indeed care about debug builds, but one interesting possibility that I discussed with Walter last week was a #pragma inline statement, which may force-enable inlining even in debug. I'm not sure how that would translate to GDC/LDC, […]
pragma(always_inline) or something like that would be trivially easy to implement in LDC. David
Oct 10 2012
On 10 October 2012 12:25, David Nadlinger <see klickverbot.at> wrote:On Wednesday, 10 October 2012 at 08:33:39 UTC, Manu wrote:I do indeed care about debug builds, but one interesting possibility tha=
I discussed with Walter last week was a #pragma inline statement, which ma=
force-enable inlining even in debug. I'm not sure how that would transla=
to GDC/LDC, [=85]
pragma(always_inline) or something like that would be trivially easy to implement in LDC. David
Currently pragma(attribute, alway_inline) in GDC, but I am considering scrapping pragma(attribute) - the current implementation kept only for attributes used by gcc builtin functions - and introduce each supported attribute as an individual pragma. --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Oct 10 2012
--bcaec5014c719958c904cbb37609 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 10 October 2012 14:50, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 10 October 2012 12:25, David Nadlinger <see klickverbot.at> wrote:On Wednesday, 10 October 2012 at 08:33:39 UTC, Manu wrote:I do indeed care about debug builds, but one interesting possibility
I discussed with Walter last week was a #pragma inline statement, which
force-enable inlining even in debug. I'm not sure how that would
to GDC/LDC, [=E2=80=A6]
pragma(always_inline) or something like that would be trivially easy to implement in LDC. David
Currently pragma(attribute, alway_inline) in GDC, but I am considering scrapping pragma(attribute) - the current implementation kept only for attributes used by gcc builtin functions - and introduce each supported attribute as an individual pragma.
Right, well that's encouraging then. Maybe all the pieces fit, and we can perform liberal wrapping of the compiler-specific intrinsics in that case. --bcaec5014c719958c904cbb37609 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 10 October 2012 14:50, Iain Buclaw <span dir=3D"ltr"><<a href=3D"mail= to:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>></span> = wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" styl= e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 10 October 2012 12:25, David Nad= linger <<a href=3D"mailto:see klickverbot.at">see klickverbot.at</a>>= wrote:<br> > On Wednesday, 10 October 2012 at 08:33:39 UTC, Manu wrote:<br> >><br> >> I do indeed care about debug builds, but one interesting possibili= ty that<br> >> I<br> >> discussed with Walter last week was a #pragma inline statement, wh= ich may<br> >> force-enable inlining even in debug. I'm not sure how that wou= ld translate<br> >> to GDC/LDC, [=E2=80=A6]<br> ><br> ><br> > pragma(always_inline) or something like that would be trivially easy t= o<br> > implement in LDC.<br> ><br> > David<br> <br> </div></div>Currently pragma(attribute, alway_inline) in GDC, but I am cons= idering<br> scrapping pragma(attribute) - the current implementation kept only for<br> attributes used by gcc builtin functions - and introduce each<br> supported attribute as an individual pragma.<br></blockquote><div><br></div=<div>Right, well that's encouraging then. Maybe all the pieces fit, an=
at case.</div> </div> --bcaec5014c719958c904cbb37609--
Oct 10 2012
Manu wrote:actually, no I won't, I'm doing a 48 hour game jam (which I'll probably write in D too), but I'll do it soon! ;)
Nice :) For a competition or casual? I would love to see what you come up with. My brother and I released our second game (this time written with our own game-engine) awhile back: http://www.youtube.com/watch?v=7pvCcgQiXNk Right now we're working on building 3D animation and Physics into the engine for our next project. It's written in C#, but I have plans for awhile that once it's to a certain point I'll be porting it to D.
Oct 10 2012
--20cf307811d06a9a6f04cbb62e41 Content-Type: text/plain; charset=UTF-8 On 10 October 2012 17:53, F i L <witte2008 gmail.com> wrote:Manu wrote:actually, no I won't, I'm doing a 48 hour game jam (which I'll probably write in D too), but I'll do it soon! ;)
Nice :) For a competition or casual? I would love to see what you come up with. My brother and I released our second game (this time written with our own game-engine) awhile back: http://www.youtube.com/watch?**v=7pvCcgQiXNk<http://www.youtube.com/watch v=7pvCcgQiXNk>Right now we're working on building 3D animation and Physics into the engine for our next project. It's written in C#, but I have plans for awhile that once it's to a certain point I'll be porting it to D.
It's a work event. Weekend office party effectively, with lots of beer and sauna (essential ingredients in any Finnish happenings!) I expect it'll be open source, should be on github, whatever it is. I'll build it on my toy engine (also open source): https://github.com/TurkeyMan/fuji/wiki, which has D bindings. --20cf307811d06a9a6f04cbb62e41 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 10 October 2012 17:53, F i L <span dir=3D"ltr"><<a href=3D"mailto:wit= te2008 gmail.com" target=3D"_blank">witte2008 gmail.com</a>></span> wrot= e:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D= "margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">Manu wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> actually, no I won't, I'm doing a 48 hour game jam (which I'll = probably<br> write in D too), but I'll do it soon! ;)<br> </blockquote> <br></div> Nice :) For a competition or casual? I would love to see what you come up w= ith. My brother and I released our second game (this time written with our = own game-engine) awhile back: <a href=3D"http://www.youtube.com/watch?v=3D7= pvCcgQiXNk" target=3D"_blank">http://www.youtube.com/watch?<u></u>v=3D7pvCc= gQiXNk</a> Right now we're working on building 3D animation and Physics= into the engine for our next project. It's written in C#, but I have p= lans for awhile that once it's to a certain point I'll be porting i= t to D.<br> </blockquote></div><br><div>It's a work event. Weekend office party eff= ectively, with lots of beer and sauna (essential ingredients in any Finnish= happenings!)</div><div>I expect it'll be open source, should be on git= hub, whatever it is. I'll build it on my toy engine (also open source):= =C2=A0<a href=3D"https://github.com/TurkeyMan/fuji/wiki">https://github.com= /TurkeyMan/fuji/wiki</a>, which has D bindings.</div> --20cf307811d06a9a6f04cbb62e41--
Oct 10 2012
On Tuesday, 9 October 2012 at 08:13:39 UTC, Manu wrote:DMD doesn't support non-x86 platforms... What DMD offer's is fine, since it all needs to be collated anyway; GDC/LDC don't agree on intrinsics either.
By the way, I just committed a patch to auto-generate GCC->LLVM intrinsic mappings to LDC – thanks, Jernej! –, which would mean that you could in theory use the GDC code path for LDC as well. David
Oct 14 2012
David Nadlinger wrote:By the way, I just committed a patch to auto-generate GCC->LLVM intrinsic mappings to LDC – thanks, Jernej! –, which would mean that you could in theory use the GDC code path for LDC as well.
Your awesome, David!
Oct 14 2012
On Sunday, 14 October 2012 at 19:40:08 UTC, F i L wrote:David Nadlinger wrote:By the way, I just committed a patch to auto-generate GCC->LLVM intrinsic mappings to LDC – thanks, Jernej! –, which would mean that you could in theory use the GDC code path for LDC as well.
Your awesome, David!
Usually, yes, but in this case even I must admit that it was jerro who did the work… :P David
Oct 14 2012
On Sunday, 7 October 2012 at 12:24:48 UTC, Manu wrote:Perfect! I can get on with my unittests :P
Speaking of test – are they available somewhere? Now that LDC at least theoretically supports most of the GCC builtins, I'd like to throw some tests at it to see what happens. David
Oct 14 2012
On 14 October 2012 21:05, David Nadlinger <see klickverbot.at> wrote:On Sunday, 7 October 2012 at 12:24:48 UTC, Manu wrote:Perfect! I can get on with my unittests :P
Speaking of test =96 are they available somewhere? Now that LDC at least theoretically supports most of the GCC builtins, I'd like to throw some tests at it to see what happens. David
Could you pastebin a header generation of the gccbuiltins module? We can compare. =3D) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Oct 14 2012
On 14 October 2012 21:58, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 14 October 2012 21:05, David Nadlinger <see klickverbot.at> wrote:On Sunday, 7 October 2012 at 12:24:48 UTC, Manu wrote:Perfect! I can get on with my unittests :P
Speaking of test =96 are they available somewhere? Now that LDC at least theoretically supports most of the GCC builtins, I'd like to throw some tests at it to see what happens. David
Could you pastebin a header generation of the gccbuiltins module? We can compare. =3D)
http://dpaste.dzfl.pl/4edb9ecc --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Oct 14 2012
Speaking of test – are they available somewhere? Now that LDC at least theoretically supports most of the GCC builtins, I'd like to throw some tests at it to see what happens. David
I have a fork of std.simd with LDC support at https://github.com/jerro/phobos/tree/std.simd and some tests for it at https://github.com/jerro/std.simd-tests .
Oct 14 2012
--bcaec51d25d41c645004cc180d22 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 15 October 2012 02:50, jerro <a a.com> wrote:Speaking of test =E2=80=93 are they available somewhere? Now that LDC at =
theoretically supports most of the GCC builtins, I'd like to throw some tests at it to see what happens. David
I have a fork of std.simd with LDC support at https://github.com/jerro/** phobos/tree/std.simd <https://github.com/jerro/phobos/tree/std.simd> and some tests for it at https://github.com/jerro/std.**simd-tests<https://gi=
Awesome. Pull request plz! :) That said, how did you come up with a lot of these implementations? Some don't look particularly efficient, and others don't even look right. xor for instance: return cast(T) (cast(int4) v1 ^ cast(int4) v2); This is wrong for float types. x86 has separate instructions for doing this to floats, which make sure to do the right thing by the flags registers. Most of the LDC blocks assume that it could be any architecture... I don't think this will produce good portable code. It needs to be much more cafully hand-crafted, but it's a nice working start. --bcaec51d25d41c645004cc180d22 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 15 October 2012 02:50, jerro <span dir=3D"ltr"><<a href=3D"mailto:a a= .com" target=3D"_blank">a a.com</a>></span> wrote:<br><div class=3D"gmai= l_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde= r-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .= 8ex;border-left:1px #ccc solid;padding-left:1ex"> Speaking of test =E2=80=93 are they available somewhere? Now that LDC at le= ast theoretically supports most of the GCC builtins, I'd like to throw = some tests at it to see what happens.<br> <br> David<br> </blockquote> <br></div> I have a fork of std.simd with LDC support at <a href=3D"https://github.com= /jerro/phobos/tree/std.simd" target=3D"_blank">https://github.com/jerro/<u>= </u>phobos/tree/std.simd</a> and some tests for it at <a href=3D"https://gi= thub.com/jerro/std.simd-tests" target=3D"_blank">https://github.com/jerro/s= td.<u></u>simd-tests</a> .<br> </blockquote></div><br><div>Awesome. Pull request plz! :)</div><div><br></d= iv><div>That said, how did you come up with a lot of these implementations?= Some don't look particularly efficient, and others don't even look= right.</div> <div>xor for instance:</div><div><span class=3D"k" style=3D"margin:0px;padd= ing:0px;border:0px;font-weight:bold;color:rgb(51,51,51);font-family:Consola= s,'Liberation Mono',Courier,monospace;font-size:12px;line-height:16= px;white-space:pre;background-color:rgb(255,255,255)"> return</span><span s= tyle=3D"color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',= Courier,monospace;font-size:12px;line-height:16px;white-space:pre;backgroun= d-color:rgb(255,255,255)"> </span><span class=3D"k" style=3D"margin:0px;pad= ding:0px;border:0px;font-weight:bold;color:rgb(51,51,51);font-family:Consol= as,'Liberation Mono',Courier,monospace;font-size:12px;line-height:1= 6px;white-space:pre;background-color:rgb(255,255,255)">cast</span><span cla= ss=3D"p" style=3D"margin:0px;padding:0px;border:0px;color:rgb(51,51,51);fon= t-family:Consolas,'Liberation Mono',Courier,monospace;font-size:12p= x;line-height:16px;white-space:pre;background-color:rgb(255,255,255)">(</sp= an><span class=3D"n" style=3D"margin:0px;padding:0px;border:0px;color:rgb(5= 1,51,51);font-family:Consolas,'Liberation Mono',Courier,monospace;f= ont-size:12px;line-height:16px;white-space:pre;background-color:rgb(255,255= ,255)">T</span><span class=3D"p" style=3D"margin:0px;padding:0px;border:0px= ;color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Courier= ,monospace;font-size:12px;line-height:16px;white-space:pre;background-color= :rgb(255,255,255)">)</span><span style=3D"color:rgb(51,51,51);font-family:C= onsolas,'Liberation Mono',Courier,monospace;font-size:12px;line-hei= ght:16px;white-space:pre;background-color:rgb(255,255,255)"> </span><span c= lass=3D"p" style=3D"margin:0px;padding:0px;border:0px;color:rgb(51,51,51);f= ont-family:Consolas,'Liberation Mono',Courier,monospace;font-size:1= 2px;line-height:16px;white-space:pre;background-color:rgb(255,255,255)">(</= span><span class=3D"k" style=3D"margin:0px;padding:0px;border:0px;font-weig= ht:bold;color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',= Courier,monospace;font-size:12px;line-height:16px;white-space:pre;backgroun= d-color:rgb(255,255,255)">cast</span><span class=3D"p" style=3D"margin:0px;= padding:0px;border:0px;color:rgb(51,51,51);font-family:Consolas,'Libera= tion Mono',Courier,monospace;font-size:12px;line-height:16px;white-spac= e:pre;background-color:rgb(255,255,255)">(</span><span class=3D"n" style=3D= "margin:0px;padding:0px;border:0px;color:rgb(51,51,51);font-family:Consolas= ,'Liberation Mono',Courier,monospace;font-size:12px;line-height:16p= x;white-space:pre;background-color:rgb(255,255,255)">int4</span><span class= =3D"p" style=3D"margin:0px;padding:0px;border:0px;color:rgb(51,51,51);font-= family:Consolas,'Liberation Mono',Courier,monospace;font-size:12px;= line-height:16px;white-space:pre;background-color:rgb(255,255,255)">)</span=<span style=3D"color:rgb(51,51,51);font-family:Consolas,'Liberation Mo=
ackground-color:rgb(255,255,255)"> </span><span class=3D"n" style=3D"margin= :0px;padding:0px;border:0px;color:rgb(51,51,51);font-family:Consolas,'L= iberation Mono',Courier,monospace;font-size:12px;line-height:16px;white= -space:pre;background-color:rgb(255,255,255)">v1</span><span style=3D"color= :rgb(51,51,51);font-family:Consolas,'Liberation Mono',Courier,monos= pace;font-size:12px;line-height:16px;white-space:pre;background-color:rgb(2= 55,255,255)"> </span><span class=3D"p" style=3D"margin:0px;padding:0px;bord= er:0px;color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',C= ourier,monospace;font-size:12px;line-height:16px;white-space:pre;background= -color:rgb(255,255,255)">^</span><span style=3D"color:rgb(51,51,51);font-fa= mily:Consolas,'Liberation Mono',Courier,monospace;font-size:12px;li= ne-height:16px;white-space:pre;background-color:rgb(255,255,255)"> </span><= span class=3D"k" style=3D"margin:0px;padding:0px;border:0px;font-weight:bol= d;color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Courie= r,monospace;font-size:12px;line-height:16px;white-space:pre;background-colo= r:rgb(255,255,255)">cast</span><span class=3D"p" style=3D"margin:0px;paddin= g:0px;border:0px;color:rgb(51,51,51);font-family:Consolas,'Liberation M= ono',Courier,monospace;font-size:12px;line-height:16px;white-space:pre;= background-color:rgb(255,255,255)">(</span><span class=3D"n" style=3D"margi= n:0px;padding:0px;border:0px;color:rgb(51,51,51);font-family:Consolas,'= Liberation Mono',Courier,monospace;font-size:12px;line-height:16px;whit= e-space:pre;background-color:rgb(255,255,255)">int4</span><span class=3D"p"= style=3D"margin:0px;padding:0px;border:0px;color:rgb(51,51,51);font-family= :Consolas,'Liberation Mono',Courier,monospace;font-size:12px;line-h= eight:16px;white-space:pre;background-color:rgb(255,255,255)">)</span><span= style=3D"color:rgb(51,51,51);font-family:Consolas,'Liberation Mono'= ;,Courier,monospace;font-size:12px;line-height:16px;white-space:pre;backgro= und-color:rgb(255,255,255)"> </span><span class=3D"n" style=3D"margin:0px;p= adding:0px;border:0px;color:rgb(51,51,51);font-family:Consolas,'Liberat= ion Mono',Courier,monospace;font-size:12px;line-height:16px;white-space= :pre;background-color:rgb(255,255,255)">v2</span><span class=3D"p" style=3D= "margin:0px;padding:0px;border:0px;color:rgb(51,51,51);font-family:Consolas= ,'Liberation Mono',Courier,monospace;font-size:12px;line-height:16p= x;white-space:pre;background-color:rgb(255,255,255)">);</span></div> <div><span style=3D"color:rgb(51,51,51);font-family:Consolas,'Liberatio= n Mono',Courier,monospace;font-size:12px;line-height:16px;white-space:p= re;background-color:rgb(255,255,255)"><br></span></div><div><span style=3D"= color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Courier,= monospace;font-size:12px;line-height:16px;white-space:pre;background-color:= rgb(255,255,255)">This is wrong for float types. x86 has separate instructi= ons for doing this to floats, which make sure to do the right thing by the = flags registers.</span></div> <div><span style=3D"color:rgb(51,51,51);font-family:Consolas,'Liberatio= n Mono',Courier,monospace;font-size:12px;line-height:16px;white-space:p= re;background-color:rgb(255,255,255)">Most of the LDC blocks assume that it= could be any architecture... I don't think this will produce good port= able code. It needs to be much more cafully hand-crafted, but it's a ni= ce working start.</span></div> --bcaec51d25d41c645004cc180d22--
Oct 15 2012
On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:On 15 October 2012 02:50, jerro <a a.com> wrote:Speaking of test – are they available somewhere? Now that LDC at leasttheoretically supports most of the GCC builtins, I'd like to throw some tests at it to see what happens. David
I have a fork of std.simd with LDC support at https://github.com/jerro/** phobos/tree/std.simd <https://github.com/jerro/phobos/tree/std.simd> and some tests for it at https://github.com/jerro/std.**simd-tests<https://github.com/jerro/std.simd-tests>.
Awesome. Pull request plz! :)
I did change an API for a few functions like loadUnaligned, though. In those cases the signatures needed to be changed because the functions used T or T* for scalar parameters and return types and Vector!T for the vector parameters and return types. This only compiles if T is a static array which I don't think makes much sense. I changed those to take the vector type as a template parameter. The vector type can not be inferred from the scalar type because you can use vector registers of different sizes simultaneously (with AVX, for example). Because of that the vector type must be passed explicitly for some functions, so I made it the first template parameter in those cases, so that Ver doesn't always need to be specified. There is one more issue that I need to solve (and that may be a problem in some cases with GDC too) - the pure, safe and nothrow attributes. Currently gcc builtin declarations in LDC have none of those attributes (I have to look into which of those can be added and if it can be done automatically). I've just commented out the attributes in my std.simd fork for now, but this isn't a proper solution.That said, how did you come up with a lot of these implementations? Some don't look particularly efficient, and others don't even look right. xor for instance: return cast(T) (cast(int4) v1 ^ cast(int4) v2); This is wrong for float types. x86 has separate instructions for doing this to floats, which make sure to do the right thing by the flags registers. Most of the LDC blocks assume that it could be any architecture... I don't think this will produce good portable code. It needs to be much more cafully hand-crafted, but it's a nice working start.
The problem is that LLVM doesn't provide intrinsics for those operations. The xor function does compile to a single xorps instruction when compiling with -O1 or higher, though. I have looked at the code generated for many (most, I think, but not for all possible types) of those LDC blocks and most of them compile to the appropriate single instruction when compiled with -O2 or -O3. Even the ones for which the D source code looks horribly inefficient like for example loadUnaligned. By the way, clang does those in a similar way. For example, here is what clang emits for a wrapper around _mm_xor_ps when compiled with -O1 -emit-llvm: define <4 x float> foo(<4 x float> %a, <4 x float> %b) nounwind uwtable readnone { %1 = bitcast <4 x float> %a to <4 x i32> %2 = bitcast <4 x float> %b to <4 x i32> %3 = xor <4 x i32> %1, %2 %4 = bitcast <4 x i32> %3 to <4 x float> ret <4 x float> %4 } AFAICT, the only way to ensure that a certain instruction will be used with LDC when there is no LLVM intrinsic for it is to use inline assembly expressions. I remember having some problems with those in the past, but it could be that I was doing something wrong. Maybe we should look into that option too.
Oct 15 2012
--0016369c8d405464d704cc1938d8 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 15 October 2012 16:34, jerro <a a.com> wrote:On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:On 15 October 2012 02:50, jerro <a a.com> wrote: Speaking of test =E2=80=93 are they available somewhere? Now that LDC a=
theoretically supports most of the GCC builtins, I'd like to throw som=
tests at it to see what happens. David
https://github.com/jerro/** phobos/tree/std.simd <https://github.com/jerro/**phobos/tree/std.simd<h=
and some tests for it at https://github.com/jerro/std.****simd-tests<https:=
<https://github.**com/jerro/std.simd-tests<https://github.com/jerro/std=
.
I did change an API for a few functions like loadUnaligned, though. In those cases the signatures needed to be changed because the functions use=
T or T* for scalar parameters and return types and Vector!T for the vecto=
parameters and return types. This only compiles if T is a static array which I don't think makes much sense. I changed those to take the vector type as a template parameter. The vector type can not be inferred from th=
scalar type because you can use vector registers of different sizes simultaneously (with AVX, for example). Because of that the vector type must be passed explicitly for some functions, so I made it the first template parameter in those cases, so that Ver doesn't always need to be specified. There is one more issue that I need to solve (and that may be a problem i=
some cases with GDC too) - the pure, safe and nothrow attributes. Currently gcc builtin declarations in LDC have none of those attributes (=
have to look into which of those can be added and if it can be done automatically). I've just commented out the attributes in my std.simd for=
for now, but this isn't a proper solution. That said, how did you come up with a lot of these implementations? Somedon't look particularly efficient, and others don't even look right. xor for instance: return cast(T) (cast(int4) v1 ^ cast(int4) v2); This is wrong for float types. x86 has separate instructions for doing this to floats, which make sure to do the right thing by the flags registers. Most of the LDC blocks assume that it could be any architecture... I don=
think this will produce good portable code. It needs to be much more cafully hand-crafted, but it's a nice working start.
The problem is that LLVM doesn't provide intrinsics for those operations. The xor function does compile to a single xorps instruction when compilin=
with -O1 or higher, though. I have looked at the code generated for many (most, I think, but not for all possible types) of those LDC blocks and most of them compile to the appropriate single instruction when compiled with -O2 or -O3. Even the ones for which the D source code looks horribly inefficient like for example loadUnaligned. By the way, clang does those in a similar way. For example, here is what clang emits for a wrapper around _mm_xor_ps when compiled with -O1 -emit-llvm: define <4 x float> foo(<4 x float> %a, <4 x float> %b) nounwind uwtable readnone { %1 =3D bitcast <4 x float> %a to <4 x i32> %2 =3D bitcast <4 x float> %b to <4 x i32> %3 =3D xor <4 x i32> %1, %2 %4 =3D bitcast <4 x i32> %3 to <4 x float> ret <4 x float> %4 } AFAICT, the only way to ensure that a certain instruction will be used with LDC when there is no LLVM intrinsic for it is to use inline assembly expressions. I remember having some problems with those in the past, but =
could be that I was doing something wrong. Maybe we should look into that option too.
Inline assembly usually ruins optimising (code reordering around inline asm blocks is usually considered impossible). It's interesting that the x86 codegen makes such good sense of those sequences, but I'm rather more concerned about other platforms. I wonder if other platforms have a similarly incomplete subset of intrinsics? :/ --0016369c8d405464d704cc1938d8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 15 October 2012 16:34, jerro <span dir=3D"ltr"><<a href=3D"mailto:a a= .com" target=3D"_blank">a a.com</a>></span> wrote:<br><div class=3D"gmai= l_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde= r-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:<b= r> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> On 15 October 2012 02:50, jerro <<a href=3D"mailto:a a.com" target=3D"_b= lank">a a.com</a>> wrote:<br> <br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> Speaking of test =E2=80=93 are they available somewhere? Now that LDC at le= ast<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> theoretically supports most of the GCC builtins, I'd like to throw some= <br> tests at it to see what happens.<br> <br> David<br> <br> </blockquote> <br></div> I have a fork of std.simd with LDC support at <a href=3D"https://github.com= /jerro/**" target=3D"_blank">https://github.com/jerro/**</a><br> phobos/tree/std.simd <<a href=3D"https://github.com/jerro/phobos/tree/st= d.simd" target=3D"_blank">https://github.com/jerro/<u></u>phobos/tree/std.s= imd</a>> and<br> some tests for it at <a href=3D"https://github.com/jerro/std.**simd-tests" = target=3D"_blank">https://github.com/jerro/std.*<u></u>*simd-tests</a><<= a href=3D"https://github.com/jerro/std.simd-tests" target=3D"_blank">https:= //github.<u></u>com/jerro/std.simd-tests</a>>.<br> <br> </blockquote><div class=3D"im"> <br> Awesome. Pull request plz! :)<br> </div></blockquote> <br> I did change an API for a few functions like loadUnaligned, though. In thos= e cases the signatures needed to be changed because the functions used T or= T* for scalar parameters and return types and Vector!T for the vector para= meters and return types. This only compiles if T is a static array which I = don't think makes much sense. I changed those to take the vector type a= s a template parameter. The vector type can not be inferred from the scalar= type because you can use vector registers of different sizes simultaneousl= y (with AVX, for example). Because of that the vector type must be passed e= xplicitly for some functions, so I made it the first template parameter in = those cases, so that Ver doesn't always need to be specified.<br> <br> There is one more issue that I need to solve (and that may be a problem in = some cases with GDC too) - the pure, safe and nothrow attributes. Current= ly gcc builtin declarations in LDC have none of those attributes (I have to= look into which of those can be added and if it can be done automatically)= . I've just commented out the attributes in my std.simd fork for now, b= ut this isn't a proper solution.<div class=3D"im"> <br> <br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> That said, how did you come up with a lot of these implementations? Some<br=
.<br> xor for instance:<br> return cast(T) (cast(int4) v1 ^ cast(int4) v2);<br> <br> This is wrong for float types. x86 has separate instructions for doing this= <br> to floats, which make sure to do the right thing by the flags registers.<br=
9;t<br> think this will produce good portable code. It needs to be much more<br> cafully hand-crafted, but it's a nice working start.<br> </blockquote> <br></div> The problem is that LLVM doesn't provide intrinsics for those operation= s. The xor function does compile to a single xorps instruction when compili= ng with -O1 or higher, though. I have looked at the code generated for many= (most, I think, but not for all possible types) of those LDC blocks and mo= st of them compile to the appropriate single instruction when compiled with= -O2 or -O3. Even the ones for which the D source code looks horribly ineff= icient like for example loadUnaligned.<br> <br> By the way, clang does those in a similar way. For example, here is what cl= ang emits for a wrapper around _mm_xor_ps when compiled with -O1 -emit-llvm= :<br> <br> define <4 x float> foo(<4 x float> %a, <4 x float> %b) n= ounwind uwtable readnone {<br> =C2=A0 %1 =3D bitcast <4 x float> %a to <4 x i32><br> =C2=A0 %2 =3D bitcast <4 x float> %b to <4 x i32><br> =C2=A0 %3 =3D xor <4 x i32> %1, %2<br> =C2=A0 %4 =3D bitcast <4 x i32> %3 to <4 x float><br> =C2=A0 ret <4 x float> %4<br> }<br> <br> AFAICT, the only way to ensure that a certain instruction will be used with= LDC when there is no LLVM intrinsic for it is to use inline assembly expre= ssions. I remember having some problems with those in the past, but it coul= d be that I was doing something wrong. Maybe we should look into that optio= n too.<br> </blockquote></div><br><div>Inline assembly usually ruins optimising (code = reordering around inline asm blocks is usually considered impossible).</div=<div>It's interesting that the x86 codegen makes such good sense of th=
onder if other platforms have a similarly incomplete subset of intrinsics? = :/</div> --0016369c8d405464d704cc1938d8--
Oct 15 2012
On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote:On 15 October 2012 16:34, jerro <a a.com> wrote:On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:On 15 October 2012 02:50, jerro <a a.com> wrote: Speaking of test – are they available somewhere? Now that LDC at leasttheoretically supports most of the GCC builtins, I'd like to throw some tests at it to see what happens. David
https://github.com/jerro/** phobos/tree/std.simd <https://github.com/jerro/**phobos/tree/std.simd<https://github.com/jerro/phobos/tree/std.simd>> and some tests for it at https://github.com/jerro/std.****simd-tests<https://github.com/jerro/std.**simd-tests> <https://github.**com/jerro/std.simd-tests<https://github.com/jerro/std.simd-tests>.
I did change an API for a few functions like loadUnaligned, though. In those cases the signatures needed to be changed because the functions used T or T* for scalar parameters and return types and Vector!T for the vector parameters and return types. This only compiles if T is a static array which I don't think makes much sense. I changed those to take the vector type as a template parameter. The vector type can not be inferred from the scalar type because you can use vector registers of different sizes simultaneously (with AVX, for example). Because of that the vector type must be passed explicitly for some functions, so I made it the first template parameter in those cases, so that Ver doesn't always need to be specified. There is one more issue that I need to solve (and that may be a problem in some cases with GDC too) - the pure, safe and nothrow attributes. Currently gcc builtin declarations in LDC have none of those attributes (I have to look into which of those can be added and if it can be done automatically). I've just commented out the attributes in my std.simd fork for now, but this isn't a proper solution. That said, how did you come up with a lot of these implementations? Somedon't look particularly efficient, and others don't even look right. xor for instance: return cast(T) (cast(int4) v1 ^ cast(int4) v2); This is wrong for float types. x86 has separate instructions for doing this to floats, which make sure to do the right thing by the flags registers. Most of the LDC blocks assume that it could be any architecture... I don't think this will produce good portable code. It needs to be much more cafully hand-crafted, but it's a nice working start.
The problem is that LLVM doesn't provide intrinsics for those operations. The xor function does compile to a single xorps instruction when compiling with -O1 or higher, though. I have looked at the code generated for many (most, I think, but not for all possible types) of those LDC blocks and most of them compile to the appropriate single instruction when compiled with -O2 or -O3. Even the ones for which the D source code looks horribly inefficient like for example loadUnaligned. By the way, clang does those in a similar way. For example, here is what clang emits for a wrapper around _mm_xor_ps when compiled with -O1 -emit-llvm: define <4 x float> foo(<4 x float> %a, <4 x float> %b) nounwind uwtable readnone { %1 = bitcast <4 x float> %a to <4 x i32> %2 = bitcast <4 x float> %b to <4 x i32> %3 = xor <4 x i32> %1, %2 %4 = bitcast <4 x i32> %3 to <4 x float> ret <4 x float> %4 } AFAICT, the only way to ensure that a certain instruction will be used with LDC when there is no LLVM intrinsic for it is to use inline assembly expressions. I remember having some problems with those in the past, but it could be that I was doing something wrong. Maybe we should look into that option too.
Inline assembly usually ruins optimising (code reordering around inline asm blocks is usually considered impossible).
I don't see a reason why the compiler couldn't reorder code around GCC style inline assembly blocks. You are supposed to specify which registers are changed in the block. Doesn't that give the compiler enough information to reorder code?It's interesting that the x86 codegen makes such good sense of those sequences, but I'm rather more concerned about other platforms. I wonder if other platforms have a similarly incomplete subset of intrinsics? :/
It looks to me like LLVM does provide intrinsics for those operation that can't be expressed in other ways. So my guess is that if some intrinsics are absolutely needed for some platform, they will probably be there. If an intrinsic is needed, I also don't see a reason why they wouldn't accept a patch that ads it.
Oct 15 2012
--20cf300fb263e6be7104cc1a1545 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 15 October 2012 17:07, jerro <a a.com> wrote:On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote:On 15 October 2012 16:34, jerro <a a.com> wrote: On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:On 15 October 2012 02:50, jerro <a a.com> wrote:Speaking of test =E2=80=93 are they available somewhere? Now that LDC=
theoretically supports most of the GCC builtins, I'd like to throwsome tests at it to see what happens. David I have a fork of std.simd with LDC support at
phobos/tree/std.simd <https://github.com/jerro/**** phobos/tree/std.simd <https://github.com/jerro/**phobos/tree/std.simd=
<https://**github.com/jerro/phobos/tree/**std.simd<https://github.com=
some tests for it at https://github.com/jerro/std.******simd-tests<ht=
<https://github.**com/jerro/std.**simd-tests<https://github.com/jerro=
com/jerro/std.simd-tests <https://github.com/jerro/std.simd-tests>>.
Awesome. Pull request plz! :)
those cases the signatures needed to be changed because the functions used T or T* for scalar parameters and return types and Vector!T for the vector parameters and return types. This only compiles if T is a static array which I don't think makes much sense. I changed those to take the vecto=
type as a template parameter. The vector type can not be inferred from the scalar type because you can use vector registers of different sizes simultaneously (with AVX, for example). Because of that the vector type must be passed explicitly for some functions, so I made it the first template parameter in those cases, so that Ver doesn't always need to b=
specified. There is one more issue that I need to solve (and that may be a problem in some cases with GDC too) - the pure, safe and nothrow attributes. Currently gcc builtin declarations in LDC have none of those attributes (I have to look into which of those can be added and if it can be done automatically). I've just commented out the attributes in my std.simd fork for now, but this isn't a proper solution. That said, how did you come up with a lot of these implementations? So=
don't look particularly efficient, and others don't even look right. xor for instance: return cast(T) (cast(int4) v1 ^ cast(int4) v2); This is wrong for float types. x86 has separate instructions for doing this to floats, which make sure to do the right thing by the flags register=
Most of the LDC blocks assume that it could be any architecture... I don't think this will produce good portable code. It needs to be much more cafully hand-crafted, but it's a nice working start.
The xor function does compile to a single xorps instruction when compiling with -O1 or higher, though. I have looked at the code generated for man=
(most, I think, but not for all possible types) of those LDC blocks and most of them compile to the appropriate single instruction when compile=
with -O2 or -O3. Even the ones for which the D source code looks horrib=
inefficient like for example loadUnaligned. By the way, clang does those in a similar way. For example, here is wha=
clang emits for a wrapper around _mm_xor_ps when compiled with -O1 -emit-llvm: define <4 x float> foo(<4 x float> %a, <4 x float> %b) nounwind uwtabl=
readnone { %1 =3D bitcast <4 x float> %a to <4 x i32> %2 =3D bitcast <4 x float> %b to <4 x i32> %3 =3D xor <4 x i32> %1, %2 %4 =3D bitcast <4 x i32> %3 to <4 x float> ret <4 x float> %4 } AFAICT, the only way to ensure that a certain instruction will be used with LDC when there is no LLVM intrinsic for it is to use inline assemb=
expressions. I remember having some problems with those in the past, bu=
it could be that I was doing something wrong. Maybe we should look into th=
option too.
asm blocks is usually considered impossible).
I don't see a reason why the compiler couldn't reorder code around GCC style inline assembly blocks. You are supposed to specify which registers are changed in the block. Doesn't that give the compiler enough informati=
to reorder code?
Not necessarily. If you affect various flags registers or whatever, or direct memory access might violate it's assumptions about the state of memory/stack. I don't think I've come in contact with any compiler's that aren't super-conservative about this sort of thing. It's interesting that the x86 codegen makes such good sense of thosesequences, but I'm rather more concerned about other platforms. I wonder if other platforms have a similarly incomplete subset of intrinsics? :/
It looks to me like LLVM does provide intrinsics for those operation that can't be expressed in other ways. So my guess is that if some intrinsics are absolutely needed for some platform, they will probably be there. If =
intrinsic is needed, I also don't see a reason why they wouldn't accept a patch that ads it.
Fair enough. Interesting to know. This means that cross-platform LDC SIMD code will need to be thoroughly scrutinised for codegen quality in all targets. --20cf300fb263e6be7104cc1a1545 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 15 October 2012 17:07, jerro <span dir=3D"ltr"><<a href=3D"mailto:a a= .com" target=3D"_blank">a a.com</a>></span> wrote:<br><div class=3D"gmai= l_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde= r-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote:<b= r> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> On 15 October 2012 16:34, jerro <<a href=3D"mailto:a a.com" target=3D"_b= lank">a a.com</a>> wrote:<br> <br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:<br> <br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> On 15 October 2012 02:50, jerro <<a href=3D"mailto:a a.com" target=3D"_b= lank">a a.com</a>> wrote:<br> <br> =C2=A0Speaking of test =E2=80=93 are they available somewhere? Now that LDC= at least<br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> theoretically supports most of the GCC builtins, I'd like to throw some= <br> tests at it to see what happens.<br> <br> David<br> <br> <br> </blockquote> I have a fork of std.simd with LDC support at<br> <a href=3D"https://github.com/jerro/**" target=3D"_blank">https://github.co= m/jerro/**</a><br></div> phobos/tree/std.simd <<a href=3D"https://github.com/jerro/**phobos/tree/= std.simd" target=3D"_blank">https://github.com/jerro/**<u></u>phobos/tree/s= td.simd</a><<a href=3D"https://github.com/jerro/phobos/tree/std.simd" ta= rget=3D"_blank">https://<u></u>github.com/jerro/phobos/tree/<u></u>std.simd= </a>>><br> and<br> some tests for it at <a href=3D"https://github.com/jerro/std.****simd-tests= " target=3D"_blank">https://github.com/jerro/std.*<u></u>***simd-tests</a>&= lt;<a href=3D"https://github.com/jerro/std.**simd-tests" target=3D"_blank">= https://github.<u></u>com/jerro/std.**simd-tests</a>><br> <<a href=3D"https://github." target=3D"_blank">https://github.</a>**com/= jerro/<u></u>std.simd-tests<<a href=3D"https://github.com/jerro/std.simd= -tests" target=3D"_blank">https://github.<u></u>com/jerro/std.simd-tests</a=><br>
>.<br> <br> <br> </blockquote><div><div class=3D"h5"> Awesome. Pull request plz! :)<br> <br> </div></div></blockquote><div><div class=3D"h5"> <br> I did change an API for a few functions like loadUnaligned, though. In<br> those cases the signatures needed to be changed because the functions used<= br> T or T* for scalar parameters and return types and Vector!T for the vector<= br> parameters and return types. This only compiles if T is a static array<br> which I don't think makes much sense. I changed those to take the vecto= r<br> type as a template parameter. The vector type can not be inferred from the<= br> scalar type because you can use vector registers of different sizes<br> simultaneously (with AVX, for example). Because of that the vector type<br> must be passed explicitly for some functions, so I made it the first<br> template parameter in those cases, so that Ver doesn't always need to b= e<br> specified.<br> <br> There is one more issue that I need to solve (and that may be a problem in<= br> some cases with GDC too) - the pure, safe and nothrow attributes.<br> Currently gcc builtin declarations in LDC have none of those attributes (I<= br> have to look into which of those can be added and if it can be done<br> automatically). I've just commented out the attributes in my std.simd f= ork<br> for now, but this isn't a proper solution.<br> <br> <br> <br> =C2=A0That said, how did you come up with a lot of these implementations? S= ome<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> don't look particularly efficient, and others don't even look right= .<br> xor for instance:<br> return cast(T) (cast(int4) v1 ^ cast(int4) v2);<br> <br> This is wrong for float types. x86 has separate instructions for doing<br> this<br> to floats, which make sure to do the right thing by the flags registers.<br=
9;t<br> think this will produce good portable code. It needs to be much more<br> cafully hand-crafted, but it's a nice working start.<br> <br> </blockquote> <br> The problem is that LLVM doesn't provide intrinsics for those operation= s.<br> The xor function does compile to a single xorps instruction when compiling<= br> with -O1 or higher, though. I have looked at the code generated for many<br=
most of them compile to the appropriate single instruction when compiled<br=
r> inefficient like for example loadUnaligned.<br> <br> By the way, clang does those in a similar way. For example, here is what<br=
-emit-llvm:<br> <br> define <4 x float> foo(<4 x float> %a, <4 x float> %b) n= ounwind uwtable<br> readnone {<br> =C2=A0 %1 =3D bitcast <4 x float> %a to <4 x i32><br> =C2=A0 %2 =3D bitcast <4 x float> %b to <4 x i32><br> =C2=A0 %3 =3D xor <4 x i32> %1, %2<br> =C2=A0 %4 =3D bitcast <4 x i32> %3 to <4 x float><br> =C2=A0 ret <4 x float> %4<br> }<br> <br> AFAICT, the only way to ensure that a certain instruction will be used<br> with LDC when there is no LLVM intrinsic for it is to use inline assembly<b= r> expressions. I remember having some problems with those in the past, but it= <br> could be that I was doing something wrong. Maybe we should look into that<b= r> option too.<br> <br> </div></div></blockquote> <br><div class=3D"im"> Inline assembly usually ruins optimising (code reordering around inline asm= <br> blocks is usually considered impossible).<br> </div></blockquote> <br> I don't see a reason why the compiler couldn't reorder code around = GCC style inline assembly blocks. You are supposed to specify which registe= rs are changed in the block. Doesn't that give the compiler enough info= rmation to reorder code?</blockquote> <div><br></div><div>Not necessarily. If you affect various flags registers = or whatever, or direct memory access might violate it's assumptions abo= ut the state of memory/stack.</div><div>I don't think I've come in = contact with any compiler's that aren't super-conservative about th= is sort of thing.</div> <div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=3D"= im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> It's interesting that the x86 codegen makes such good sense of those<br=
r if<br> other platforms have a similarly incomplete subset of intrinsics? :/<br> </blockquote> <br></div> It looks to me like LLVM does provide intrinsics for those operation that c= an't be expressed in other ways. So my guess is that if some intrinsics= are absolutely needed for some platform, they will probably be there. If a= n intrinsic is needed, I also don't see a reason why they wouldn't = accept a patch that ads it.<br> </blockquote></div><br><div>Fair enough. Interesting to know. This means th= at cross-platform LDC SIMD code will need to be thoroughly scrutinised for = codegen quality in all targets.</div> --20cf300fb263e6be7104cc1a1545--
Oct 15 2012









Denis Shelomovskij <verylonglogin.reg gmail.com> 