www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - SIMD ideas for Rust

reply "bearophile" <bearophileHUGS lycos.com> writes:
A small blog post about SIMD ideas for Rust language:

http://blog.aventine.se/post/55669497784/my-vision-for-rust-simd

Some info on one of the discussed intrinsics, _mm_mul_epu32:
http://msdn.microsoft.com/en-us/library/f3d9e6fk%28v=vs.90%29.aspx

Bye,
bearophile
Jul 17 2013
next sibling parent Manu <turkeyman gmail.com> writes:
--089e01538afc3b7d0804e1c7f34d
Content-Type: text/plain; charset=UTF-8

Interesting. Almost all his points are what we do already in D.
Always nice to see others come to the same conclusions :)


On 18 July 2013 11:15, bearophile <bearophileHUGS lycos.com> wrote:

 A small blog post about SIMD ideas for Rust language:

 http://blog.aventine.se/post/**55669497784/my-vision-for-**rust-simd<http://blog.aventine.se/post/55669497784/my-vision-for-rust-simd>

 Some info on one of the discussed intrinsics, _mm_mul_epu32:
 http://msdn.microsoft.com/en-**us/library/f3d9e6fk%28v=vs.90%**29.aspx<http://msdn.microsoft.com/en-us/library/f3d9e6fk%28v=vs.90%29.aspx>

 Bye,
 bearophile

--089e01538afc3b7d0804e1c7f34d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Interesting. Almost all his points are what we do already = in D.<div style>Always nice to see others come to the same conclusions :)</= div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On = 18 July 2013 11:15, bearophile <span dir=3D"ltr">&lt;<a href=3D"mailto:bear= ophileHUGS lycos.com" target=3D"_blank">bearophileHUGS lycos.com</a>&gt;</s= pan> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">A small blog post about SIMD ideas for Rust = language:<br> <br> <a href=3D"http://blog.aventine.se/post/55669497784/my-vision-for-rust-simd= " target=3D"_blank">http://blog.aventine.se/post/<u></u>55669497784/my-visi= on-for-<u></u>rust-simd</a><br> <br> Some info on one of the discussed intrinsics, _mm_mul_epu32:<br> <a href=3D"http://msdn.microsoft.com/en-us/library/f3d9e6fk%28v=3Dvs.90%29.= aspx" target=3D"_blank">http://msdn.microsoft.com/en-<u></u>us/library/f3d9= e6fk%28v=3Dvs.90%<u></u>29.aspx</a><br> <br> Bye,<br> bearophile<br> </blockquote></div><br></div> --089e01538afc3b7d0804e1c7f34d--
Jul 18 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 Interesting. Almost all his points are what we do already in D.
 Always nice to see others come to the same conclusions :)

There is also some discussion here: http://www.reddit.com/r/rust/comments/1igvye/vision_for_rust_simd/ Regarding SIMD, few days ago I have added some notes in this thread, that are significant: http://forum.dlang.org/thread/kpt0ja$1fk6$1 digitalmars.com?page=2 Bye, bearophile
Jul 18 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 Interesting. Almost all his points are what we do already in D.
 Always nice to see others come to the same conclusions :)

While trying to write a multiplication of two complex numbers using SSE3 with LDC2 I have found about seven or more bugs, that I will discuss elsewhere. But regarding the syntax, in nice code like this D requires to add ".array" before all those subscripts (code adapted from Fog): double2 complexMult(in double2 a, in double2 b) pure nothrow { double2 b_flip = [b.array[1], b.array[0]]; double2 a_im = [a.array[1], a.array[1]]; double2 a_re = [a.array[0], a.array[0]]; double2 aib = a_im * b_flip; double2 arb = a_re * b; return [arb.array[0] - aib.array[0], arb.array[1] + aib.array[1]]; } A line like this: double2 b_flip = [b.array[1], b.array[0]]; becomes something like: pshufd $238, %xmm1, %xmm3 Similarly all the other lines become single instructions (but the last one, because LDC2 misses to use a addsubpd). I vaguely remember you saying that slow SIMD operations shouldn't have a too much short syntax to avoid giving an illusion of efficiency. But given that "often" the CPU executes such array subscripting and shuffling efficiently, isn't it nicer/enough to support a simpler syntax like this in D? double2 complexMult(in double2 a, in double2 b) pure nothrow { double2 b_flip = [b[1], b[0]]; double2 a_im = [a[1], a[1]]; double2 a_re = [a[0], a[0]]; double2 aib = a_im * b_flip; double2 arb = a_re * b; return [arb[0] - aib[0], arb[1] + aib[1]]; } Bye, bearophile
Jul 19 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7b41cbb883a97704e1df7226
Content-Type: text/plain; charset=UTF-8

On 19 July 2013 19:33, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:

  Interesting. Almost all his points are what we do already in D.
 Always nice to see others come to the same conclusions :)

While trying to write a multiplication of two complex numbers using SSE3 with LDC2 I have found about seven or more bugs, that I will discuss elsewhere. But regarding the syntax, in nice code like this D requires to add ".array" before all those subscripts (code adapted from Fog): double2 complexMult(in double2 a, in double2 b) pure nothrow { double2 b_flip = [b.array[1], b.array[0]]; double2 a_im = [a.array[1], a.array[1]]; double2 a_re = [a.array[0], a.array[0]]; double2 aib = a_im * b_flip; double2 arb = a_re * b; return [arb.array[0] - aib.array[0], arb.array[1] + aib.array[1]]; } A line like this: double2 b_flip = [b.array[1], b.array[0]]; becomes something like: pshufd $238, %xmm1, %xmm3 Similarly all the other lines become single instructions (but the last one, because LDC2 misses to use a addsubpd). I vaguely remember you saying that slow SIMD operations shouldn't have a too much short syntax to avoid giving an illusion of efficiency. But given that "often" the CPU executes such array subscripting and shuffling efficiently, isn't it nicer/enough to support a simpler syntax like this in D? double2 complexMult(in double2 a, in double2 b) pure nothrow { double2 b_flip = [b[1], b[0]]; double2 a_im = [a[1], a[1]]; double2 a_re = [a[0], a[0]]; double2 aib = a_im * b_flip; double2 arb = a_re * b; return [arb[0] - aib[0], arb[1] + aib[1]]; }

The point about eliminating the index operator is because it implies a vector->float cast. You want to perform a shuffle(/swizzle), but you are only really performing the operation incidentally. What you're really doing is casting a bunch of vector components to floats, and then rebuilding a vector, and LLVM can helpfully deal with that. I would suggest calling a spade a spade and using a swizzle function to perform a swizzle, instead of code like what you wrote. Wouldn't this be better: double2 complexMult(in double2 a, in double2 b) pure nothrow { double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to include an opDispatch in the basic type double2 a_im = a.yy; double2 a_re = a.xx; double2 aib = a_im * b_flip; double2 arb = a_re * b; // return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is tricky... it's not very portable. // Maybe: return select([-1, 0], arb-aib, arb+aib); // Hopefully the x86 optimiser will generate the proper opcode. Or a bunch of other options; a multi-vector shuffle, shift, swizzle, interleave. } I think that would be better. More portable, and it eliminates the code that implies a vector->float->vector cast sequence, which I maintain, should be syntactically discouraged at all costs. You don't want to be giving people bad ideas that it's reasonable code to write ;) --047d7b41cbb883a97704e1df7226 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">On 19 July 2013 19:33, bearophile <span dir=3D"ltr">&lt;<a= href=3D"mailto:bearophileHUGS lycos.com" target=3D"_blank">bearophileHUGS = lycos.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D= "gmail_quote"> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex"><div class=3D"im">Manu:<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex"> Interesting. Almost all his points are what we do already in D.<br> Always nice to see others come to the same conclusions :)<br> </blockquote> <br></div> While trying to write a multiplication of two complex numbers using SSE3 wi= th LDC2 I have found about seven or more bugs, that I will discuss elsewher= e. But regarding the syntax, in nice code like this D requires to add &quot= ;.array&quot; before all those subscripts (code adapted from Fog):<br> <br> <br> double2 complexMult(in double2 a, in double2 b) pure nothrow {<br> =C2=A0 =C2=A0 double2 b_flip =3D [b.array[1], b.array[0]];<br> =C2=A0 =C2=A0 double2 a_im =3D [a.array[1], a.array[1]];<br> =C2=A0 =C2=A0 double2 a_re =3D [a.array[0], a.array[0]];<br> =C2=A0 =C2=A0 double2 aib =3D a_im * b_flip;<br> =C2=A0 =C2=A0 double2 arb =3D a_re * b;<br> =C2=A0 =C2=A0 return [arb.array[0] - aib.array[0], arb.array[1] + aib.array= [1]];<br> }<br> <br> <br> A line like this:<br> <br> double2 b_flip =3D [b.array[1], b.array[0]];<br> <br> becomes something like:<br> <br> pshufd =C2=A0 $238, =C2=A0%xmm1, %xmm3<br> <br> Similarly all the other lines become single instructions (but the last one,= because LDC2 misses to use a addsubpd).<br> <br> I vaguely remember you saying that slow SIMD operations shouldn&#39;t have = a too much short syntax to avoid giving an illusion of efficiency. But give= n that &quot;often&quot; the CPU executes such array subscripting and shuff= ling efficiently, isn&#39;t it nicer/enough to support a simpler syntax lik= e this in D?<br> <br> double2 complexMult(in double2 a, in double2 b) pure nothrow {<br> =C2=A0 =C2=A0 double2 b_flip =3D [b[1], b[0]];<br> =C2=A0 =C2=A0 double2 a_im =3D [a[1], a[1]];<br> =C2=A0 =C2=A0 double2 a_re =3D [a[0], a[0]];<br> =C2=A0 =C2=A0 double2 aib =3D a_im * b_flip;<br> =C2=A0 =C2=A0 double2 arb =3D a_re * b;<br> =C2=A0 =C2=A0 return [arb[0] - aib[0], arb[1] + aib[1]];<br> }<br></blockquote><div><br></div><div style>The point about eliminating the= index operator is because it implies a vector-&gt;float cast.</div><div st= yle>You want to perform a shuffle(/swizzle), but you are only really perfor= ming the operation incidentally.</div> <div style>What you&#39;re really doing is casting a bunch of vector compon= ents to floats, and then rebuilding a vector, and LLVM can helpfully deal w= ith that.</div><div style><br></div><div style>I would suggest calling a sp= ade a spade and using a swizzle function to perform a swizzle, instead of c= ode like what you wrote.</div> <div style>Wouldn&#39;t this be better:</div><div style><br></div><div styl= e>double2 complexMult(in double2 a, in double2 b) pure nothrow {<br>=C2=A0 = =C2=A0 double2 b_flip =3D b.yx; // or b.swizzle!&quot;yx&quot;, if we don&#= 39;t want to include an opDispatch in the basic type<br> =C2=A0 =C2=A0 double2 a_im =3D a.yy;<br>=C2=A0 =C2=A0 double2 a_re =3D a.xx= ;<br>=C2=A0 =C2=A0 double2 aib =3D a_im * b_flip;<br>=C2=A0 =C2=A0 double2 = arb =3D a_re * b;</div><div style><br></div><div style>// =C2=A0 =C2=A0retu= rn [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is tricky... it&#= 39;s not very portable.<br> </div><div style><br></div><div style>=C2=A0 =C2=A0 // Maybe:<br>=C2=A0 =C2= =A0 return select([-1, 0], arb-aib, arb+aib);</div><div style>=C2=A0 =C2=A0= // Hopefully the x86 optimiser will generate the proper opcode. Or a bunch= of other options; a multi-vector shuffle, shift, swizzle, interleave.</div=

better. More portable, and it eliminates the code that implies a vector-&g= t;float-&gt;vector cast sequence, which I maintain, should be syntactically= discouraged at all costs.</div> <div style>You don&#39;t want to be giving people bad ideas that it&#39;s r= easonable code to write ;)</div></div></div></div> --047d7b41cbb883a97704e1df7226--
Jul 19 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 What you're really doing is casting a bunch of vector 
 components to floats,
 and then rebuilding a vector, and LLVM can helpfully deal with 
 that.

 I would suggest calling a spade a spade and using a swizzle 
 function to
 perform a swizzle, instead of code like what you wrote.
 Wouldn't this be better:

 double2 complexMult(in double2 a, in double2 b) pure nothrow {
     double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't 
 want to
 include an opDispatch in the basic type
     double2 a_im = a.yy;
     double2 a_re = a.xx;
     double2 aib = a_im * b_flip;
     double2 arb = a_re * b;

I see and you are right. (If I turn the basic type into a struct containing a double2 aliased-this to the whole structure, the generated code becomes awful). A YMM that already contains 8 floats, and probably SIMD registers will keep growing, maybe to become 1024 bits long. So the swizzle item names like x y z w will not suffice and some more general naming scheme is needed.
 //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final 
 line is
 tricky... it's not very portable.

     // Maybe:
     return select([-1, 0], arb-aib, arb+aib);
     // Hopefully the x86 optimiser will generate the proper 
 opcode. Or a
 bunch of other options; a multi-vector shuffle, shift, swizzle, 
 interleave.
 }

 I think that would be better. More portable, and it eliminates 
 the code
 that implies a vector->float->vector cast sequence, which I 
 maintain,
 should be syntactically discouraged at all costs.
 You don't want to be giving people bad ideas that it's 
 reasonable code to
 write ;)

My experience in writing such kind of code is limited. I will try your select to see what kind of code LDC2-LLVM generates. Bye, bearophile
Jul 19 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--089e0129563644e2c404e1e7f86e
Content-Type: text/plain; charset=UTF-8

On 20 July 2013 03:43, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:

  What you're really doing is casting a bunch of vector components to
 floats,
 and then rebuilding a vector, and LLVM can helpfully deal with that.

 I would suggest calling a spade a spade and using a swizzle function to
 perform a swizzle, instead of code like what you wrote.
 Wouldn't this be better:

 double2 complexMult(in double2 a, in double2 b) pure nothrow {
     double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to
 include an opDispatch in the basic type
     double2 a_im = a.yy;
     double2 a_re = a.xx;
     double2 aib = a_im * b_flip;
     double2 arb = a_re * b;

I see and you are right. (If I turn the basic type into a struct containing a double2 aliased-this to the whole structure, the generated code becomes awful). A YMM that already contains 8 floats, and probably SIMD registers will keep growing, maybe to become 1024 bits long. So the swizzle item names like x y z w will not suffice and some more general naming scheme is needed.

Swizzling bytes already has that problem. Hexadecimal swizzle strings work nicely up to 16 elements, but past that, I'd probably require the template receive a tuple of int's. These are trivial details. .xyzw are particularly useful for 2-4d vectors. They can be removed for anything higher. The nicest/most preferred interface can be decided with experience. As yet there's not a lot of practical experience with >128bit registers, and the sorts of patterns that appear frequently. // return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
 tricky... it's not very portable.

     // Maybe:
     return select([-1, 0], arb-aib, arb+aib);
     // Hopefully the x86 optimiser will generate the proper opcode. Or a
 bunch of other options; a multi-vector shuffle, shift, swizzle,
 interleave.
 }

 I think that would be better. More portable, and it eliminates the code
 that implies a vector->float->vector cast sequence, which I maintain,
 should be syntactically discouraged at all costs.
 You don't want to be giving people bad ideas that it's reasonable code to
 write ;)

My experience in writing such kind of code is limited. I will try your select to see what kind of code LDC2-LLVM generates.

It probably won't be good because I haven't paid attention to how it optimises on SSE yet. You need to encourage the compiler to generate ADDSUBPD for SSE, and any (or none) of the possible expressions may result in it choosing the proper opcode. I'm apprehensive to add a helper function for that operation, since it's dreadfully SSE-specific. It's the sort of thing where you might rather carefully make sure the standard API will reliably encourage the optimiser to do it. If you can find a pattern of operations that optimises to ADDSUBPD, I'm interested to know what the sequence(/s) are. If not, we'll consider an explicit function. It can be emulated within reason on other architectures, but I think it would be better to work a different solution though. Ie, perform 2 (or 4) side by side (stream processing)... That will work well on all architectures. --089e0129563644e2c404e1e7f86e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">On 20 July 2013 03:43, bearophile <span dir=3D"ltr">&lt;<a= href=3D"mailto:bearophileHUGS lycos.com" target=3D"_blank">bearophileHUGS = lycos.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D= "gmail_quote"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">Manu:<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im"> What you&#39;re really doing is casting a bunch of vector components to flo= ats,<br> and then rebuilding a vector, and LLVM can helpfully deal with that.<br> <br> I would suggest calling a spade a spade and using a swizzle function to<br> perform a swizzle, instead of code like what you wrote.<br> Wouldn&#39;t this be better:<br> <br></div><div class=3D"im"> double2 complexMult(in double2 a, in double2 b) pure nothrow {<br></div><di= v class=3D"im"> =C2=A0 =C2=A0 double2 b_flip =3D b.yx; // or b.swizzle!&quot;yx&quot;, if w= e don&#39;t want to<br> include an opDispatch in the basic type<br> =C2=A0 =C2=A0 double2 a_im =3D a.yy;<br> =C2=A0 =C2=A0 double2 a_re =3D a.xx;<br></div><div class=3D"im"> =C2=A0 =C2=A0 double2 aib =3D a_im * b_flip;<br> =C2=A0 =C2=A0 double2 arb =3D a_re * b;<br> </div></blockquote> <br> I see and you are right.<br> <br> (If I turn the basic type into a struct containing a double2<br> aliased-this to the whole structure, the generated code becomes<br> awful).<br> <br> A YMM that already contains 8 floats, and probably SIMD registers<br> will keep growing, maybe to become 1024 bits long. So the swizzle<br> item names like x y z w will not suffice and some more general<br> naming scheme is needed.</blockquote><div><br></div><div style>Swizzling by= tes already has that problem. Hexadecimal swizzle strings work nicely up to= 16 elements, but past that, I&#39;d probably require the template receive = a tuple of int&#39;s.</div> <div style>These are trivial details. .xyzw are particularly useful for 2-4= d vectors. They can be removed for anything higher. The nicest/most preferr= ed interface can be decided with experience.</div><div style>As yet there&#= 39;s not a lot of practical experience with &gt;128bit registers, and the s= orts of patterns that appear frequently.</div> <div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex= ;border-left:1px #ccc solid;padding-left:1ex"><div class=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> // =C2=A0 =C2=A0return [arb[0] - aib[0], arb[1] + aib[1]]; // this final li= ne is<br> tricky... it&#39;s not very portable.<br> <br> =C2=A0 =C2=A0 // Maybe:<br> =C2=A0 =C2=A0 return select([-1, 0], arb-aib, arb+aib);<br> =C2=A0 =C2=A0 // Hopefully the x86 optimiser will generate the proper opcod= e. Or a<br> bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.= <br> }<br> <br> I think that would be better. More portable, and it eliminates the code<br> that implies a vector-&gt;float-&gt;vector cast sequence, which I maintain,= <br> should be syntactically discouraged at all costs.<br> You don&#39;t want to be giving people bad ideas that it&#39;s reasonable c= ode to<br> write ;)<br> </blockquote> <br></div> My experience in writing such kind of code is limited. I will try<br> your select to see what kind of code LDC2-LLVM generates.<br></blockquote><= div><br></div><div style>It probably won&#39;t be good because I haven&#39;= t paid attention to how it optimises on SSE yet.</div><div style>You need t= o encourage the compiler to generate ADDSUBPD for SSE, and any (or none) of= the possible expressions may result in it choosing the proper opcode.</div=

, since it&#39;s dreadfully SSE-specific. It&#39;s the sort of thing where = you might rather carefully make sure the standard API will reliably encoura= ge the optimiser to do it.</div> <div style>If you can find a pattern of operations that optimises to ADDSUB= PD, I&#39;m interested to know what the sequence(/s) are.</div><div style>I= f not, we&#39;ll consider an explicit function. It can be emulated within r= eason on other architectures, but I think it would be better to work a diff= erent solution though. Ie, perform 2 (or 4) side by side (stream processing= )... That will work well on all architectures.</div> </div></div></div> --089e0129563644e2c404e1e7f86e--
Jul 19 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final 
 line is
 tricky... it's not very portable.

     // Maybe:
     return select([-1, 0], arb-aib, arb+aib);
     // Hopefully the x86 optimiser will generate the proper 
 opcode. Or a
 bunch of other options; a multi-vector shuffle, shift, swizzle, 
 interleave.
 }

 I think that would be better. More portable, and it eliminates 
 the code
 that implies a vector->float->vector cast sequence, which I 
 maintain,
 should be syntactically discouraged at all costs.
 You don't want to be giving people bad ideas that it's 
 reasonable code to
 write ;)

Apparently that select() uses __builtin_ia32_pblendvb128, that is a SSE4.1 instruction. At the moment I have only SSE3 :-( Bye, bearophile
Jul 23 2013
prev sibling parent Manu <turkeyman gmail.com> writes:
--047d7b67812689d5ab04e236e8cc
Content-Type: text/plain; charset=UTF-8

On 23 July 2013 17:05, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
 tricky... it's not very portable.

     // Maybe:
     return select([-1, 0], arb-aib, arb+aib);
     // Hopefully the x86 optimiser will generate the proper opcode. Or a
 bunch of other options; a multi-vector shuffle, shift, swizzle,
 interleave.
 }

 I think that would be better. More portable, and it eliminates the code
 that implies a vector->float->vector cast sequence, which I maintain,
 should be syntactically discouraged at all costs.
 You don't want to be giving people bad ideas that it's reasonable code to
 write ;)

Apparently that select() uses __builtin_ia32_pblendvb128, that is a SSE4.1 instruction. At the moment I have only SSE3 :-(

It's probably better to use a shuf, or a shift for compatibility, since the selection predicate is constant anyway. --047d7b67812689d5ab04e236e8cc Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">On 23 July 2013 17:05, bearophile <span dir=3D"ltr">&lt;<a= href=3D"mailto:bearophileHUGS lycos.com" target=3D"_blank">bearophileHUGS = lycos.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D= "gmail_quote"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">Manu:<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> // =C2=A0 =C2=A0return [arb[0] - aib[0], arb[1] + aib[1]]; // this final li= ne is<br> tricky... it&#39;s not very portable.<br> <br> =C2=A0 =C2=A0 // Maybe:<br> =C2=A0 =C2=A0 return select([-1, 0], arb-aib, arb+aib);<br> =C2=A0 =C2=A0 // Hopefully the x86 optimiser will generate the proper opcod= e. Or a<br> bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.= <br> }<br> <br> I think that would be better. More portable, and it eliminates the code<br> that implies a vector-&gt;float-&gt;vector cast sequence, which I maintain,= <br> should be syntactically discouraged at all costs.<br> You don&#39;t want to be giving people bad ideas that it&#39;s reasonable c= ode to<br> write ;)<br> </blockquote> <br></div> Apparently that select() uses __builtin_ia32_pblendvb128, that is a SSE4.1 = instruction. At the moment I have only SSE3 :-(<br></blockquote><div><br></= div><div>It&#39;s probably better to use a shuf, or a shift for compatibili= ty, since the selection predicate is constant anyway.</div> </div></div></div> --047d7b67812689d5ab04e236e8cc--
Jul 23 2013