www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Apologies for the delay, we're moving and things are a bit hectic.

reddit: 
http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/

twitter: https://twitter.com/D_Programming/status/347433981928693760

hackernews: https://news.ycombinator.com/item?id=5907624

facebook: https://www.facebook.com/dlang.org/posts/659747567372261

youtube: http://youtube.com/watch?v=q_39RnxtkgM


Andrei
Jun 19 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 http://youtube.com/watch?v=q_39RnxtkgM
Very nice. - - - - - - - - - - - - - - - - - - - Slide 3:
 In practise, say we have iterative code like this:
 
 int data[100];
 
 for(int i = 0; i < data.length; ++i) {
   data[i] += 10; }
For code like that in D we have vector ops: int[100] data; data[] += 10; Regarding vector ops: currently they are written with handwritten asm that uses SIMD where possible. Once std.simd is in good shape I think the array ops can be rewritten (and completed in their missing parts) using a higher level style of coding. - - - - - - - - - - - - - - - - - - - Slide 22:
 Comparisons:
 Full suite of comparisons Can produce bit-masks, or boolean 
 'any'/'all' logic.
Maybe a little of compiler support (for the syntax) will help here. - - - - - - - - - - - - - - - - - - - Slide 26:
 Always pass vectors by value.
Unfortunately it seems a bad idea to give a warning if you pass one of those by reference. - - - - - - - - - - - - - - - - - - - Slide 27:
 3. Use ‘leaf’ functions where possible.
I am not sure how much good it is to enforce leaf functions with a leaf annotation. - - - - - - - - - - - - - - - - - - - Slide 32:
 Experiment with prefetching?
Are D intrinsics offering instructions to perform prefetching? - - - - - - - - - - - - - - - - - - - LDC2 is supports SIMD on Windows32 too. So for this code: void main() { alias double2 = __vector(double[2]); auto a = new double[200]; auto b = cast(double2[])a; double2 tens = [10.0, 10.0]; b[] += tens; } LDC2 compiles it to: movl $200, 4(%esp) movl $__D11TypeInfo_Ad6__initZ, (%esp) calll __d_newarrayiT movl %edx, %esi movl %eax, (%esp) movl $16, 8(%esp) movl $8, 4(%esp) calll __d_array_cast_len testl %eax, %eax je LBB0_3 movapd LCPI0_0, %xmm0 .align 16, 0x90 LBB0_2: movapd (%esi), %xmm1 addpd %xmm0, %xmm1 movapd %xmm1, (%esi) addl $16, %esi decl %eax jne LBB0_2 LBB0_3: xorl %eax, %eax addl $12, %esp popl %esi ret It uses addpd that works with two doubles at the same time. - - - - - - - - - - - - - - - - - - - The Reddit thread contains a link to this page, a compiler for a C variant from Intel that's optimized for SIMD: http://ispc.github.io/ Some of the syntax of ispc: - - - - - - The first of these statements is cif, indicating an if statement that is expected to be coherent. The usage of cif in code is just the same as if: cif (x < y) { ... } else { ... } cif provides a hint to the compiler that you expect that most of the executing SPMD programs will all have the same result for the if condition. Along similar lines, cfor, cdo, and cwhile check to see if all program instances are running at the start of each loop iteration; if so, they can run a specialized code path that has been optimized for the "all on" execution mask case. - - - - - - foreach_tiled(y = y0 ... y1, x = 0 ... w, u = 0 ... nsubsamples, v = 0 ... nsubsamples) { float du = (float)u * invSamples, dv = (float)v * invSamples; - - - - - - I'll take a better look at ispc. Bye, bearophile
Jun 20 2013
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 20 June 2013 21:58, bearophile <bearophileHUGS lycos.com> wrote:

 Andrei Alexandrescu:

  http://youtube.com/watch?v=3Dq_**39RnxtkgM<http://youtube.com/watch?v=3D=
q_39RnxtkgM>

 Very nice.

 - - - - - - - - - - - - - - - - - - -

 Slide 3:

  In practise, say we have iterative code like this:
 int data[100];

 for(int i =3D 0; i < data.length; ++i) {
   data[i] +=3D 10; }
For code like that in D we have vector ops: int[100] data; data[] +=3D 10; Regarding vector ops: currently they are written with handwritten asm tha=
t
 uses SIMD where possible. Once std.simd is in good shape I think the arra=
y
 ops can be rewritten (and completed in their missing parts) using a highe=
r
 level style of coding.
I was trying to illustrate a process. Not so much a comment on D array syntax. The problem with auto-simd applied to array operations, is D doesn't assert that arrays are aligned. Nor are they multiples of 'N' elements wide, which means they lose the opportunity to make a lot of assumptions that make the biggest performance difference. They must be aligned, and multiples of N elements. By using explicit SIMD types, you're forced to adhere to those rules as a programmer, and the compiler can optimise properly. You take on the responsibility to handle mis-alignment and stragglers as the programmer, and perhaps make less conservative choices. - - - - - - - - - - - - - - - - - - -
 Slide 22:

  Comparisons:
 Full suite of comparisons Can produce bit-masks, or boolean 'any'/'all'
 logic.
Maybe a little of compiler support (for the syntax) will help here.
Well, each are valid comparisons in different situations. I'm not sure how syntax could clearly select the one you want. - - - - - - - - - - - - - - - - - - -
 Slide 26:

  Always pass vectors by value.

 Unfortunately it seems a bad idea to give a warning if you pass one of
 those by reference.
And I don't think it should. Passing by ref isn't 'wrong', you just shouldn't do it if you care about performance. - - - - - - - - - - - - - - - - - - -
 Slide 27:

  3. Use =E2=80=98leaf=E2=80=99 functions where possible.

 I am not sure how much good it is to enforce leaf functions with a  leaf
 annotation.
I don't think it would be useful. It should only be considered a general rule when people are very specifically considering performance above all else. It's just a very important detail to be aware of when optimising your code, particularly so when you're dealing with maths code (often involving simd). - - - - - - - - - - - - - - - - - - -
 Slide 32:

  Experiment with prefetching?

 Are D intrinsics offering instructions to perform prefetching?
Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :) - - - - - - - - - - - - - - - - - - -
 LDC2 is supports SIMD on Windows32 too.

 So for this code:


 void main() {
     alias double2 =3D __vector(double[2]);
     auto a =3D new double[200];
     auto b =3D cast(double2[])a;
     double2 tens =3D [10.0, 10.0];
     b[] +=3D tens;
 }


 LDC2 compiles it to:

         movl    $200, 4(%esp)
         movl    $__D11TypeInfo_Ad6__initZ, (%esp)
         calll   __d_newarrayiT
         movl    %edx, %esi
         movl    %eax, (%esp)
         movl    $16, 8(%esp)
         movl    $8, 4(%esp)
         calll   __d_array_cast_len
         testl   %eax, %eax
         je      LBB0_3
         movapd  LCPI0_0, %xmm0
         .align  16, 0x90
 LBB0_2:
         movapd  (%esi), %xmm1
         addpd   %xmm0, %xmm1
         movapd  %xmm1, (%esi)
         addl    $16, %esi
         decl    %eax
         jne     LBB0_2
 LBB0_3:
         xorl    %eax, %eax
         addl    $12, %esp
         popl    %esi
         ret


 It uses addpd that works with two doubles at the same time.
Sure... did I say this wasn't supported somewhere? Sorry if I gave that impression. - - - - - - - - - - - - - - - - - - -
 The Reddit thread contains a link to this page, a compiler for a C varian=
t
 from Intel that's optimized for SIMD:
 http://ispc.github.io/

 Some of the syntax of ispc:

 - - - - - -

 The first of these statements is cif, indicating an if statement that is
 expected to be coherent. The usage of cif in code is just the same as if:

 cif (x < y) {
     ...
 } else {
     ...
 }

 cif provides a hint to the compiler that you expect that most of the
 executing SPMD programs will all have the same result for the if conditio=
n.
 Along similar lines, cfor, cdo, and cwhile check to see if all program
 instances are running at the start of each loop iteration; if so, they ca=
n
 run a specialized code path that has been optimized for the "all on"
 execution mask case.
This is interesting. I didn't know about this.
Jun 20 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 They must be aligned, and multiples of N elements.
The D GC currently allocates them 16-bytes aligned (but if you slice the array you can lose some alignment). On some new CPUs the penalty for misalignment is small. You often have "n" values, where n is variable. If n is large enough and you are using D vector ops, the handling of the head and tail doesn't waste too much time. If you have very few values it's much better to use the SIMD code.
 Well, each are valid comparisons in different situations. I'm 
 not sure how syntax could clearly select the one you want.
Maybe later we'll look for some syntax sugar for this.
 Are D intrinsics offering instructions to perform prefetching?
Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :)
I think D SIMD programmers will expect something functionally like __builtin_prefetch to be available in D too: http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396 Thank you, bye, bearophile
Jun 20 2013
parent Manu <turkeyman gmail.com> writes:
On 21 June 2013 00:03, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  They must be aligned, and multiples of N elements.

 The D GC currently allocates them 16-bytes aligned (but if you slice the
 array you can lose some alignment). On some new CPUs the penalty for
 misalignment is small.
Yes, the GC allocates 16byte aligned memory, this is good. It's critical actually. But if the data types themselves weren't aligned, then the alloc alignment would be lost as soon as they were used in struct's. You'll notice I made a point of focusing on _portable_ simd. It's true, some new chips can deal with it at virtually no additional cost, but they lose nothing by aligning their data regardless, and you can run on anything. I hope that people write libraries that can run well on anything, not just their architecture of choice. The guidelines I presented, if followed, will give you good performance on all architectures. They're not even very inconvenient. If your point is about auto-vectorisation being much simpler without the alignment restrictions, this is true. But again, I'm talking about portable and RELIABLE implementations, that is, the programmer should know that SIMD was used effectively, and not have to hope the optimiser was able to do a good job. Make these guidelines second nature, and you'll foster a habit of writing portable code even if you don't intend to do so personally. Someone somewhere may want to use your library... You often have "n" values, where n is variable. If n is large enough and
 you are using D vector ops, the handling of the head and tail doesn't waste
 too much time. If you have very few values it's much better to use the SIMD
 code.
See my later slides about branch predictability. When you need to handle stragglers on the head or tail, then you've introduced 2 sources of unpredictability (and also bloated your code). If the arrays are very long, this may be okay as you say, but if they're not it becomes significant. But there is an new issue that appears; if the output array is not the same as the input array, then you have a new mis-alignment where the bases of the 2 arrays might not share the same alignment, and you can't do a simd load from one and store to the other without a series of corrective shifts and merges, which will effectively result in similar code to my un-aligned load demonstration. So the case where this is reliable is: * long data array * output array is the same as the input array (overwrites the input?) I don't consider that reliable, and I don't think special-cases awareness of those criteria is any easier than carefully/deliberately using SIMD in the first place. Well, each are valid comparisons in different situations. I'm not sure how
 syntax could clearly select the one you want.
Maybe later we'll look for some syntax sugar for this.
I'm definitely curious... but i'm not sure it's necessary. Are D intrinsics offering instructions to perform prefetching?

 Well, GCC does at least. If you're worried about performance at this
 level, you're probably already using GCC :)
I think D SIMD programmers will expect something functionally like __builtin_prefetch to be available in D too: http://gcc.gnu.org/onlinedocs/**gcc/Other-Builtins.html#index-** g_t_005f_005fbuiltin_**005fprefetch-3396<http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396>
Yup, I toyed with the idea of adding it to std.simd, but I didn't think it fit there.
Jun 20 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 This is interesting. I didn't know about this.
An important thing here is: what's the semantics present in that language that is missing in D (and that is useful for the optimizer)? Is it possible/worth to add it? Bye, bearophile
Jun 23 2013
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 This is interesting. I didn't know about this.
I have taken a look at this page: https://github.com/ispc/ispc There is a free compiler binary for various operating systems: http://ispc.github.io/downloads.html I have tried the Windows compiler on some of the given examples of code, and it works! And the resulting asm is excellent. Normal compilers for the usual languages aren't able to do produce not even nearly as good asm. To try the code of the examples I compile like this: ispc.exe --emit-asm stencil.ispc -o stencil.s Or even like this to see the AVX2 asm instructions: ispc.exe --target=avx2 --emit-asm stencil.ispc -o stencil.s As example it compiles a function like this: static void stencil_step(uniform int x0, uniform int x1, uniform int y0, uniform int y1, uniform int z0, uniform int z1, uniform int Nx, uniform int Ny, uniform int Nz, uniform const float coef[4], uniform const float vsq[], uniform const float Ain[], uniform float Aout[]) { const uniform int Nxy = Nx * Ny; foreach (z = z0 ... z1, y = y0 ... y1, x = x0 ... x1) { int index = (z * Nxy) + (y * Nx) + x; #define A_cur(x, y, z) Ain[index + (x) + ((y) * Nx) + ((z) * Nxy)] #define A_next(x, y, z) Aout[index + (x) + ((y) * Nx) + ((z) * Nxy)] float div = coef[0] * A_cur(0, 0, 0) + coef[1] * (A_cur(+1, 0, 0) + A_cur(-1, 0, 0) + A_cur(0, +1, 0) + A_cur(0, -1, 0) + A_cur(0, 0, +1) + A_cur(0, 0, -1)) + coef[2] * (A_cur(+2, 0, 0) + A_cur(-2, 0, 0) + A_cur(0, +2, 0) + A_cur(0, -2, 0) + A_cur(0, 0, +2) + A_cur(0, 0, -2)) + coef[3] * (A_cur(+3, 0, 0) + A_cur(-3, 0, 0) + A_cur(0, +3, 0) + A_cur(0, -3, 0) + A_cur(0, 0, +3) + A_cur(0, 0, -3)); A_next(0, 0, 0) = 2 * A_cur(0, 0, 0) - A_next(0, 0, 0) + vsq[index] * div; } } To asm like (using SSE4): pshufd $0, %xmm7, %xmm9 # xmm9 = xmm7[0,0,0,0] cmpl 156(%rsp), %r8d # 4-byte Folded Reload movups (%r14,%rdi), %xmm0 mulps %xmm6, %xmm5 pshufd $0, %xmm2, %xmm2 # xmm2 = xmm2[0,0,0,0] movq 424(%rsp), %rdx movups (%rdx,%r15), %xmm6 mulps %xmm3, %xmm2 pshufd $0, %xmm10, %xmm3 # xmm3 = xmm10[0,0,0,0] mulps %xmm11, %xmm3 addps %xmm2, %xmm3 addps %xmm5, %xmm3 addps %xmm4, %xmm0 movslq %eax, %rax movups (%r14,%rax), %xmm2 addps %xmm0, %xmm2 movslq %ebp, %rax movups (%r14,%rax), %xmm0 addps %xmm2, %xmm0 mulps %xmm9, %xmm0 addps %xmm3, %xmm0 mulps %xmm6, %xmm0 addps %xmm1, %xmm0 movups %xmm0, (%r11,%r15) jl .LBB0_5 .LBB0_6: # %partial_inner_all_outer # in Loop: Header=BB0_4 Depth=2 movq %r13, %rbp movl 156(%rsp), %r13d # 4-byte Reload movq 424(%rsp), %r9 cmpl 64(%rsp), %r8d # 4-byte Folded Reload movq %r11, %rbx jge .LBB0_257 # BB#7: # %partial_inner_only # in Loop: Header=BB0_4 Depth=2 movd %r8d, %xmm0 movl 84(%rsp), %r15d # 4-byte Reload imull 400(%rsp), %r15d pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] paddd .LCPI0_1(%rip), %xmm0 movdqa %xmm8, %xmm1 pcmpgtd %xmm0, %xmm1 movmskps %xmm1, %edi addl 68(%rsp), %r15d # 4-byte Folded Reload leal (%r8,%r15), %r10d Or using AVX2 (this not exactly the equievalent piece of code) (The asm generates with AVX2 is usually significant shorter): movslq %edx, %rdx vaddps (%rax,%rdx), %ymm5, %ymm5 movl 52(%rsp), %edx # 4-byte Reload leal (%rdx,%r15), %edx movslq %edx, %rdx vaddps (%rax,%rdx), %ymm5, %ymm5 vaddps %ymm11, %ymm6, %ymm9 vaddps %ymm10, %ymm8, %ymm10 vmovups (%rax,%rdi), %ymm12 addl $32, %r15d vmovups (%r11,%rcx), %ymm6 movq 400(%rsp), %rdx vbroadcastss 12(%rdx), %ymm7 vbroadcastss 8(%rdx), %ymm8 vbroadcastss (%rdx), %ymm11 vaddps %ymm12, %ymm10, %ymm10 cmpl 48(%rsp), %r14d # 4-byte Folded Reload vbroadcastss 4(%rdx), %ymm12 vmulps %ymm9, %ymm12, %ymm9 vfmadd213ps %ymm9, %ymm3, %ymm11 vfmadd213ps %ymm11, %ymm8, %ymm10 vfmadd213ps %ymm10, %ymm5, %ymm7 vfmadd213ps %ymm4, %ymm6, %ymm7 vmovups %ymm7, (%r8,%rcx) jl .LBB0_8 # BB#4: # %partial_inner_all_outer.us # in Loop: Header=BB0_7 Depth=2 cmpl 40(%rsp), %r14d # 4-byte Folded Reload movq 400(%rsp), %r10 jge .LBB0_6 # BB#5: # %partial_inner_only.us # in Loop: Header=BB0_7 Depth=2 vmovd %r14d, %xmm3 vbroadcastss %xmm3, %ymm3 movl 100(%rsp), %ecx # 4-byte Reload leal (%rcx,%r15), %ecx vpaddd %ymm2, %ymm3, %ymm3 Sometimes it gives performance warnings: rt.ispc:257:9: Performance Warning: Scatter required to store value. image[offset] = ray.maxt; ^^^^^^^^^^^^^ rt.ispc:258:9: Performance Warning: Scatter required to store value. id[offset] = ray.hitId; ^^^^^^^^^^ A a bit larger example with this function from a little ray-tracer: static bool TriIntersect(const uniform Triangle &tri, Ray &ray) { uniform float3 p0 = { tri.p[0][0], tri.p[0][1], tri.p[0][2] }; uniform float3 p1 = { tri.p[1][0], tri.p[1][1], tri.p[1][2] }; uniform float3 p2 = { tri.p[2][0], tri.p[2][1], tri.p[2][2] }; uniform float3 e1 = p1 - p0; uniform float3 e2 = p2 - p0; float3 s1 = Cross(ray.dir, e2); float divisor = Dot(s1, e1); bool hit = true; if (divisor == 0.) hit = false; float invDivisor = 1.f / divisor; // Compute first barycentric coordinate float3 d = ray.origin - p0; float b1 = Dot(d, s1) * invDivisor; if (b1 < 0. || b1 > 1.) hit = false; // Compute second barycentric coordinate float3 s2 = Cross(d, e1); float b2 = Dot(ray.dir, s2) * invDivisor; if (b2 < 0. || b1 + b2 > 1.) hit = false; // Compute _t_ to intersection point float t = Dot(e2, s2) * invDivisor; if (t < ray.mint || t > ray.maxt) hit = false; if (hit) { ray.maxt = t; ray.hitId = tri.id; } return hit; } The (more or less) complete asm with AVX2 for that function: "TriIntersect___REFs[_c_unTriangle]REFs[vyRay]": # "TriIntersect___REFs[_c_unTriangle]REFs[vyRay]" # BB#0: # %allocas subq $248, %rsp vmovaps %xmm15, 80(%rsp) # 16-byte Spill vmovaps %xmm14, 96(%rsp) # 16-byte Spill vmovaps %xmm13, 112(%rsp) # 16-byte Spill vmovaps %xmm12, 128(%rsp) # 16-byte Spill vmovaps %xmm11, 144(%rsp) # 16-byte Spill vmovaps %xmm10, 160(%rsp) # 16-byte Spill vmovaps %xmm9, 176(%rsp) # 16-byte Spill vmovaps %xmm8, 192(%rsp) # 16-byte Spill vmovaps %xmm7, 208(%rsp) # 16-byte Spill vmovaps %xmm6, 224(%rsp) # 16-byte Spill vmovss (%rcx), %xmm0 vmovss 16(%rcx), %xmm2 vinsertps $16, 4(%rcx), %xmm0, %xmm0 vinsertps $32, 8(%rcx), %xmm0, %xmm0 vinsertf128 $1, %xmm0, %ymm0, %ymm10 vmovaps (%rdx), %ymm0 vmovaps 32(%rdx), %ymm3 vmovaps 64(%rdx), %ymm1 vmovaps 96(%rdx), %ymm9 vpbroadcastd .LCPI1_0(%rip), %ymm12 vxorps %ymm4, %ymm4, %ymm4 vpermps %ymm10, %ymm4, %ymm4 vsubps %ymm4, %ymm0, %ymm0 vpermps %ymm10, %ymm12, %ymm4 vsubps %ymm4, %ymm3, %ymm13 vmovups %ymm13, (%rsp) # 32-byte Folded Spill vpbroadcastd .LCPI1_1(%rip), %ymm8 vpermps %ymm10, %ymm8, %ymm3 vsubps %ymm3, %ymm1, %ymm6 vmovss 32(%rcx), %xmm1 vinsertps $16, 36(%rcx), %xmm1, %xmm1 vinsertps $32, 40(%rcx), %xmm1, %xmm1 vinsertf128 $1, %xmm0, %ymm1, %ymm1 vsubps %ymm10, %ymm1, %ymm15 vbroadcastss %xmm15, %ymm3 vmovups %ymm3, 32(%rsp) # 32-byte Folded Spill vmovaps 128(%rdx), %ymm11 vpermps %ymm15, %ymm12, %ymm5 vmulps %ymm3, %ymm11, %ymm1 vmovaps %ymm3, %ymm4 vmovaps %ymm5, %ymm7 vfmsub213ps %ymm1, %ymm9, %ymm7 vinsertps $16, 20(%rcx), %xmm2, %xmm1 vinsertps $32, 24(%rcx), %xmm1, %xmm1 vinsertf128 $1, %xmm0, %ymm1, %ymm1 vsubps %ymm10, %ymm1, %ymm1 vpermps %ymm1, %ymm12, %ymm14 vpermps %ymm1, %ymm8, %ymm12 vmulps %ymm6, %ymm14, %ymm3 vmovaps %ymm13, %ymm2 vfmsub213ps %ymm3, %ymm12, %ymm2 vbroadcastss %xmm1, %ymm1 vmulps %ymm0, %ymm12, %ymm3 vmovaps %ymm6, %ymm13 vfmsub213ps %ymm3, %ymm1, %ymm13 vmulps %ymm13, %ymm11, %ymm3 vmovaps %ymm2, %ymm10 vfmadd213ps %ymm3, %ymm9, %ymm10 vpermps %ymm15, %ymm8, %ymm8 vmulps %ymm8, %ymm9, %ymm3 vmovaps 160(%rdx), %ymm9 vmulps %ymm5, %ymm9, %ymm15 vfmsub213ps %ymm15, %ymm8, %ymm11 vmovaps %ymm4, %ymm15 vfmsub213ps %ymm3, %ymm9, %ymm15 vmulps %ymm15, %ymm14, %ymm4 vmovaps %ymm11, %ymm3 vfmadd213ps %ymm4, %ymm1, %ymm3 vfmadd213ps %ymm3, %ymm7, %ymm12 vmovups (%rsp), %ymm4 # 32-byte Folded Reload vmulps %ymm4, %ymm15, %ymm3 vfmadd213ps %ymm3, %ymm0, %ymm11 vfmadd213ps %ymm11, %ymm6, %ymm7 vmulps %ymm4, %ymm1, %ymm1 vfmsub213ps %ymm1, %ymm14, %ymm0 vbroadcastss .LCPI1_2(%rip), %ymm1 vrcpps %ymm12, %ymm3 vmovaps %ymm12, %ymm4 vfnmadd213ps %ymm1, %ymm3, %ymm4 vmulps %ymm4, %ymm3, %ymm4 vxorps %ymm14, %ymm14, %ymm14 vcmpeqps %ymm14, %ymm12, %ymm1 vcmpunordps %ymm14, %ymm12, %ymm3 vorps %ymm1, %ymm3, %ymm11 vmulps %ymm13, %ymm5, %ymm1 vmulps %ymm4, %ymm7, %ymm5 vmovups 32(%rsp), %ymm3 # 32-byte Folded Reload vfmadd213ps %ymm1, %ymm3, %ymm2 vbroadcastss .LCPI1_3(%rip), %ymm1 vcmpnleps %ymm1, %ymm5, %ymm3 vcmpnleps %ymm5, %ymm14, %ymm6 vorps %ymm3, %ymm6, %ymm6 vbroadcastss .LCPI1_0(%rip), %ymm3 vblendvps %ymm11, %ymm14, %ymm3, %ymm7 vfmadd213ps %ymm10, %ymm0, %ymm9 vmovaps (%r8), %ymm3 vmovmskps %ymm3, %eax leaq 352(%rdx), %r8 cmpl $255, %eax vmulps %ymm9, %ymm4, %ymm9 vaddps %ymm9, %ymm5, %ymm10 vmovups 320(%rdx), %ymm5 vfmadd213ps %ymm2, %ymm8, %ymm0 vblendvps %ymm6, %ymm14, %ymm7, %ymm6 vpcmpeqd %ymm2, %ymm2, %ymm2 vcmpnleps %ymm1, %ymm10, %ymm1 vcmpnleps %ymm9, %ymm14, %ymm7 vorps %ymm1, %ymm7, %ymm1 vblendvps %ymm1, %ymm14, %ymm6, %ymm6 vmulps %ymm0, %ymm4, %ymm1 vcmpnleps 352(%rdx), %ymm1, %ymm0 vcmpnleps %ymm1, %ymm5, %ymm4 vorps %ymm0, %ymm4, %ymm0 vblendvps %ymm0, %ymm14, %ymm6, %ymm0 vpcmpeqd %ymm14, %ymm0, %ymm4 vpxor %ymm2, %ymm4, %ymm2 je .LBB1_1 # BB#4: # %some_on vpand %ymm3, %ymm2, %ymm2 .LBB1_1: # %all_on vmovmskps %ymm2, %eax testl %eax, %eax je .LBB1_3 # BB#2: # %safe_if_run_true356 vmaskmovps %ymm1, %ymm2, (%r8) vpbroadcastd 48(%rcx), %ymm1 vmaskmovps %ymm1, %ymm2, 384(%rdx) .LBB1_3: # %safe_if_after_true vmovaps 224(%rsp), %xmm6 # 16-byte Reload vmovaps 208(%rsp), %xmm7 # 16-byte Reload vmovaps 192(%rsp), %xmm8 # 16-byte Reload vmovaps 176(%rsp), %xmm9 # 16-byte Reload vmovaps 160(%rsp), %xmm10 # 16-byte Reload vmovaps 144(%rsp), %xmm11 # 16-byte Reload vmovaps 128(%rsp), %xmm12 # 16-byte Reload vmovaps 112(%rsp), %xmm13 # 16-byte Reload vmovaps 96(%rsp), %xmm14 # 16-byte Reload vmovaps 80(%rsp), %xmm15 # 16-byte Reload addq $248, %rsp ret Using SSE4 the function asm starts like this: "TriIntersect___REFs[_c_unTriangle]REFs[vyRay]": # "TriIntersect___REFs[_c_unTriangle]REFs[vyRay]" # BB#0: # %allocas subq $248, %rsp movaps %xmm15, 80(%rsp) # 16-byte Spill movaps %xmm14, 96(%rsp) # 16-byte Spill movaps %xmm13, 112(%rsp) # 16-byte Spill movaps %xmm12, 128(%rsp) # 16-byte Spill movaps %xmm11, 144(%rsp) # 16-byte Spill movaps %xmm10, 160(%rsp) # 16-byte Spill movaps %xmm9, 176(%rsp) # 16-byte Spill movaps %xmm8, 192(%rsp) # 16-byte Spill movaps %xmm7, 208(%rsp) # 16-byte Spill movaps %xmm6, 224(%rsp) # 16-byte Spill movss (%rcx), %xmm1 movss 16(%rcx), %xmm0 insertps $16, 4(%rcx), %xmm1 insertps $32, 8(%rcx), %xmm1 movss 32(%rcx), %xmm7 insertps $16, 36(%rcx), %xmm7 insertps $32, 40(%rcx), %xmm7 subps %xmm1, %xmm7 pshufd $-86, %xmm1, %xmm2 # xmm2 = xmm1[2,2,2,2] pshufd $85, %xmm1, %xmm3 # xmm3 = xmm1[1,1,1,1] movaps (%rdx), %xmm4 movaps 16(%rdx), %xmm12 movaps 32(%rdx), %xmm5 movaps 48(%rdx), %xmm11 subps %xmm3, %xmm12 subps %xmm2, %xmm5 movaps %xmm5, 32(%rsp) # 16-byte Spill pshufd $0, %xmm1, %xmm2 # xmm2 = xmm1[0,0,0,0] subps %xmm2, %xmm4 pshufd $85, %xmm7, %xmm3 # xmm3 = xmm7[1,1,1,1] movdqa %xmm3, 48(%rsp) # 16-byte Spill movaps %xmm11, %xmm2 mulps %xmm3, %xmm2 movaps %xmm3, %xmm6 pshufd $-86, %xmm7, %xmm3 # xmm3 = xmm7[2,2,2,2] movdqa %xmm3, 64(%rsp) # 16-byte Spill movaps %xmm11, %xmm9 mulps %xmm3, %xmm9 movaps %xmm3, %xmm10 insertps $16, 20(%rcx), %xmm0 insertps $32, 24(%rcx), %xmm0 subps %xmm1, %xmm0 pshufd $85, %xmm0, %xmm3 # xmm3 = xmm0[1,1,1,1] movdqa %xmm3, (%rsp) # 16-byte Spill movdqa %xmm3, %xmm1 movdqa %xmm3, %xmm14 mulps %xmm5, %xmm1 pshufd $-86, %xmm0, %xmm8 # xmm8 = xmm0[2,2,2,2] ... Even LDC2 compiler doesn't get anywhere close to such good/efficient usage of SIMD instructions. And the compiler is also able to spread the work on multiple cores. I think D is meant to be used for similar numerical code too, so perhaps the little amount of ideas contained in this very C-like language is worth stealing and adding to D. Bye, bearophile
Jul 12 2013
prev sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 20.06.2013 13:58, schrieb bearophile:
 
 The Reddit thread contains a link to this page, a compiler for a C
 variant from Intel that's optimized for SIMD:
 http://ispc.github.io/
 
Since you mention that, I developed a similar compiler/language in parallel to Intel at the time. The main differences were an implicit approach to handling the main loop, it didn't expose the SIMD target through explicit indices like ISPC did at the beginning, and could target GPUs in addition to outputting SIMD code. It was primarily used for high performance image processing on a commercial application that needed a safe CPU fallback path. Although I didn't have the time to implement many comprehensive optimization techniques (apart from some basic ones and from what LLVM provides) the results were quite impressive, depending of course on how well a program lends itself to the SPMD->SIMD transformation. There are some benchmarks at the end of the thesis: http://outerproduct.org/research/msc-thesis-slurp.pdf Unfortunately, at this point it is purely academical because the copyright to the source code has been left with my former employer and now Google. It's a bit of a pity as it was filling a certain niche that nobody else did (fortunately this has become less of an issue with the ubiquitous distribution of shader capable GPUs and improving drivers from a certain GPU vendor).
Jun 20 2013
prev sibling next sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Wed, 19 Jun 2013 15:25:29 -0400
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 
 reddit: 
 http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/
 
A bit late, but torrents/links up: http://semitwist.com/download/misc/dconf2013/
Jun 20 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/21/13 12:38 AM, Nick Sabalausky wrote:
 On Wed, 19 Jun 2013 15:25:29 -0400
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:
 reddit:
 http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/
A bit late, but torrents/links up: http://semitwist.com/download/misc/dconf2013/
Thanks for this work. I'll be late with torrents for the last two talks until I get to a broadband connection. Andrei
Jun 21 2013
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/19/13 12:25 PM, Andrei Alexandrescu wrote:
 Apologies for the delay, we're moving and things are a bit hectic.

 reddit:
 http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/


 twitter: https://twitter.com/D_Programming/status/347433981928693760

 hackernews: https://news.ycombinator.com/item?id=5907624

 facebook: https://www.facebook.com/dlang.org/posts/659747567372261

 youtube: http://youtube.com/watch?v=q_39RnxtkgM


 Andrei
Now available in HD: https://archive.org/details/dconf2013-day03-talk05 Andrei
Jun 24 2013