www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Very simple SIMD programming

reply "bearophile" <bearophileHUGS lycos.com> writes:
I have found a nice paper, "Extending a C-like Language for 
Portable SIMD Programming", (2012), by Roland L., Sebastian Hack 
and Ingo Wald:

http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

SIMD programming is necessary in a system language, or in any 
language that wants to use the modern CPUs well. So languages 
like C, C++, D (and Mono-C#) support such wider registers.

The authors of this paper have understood that it's also 
important to make SIMD programming easy, almost as easy as scalar 
code, so most programmers are able to write such kind of correct 
code.

So this this paper presents ideas to better express SIMD 
semantics in a C-like language. They introduce few new constructs 
in a large subset of C language, with few ideas. The result 
coding patterns seem easy enough (they are surely look simpler 
than most multi-core coding patterns I've seen, including Cilk+).


They present a simple scalar program in C:

struct data_t {
     int key;
     int other;
};

int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
}


Then they explain the three most common ways to represent an 
array of structs, here a struct that contains 3 values:

x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 
y7 z7
(a) Array of Structures (AoS)

x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
z4 z5 z6 z7
(b) Structure of Arrays (SoA)

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 
z6 z7
(c) Hybrid Structure of Arrays (Hybrid SoA)

They explain how the (c) is the preferred pattern in SIMD 
programming.


Using the (c) data pattern they show how in C with (nice) SIMD 
intrinsics you write vectorized code (a simd_data_t struct 
instance contains 8 int values):

struct simd_data_t {
     simd_int key;
     simd_int other;
};

int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
}


D should do become able to do this (that is not too much bad), or 
better.


Their C language extensions allow to write nicer code like:

struct data_t {
     int key;
     int other;
};

int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
}


This is based on just few simple ideas, explained in the paper 
(they are interesting, but quoting here those parts of the paper 
is not a good idea). Such ideas are not directly portable to D 
(unless the front-end is changed. Their compiler works by 
lowering, and emits regular C++ code with intrinsics).


Near the end of the paper they also propose some C++ library code:

the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile
Oct 23 2012
next sibling parent reply Don Clugston <dac nospam.com> writes:
On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

I don't know what that code does. I think the if statement is always true. Try compiling it in D. test.d(8): Error: 4 < x must be parenthesized when next to operator & test.d(8): Error: x <= 8 must be parenthesized when next to operator & Making that an error was such a good idea. <g>
Oct 24 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 10/24/2012 11:24 AM, Don Clugston wrote:
 On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

I don't know what that code does. I think the if statement is always true.

No, the code is fine.
 Try compiling it in D.

 test.d(8): Error: 4 < x must be parenthesized when next to operator &
 test.d(8): Error: x <= 8 must be parenthesized when next to operator &

 Making that an error was such a good idea.
 <g>

C's precedence rules are the same as in math in this case.
Oct 24 2012
parent Don Clugston <dac nospam.com> writes:
On 24/10/12 11:33, Timon Gehr wrote:
 On 10/24/2012 11:24 AM, Don Clugston wrote:
 On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

I don't know what that code does. I think the if statement is always true.

No, the code is fine.

Oh, you're right. It's crap code though.
Oct 24 2012
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Don Clugston:

 Making that an error was such a good idea.
 <g>

There are two other common sources of bugs code that I'd like to see removed from D code: http://d.puremagic.com/issues/show_bug.cgi?id=5409 http://d.puremagic.com/issues/show_bug.cgi?id=8757 Bye, bearophile
Oct 24 2012
prev sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for 
 Portable SIMD Programming", (2012), by Roland L., Sebastian 
 Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 SIMD programming is necessary in a system language, or in any 
 language that wants to use the modern CPUs well. So languages 
 like C, C++, D (and Mono-C#) support such wider registers.

 The authors of this paper have understood that it's also 
 important to make SIMD programming easy, almost as easy as 
 scalar code, so most programmers are able to write such kind of 
 correct code.

 So this this paper presents ideas to better express SIMD 
 semantics in a C-like language. They introduce few new 
 constructs in a large subset of C language, with few ideas. The 
 result coding patterns seem easy enough (they are surely look 
 simpler than most multi-core coding patterns I've seen, 
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an 
 array of structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 
 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
 z4 z5 z6 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 
 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD 
 programming.


 Using the (c) data pattern they show how in C with (nice) SIMD 
 intrinsics you write vectorized code (a simd_data_t struct 
 instance contains 8 int values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much bad), 
 or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the paper 
 (they are interesting, but quoting here those parts of the 
 paper is not a good idea). Such ideas are not directly portable 
 to D (unless the front-end is changed. Their compiler works by 
 lowering, and emits regular C++ code with intrinsics).


 Near the end of the paper they also propose some C++ library 
 code:

the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile

Actually, I am yet to see any language that has SIMD as part of the language standard and not as an extension where each vendor does its own way.
Oct 24 2012
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Paulo Pinto:

 Actually, I am yet to see any language that has SIMD as part of 
 the language standard and not as an extension where each vendor 
 does its own way.

D is, or is going to be, one such language :-) Bye, bearophile
Oct 24 2012
parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 24 October 2012 at 12:47:38 UTC, bearophile wrote:
 Paulo Pinto:

 Actually, I am yet to see any language that has SIMD as part 
 of the language standard and not as an extension where each 
 vendor does its own way.

D is, or is going to be, one such language :-) Bye, bearophile

Is it so? From what I can understand the SIMD calls depend on which D compiler is being used. That doesn't look like part of the language standard to me. :( -- Paulo
Oct 24 2012
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 24 October 2012 15:39, Paulo Pinto <pjmlp progtools.org> wrote:

 On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:

 I have found a nice paper, "Extending a C-like Language for Portable SIMD
 Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf<http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf>

 SIMD programming is necessary in a system language, or in any language
 that wants to use the modern CPUs well. So languages like C, C++, D (and
 Mono-C#) support such wider registers.

 The authors of this paper have understood that it's also important to
 make SIMD programming easy, almost as easy as scalar code, so most
 programmers are able to write such kind of correct code.

 So this this paper presents ideas to better express SIMD semantics in a
 C-like language. They introduce few new constructs in a large subset of C
 language, with few ideas. The result coding patterns seem easy enough (they
 are surely look simpler than most multi-core coding patterns I've seen,
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an array of
 structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 z4 z5 z6
 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD programming.


 Using the (c) data pattern they show how in C with (nice) SIMD intrinsics
 you write vectorized code (a simd_data_t struct instance contains 8 int
 values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much bad), or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the paper (they are
 interesting, but quoting here those parts of the paper is not a good idea).
 Such ideas are not directly portable to D (unless the front-end is changed.
 Their compiler works by lowering, and emits regular C++ code with
 intrinsics).


 Near the end of the paper they also propose some C++ library code:

  the C++ template mechanism would allow to define a hybrid SoA container
 class: Similar to std::vector which abstracts a traditional C array, one
 could implement a wrapper around a T block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile

Actually, I am yet to see any language that has SIMD as part of the language standard and not as an extension where each vendor does its own way.

HLSL, GLSL, Cg? :) I don't think it's possible considering that D is designed to plug on to various backends. D already has what's required to do some fairly nice (by comparison) simd stuff with good supporting libraries. One thing I can think of that would really improve simd (and not only simd) would be a way to define compound operators. If the library could detect/hook sequences of operations and implement them more efficiently as a compound, that would make some very powerful optimisations available. Simple example: T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return _madd(a, b, c); } T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { return _madd(b, c, a); }
Oct 24 2012
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 D already has what's required to do some fairly nice (by 
 comparison) simd stuff with good supporting libraries.

After reading that paper I am not sure you are right. See how their language manages masks by itself. This is from page 3: // vector length of context = 1; current_mask = T int block[4] v = <0,3,4,1>; int block[4] w = 3; // <3,3,3,3> via broadcast bool block[4] m = v < w; // <T,F,F,T> ++v; // <1,4,5,2> if (m) { // vector length of context = 4; current_mask = m v += 2; // <3,4,5,4> } else { // vector length of context = 4; current_mask = ~m v += 3; // <3,7,8,4> } // vector length of context = 1; current_mask = T (The simple benchmarks of the paper show a 5-15% performance loss compared to handwritten SIMD code.) Bye, bearophile
Oct 24 2012
parent reply Manu <turkeyman gmail.com> writes:
On 24 October 2012 16:00, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  D already has what's required to do some fairly nice (by comparison) simd
 stuff with good supporting libraries.

After reading that paper I am not sure you are right. See how their language manages masks by itself. This is from page 3: // vector length of context = 1; current_mask = T int block[4] v = <0,3,4,1>; int block[4] w = 3; // <3,3,3,3> via broadcast bool block[4] m = v < w; // <T,F,F,T> ++v; // <1,4,5,2> if (m) { // vector length of context = 4; current_mask = m v += 2; // <3,4,5,4> } else { // vector length of context = 4; current_mask = ~m v += 3; // <3,7,8,4> } // vector length of context = 1; current_mask = T

I agree that if is kinda neat, but it's probably not out of the question for future extension. All the other stuff here is possible. That said, it's not necessarily optimal either, just conveniently written. The compiler would have to do some serious magic to optimise that; flattening both sides of the if into parallel expressions, and then applying the mask to combine... I'm personally not in favour of SIMD constructs that are anything less than optimal (but I appreciate I'm probably in the minority here). (The simple benchmarks of the paper show a 5-15% performance loss compared
 to handwritten SIMD code.)

Right, as I suspected.
Oct 24 2012
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 The compiler would have to do some serious magic to optimise 
 that;
 flattening both sides of the if into parallel expressions, and 
 then applying the mask to combine...

I think it's a small amount of magic. The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient.
 I'm personally not in favour of SIMD constructs that are 
 anything less than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance 
 loss compared
 to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-) The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things. A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss. This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-) Bye, bearophile
Oct 24 2012
next sibling parent Manu <turkeyman gmail.com> writes:
On 25 October 2012 01:00, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  The compiler would have to do some serious magic to optimise that;
 flattening both sides of the if into parallel expressions, and then
 applying the mask to combine...

I think it's a small amount of magic. The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient. I'm personally not in favour of SIMD constructs that are anything less
 than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance loss compared

 to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-) The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things. A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss. This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-)

I don't disagree with you, it is fairly cool! I can't can't imagine D adopting those sort of language features any time soon, but it's probably possible. I guess the keys are defining the bool vector concept, and some tech to flatten both sides of a vector if statement, but that's far from simple... Particularly so if someone puts some unrelated code in those if blocks. Chances are it offers too much freedom that wouldn't be well used or understood by the average programmer, and that still leaves you in a similar land of only being particularly worthwhile in the hands of a fairly advanced/competent user. The main error that most people make is thinking SIMD code is faster by nature. Truth is, in the hands of someone who doesn't know precisely what they're doing, SIMD code is almost always slower. There are some cool new expressions offered here, fairly convenient (although easy[er?] to write in other ways too), but I don't think it would likely change that fundamental premise for the average programmer beyond some very simple parallel constructs that the compiler can easily get right. I'd certainly love to see it, but is it realistic that someone would take the time to do all of that any time soon when benefits are controversial? It may even open the possibility for un-skilled people to write far worse code. Let's consider your example above for instance, I would rewrite (given existing syntax): // vector length of context = 1; current_mask = T int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes) v += int4(1); // [1,4,5,2] // the if block is trivially rewritten: int4 trueSide = v + int4(2); int4 falseSize = v + int4(3); v = select(m, trueSide, falseSide); // [3,7,8,4] Or the whole thing further simplified: int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast // one convenient function does the comparison and select accordingly v = selectLess(v, w, v + int4(1 + 2), v + int4(1 + 3)); // combine the prior few lines I actually find this more convenient. I also find the if syntax you demonstrate to be rather deceptive and possibly misleading. 'if' suggests a branch, whereas the construct you demonstrate will evaluate both sides every time. Inexperienced programmers may not really grasp that. Evaluating the true side and the false side inline, and then perform the select serially is more honest; it's actually what the computer will do, and I don't really see it being particularly less convenient either.
Oct 24 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 01:00, bearophile <bearophileHUGS lycos.com> wrote:
 Manu:


 The compiler would have to do some serious magic to optimise that;
 flattening both sides of the if into parallel expressions, and then
 applying the mask to combine...

I think it's a small amount of magic. The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient.
 I'm personally not in favour of SIMD constructs that are anything less
 than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance loss
 compared
 to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-) The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things. A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss. This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-)

I don't disagree with you, it is fairly cool! I can't can't imagine D adopting those sort of language features any time soon, but it's probably possible. I guess the keys are defining the bool vector concept, and some tech to flatten both sides of a vector if statement, but that's far from simple... Particularly so if someone puts some unrelated code in those if blocks. Chances are it offers too much freedom that wouldn't be well used or understood by the average programmer, and that still leaves you in a similar land of only being particularly worthwhile in the hands of a fairly advanced/competent user. The main error that most people make is thinking SIMD code is faster by nature. Truth is, in the hands of someone who doesn't know precisely what they're doing, SIMD code is almost always slower. There are some cool new expressions offered here, fairly convenient (although easy[er?] to write in other ways too), but I don't think it would likely change that fundamental premise for the average programmer beyond some very simple parallel constructs that the compiler can easily get right. I'd certainly love to see it, but is it realistic that someone would take the time to do all of that any time soon when benefits are controversial? It may even open the possibility for un-skilled people to write far worse code. Let's consider your example above for instance, I would rewrite (given existing syntax): // vector length of context = 1; current_mask = T int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes) v += int4(1); // [1,4,5,2] // the if block is trivially rewritten: int4 trueSide = v + int4(2); int4 falseSize = v + int4(3); v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3; .... -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.
Oct 24 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-)

I tried to have indexing banned... I presume indexing works? :(
Oct 25 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 25 October 2012 09:36, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite
 (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-)

I tried to have indexing banned... I presume indexing works? :(

You can't index directly, no. Only through .array[] property, which isn't an lvalue. Regards -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 25 2012
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 25 October 2012 13:38, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 25 October 2012 09:36, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite
 (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-)

I tried to have indexing banned... I presume indexing works? :(

You can't index directly, no. Only through .array[] property, which isn't an lvalue.

Yeah, good. That's how I thought it was :) Let me rewrite ti again then: int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast v = selectLess(v, w, v + 3, v + 4); // combine the prior few lines: v < w = [T,F,F,T] -> [0+3, 3+4, 4+4, 1+3] == [3,7,8,4] I think this is far more convenient than any crazy 'if' syntax :) .. It's also perfectly optimal on all architectures I know aswell!
Oct 25 2012
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 I think this is far more convenient than any crazy 'if' syntax 
 :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-) That if syntax (that is integrated in a type system that manages the masks, plus implicit polymorphism that allows the same function to be used both in a vectorized or scalar context) works with larger amounts of code too, while you are just doing a differential assignment. Bye, bearophile
Oct 25 2012
next sibling parent Manu <turkeyman gmail.com> writes:
On 25 October 2012 14:13, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-)

I think I said numerous times in my former email that it's really cool, and certainly very interesting. I just can't imagine it appearing in D any time soon. We do have some ways to conveniently do lots of that stuff right now, and make some improvement on other competing languages in the area. I'd like to see more realistic case studies of their approach where it significantly simplifies particular workloads? That if syntax (that is integrated in a type system that manages the masks,
 plus implicit polymorphism that allows the same function to be used both in
 a vectorized or scalar context) works with larger amounts of code too,
 while you are just doing a differential assignment.

And that's likely where it all starts getting very complicated. If the branches start doing significant (and unbalanced) work, an un-skilled programmer will have a lot of trouble understanding what sort of mess they may be making. And as usual, x86 will be the most tolerant, so they may not even know when profiling. I've said before, it's very interesting, but it also sound potentially very dangerous. It's probably also an awful lot of work I'd wager... I doubt we'll see those expressions any time soon.
Oct 25 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/25/12 7:13 AM, bearophile wrote:
 Manu:

 I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-)

The part with respect for one and one's work applies right back at you. Andrei
Oct 25 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/25/2012 4:13 AM, bearophile wrote:
 Manu:

 I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-) That if syntax (that is integrated in a type system that manages the masks, plus implicit polymorphism that allows the same function to be used both in a vectorized or scalar context) works with larger amounts of code too, while you are just doing a differential assignment.

The interesting thing about SIMD code is that if you just read the data sheets for SIMD instructions, and write some SIMD code based on them, you're going to get lousy results. I know this from experience (see the array op SIMD implementations in the D runtime library). Making SIMD code that delivers performance turns out to be a highly quirky and subtle exercise, one that is resistant to formalization. Despite the availability of SIMD hardware, there is a terrible lack of quality information on how to do it right on the internet by people who know what they're talking about. Manu is on the daily front lines of doing competitive, real world SIMD programming. He leads a team doing SIMD work. Hence, I am going to strongly weight his opinions on any high level SIMD design constructs. Interestingly, both of us have rejected the "auto-vectorization" approach popular in C/C++ compilers, for very different reasons.
Oct 25 2012
parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 Making SIMD code that delivers performance turns out to be a 
 highly quirky and subtle exercise, one that is resistant to 
 formalization.

I have written some SIMD code, with mixed results, so I understand part of such problems, despite my total experience on such things is limited. Despite those problems and their failures I think it's important to support computer scientists that try to invent languages that try to offer medium-level means to write such kind of code :-) Reading and studying CS papers is important.
 Manu is on the daily front lines of doing competitive, real 
 world SIMD programming. He leads a team doing SIMD work. Hence, 
 I am going to strongly weight his opinions on any high level 
 SIMD design constructs.

I respect both Manu and his work (and you Walter are the one at the top of my list of programming heroes).
 Interestingly, both of us have rejected the 
 "auto-vectorization" approach popular in C/C++ compilers, for 
 very different reasons.

The authors of that paper too have rejected it. It doesn't give enough semantics to the compilers. They have explored a different solution. Bye, bearophile
Oct 25 2012
prev sibling next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 24 October 2012 at 12:50:44 UTC, Manu wrote:
 On 24 October 2012 15:39, Paulo Pinto <pjmlp progtools.org> 
 wrote:

 On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile 
 wrote:

 I have found a nice paper, "Extending a C-like Language for 
 Portable SIMD
 Programming", (2012), by Roland L., Sebastian Hack and Ingo 
 Wald:

 http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf<http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf>

 SIMD programming is necessary in a system language, or in any 
 language
 that wants to use the modern CPUs well. So languages like C, 
 C++, D (and
 Mono-C#) support such wider registers.

 The authors of this paper have understood that it's also 
 important to
 make SIMD programming easy, almost as easy as scalar code, so 
 most
 programmers are able to write such kind of correct code.

 So this this paper presents ideas to better express SIMD 
 semantics in a
 C-like language. They introduce few new constructs in a large 
 subset of C
 language, with few ideas. The result coding patterns seem 
 easy enough (they
 are surely look simpler than most multi-core coding patterns 
 I've seen,
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an 
 array of
 structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 
 z6 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 
 z3 z4 z5 z6
 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 
 z4 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD 
 programming.


 Using the (c) data pattern they show how in C with (nice) 
 SIMD intrinsics
 you write vectorized code (a simd_data_t struct instance 
 contains 8 int
 values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much 
 bad), or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the 
 paper (they are
 interesting, but quoting here those parts of the paper is not 
 a good idea).
 Such ideas are not directly portable to D (unless the 
 front-end is changed.
 Their compiler works by lowering, and emits regular C++ code 
 with
 intrinsics).


 Near the end of the paper they also propose some C++ library 
 code:

  the C++ template mechanism would allow to define a hybrid 
 SoA container
 class: Similar to std::vector which abstracts a traditional 
 C array, one
 could implement a wrapper around a T block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile

Actually, I am yet to see any language that has SIMD as part of the language standard and not as an extension where each vendor does its own way.

HLSL, GLSL, Cg? :)

I was thinking about general purpose programming languages, not domain specific ones. -- Paulo
Oct 24 2012
prev sibling next sibling parent reply "jerro" <a a.com> writes:
 Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
 return
 _madd(a, b, c); }

It may be useful to have a way to define compound operators for other things (although you can already write expression templates), but this is an optimization that the compiler back end can do. If you compile this code: float4 foo(float4 a, float4 b, float4 c){ return a * b + c; } With gdc with flags -O2 -fma, you get: 0000000000000000 <_D3tmp3fooFNhG4fNhG4fNhG4fZNhG4f>: 0: c4 e2 69 98 c1 vfmadd132ps xmm0,xmm2,xmm1 5: c3 ret
Oct 24 2012
parent Manu <turkeyman gmail.com> writes:
On 24 October 2012 18:12, jerro <a a.com> wrote:

  Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
 _madd(a, b, c); }

It may be useful to have a way to define compound operators for other things (although you can already write expression templates), but this is an optimization that the compiler back end can do. If you compile this code: float4 foo(float4 a, float4 b, float4 c){ return a * b + c; } With gdc with flags -O2 -fma, you get: 0000000000000000 <_**D3tmp3fooFNhG4fNhG4fNhG4fZNhG4**f>: 0: c4 e2 69 98 c1 vfmadd132ps xmm0,xmm2,xmm1 5: c3 ret

Right, I suspected GDC might do that, but it was just an example. You can extend that to many more complicated scenarios. What does it do on less mature architectures like MIPS, PPC, ARM?
Oct 24 2012
prev sibling parent "F i L" <witte2008 gmail.com> writes:
Manu wrote:
 One thing I can think of that would really improve simd (and 
 not only simd)
 would be a way to define compound operators.
 If the library could detect/hook sequences of operations and 
 implement them
 more efficiently as a compound, that would make some very 
 powerful
 optimisations available.

 Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
 return
 _madd(a, b, c); }
   T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { 
 return
 _madd(b, c, a); }

I thought about that before and it might be nice to have that level of control in the language, but ultimately, like jerro said, I think it would be better suited for the compiler's backend optimization. Unfortunately I don't think more complex patterns, such as Matrix multiplications, are found and optimized by GCC/LLVM... I could be wrong, but these are area where my hand-tuned code always outperforms basic math code. I think having that in the back-end makes a lot of sense, because your code is easier to read and understand, without sacrificing performance. Plus, it would be difficult to map a sequence as complex as matrix multiplication to a single compound operator. That being said, I do think something similar would be useful in general: struct Vector { ... static float distance(Vector a, Vector b) {...} static float distanceSquared(Vector a, Vector b) {...} float opSequence(string funcs...)(Vector a, Vector b) if (funcs[0] == "Math.sqrt" && funcs[1] == "Vector.distance") { return distanceSquared(a, b); } } void main() { auto a = Vector.random( ... ); auto b = Vector.random( ... ); // Below is turned into a 'distanceSquared()' call float dis = Math.sqrt(Vector.distance(a, b)); } Since distance requires a 'Math.sqrt()', this pseudo-code could avoid the operation entirely by calling 'distanceSquared()' even if the programmer is a noob and doesn't know to do it explicitly.
Oct 24 2012