digitalmars.D - Very simple SIMD programming

bearophile (124/128) Oct 23 2012 I have found a nice paper, "Extending a C-like Language for

Don Clugston (7/22) Oct 24 2012 I don't know what that code does. I think the if statement is always

Timon Gehr (3/30) Oct 24 2012 C's precedence rules are the same as in math in this case.

Don Clugston (2/28) Oct 24 2012 Oh, you're right. It's crap code though.

bearophile (7/9) Oct 24 2012 There are two other common sources of bugs code that I'd like to

Paulo Pinto (4/135) Oct 24 2012 Actually, I am yet to see any language that has SIMD as part of

bearophile (4/7) Oct 24 2012 D is, or is going to be, one such language :-)

Paulo Pinto (7/14) Oct 24 2012 Is it so?

Manu (16/177) Oct 24 2012 HLSL, GLSL, Cg? :)

bearophile (20/22) Oct 24 2012 After reading that paper I am not sure you are right. See how

Manu (11/31) Oct 24 2012 I agree that if is kinda neat, but it's probably not out of the question

bearophile (25/37) Oct 24 2012 I think it's a small amount of magic.

Manu (46/80) Oct 24 2012 I don't disagree with you, it is fairly cool!
Iain Buclaw (8/86) Oct 24 2012 This should work....
Manu (2/21) Oct 24 2012 Probably, just wasn't sure.
Iain Buclaw (6/31) Oct 24 2012 The idea with vectors is that they support the same operations that D
Manu (2/32) Oct 25 2012 I tried to have indexing banned... I presume indexing works? :(
Iain Buclaw (7/42) Oct 25 2012 You can't index directly, no. Only through .array[] property, which
Manu (9/49) Oct 25 2012 Yeah, good. That's how I thought it was :)

bearophile (14/17) Oct 25 2012 You should show more respect for them and their work. Their ideas

Manu (18/30) Oct 25 2012 I think I said numerous times in my former email that it's really cool, ...
Andrei Alexandrescu (3/11) Oct 25 2012 The part with respect for one and one's work applies right back at you.
Walter Bright (14/26) Oct 25 2012 The interesting thing about SIMD code is that if you just read the data ...

bearophile (15/25) Oct 25 2012 I have written some SIMD code, with mixed results, so I

Paulo Pinto (5/214) Oct 24 2012 I was thinking about general purpose programming languages, not
jerro (9/13) Oct 24 2012 It may be useful to have a way to define compound operators for

Manu (4/16) Oct 24 2012 Right, I suspected GDC might do that, but it was just an example. You ca...

F i L (36/51) Oct 24 2012 I thought about that before and it might be nice to have that

"bearophile" <bearophileHUGS lycos.com> writes:

I have found a nice paper, "Extending a C-like Language for 
Portable SIMD Programming", (2012), by Roland L., Sebastian Hack 
and Ingo Wald:

http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

SIMD programming is necessary in a system language, or in any 
language that wants to use the modern CPUs well. So languages 


The authors of this paper have understood that it's also 
important to make SIMD programming easy, almost as easy as scalar 
code, so most programmers are able to write such kind of correct 
code.

So this this paper presents ideas to better express SIMD 
semantics in a C-like language. They introduce few new constructs 
in a large subset of C language, with few ideas. The result 
coding patterns seem easy enough (they are surely look simpler 
than most multi-core coding patterns I've seen, including Cilk+).


They present a simple scalar program in C:

struct data_t {
     int key;
     int other;
};

int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
}


Then they explain the three most common ways to represent an 
array of structs, here a struct that contains 3 values:

x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 
y7 z7
(a) Array of Structures (AoS)

x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
z4 z5 z6 z7
(b) Structure of Arrays (SoA)

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 
z6 z7
(c) Hybrid Structure of Arrays (Hybrid SoA)

They explain how the (c) is the preferred pattern in SIMD 
programming.


Using the (c) data pattern they show how in C with (nice) SIMD 
intrinsics you write vectorized code (a simd_data_t struct 
instance contains 8 int values):

struct simd_data_t {
     simd_int key;
     simd_int other;
};

int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
}


D should do become able to do this (that is not too much bad), or 
better.


Their C language extensions allow to write nicer code like:

struct data_t {
     int key;
     int other;
};

int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
}


This is based on just few simple ideas, explained in the paper 
(they are interesting, but quoting here those parts of the paper 
is not a good idea). Such ideas are not directly portable to D 
(unless the front-end is changed. Their compiler works by 
lowering, and emits regular C++ code with intrinsics).


Near the end of the paper they also propose some C++ library code:

the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<


// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa <vec3 > vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa <vec3 >::vector_length;
int i = /*...*/
hsoa <vec3 >::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });


Regardless of the other ideas of their C-like language, a similar 
struct should be added to Phobos once a bit higher level SIMD 
support is in better shape in D. Supporting Hybrid-SoA and few 
operations on it will be an important but probably quite short 
and simple addition to Phobos collections (it's essentially an 
struct that acts like an array, with few simple extra operations).

I think no commonly used language allows both very simple and 

Go, and currently Rust too, are not able to support SIMD 
programming well. I think currently Haskell too is not supporting 
it well, but Haskell is very flexible, and it's compiled by a 
native compiler, so such things are maybe possible to add). So 
supporting it well in D will be an interesting selling point of 
D. (Supporting a very simple SIMD coding in D will make D more 
widespread, but such kind of programming will probably keep being 
a small niche).

Bye,
bearophile

Oct 23 2012

Don Clugston <dac nospam.com> writes:

On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

I don't know what that code does. I think the if statement is always 
true. Try compiling it in D.

test.d(8): Error: 4 < x must be parenthesized when next to operator &
test.d(8): Error: x <= 8 must be parenthesized when next to operator &

Making that an error was such a good idea.
<g>

Oct 24 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/24/2012 11:24 AM, Don Clugston wrote:
 On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

 I don't know what that code does. I think the if statement is always
 true.

No, the code is fine.

 Try compiling it in D.

 test.d(8): Error: 4 < x must be parenthesized when next to operator &
 test.d(8): Error: x <= 8 must be parenthesized when next to operator &

 Making that an error was such a good idea.
 <g>

C's precedence rules are the same as in math in this case.

Oct 24 2012

Don Clugston <dac nospam.com> writes:

On 24/10/12 11:33, Timon Gehr wrote:
 On 10/24/2012 11:24 AM, Don Clugston wrote:
 On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

 I don't know what that code does. I think the if statement is always
 true.

 No, the code is fine.

Oh, you're right. It's crap code though.

Oct 24 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Don Clugston:

 Making that an error was such a good idea.
 <g>

There are two other common sources of bugs code that I'd like to 
see removed from D code:

http://d.puremagic.com/issues/show_bug.cgi?id=5409
http://d.puremagic.com/issues/show_bug.cgi?id=8757

Bye,
bearophile

Oct 24 2012

"Paulo Pinto" <pjmlp progtools.org> writes:

On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for 
 Portable SIMD Programming", (2012), by Roland L., Sebastian 
 Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 SIMD programming is necessary in a system language, or in any 
 language that wants to use the modern CPUs well. So languages 


 The authors of this paper have understood that it's also 
 important to make SIMD programming easy, almost as easy as 
 scalar code, so most programmers are able to write such kind of 
 correct code.

 So this this paper presents ideas to better express SIMD 
 semantics in a C-like language. They introduce few new 
 constructs in a large subset of C language, with few ideas. The 
 result coding patterns seem easy enough (they are surely look 
 simpler than most multi-core coding patterns I've seen, 
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an 
 array of structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 
 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
 z4 z5 z6 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 
 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD 
 programming.


 Using the (c) data pattern they show how in C with (nice) SIMD 
 intrinsics you write vectorized code (a simd_data_t struct 
 instance contains 8 int values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much bad), 
 or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the paper 
 (they are interesting, but quoting here those parts of the 
 paper is not a good idea). Such ideas are not directly portable 
 to D (unless the front-end is changed. Their compiler works by 
 lowering, and emits regular C++ code with intrinsics).


 Near the end of the paper they also propose some C++ library 
 code:

the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<


 // scalar context throughout this example
 struct vec3 { float x, y, z; };
 // vec3 block[N]* pointing to ceil(n/N) elements
 hsoa <vec3 > vecs(n);
 // preferred vector length of vec3 automatically derived
 static const int N = hsoa <vec3 >::vector_length;
 int i = /*...*/
 hsoa <vec3 >::block_index ii = /*...*/
 vec3 v = vecs[i]; // gather
 vecs[i] = v; // scatter
 vec3 block[N] w = vecs[ii]; // fetch whole block
 hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
 r = v; // pipe through proxy
 // for each element
 vecs.foreach([](vec3& scalar v) { /*...*/ });


 Regardless of the other ideas of their C-like language, a 
 similar struct should be added to Phobos once a bit higher 
 level SIMD support is in better shape in D. Supporting 
 Hybrid-SoA and few operations on it will be an important but 
 probably quite short and simple addition to Phobos collections 
 (it's essentially an struct that acts like an array, with few 
 simple extra operations).

 I think no commonly used language allows both very simple and 

 Java, Go, and currently Rust too, are not able to support SIMD 
 programming well. I think currently Haskell too is not 
 supporting it well, but Haskell is very flexible, and it's 
 compiled by a native compiler, so such things are maybe 
 possible to add). So supporting it well in D will be an 
 interesting selling point of D. (Supporting a very simple SIMD 
 coding in D will make D more widespread, but such kind of 
 programming will probably keep being a small niche).

 Bye,
 bearophile


Actually, I am yet to see any language that has SIMD as part of 
the language standard and not as an extension where each vendor 
does its own way.

Oct 24 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Paulo Pinto:

 Actually, I am yet to see any language that has SIMD as part of 
 the language standard and not as an extension where each vendor 
 does its own way.

D is, or is going to be, one such language :-)

Bye,
bearophile

Oct 24 2012

"Paulo Pinto" <pjmlp progtools.org> writes:

On Wednesday, 24 October 2012 at 12:47:38 UTC, bearophile wrote:
 Paulo Pinto:

 Actually, I am yet to see any language that has SIMD as part 
 of the language standard and not as an extension where each 
 vendor does its own way.

 D is, or is going to be, one such language :-)

 Bye,
 bearophile

Is it so?

 From what I can understand the SIMD calls depend on which D 
compiler is being used.

That doesn't look like part of the language standard to me. :(

--
Paulo

Oct 24 2012

Manu <turkeyman gmail.com> writes:

On 24 October 2012 15:39, Paulo Pinto <pjmlp progtools.org> wrote:

 On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:

 I have found a nice paper, "Extending a C-like Language for Portable SIMD
 Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf<http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf>

 SIMD programming is necessary in a system language, or in any language
 that wants to use the modern CPUs well. So languages like C, C++, D (and


 The authors of this paper have understood that it's also important to
 make SIMD programming easy, almost as easy as scalar code, so most
 programmers are able to write such kind of correct code.

 So this this paper presents ideas to better express SIMD semantics in a
 C-like language. They introduce few new constructs in a large subset of C
 language, with few ideas. The result coding patterns seem easy enough (they
 are surely look simpler than most multi-core coding patterns I've seen,
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an array of
 structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 z4 z5 z6
 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD programming.


 Using the (c) data pattern they show how in C with (nice) SIMD intrinsics
 you write vectorized code (a simd_data_t struct instance contains 8 int
 values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much bad), or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the paper (they are
 interesting, but quoting here those parts of the paper is not a good idea).
 Such ideas are not directly portable to D (unless the front-end is changed.
 Their compiler works by lowering, and emits regular C++ code with
 intrinsics).


 Near the end of the paper they also propose some C++ library code:

  the C++ template mechanism would allow to define a hybrid SoA container
 class: Similar to std::vector which abstracts a traditional C array, one
 could implement a wrapper around a T block[N]*:<


 // scalar context throughout this example
 struct vec3 { float x, y, z; };
 // vec3 block[N]* pointing to ceil(n/N) elements
 hsoa <vec3 > vecs(n);
 // preferred vector length of vec3 automatically derived
 static const int N = hsoa <vec3 >::vector_length;
 int i = /*...*/
 hsoa <vec3 >::block_index ii = /*...*/
 vec3 v = vecs[i]; // gather
 vecs[i] = v; // scatter
 vec3 block[N] w = vecs[ii]; // fetch whole block
 hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
 r = v; // pipe through proxy
 // for each element
 vecs.foreach([](vec3& scalar v) { /*...*/ });


 Regardless of the other ideas of their C-like language, a similar struct
 should be added to Phobos once a bit higher level SIMD support is in better
 shape in D. Supporting Hybrid-SoA and few operations on it will be an
 important but probably quite short and simple addition to Phobos
 collections (it's essentially an struct that acts like an array, with few
 simple extra operations).

 I think no commonly used language allows both very simple and quite

 currently Rust too, are not able to support SIMD programming well. I think
 currently Haskell too is not supporting it well, but Haskell is very
 flexible, and it's compiled by a native compiler, so such things are maybe
 possible to add). So supporting it well in D will be an interesting selling
 point of D. (Supporting a very simple SIMD coding in D will make D more
 widespread, but such kind of programming will probably keep being a small
 niche).

 Bye,
 bearophile


 Actually, I am yet to see any language that has SIMD as part of the
 language standard and not as an extension where each vendor does its own
 way.

HLSL, GLSL, Cg? :)
I don't think it's possible considering that D is designed to plug on to
various backends.
D already has what's required to do some fairly nice (by comparison) simd
stuff with good supporting libraries.

One thing I can think of that would really improve simd (and not only simd)
would be a way to define compound operators.
If the library could detect/hook sequences of operations and implement them
more efficiently as a compound, that would make some very powerful
optimisations available.

Simple example:
  T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
_madd(a, b, c); }
  T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { return
_madd(b, c, a); }

Oct 24 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Manu:

 D already has what's required to do some fairly nice (by 
 comparison) simd stuff with good supporting libraries.

After reading that paper I am not sure you are right. See how 
their language manages masks by itself. This is from page 3:


// vector length of context = 1; current_mask = T
int block[4] v = <0,3,4,1>;
int block[4] w = 3; // <3,3,3,3> via broadcast
bool block[4] m = v < w; // <T,F,F,T>
++v; // <1,4,5,2>
if (m) {
     // vector length of context = 4; current_mask = m
     v += 2; // <3,4,5,4>
} else {
     // vector length of context = 4; current_mask = ~m
     v += 3; // <3,7,8,4>
}
// vector length of context = 1; current_mask = T


(The simple benchmarks of the paper show a 5-15% performance loss 
compared to handwritten SIMD code.)

Bye,
bearophile

Oct 24 2012

Manu <turkeyman gmail.com> writes:

On 24 October 2012 16:00, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  D already has what's required to do some fairly nice (by comparison) simd
 stuff with good supporting libraries.

 After reading that paper I am not sure you are right. See how their
 language manages masks by itself. This is from page 3:


 // vector length of context = 1; current_mask = T
 int block[4] v = <0,3,4,1>;
 int block[4] w = 3; // <3,3,3,3> via broadcast
 bool block[4] m = v < w; // <T,F,F,T>
 ++v; // <1,4,5,2>
 if (m) {
     // vector length of context = 4; current_mask = m
     v += 2; // <3,4,5,4>
 } else {
     // vector length of context = 4; current_mask = ~m
     v += 3; // <3,7,8,4>
 }
 // vector length of context = 1; current_mask = T

I agree that if is kinda neat, but it's probably not out of the question
for future extension. All the other stuff here is possible.
That said, it's not necessarily optimal either, just conveniently written.
The compiler would have to do some serious magic to optimise that;
flattening both sides of the if into parallel expressions, and then
applying the mask to combine...
I'm personally not in favour of SIMD constructs that are anything less than
optimal (but I appreciate I'm probably in the minority here).


(The simple benchmarks of the paper show a 5-15% performance loss compared
 to handwritten SIMD code.)

Right, as I suspected.

Oct 24 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Manu:

 The compiler would have to do some serious magic to optimise 
 that;
 flattening both sides of the if into parallel expressions, and 
 then applying the mask to combine...

I think it's a small amount of magic.

The simple features shown in that paper are fully focused on SIMD 
programming, so they aren't introducing things clearly not 
efficient.


 I'm personally not in favour of SIMD constructs that are 
 anything less than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance 
 loss compared
 to handwritten SIMD code.)

 Right, as I suspected.

15% is a very small performance loss, if for the programmer the 
alternative is writing scalar code, that is 2 or 3 times slower 
:-)

The SIMD programmers that can't stand a 1% loss of performance 
use the intrinsics manually (or write in asm) and they ignore all 
other things.

A much larger population of system programmers wish to use modern 
CPUs efficiently, but they don't have time (or skill, this means 
their programs are too much often buggy) for assembly-level 
programming. Currently they use smart numerical C++ libraries, 
use modern Fortran versions, and/or write C/C++ scalar code (or 
Fortran), add "restrict" annotations, and take a look at the 
produced asm hoping the modern compiler back-ends will vectorize 
it. This is not good enough, and it's far from a 15% loss.

This paper shows a third way, making such kind of programming 
simpler and approachable for a wider audience, with a small 
performance loss compared to handwritten code. This is what 
language designers do since 60+ years :-)

Bye,
bearophile

Oct 24 2012

Manu <turkeyman gmail.com> writes:

On 25 October 2012 01:00, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  The compiler would have to do some serious magic to optimise that;
 flattening both sides of the if into parallel expressions, and then
 applying the mask to combine...

 I think it's a small amount of magic.

 The simple features shown in that paper are fully focused on SIMD
 programming, so they aren't introducing things clearly not efficient.


  I'm personally not in favour of SIMD constructs that are anything less
 than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance loss compared

 to handwritten SIMD code.)

 Right, as I suspected.

 15% is a very small performance loss, if for the programmer the
 alternative is writing scalar code, that is 2 or 3 times slower :-)

 The SIMD programmers that can't stand a 1% loss of performance use the
 intrinsics manually (or write in asm) and they ignore all other things.

 A much larger population of system programmers wish to use modern CPUs
 efficiently, but they don't have time (or skill, this means their programs
 are too much often buggy) for assembly-level programming. Currently they
 use smart numerical C++ libraries, use modern Fortran versions, and/or
 write C/C++ scalar code (or Fortran), add "restrict" annotations, and take
 a look at the produced asm hoping the modern compiler back-ends will
 vectorize it. This is not good enough, and it's far from a 15% loss.

 This paper shows a third way, making such kind of programming simpler and
 approachable for a wider audience, with a small performance loss compared
 to handwritten code. This is what language designers do since 60+ years :-)

I don't disagree with you, it is fairly cool!
I can't can't imagine D adopting those sort of language features any time
soon, but it's probably possible.
I guess the keys are defining the bool vector concept, and some tech to
flatten both sides of a vector if statement, but that's far from simple...
Particularly so if someone puts some unrelated code in those if blocks.
Chances are it offers too much freedom that wouldn't be well used or
understood by the average programmer, and that still leaves you in a
similar land of only being particularly worthwhile in the hands of a fairly
advanced/competent user.
The main error that most people make is thinking SIMD code is faster by
nature. Truth is, in the hands of someone who doesn't know precisely what
they're doing, SIMD code is almost always slower.
There are some cool new expressions offered here, fairly convenient
(although easy[er?] to write in other ways too), but I don't think it would
likely change that fundamental premise for the average programmer beyond
some very simple parallel constructs that the compiler can easily get right.
I'd certainly love to see it, but is it realistic that someone would take
the time to do all of that any time soon when benefits
are controversial? It may even open the possibility for un-skilled people
to write far worse code.

Let's consider your example above for instance, I would rewrite (given
existing syntax):

// vector length of context = 1; current_mask = T
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
v += int4(1); // [1,4,5,2]

// the if block is trivially rewritten:
int4 trueSide = v + int4(2);
int4 falseSize = v + int4(3);
v = select(m, trueSide, falseSide); // [3,7,8,4]


Or the whole thing further simplified:
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast

// one convenient function does the comparison and select accordingly
v = selectLess(v, w, v + int4(1 + 2), v + int4(1 + 3)); // combine the
prior few lines

I actually find this more convenient. I also find the if syntax you
demonstrate to be rather deceptive and possibly misleading. 'if' suggests a
branch, whereas the construct you demonstrate will evaluate both sides
every time. Inexperienced programmers may not really grasp that. Evaluating
the true side and the false side inline, and then perform the select
serially is more honest; it's actually what the computer will do, and I
don't really see it being particularly less convenient either.

Oct 24 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 01:00, bearophile <bearophileHUGS lycos.com> wrote:
 Manu:


 The compiler would have to do some serious magic to optimise that;
 flattening both sides of the if into parallel expressions, and then
 applying the mask to combine...


 I think it's a small amount of magic.

 The simple features shown in that paper are fully focused on SIMD
 programming, so they aren't introducing things clearly not efficient.


 I'm personally not in favour of SIMD constructs that are anything less
 than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance loss
 compared
 to handwritten SIMD code.)

 Right, as I suspected.


 15% is a very small performance loss, if for the programmer the
 alternative is writing scalar code, that is 2 or 3 times slower :-)

 The SIMD programmers that can't stand a 1% loss of performance use the
 intrinsics manually (or write in asm) and they ignore all other things.

 A much larger population of system programmers wish to use modern CPUs
 efficiently, but they don't have time (or skill, this means their programs
 are too much often buggy) for assembly-level programming. Currently they use
 smart numerical C++ libraries, use modern Fortran versions, and/or write
 C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look
 at the produced asm hoping the modern compiler back-ends will vectorize it.
 This is not good enough, and it's far from a 15% loss.

 This paper shows a third way, making such kind of programming simpler and
 approachable for a wider audience, with a small performance loss compared to
 handwritten code. This is what language designers do since 60+ years :-)


 I don't disagree with you, it is fairly cool!
 I can't can't imagine D adopting those sort of language features any time
 soon, but it's probably possible.
 I guess the keys are defining the bool vector concept, and some tech to
 flatten both sides of a vector if statement, but that's far from simple...
 Particularly so if someone puts some unrelated code in those if blocks.
 Chances are it offers too much freedom that wouldn't be well used or
 understood by the average programmer, and that still leaves you in a similar
 land of only being particularly worthwhile in the hands of a fairly
 advanced/competent user.
 The main error that most people make is thinking SIMD code is faster by
 nature. Truth is, in the hands of someone who doesn't know precisely what
 they're doing, SIMD code is almost always slower.
 There are some cool new expressions offered here, fairly convenient
 (although easy[er?] to write in other ways too), but I don't think it would
 likely change that fundamental premise for the average programmer beyond
 some very simple parallel constructs that the compiler can easily get right.
 I'd certainly love to see it, but is it realistic that someone would take
 the time to do all of that any time soon when benefits are controversial? It
 may even open the possibility for un-skilled people to write far worse code.

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work....

int4 trueSide = v + 2;
int4 falseSide = v + 3;
....


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Oct 24 2012

Manu <turkeyman gmail.com> writes:

On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

 This should work....

 int4 trueSide = v + 2;
 int4 falseSide = v + 3;

Probably, just wasn't sure.

Oct 24 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

 This should work....

 int4 trueSide = v + 2;
 int4 falseSide = v + 3;


 Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D
array operations support. :-)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Oct 24 2012

Manu <turkeyman gmail.com> writes:

On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

 This should work....

 int4 trueSide = v + 2;
 int4 falseSide = v + 3;


 Probably, just wasn't sure.

 The idea with vectors is that they support the same operations that D
 array operations support. :-)

I tried to have indexing banned... I presume indexing works? :(

Oct 25 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 25 October 2012 09:36, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite
 (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

 This should work....

 int4 trueSide = v + 2;
 int4 falseSide = v + 3;


 Probably, just wasn't sure.

 The idea with vectors is that they support the same operations that D
 array operations support. :-)


 I tried to have indexing banned... I presume indexing works? :(

You can't index directly, no.  Only through .array[] property, which
isn't an lvalue.

Regards
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Oct 25 2012

Manu <turkeyman gmail.com> writes:

On 25 October 2012 13:38, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 25 October 2012 09:36, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite
 (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

 This should work....

 int4 trueSide = v + 2;
 int4 falseSide = v + 3;


 Probably, just wasn't sure.

 The idea with vectors is that they support the same operations that D
 array operations support. :-)


 I tried to have indexing banned... I presume indexing works? :(

 You can't index directly, no.  Only through .array[] property, which
 isn't an lvalue.

Yeah, good. That's how I thought it was :)

Let me rewrite ti again then:

int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
v = selectLess(v, w, v + 3, v + 4); // combine the prior few lines: v < w =
[T,F,F,T]  ->  [0+3, 3+4, 4+4, 1+3] == [3,7,8,4]

I think this is far more convenient than any crazy 'if' syntax :) .. It's
also perfectly optimal on all architectures I know aswell!

Oct 25 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Manu:

 I think this is far more convenient than any crazy 'if' syntax 
 :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas 
seem very far from being crazy. They have also proved their type 
system to be sound. This kind of work is lightyears ahead of the 
usual sloppy designs you see in D features, where design holes 
are found only years later, when sometimes it's too much late to 
fix them :-)

That if syntax (that is integrated in a type system that manages 
the masks, plus implicit polymorphism that allows the same 
function to be used both in a vectorized or scalar context) works 
with larger amounts of code too, while you are just doing a 
differential assignment.

Bye,
bearophile

Oct 25 2012

Manu <turkeyman gmail.com> writes:

On 25 October 2012 14:13, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

 You should show more respect for them and their work. Their ideas seem
 very far from being crazy. They have also proved their type system to be
 sound. This kind of work is lightyears ahead of the usual sloppy designs
 you see in D features, where design holes are found only years later, when
 sometimes it's too much late to fix them :-)


I think I said numerous times in my former email that it's really cool, and
certainly very interesting.
I just can't imagine it appearing in D any time soon. We do have some ways
to conveniently do lots of that stuff right now, and make some improvement
on other competing languages in the area.
I'd like to see more realistic case studies of their approach where it
significantly simplifies particular workloads?


That if syntax (that is integrated in a type system that manages the masks,
 plus implicit polymorphism that allows the same function to be used both in
 a vectorized or scalar context) works with larger amounts of code too,
 while you are just doing a differential assignment.

And that's likely where it all starts getting very complicated. If the
branches start doing significant (and unbalanced) work, an un-skilled
programmer will have a lot of trouble understanding what sort of mess they
may be making.
And as usual, x86 will be the most tolerant, so they may not even know when
profiling.
I've said before, it's very interesting, but it also sound potentially very
dangerous. It's probably also an awful lot of work I'd wager... I doubt
we'll see those expressions any time soon.

Oct 25 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/25/12 7:13 AM, bearophile wrote:
 Manu:

 I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

 You should show more respect for them and their work. Their ideas seem
 very far from being crazy. They have also proved their type system to be
 sound. This kind of work is lightyears ahead of the usual sloppy designs
 you see in D features, where design holes are found only years later,
 when sometimes it's too much late to fix them :-)

The part with respect for one and one's work applies right back at you.

Andrei

Oct 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 10/25/2012 4:13 AM, bearophile wrote:
 Manu:

 I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

 You should show more respect for them and their work. Their ideas seem very far
 from being crazy. They have also proved their type system to be sound. This
kind
 of work is lightyears ahead of the usual sloppy designs you see in D features,
 where design holes are found only years later, when sometimes it's too much
late
 to fix them :-)

 That if syntax (that is integrated in a type system that manages the masks,
plus
 implicit polymorphism that allows the same function to be used both in a
 vectorized or scalar context) works with larger amounts of code too, while you
 are just doing a differential assignment.


The interesting thing about SIMD code is that if you just read the data sheets 
for SIMD instructions, and write some SIMD code based on them, you're going to 
get lousy results. I know this from experience (see the array op SIMD 
implementations in the D runtime library).

Making SIMD code that delivers performance turns out to be a highly quirky and 
subtle exercise, one that is resistant to formalization. Despite the 
availability of SIMD hardware, there is a terrible lack of quality information 
on how to do it right on the internet by people who know what they're talking
about.

Manu is on the daily front lines of doing competitive, real world SIMD 
programming. He leads a team doing SIMD work. Hence, I am going to strongly 
weight his opinions on any high level SIMD design constructs.

Interestingly, both of us have rejected the "auto-vectorization" approach 
popular in C/C++ compilers, for very different reasons.

Oct 25 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 Making SIMD code that delivers performance turns out to be a 
 highly quirky and subtle exercise, one that is resistant to 
 formalization.

I have written some SIMD code, with mixed results, so I 
understand part of such problems, despite my total experience on 
such things is limited.

Despite those problems and their failures I think it's important 
to support computer scientists that try to invent languages that 
try to offer medium-level means to write such kind of code :-) 
Reading and studying CS papers is important.


 Manu is on the daily front lines of doing competitive, real 
 world SIMD programming. He leads a team doing SIMD work. Hence, 
 I am going to strongly weight his opinions on any high level 
 SIMD design constructs.

I respect both Manu and his work (and you Walter are the one at 
the top of my list of programming heroes).


 Interestingly, both of us have rejected the 
 "auto-vectorization" approach popular in C/C++ compilers, for 
 very different reasons.

The authors of that paper too have rejected it. It doesn't give 
enough semantics to the compilers. They have explored a different 
solution.

Bye,
bearophile

Oct 25 2012

"Paulo Pinto" <pjmlp progtools.org> writes:

On Wednesday, 24 October 2012 at 12:50:44 UTC, Manu wrote:
 On 24 October 2012 15:39, Paulo Pinto <pjmlp progtools.org> 
 wrote:

 On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile 
 wrote:

 I have found a nice paper, "Extending a C-like Language for 
 Portable SIMD
 Programming", (2012), by Roland L., Sebastian Hack and Ingo 
 Wald:

 http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf<http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf>

 SIMD programming is necessary in a system language, or in any 
 language
 that wants to use the modern CPUs well. So languages like C, 
 C++, D (and


 The authors of this paper have understood that it's also 
 important to
 make SIMD programming easy, almost as easy as scalar code, so 
 most
 programmers are able to write such kind of correct code.

 So this this paper presents ideas to better express SIMD 
 semantics in a
 C-like language. They introduce few new constructs in a large 
 subset of C
 language, with few ideas. The result coding patterns seem 
 easy enough (they
 are surely look simpler than most multi-core coding patterns 
 I've seen,
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an 
 array of
 structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 
 z6 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 
 z3 z4 z5 z6
 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 
 z4 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD 
 programming.


 Using the (c) data pattern they show how in C with (nice) 
 SIMD intrinsics
 you write vectorized code (a simd_data_t struct instance 
 contains 8 int
 values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much 
 bad), or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the 
 paper (they are
 interesting, but quoting here those parts of the paper is not 
 a good idea).
 Such ideas are not directly portable to D (unless the 
 front-end is changed.
 Their compiler works by lowering, and emits regular C++ code 
 with
 intrinsics).


 Near the end of the paper they also propose some C++ library 
 code:

  the C++ template mechanism would allow to define a hybrid 
 SoA container
 class: Similar to std::vector which abstracts a traditional 
 C array, one
 could implement a wrapper around a T block[N]*:<


 // scalar context throughout this example
 struct vec3 { float x, y, z; };
 // vec3 block[N]* pointing to ceil(n/N) elements
 hsoa <vec3 > vecs(n);
 // preferred vector length of vec3 automatically derived
 static const int N = hsoa <vec3 >::vector_length;
 int i = /*...*/
 hsoa <vec3 >::block_index ii = /*...*/
 vec3 v = vecs[i]; // gather
 vecs[i] = v; // scatter
 vec3 block[N] w = vecs[ii]; // fetch whole block
 hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
 r = v; // pipe through proxy
 // for each element
 vecs.foreach([](vec3& scalar v) { /*...*/ });


 Regardless of the other ideas of their C-like language, a 
 similar struct
 should be added to Phobos once a bit higher level SIMD 
 support is in better
 shape in D. Supporting Hybrid-SoA and few operations on it 
 will be an
 important but probably quite short and simple addition to 
 Phobos
 collections (it's essentially an struct that acts like an 
 array, with few
 simple extra operations).

 I think no commonly used language allows both very simple and 
 quite

 Go, and
 currently Rust too, are not able to support SIMD programming 
 well. I think
 currently Haskell too is not supporting it well, but Haskell 
 is very
 flexible, and it's compiled by a native compiler, so such 
 things are maybe
 possible to add). So supporting it well in D will be an 
 interesting selling
 point of D. (Supporting a very simple SIMD coding in D will 
 make D more
 widespread, but such kind of programming will probably keep 
 being a small
 niche).

 Bye,
 bearophile


 Actually, I am yet to see any language that has SIMD as part 
 of the
 language standard and not as an extension where each vendor 
 does its own
 way.

 HLSL, GLSL, Cg? :)

I was thinking about general purpose programming languages, not 
domain specific ones.

--
Paulo

Oct 24 2012

"jerro" <a a.com> writes:

 Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
 return
 _madd(a, b, c); }

It may be useful to have a way to define compound operators for 
other things (although you can already write expression 
templates), but this is an optimization that the compiler back 
end can do. If you compile this code:

float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }

With gdc with flags -O2 -fma, you get:

0000000000000000 <_D3tmp3fooFNhG4fNhG4fNhG4fZNhG4f>:
    0:   c4 e2 69 98 c1          vfmadd132ps xmm0,xmm2,xmm1
    5:   c3                      ret

Oct 24 2012

Manu <turkeyman gmail.com> writes:

On 24 October 2012 18:12, jerro <a a.com> wrote:

  Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
 _madd(a, b, c); }

 It may be useful to have a way to define compound operators for other
 things (although you can already write expression templates), but this is
 an optimization that the compiler back end can do. If you compile this code:

 float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }

 With gdc with flags -O2 -fma, you get:

 0000000000000000 <_**D3tmp3fooFNhG4fNhG4fNhG4fZNhG4**f>:
    0:   c4 e2 69 98 c1          vfmadd132ps xmm0,xmm2,xmm1
    5:   c3                      ret

Right, I suspected GDC might do that, but it was just an example. You can
extend that to many more complicated scenarios.
What does it do on less mature architectures like MIPS, PPC, ARM?

Oct 24 2012

"F i L" <witte2008 gmail.com> writes:

Manu wrote:
 One thing I can think of that would really improve simd (and 
 not only simd)
 would be a way to define compound operators.
 If the library could detect/hook sequences of operations and 
 implement them
 more efficiently as a compound, that would make some very 
 powerful
 optimisations available.

 Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
 return
 _madd(a, b, c); }
   T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { 
 return
 _madd(b, c, a); }

I thought about that before and it might be nice to have that 
level of control in the language, but ultimately, like jerro 
said, I think it would be better suited for the compiler's 
backend optimization. Unfortunately I don't think more complex 
patterns, such as Matrix multiplications, are found and optimized 
by GCC/LLVM... I could be wrong, but these are area where my 
hand-tuned code always outperforms basic math code.

I think having that in the back-end makes a lot of sense, because 
your code is easier to read and understand, without sacrificing 
performance. Plus, it would be difficult to map a sequence as 
complex as matrix multiplication to a single compound operator. 
That being said, I do think something similar would be useful in 
general:

   struct Vector
   {
       ...

       static float distance(Vector a, Vector b) {...}
       static float distanceSquared(Vector a, Vector b) {...}

       float opSequence(string funcs...)(Vector a, Vector b)
         if (funcs[0] == "Math.sqrt" &&
             funcs[1] == "Vector.distance")
       {
           return distanceSquared(a, b);
       }
   }

   void main()
   {
       auto a = Vector.random( ... );
       auto b = Vector.random( ... );

       // Below is turned into a 'distanceSquared()' call
       float dis = Math.sqrt(Vector.distance(a, b));
   }

Since distance requires a 'Math.sqrt()', this pseudo-code could 
avoid the operation entirely by calling 'distanceSquared()' even 
if the programmer is a noob and doesn't know to do it explicitly.

Oct 24 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Very simple SIMD programming