www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Very simple SIMD programming

reply "bearophile" <bearophileHUGS lycos.com> writes:
I have found a nice paper, "Extending a C-like Language for 
Portable SIMD Programming", (2012), by Roland L., Sebastian Hack 
and Ingo Wald:

http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

SIMD programming is necessary in a system language, or in any 
language that wants to use the modern CPUs well. So languages 
like C, C++, D (and Mono-C#) support such wider registers.

The authors of this paper have understood that it's also 
important to make SIMD programming easy, almost as easy as scalar 
code, so most programmers are able to write such kind of correct 
code.

So this this paper presents ideas to better express SIMD 
semantics in a C-like language. They introduce few new constructs 
in a large subset of C language, with few ideas. The result 
coding patterns seem easy enough (they are surely look simpler 
than most multi-core coding patterns I've seen, including Cilk+).


They present a simple scalar program in C:

struct data_t {
     int key;
     int other;
};

int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
}


Then they explain the three most common ways to represent an 
array of structs, here a struct that contains 3 values:

x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 
y7 z7
(a) Array of Structures (AoS)

x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
z4 z5 z6 z7
(b) Structure of Arrays (SoA)

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 
z6 z7
(c) Hybrid Structure of Arrays (Hybrid SoA)

They explain how the (c) is the preferred pattern in SIMD 
programming.


Using the (c) data pattern they show how in C with (nice) SIMD 
intrinsics you write vectorized code (a simd_data_t struct 
instance contains 8 int values):

struct simd_data_t {
     simd_int key;
     simd_int other;
};

int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
}


D should do become able to do this (that is not too much bad), or 
better.


Their C language extensions allow to write nicer code like:

struct data_t {
     int key;
     int other;
};

int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
}


This is based on just few simple ideas, explained in the paper 
(they are interesting, but quoting here those parts of the paper 
is not a good idea). Such ideas are not directly portable to D 
(unless the front-end is changed. Their compiler works by 
lowering, and emits regular C++ code with intrinsics).


Near the end of the paper they also propose some C++ library code:

the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile
Oct 23 2012
next sibling parent reply Don Clugston <dac nospam.com> writes:
On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

I don't know what that code does. I think the if statement is always true. Try compiling it in D. test.d(8): Error: 4 < x must be parenthesized when next to operator & test.d(8): Error: x <= 8 must be parenthesized when next to operator & Making that an error was such a good idea. <g>
Oct 24 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 10/24/2012 11:24 AM, Don Clugston wrote:
 On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

I don't know what that code does. I think the if statement is always true.

No, the code is fine.
 Try compiling it in D.

 test.d(8): Error: 4 < x must be parenthesized when next to operator &
 test.d(8): Error: x <= 8 must be parenthesized when next to operator &

 Making that an error was such a good idea.
 <g>

C's precedence rules are the same as in math in this case.
Oct 24 2012
parent Don Clugston <dac nospam.com> writes:
On 24/10/12 11:33, Timon Gehr wrote:
 On 10/24/2012 11:24 AM, Don Clugston wrote:
 On 24/10/12 04:41, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for Portable
 SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 They present a simple scalar program in C:

 struct data_t {
      int key;
      int other;
 };

 int search(data_t* data , int N) {
      for (int i = 0; i < N; i++) {
          int x = data[i].key;
          if (4 < x & x <= 8) return x;
      }
      return -1;
 }

I don't know what that code does. I think the if statement is always true.

No, the code is fine.

Oh, you're right. It's crap code though.
Oct 24 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Don Clugston:

 Making that an error was such a good idea.
 <g>

There are two other common sources of bugs code that I'd like to see removed from D code: http://d.puremagic.com/issues/show_bug.cgi?id=5409 http://d.puremagic.com/issues/show_bug.cgi?id=8757 Bye, bearophile
Oct 24 2012
prev sibling next sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:
 I have found a nice paper, "Extending a C-like Language for 
 Portable SIMD Programming", (2012), by Roland L., Sebastian 
 Hack and Ingo Wald:

 http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

 SIMD programming is necessary in a system language, or in any 
 language that wants to use the modern CPUs well. So languages 
 like C, C++, D (and Mono-C#) support such wider registers.

 The authors of this paper have understood that it's also 
 important to make SIMD programming easy, almost as easy as 
 scalar code, so most programmers are able to write such kind of 
 correct code.

 So this this paper presents ideas to better express SIMD 
 semantics in a C-like language. They introduce few new 
 constructs in a large subset of C language, with few ideas. The 
 result coding patterns seem easy enough (they are surely look 
 simpler than most multi-core coding patterns I've seen, 
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an 
 array of structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 
 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
 z4 z5 z6 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 
 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD 
 programming.


 Using the (c) data pattern they show how in C with (nice) SIMD 
 intrinsics you write vectorized code (a simd_data_t struct 
 instance contains 8 int values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much bad), 
 or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the paper 
 (they are interesting, but quoting here those parts of the 
 paper is not a good idea). Such ideas are not directly portable 
 to D (unless the front-end is changed. Their compiler works by 
 lowering, and emits regular C++ code with intrinsics).


 Near the end of the paper they also propose some C++ library 
 code:

the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile

Actually, I am yet to see any language that has SIMD as part of the language standard and not as an extension where each vendor does its own way.
Oct 24 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/25/12 7:13 AM, bearophile wrote:
 Manu:

 I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-)

The part with respect for one and one's work applies right back at you. Andrei
Oct 25 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 10/25/2012 4:13 AM, bearophile wrote:
 Manu:

 I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-) That if syntax (that is integrated in a type system that manages the masks, plus implicit polymorphism that allows the same function to be used both in a vectorized or scalar context) works with larger amounts of code too, while you are just doing a differential assignment.

The interesting thing about SIMD code is that if you just read the data sheets for SIMD instructions, and write some SIMD code based on them, you're going to get lousy results. I know this from experience (see the array op SIMD implementations in the D runtime library). Making SIMD code that delivers performance turns out to be a highly quirky and subtle exercise, one that is resistant to formalization. Despite the availability of SIMD hardware, there is a terrible lack of quality information on how to do it right on the internet by people who know what they're talking about. Manu is on the daily front lines of doing competitive, real world SIMD programming. He leads a team doing SIMD work. Hence, I am going to strongly weight his opinions on any high level SIMD design constructs. Interestingly, both of us have rejected the "auto-vectorization" approach popular in C/C++ compilers, for very different reasons.
Oct 25 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Paulo Pinto:

 Actually, I am yet to see any language that has SIMD as part of 
 the language standard and not as an extension where each vendor 
 does its own way.

D is, or is going to be, one such language :-) Bye, bearophile
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--bcaec54eeaae4a237a04cccd884e
Content-Type: text/plain; charset=UTF-8

On 24 October 2012 15:39, Paulo Pinto <pjmlp progtools.org> wrote:

 On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:

 I have found a nice paper, "Extending a C-like Language for Portable SIMD
 Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

 http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf<http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf>

 SIMD programming is necessary in a system language, or in any language
 that wants to use the modern CPUs well. So languages like C, C++, D (and
 Mono-C#) support such wider registers.

 The authors of this paper have understood that it's also important to
 make SIMD programming easy, almost as easy as scalar code, so most
 programmers are able to write such kind of correct code.

 So this this paper presents ideas to better express SIMD semantics in a
 C-like language. They introduce few new constructs in a large subset of C
 language, with few ideas. The result coding patterns seem easy enough (they
 are surely look simpler than most multi-core coding patterns I've seen,
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an array of
 structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 z4 z5 z6
 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD programming.


 Using the (c) data pattern they show how in C with (nice) SIMD intrinsics
 you write vectorized code (a simd_data_t struct instance contains 8 int
 values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much bad), or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the paper (they are
 interesting, but quoting here those parts of the paper is not a good idea).
 Such ideas are not directly portable to D (unless the front-end is changed.
 Their compiler works by lowering, and emits regular C++ code with
 intrinsics).


 Near the end of the paper they also propose some C++ library code:

  the C++ template mechanism would allow to define a hybrid SoA container
 class: Similar to std::vector which abstracts a traditional C array, one
 could implement a wrapper around a T block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile

Actually, I am yet to see any language that has SIMD as part of the language standard and not as an extension where each vendor does its own way.

HLSL, GLSL, Cg? :) I don't think it's possible considering that D is designed to plug on to various backends. D already has what's required to do some fairly nice (by comparison) simd stuff with good supporting libraries. One thing I can think of that would really improve simd (and not only simd) would be a way to define compound operators. If the library could detect/hook sequences of operations and implement them more efficiently as a compound, that would make some very powerful optimisations available. Simple example: T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return _madd(a, b, c); } T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { return _madd(b, c, a); } --bcaec54eeaae4a237a04cccd884e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 24 October 2012 15:39, Paulo Pinto <span dir=3D"ltr">&lt;<a href=3D"mail= to:pjmlp progtools.org" target=3D"_blank">pjmlp progtools.org</a>&gt;</span=
 wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" st=

<div class=3D"HOEnZb"><div class=3D"h5">On Wednesday, 24 October 2012 at 02= :41:53 UTC, bearophile wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I have found a nice paper, &quot;Extending a C-like Language for Portable S= IMD Programming&quot;, (2012), by Roland L., Sebastian Hack and Ingo Wald:<= br> <br> <a href=3D"http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf" ta= rget=3D"_blank">http://www.cdl.uni-saarland.<u></u>de/projects/vecimp/vecim= p_tr.<u></u>pdf</a><br> <br> SIMD programming is necessary in a system language, or in any language that= wants to use the modern CPUs well. So languages like C, C++, D (and Mono-C= #) support such wider registers.<br> <br> The authors of this paper have understood that it&#39;s also important to m= ake SIMD programming easy, almost as easy as scalar code, so most programme= rs are able to write such kind of correct code.<br> <br> So this this paper presents ideas to better express SIMD semantics in a C-l= ike language. They introduce few new constructs in a large subset of C lang= uage, with few ideas. The result coding patterns seem easy enough (they are= surely look simpler than most multi-core coding patterns I&#39;ve seen, in= cluding Cilk+).<br> <br> <br> They present a simple scalar program in C:<br> <br> struct data_t {<br> =C2=A0 =C2=A0 int key;<br> =C2=A0 =C2=A0 int other;<br> };<br> <br> int search(data_t* data , int N) {<br> =C2=A0 =C2=A0 for (int i =3D 0; i &lt; N; i++) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 int x =3D data[i].key;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (4 &lt; x &amp; x &lt;=3D 8) return x;<br> =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 return -1;<br> }<br> <br> <br> Then they explain the three most common ways to represent an array of struc= ts, here a struct that contains 3 values:<br> <br> x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 y7 z7<br> (a) Array of Structures (AoS)<br> <br> x0 x1 x2 x3 x4 x5 x6 x7 =C2=A0 y0 y1 y2 y3 y4 y5 y6 y7 =C2=A0 z0 z1 z2 z3 z= 4 z5 z6 z7<br> (b) Structure of Arrays (SoA)<br> <br> x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 z6 z7<br> (c) Hybrid Structure of Arrays (Hybrid SoA)<br> <br> They explain how the (c) is the preferred pattern in SIMD programming.<br> <br> <br> Using the (c) data pattern they show how in C with (nice) SIMD intrinsics y= ou write vectorized code (a simd_data_t struct instance contains 8 int valu= es):<br> <br> struct simd_data_t {<br> =C2=A0 =C2=A0 simd_int key;<br> =C2=A0 =C2=A0 simd_int other;<br> };<br> <br> int search(simd_data_t* data , int N) {<br> =C2=A0 =C2=A0 for (int i =3D 0; i &lt; N/L; ++i) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 simd_int x =3D load(data[i].key);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 simd_int cmp =3D simd_and(simd_lt(4, x),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 simd_le(x, 8));<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 int mask =3D simd_to_mask(cmp);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (mask !=3D 0) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 simd_int result =3D simd_and(mask= , x);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (int j =3D 0; j &lt; log2(L);= j++)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 result =3D simd_or(= result ,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 whole_reg_shr(resul= t , 1 &lt;&lt; j));<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return simd_extract= (result , 0);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 return -1;<br> }<br> <br> <br> D should do become able to do this (that is not too much bad), or better.<b= r> <br> <br> Their C language extensions allow to write nicer code like:<br> <br> struct data_t {<br> =C2=A0 =C2=A0 int key;<br> =C2=A0 =C2=A0 int other;<br> };<br> <br> int search(data_t *scalar data , int scalar N) {<br> =C2=A0 =C2=A0 int L =3D lengthof(*data);<br> =C2=A0 =C2=A0 for (int i =3D 0; i &lt; N/L; ++i) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 int x =3D data[i].key;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (4 &lt; x &amp; x &lt;=3D 8)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 int block[L] result =3D [x, 0];<b= r> =C2=A0 =C2=A0 =C2=A0 =C2=A0 scalar {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (int j =3D 0; j &lt; log2(L);= ++j)<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 result |=3D whole_r= eg_shr(result , 1 &lt;&lt; j);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return get(x, 0);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 }<br> =C2=A0 =C2=A0 return -1;<br> }<br> <br> <br> This is based on just few simple ideas, explained in the paper (they are in= teresting, but quoting here those parts of the paper is not a good idea). S= uch ideas are not directly portable to D (unless the front-end is changed. = Their compiler works by lowering, and emits regular C++ code with intrinsic= s).<br> <br> <br> Near the end of the paper they also propose some C++ library code:<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> the C++ template mechanism would allow to define a hybrid SoA container cla= ss: Similar to std::vector which abstracts a traditional C array, one could= implement a wrapper around a T block[N]*:&lt;<br> </blockquote> <br> <br> // scalar context throughout this example<br> struct vec3 { float x, y, z; };<br> // vec3 block[N]* pointing to ceil(n/N) elements<br> hsoa &lt;vec3 &gt; vecs(n);<br> // preferred vector length of vec3 automatically derived<br> static const int N =3D hsoa &lt;vec3 &gt;::vector_length;<br> int i =3D /*...*/<br> hsoa &lt;vec3 &gt;::block_index ii =3D /*...*/<br> vec3 v =3D vecs[i]; // gather<br> vecs[i] =3D v; // scatter<br> vec3 block[N] w =3D vecs[ii]; // fetch whole block<br> hsoa &lt;vec3 &gt;::ref r =3D vecs[i]; // get proxy to a scalar<br> r =3D v; // pipe through proxy<br> // for each element<br> vecs.foreach([](vec3&amp; scalar v) { /*...*/ });<br> <br> <br> Regardless of the other ideas of their C-like language, a similar struct sh= ould be added to Phobos once a bit higher level SIMD support is in better s= hape in D. Supporting Hybrid-SoA and few operations on it will be an import= ant but probably quite short and simple addition to Phobos collections (it&= #39;s essentially an struct that acts like an array, with few simple extra = operations).<br> <br> I think no commonly used language allows both very simple and quite efficie= nt SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust = too, are not able to support SIMD programming well. I think currently Haske= ll too is not supporting it well, but Haskell is very flexible, and it&#39;= s compiled by a native compiler, so such things are maybe possible to add).= So supporting it well in D will be an interesting selling point of D. (Sup= porting a very simple SIMD coding in D will make D more widespread, but suc= h kind of programming will probably keep being a small niche).<br> <br> Bye,<br> bearophile<br> </blockquote> <br> <br></div></div> Actually, I am yet to see any language that has SIMD as part of the languag= e standard and not as an extension where each vendor does its own way.<br><= /blockquote><div><br></div><div>HLSL, GLSL, Cg? :)</div><div>I don&#39;t th= ink it&#39;s possible considering that D is designed to plug on to various = backends.</div> <div>D already has what&#39;s required to do some fairly nice (by compariso= n) simd stuff with good supporting libraries.</div><div><br></div><div>One = thing I can think of that would really improve simd (and not only simd) wou= ld be a way to define compound operators.</div> <div>If the library could detect/hook sequences of operations and implement= them more efficiently as a compound, that would make some very powerful op= timisations available.</div><div><br></div><div>Simple example:</div><div> =C2=A0 T opCompound(string seq)(T a, T b, T c) if(seq =3D=3D &quot;* +&quot= ;) { return _madd(a, b, c); }</div><div><div>=C2=A0 T opCompound(string seq= )(T a, T b, T c) if(seq =3D=3D &quot;+ *&quot;) { return _madd(b, c, a); }<= /div></div><div> <br></div></div> --bcaec54eeaae4a237a04cccd884e--
Oct 24 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 D already has what's required to do some fairly nice (by 
 comparison) simd stuff with good supporting libraries.

After reading that paper I am not sure you are right. See how their language manages masks by itself. This is from page 3: // vector length of context = 1; current_mask = T int block[4] v = <0,3,4,1>; int block[4] w = 3; // <3,3,3,3> via broadcast bool block[4] m = v < w; // <T,F,F,T> ++v; // <1,4,5,2> if (m) { // vector length of context = 4; current_mask = m v += 2; // <3,4,5,4> } else { // vector length of context = 4; current_mask = ~m v += 3; // <3,7,8,4> } // vector length of context = 1; current_mask = T (The simple benchmarks of the paper show a 5-15% performance loss compared to handwritten SIMD code.) Bye, bearophile
Oct 24 2012
prev sibling next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 24 October 2012 at 12:50:44 UTC, Manu wrote:
 On 24 October 2012 15:39, Paulo Pinto <pjmlp progtools.org> 
 wrote:

 On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile 
 wrote:

 I have found a nice paper, "Extending a C-like Language for 
 Portable SIMD
 Programming", (2012), by Roland L., Sebastian Hack and Ingo 
 Wald:

 http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf<http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf>

 SIMD programming is necessary in a system language, or in any 
 language
 that wants to use the modern CPUs well. So languages like C, 
 C++, D (and
 Mono-C#) support such wider registers.

 The authors of this paper have understood that it's also 
 important to
 make SIMD programming easy, almost as easy as scalar code, so 
 most
 programmers are able to write such kind of correct code.

 So this this paper presents ideas to better express SIMD 
 semantics in a
 C-like language. They introduce few new constructs in a large 
 subset of C
 language, with few ideas. The result coding patterns seem 
 easy enough (they
 are surely look simpler than most multi-core coding patterns 
 I've seen,
 including Cilk+).


 They present a simple scalar program in C:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t* data , int N) {
     for (int i = 0; i < N; i++) {
         int x = data[i].key;
         if (4 < x & x <= 8) return x;
     }
     return -1;
 }


 Then they explain the three most common ways to represent an 
 array of
 structs, here a struct that contains 3 values:

 x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 
 z6 x7 y7 z7
 (a) Array of Structures (AoS)

 x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 
 z3 z4 z5 z6
 z7
 (b) Structure of Arrays (SoA)

 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 
 z4 z5 z6 z7
 (c) Hybrid Structure of Arrays (Hybrid SoA)

 They explain how the (c) is the preferred pattern in SIMD 
 programming.


 Using the (c) data pattern they show how in C with (nice) 
 SIMD intrinsics
 you write vectorized code (a simd_data_t struct instance 
 contains 8 int
 values):

 struct simd_data_t {
     simd_int key;
     simd_int other;
 };

 int search(simd_data_t* data , int N) {
     for (int i = 0; i < N/L; ++i) {
         simd_int x = load(data[i].key);
         simd_int cmp = simd_and(simd_lt(4, x),
         simd_le(x, 8));
         int mask = simd_to_mask(cmp);
         if (mask != 0) {
             simd_int result = simd_and(mask , x);
             for (int j = 0; j < log2(L); j++)
                 result = simd_or(result ,
                 whole_reg_shr(result , 1 << j));
                 return simd_extract(result , 0);
             }
         }
     return -1;
 }


 D should do become able to do this (that is not too much 
 bad), or better.


 Their C language extensions allow to write nicer code like:

 struct data_t {
     int key;
     int other;
 };

 int search(data_t *scalar data , int scalar N) {
     int L = lengthof(*data);
     for (int i = 0; i < N/L; ++i) {
         int x = data[i].key;
         if (4 < x & x <= 8)
             int block[L] result = [x, 0];
         scalar {
             for (int j = 0; j < log2(L); ++j)
                 result |= whole_reg_shr(result , 1 << j);
             return get(x, 0);
         }
     }
     return -1;
 }


 This is based on just few simple ideas, explained in the 
 paper (they are
 interesting, but quoting here those parts of the paper is not 
 a good idea).
 Such ideas are not directly portable to D (unless the 
 front-end is changed.
 Their compiler works by lowering, and emits regular C++ code 
 with
 intrinsics).


 Near the end of the paper they also propose some C++ library 
 code:

  the C++ template mechanism would allow to define a hybrid 
 SoA container
 class: Similar to std::vector which abstracts a traditional 
 C array, one
 could implement a wrapper around a T block[N]*:<

// scalar context throughout this example struct vec3 { float x, y, z; }; // vec3 block[N]* pointing to ceil(n/N) elements hsoa <vec3 > vecs(n); // preferred vector length of vec3 automatically derived static const int N = hsoa <vec3 >::vector_length; int i = /*...*/ hsoa <vec3 >::block_index ii = /*...*/ vec3 v = vecs[i]; // gather vecs[i] = v; // scatter vec3 block[N] w = vecs[ii]; // fetch whole block hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar r = v; // pipe through proxy // for each element vecs.foreach([](vec3& scalar v) { /*...*/ }); Regardless of the other ideas of their C-like language, a similar struct should be added to Phobos once a bit higher level SIMD support is in better shape in D. Supporting Hybrid-SoA and few operations on it will be an important but probably quite short and simple addition to Phobos collections (it's essentially an struct that acts like an array, with few simple extra operations). I think no commonly used language allows both very simple and quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and currently Rust too, are not able to support SIMD programming well. I think currently Haskell too is not supporting it well, but Haskell is very flexible, and it's compiled by a native compiler, so such things are maybe possible to add). So supporting it well in D will be an interesting selling point of D. (Supporting a very simple SIMD coding in D will make D more widespread, but such kind of programming will probably keep being a small niche). Bye, bearophile

Actually, I am yet to see any language that has SIMD as part of the language standard and not as an extension where each vendor does its own way.

HLSL, GLSL, Cg? :)

I was thinking about general purpose programming languages, not domain specific ones. -- Paulo
Oct 24 2012
prev sibling next sibling parent "jerro" <a a.com> writes:
 Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
 return
 _madd(a, b, c); }

It may be useful to have a way to define compound operators for other things (although you can already write expression templates), but this is an optimization that the compiler back end can do. If you compile this code: float4 foo(float4 a, float4 b, float4 c){ return a * b + c; } With gdc with flags -O2 -fma, you get: 0000000000000000 <_D3tmp3fooFNhG4fNhG4fNhG4fZNhG4f>: 0: c4 e2 69 98 c1 vfmadd132ps xmm0,xmm2,xmm1 5: c3 ret
Oct 24 2012
prev sibling next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 24 October 2012 at 12:47:38 UTC, bearophile wrote:
 Paulo Pinto:

 Actually, I am yet to see any language that has SIMD as part 
 of the language standard and not as an extension where each 
 vendor does its own way.

D is, or is going to be, one such language :-) Bye, bearophile

Is it so? From what I can understand the SIMD calls depend on which D compiler is being used. That doesn't look like part of the language standard to me. :( -- Paulo
Oct 24 2012
prev sibling next sibling parent "F i L" <witte2008 gmail.com> writes:
Manu wrote:
 One thing I can think of that would really improve simd (and 
 not only simd)
 would be a way to define compound operators.
 If the library could detect/hook sequences of operations and 
 implement them
 more efficiently as a compound, that would make some very 
 powerful
 optimisations available.

 Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
 return
 _madd(a, b, c); }
   T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { 
 return
 _madd(b, c, a); }

I thought about that before and it might be nice to have that level of control in the language, but ultimately, like jerro said, I think it would be better suited for the compiler's backend optimization. Unfortunately I don't think more complex patterns, such as Matrix multiplications, are found and optimized by GCC/LLVM... I could be wrong, but these are area where my hand-tuned code always outperforms basic math code. I think having that in the back-end makes a lot of sense, because your code is easier to read and understand, without sacrificing performance. Plus, it would be difficult to map a sequence as complex as matrix multiplication to a single compound operator. That being said, I do think something similar would be useful in general: struct Vector { ... static float distance(Vector a, Vector b) {...} static float distanceSquared(Vector a, Vector b) {...} float opSequence(string funcs...)(Vector a, Vector b) if (funcs[0] == "Math.sqrt" && funcs[1] == "Vector.distance") { return distanceSquared(a, b); } } void main() { auto a = Vector.random( ... ); auto b = Vector.random( ... ); // Below is turned into a 'distanceSquared()' call float dis = Math.sqrt(Vector.distance(a, b)); } Since distance requires a 'Math.sqrt()', this pseudo-code could avoid the operation entirely by calling 'distanceSquared()' even if the programmer is a noob and doesn't know to do it explicitly.
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7bf0f14efe0b4304ccd48dc6
Content-Type: text/plain; charset=UTF-8

On 24 October 2012 16:00, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  D already has what's required to do some fairly nice (by comparison) simd
 stuff with good supporting libraries.

After reading that paper I am not sure you are right. See how their language manages masks by itself. This is from page 3: // vector length of context = 1; current_mask = T int block[4] v = <0,3,4,1>; int block[4] w = 3; // <3,3,3,3> via broadcast bool block[4] m = v < w; // <T,F,F,T> ++v; // <1,4,5,2> if (m) { // vector length of context = 4; current_mask = m v += 2; // <3,4,5,4> } else { // vector length of context = 4; current_mask = ~m v += 3; // <3,7,8,4> } // vector length of context = 1; current_mask = T

I agree that if is kinda neat, but it's probably not out of the question for future extension. All the other stuff here is possible. That said, it's not necessarily optimal either, just conveniently written. The compiler would have to do some serious magic to optimise that; flattening both sides of the if into parallel expressions, and then applying the mask to combine... I'm personally not in favour of SIMD constructs that are anything less than optimal (but I appreciate I'm probably in the minority here). (The simple benchmarks of the paper show a 5-15% performance loss compared
 to handwritten SIMD code.)

Right, as I suspected. --047d7bf0f14efe0b4304ccd48dc6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 24 October 2012 16:00, bearophile <span dir=3D"ltr">&lt;<a href=3D"mailt= o:bearophileHUGS lycos.com" target=3D"_blank">bearophileHUGS lycos.com</a>&= gt;</span> wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_= quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1= ex"> Manu:<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> D already has what&#39;s required to do some fairly nice (by comparison) si= md stuff with good supporting libraries.<br> </blockquote> <br></div> After reading that paper I am not sure you are right. See how their languag= e manages masks by itself. This is from page 3:<br> <br> <br> // vector length of context =3D 1; current_mask =3D T<br> int block[4] v =3D &lt;0,3,4,1&gt;;<br> int block[4] w =3D 3; // &lt;3,3,3,3&gt; via broadcast<br> bool block[4] m =3D v &lt; w; // &lt;T,F,F,T&gt;<br> ++v; // &lt;1,4,5,2&gt;<br> if (m) {<br> =C2=A0 =C2=A0 // vector length of context =3D 4; current_mask =3D m<br> =C2=A0 =C2=A0 v +=3D 2; // &lt;3,4,5,4&gt;<br> } else {<br> =C2=A0 =C2=A0 // vector length of context =3D 4; current_mask =3D ~m<br> =C2=A0 =C2=A0 v +=3D 3; // &lt;3,7,8,4&gt;<br> }<br> // vector length of context =3D 1; current_mask =3D T<br></blockquote><div>= <br></div><div>I agree that if is kinda neat, but it&#39;s probably not out= of the question for future extension. All the other stuff here is possible= .</div> <div>That said, it&#39;s not necessarily optimal either, just conveniently = written. The compiler would have to do some serious magic to optimise that;= flattening both sides of the if into parallel expressions, and then applyi= ng the mask to combine...</div> <div>I&#39;m personally not in favour of SIMD constructs that are anything = less than optimal (but I appreciate I&#39;m probably in the minority here).= </div><div><br></div><div><br></div><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> (The simple benchmarks of the paper show a 5-15% performance loss compared = to handwritten SIMD code.)<br></blockquote><div><br></div><div>Right, as I = suspected.</div></div> --047d7bf0f14efe0b4304ccd48dc6--
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7b5d8cd7b45f2004ccd4b274
Content-Type: text/plain; charset=UTF-8

On 24 October 2012 18:12, jerro <a a.com> wrote:

  Simple example:
   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
 _madd(a, b, c); }

It may be useful to have a way to define compound operators for other things (although you can already write expression templates), but this is an optimization that the compiler back end can do. If you compile this code: float4 foo(float4 a, float4 b, float4 c){ return a * b + c; } With gdc with flags -O2 -fma, you get: 0000000000000000 <_**D3tmp3fooFNhG4fNhG4fNhG4fZNhG4**f>: 0: c4 e2 69 98 c1 vfmadd132ps xmm0,xmm2,xmm1 5: c3 ret

Right, I suspected GDC might do that, but it was just an example. You can extend that to many more complicated scenarios. What does it do on less mature architectures like MIPS, PPC, ARM? --047d7b5d8cd7b45f2004ccd4b274 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 24 October 2012 18:12, jerro <span dir=3D"ltr">&lt;<a href=3D"mailto:a a= .com" target=3D"_blank">a a.com</a>&gt;</span> wrote:<br><div class=3D"gmai= l_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde= r-left:1px #ccc solid;padding-left:1ex"> <div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le= ft:1px #ccc solid;padding-left:1ex"> Simple example:<br> =C2=A0 T opCompound(string seq)(T a, T b, T c) if(seq =3D=3D &quot;* +&quot= ;) { return<br> _madd(a, b, c); }<br> </blockquote> <br></div> It may be useful to have a way to define compound operators for other thing= s (although you can already write expression templates), but this is an opt= imization that the compiler back end can do. If you compile this code:<br> <br> float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }<br> <br> With gdc with flags -O2 -fma, you get:<br> <br> 0000000000000000 &lt;_<u></u>D3tmp3fooFNhG4fNhG4fNhG4fZNhG4<u></u>f&gt;:<br=

add132ps xmm0,xmm2,xmm1<br> =C2=A0 =C2=A05: =C2=A0 c3 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0ret<br> </blockquote></div><br><div>Right, I suspected GDC might do that, but it wa= s just an example. You can extend that to many more complicated scenarios.<= /div><div>What does it do on less mature architectures like MIPS, PPC, ARM?= </div> --047d7b5d8cd7b45f2004ccd4b274--
Oct 24 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 The compiler would have to do some serious magic to optimise 
 that;
 flattening both sides of the if into parallel expressions, and 
 then applying the mask to combine...

I think it's a small amount of magic. The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient.
 I'm personally not in favour of SIMD constructs that are 
 anything less than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance 
 loss compared
 to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-) The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things. A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss. This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-) Bye, bearophile
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--20cf307811d0ea7ebe04ccd5d9f0
Content-Type: text/plain; charset=UTF-8

On 25 October 2012 01:00, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  The compiler would have to do some serious magic to optimise that;
 flattening both sides of the if into parallel expressions, and then
 applying the mask to combine...

I think it's a small amount of magic. The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient. I'm personally not in favour of SIMD constructs that are anything less
 than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance loss compared

 to handwritten SIMD code.)


15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-) The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things. A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss. This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-)

I don't disagree with you, it is fairly cool! I can't can't imagine D adopting those sort of language features any time soon, but it's probably possible. I guess the keys are defining the bool vector concept, and some tech to flatten both sides of a vector if statement, but that's far from simple... Particularly so if someone puts some unrelated code in those if blocks. Chances are it offers too much freedom that wouldn't be well used or understood by the average programmer, and that still leaves you in a similar land of only being particularly worthwhile in the hands of a fairly advanced/competent user. The main error that most people make is thinking SIMD code is faster by nature. Truth is, in the hands of someone who doesn't know precisely what they're doing, SIMD code is almost always slower. There are some cool new expressions offered here, fairly convenient (although easy[er?] to write in other ways too), but I don't think it would likely change that fundamental premise for the average programmer beyond some very simple parallel constructs that the compiler can easily get right. I'd certainly love to see it, but is it realistic that someone would take the time to do all of that any time soon when benefits are controversial? It may even open the possibility for un-skilled people to write far worse code. Let's consider your example above for instance, I would rewrite (given existing syntax): // vector length of context = 1; current_mask = T int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes) v += int4(1); // [1,4,5,2] // the if block is trivially rewritten: int4 trueSide = v + int4(2); int4 falseSize = v + int4(3); v = select(m, trueSide, falseSide); // [3,7,8,4] Or the whole thing further simplified: int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast // one convenient function does the comparison and select accordingly v = selectLess(v, w, v + int4(1 + 2), v + int4(1 + 3)); // combine the prior few lines I actually find this more convenient. I also find the if syntax you demonstrate to be rather deceptive and possibly misleading. 'if' suggests a branch, whereas the construct you demonstrate will evaluate both sides every time. Inexperienced programmers may not really grasp that. Evaluating the true side and the false side inline, and then perform the select serially is more honest; it's actually what the computer will do, and I don't really see it being particularly less convenient either. --20cf307811d0ea7ebe04ccd5d9f0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 25 October 2012 01:00, bearophile <span dir=3D"ltr">&lt;<a href=3D"mailt= o:bearophileHUGS lycos.com" target=3D"_blank">bearophileHUGS lycos.com</a>&= gt;</span> wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_= quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1= ex"> Manu:<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> The compiler would have to do some serious magic to optimise that;<br> flattening both sides of the if into parallel expressions, and then applyin= g the mask to combine...<br> </blockquote> <br></div> I think it&#39;s a small amount of magic.<br> <br> The simple features shown in that paper are fully focused on SIMD programmi= ng, so they aren&#39;t introducing things clearly not efficient.<br> <br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im"> I&#39;m personally not in favour of SIMD constructs that are anything less = than<br> optimal (but I appreciate I&#39;m probably in the minority here).<br> <br> <br></div><div class=3D"im"> (The simple benchmarks of the paper show a 5-15% performance loss compared<= br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> to handwritten SIMD code.)<br> <br> </blockquote> <br></div> Right, as I suspected.<br> </blockquote> <br> 15% is a very small performance loss, if for the programmer the alternative= is writing scalar code, that is 2 or 3 times slower :-)<br> <br> The SIMD programmers that can&#39;t stand a 1% loss of performance use the = intrinsics manually (or write in asm) and they ignore all other things.<br> <br> A much larger population of system programmers wish to use modern CPUs effi= ciently, but they don&#39;t have time (or skill, this means their programs = are too much often buggy) for assembly-level programming. Currently they us= e smart numerical C++ libraries, use modern Fortran versions, and/or write = C/C++ scalar code (or Fortran), add &quot;restrict&quot; annotations, and t= ake a look at the produced asm hoping the modern compiler back-ends will ve= ctorize it. This is not good enough, and it&#39;s far from a 15% loss.<br> <br> This paper shows a third way, making such kind of programming simpler and a= pproachable for a wider audience, with a small performance loss compared to= handwritten code. This is what language designers do since 60+ years :-)<b= r> </blockquote><div><br></div><div>I don&#39;t disagree with you, it is fairl= y cool!</div><div>I can&#39;t can&#39;t imagine D adopting those sort of la= nguage features any time soon, but it&#39;s probably possible.</div><div> I guess the keys are defining the bool vector concept, and some tech to fla= tten both sides of a vector if statement, but that&#39;s far from simple...= Particularly so if someone puts some unrelated code in those if blocks.</d= iv> <div>Chances are it offers too much freedom that wouldn&#39;t be well used = or understood by the average programmer, and that still leaves you in a sim= ilar land of only being particularly worthwhile in the hands of a fairly ad= vanced/competent=C2=A0user.</div> <div>The main error that most people make is thinking SIMD code is faster b= y nature. Truth is, in the hands of someone who doesn&#39;t know precisely = what they&#39;re doing, SIMD code is almost always slower.</div><div>There = are some cool new expressions offered here, fairly convenient (although eas= y[er?] to write in other ways too), but I don&#39;t think it would likely c= hange that fundamental premise for the average programmer beyond some very = simple parallel constructs that the compiler can easily get right.</div> <div>I&#39;d certainly love to see it, but is it realistic that someone wou= ld take the time to do all of that any time soon when benefits are=C2=A0con= troversial?=C2=A0It may even open the possibility for un-skilled people to = write far worse code.</div> <div><br></div><div>Let&#39;s consider your example above for instance, I w= ould rewrite (given existing syntax):</div><div><br></div><div><span style= =3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;backgro= und-color:rgb(255,255,255)">// vector length of context =3D 1; current_mask= =3D T</span><br style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;= font-size:13px;background-color:rgb(255,255,255)"> <span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:1= 3px;background-color:rgb(255,255,255)">int4 v =3D [0,3,4,1];</span><br styl= e=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;backgr= ound-color:rgb(255,255,255)"> <span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:1= 3px;background-color:rgb(255,255,255)">int4 w =3D 3; // [3,3,3,3] via broad= cast</span><br style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;fo= nt-size:13px;background-color:rgb(255,255,255)"> <span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:1= 3px;background-color:rgb(255,255,255)">uint4 m =3D maskLess(v, w); // [T,F,= F,T] (T =3D=3D ones, F =3D=3D zeroes)</span><br style=3D"color:rgb(34,34,34= );font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,= 255)"> <span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:1= 3px;background-color:rgb(255,255,255)">v +=3D int4(1); // [1,4,5,2]</span><= /div><div><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif">= <br> </span></div><div><span style=3D"color:rgb(34,34,34);font-family:arial,sans= -serif">// the if block is trivially rewritten:</span></div><div><span styl= e=3D"color:rgb(34,34,34);font-family:arial,sans-serif">int4 trueSide =3D=C2= =A0</span><span style=3D"background-color:rgb(255,255,255);color:rgb(34,34,= 34);font-family:arial,sans-serif;font-size:13px">v + int4(2);</span></div> <div><span style=3D"background-color:rgb(255,255,255);color:rgb(34,34,34);f= ont-family:arial,sans-serif;font-size:13px">int4 falseSize =3D=C2=A0</span>= <span style=3D"background-color:rgb(255,255,255);color:rgb(34,34,34);font-f= amily:arial,sans-serif;font-size:13px">v + int4(3);</span></div> <div><span style=3D"background-color:rgb(255,255,255);color:rgb(34,34,34);f= ont-family:arial,sans-serif;font-size:13px">v =3D select(m, trueSide</span>= <span style=3D"background-color:rgb(255,255,255);color:rgb(34,34,34);font-f= amily:arial,sans-serif;font-size:13px">, falseSide</span><span style=3D"bac= kground-color:rgb(255,255,255);color:rgb(34,34,34);font-family:arial,sans-s= erif;font-size:13px">); // [3,7,8,4]</span></div> <div><br></div><div><br></div><div>Or the whole thing further simplified:</= div><div><span style=3D"font-size:13px;color:rgb(34,34,34);font-family:aria= l,sans-serif;background-color:rgb(255,255,255)">int4 v =3D [0,3,4,1];</span=
<br style=3D"font-size:13px;color:rgb(34,34,34);font-family:arial,sans-ser=

<span style=3D"font-size:13px;color:rgb(34,34,34);font-family:arial,sans-se= rif;background-color:rgb(255,255,255)">int4 w =3D 3; // [3,3,3,3] via broad= cast</span><br style=3D"font-size:13px;color:rgb(34,34,34);font-family:aria= l,sans-serif;background-color:rgb(255,255,255)"> </div><div><span style=3D"font-size:13px;color:rgb(34,34,34);font-family:ar= ial,sans-serif;background-color:rgb(255,255,255)"><br></span></div><div><sp= an style=3D"font-size:13px;color:rgb(34,34,34);font-family:arial,sans-serif= ;background-color:rgb(255,255,255)">// one convenient function does the com= parison and select accordingly</span></div> <div><span style=3D"font-size:13px;color:rgb(34,34,34);font-family:arial,sa= ns-serif;background-color:rgb(255,255,255)">v =3D selectLess(v, w,=C2=A0</s= pan><span style=3D"font-size:13px;background-color:rgb(255,255,255);color:r= gb(34,34,34);font-family:arial,sans-serif">v + int4(1 + 2),=C2=A0</span><sp= an style=3D"font-size:13px;background-color:rgb(255,255,255);color:rgb(34,3= 4,34);font-family:arial,sans-serif">v + int4(1 + 3)); // combine the prior = few lines</span></div> <div><br></div><div>I actually find this more convenient. I also find the i= f syntax you demonstrate to be rather deceptive and possibly misleading. &#= 39;if&#39; suggests a branch, whereas the construct you demonstrate will ev= aluate both sides every time. Inexperienced programmers may not really gras= p that. Evaluating the true side and the false side inline, and then perfor= m the select serially is more honest; it&#39;s actually what the computer w= ill do, and I don&#39;t really see it being particularly less convenient ei= ther.</div> </div> --20cf307811d0ea7ebe04ccd5d9f0--
Oct 24 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 01:00, bearophile <bearophileHUGS lycos.com> wrote:
 Manu:


 The compiler would have to do some serious magic to optimise that;
 flattening both sides of the if into parallel expressions, and then
 applying the mask to combine...

I think it's a small amount of magic. The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient.
 I'm personally not in favour of SIMD constructs that are anything less
 than
 optimal (but I appreciate I'm probably in the minority here).


 (The simple benchmarks of the paper show a 5-15% performance loss
 compared
 to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-) The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things. A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss. This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-)

I don't disagree with you, it is fairly cool! I can't can't imagine D adopting those sort of language features any time soon, but it's probably possible. I guess the keys are defining the bool vector concept, and some tech to flatten both sides of a vector if statement, but that's far from simple... Particularly so if someone puts some unrelated code in those if blocks. Chances are it offers too much freedom that wouldn't be well used or understood by the average programmer, and that still leaves you in a similar land of only being particularly worthwhile in the hands of a fairly advanced/competent user. The main error that most people make is thinking SIMD code is faster by nature. Truth is, in the hands of someone who doesn't know precisely what they're doing, SIMD code is almost always slower. There are some cool new expressions offered here, fairly convenient (although easy[er?] to write in other ways too), but I don't think it would likely change that fundamental premise for the average programmer beyond some very simple parallel constructs that the compiler can easily get right. I'd certainly love to see it, but is it realistic that someone would take the time to do all of that any time soon when benefits are controversial? It may even open the possibility for un-skilled people to write far worse code. Let's consider your example above for instance, I would rewrite (given existing syntax): // vector length of context = 1; current_mask = T int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes) v += int4(1); // [1,4,5,2] // the if block is trivially rewritten: int4 trueSide = v + int4(2); int4 falseSize = v + int4(3); v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3; .... -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--bcaec5014c71e487a204ccd647b2
Content-Type: text/plain; charset=UTF-8

On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure. --bcaec5014c71e487a204ccd647b2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 25 October 2012 02:01, Iain Buclaw <span dir=3D"ltr">&lt;<a href=3D"mail= to:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>&gt;</span> = wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" styl= e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 24 October 2012 23:46, Manu &lt;= <a href=3D"mailto:turkeyman gmail.com">turkeyman gmail.com</a>&gt; wrote:= =C2=A0</div></div></blockquote><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5"> &gt; Let&#39;s consider your example above for instance, I would rewrite (g= iven<br> &gt; existing syntax):<br> &gt;<br> &gt; // vector length of context =3D 1; current_mask =3D T<br> &gt; int4 v =3D [0,3,4,1];<br> &gt; int4 w =3D 3; // [3,3,3,3] via broadcast<br> &gt; uint4 m =3D maskLess(v, w); // [T,F,F,T] (T =3D=3D ones, F =3D=3D zero= es)<br> &gt; v +=3D int4(1); // [1,4,5,2]<br> &gt;<br> &gt; // the if block is trivially rewritten:<br> &gt; int4 trueSide =3D v + int4(2);<br> &gt; int4 falseSize =3D v + int4(3);<br> &gt; v =3D select(m, trueSide, falseSide); // [3,7,8,4]<br> &gt;<br> &gt;<br> <br> </div></div>This should work....<br> <br> int4 trueSide =3D v + 2;<br> int4 falseSide =3D v + 3;<br></blockquote><div><br></div><div>Probably, jus= t wasn&#39;t sure.</div></div> --bcaec5014c71e487a204ccd647b2--
Oct 24 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 24 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--bcaec5014c7171d30e04ccde1a50
Content-Type: text/plain; charset=UTF-8

On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-)

I tried to have indexing banned... I presume indexing works? :( --bcaec5014c7171d30e04ccde1a50 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 25 October 2012 02:18, Iain Buclaw <span dir=3D"ltr">&lt;<a href=3D"mail= to:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>&gt;</span> = wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" styl= e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 25 October 2012 00:16, Manu &lt;= <a href=3D"mailto:turkeyman gmail.com">turkeyman gmail.com</a>&gt; wrote:<b= r> &gt; On 25 October 2012 02:01, Iain Buclaw &lt;<a href=3D"mailto:ibuclaw ub= untu.com">ibuclaw ubuntu.com</a>&gt; wrote:<br> &gt;&gt;<br> &gt;&gt; On 24 October 2012 23:46, Manu &lt;<a href=3D"mailto:turkeyman gma= il.com">turkeyman gmail.com</a>&gt; wrote:<br> &gt;&gt;<br> &gt;&gt; &gt; Let&#39;s consider your example above for instance, I would r= ewrite (given<br> &gt;&gt; &gt; existing syntax):<br> &gt;&gt; &gt;<br> &gt;&gt; &gt; // vector length of context =3D 1; current_mask =3D T<br> &gt;&gt; &gt; int4 v =3D [0,3,4,1];<br> &gt;&gt; &gt; int4 w =3D 3; // [3,3,3,3] via broadcast<br> &gt;&gt; &gt; uint4 m =3D maskLess(v, w); // [T,F,F,T] (T =3D=3D ones, F = =3D=3D zeroes)<br> &gt;&gt; &gt; v +=3D int4(1); // [1,4,5,2]<br> &gt;&gt; &gt;<br> &gt;&gt; &gt; // the if block is trivially rewritten:<br> &gt;&gt; &gt; int4 trueSide =3D v + int4(2);<br> &gt;&gt; &gt; int4 falseSize =3D v + int4(3);<br> &gt;&gt; &gt; v =3D select(m, trueSide, falseSide); // [3,7,8,4]<br> &gt;&gt; &gt;<br> &gt;&gt; &gt;<br> &gt;&gt;<br> &gt;&gt; This should work....<br> &gt;&gt;<br> &gt;&gt; int4 trueSide =3D v + 2;<br> &gt;&gt; int4 falseSide =3D v + 3;<br> &gt;<br> &gt;<br> &gt; Probably, just wasn&#39;t sure.<br> <br> </div></div>The idea with vectors is that they support the same operations = that D<br> array operations support. :-)<br></blockquote><div><br></div><div>I tried t= o have indexing banned... I presume indexing works? :(</div></div> --bcaec5014c7171d30e04ccde1a50--
Oct 25 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 25 October 2012 09:36, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite
 (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-)

I tried to have indexing banned... I presume indexing works? :(

You can't index directly, no. Only through .array[] property, which isn't an lvalue. Regards -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Oct 25 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7b6d9340d87e5704cce0108c
Content-Type: text/plain; charset=UTF-8

On 25 October 2012 13:38, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 25 October 2012 09:36, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:18, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 25 October 2012 00:16, Manu <turkeyman gmail.com> wrote:
 On 25 October 2012 02:01, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 24 October 2012 23:46, Manu <turkeyman gmail.com> wrote:

 Let's consider your example above for instance, I would rewrite
 (given
 existing syntax):

 // vector length of context = 1; current_mask = T
 int4 v = [0,3,4,1];
 int4 w = 3; // [3,3,3,3] via broadcast
 uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
 v += int4(1); // [1,4,5,2]

 // the if block is trivially rewritten:
 int4 trueSide = v + int4(2);
 int4 falseSize = v + int4(3);
 v = select(m, trueSide, falseSide); // [3,7,8,4]

This should work.... int4 trueSide = v + 2; int4 falseSide = v + 3;

Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-)

I tried to have indexing banned... I presume indexing works? :(

You can't index directly, no. Only through .array[] property, which isn't an lvalue.

Yeah, good. That's how I thought it was :) Let me rewrite ti again then: int4 v = [0,3,4,1]; int4 w = 3; // [3,3,3,3] via broadcast v = selectLess(v, w, v + 3, v + 4); // combine the prior few lines: v < w = [T,F,F,T] -> [0+3, 3+4, 4+4, 1+3] == [3,7,8,4] I think this is far more convenient than any crazy 'if' syntax :) .. It's also perfectly optimal on all architectures I know aswell! --047d7b6d9340d87e5704cce0108c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 25 October 2012 13:38, Iain Buclaw <span dir=3D"ltr">&lt;<a href=3D"mail= to:ibuclaw ubuntu.com" target=3D"_blank">ibuclaw ubuntu.com</a>&gt;</span> = wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" styl= e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"HOEnZb"><div class=3D"h5">On 25 October 2012 09:36, Manu &lt;= <a href=3D"mailto:turkeyman gmail.com">turkeyman gmail.com</a>&gt; wrote:<b= r> &gt; On 25 October 2012 02:18, Iain Buclaw &lt;<a href=3D"mailto:ibuclaw ub= untu.com">ibuclaw ubuntu.com</a>&gt; wrote:<br> &gt;&gt;<br> &gt;&gt; On 25 October 2012 00:16, Manu &lt;<a href=3D"mailto:turkeyman gma= il.com">turkeyman gmail.com</a>&gt; wrote:<br> &gt;&gt; &gt; On 25 October 2012 02:01, Iain Buclaw &lt;<a href=3D"mailto:i= buclaw ubuntu.com">ibuclaw ubuntu.com</a>&gt; wrote:<br> &gt;&gt; &gt;&gt;<br> &gt;&gt; &gt;&gt; On 24 October 2012 23:46, Manu &lt;<a href=3D"mailto:turk= eyman gmail.com">turkeyman gmail.com</a>&gt; wrote:<br> &gt;&gt; &gt;&gt;<br> &gt;&gt; &gt;&gt; &gt; Let&#39;s consider your example above for instance, = I would rewrite<br> &gt;&gt; &gt;&gt; &gt; (given<br> &gt;&gt; &gt;&gt; &gt; existing syntax):<br> &gt;&gt; &gt;&gt; &gt;<br> &gt;&gt; &gt;&gt; &gt; // vector length of context =3D 1; current_mask =3D = T<br> &gt;&gt; &gt;&gt; &gt; int4 v =3D [0,3,4,1];<br> &gt;&gt; &gt;&gt; &gt; int4 w =3D 3; // [3,3,3,3] via broadcast<br> &gt;&gt; &gt;&gt; &gt; uint4 m =3D maskLess(v, w); // [T,F,F,T] (T =3D=3D o= nes, F =3D=3D zeroes)<br> &gt;&gt; &gt;&gt; &gt; v +=3D int4(1); // [1,4,5,2]<br> &gt;&gt; &gt;&gt; &gt;<br> &gt;&gt; &gt;&gt; &gt; // the if block is trivially rewritten:<br> &gt;&gt; &gt;&gt; &gt; int4 trueSide =3D v + int4(2);<br> &gt;&gt; &gt;&gt; &gt; int4 falseSize =3D v + int4(3);<br> &gt;&gt; &gt;&gt; &gt; v =3D select(m, trueSide, falseSide); // [3,7,8,4]<b= r> &gt;&gt; &gt;&gt; &gt;<br> &gt;&gt; &gt;&gt; &gt;<br> &gt;&gt; &gt;&gt;<br> &gt;&gt; &gt;&gt; This should work....<br> &gt;&gt; &gt;&gt;<br> &gt;&gt; &gt;&gt; int4 trueSide =3D v + 2;<br> &gt;&gt; &gt;&gt; int4 falseSide =3D v + 3;<br> &gt;&gt; &gt;<br> &gt;&gt; &gt;<br> &gt;&gt; &gt; Probably, just wasn&#39;t sure.<br> &gt;&gt;<br> &gt;&gt; The idea with vectors is that they support the same operations tha= t D<br> &gt;&gt; array operations support. :-)<br> &gt;<br> &gt;<br> &gt; I tried to have indexing banned... I presume indexing works? :(<br> <br> </div></div>You can&#39;t index directly, no. =C2=A0Only through .array[] p= roperty, which<br> isn&#39;t an lvalue.<br></blockquote><div><br></div><div>Yeah, good. That&#= 39;s how I thought it was :)</div><div><br></div><div>Let me rewrite ti aga= in then:</div><div><br></div><div><div style=3D"color:rgb(34,34,34);font-fa= mily:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)"> int4 v =3D [0,3,4,1];</div><div style=3D"color:rgb(34,34,34);font-family:ar= ial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">int4 w =3D= 3; // [3,3,3,3] via broadcast<br></div><div style=3D"color:rgb(34,34,34);f= ont-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255= )"> v =3D selectLess(v, w,=C2=A0v + 3,=C2=A0v + 4); // combine the prior few li= nes: v &lt; w =3D [T,F,F,T] =C2=A0-&gt; =C2=A0[0+3, 3+4, 4+4, 1+3] =3D=3D [= 3,7,8,4]</div></div><div style=3D"color:rgb(34,34,34);font-family:arial,san= s-serif;font-size:13px;background-color:rgb(255,255,255)"> <br></div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;fo= nt-size:13px;background-color:rgb(255,255,255)">I think this is far more co= nvenient than any crazy &#39;if&#39; syntax :) .. It&#39;s also perfectly o= ptimal on all architectures I know aswell!</div> </div> --047d7b6d9340d87e5704cce0108c--
Oct 25 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 I think this is far more convenient than any crazy 'if' syntax 
 :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-) That if syntax (that is integrated in a type system that manages the masks, plus implicit polymorphism that allows the same function to be used both in a vectorized or scalar context) works with larger amounts of code too, while you are just doing a differential assignment. Bye, bearophile
Oct 25 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7b624c90bc44e804cce14cd8
Content-Type: text/plain; charset=UTF-8

On 25 October 2012 14:13, bearophile <bearophileHUGS lycos.com> wrote:

 Manu:


  I think this is far more convenient than any crazy 'if' syntax :) .. It's
 also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far from being crazy. They have also proved their type system to be sound. This kind of work is lightyears ahead of the usual sloppy designs you see in D features, where design holes are found only years later, when sometimes it's too much late to fix them :-)

I think I said numerous times in my former email that it's really cool, and certainly very interesting. I just can't imagine it appearing in D any time soon. We do have some ways to conveniently do lots of that stuff right now, and make some improvement on other competing languages in the area. I'd like to see more realistic case studies of their approach where it significantly simplifies particular workloads? That if syntax (that is integrated in a type system that manages the masks,
 plus implicit polymorphism that allows the same function to be used both in
 a vectorized or scalar context) works with larger amounts of code too,
 while you are just doing a differential assignment.

And that's likely where it all starts getting very complicated. If the branches start doing significant (and unbalanced) work, an un-skilled programmer will have a lot of trouble understanding what sort of mess they may be making. And as usual, x86 will be the most tolerant, so they may not even know when profiling. I've said before, it's very interesting, but it also sound potentially very dangerous. It's probably also an awful lot of work I'd wager... I doubt we'll see those expressions any time soon. --047d7b624c90bc44e804cce14cd8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 25 October 2012 14:13, bearophile <span dir=3D"ltr">&lt;<a href=3D"mailt= o:bearophileHUGS lycos.com" target=3D"_blank">bearophileHUGS lycos.com</a>&= gt;</span> wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_= quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1= ex"> Manu:<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I think this is far more convenient than any crazy &#39;if&#39; syntax :) .= . It&#39;s<br> also perfectly optimal on all architectures I know aswell!<br> </blockquote> <br></div> You should show more respect for them and their work. Their ideas seem very= far from being crazy. They have also proved their type system to be sound.= This kind of work is lightyears ahead of the usual sloppy designs you see = in D features, where design holes are found only years later, when sometime= s it&#39;s too much late to fix them :-)</blockquote> <div><br></div><div>I think I said numerous times in my former email that i= t&#39;s really cool, and certainly very interesting.</div><div>I just can&#= 39;t imagine it appearing in D any time soon. We do have some ways to conve= niently do lots of that stuff right now, and make some improvement on other= competing languages in the area.</div> <div>I&#39;d like to see more realistic case studies of their approach wher= e it significantly simplifies particular workloads?</div><div><br></div><di= v><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo= rder-left:1px #ccc solid;padding-left:1ex"> That if syntax (that is integrated in a type system that manages the masks,= plus implicit polymorphism that allows the same function to be used both i= n a vectorized or scalar context) works with larger amounts of code too, wh= ile you are just doing a differential assignment.<br> </blockquote><div><br></div><div>And that&#39;s likely where it all starts = getting very complicated. If the branches start doing significant (and unba= lanced) work, an un-skilled programmer will have a lot of trouble understan= ding what sort of mess they may be making.</div> <div>And as usual, x86 will be the most tolerant, so they may not even know= when profiling.</div><div>I&#39;ve said before, it&#39;s very interesting,= but it also sound potentially very dangerous. It&#39;s probably also an aw= ful lot of work I&#39;d wager... I doubt we&#39;ll see those expressions an= y time soon.</div> </div> --047d7b624c90bc44e804cce14cd8--
Oct 25 2012
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 Making SIMD code that delivers performance turns out to be a 
 highly quirky and subtle exercise, one that is resistant to 
 formalization.

I have written some SIMD code, with mixed results, so I understand part of such problems, despite my total experience on such things is limited. Despite those problems and their failures I think it's important to support computer scientists that try to invent languages that try to offer medium-level means to write such kind of code :-) Reading and studying CS papers is important.
 Manu is on the daily front lines of doing competitive, real 
 world SIMD programming. He leads a team doing SIMD work. Hence, 
 I am going to strongly weight his opinions on any high level 
 SIMD design constructs.

I respect both Manu and his work (and you Walter are the one at the top of my list of programming heroes).
 Interestingly, both of us have rejected the 
 "auto-vectorization" approach popular in C/C++ compilers, for 
 very different reasons.

The authors of that paper too have rejected it. It doesn't give enough semantics to the compilers. They have explored a different solution. Bye, bearophile
Oct 25 2012