www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Reduce has dreadful performance?

reply Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
  auto sum =3D 0.0;
  foreach (immutable i; 1 .. n + 1) {
    immutable x =3D (i - 0.5) * delta;
    sum +=3D 1.0 / (1.0 + x * x);
  }
  return 4.0 * delta * sum;
}

runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
  return 4.0 * delta * reduce!((double t, int i){immutable x =3D (i -
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1));
}

runs in about 17.03s, whilst:

double sequential_reduce_alt(const int n, const double delta) {
  return 4.0 * delta * reduce!"a + b"(
         map!((int i){immutable x =3D (i - 0.5) * delta; return 1.0 /
(1.0 + x * x);})(iota(1, n + 1)));
}

takes about 28.02s. Unless I am missing something (very possible), this
is not going to create a good advert for D as an imperative language
with declarative (internal iteration) expression.


--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
Jun 18 2015
next sibling parent reply "Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 On a given machine, the code:

 double sequential_loop(const int n, const double delta) {
   auto sum = 0.0;
   foreach (immutable i; 1 .. n + 1) {
     immutable x = (i - 0.5) * delta;
     sum += 1.0 / (1.0 + x * x);
   }
   return 4.0 * delta * sum;
 }

 runs in about 6.70s. However the code:

 double sequential_reduce(const int n, const double delta) {
   return 4.0 * delta * reduce!((double t, int i){immutable x = 
 (i -
 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n 
 + 1));
 }

 runs in about 17.03s, whilst:

 double sequential_reduce_alt(const int n, const double delta) {
   return 4.0 * delta * reduce!"a + b"(
          map!((int i){immutable x = (i - 0.5) * delta; return 
 1.0 /
 (1.0 + x * x);})(iota(1, n + 1)));
 }

 takes about 28.02s. Unless I am missing something (very 
 possible), this
 is not going to create a good advert for D as an imperative 
 language
 with declarative (internal iteration) expression.
which compiler ?
Jun 18 2015
parent "Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
On Thursday, 18 June 2015 at 10:34:46 UTC, Laeeth Isharc wrote:
 On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 On a given machine, the code:

 double sequential_loop(const int n, const double delta) {
   auto sum = 0.0;
   foreach (immutable i; 1 .. n + 1) {
     immutable x = (i - 0.5) * delta;
     sum += 1.0 / (1.0 + x * x);
   }
   return 4.0 * delta * sum;
 }

 runs in about 6.70s. However the code:

 double sequential_reduce(const int n, const double delta) {
   return 4.0 * delta * reduce!((double t, int i){immutable x = 
 (i -
 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n 
 + 1));
 }

 runs in about 17.03s, whilst:

 double sequential_reduce_alt(const int n, const double delta) {
   return 4.0 * delta * reduce!"a + b"(
          map!((int i){immutable x = (i - 0.5) * delta; return 
 1.0 /
 (1.0 + x * x);})(iota(1, n + 1)));
 }

 takes about 28.02s. Unless I am missing something (very 
 possible), this
 is not going to create a good advert for D as an imperative 
 language
 with declarative (internal iteration) expression.
which compiler ?
fwiw n=1bn, delta=0.1 ldc no params: 8.749s/17.466s ldc -O2 4.541s/4.562s gdc no params 4.690/22.163 (!) gdc -O2 4.552/4.551
Jun 18 2015
prev sibling next sibling parent "weaselcat" <weaselcat gmail.com> writes:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 On a given machine, the code:

 double sequential_loop(const int n, const double delta) {
   auto sum = 0.0;
   foreach (immutable i; 1 .. n + 1) {
     immutable x = (i - 0.5) * delta;
     sum += 1.0 / (1.0 + x * x);
   }
   return 4.0 * delta * sum;
 }

 runs in about 6.70s. However the code:

 double sequential_reduce(const int n, const double delta) {
   return 4.0 * delta * reduce!((double t, int i){immutable x = 
 (i -
 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n 
 + 1));
 }

 runs in about 17.03s, whilst:

 double sequential_reduce_alt(const int n, const double delta) {
   return 4.0 * delta * reduce!"a + b"(
          map!((int i){immutable x = (i - 0.5) * delta; return 
 1.0 /
 (1.0 + x * x);})(iota(1, n + 1)));
 }

 takes about 28.02s. Unless I am missing something (very 
 possible), this
 is not going to create a good advert for D as an imperative 
 language
 with declarative (internal iteration) expression.
first two run in the same time on LDC, third one runs about 30-40% faster.
Jun 18 2015
prev sibling next sibling parent reply "Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 On a given machine, the code:

 double sequential_loop(const int n, const double delta) {
   auto sum = 0.0;
   foreach (immutable i; 1 .. n + 1) {
     immutable x = (i - 0.5) * delta;
     sum += 1.0 / (1.0 + x * x);
   }
   return 4.0 * delta * sum;
 }

 runs in about 6.70s. However the code:

 double sequential_reduce(const int n, const double delta) {
   return 4.0 * delta * reduce!((double t, int i){immutable x = 
 (i -
 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n 
 + 1));
 }

 runs in about 17.03s, whilst:

 double sequential_reduce_alt(const int n, const double delta) {
   return 4.0 * delta * reduce!"a + b"(
          map!((int i){immutable x = (i - 0.5) * delta; return 
 1.0 /
 (1.0 + x * x);})(iota(1, n + 1)));
 }

 takes about 28.02s. Unless I am missing something (very 
 possible), this
 is not going to create a good advert for D as an imperative 
 language
 with declarative (internal iteration) expression.
import std.stdio, std.datetime, std.algorithm, std.range, std.conv; double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } //runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x = (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } //runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x = (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } void main() { auto res = benchmark!( {sequential_loop(1000, 10);}, {sequential_reduce(1000, 10);}, {sequential_reduce_alt(1000, 10);}, )(1000); writeln(res[].map!(to!Duration)); } $ dmd -run test.d [9 ms, 305 μs, and 9 hnsecs, 16 ms, 625 μs, and 3 hnsecs, 27 ms, 417 μs, and 3 hnsecs] $ dmd -O -inline -release -run test.d [4 ms, 567 μs, and 6 hnsecs, 4 ms, 853 μs, and 8 hnsecs, 5 ms, 52 μs, and 5 hnsecs] OS X, DMD64 D Compiler v2.067.1 Looks like you forgot optimisation troika "-O -inline -release" Regards, Ilya
Jun 18 2015
next sibling parent reply "Andrea Fontana" <nospam example.com> writes:
On Thursday, 18 June 2015 at 10:46:18 UTC, Ilya Yaroshenko wrote:
 On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 On a given machine, the code:

 double sequential_loop(const int n, const double delta) {
   auto sum = 0.0;
   foreach (immutable i; 1 .. n + 1) {
     immutable x = (i - 0.5) * delta;
     sum += 1.0 / (1.0 + x * x);
   }
   return 4.0 * delta * sum;
 }
What about this: double sequential_alternative(const int n, const double delta) { return 4.0 * delta * iota(1, n+1).map!(x => (x-0.5)*delta).map!(x => 1.0/(1.0 + x*x)).sum; } ?
Jun 18 2015
parent "Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:
On Thursday, 18 June 2015 at 10:50:09 UTC, Andrea Fontana wrote:
 On Thursday, 18 June 2015 at 10:46:18 UTC, Ilya Yaroshenko 
 wrote:
 On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 On a given machine, the code:

 double sequential_loop(const int n, const double delta) {
   auto sum = 0.0;
   foreach (immutable i; 1 .. n + 1) {
     immutable x = (i - 0.5) * delta;
     sum += 1.0 / (1.0 + x * x);
   }
   return 4.0 * delta * sum;
 }
What about this: double sequential_alternative(const int n, const double delta) { return 4.0 * delta * iota(1, n+1).map!(x => (x-0.5)*delta).map!(x => 1.0/(1.0 + x*x)).sum; } ?
This is not equivalent, because sum uses another (more precise) algorithms to do summation. It should work a little bit slower.
Jun 18 2015
prev sibling parent reply Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via Digitalmars-d
wrote:
=20
[=E2=80=A6]
 Looks like you forgot optimisation troika "-O -inline -release"
=20
Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.) --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jun 18 2015
next sibling parent reply "Laeeth Isharc" <nospamlaeeth laeethnospam.laeeth.com> writes:
On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
 On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via 
 Digitalmars-d wrote:
 
[…]
 Looks like you forgot optimisation troika "-O -inline -release"
 
Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.)
python setup.py pydexe --compiler=ldc
Jun 18 2015
next sibling parent "Laeeth Isharc" <nospamlaeeth laeethnospam.laeeth.com> writes:
On Thursday, 18 June 2015 at 20:18:31 UTC, Laeeth Isharc wrote:
 On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
 On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via 
 Digitalmars-d wrote:
 
[…]
 Looks like you forgot optimisation troika "-O -inline 
 -release"
 
Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.)
python setup.py pydexe --compiler=ldc
GDC doesn't work anyway, I think, because of the shared library problem. would be great to have a dub setup so one can build in a standard way without using python.
Jun 18 2015
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Thursday, 18 June 2015 at 20:18:31 UTC, Laeeth Isharc wrote:
 On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
 On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via 
 Digitalmars-d wrote:
 
[…]
 Looks like you forgot optimisation troika "-O -inline 
 -release"
 
Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.)
python setup.py pydexe --compiler=ldc
Just a note: if you're using PydMagic you can pass --compiler=ldc to %%pyd
Jun 19 2015
prev sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thu, 2015-06-18 at 20:18 +0000, Laeeth Isharc via Digitalmars-d wrote:
 [=E2=80=A6]
=20
 python setup.py pydexe --compiler=3Dldc
I now have this working and reduce is no longer an embarrassment. Also we have reaffirmed that LDC is the tool of choice and DMD unusable. But this surprises no-one involved with CPU-bound activity I suspect. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t:+44 20 7585 2200 voip:sip: russel.winder ekiga.net 41 Buckmaster Road m:+44 7770 465 077 xmpp:russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype:russel_winder
Jun 19 2015
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:
 Now I get

 Loop: 3.14s
 Reduce 1: 4.76s
 Reduce 2: 5.12s
I expect that at the moment, range+algorithms code will likely be somewhat slower than old fashioned loops. This is because code generators have been tuned for decades to do a great job with loops. There's no intrinsic reason why ranges must do worse, so I expect they'll achieve parity. Ranges can move ahead because they can reduce the algorithmic complexity, whereas user written loops tend to be suboptimal.
Jun 18 2015
next sibling parent reply "weaselcat" <weaselcat gmail.com> writes:
On Thursday, 18 June 2015 at 20:53:20 UTC, Walter Bright wrote:
 On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:
 Now I get

 Loop: 3.14s
 Reduce 1: 4.76s
 Reduce 2: 5.12s
I expect that at the moment, range+algorithms code will likely be somewhat slower than old fashioned loops. This is because code generators have been tuned for decades to do a great job with loops. There's no intrinsic reason why ranges must do worse, so I expect they'll achieve parity. Ranges can move ahead because they can reduce the algorithmic complexity, whereas user written loops tend to be suboptimal.
ldc and gdc have no issue optimizing the range code(range code is actually faster with ldc)
Jun 18 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/18/15 2:04 PM, weaselcat wrote:
 On Thursday, 18 June 2015 at 20:53:20 UTC, Walter Bright wrote:
 On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:
 Now I get

 Loop: 3.14s
 Reduce 1: 4.76s
 Reduce 2: 5.12s
I expect that at the moment, range+algorithms code will likely be somewhat slower than old fashioned loops. This is because code generators have been tuned for decades to do a great job with loops. There's no intrinsic reason why ranges must do worse, so I expect they'll achieve parity. Ranges can move ahead because they can reduce the algorithmic complexity, whereas user written loops tend to be suboptimal.
ldc and gdc have no issue optimizing the range code(range code is actually faster with ldc)
Russel, do you have numbers for ldc by any chance? Thx! -- Andrei
Jun 18 2015
parent reply Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thu, 2015-06-18 at 15:54 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 [=E2=80=A6]
=20
 Russel, do you have numbers for ldc by any chance? Thx! -- Andrei
Single data point only, and I need to check this, but: Loop: 1.44s Reduce 1: 1.43s Reduce 2: 1.42s LDC seems to beat DMD into the ground. But then I think we all know this ;-) The PyD setuptools support appears to not properly ser RPATH which is a bit annoying. Actually it is a lot annoying. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t:+44 20 7585 2200 voip:sip: russel.winder ekiga.net 41 Buckmaster Road m:+44 7770 465 077 xmpp:russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype:russel_winder
Jun 19 2015
parent "weaselcat" <weaselcat gmail.com> writes:
On Saturday, 20 June 2015 at 05:55:07 UTC, Russel Winder wrote:
 On Thu, 2015-06-18 at 15:54 -0700, Andrei Alexandrescu via 
 Digitalmars-d wrote:
 […]
 
 Russel, do you have numbers for ldc by any chance? Thx! -- 
 Andrei
Single data point only, and I need to check this, but: Loop: 1.44s Reduce 1: 1.43s Reduce 2: 1.42s LDC seems to beat DMD into the ground. But then I think we all know this ;-) The PyD setuptools support appears to not properly ser RPATH which is a bit annoying. Actually it is a lot annoying.
Interesting, I saw a bigger speedup from the ranges. Do you use windows? What LLVM/LDC version? flags?
Jun 20 2015
prev sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thu, 2015-06-18 at 13:53 -0700, Walter Bright via Digitalmars-d wrote:
 [=E2=80=A6]
=20
 There's no intrinsic reason why ranges must do worse, so I expect they'll=
=20
 achieve parity.
=20
reduce already has achieved parity using LDC =E2=80=93 once you install it = properly, unlike what I had been doing. Explicit loops are just so last millenium. :-) PyData London 2015 here we come. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t:+44 20 7585 2200 voip:sip: russel.winder ekiga.net 41 Buckmaster Road m:+44 7770 465 077 xmpp:russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype:russel_winder
Jun 19 2015
prev sibling parent "Martin Nowak" <code dawg.eu> writes:
On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
 Loop: 3.14s
 Reduce 1: 4.76s
 Reduce 2: 5.12s

 This is DMD 2.067
Don't compare performance numbers on dmd, particularly not when assessing abstraction overhead.
Jun 20 2015
prev sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 On a given machine, the code:

 double sequential_loop(const int n, const double delta) {
   auto sum = 0.0;
   foreach (immutable i; 1 .. n + 1) {
     immutable x = (i - 0.5) * delta;
     sum += 1.0 / (1.0 + x * x);
   }
   return 4.0 * delta * sum;
 }

 runs in about 6.70s. However the code:

 double sequential_reduce(const int n, const double delta) {
   return 4.0 * delta * reduce!((double t, int i){immutable x = 
 (i -
 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n 
 + 1));
 }

 runs in about 17.03s, whilst:

 double sequential_reduce_alt(const int n, const double delta) {
   return 4.0 * delta * reduce!"a + b"(
          map!((int i){immutable x = (i - 0.5) * delta; return 
 1.0 /
 (1.0 + x * x);})(iota(1, n + 1)));
 }

 takes about 28.02s. Unless I am missing something (very 
 possible), this
 is not going to create a good advert for D as an imperative 
 language
 with declarative (internal iteration) expression.
What you've forgotten is to turn on your optimisation flags. Even DMD gets the first 2 vaguely right with -O -inline -release gdc is able to vectorize all 3 examples if you give it -Ofast The 3rd one does appear to be a bit slower in all cases, not sure why right now.
Jun 18 2015
parent "weaselcat" <weaselcat gmail.com> writes:
On Thursday, 18 June 2015 at 13:20:04 UTC, John Colvin wrote:
 On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
 [...]
What you've forgotten is to turn on your optimisation flags. Even DMD gets the first 2 vaguely right with -O -inline -release gdc is able to vectorize all 3 examples if you give it -Ofast The 3rd one does appear to be a bit slower in all cases, not sure why right now.
third one was faster for me on ldc, fwiw
Jun 18 2015