## digitalmars.D - Reduce has dreadful performance?

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
```On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum =3D 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x =3D (i - 0.5) * delta;
sum +=3D 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
return 4.0 * delta * reduce!((double t, int i){immutable x =3D (i -
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1));
}

double sequential_reduce_alt(const int n, const double delta) {
return 4.0 * delta * reduce!"a + b"(
map!((int i){immutable x =3D (i - 0.5) * delta; return 1.0 /
(1.0 + x * x);})(iota(1, n + 1)));
}

takes about 28.02s. Unless I am missing something (very possible), this
is not going to create a good advert for D as an imperative language
with declarative (internal iteration) expression.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
```
Jun 18 2015
"Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
```On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
return 4.0 * delta * reduce!((double t, int i){immutable x =
(i -
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n
+ 1));
}

double sequential_reduce_alt(const int n, const double delta) {
return 4.0 * delta * reduce!"a + b"(
map!((int i){immutable x = (i - 0.5) * delta; return
1.0 /
(1.0 + x * x);})(iota(1, n + 1)));
}

takes about 28.02s. Unless I am missing something (very
possible), this
is not going to create a good advert for D as an imperative
language
with declarative (internal iteration) expression.

which compiler ?
```
Jun 18 2015
"Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
```On Thursday, 18 June 2015 at 10:34:46 UTC, Laeeth Isharc wrote:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
return 4.0 * delta * reduce!((double t, int i){immutable x =
(i -
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n
+ 1));
}

double sequential_reduce_alt(const int n, const double delta) {
return 4.0 * delta * reduce!"a + b"(
map!((int i){immutable x = (i - 0.5) * delta; return
1.0 /
(1.0 + x * x);})(iota(1, n + 1)));
}

takes about 28.02s. Unless I am missing something (very
possible), this
is not going to create a good advert for D as an imperative
language
with declarative (internal iteration) expression.

which compiler ?

fwiw n=1bn, delta=0.1
ldc no params:
8.749s/17.466s
ldc -O2
4.541s/4.562s
gdc no params
4.690/22.163 (!)
gdc -O2
4.552/4.551
```
Jun 18 2015
"weaselcat" <weaselcat gmail.com> writes:
```On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
return 4.0 * delta * reduce!((double t, int i){immutable x =
(i -
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n
+ 1));
}

double sequential_reduce_alt(const int n, const double delta) {
return 4.0 * delta * reduce!"a + b"(
map!((int i){immutable x = (i - 0.5) * delta; return
1.0 /
(1.0 + x * x);})(iota(1, n + 1)));
}

takes about 28.02s. Unless I am missing something (very
possible), this
is not going to create a good advert for D as an imperative
language
with declarative (internal iteration) expression.

first two run in the same time on LDC, third one runs about
30-40% faster.
```
Jun 18 2015
"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:
```On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
return 4.0 * delta * reduce!((double t, int i){immutable x =
(i -
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n
+ 1));
}

double sequential_reduce_alt(const int n, const double delta) {
return 4.0 * delta * reduce!"a + b"(
map!((int i){immutable x = (i - 0.5) * delta; return
1.0 /
(1.0 + x * x);})(iota(1, n + 1)));
}

takes about 28.02s. Unless I am missing something (very
possible), this
is not going to create a good advert for D as an imperative
language
with declarative (internal iteration) expression.

import std.stdio, std.datetime, std.algorithm, std.range,
std.conv;

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

//runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
return 4.0 * delta * reduce!((double t, int i){immutable x = (i
-
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n +
1));
}

double sequential_reduce_alt(const int n, const double delta) {
return 4.0 * delta * reduce!"a + b"(
map!((int i){immutable x = (i - 0.5) * delta; return 1.0
/
(1.0 + x * x);})(iota(1, n + 1)));
}

void main() {
auto res = benchmark!(
{sequential_loop(1000, 10);},
{sequential_reduce(1000, 10);},
{sequential_reduce_alt(1000, 10);},
)(1000);
writeln(res[].map!(to!Duration));
}

\$ dmd -run test.d
[9 ms, 305 μs, and 9 hnsecs, 16 ms, 625 μs, and 3 hnsecs, 27 ms,
417 μs, and 3 hnsecs]

\$ dmd -O -inline -release -run test.d
[4 ms, 567 μs, and 6 hnsecs, 4 ms, 853 μs, and 8 hnsecs, 5 ms, 52
μs, and 5 hnsecs]

OS X, DMD64 D Compiler v2.067.1

Looks like you forgot optimisation troika "-O -inline -release"

Regards,
Ilya
```
Jun 18 2015
"Andrea Fontana" <nospam example.com> writes:
```On Thursday, 18 June 2015 at 10:46:18 UTC, Ilya Yaroshenko wrote:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

double sequential_alternative(const int n, const double delta) {
return 4.0 * delta * iota(1, n+1).map!(x =>
(x-0.5)*delta).map!(x => 1.0/(1.0 + x*x)).sum;
}

?
```
Jun 18 2015
"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:
```On Thursday, 18 June 2015 at 10:50:09 UTC, Andrea Fontana wrote:
On Thursday, 18 June 2015 at 10:46:18 UTC, Ilya Yaroshenko
wrote:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

double sequential_alternative(const int n, const double delta) {
return 4.0 * delta * iota(1, n+1).map!(x =>
(x-0.5)*delta).map!(x => 1.0/(1.0 + x*x)).sum;
}

?

This is not equivalent, because sum uses another (more precise)
algorithms to do summation. It should work a little bit slower.
```
Jun 18 2015
Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
```On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via Digitalmars-d
wrote:
=20

[=E2=80=A6]
Looks like you forgot optimisation troika "-O -inline -release"
=20

Don't I feel stupid :-)

Now I get

Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

This is DMD 2.067

Anyone know how to get PyD to use LDC rather than DMD in the setuptools
build spec?

(I can't use GDC just now as it is not packaged for Fedora.)

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
```
Jun 18 2015
"Laeeth Isharc" <nospamlaeeth laeethnospam.laeeth.com> writes:
```On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via
Digitalmars-d wrote:

[…]
Looks like you forgot optimisation troika "-O -inline -release"

Don't I feel stupid :-)

Now I get

Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

This is DMD 2.067

Anyone know how to get PyD to use LDC rather than DMD in the
setuptools build spec?

(I can't use GDC just now as it is not packaged for Fedora.)

python setup.py pydexe --compiler=ldc
```
Jun 18 2015
"Laeeth Isharc" <nospamlaeeth laeethnospam.laeeth.com> writes:
```On Thursday, 18 June 2015 at 20:18:31 UTC, Laeeth Isharc wrote:
On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via
Digitalmars-d wrote:

[…]
Looks like you forgot optimisation troika "-O -inline
-release"

Don't I feel stupid :-)

Now I get

Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

This is DMD 2.067

Anyone know how to get PyD to use LDC rather than DMD in the
setuptools build spec?

(I can't use GDC just now as it is not packaged for Fedora.)

python setup.py pydexe --compiler=ldc

GDC doesn't work anyway, I think, because of the shared library
problem.

would be great to have a dub setup so one can build in a standard
way without using python.
```
Jun 18 2015
"John Colvin" <john.loughran.colvin gmail.com> writes:
```On Thursday, 18 June 2015 at 20:18:31 UTC, Laeeth Isharc wrote:
On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via
Digitalmars-d wrote:

[…]
Looks like you forgot optimisation troika "-O -inline
-release"

Don't I feel stupid :-)

Now I get

Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

This is DMD 2.067

Anyone know how to get PyD to use LDC rather than DMD in the
setuptools build spec?

(I can't use GDC just now as it is not packaged for Fedora.)

python setup.py pydexe --compiler=ldc

Just a note: if you're using PydMagic you can pass --compiler=ldc
to %%pyd
```
Jun 19 2015
Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
```On Thu, 2015-06-18 at 20:18 +0000, Laeeth Isharc via Digitalmars-d wrote:
[=E2=80=A6]
=20
python setup.py pydexe --compiler=3Dldc

I now have this working and reduce is no longer an embarrassment. Also we
have reaffirmed that LDC is the tool of choice and DMD unusable. But this
surprises no-one involved with CPU-bound activity I suspect.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder     t:+44 20 7585 2200   voip:sip:
russel.winder ekiga.net
41 Buckmaster Road   m:+44 7770 465 077   xmpp:russel winder.org.uk
London SW11 1EN, UK  w: www.russel.org.uk skype:russel_winder
```
Jun 19 2015
Walter Bright <newshound2 digitalmars.com> writes:
```On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:
Now I get

Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

I expect that at the moment, range+algorithms code will likely be somewhat
slower than old fashioned loops. This is because code generators have been
tuned
for decades to do a great job with loops.

There's no intrinsic reason why ranges must do worse, so I expect they'll
achieve parity.

Ranges can move ahead because they can reduce the algorithmic complexity,
whereas user written loops tend to be suboptimal.
```
Jun 18 2015
"weaselcat" <weaselcat gmail.com> writes:
```On Thursday, 18 June 2015 at 20:53:20 UTC, Walter Bright wrote:
On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:
Now I get

Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

I expect that at the moment, range+algorithms code will likely
be somewhat slower than old fashioned loops. This is because
code generators have been tuned for decades to do a great job
with loops.

There's no intrinsic reason why ranges must do worse, so I
expect they'll achieve parity.

Ranges can move ahead because they can reduce the algorithmic
complexity, whereas user written loops tend to be suboptimal.

ldc and gdc have no issue optimizing the range code(range code is
actually faster with ldc)
```
Jun 18 2015
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 6/18/15 2:04 PM, weaselcat wrote:
On Thursday, 18 June 2015 at 20:53:20 UTC, Walter Bright wrote:
On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:
Now I get

Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

I expect that at the moment, range+algorithms code will likely be
somewhat slower than old fashioned loops. This is because code
generators have been tuned for decades to do a great job with loops.

There's no intrinsic reason why ranges must do worse, so I expect
they'll achieve parity.

Ranges can move ahead because they can reduce the algorithmic
complexity, whereas user written loops tend to be suboptimal.

ldc and gdc have no issue optimizing the range code(range code is
actually faster with ldc)

Russel, do you have numbers for ldc by any chance? Thx! -- Andrei
```
Jun 18 2015
Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
```On Thu, 2015-06-18 at 15:54 -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
[=E2=80=A6]
=20
Russel, do you have numbers for ldc by any chance? Thx! -- Andrei

Single data point only, and I need to check this, but:

Loop: 1.44s
Reduce 1: 1.43s
Reduce 2: 1.42s

LDC seems to beat DMD into the ground.

But then I think we all know this ;-)

The PyD setuptools support appears to not properly ser RPATH which is a bit
annoying. Actually it is a lot annoying.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder     t:+44 20 7585 2200   voip:sip:
russel.winder ekiga.net
41 Buckmaster Road   m:+44 7770 465 077   xmpp:russel winder.org.uk
London SW11 1EN, UK  w: www.russel.org.uk skype:russel_winder
```
Jun 19 2015
"weaselcat" <weaselcat gmail.com> writes:
```On Saturday, 20 June 2015 at 05:55:07 UTC, Russel Winder wrote:
On Thu, 2015-06-18 at 15:54 -0700, Andrei Alexandrescu via
Digitalmars-d wrote:
[…]

Russel, do you have numbers for ldc by any chance? Thx! --
Andrei

Single data point only, and I need to check this, but:

Loop: 1.44s
Reduce 1: 1.43s
Reduce 2: 1.42s

LDC seems to beat DMD into the ground.

But then I think we all know this ;-)

The PyD setuptools support appears to not properly ser RPATH
which is a bit annoying. Actually it is a lot annoying.

Interesting, I saw a bigger speedup from the ranges.
Do you use windows? What LLVM/LDC version? flags?
```
Jun 20 2015
Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
```On Thu, 2015-06-18 at 13:53 -0700, Walter Bright via Digitalmars-d wrote:
[=E2=80=A6]
=20
There's no intrinsic reason why ranges must do worse, so I expect they'll=

=20
achieve parity.
=20

reduce already has achieved parity using LDC =E2=80=93 once you install it =
properly,
unlike what I had been doing.

Explicit loops are just so last millenium. :-)

PyData London 2015 here we come.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder     t:+44 20 7585 2200   voip:sip:
russel.winder ekiga.net
41 Buckmaster Road   m:+44 7770 465 077   xmpp:russel winder.org.uk
London SW11 1EN, UK  w: www.russel.org.uk skype:russel_winder
```
Jun 19 2015
"Martin Nowak" <code dawg.eu> writes:
```On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:
Loop: 3.14s
Reduce 1: 4.76s
Reduce 2: 5.12s

This is DMD 2.067

Don't compare performance numbers on dmd, particularly not when
```
Jun 20 2015
"John Colvin" <john.loughran.colvin gmail.com> writes:
```On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
On a given machine, the code:

double sequential_loop(const int n, const double delta) {
auto sum = 0.0;
foreach (immutable i; 1 .. n + 1) {
immutable x = (i - 0.5) * delta;
sum += 1.0 / (1.0 + x * x);
}
return 4.0 * delta * sum;
}

runs in about 6.70s. However the code:

double sequential_reduce(const int n, const double delta) {
return 4.0 * delta * reduce!((double t, int i){immutable x =
(i -
0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n
+ 1));
}

double sequential_reduce_alt(const int n, const double delta) {
return 4.0 * delta * reduce!"a + b"(
map!((int i){immutable x = (i - 0.5) * delta; return
1.0 /
(1.0 + x * x);})(iota(1, n + 1)));
}

takes about 28.02s. Unless I am missing something (very
possible), this
is not going to create a good advert for D as an imperative
language
with declarative (internal iteration) expression.

What you've forgotten is to turn on your optimisation flags. Even
DMD gets the first 2 vaguely right with -O -inline -release

gdc is able to vectorize all 3 examples if you give it -Ofast

The 3rd one does appear to be a bit slower in all cases, not sure
why right now.
```
Jun 18 2015
"weaselcat" <weaselcat gmail.com> writes:
```On Thursday, 18 June 2015 at 13:20:04 UTC, John Colvin wrote:
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:
[...]

What you've forgotten is to turn on your optimisation flags.
Even DMD gets the first 2 vaguely right with -O -inline -release

gdc is able to vectorize all 3 examples if you give it -Ofast

The 3rd one does appear to be a bit slower in all cases, not
sure why right now.

third one was faster for me on ldc, fwiw
```
Jun 18 2015