digitalmars.D.learn - Simple performance question from a newcomer

dextorious (94/94) Feb 21 2016 I've been vaguely aware of D for many years, but the recent

jmh530 (6/11) Feb 21 2016 I didn't look at your code that thoroughly, but it is generally
Daniel Kozak via Digitalmars-d-learn (11/102) Feb 21 2016 You can use -profile to see what is causing it.
Daniel Kozak via Digitalmars-d-learn (2/113) Feb 21 2016
Jack Stouffer (15/20) Feb 21 2016 First off, you should really be using GDC or LDC if you want
bachmeier (13/16) Feb 21 2016 First, a minor point, the D community is usually pretty careful

dextorious (15/27) Feb 22 2016 While I certainly do not doubt the open mindedness of the D
sigod (4/20) Feb 22 2016 I can't agree with that. Between `for` and `foreach` you should

ZombineDev (30/124) Feb 21 2016 The problem is not with ranges, but with the particualr algorithm

ZombineDev (34/40) Feb 21 2016 I did some more testing and clearly the larger times for N=1000

ZombineDev (39/56) Feb 21 2016 Just for the record, with `DMD -release -O -inline`, Kahan,

dextorious (35/46) Feb 22 2016 First of all, I am pleasantly surprised by the rapid influx of

ixid (17/23) Feb 23 2016 Your experience is exactly what the D community needs to get

Marc =?UTF-8?B?U2Now7x0eg==?= (8/10) Feb 23 2016 While I agree with most of what you're saying, I don't think we

ixid (7/18) Feb 23 2016 Wouldn't it be better to have technically perfect implementations
dextorious (12/23) Feb 23 2016 Being new to the language, I certainly make no claims about what

=?UTF-8?Q?Ali_=c3=87ehreli?= (7/14) Feb 24 2016 According to Wikipedia, pairwise summation is the default algorithm in

bachmeier (6/17) Feb 23 2016 His concern is with the default settings of Dub. I've tried Dub
dextorious (42/65) Feb 23 2016 Personally, I think a few aspects of documentation for the

jmh530 (9/25) Feb 23 2016 I think that's fair. I think part of the reason for the focus on
Mike Parker (10/14) Feb 23 2016 If you're referring to this:

dextorious (14/28) Feb 24 2016 There's part of what I'm referring to, yes. There doesn't seem to

jmh530 (9/11) Feb 24 2016 There are examples like in the package format page

Kapps (8/8) Feb 21 2016 If you do want to test the differences between the range approach

Kapps (24/32) Feb 21 2016 Using LDC with the mir version of ndslice so it compiles, and the

dextorious <dextorious gmail.com> writes:

I've been vaguely aware of D for many years, but the recent 
addition of std.experimental.ndslice finally inspired me to give 
it a try, since my main expertise lies in the domain of 
scientific computing and I primarily use Python/Julia/C++, where 
multidimensional arrays can be handled with a great deal of 
expressiveness and flexibility. Before writing anything serious, 
I wanted to get a sense for the kind of code I would have to 
write to get the best performance for numerical calculations, so 
I wrote a trivial summation benchmark. The following code gave me 
slightly surprising results:

import std.stdio;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.range;
import std.experimental.ndslice;

void main() {
	int N = 1000;
	int Q = 20;
	int times = 1_000;
	double[] res1 = uninitializedArray!(double[])(N);
	double[] res2 = uninitializedArray!(double[])(N);
	double[] res3 = uninitializedArray!(double[])(N);
	auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
	StopWatch sw;
	double t0, t1, t2;
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res1[i] = sumtest1(f[i]);
		}
	}
	sw.stop();
	t0 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res2[i] = sumtest2(f[i]);
		}
	}
	sw.stop();
	t1 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		sumtest3(f, res3, N, Q);
	}
	t2 = sw.peek().msecs;
	writeln(t0, " ms");
	writeln(t1, " ms");
	writeln(t2, " ms");
	assert( res1 == res2 );
	assert( res2 == res3 );
}

auto sumtest1(Range)(Range range)  safe pure nothrow  nogc {
	return range.sum;
}

auto sumtest2(Range)(Range f)  safe pure nothrow  nogc {
	double retval = 0.0;
	foreach (double f_ ; f)	{
		retval += f_;
	}
	return retval;
}

auto sumtest3(Range)(Range f, double[] retval, double N, double 
Q)  safe pure nothrow  nogc {
	for (int i=0; i<N; ++i)	{
		for (int j=1; j<Q; ++j)	{
			retval[i] += f[i,j];
		}
	}
}

When I compiled it using dmd -release -inline -O -noboundscheck 
../src/main.d, I got the following timings:
1268 ms
312 ms
271 ms

I had heard while reading up on the language that in D explicit 
loops are generally frowned upon and not necessary for the usual 
performance reasons. Nevertheless, the two explicit loop 
functions gave me an improvement by a factor of 4+. Furthermore, 
the difference between sumtest2 and sumtest3 seems to indicate 
that function calls have a significant overhead. I also tried 
using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, 
but that yielded even worse performance. I did not try the 
GDC/LDC compilers yet, since they don't seem to be up to date on 
the standard library and don't include the ndslice package last I 
checked.

Now, seeing as how my experience writing D is literally a few 
hours, is there anything I did blatantly wrong? Did I miss any 
optimizations? Most importantly, can the elegant operator 
chaining style be generally made as fast as the explicit loops 
we've all been writing for decades?

Feb 21 2016

jmh530 <john.michael.hall gmail.com> writes:

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
 Now, seeing as how my experience writing D is literally a few 
 hours, is there anything I did blatantly wrong? Did I miss any 
 optimizations? Most importantly, can the elegant operator 
 chaining style be generally made as fast as the explicit loops 
 we've all been writing for decades?

I didn't look at your code that thoroughly, but it is generally 
recommended that if you're concerned about performance to compile 
with gdc or ldc. dmd has fast compile times, but does not produce 
as fast of code. You might want to check if the performance 
differential is still large with one of those.

Feb 21 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

You can use -profile to see what is causing it.

   Num          Tree        Func        Per
   Calls        Time        Time        Call

23000000   550799875   550243765          23     pure nothrow  nogc 
 safe double std.algorithm.iteration.sumPairwise!(double, 
std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, 
double, double).iota(double, double, 
double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, 
std.range.iota!(double, double, double).iota(double, double, 
double).Result).Slice)

Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):
 I've been vaguely aware of D for many years, but the recent addition 
 of std.experimental.ndslice finally inspired me to give it a try, 
 since my main expertise lies in the domain of scientific computing and 
 I primarily use Python/Julia/C++, where multidimensional arrays can be 
 handled with a great deal of expressiveness and flexibility. Before 
 writing anything serious, I wanted to get a sense for the kind of code 
 I would have to write to get the best performance for numerical 
 calculations, so I wrote a trivial summation benchmark. The following 
 code gave me slightly surprising results:

 import std.stdio;
 import std.array : array;
 import std.algorithm;
 import std.datetime;
 import std.range;
 import std.experimental.ndslice;

 void main() {
     int N = 1000;
     int Q = 20;
     int times = 1_000;
     double[] res1 = uninitializedArray!(double[])(N);
     double[] res2 = uninitializedArray!(double[])(N);
     double[] res3 = uninitializedArray!(double[])(N);
     auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
     StopWatch sw;
     double t0, t1, t2;
     sw.start();
     foreach (unused; 0..times) {
         for (int i=0; i<N; ++i) {
             res1[i] = sumtest1(f[i]);
         }
     }
     sw.stop();
     t0 = sw.peek().msecs;
     sw.reset();
     sw.start();
     foreach (unused; 0..times) {
         for (int i=0; i<N; ++i) {
             res2[i] = sumtest2(f[i]);
         }
     }
     sw.stop();
     t1 = sw.peek().msecs;
     sw.reset();
     sw.start();
     foreach (unused; 0..times) {
         sumtest3(f, res3, N, Q);
     }
     t2 = sw.peek().msecs;
     writeln(t0, " ms");
     writeln(t1, " ms");
     writeln(t2, " ms");
     assert( res1 == res2 );
     assert( res2 == res3 );
 }

 auto sumtest1(Range)(Range range)  safe pure nothrow  nogc {
     return range.sum;
 }

 auto sumtest2(Range)(Range f)  safe pure nothrow  nogc {
     double retval = 0.0;
     foreach (double f_ ; f)    {
         retval += f_;
     }
     return retval;
 }

 auto sumtest3(Range)(Range f, double[] retval, double N, double Q) 
  safe pure nothrow  nogc {
     for (int i=0; i<N; ++i)    {
         for (int j=1; j<Q; ++j)    {
             retval[i] += f[i,j];
         }
     }
 }

 When I compiled it using dmd -release -inline -O -noboundscheck 
 ../src/main.d, I got the following timings:
 1268 ms
 312 ms
 271 ms

 I had heard while reading up on the language that in D explicit loops 
 are generally frowned upon and not necessary for the usual performance 
 reasons. Nevertheless, the two explicit loop functions gave me an 
 improvement by a factor of 4+. Furthermore, the difference between 
 sumtest2 and sumtest3 seems to indicate that function calls have a 
 significant overhead. I also tried using f.reduce!((a, b) => a + b) 
 instead of f.sum in sumtest1, but that yielded even worse performance. 
 I did not try the GDC/LDC compilers yet, since they don't seem to be 
 up to date on the standard library and don't include the ndslice 
 package last I checked.

 Now, seeing as how my experience writing D is literally a few hours, 
 is there anything I did blatantly wrong? Did I miss any optimizations? 
 Most importantly, can the elegant operator chaining style be generally 
 made as fast as the explicit loops we've all been writing for decades?

Feb 21 2016

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

So I guess pairwise summation is one to blame here.

Dne 21.2.2016 v 16:56 Daniel Kozak napsal(a):
 You can use -profile to see what is causing it.

   Num          Tree        Func        Per
   Calls        Time        Time        Call

 23000000   550799875   550243765          23     pure nothrow  nogc 
  safe double std.algorithm.iteration.sumPairwise!(double, 
 std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, 
 double, double).iota(double, double, 
 double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, 
 std.range.iota!(double, double, double).iota(double, double, 
 double).Result).Slice)

 Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):
 I've been vaguely aware of D for many years, but the recent addition 
 of std.experimental.ndslice finally inspired me to give it a try, 
 since my main expertise lies in the domain of scientific computing 
 and I primarily use Python/Julia/C++, where multidimensional arrays 
 can be handled with a great deal of expressiveness and flexibility. 
 Before writing anything serious, I wanted to get a sense for the kind 
 of code I would have to write to get the best performance for 
 numerical calculations, so I wrote a trivial summation benchmark. The 
 following code gave me slightly surprising results:

 import std.stdio;
 import std.array : array;
 import std.algorithm;
 import std.datetime;
 import std.range;
 import std.experimental.ndslice;

 void main() {
     int N = 1000;
     int Q = 20;
     int times = 1_000;
     double[] res1 = uninitializedArray!(double[])(N);
     double[] res2 = uninitializedArray!(double[])(N);
     double[] res3 = uninitializedArray!(double[])(N);
     auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
     StopWatch sw;
     double t0, t1, t2;
     sw.start();
     foreach (unused; 0..times) {
         for (int i=0; i<N; ++i) {
             res1[i] = sumtest1(f[i]);
         }
     }
     sw.stop();
     t0 = sw.peek().msecs;
     sw.reset();
     sw.start();
     foreach (unused; 0..times) {
         for (int i=0; i<N; ++i) {
             res2[i] = sumtest2(f[i]);
         }
     }
     sw.stop();
     t1 = sw.peek().msecs;
     sw.reset();
     sw.start();
     foreach (unused; 0..times) {
         sumtest3(f, res3, N, Q);
     }
     t2 = sw.peek().msecs;
     writeln(t0, " ms");
     writeln(t1, " ms");
     writeln(t2, " ms");
     assert( res1 == res2 );
     assert( res2 == res3 );
 }

 auto sumtest1(Range)(Range range)  safe pure nothrow  nogc {
     return range.sum;
 }

 auto sumtest2(Range)(Range f)  safe pure nothrow  nogc {
     double retval = 0.0;
     foreach (double f_ ; f)    {
         retval += f_;
     }
     return retval;
 }

 auto sumtest3(Range)(Range f, double[] retval, double N, double Q) 
  safe pure nothrow  nogc {
     for (int i=0; i<N; ++i)    {
         for (int j=1; j<Q; ++j)    {
             retval[i] += f[i,j];
         }
     }
 }

 When I compiled it using dmd -release -inline -O -noboundscheck 
 ../src/main.d, I got the following timings:
 1268 ms
 312 ms
 271 ms

 I had heard while reading up on the language that in D explicit loops 
 are generally frowned upon and not necessary for the usual 
 performance reasons. Nevertheless, the two explicit loop functions 
 gave me an improvement by a factor of 4+. Furthermore, the difference 
 between sumtest2 and sumtest3 seems to indicate that function calls 
 have a significant overhead. I also tried using f.reduce!((a, b) => a 
 + b) instead of f.sum in sumtest1, but that yielded even worse 
 performance. I did not try the GDC/LDC compilers yet, since they 
 don't seem to be up to date on the standard library and don't include 
 the ndslice package last I checked.

 Now, seeing as how my experience writing D is literally a few hours, 
 is there anything I did blatantly wrong? Did I miss any 
 optimizations? Most importantly, can the elegant operator chaining 
 style be generally made as fast as the explicit loops we've all been 
 writing for decades?

Feb 21 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
 Now, seeing as how my experience writing D is literally a few 
 hours, is there anything I did blatantly wrong? Did I miss any 
 optimizations? Most importantly, can the elegant operator 
 chaining style be generally made as fast as the explicit loops 
 we've all been writing for decades?

First off, you should really be using GDC or LDC if you want 
speed. On how to do that, see my blog post about here: 
http://jackstouffer.com/blog/nd_slice.html. Specifically the 
section titled "Getting Hands On".

Secondly, both of your other sum examples use naive element by 
element summation rather than the more accurate pairwise 
summation which sum uses with random access floating point 
ranges. So your not really comparing apples to apples here.

Since Phobos' pairwise summation is recursive, it's very likely 
that DMD isn't doing all the optimizations that LDC or GDC can, 
such as inlining or tail call optimizations. I haven't compiled 
your code so I can't check myself.

Also, templates are auto attributed, so there's no reason to 
include  safe nothrow, etc. on templated functions.

Feb 21 2016

bachmeier <no spam.net> writes:

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
 I had heard while reading up on the language that in D explicit 
 loops are generally frowned upon and not necessary for the 
 usual performance reasons.

First, a minor point, the D community is usually pretty careful 
not to frown on a particular coding style (unlike some 
communities) so if you are comfortable writing loops and it gives 
you the fastest code, you should do so.

On the performance issue, you can see this related post about 
performance with reduce:
http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d puremagic.com

This was Walter's response:
http://forum.dlang.org/post/mlvb40$1tdf$1 digitalmars.com

And this shows that LDC flat out does a better job of 
optimization in this case:
http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d puremagic.com

Feb 21 2016

dextorious <dextorious gmail.com> writes:

On Sunday, 21 February 2016 at 16:20:30 UTC, bachmeier wrote:
 First, a minor point, the D community is usually pretty careful 
 not to frown on a particular coding style (unlike some 
 communities) so if you are comfortable writing loops and it 
 gives you the fastest code, you should do so.

 On the performance issue, you can see this related post about 
 performance with reduce:
 http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d puremagic.com

 This was Walter's response:
 http://forum.dlang.org/post/mlvb40$1tdf$1 digitalmars.com

 And this shows that LDC flat out does a better job of 
 optimization in this case:
 http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d puremagic.com

While I certainly do not doubt the open mindedness of the D 
community, it was in part Walter Bright's statement during a 
keynote speech of how "loops are bugs" that motivated me to look 
at D for a fresh approach to writing numerical code. For decades, 
explicit loops have been the only way to attain good performance 
for certain kinds of code in virtually all languages (discounting 
a few quirky high level languages like MATLAB) and the notion 
that this need not be the case is quite attractive to many 
people, myself included.

While the point Walter makes, that there is no mathematical 
reason ranges should be slower than loops and that loops are 
generally easier to get wrong is certainly true, D is the first 
general purpose language I've ever seen that makes this sentiment 
come close to reality.

Feb 22 2016

sigod <sigod.mail gmail.com> writes:

On Sunday, 21 February 2016 at 16:20:30 UTC, bachmeier wrote:
 On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
 I had heard while reading up on the language that in D 
 explicit loops are generally frowned upon and not necessary 
 for the usual performance reasons.

 First, a minor point, the D community is usually pretty careful 
 not to frown on a particular coding style (unlike some 
 communities) so if you are comfortable writing loops and it 
 gives you the fastest code, you should do so.

 On the performance issue, you can see this related post about 
 performance with reduce:
 http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d puremagic.com

 This was Walter's response:
 http://forum.dlang.org/post/mlvb40$1tdf$1 digitalmars.com

 And this shows that LDC flat out does a better job of 
 optimization in this case:
 http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d puremagic.com

I can't agree with that. Between `for` and `foreach` you should 
choose one that is more readable/understandable for particular 
situation. It's compiler's task to optimize such small things.

Feb 22 2016

ZombineDev <petar.p.kirov gmail.com> writes:

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
 I've been vaguely aware of D for many years, but the recent 
 addition of std.experimental.ndslice finally inspired me to 
 give it a try, since my main expertise lies in the domain of 
 scientific computing and I primarily use Python/Julia/C++, 
 where multidimensional arrays can be handled with a great deal 
 of expressiveness and flexibility. Before writing anything 
 serious, I wanted to get a sense for the kind of code I would 
 have to write to get the best performance for numerical 
 calculations, so I wrote a trivial summation benchmark. The 
 following code gave me slightly surprising results:

 import std.stdio;
 import std.array : array;
 import std.algorithm;
 import std.datetime;
 import std.range;
 import std.experimental.ndslice;

 void main() {
 	int N = 1000;
 	int Q = 20;
 	int times = 1_000;
 	double[] res1 = uninitializedArray!(double[])(N);
 	double[] res2 = uninitializedArray!(double[])(N);
 	double[] res3 = uninitializedArray!(double[])(N);
 	auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
 	StopWatch sw;
 	double t0, t1, t2;
 	sw.start();
 	foreach (unused; 0..times) {
 		for (int i=0; i<N; ++i) {
 			res1[i] = sumtest1(f[i]);
 		}
 	}
 	sw.stop();
 	t0 = sw.peek().msecs;
 	sw.reset();
 	sw.start();
 	foreach (unused; 0..times) {
 		for (int i=0; i<N; ++i) {
 			res2[i] = sumtest2(f[i]);
 		}
 	}
 	sw.stop();
 	t1 = sw.peek().msecs;
 	sw.reset();
 	sw.start();
 	foreach (unused; 0..times) {
 		sumtest3(f, res3, N, Q);
 	}
 	t2 = sw.peek().msecs;
 	writeln(t0, " ms");
 	writeln(t1, " ms");
 	writeln(t2, " ms");
 	assert( res1 == res2 );
 	assert( res2 == res3 );
 }

 auto sumtest1(Range)(Range range)  safe pure nothrow  nogc {
 	return range.sum;
 }

 auto sumtest2(Range)(Range f)  safe pure nothrow  nogc {
 	double retval = 0.0;
 	foreach (double f_ ; f)	{
 		retval += f_;
 	}
 	return retval;
 }

 auto sumtest3(Range)(Range f, double[] retval, double N, double 
 Q)  safe pure nothrow  nogc {
 	for (int i=0; i<N; ++i)	{
 		for (int j=1; j<Q; ++j)	{
 			retval[i] += f[i,j];
 		}
 	}
 }

 When I compiled it using dmd -release -inline -O -noboundscheck 
 ../src/main.d, I got the following timings:
 1268 ms
 312 ms
 271 ms

 I had heard while reading up on the language that in D explicit 
 loops are generally frowned upon and not necessary for the 
 usual performance reasons. Nevertheless, the two explicit loop 
 functions gave me an improvement by a factor of 4+. 
 Furthermore, the difference between sumtest2 and sumtest3 seems 
 to indicate that function calls have a significant overhead. I 
 also tried using f.reduce!((a, b) => a + b) instead of f.sum in 
 sumtest1, but that yielded even worse performance. I did not 
 try the GDC/LDC compilers yet, since they don't seem to be up 
 to date on the standard library and don't include the ndslice 
 package last I checked.

 Now, seeing as how my experience writing D is literally a few 
 hours, is there anything I did blatantly wrong? Did I miss any 
 optimizations? Most importantly, can the elegant operator 
 chaining style be generally made as fast as the explicit loops 
 we've all been writing for decades?

The problem is not with ranges, but with the particualr algorithm 
used for summing. If you look at the docs 

see that if the range has random-access `sum` will use the pair-wise algorithm.
About the second and third tests, the problem is with DMD which should not be
used when measuring performance (but only for development, because it has fast
compile-times).

These are the results that I get with LDC:
Pair-wise (sumtest1):
415 ms
21 ms
20 ms

And if I use the Kahan algorithm:
106 ms
36 ms
31 ms
The second two results are probably larger due to noise.

And if I increase N to 100_000:
Pair-wise (sumtest1):
29557 ms
2061 ms
1990 ms

Kahan:
4566 ms
2067 ms
1990 ms

According to `dub --verbose`, my command-line was roughly this:
ldc2 -ofapp -release -O5 -singleobj -w source/app.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d

Feb 21 2016

ZombineDev <petar.p.kirov gmail.com> writes:

On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:
 ...
 And if I use the Kahan algorithm:
 106 ms
 36 ms
 31 ms
 The second two results are probably larger due to noise.

I did some more testing and clearly the larger times for N=1000 
were just noise:
[LDC Kahan N=1000]
106 ms
36 ms
31 ms

59 ms
24 ms
22 ms

46 ms
21 ms
20 ms

45 ms
21 ms
20 ms

45 ms
21 ms
20 ms

59 ms
24 ms
21 ms

102 ms
35 ms
30 ms

104 ms
37 ms
29 ms

107 ms
36 ms
31 ms

46 ms
21 ms
20 ms

Feb 21 2016

ZombineDev <petar.p.kirov gmail.com> writes:

On Sunday, 21 February 2016 at 16:36:22 UTC, ZombineDev wrote:
 On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:
 ...
 And if I use the Kahan algorithm:
 106 ms
 36 ms
 31 ms
 The second two results are probably larger due to noise.

 I did some more testing and clearly the larger times for N=1000 
 were just noise:
 [LDC Kahan N=1000]
 106 ms
 36 ms
 31 ms

 59 ms
 24 ms
 22 ms

 ...

Just for the record, with `DMD -release -O -inline`, Kahan, 
N=1000 I get:
325 ms
217 ms
165 ms

231 ms
117 ms
58 ms

131 ms
109 ms
58 ms

131 ms
109 ms
57 ms

131 ms
112 ms
57 ms

125 ms
106 ms
55 ms

125 ms
104 ms
55 ms

125 ms
105 ms
55 ms

125 ms
104 ms
55 ms

230 ms
115 ms
58 ms

131 ms
112 ms
58 ms

131 ms
109 ms
57 ms

Feb 21 2016

dextorious <dextorious gmail.com> writes:

First of all, I am pleasantly surprised by the rapid influx of 
helpful responses. The community here seems quite wonderful. In 
the interests of not cluttering the thread too much, since the 
advice given here has many commonalities, I will only try to 
respond once to each type of suggestion.

On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:
 The problem is not with ranges, but with the particualr 
 algorithm used for summing. If you look at the docs 

see that if the range has random-access `sum` will use the pair-wise algorithm.
About the second and third tests, the problem is with DMD which should not be
used when measuring performance (but only for development, because it has fast
compile-times).
 ...
 According to `dub --verbose`, my command-line was roughly this:
 ldc2 -ofapp -release -O5 -singleobj -w source/app.d
 ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d
 ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d
 ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d
 ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d
 ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d

It appears that I cannot use the GDC compiler for this particular 
problem due to it using a comparatively older version of the DMD 
frontend (I understand Mir requires >=2.068), but I did manage to 
get LDC working on my system after a bit of work. Since I've been 
using dub to manage my project, I used the default "release" 
build type. I also tried compiling manually with LDC, using the 
-O5 switch you mentioned. These are the results (I increased the 
iteration count to lessen the noise, the array is now 10000x20, 
each function is run a thousand times):

             DMD    LDC (dub)    LDC (-release -enable-inlining 
-O5 -w -singleobj)
sumtest1:12067 ms  6899 ms      1940 ms
sumtest2: 3076 ms  1349 ms       452 ms
sumtest3: 2526 ms   847 ms       434 ms
sumtest4: 5614 ms  1481 ms       452 ms

The sumtest1, 2 and 3 functions are as given in the first post, 
sumtest4 uses the range.reduce!((a, b) => a + b) approach to 
enforce naive summation. Much to my satisfaction, the 
range.reduce version is now exactly as quick as the traditional 
loop and while function inlining isn't quite perfect, the 4% 
performance penalty incurred by the 10_000 function calls (or 
whatever inlined form the function finally takes) is quite 
acceptable.

I do have to wonder, however, about the default settings of dub 
in this case. Having gone through its documentation, I might 
still not have guessed to try the compiler options you provided, 
thereby losing out on a 2-3x performance improvement. What build 
options did you use in your dub.json that it managed to translate 
to the correct compiler switches?

Feb 22 2016

ixid <nuaccount gmail.com> writes:

On Monday, 22 February 2016 at 15:43:23 UTC, dextorious wrote:
 I do have to wonder, however, about the default settings of dub 
 in this case. Having gone through its documentation, I might 
 still not have guessed to try the compiler options you 
 provided, thereby losing out on a 2-3x performance improvement. 
 What build options did you use in your dub.json that it managed 
 to translate to the correct compiler switches?

Your experience is exactly what the D community needs to get 
right. You've come in as an interested user with patience and 
initially D has offered slightly disappointing performance for 
both technical reasons and because of the different compilers. 
You've gotten to the right place in the end but we need point A 
to point B to be a lot smoother and more obvious so more people 
get a good initial impression of D.

Every D user thread seems to go like this- someone starts with 
DMD, they then struggle a little and hopefully get LDC working 
with a list of slightly obscure compiler switches offered. A 
standard algorithm performs disappointingly for somewhat valid 
technical reasons and more clunky alternatives are then deployed. 
We really need to standard algorithms to be fast and perhaps have 
separate ones for perfect technical accuracy.

What are your thoughts on D now? What would have helped you get 
to the right place much faster?

Feb 23 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:
 We really need to standard algorithms to be fast and perhaps 
 have separate ones for perfect technical accuracy.

While I agree with most of what you're saying, I don't think we 
should prioritize performance over accuracy or correctness. 
Especially for numerics people, precision is very important, and 
it can make a just as bad first impression if we don't get this 
right. We can however make the note in the documentation (which 
already talks about performance) a bit more prominent: 
http://dlang.org/phobos/std_algorithm_iteration.html#sum

Feb 23 2016

ixid <nuaccount gmail.com> writes:

On Tuesday, 23 February 2016 at 14:07:22 UTC, Marc Schütz wrote:
 On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:
 We really need to standard algorithms to be fast and perhaps 
 have separate ones for perfect technical accuracy.

 While I agree with most of what you're saying, I don't think we 
 should prioritize performance over accuracy or correctness. 
 Especially for numerics people, precision is very important, 
 and it can make a just as bad first impression if we don't get 
 this right. We can however make the note in the documentation 
 (which already talks about performance) a bit more prominent: 
 http://dlang.org/phobos/std_algorithm_iteration.html#sum

Wouldn't it be better to have technically perfect implementations 
for those numerics people? Sum is a basic function that almost 
everyone may want to use, this is a factor of four slowdown for 
the sake of one user group who could be perfectly well served by 
a sub-library that contains high-accuracy versions. It might make 
sense if the speed difference were only a few percent.

Feb 23 2016

dextorious <dextorious gmail.com> writes:

On Tuesday, 23 February 2016 at 14:07:22 UTC, Marc Schütz wrote:
 On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:
 We really need to standard algorithms to be fast and perhaps 
 have separate ones for perfect technical accuracy.

 While I agree with most of what you're saying, I don't think we 
 should prioritize performance over accuracy or correctness. 
 Especially for numerics people, precision is very important, 
 and it can make a just as bad first impression if we don't get 
 this right. We can however make the note in the documentation 
 (which already talks about performance) a bit more prominent: 
 http://dlang.org/phobos/std_algorithm_iteration.html#sum

Being new to the language, I certainly make no claims about what 
the Phobos library should do, but coming from a heavy numerics 
background in many languages, I can say that this is the first 
time I've seen a common summation function do anything beyond 
naive summation. Some languages feature more accurate options 
separately, but never as the default, so it did not occur to me 
to specifically check the documentation for something like sum() 
(which is my fault, of course, no issues there). Having the more 
accurate pairwise summation algorithm in the standard library is 
certainly worthwhile for some applications, but I was a bit 
surprised to see it as the default.

Feb 23 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 02/23/2016 12:12 PM, dextorious wrote:

 Some languages
 feature more accurate options separately, but never as the default, so
 it did not occur to me to specifically check the documentation for
 something like sum() (which is my fault, of course, no issues there).
 Having the more accurate pairwise summation algorithm in the standard
 library is certainly worthwhile for some applications, but I was a bit
 surprised to see it as the default.

According to Wikipedia, pairwise summation is the default algorithm in 
NumPy and Julia as well:

"Pairwise summation is the default summation algorithm in NumPy and the 
Julia technical-computing language".

   https://en.wikipedia.org/wiki/Pairwise_summation

Ali

Feb 24 2016

bachmeier <no spam.com> writes:

On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:
 On Monday, 22 February 2016 at 15:43:23 UTC, dextorious wrote:
 I do have to wonder, however, about the default settings of 
 dub in this case. Having gone through its documentation, I 
 might still not have guessed to try the compiler options you 
 provided, thereby losing out on a 2-3x performance 
 improvement. What build options did you use in your dub.json 
 that it managed to translate to the correct compiler switches?

 Your experience is exactly what the D community needs to get 
 right. You've come in as an interested user with patience and 
 initially D has offered slightly disappointing performance for 
 both technical reasons and because of the different compilers.

His concern is with the default settings of Dub. I've tried Dub 
and given up several times, and I've been using D since 2013. The 
community needs to provide real documentation. It's embarrassing 
that it's pushed as the official package manager and will soon be 
included with DMD.

Feb 23 2016

dextorious <dextorious gmail.com> writes:

On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:
 On Monday, 22 February 2016 at 15:43:23 UTC, dextorious wrote:
 I do have to wonder, however, about the default settings of 
 dub in this case. Having gone through its documentation, I 
 might still not have guessed to try the compiler options you 
 provided, thereby losing out on a 2-3x performance 
 improvement. What build options did you use in your dub.json 
 that it managed to translate to the correct compiler switches?

 Your experience is exactly what the D community needs to get 
 right. You've come in as an interested user with patience and 
 initially D has offered slightly disappointing performance for 
 both technical reasons and because of the different compilers. 
 You've gotten to the right place in the end but we need point A 
 to point B to be a lot smoother and more obvious so more people 
 get a good initial impression of D.

 Every D user thread seems to go like this- someone starts with 
 DMD, they then struggle a little and hopefully get LDC working 
 with a list of slightly obscure compiler switches offered. A 
 standard algorithm performs disappointingly for somewhat valid 
 technical reasons and more clunky alternatives are then 
 deployed. We really need to standard algorithms to be fast and 
 perhaps have separate ones for perfect technical accuracy.

 What are your thoughts on D now? What would have helped you get 
 to the right place much faster?

Personally, I think a few aspects of documentation for the 
various compilers, dub and possibly the dlang.org website itself 
could be improved, if accessibility is considered important. For 
instance, just to take my journey with trying out D as an 
example, I can immediately list a few points where I 
misunderstood or failed to find relevant information:

1. While the dlang.org website does a good job presenting the 
three compilers side by side with a short pro/con list for each 
and does mention that DMD produces slower code, I did not at 
first expect the difference to be half an order of magnitude or 
more. In retrospect, after reading the forums and learning about 
how each compiler works, this is quite obvious, but the initial 
impression was misleading.

2. The LDC compiler gave me a few issues during setup, 
particularly on Windows. The binaries supplied are dynamically 
linked against the MSVS2015 runtime (and will fail on any other 
system) and seem to require a full Visual Studio installation. I 
assume there are good reasons for this (though I hope in the 
future a more widely usable version could be made available), but 
the fact itself could be made clearer on the download page (it 
can be found after some searching on the D wiki and the forums).

3. The documentation for the dub package is useful, but somewhat 
difficult to read due to how it is structured and does not seem 
complete. For instance, I am still not sure how to make it pass 
the -O5 switch to the LDC2 compiler and the impression I got from 
the documentation is that explicit manual switches can only be 
supplied for the DMD compiler. It says that when using other 
compilers, the relevant switches are automatically translated to 
appropriate options for GDC/LDC, but no further details are 
supplied and no matter what options I set for the DMD compiler, 
using --compiler=ldc2 only yields -O and not -O5. For the moment, 
I'm compiling my code and managing dependencies manually like I 
would in C++, which is just fine for me personally, but does 
leave a slightly disappointing impression about what is 
apparently considered a semi-official package manager for the D 
language.

Of course, this is just my anecdotal experience and should not be 
taken as major criticism. It may be that I missed something or 
did not do enough research. Certainly, some amount of adjustment 
is to be expected when learning a new language, but there does 
seem to be some room for improvement.

Feb 23 2016

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 23 February 2016 at 20:03:30 UTC, dextorious wrote:
 Personally, I think a few aspects of documentation for the 
 various compilers, dub and possibly the dlang.org website 
 itself could be improved, if accessibility is considered 
 important.

Couldn't agree more.

 Being new to the language, I certainly make no claims about 
 what the Phobos library should do, but coming from a heavy 
 numerics background in many languages, I can say that this is 
 the first time I've seen a common summation function do 
 anything beyond naive summation. Some languages feature more 
 accurate options separately, but never as the default, so it 
 did not occur to me to specifically check the documentation for 
 something like sum() (which is my fault, of course, no issues 
 there). Having the more accurate pairwise summation algorithm 
 in the standard library is certainly worthwhile for some 
 applications, but I was a bit surprised to see it as the 
 default.

I think that's fair. I think part of the reason for the focus on 
accuracy over speed is that floats can have really weird behavior 
sometimes. For most people, it's better to be a little slower all 
the time in order to get the right answer all the time (or as 
often as possible with floats).

And people who want more speed, can look at the docs and figure 
out what they need to do to get more.

Feb 23 2016

Mike Parker <aldacron gmail.com> writes:

On Tuesday, 23 February 2016 at 20:03:30 UTC, dextorious wrote:
 For instance, I am still not sure how to make it pass the -O5 
 switch to the LDC2 compiler and the impression I got from the 
 documentation is that explicit manual switches can only be 
 supplied for the DMD compiler.

If you're referring to this:

"Additional flags passed to the D compiler - note that these 
flags are usually specific to the compiler in use, but a set of 
flags is automatically translated from DMD to the selected 
compiler"

My take is that a specific set of flags are automatically 
translated (so you don't need to make a separate dflags entry for 
each compiler you support if you only use those flags), but you 
can pass any compiler-specific flags you need.

Feb 23 2016

dextorious <dextorious gmail.com> writes:

On Wednesday, 24 February 2016 at 03:33:14 UTC, Mike Parker wrote:
 On Tuesday, 23 February 2016 at 20:03:30 UTC, dextorious wrote:
 For instance, I am still not sure how to make it pass the -O5 
 switch to the LDC2 compiler and the impression I got from the 
 documentation is that explicit manual switches can only be 
 supplied for the DMD compiler.

 If you're referring to this:

 "Additional flags passed to the D compiler - note that these 
 flags are usually specific to the compiler in use, but a set of 
 flags is automatically translated from DMD to the selected 
 compiler"

 My take is that a specific set of flags are automatically 
 translated (so you don't need to make a separate dflags entry 
 for each compiler you support if you only use those flags), but 
 you can pass any compiler-specific flags you need.

There's part of what I'm referring to, yes. There doesn't seem to 
be any documentation on what gets translated and what doesn't.

For the moment, the only way I've found to manually pass specific 
compiler options ("-O5 -singleobj" in my case) is by settings the 
dflags attribute when defining a buildType. However, there 
doesn't seem to be any way to specify different dflags for 
different compilers, so I am forced to introduce separately named 
buildTypes for each compiler. Since I still need to manually 
specify the compiler using the --compiler option when running 
dub, this feels like I'm using a hacky workaround rather than a 
consistently designed CLI. Furthermore, from the documentation, I 
have no idea if what I'm doing is the intended way or just an 
ugly hack around whatever piece of information I've missed.

Feb 24 2016

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 24 February 2016 at 19:15:23 UTC, dextorious wrote:
 However, there doesn't seem to be any way to specify different 
 dflags for different compilers

There are examples like in the package format page
"dflags-dmd": ["-vtls"],
"sourceFiles-windows-x86_64-dmd": ["lib/win32/mylib.lib"],
that would give some idea of how to do it.

Combining them together, you are able to do things like
"dflags-windows-x86_64-dmd": ["-vtls"],

So yes, it is do-able, but as you mention above, the docs page 
could use some work in making this functionality clear.

Feb 24 2016

Kapps <opantm2+spam gmail.com> writes:

If you do want to test the differences between the range approach 
and the loop approach, something like:
auto sumtest4(Range)(Range range)  safe pure {
	return range.reduce!((a, b) => a + b);
}
is a more fair comparison. I get results within 15% of sumtest2 
with this using dmd. I think with ldc this would be identical, 
but the version in homebrew is too old to compile this.

Feb 21 2016

Kapps <opantm2+spam gmail.com> writes:

On Monday, 22 February 2016 at 07:10:23 UTC, Kapps wrote:
 If you do want to test the differences between the range 
 approach and the loop approach, something like:
 auto sumtest4(Range)(Range range)  safe pure {
 	return range.reduce!((a, b) => a + b);
 }
 is a more fair comparison. I get results within 15% of sumtest2 
 with this using dmd. I think with ldc this would be identical, 
 but the version in homebrew is too old to compile this.

Using LDC with the mir version of ndslice so it compiles, and the 
following code:
sw.reset();
sw.start();
foreach (unused; 0..times) {
	for (int i=0; i<N; ++i) {
		res4[i] = sumtest4(f[i]);
	}
}
t3 = sw.peek().msecs;

and

auto sumtest4(Range)(Range range) {
	return range.reduce!((a, b) => a + b);
}

I get:
145 ms
19 ms
19 ms
19 ms

So, with LDC, there is no performance hit doing this. The only 
performance hit is when .sum uses a different algorithm for a 
more accurate result. Also, the LDC version appears to be roughly 
5x faster than the DMD version.

Feb 21 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Simple performance question from a newcomer