www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Code runtime

reply Ausprobierer <c3550227 trbvn.com> writes:
I've written this simple piece of code:

[CODE]
import std.algorithm;
import std.array;
import std.conv;
import std.datetime;
import std.parallelism;
import std.range;
import std.stdio;
import core.atomic;
import core.thread;

void main()
{
     immutable long n = 11;

     long[] z = [1,1,2,2,3,3,4,4,5,5,6,6,0,0];
     long j = 0;
     long l = z.length;

     StopWatch w; w.reset(); w.start;

     do
     {
         if (z[0] == 0) break; else
         {

             long m = 0;
             for (long i = 0; i < z.length; i += 2)
             {
                 m += (z[i]*10 + z[i+1]);
             }
             if (m % n == 0) j++;
         }
     }
     while (nextPermutation(z));

     w.stop();
     j.writeln;
     w.peek().msecs.writeln;
}
[/CODE]

I've compiled it with "dmd -m64 -release -inline <filename>.d".
Output is 58679513, and it takes around 52 seconds to run.

Then I've replicated this code in MSVS2015 C++:

[CODE]
#include <algorithm>
#include <iostream>
#include <ctime>

using namespace std;

const long N = 11;

long Z[] = { 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 0, 0 };
long L = sizeof(Z) / sizeof(Z[0]);

int main(void)
{
	long j = 0;

	clock_t t = clock();

	do {
		if (Z[0] == 0) break;

		long m = 0;
		for (long i = 0; i < L; i += 2) {
			m += (Z[i] * 10 + Z[i + 1]);
		}

		if (m % N == 0) {
			j++;
		}
	} while (next_permutation(Z, Z+L));

	t = clock() - t;

	cout << j << endl;
	cout << (float(t)) / CLOCKS_PER_SEC << endl;

	return 0;
}
[/CODE]

Compile mode is also Release/x64. I've run the program and was 
eagerly surprised, not to say stunned. The output is also 
58679513, but it's done in under 6 seconds!

Care to explain? I haven't found a solution to this :(
Jun 05 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/05/2016 11:52 PM, Ausprobierer wrote:
 I've compiled it with "dmd -m64 -release -inline <filename>.d".
Misses -O for actual optimizations. You can also try ldc or gdc for better performance.
Jun 05 2016
parent reply Ausprobierer <c3550227 trbvn.com> writes:
On Sunday, 5 June 2016 at 21:56:30 UTC, ag0aep6g wrote:
 On 06/05/2016 11:52 PM, Ausprobierer wrote:
 I've compiled it with "dmd -m64 -release -inline <filename>.d".
Misses -O for actual optimizations. You can also try ldc or gdc for better performance.
Still a runtime of approx. 36 seconds.
Jun 05 2016
parent ag0aep6g <anonymous example.com> writes:
On 06/06/2016 12:00 AM, Ausprobierer wrote:
 On Sunday, 5 June 2016 at 21:56:30 UTC, ag0aep6g wrote:
 On 06/05/2016 11:52 PM, Ausprobierer wrote:
 I've compiled it with "dmd -m64 -release -inline <filename>.d".
Misses -O for actual optimizations. You can also try ldc or gdc for better performance.
Still a runtime of approx. 36 seconds.
When I compare `ldc2 -release -O` against `g++ -O`, I get 13 seconds and 9 seconds. Still slower, but it's in the same ballpark. `dmd -release -O -inline` does significantly worse: 40 seconds. dmd's strength are quick compilations, not fast executables (not that the two are mutually exclusive).
Jun 05 2016
prev sibling next sibling parent reply Donald Drumpf <donald.trump dt.com> writes:
On Sunday, 5 June 2016 at 21:52:20 UTC, Ausprobierer wrote:
 I've written this simple piece of code:

 [CODE]
 import std.algorithm;
 import std.array;
 import std.conv;
 import std.datetime;
 import std.parallelism;
 import std.range;
 import std.stdio;
 import core.atomic;
 import core.thread;

 [...]
How many times did you run each? A single run isn't ever really useful. Also, not sure if this would account for the drastic difference, but declare long Z[] in global scope over in local scope might have some impact. If I wrote this, it would be faster, I guarantee it.
Jun 05 2016
parent reply Ausprobierer <c3550227 trbvn.com> writes:
On Sunday, 5 June 2016 at 21:58:46 UTC, Donald Drumpf wrote:
 On Sunday, 5 June 2016 at 21:52:20 UTC, Ausprobierer wrote:
 I've written this simple piece of code:

 [CODE]
 import std.algorithm;
 import std.array;
 import std.conv;
 import std.datetime;
 import std.parallelism;
 import std.range;
 import std.stdio;
 import core.atomic;
 import core.thread;

 [...]
How many times did you run each? A single run isn't ever really useful. Also, not sure if this would account for the drastic difference, but declare long Z[] in global scope over in local scope might have some impact. If I wrote this, it would be faster, I guarantee it.
With [CODE] import std.algorithm; import std.array; import std.conv; import std.datetime; import std.parallelism; import std.range; import std.stdio; import core.atomic; import core.thread; long[] z = [1,1,2,2,3,3,4,4,5,5,6,6,0,0]; void main() { immutable long n = 11; long l = z.length; long j = 0; [...] [/CODE] and "dmd -m64 -O -inline -release <filename>.d" I still get 40 seconds runtime.
Jun 05 2016
parent Donald Drumpf <donald.trump dt.com> writes:
On Sunday, 5 June 2016 at 22:08:21 UTC, Ausprobierer wrote:
 On Sunday, 5 June 2016 at 21:58:46 UTC, Donald Drumpf wrote:
 [...]
With [CODE] import std.algorithm; import std.array; import std.conv; import std.datetime; import std.parallelism; import std.range; import std.stdio; import core.atomic; import core.thread; long[] z = [1,1,2,2,3,3,4,4,5,5,6,6,0,0]; void main() { immutable long n = 11; long l = z.length; long j = 0; [...] [/CODE] and "dmd -m64 -O -inline -release <filename>.d" I still get 40 seconds runtime.
Well that's embarrassing.
Jun 05 2016
prev sibling next sibling parent reply extrawurst <stephan extrawurst.org> writes:
On Sunday, 5 June 2016 at 21:52:20 UTC, Ausprobierer wrote:
 I've written this simple piece of code:

 [CODE]
 import std.algorithm;
 import std.array;
 import std.conv;
 import std.datetime;
 import std.parallelism;
 import std.range;
 import std.stdio;
 import core.atomic;
 import core.thread;

 void main()
 {
     immutable long n = 11;

     long[] z = [1,1,2,2,3,3,4,4,5,5,6,6,0,0];
     long j = 0;
     long l = z.length;

     StopWatch w; w.reset(); w.start;

     do
     {
         if (z[0] == 0) break; else
         {

             long m = 0;
             for (long i = 0; i < z.length; i += 2)
             {
                 m += (z[i]*10 + z[i+1]);
             }
             if (m % n == 0) j++;
         }
     }
     while (nextPermutation(z));

     w.stop();
     j.writeln;
     w.peek().msecs.writeln;
 }
 [/CODE]

 I've compiled it with "dmd -m64 -release -inline <filename>.d".
 Output is 58679513, and it takes around 52 seconds to run.

 Then I've replicated this code in MSVS2015 C++:

 [CODE]
 #include <algorithm>
 #include <iostream>
 #include <ctime>

 using namespace std;

 const long N = 11;

 long Z[] = { 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 0, 0 };
 long L = sizeof(Z) / sizeof(Z[0]);

 int main(void)
 {
 	long j = 0;

 	clock_t t = clock();

 	do {
 		if (Z[0] == 0) break;

 		long m = 0;
 		for (long i = 0; i < L; i += 2) {
 			m += (Z[i] * 10 + Z[i + 1]);
 		}

 		if (m % N == 0) {
 			j++;
 		}
 	} while (next_permutation(Z, Z+L));

 	t = clock() - t;

 	cout << j << endl;
 	cout << (float(t)) / CLOCKS_PER_SEC << endl;

 	return 0;
 }
 [/CODE]

 Compile mode is also Release/x64. I've run the program and was 
 eagerly surprised, not to say stunned. The output is also 
 58679513, but it's done in under 6 seconds!

 Care to explain? I haven't found a solution to this :(
I never used those methods neither in D nor in C++ but I have a gut feeling it is the nextPermutation method in phobos, allocating internally (it is not nogc at least) but you can compare it easily to c++: https://github.com/dlang/phobos/blob/cdc8218e9840353f6ed79f00a63280c3002b198f/std/algorithm/sorting.d#L2786 could be less noise if you benchmark just calling this method vs. in C++ and even dig into the disassembly difference there. --stephan
Jun 05 2016
next sibling parent Ausprobierer <c3550227 trbvn.com> writes:
 I never used those methods neither in D nor in C++ but I have a 
 gut feeling it is the nextPermutation method in phobos, [...]

 --stephan
I think you're right, it probably is the nextPermutation() function: <imports> long[] z = [1,1,2,2,3,3,4,4,5,5,6,6,0,0]; void main() { do { if (z[0] == 0) break; } while (nextPermutation(z)); } ...takes approx. 31 seconds, compiled with "-m64 -O -release -inline".
Jun 05 2016
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, Jun 05, 2016 at 10:29:09PM +0000, extrawurst via Digitalmars-d wrote:
[...]
 I never used those methods neither in D nor in C++ but I have a gut
 feeling it is the nextPermutation method in phobos, allocating
 internally
I wrote nextPermutation. It does not allocate. For performance, I recommend using gdc or ldc. It is not dmd's strength. T -- If blunt statements had a point, they wouldn't be blunt...
Jun 05 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/6/16 1:12 AM, H. S. Teoh via Digitalmars-d wrote:
 On Sun, Jun 05, 2016 at 10:29:09PM +0000, extrawurst via Digitalmars-d wrote:
 [...]
 I never used those methods neither in D nor in C++ but I have a gut
 feeling it is the nextPermutation method in phobos, allocating
 internally
I wrote nextPermutation. It does not allocate. For performance, I recommend using gdc or ldc. It is not dmd's strength.
FWIW: * gdc should be used when benchmarking to ensure comparable backends. * Just looked at std.permutation (https://github.com/dlang/phobos/blob/cdc8218e9840353f6ed79f00a63280c3002b198f/std/algorit m/sorting.d#L2786), looks good but operates on complex compound ranges so it puts more pressure on the optimizer. Compare to C++'s std::next_permutation (http://stackoverflow.com/questions/11483060/stdnext-permutation-impleme tation-explanation) which directly bumps iterators etc. We also have trouble because of ranges' much discussed difficulties with iterators pointing somewhere inside them (e.g. we need an extra counter n). A specialization of nextPermutation for random-access ranges should pay off here. Andrei
Jun 05 2016
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Monday, 6 June 2016 at 06:25:19 UTC, Andrei Alexandrescu wrote:
 * Just looked at std.permutation
FWIW, there's also this: https://dlang.org/library/std/algorithm/iteration/permutations.html
Jun 06 2016
parent Seb <seb wilzba.ch> writes:
On Monday, 6 June 2016 at 10:44:10 UTC, Vladimir Panteleev wrote:
 On Monday, 6 June 2016 at 06:25:19 UTC, Andrei Alexandrescu 
 wrote:
 * Just looked at std.permutation
FWIW, there's also this: https://dlang.org/library/std/algorithm/iteration/permutations.html
and mir.combinatorics http://docs.mir.dlang.io/latest/mir_combinatorics.html if you are searching for more algorithms with range support. For all of these you can use the new Allocator API to allocate the initial state.
Jun 06 2016
prev sibling next sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Sunday, 5 June 2016 at 21:52:20 UTC, Ausprobierer wrote:
 I've written this simple piece of code:

 [CODE]
 import std.algorithm;
 import std.array;
 import std.conv;
 import std.datetime;
 import std.parallelism;
 import std.range;
 import std.stdio;
 import core.atomic;
 import core.thread;

 void main()
 {
     immutable long n = 11;

     long[] z = [1,1,2,2,3,3,4,4,5,5,6,6,0,0];
     long j = 0;
     long l = z.length;

     StopWatch w; w.reset(); w.start;

     do
     {
         if (z[0] == 0) break; else
         {

             long m = 0;
             for (long i = 0; i < z.length; i += 2)
             {
                 m += (z[i]*10 + z[i+1]);
             }
             if (m % n == 0) j++;
         }
     }
     while (nextPermutation(z));

     w.stop();
     j.writeln;
     w.peek().msecs.writeln;
 }
 [/CODE]

 I've compiled it with "dmd -m64 -release -inline <filename>.d".
 Output is 58679513, and it takes around 52 seconds to run.

 Then I've replicated this code in MSVS2015 C++:

 [CODE]
 #include <algorithm>
 #include <iostream>
 #include <ctime>

 using namespace std;

 const long N = 11;

 long Z[] = { 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 0, 0 };
 long L = sizeof(Z) / sizeof(Z[0]);

 int main(void)
 {
 	long j = 0;

 	clock_t t = clock();

 	do {
 		if (Z[0] == 0) break;

 		long m = 0;
 		for (long i = 0; i < L; i += 2) {
 			m += (Z[i] * 10 + Z[i + 1]);
 		}

 		if (m % N == 0) {
 			j++;
 		}
 	} while (next_permutation(Z, Z+L));

 	t = clock() - t;

 	cout << j << endl;
 	cout << (float(t)) / CLOCKS_PER_SEC << endl;

 	return 0;
 }
 [/CODE]

 Compile mode is also Release/x64. I've run the program and was 
 eagerly surprised, not to say stunned. The output is also 
 58679513, but it's done in under 6 seconds!

 Care to explain? I haven't found a solution to this :(
On OS X using ldc and clang, I get C++: ~6s and D: ~7s. The slowdown in D seems to be due to parts of nextPermutation not ending up inlined. Be careful with benchmarks like this, you are giving the compiler a lot more information than it usually has in any real world case (here it knows the exact values of all the input data/parameters!).
Jun 05 2016
next sibling parent Ausprobierer <c3553265 trbvn.com> writes:
 On OS X using ldc and clang, I get C++: ~6s and D: ~7s. The 
 slowdown in D seems to be due to parts of nextPermutation not 
 ending up inlined.

 Be careful with benchmarks like this, you are giving the 
 compiler a lot more information than it usually has in any real 
 world case (here it knows the exact values of all the input 
 data/parameters!).
I've compiled and run the code on Win7 x64. I think it's a compiler issue then, with DMD causing the same slow behaviour on Linux x64. If so, I should possibly resort to e. g. LDC.
Jun 05 2016
prev sibling parent Johan Engelen <j j.nl> writes:
On Sunday, 5 June 2016 at 22:34:47 UTC, John Colvin wrote:
 Be careful with benchmarks like this, you are giving the 
 compiler a lot more information than it usually has in any real 
 world case (here it knows the exact values of all the input 
 data/parameters!).
+1 ! For the input data, I think you could have a function return the data and mark that function with weak linkage so the compiler cannot reason about it nor inline it. -Johan
Jun 06 2016
prev sibling parent Max Samukha <maxsamukha gmail.com> writes:
On Sunday, 5 June 2016 at 21:52:20 UTC, Ausprobierer wrote:

 Then I've replicated this code in MSVS2015 C++:
Note that 'long' in D is 64 bits, whereas in MSVC it is 32 bits.
Jun 06 2016