digitalmars.D.learn - math.log() benchmark of first 1 billion int using std.parallelism

Iov Gherman (39/39) Dec 22 2014 Hi everybody,

bachmeier (3/26) Dec 22 2014 DMD is generally going to produce the slowest code. LDC and GDC
Daniel Kozak via Digitalmars-d-learn (4/6) Dec 22 2014 You should try use some arguments -O -release -inline -noboundscheck

Daniel Kozak (4/12) Dec 22 2014 Btw. try use C log function, maybe it would be faster:

aldanor (20/35) Dec 22 2014 Just tried it out myself (E5 Xeon / Linux):

aldanor (4/23) Dec 22 2014 Replacing "import core.stdc.math" with "import std.math" in the D

Laeeth Isharc (12/15) Dec 22 2014 + GDC/LDC vs DMD

Russel Winder via Digitalmars-d-learn (19/25) Dec 22 2014 Without the source codes and the commands used to create and run, it
Iov Gherman (17/17) Dec 22 2014 Hi Guys,

bachmeier (2/20) Dec 22 2014 Link to your repo?

Iov Gherman (3/28) Dec 22 2014 Sorry, forgot about it:

Iov Gherman (10/10) Dec 22 2014 So, I did some more testing with the one processing in paralel:

John Colvin (4/14) Dec 22 2014 Flag suggestions:

Iov Gherman (10/29) Dec 22 2014 Tried it, here are the results:

John Colvin (5/38) Dec 22 2014 That's very different to my results.

Daniel Kozak (3/7) Dec 22 2014 What CPU do you have? On my Intel Core i3 I have similar

John Colvin (2/11) Dec 23 2014 Intel Core i5-4278U

Iov Gherman (11/15) Dec 23 2014 I checked again today and the results are interesting, on my pc I

John Colvin (6/22) Dec 23 2014 These multi-threaded benchmarks can be very sensitive to their

Iov Gherman (4/9) Dec 23 2014 I did not use the nice parameter but I always ran them multiple

Daniel Kozak (44/53) Dec 23 2014 And what about single threaded version?

Iov Gherman (13/14) Dec 23 2014 Just ran the single thread examples after I moved time start

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (4/18) Dec 23 2014 Note that log is done in software on x86 with different levels of

Daniel Kozak (4/20) Dec 23 2014 Btw. I just noticed small issue with D vs. java, you start

Iov Gherman (3/6) Dec 23 2014 Here is the java result for parallel processing after moving the

Iov Gherman (1/1) Dec 23 2014 Forgot to mention that I pushed my changes to github.
Daniel Kozak (49/55) Dec 23 2014 Java:

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> (6/6) Dec 23 2014 I'm getting faster execution on java thank dmd, gdc beats it

aldanor (3/13) Dec 22 2014 import std.math, std.stdio, std.datetime;

Iov Gherman (4/22) Dec 22 2014 Tried it, it is worst:

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (3/28) Dec 22 2014 Strange... for me, core.stdc.math.log is about twice as fast as

John Colvin (8/37) Dec 23 2014 For posix-style threads, a per-thread workload of 200 calls to

"Iov Gherman" <iovisx gmail.com> writes:

Hi everybody,

I am a java developer and used C/C++ only for some home projects 
so I never mastered native programming.

I am currently learning D and I find it fascinating. I was 
reading the documentation about std.parallelism and I wanted to 
experiment a bit with the example "Find the logarithm of every 
number from 1 to 10_000_000 in parallel".

So, first, I changed the limit to 1 billion and ran it. I was 
blown away by the performance, the program ran in: 4 secs, 670 ms 
and I used a workUnitSize of 200. I have an i7 4th generation 
processor with 8 cores.

Then I was curios to try the same test in Java just to see how 
much slower will that be (at least that was what I expected). I 
used Java's ExecutorService with a pool of 8 cores and created 
5_000_000 tasks, each task was calculating log() for 200 numbers. 
The whole program ran in 3 secs, 315 ms.

Now, can anyone explain why this program ran faster in Java? I 
ran both programs multiple times and the results were always 
close to this execution times.

Can the implementation of log() function be the reason for a 
slower execution time in D?

I then decided to ran the same program in a single thread, a 
simple foreach/for loop. I tried it in C and Go also. This are 
the results:
- D: 24 secs, 32 ms.
- Java: 20 secs, 881 ms.
- C: 21 secs
- Go: 37 secs

I run Arch Linux on my PC. I compiled D programs using dmd-2.066 
and used no compile arguments (dmd prog.d).
I used Oracle's Java 8 (tried 7 and 6, seems like with Java 6 the 
performance is a bit better then 7 and 8).
To compile the C program I used: gcc 4.9.2
For Go program I used go 1.4

I really really like the built in support in D for parallel 
processing and how easy is to schedule tasks taking advantage of 
workUnitSize.

Thanks,
Iov

Dec 22 2014

"bachmeier" <no spam.net> writes:

On Monday, 22 December 2014 at 10:12:52 UTC, Iov Gherman wrote:
 Now, can anyone explain why this program ran faster in Java? I 
 ran both programs multiple times and the results were always 
 close to this execution times.

 Can the implementation of log() function be the reason for a 
 slower execution time in D?

 I then decided to ran the same program in a single thread, a 
 simple foreach/for loop. I tried it in C and Go also. This are 
 the results:
 - D: 24 secs, 32 ms.
 - Java: 20 secs, 881 ms.
 - C: 21 secs
 - Go: 37 secs

 I run Arch Linux on my PC. I compiled D programs using 
 dmd-2.066 and used no compile arguments (dmd prog.d).
 I used Oracle's Java 8 (tried 7 and 6, seems like with Java 6 
 the performance is a bit better then 7 and 8).
 To compile the C program I used: gcc 4.9.2
 For Go program I used go 1.4

 I really really like the built in support in D for parallel 
 processing and how easy is to schedule tasks taking advantage 
 of workUnitSize.

 Thanks,
 Iov

DMD is generally going to produce the slowest code. LDC and GDC 
will normally do better.

Dec 22 2014

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

 I run Arch Linux on my PC. I compiled D programs using dmd-2.066 
 and used no compile arguments (dmd prog.d)

You should try use some arguments -O -release -inline -noboundscheck
and maybe try use gdc or ldc should help with performance

can you post your code in all languages somewhere? I like to try it on
my machine :)

Dec 22 2014

"Daniel Kozak" <kozzi11 gmail.com> writes:

On Monday, 22 December 2014 at 10:35:52 UTC, Daniel Kozak via 
Digitalmars-d-learn wrote:
 I run Arch Linux on my PC. I compiled D programs using 
 dmd-2.066 and used no compile arguments (dmd prog.d)

 You should try use some arguments -O -release -inline 
 -noboundscheck
 and maybe try use gdc or ldc should help with performance

 can you post your code in all languages somewhere? I like to 
 try it on
 my machine :)

Btw. try use C log function, maybe it would be faster:

import core.stdc.math;

Dec 22 2014

"aldanor" <i.s.smirnov gmail.com> writes:

On Monday, 22 December 2014 at 10:40:45 UTC, Daniel Kozak wrote:
 On Monday, 22 December 2014 at 10:35:52 UTC, Daniel Kozak via 
 Digitalmars-d-learn wrote:
 I run Arch Linux on my PC. I compiled D programs using 
 dmd-2.066 and used no compile arguments (dmd prog.d)

 You should try use some arguments -O -release -inline 
 -noboundscheck
 and maybe try use gdc or ldc should help with performance

 can you post your code in all languages somewhere? I like to 
 try it on
 my machine :)

 Btw. try use C log function, maybe it would be faster:

 import core.stdc.math;

Just tried it out myself (E5 Xeon / Linux):

D version: 19.64 sec (avg 3 runs)

     import core.stdc.math;

     void main() {
         double s = 0;
         foreach (i; 1 .. 1_000_000_000)
             s += log(i);
     }

     // build flags: -O -release

C version: 19.80 sec (avg 3 runs)

     #include <math.h>

     int main() {
         double s = 0;
         long i;
         for (i = 1; i < 1000000000; i++)
             s += log(i);
         return 0;
     }

     // build flags: -O3 -lm

Dec 22 2014

"aldanor" <i.s.smirnov gmail.com> writes:

On Monday, 22 December 2014 at 11:11:07 UTC, aldanor wrote:
 Just tried it out myself (E5 Xeon / Linux):

 D version: 19.64 sec (avg 3 runs)

     import core.stdc.math;

     void main() {
         double s = 0;
         foreach (i; 1 .. 1_000_000_000)
             s += log(i);
     }

     // build flags: -O -release

 C version: 19.80 sec (avg 3 runs)

     #include <math.h>

     int main() {
         double s = 0;
         long i;
         for (i = 1; i < 1000000000; i++)
             s += log(i);
         return 0;
     }

     // build flags: -O3 -lm

Replacing "import core.stdc.math" with "import std.math" in the D 
example increases the avg runtime from 19.64 to 23.87 seconds 
(~20% slower) which is consistent with OP's statement.

Dec 22 2014

"Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:

 Replacing "import core.stdc.math" with "import std.math" in the 
 D example increases the avg runtime from 19.64 to 23.87 seconds 
 (~20% slower) which is consistent with OP's statement.

+ GDC/LDC vs DMD
+ nobounds, release

Do you think we should start a topic on D wiki front page for 
benchmarking/performance tips to organize peoples' experience of 
what works?

I took a quick look and couldn't see anything already.  And it 
seems to be a topic that comes up quite frequently (less on forum 
than people doing their own benchmarks and it getting picked up 
on reddit etc).

I am not so experienced in this area otherwise I would write a 
first draft myself.

Laeeth

Dec 22 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 2014-12-22 at 10:12 +0000, Iov Gherman via Digitalmars-d-learn wrote:
 […]
 - D: 24 secs, 32 ms.
 - Java: 20 secs, 881 ms.
 - C: 21 secs
 - Go: 37 secs
 

Without the source codes and the commands used to create and run, it 
is impossible to offer constructive criticism of the results. However a
priori the above does not surprise me. I'll wager ldc2 or gdc will 
beat dmd for CPU-bound code, so as others have said for benchmarking 
use ldc2 or gdc with all optimization on (-O3). If you used gc for Go 
then switch to gccgo (again with -O3) and see a huge performance 
improvement on CPU-bound code.

Java beating C and C++ is fairly normal these days due to the tricks 
you can play with JIT over AOT optimization. Once Java has proper 
support for GPGPU, it will be hard for native code languages to get 
any new converts from JVM.

Put the source up and I and others will try things out.
-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 22 2014

"Iov Gherman" <iovisx gmail.com> writes:

Hi Guys,

First of all, thank you all for responding so quick, it is so 
nice to see D having such an active community.

As I said in my first post, I used no other parameters to dmd 
when compiling because I don't know too much about dmd 
compilation flags. I can't wait to try the flags Daniel suggested 
with dmd (-O -release -inline -noboundscheck) and the other two 
compilers (ldc2 and gdc). Thank you guys for your suggestions.

Meanwhile, I created a git repository on github and I put there 
all my code. If you find any errors please let me know. Because I 
am keeping the results in a big array the programs take 
approximately 8Gb of RAM. If you don't have enough RAM feel free 
to decrease the size of the array. For java code you will also 
need to change 'compile-run.bsh' and use the right memory 
parameters.


Thank you all for helping,
Iov

Dec 22 2014

"bachmeier" <no spam.com> writes:

On Monday, 22 December 2014 at 17:05:19 UTC, Iov Gherman wrote:
 Hi Guys,

 First of all, thank you all for responding so quick, it is so 
 nice to see D having such an active community.

 As I said in my first post, I used no other parameters to dmd 
 when compiling because I don't know too much about dmd 
 compilation flags. I can't wait to try the flags Daniel 
 suggested with dmd (-O -release -inline -noboundscheck) and the 
 other two compilers (ldc2 and gdc). Thank you guys for your 
 suggestions.

 Meanwhile, I created a git repository on github and I put there 
 all my code. If you find any errors please let me know. Because 
 I am keeping the results in a big array the programs take 
 approximately 8Gb of RAM. If you don't have enough RAM feel 
 free to decrease the size of the array. For java code you will 
 also need to change 'compile-run.bsh' and use the right memory 
 parameters.


 Thank you all for helping,
 Iov

Link to your repo?

Dec 22 2014

"Iov Gherman" <iovisx gmail.com> writes:

On Monday, 22 December 2014 at 17:16:05 UTC, bachmeier wrote:
 On Monday, 22 December 2014 at 17:05:19 UTC, Iov Gherman wrote:
 Hi Guys,

 First of all, thank you all for responding so quick, it is so 
 nice to see D having such an active community.

 As I said in my first post, I used no other parameters to dmd 
 when compiling because I don't know too much about dmd 
 compilation flags. I can't wait to try the flags Daniel 
 suggested with dmd (-O -release -inline -noboundscheck) and 
 the other two compilers (ldc2 and gdc). Thank you guys for 
 your suggestions.

 Meanwhile, I created a git repository on github and I put 
 there all my code. If you find any errors please let me know. 
 Because I am keeping the results in a big array the programs 
 take approximately 8Gb of RAM. If you don't have enough RAM 
 feel free to decrease the size of the array. For java code you 
 will also need to change 'compile-run.bsh' and use the right 
 memory parameters.


 Thank you all for helping,
 Iov

 Link to your repo?

Sorry, forgot about it:
https://github.com/ghermaniov/benchmarks

Dec 22 2014

"Iov Gherman" <iovisx gmail.com> writes:

So, I did some more testing with the one processing in paralel:

--- dmd:
4 secs, 977 ms

--- dmd with flags: -O -release -inline -noboundscheck:
4 secs, 635 ms

--- ldc:
6 secs, 271 ms

--- gdc:
10 secs, 439 ms

I also pushed the new bash scripts to the git repository.

Dec 22 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote:
 So, I did some more testing with the one processing in paralel:

 --- dmd:
 4 secs, 977 ms

 --- dmd with flags: -O -release -inline -noboundscheck:
 4 secs, 635 ms

 --- ldc:
 6 secs, 271 ms

 --- gdc:
 10 secs, 439 ms

 I also pushed the new bash scripts to the git repository.

Flag suggestions:

ldc2 -O3 -release -mcpu=native -singleobj

gdc -O3 -frelease -march=native

Dec 22 2014

"Iov Gherman" <iovisx gmail.com> writes:

On Monday, 22 December 2014 at 17:50:20 UTC, John Colvin wrote:
 On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote:
 So, I did some more testing with the one processing in paralel:

 --- dmd:
 4 secs, 977 ms

 --- dmd with flags: -O -release -inline -noboundscheck:
 4 secs, 635 ms

 --- ldc:
 6 secs, 271 ms

 --- gdc:
 10 secs, 439 ms

 I also pushed the new bash scripts to the git repository.

 Flag suggestions:

 ldc2 -O3 -release -mcpu=native -singleobj

 gdc -O3 -frelease -march=native

Tried it, here are the results:

--- ldc:
6 secs, 271 ms

--- ldc -O3 -release -mcpu=native -singleobj:
5 secs, 686 ms

--- gdc:
10 secs, 439 ms

--- gdc -O3 -frelease -march=native:
9 secs, 180 ms

Dec 22 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 22 December 2014 at 18:27:48 UTC, Iov Gherman wrote:
 On Monday, 22 December 2014 at 17:50:20 UTC, John Colvin wrote:
 On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote:
 So, I did some more testing with the one processing in 
 paralel:

 --- dmd:
 4 secs, 977 ms

 --- dmd with flags: -O -release -inline -noboundscheck:
 4 secs, 635 ms

 --- ldc:
 6 secs, 271 ms

 --- gdc:
 10 secs, 439 ms

 I also pushed the new bash scripts to the git repository.

 Flag suggestions:

 ldc2 -O3 -release -mcpu=native -singleobj

 gdc -O3 -frelease -march=native

 Tried it, here are the results:

 --- ldc:
 6 secs, 271 ms

 --- ldc -O3 -release -mcpu=native -singleobj:
 5 secs, 686 ms

 --- gdc:
 10 secs, 439 ms

 --- gdc -O3 -frelease -march=native:
 9 secs, 180 ms

That's very different to my results.

I see no important difference between ldc and dmd when using 
std.math, but when using core.stdc.math ldc halves its time where 
dmd only manages to get to ~80%

Dec 22 2014

"Daniel Kozak" <kozzi11 gmail.com> writes:

 That's very different to my results.

 I see no important difference between ldc and dmd when using 
 std.math, but when using core.stdc.math ldc halves its time 
 where dmd only manages to get to ~80%

What CPU do you have? On my Intel Core i3 I have similar 
experience as Iov Gherman, but on my Amd FX4200 I have same 
results as you. Seems std.math.log is not good for my AMD CPU :)

Dec 22 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 23 December 2014 at 07:26:27 UTC, Daniel Kozak wrote:
 That's very different to my results.

 I see no important difference between ldc and dmd when using 
 std.math, but when using core.stdc.math ldc halves its time 
 where dmd only manages to get to ~80%

 What CPU do you have? On my Intel Core i3 I have similar 
 experience as Iov Gherman, but on my Amd FX4200 I have same 
 results as you. Seems std.math.log is not good for my AMD CPU :)

Intel Core i5-4278U

Dec 23 2014

"Iov Gherman" <iovisx gmail.com> writes:

 That's very different to my results.

 I see no important difference between ldc and dmd when using 
 std.math, but when using core.stdc.math ldc halves its time 
 where dmd only manages to get to ~80%

I checked again today and the results are interesting, on my pc I 
don't see any difference between std.math and core.stdc.math with 
ldc. Here are the results with all compilers.

- with std.math:
dmd: 4 secs, 878 ms
ldc: 5 secs, 650 ms
gdc: 9 secs, 161 ms

- with core.stdc.math:
dmd: 5 secs, 991 ms
ldc: 5 secs, 572 ms
gdc: 7 secs, 957 ms

Dec 23 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 23 December 2014 at 10:20:04 UTC, Iov Gherman wrote:
 That's very different to my results.

 I see no important difference between ldc and dmd when using 
 std.math, but when using core.stdc.math ldc halves its time 
 where dmd only manages to get to ~80%

 I checked again today and the results are interesting, on my pc 
 I don't see any difference between std.math and core.stdc.math 
 with ldc. Here are the results with all compilers.

 - with std.math:
 dmd: 4 secs, 878 ms
 ldc: 5 secs, 650 ms
 gdc: 9 secs, 161 ms

 - with core.stdc.math:
 dmd: 5 secs, 991 ms
 ldc: 5 secs, 572 ms
 gdc: 7 secs, 957 ms

These multi-threaded benchmarks can be very sensitive to their 
environment, you should try running it with nice -20 and do 
multiple passes to get a vague idea of the variability in the 
result. Also, it's important to minimise the number of other 
running processes.

Dec 23 2014

"Iov Gherman" <iovisx gmail.com> writes:

 These multi-threaded benchmarks can be very sensitive to their 
 environment, you should try running it with nice -20 and do 
 multiple passes to get a vague idea of the variability in the 
 result. Also, it's important to minimise the number of other 
 running processes.

I did not use the nice parameter but I always ran them multiple 
times and choose the average time. My system has very few running 
processes, minimalist ArchLinux with Xfce4 so I don't think the 
running processes are affecting in any way my tests.

Dec 23 2014

"Daniel Kozak" <kozzi11 gmail.com> writes:

On Tuesday, 23 December 2014 at 10:39:13 UTC, Iov Gherman wrote:
 These multi-threaded benchmarks can be very sensitive to their 
 environment, you should try running it with nice -20 and do 
 multiple passes to get a vague idea of the variability in the 
 result. Also, it's important to minimise the number of other 
 running processes.

 I did not use the nice parameter but I always ran them multiple 
 times and choose the average time. My system has very few 
 running processes, minimalist ArchLinux with Xfce4 so I don't 
 think the running processes are affecting in any way my tests.

And what about single threaded version?

Btw. One reason why DMD is faster is because it use fyl2x X87 
instruction

here is version for others compilers:

import std.math, std.stdio, std.datetime;

enum SIZE = 100_000_000;

version(GNU)
{
	real mylog(double x) pure nothrow
	{
		real result;
		double y = LN2;
		asm
		{
			"fldl   %2\n"
			"fldl   %1\n"
			"fyl2x"
			: "=t" (result) : "m" (x), "m" (y);
		}
		return result;
	}
}
else
{
	real mylog(double x) pure nothrow
	{
		return yl2x(x, LN2);
	}
}

void main() {
	
	auto t1 = Clock.currTime();
	auto logs = new double[SIZE];
	
	foreach (i; 0 .. SIZE)
	{
		logs[i] = mylog(i + 1.0);
	}

	auto t2 = Clock.currTime();

	writeln("time: ", (t2 - t1));
}

But it is faster only on all Intel CPU, but on one of my AMD it 
is slower than core.stdc.log

Dec 23 2014

"Iov Gherman" <iovisx gmail.com> writes:

 And what about single threaded version?

Just ran the single thread examples after I moved time start 
before array allocation, thanks for that, good catch. Still 
better results in Java:

- java:
21 secs, 612 ms

- with std.math:
dmd: 23 secs, 994 ms
ldc: 31 secs, 668 ms
gdc: 52 secs, 576 ms

- with core.stdc.math:
dmd: 30 secs, 724 ms
ldc: 30 secs, 988 ms
gdc: time: 25 secs, 970 ms

Dec 23 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Tuesday, 23 December 2014 at 12:26:28 UTC, Iov Gherman wrote:
 And what about single threaded version?

 Just ran the single thread examples after I moved time start 
 before array allocation, thanks for that, good catch. Still 
 better results in Java:

 - java:
 21 secs, 612 ms

 - with std.math:
 dmd: 23 secs, 994 ms
 ldc: 31 secs, 668 ms
 gdc: 52 secs, 576 ms

 - with core.stdc.math:
 dmd: 30 secs, 724 ms
 ldc: 30 secs, 988 ms
 gdc: time: 25 secs, 970 ms

Note that log is done in software on x86 with different levels of 
precision and with different ability to handle corner cases. It 
is therefore a very bad benchmark tool.

Dec 23 2014

"Daniel Kozak" <kozzi11 gmail.com> writes:

On Tuesday, 23 December 2014 at 10:20:04 UTC, Iov Gherman wrote:
 That's very different to my results.

 I see no important difference between ldc and dmd when using 
 std.math, but when using core.stdc.math ldc halves its time 
 where dmd only manages to get to ~80%

 I checked again today and the results are interesting, on my pc 
 I don't see any difference between std.math and core.stdc.math 
 with ldc. Here are the results with all compilers.

 - with std.math:
 dmd: 4 secs, 878 ms
 ldc: 5 secs, 650 ms
 gdc: 9 secs, 161 ms

 - with core.stdc.math:
 dmd: 5 secs, 991 ms
 ldc: 5 secs, 572 ms
 gdc: 7 secs, 957 ms

Btw. I just noticed small issue with D vs. java, you start 
messure in D before allocation, but in case of Java after 
allocation

Dec 23 2014

"Iov Gherman" <iovisx gmail.com> writes:

 Btw. I just noticed small issue with D vs. java, you start 
 messure in D before allocation, but in case of Java after 
 allocation

Here is the java result for parallel processing after moving the 
start time as the first line in main. Still best result:

4 secs, 50 ms average

Dec 23 2014

"Iov Gherman" <iovisx gmail.com> writes:

Forgot to mention that I pushed my changes to github.

Dec 23 2014

"Daniel Kozak" <kozzi11 gmail.com> writes:

On Tuesday, 23 December 2014 at 12:31:47 UTC, Iov Gherman wrote:
 Btw. I just noticed small issue with D vs. java, you start 
 messure in D before allocation, but in case of Java after 
 allocation

 Here is the java result for parallel processing after moving 
 the start time as the first line in main. Still best result:

 4 secs, 50 ms average

Java:

Exec time: 6 secs, 421 ms

LDC (-O3 -release -mcpu=native -singleobj -inline 
-boundscheck=off)

time: 5 secs, 321 ms, 877 μs, and 2 hnsecs

GDC(-O3 -frelease -march=native -finline -fno-bounds-check)

time: 5 secs, 237 ms, 453 μs, and 7 hnsecs

DMD(-O -release -inline -noboundscheck)
time: 5 secs, 107 ms, 931 μs, and 3 hnsecs

So all d compilers beat Java in my case:

but I have made some change in D version:

import std.parallelism, std.math, std.stdio, std.datetime;
import core.memory;

enum XMS = 3*1024*1024*1024; //3GB

version(GNU)
{
	real mylog(double x) pure nothrow
	{
		double result;
		double y = LN2;
		asm
		{
			"fldl   %2\n"
			"fldl   %1\n"
			"fyl2x\n"
			: "=t" (result) : "m" (x), "m" (y);
		}
		
		return result;
	}
}
else
{
	real mylog(double x) pure nothrow
	{
		return yl2x(x, LN2);
	}
}

void main() {

	GC.reserve(XMS);
	auto t1 = Clock.currTime();


	auto logs = new double[1_000_000_000];	
	foreach(i, ref elem; taskPool.parallel(logs, 200)) {
		elem = mylog(i + 1.0);
	}


	auto t2 = Clock.currTime();
	writeln("time: ", (t2 - t1));	
}

Dec 23 2014

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> writes:

I'm getting faster execution on java thank dmd, gdc beats it 
though.

...although, what this topic really provides is a reason for me 
to get more RAM for my next laptop. How much do you people run 
with? I had to scale the java down to 300 million to avoid dying 
with 4G memory.

Dec 23 2014

"aldanor" <i.s.smirnov gmail.com> writes:

On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote:
 So, I did some more testing with the one processing in paralel:

 --- dmd:
 4 secs, 977 ms

 --- dmd with flags: -O -release -inline -noboundscheck:
 4 secs, 635 ms

 --- ldc:
 6 secs, 271 ms

 --- gdc:
 10 secs, 439 ms

 I also pushed the new bash scripts to the git repository.

import std.math, std.stdio, std.datetime;

--> try replacing "std.math" with "core.stdc.math".

Dec 22 2014

"Iov Gherman" <iovisx gmail.com> writes:

On Monday, 22 December 2014 at 18:00:18 UTC, aldanor wrote:
 On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote:
 So, I did some more testing with the one processing in paralel:

 --- dmd:
 4 secs, 977 ms

 --- dmd with flags: -O -release -inline -noboundscheck:
 4 secs, 635 ms

 --- ldc:
 6 secs, 271 ms

 --- gdc:
 10 secs, 439 ms

 I also pushed the new bash scripts to the git repository.

 import std.math, std.stdio, std.datetime;

 --> try replacing "std.math" with "core.stdc.math".

Tried it, it is worst:
6 secs, 78 ms while the initial one was 4 secs, 977 ms and 
sometimes even better.

Dec 22 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Monday, 22 December 2014 at 18:23:29 UTC, Iov Gherman wrote:
 On Monday, 22 December 2014 at 18:00:18 UTC, aldanor wrote:
 On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote:
 So, I did some more testing with the one processing in 
 paralel:

 --- dmd:
 4 secs, 977 ms

 --- dmd with flags: -O -release -inline -noboundscheck:
 4 secs, 635 ms

 --- ldc:
 6 secs, 271 ms

 --- gdc:
 10 secs, 439 ms

 I also pushed the new bash scripts to the git repository.

 import std.math, std.stdio, std.datetime;

 --> try replacing "std.math" with "core.stdc.math".

 Tried it, it is worst:
 6 secs, 78 ms while the initial one was 4 secs, 977 ms and 
 sometimes even better.

Strange... for me, core.stdc.math.log is about twice as fast as 
std.math.log.

Dec 22 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 22 December 2014 at 17:16:49 UTC, Iov Gherman wrote:
 On Monday, 22 December 2014 at 17:16:05 UTC, bachmeier wrote:
 On Monday, 22 December 2014 at 17:05:19 UTC, Iov Gherman wrote:
 Hi Guys,

 First of all, thank you all for responding so quick, it is so 
 nice to see D having such an active community.

 As I said in my first post, I used no other parameters to dmd 
 when compiling because I don't know too much about dmd 
 compilation flags. I can't wait to try the flags Daniel 
 suggested with dmd (-O -release -inline -noboundscheck) and 
 the other two compilers (ldc2 and gdc). Thank you guys for 
 your suggestions.

 Meanwhile, I created a git repository on github and I put 
 there all my code. If you find any errors please let me know. 
 Because I am keeping the results in a big array the programs 
 take approximately 8Gb of RAM. If you don't have enough RAM 
 feel free to decrease the size of the array. For java code 
 you will also need to change 'compile-run.bsh' and use the 
 right memory parameters.


 Thank you all for helping,
 Iov

 Link to your repo?

 Sorry, forgot about it:
 https://github.com/ghermaniov/benchmarks

For posix-style threads, a per-thread workload of 200 calls to 
log seems rather small. It would interesting to see a graph of 
execution-time as a function of workgroup-size.

Traditionally one would use a workgroup size of (nElements / 
nCores) or similar, in order to get all the cores working but 
also minimise pressure on the scheduler, inter-thread 
communication and so on.

Dec 23 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - math.log() benchmark of first 1 billion int using std.parallelism