www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - summing large arrays - or - performance of tight loops

reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've done a few tests on how DMD and GDC handle tight loops
and was surprised by the slow code DMD generates for "foreach".

results and source code:
http://dstress.kuehne.cn/benchmark/tight_loops/sum.html

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFGJkFyLK5blCcjpWoRAgM4AJ9JnbxtHTlBFwXdWfwfcuShxi/2iACfVaxI
ULxRdOI+28mzZXCbv8RiPO8=
=Z0K0
-----END PGP SIGNATURE-----
Apr 18 2007
next sibling parent reply "Craig Black" <cblack ara.com> writes:
Not Found

The requested URL /benchmark/tight_loops/sum.html was not found on this 
server.

-Craig 
Apr 18 2007
parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Craig Black schrieb am 2007-04-18:
 Not Found

 The requested URL /benchmark/tight_loops/sum.html was not found on this 
 server.

Seems to be some server issue, backup can be found here: http://svn.berlios.de/svnroot/repos/dstress/trunk/benchmark/tight_loops/sum.html Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGJkfJLK5blCcjpWoRAnMnAJ0dabQGW2edXBXlXVbXqqUSlgUmjACgqJUh kh2pP8ZmgultrR2U+Cusgyc= =tUSf -----END PGP SIGNATURE-----
Apr 18 2007
next sibling parent reply "Craig Black" <cblack ara.com> writes:
Good work on this.  It is certainly a problem for Walter if he sticks with 
his "trust the compiler" philiosphy.  Since the DMD documentation claims 
that foreach should emit optimized code, and it clearly doesn't in these 
cases, I think you could submit this as a bug.

-Craig 
Apr 18 2007
next sibling parent reply Dan <murpsoft hotmail.com> writes:
Indeed, this is clearly an optimization problem.  I agree that we should put
our faith in the compiler because it's bad to undermine it to try to write
optimal code when the compiler may still be improved to solve the problem
*properly*

That said, DMDScript undermines associative arrays by completely
re-implementing them with a tweak - instead of using opIndex and opIndexAssign.
 : p

But yes, let's work on getting the compiler to Do The Right Thing(tm) so we
don't have to kludge our code, mm?
Apr 18 2007
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan schrieb am 2007-04-18:
 Indeed, this is clearly an optimization problem.  I agree that we should
 put our faith in the compiler because it's bad to undermine it to try to
 write optimal code when the compiler may still be improved to solve the
 problem *properly*

 That said, DMDScript undermines associative arrays by completely
 re-implementing them with a tweak - instead of using opIndex and
 opIndexAssign.  : p

 But yes, let's work on getting the compiler to Do The Right Thing(tm) so
 we don't have to kludge our code, mm?

Originally I needed to find the fastest way to iterate repeatedly over large arrays and apply trivial operations to each element. I've put the results only so that others can have a look to, not to prioritise compiler tweaks over bug fixing. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGKhTGLK5blCcjpWoRAg2XAJ4w03fB+vup/GKPkV/MBtLKb7oVowCeNz8Z opdrrDv1fXLJLQrx9TEZYKg= =Zr9Q -----END PGP SIGNATURE-----
Apr 23 2007
prev sibling parent reply Dave <Dave_member pathlink.com> writes:
Craig Black wrote:
 Good work on this.  It is certainly a problem for Walter if he sticks with 
 his "trust the compiler" philiosphy.  Since the DMD documentation claims 

Walter consistently mentioned that a) most of the DM compiler back-end is 20 years old and b) is not optimized for D yet. I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations. Plus for integer stuff it seems to do pretty well (GCC is no slouch). That said, both foreach and FP should perform better!
 that foreach should emit optimized code, and it clearly doesn't in these 
 cases, I think you could submit this as a bug.
 
 -Craig 
 

Apr 18 2007
parent reply "Craig Black" <cblack ara.com> writes:
  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig
Apr 19 2007
next sibling parent reply Dan <murpsoft hotmail.com> writes:
Craig Black Wrote:

  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig

I'm not going to change any foreach code over. I'm going to wait until someone in Walter's group fixes it. : )
Apr 19 2007
parent reply Dave <Dave_member pathlink.com> writes:
Dan wrote:
 Craig Black Wrote:
 
  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

is preferred over other looping mechanisms because the compiler will optimize it. -Craig

I'm not going to change any foreach code over. I'm going to wait until someone in Walter's group fixes it. : )

You might want to try it out first before making any judgments anyhow. I tried a couple of the benchmark tests and 'foreach' performed the same as 'array' for example. Mine were ran on a P4 chip and an AMD64 chip (original was Turion).
Apr 19 2007
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main: import std.stdio, std.perf; void main(){ const int f= 256; int[ ] a= new int[ 1024*1024*f]; auto c=new PerformanceCounter; auto c2=new PerformanceCounter; c.start; for(int i= 1; i<=5120/f; i++) volatile sum_foreach( a); c.stop; volatile auto t= c.microseconds; c2.start; for(int i= 1; i<=5120/f; i++) volatile sum_array( a); c2.stop; writefln( "%s %s %s" , cast(real)t/c2.microseconds , t , c2.microseconds ); } -manfred
Apr 19 2007
parent reply Dave <Dave_member pathlink.com> writes:
Manfred Nowak wrote:
 Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main: import std.stdio, std.perf; void main(){ const int f= 256; int[ ] a= new int[ 1024*1024*f]; auto c=new PerformanceCounter; auto c2=new PerformanceCounter; c.start; for(int i= 1; i<=5120/f; i++) volatile sum_foreach( a); c.stop; volatile auto t= c.microseconds; c2.start; for(int i= 1; i<=5120/f; i++) volatile sum_array( a); c2.stop; writefln( "%s %s %s" , cast(real)t/c2.microseconds , t , c2.microseconds ); } -manfred

I was using a simple 1024 x 1024 size array before. Using a 1024 x 1024 x 64 array, I got: P4: 97% (linux32 FC5) AMD64: 92% (WinXP32) So, the array size seems to make some difference, at least on AMD machines. Here's all of the code I tested with: import std.date, std.perf, std.stdio; void main() { int[] arr = new int[1024*1024*64]; int result1 = 0, result2 = 0; // initialize with values that won't cause an integer overflow foreach(i, ref e; arr) e = i % 5 + 1; version(perftime) { auto t1 = new PerformanceCounter; t1.start; for(auto i = 0; i < 10; i++) result1 = sum_foreach(arr); t1.stop; auto t2 = new PerformanceCounter; t2.start; for(auto i = 0; i < 10; i++) result2 = sum_array(arr); t2.stop; writefln("%s %s %s %s %s" , cast(real)t1.microseconds/t2.microseconds , t1.microseconds , t2.microseconds , result1 , result2 ); } else { d_time s1 = getUTCtime; for(auto i = 0; i < 10; i++) result1 = sum_foreach(arr); d_time e1 = getUTCtime; d_time s2 = getUTCtime; for(auto i = 0; i < 10; i++) result2 = sum_array(arr); d_time e2 = getUTCtime; writefln("%s %s %s %s %s" , cast(real)(e1-s1)/(e2-s2) , e1-s1 , e2-s2 , result1 , result2 ); } } T sum_foreach(T)(T[] data) { T result = 0; foreach(element; data) { result += element; } return result; } T sum_array(T)(T[] data) { T result = 0; size_t index = 0; while(index < data.length) { result += data[index]; index++; } return result; }
Apr 19 2007
parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dave schrieb am 2007-04-19:
 Manfred Nowak wrote:
 Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main:


<snip>
 I was using a simple 1024 x 1024 size array before.

 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double. The benchmarking was done as follows: dmd sum.html -O -release -inline -version=benchmark ./sum 2> /dev/null Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGKhZsLK5blCcjpWoRAha5AJ40lu7apkLufJgfi8wFYiFfIuhWTACdGUJg 3MiQxN7N15GlC+J6OEilwuM= =dXko -----END PGP SIGNATURE-----
Apr 23 2007
next sibling parent reply Dan <murpsoft hotmail.com> writes:
 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double.

Actually, the size of the data type doesn't matter at all for a properly implemented algorithm - as a general rule, you implement a duff's device to align and then use the largest sized instruction you can fit. Right now the SSE2 instruction "movaps" is quite effective for copying memory. Also, each operating system implements Page tracking differently. Some do it by inserting some metadata onto the Page itself. For that reason, implementing 4kb arrays can perform really well on some OS's, and very very poorly on others (they take 2 pages when you think they're taking 1, so cache and page misses screw you up) For that reason, it can (very rarely but occassionally) actually improve performance to *not* use a power of two array, but something just short of it. Sincerely, Dan
Apr 23 2007
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan schrieb am 2007-04-23:
 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double.

Actually, the size of the data type doesn't matter at all for a properly implemented algorithm - as a general rule, you implement a duff's device to align and then use the largest sized instruction you can fit. Right now the SSE2 instruction "movaps" is quite effective for copying memory.

That's what I thought too, but while my SSE version for float and double didn't have the worst performance they were by no means the fastest. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGLRkOLK5blCcjpWoRAkhDAKCb4IU0RG6HTzL1DywM4yClWwK9eACfYshW vdYZYS3eIKhBLclsDOyq19M= =n+9d -----END PGP SIGNATURE-----
Apr 23 2007
prev sibling parent Dave <Dave_member pathlink.com> writes:
Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Dave schrieb am 2007-04-19:
 Manfred Nowak wrote:
 Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main:


<snip>
 I was using a simple 1024 x 1024 size array before.

 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

The results strongly depend on the memory architecture and to a lesser

Yea, with "...the array size seems to make some difference...", I was referring to slight differences between tests on my machines.
 extend on the element values. I've put an updated version online that
 contains results for byte, short, int, long, float and double.
 

I agree, I think the big (relative foreach/other) difference between machines is probably because of different cache sizes, or it could be alignment issues I guess. All this seems to indicate that the code generated for foreach is more sensitive to mem. arch. than the other loops. That said, three of the four architectures spoken of in this thread had foreach perform as well as or better than the common while-loop anyway.
 The benchmarking was done as follows:
 
 dmd sum.html -O -release -inline -version=benchmark
 ./sum 2> /dev/null
 
 Thomas
 
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFGKhZsLK5blCcjpWoRAha5AJ40lu7apkLufJgfi8wFYiFfIuhWTACdGUJg
 3MiQxN7N15GlC+J6OEilwuM=
 =dXko
 -----END PGP SIGNATURE-----

Apr 23 2007
prev sibling parent reply Dave <Dave_member pathlink.com> writes:
Craig Black wrote:
  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig

Sorry - I misunderstood you, and hadn't seen that documentation. Thanks, - Dave
Apr 19 2007
parent Dave <Dave_member pathlink.com> writes:
Dave wrote:
 Craig Black wrote:
  I think the "trust the compiler" philosophy is for stability (20 
 years old), not optimizations.

Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig

Sorry - I misunderstood you, and hadn't seen that documentation.

I still haven't seen that documentation 'cause I can't find it in the foreach section. Could you point it out?
 Thanks,
 
 - Dave

Apr 23 2007
prev sibling parent Dave <Dave_member pathlink.com> writes:
Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Craig Black schrieb am 2007-04-18:
 Not Found

 The requested URL /benchmark/tight_loops/sum.html was not found on this 
 server.

Seems to be some server issue, backup can be found here: http://svn.berlios.de/svnroot/repos/dstress/trunk/benchmark/tight_loops/sum.html Thomas

Nice, but didn't the 'result' variable overflow several times during the loop for the int tests?
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFGJkfJLK5blCcjpWoRAnMnAJ0dabQGW2edXBXlXVbXqqUSlgUmjACgqJUh
 kh2pP8ZmgultrR2U+Cusgyc=
 =tUSf
 -----END PGP SIGNATURE-----

Apr 18 2007
prev sibling next sibling parent torhu <fake address.dude> writes:
That link is dead.
Apr 18 2007
prev sibling next sibling parent Tomas Lindquist Olsen <tomas famolsen.dk> writes:
Thomas Kuehne wrote:

 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 I've done a few tests on how DMD and GDC handle tight loops
 and was surprised by the slow code DMD generates for "foreach".
 
 results and source code:
 http://dstress.kuehne.cn/benchmark/tight_loops/sum.html
 
 Thomas
 
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFGJkFyLK5blCcjpWoRAgM4AJ9JnbxtHTlBFwXdWfwfcuShxi/2iACfVaxI
 ULxRdOI+28mzZXCbv8RiPO8=
 =Z0K0
 -----END PGP SIGNATURE-----

Invalid URL. I tried looking at http://dstress.kuehne.cn/benchmark/ and there is not even a tight_loops directory!
Apr 18 2007
prev sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
Thomas Kuehne wrote

surprised by the slow code DMD generates for "foreach".

Not confirmed.

On my machine foreach is about 15% faster than pointer_2a.

-manfred
Apr 18 2007