digitalmars.D - summing large arrays - or - performance of tight loops
- Thomas Kuehne (12/12) Apr 18 2007 -----BEGIN PGP SIGNED MESSAGE-----
- Craig Black (4/4) Apr 18 2007 Not Found
- Thomas Kuehne (11/14) Apr 18 2007 -----BEGIN PGP SIGNED MESSAGE-----
- Craig Black (5/5) Apr 18 2007 Good work on this. It is certainly a problem for Walter if he sticks wi...
- Dan (3/3) Apr 18 2007 Indeed, this is clearly an optimization problem. I agree that we should...
- Thomas Kuehne (13/22) Apr 23 2007 -----BEGIN PGP SIGNED MESSAGE-----
- Dave (5/12) Apr 18 2007 Walter consistently mentioned that a) most of the DM compiler back-end i...
- Craig Black (4/6) Apr 19 2007 Nope. Read the documentation on foreach. It says specifically that for...
- Dan (2/10) Apr 19 2007 I'm not going to change any foreach code over. I'm going to wait until ...
- Dave (4/16) Apr 19 2007 You might want to try it out first before making any judgments anyhow. I...
- Manfred Nowak (34/36) Apr 19 2007 It is disgusting to see, that three random benchmarkers all get
- Dave (73/115) Apr 19 2007 I was using a simple 1024 x 1024 size array before.
- Thomas Kuehne (16/38) Apr 23 2007 -----BEGIN PGP SIGNED MESSAGE-----
- Dan (5/15) Apr 23 2007 Actually, the size of the data type doesn't matter at all for a properly...
- Thomas Kuehne (11/25) Apr 23 2007 -----BEGIN PGP SIGNED MESSAGE-----
- Dave (8/58) Apr 23 2007 Yea, with "...the array size seems to make some difference...", I was re...
- Dave (4/14) Apr 19 2007 Sorry - I misunderstood you, and hadn't seen that documentation.
- Dave (3/19) Apr 23 2007 I still haven't seen that documentation 'cause I can't find it in the fo...
- Dave (2/23) Apr 18 2007
- torhu (1/1) Apr 18 2007 That link is dead.
- Tomas Lindquist Olsen (3/21) Apr 18 2007 Invalid URL. I tried looking at http://dstress.kuehne.cn/benchmark/ and
- Manfred Nowak (5/5) Apr 18 2007 Thomas Kuehne wrote
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've done a few tests on how DMD and GDC handle tight loops and was surprised by the slow code DMD generates for "foreach". results and source code: http://dstress.kuehne.cn/benchmark/tight_loops/sum.html Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGJkFyLK5blCcjpWoRAgM4AJ9JnbxtHTlBFwXdWfwfcuShxi/2iACfVaxI ULxRdOI+28mzZXCbv8RiPO8= =Z0K0 -----END PGP SIGNATURE-----
Apr 18 2007
Not Found The requested URL /benchmark/tight_loops/sum.html was not found on this server. -Craig
Apr 18 2007
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Craig Black schrieb am 2007-04-18:Not Found The requested URL /benchmark/tight_loops/sum.html was not found on this server.Seems to be some server issue, backup can be found here: http://svn.berlios.de/svnroot/repos/dstress/trunk/benchmark/tight_loops/sum.html Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGJkfJLK5blCcjpWoRAnMnAJ0dabQGW2edXBXlXVbXqqUSlgUmjACgqJUh kh2pP8ZmgultrR2U+Cusgyc= =tUSf -----END PGP SIGNATURE-----
Apr 18 2007
Good work on this. It is certainly a problem for Walter if he sticks with his "trust the compiler" philiosphy. Since the DMD documentation claims that foreach should emit optimized code, and it clearly doesn't in these cases, I think you could submit this as a bug. -Craig
Apr 18 2007
Indeed, this is clearly an optimization problem. I agree that we should put our faith in the compiler because it's bad to undermine it to try to write optimal code when the compiler may still be improved to solve the problem *properly* That said, DMDScript undermines associative arrays by completely re-implementing them with a tweak - instead of using opIndex and opIndexAssign. : p But yes, let's work on getting the compiler to Do The Right Thing(tm) so we don't have to kludge our code, mm?
Apr 18 2007
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dan schrieb am 2007-04-18:Indeed, this is clearly an optimization problem. I agree that we should put our faith in the compiler because it's bad to undermine it to try to write optimal code when the compiler may still be improved to solve the problem *properly* That said, DMDScript undermines associative arrays by completely re-implementing them with a tweak - instead of using opIndex and opIndexAssign. : p But yes, let's work on getting the compiler to Do The Right Thing(tm) so we don't have to kludge our code, mm?Originally I needed to find the fastest way to iterate repeatedly over large arrays and apply trivial operations to each element. I've put the results only so that others can have a look to, not to prioritise compiler tweaks over bug fixing. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGKhTGLK5blCcjpWoRAg2XAJ4w03fB+vup/GKPkV/MBtLKb7oVowCeNz8Z opdrrDv1fXLJLQrx9TEZYKg= =Zr9Q -----END PGP SIGNATURE-----
Apr 23 2007
Craig Black wrote:Good work on this. It is certainly a problem for Walter if he sticks with his "trust the compiler" philiosphy. Since the DMD documentation claimsWalter consistently mentioned that a) most of the DM compiler back-end is 20 years old and b) is not optimized for D yet. I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations. Plus for integer stuff it seems to do pretty well (GCC is no slouch). That said, both foreach and FP should perform better!that foreach should emit optimized code, and it clearly doesn't in these cases, I think you could submit this as a bug. -Craig
Apr 18 2007
I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations.Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig
Apr 19 2007
Craig Black Wrote:I'm not going to change any foreach code over. I'm going to wait until someone in Walter's group fixes it. : )I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations.Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig
Apr 19 2007
Dan wrote:Craig Black Wrote:You might want to try it out first before making any judgments anyhow. I tried a couple of the benchmark tests and 'foreach' performed the same as 'array' for example. Mine were ran on a P4 chip and an AMD64 chip (original was Turion).I'm not going to change any foreach code over. I'm going to wait until someone in Walter's group fixes it. : )I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations.Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig
Apr 19 2007
Dave wroteMine were ran on a P4 chip and an AMD64 chip (original was Turion).It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main: import std.stdio, std.perf; void main(){ const int f= 256; int[ ] a= new int[ 1024*1024*f]; auto c=new PerformanceCounter; auto c2=new PerformanceCounter; c.start; for(int i= 1; i<=5120/f; i++) volatile sum_foreach( a); c.stop; volatile auto t= c.microseconds; c2.start; for(int i= 1; i<=5120/f; i++) volatile sum_array( a); c2.stop; writefln( "%s %s %s" , cast(real)t/c2.microseconds , t , c2.microseconds ); } -manfred
Apr 19 2007
Manfred Nowak wrote:Dave wroteI was using a simple 1024 x 1024 size array before. Using a 1024 x 1024 x 64 array, I got: P4: 97% (linux32 FC5) AMD64: 92% (WinXP32) So, the array size seems to make some difference, at least on AMD machines. Here's all of the code I tested with: import std.date, std.perf, std.stdio; void main() { int[] arr = new int[1024*1024*64]; int result1 = 0, result2 = 0; // initialize with values that won't cause an integer overflow foreach(i, ref e; arr) e = i % 5 + 1; version(perftime) { auto t1 = new PerformanceCounter; t1.start; for(auto i = 0; i < 10; i++) result1 = sum_foreach(arr); t1.stop; auto t2 = new PerformanceCounter; t2.start; for(auto i = 0; i < 10; i++) result2 = sum_array(arr); t2.stop; writefln("%s %s %s %s %s" , cast(real)t1.microseconds/t2.microseconds , t1.microseconds , t2.microseconds , result1 , result2 ); } else { d_time s1 = getUTCtime; for(auto i = 0; i < 10; i++) result1 = sum_foreach(arr); d_time e1 = getUTCtime; d_time s2 = getUTCtime; for(auto i = 0; i < 10; i++) result2 = sum_array(arr); d_time e2 = getUTCtime; writefln("%s %s %s %s %s" , cast(real)(e1-s1)/(e2-s2) , e1-s1 , e2-s2 , result1 , result2 ); } } T sum_foreach(T)(T[] data) { T result = 0; foreach(element; data) { result += element; } return result; } T sum_array(T)(T[] data) { T result = 0; size_t index = 0; while(index < data.length) { result += data[index]; index++; } return result; }Mine were ran on a P4 chip and an AMD64 chip (original was Turion).It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main: import std.stdio, std.perf; void main(){ const int f= 256; int[ ] a= new int[ 1024*1024*f]; auto c=new PerformanceCounter; auto c2=new PerformanceCounter; c.start; for(int i= 1; i<=5120/f; i++) volatile sum_foreach( a); c.stop; volatile auto t= c.microseconds; c2.start; for(int i= 1; i<=5120/f; i++) volatile sum_array( a); c2.stop; writefln( "%s %s %s" , cast(real)t/c2.microseconds , t , c2.microseconds ); } -manfred
Apr 19 2007
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dave schrieb am 2007-04-19:Manfred Nowak wrote:<snip>Dave wroteMine were ran on a P4 chip and an AMD64 chip (original was Turion).It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main:I was using a simple 1024 x 1024 size array before. Using a 1024 x 1024 x 64 array, I got: P4: 97% (linux32 FC5) AMD64: 92% (WinXP32) So, the array size seems to make some difference, at least on AMD machines.The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double. The benchmarking was done as follows: dmd sum.html -O -release -inline -version=benchmark ./sum 2> /dev/null Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGKhZsLK5blCcjpWoRAha5AJ40lu7apkLufJgfi8wFYiFfIuhWTACdGUJg 3MiQxN7N15GlC+J6OEilwuM= =dXko -----END PGP SIGNATURE-----
Apr 23 2007
Actually, the size of the data type doesn't matter at all for a properly implemented algorithm - as a general rule, you implement a duff's device to align and then use the largest sized instruction you can fit. Right now the SSE2 instruction "movaps" is quite effective for copying memory. Also, each operating system implements Page tracking differently. Some do it by inserting some metadata onto the Page itself. For that reason, implementing 4kb arrays can perform really well on some OS's, and very very poorly on others (they take 2 pages when you think they're taking 1, so cache and page misses screw you up) For that reason, it can (very rarely but occassionally) actually improve performance to *not* use a power of two array, but something just short of it. Sincerely, DanUsing a 1024 x 1024 x 64 array, I got: P4: 97% (linux32 FC5) AMD64: 92% (WinXP32) So, the array size seems to make some difference, at least on AMD machines.The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double.
Apr 23 2007
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dan schrieb am 2007-04-23:That's what I thought too, but while my SSE version for float and double didn't have the worst performance they were by no means the fastest. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGLRkOLK5blCcjpWoRAkhDAKCb4IU0RG6HTzL1DywM4yClWwK9eACfYshW vdYZYS3eIKhBLclsDOyq19M= =n+9d -----END PGP SIGNATURE-----Actually, the size of the data type doesn't matter at all for a properly implemented algorithm - as a general rule, you implement a duff's device to align and then use the largest sized instruction you can fit. Right now the SSE2 instruction "movaps" is quite effective for copying memory.Using a 1024 x 1024 x 64 array, I got: P4: 97% (linux32 FC5) AMD64: 92% (WinXP32) So, the array size seems to make some difference, at least on AMD machines.The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double.
Apr 23 2007
Thomas Kuehne wrote:-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dave schrieb am 2007-04-19:Yea, with "...the array size seems to make some difference...", I was referring to slight differences between tests on my machines.Manfred Nowak wrote:<snip>Dave wroteMine were ran on a P4 chip and an AMD64 chip (original was Turion).It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. Quotients (foreach/other): Thomas: 200% Dave: 100% me: 85% What causes those huge differences? Here is my main:I was using a simple 1024 x 1024 size array before. Using a 1024 x 1024 x 64 array, I got: P4: 97% (linux32 FC5) AMD64: 92% (WinXP32) So, the array size seems to make some difference, at least on AMD machines.The results strongly depend on the memory architecture and to a lesserextend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double.I agree, I think the big (relative foreach/other) difference between machines is probably because of different cache sizes, or it could be alignment issues I guess. All this seems to indicate that the code generated for foreach is more sensitive to mem. arch. than the other loops. That said, three of the four architectures spoken of in this thread had foreach perform as well as or better than the common while-loop anyway.The benchmarking was done as follows: dmd sum.html -O -release -inline -version=benchmark ./sum 2> /dev/null Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGKhZsLK5blCcjpWoRAha5AJ40lu7apkLufJgfi8wFYiFfIuhWTACdGUJg 3MiQxN7N15GlC+J6OEilwuM= =dXko -----END PGP SIGNATURE-----
Apr 23 2007
Craig Black wrote:Sorry - I misunderstood you, and hadn't seen that documentation. Thanks, - DaveI think the "trust the compiler" philosophy is for stability (20 years old), not optimizations.Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -Craig
Apr 19 2007
Dave wrote:Craig Black wrote:I still haven't seen that documentation 'cause I can't find it in the foreach section. Could you point it out?Sorry - I misunderstood you, and hadn't seen that documentation.I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations.Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it. -CraigThanks, - Dave
Apr 23 2007
Thomas Kuehne wrote:-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Craig Black schrieb am 2007-04-18:Nice, but didn't the 'result' variable overflow several times during the loop for the int tests?Not Found The requested URL /benchmark/tight_loops/sum.html was not found on this server.Seems to be some server issue, backup can be found here: http://svn.berlios.de/svnroot/repos/dstress/trunk/benchmark/tight_loops/sum.html Thomas-----BEGIN PGP SIGNATURE----- iD8DBQFGJkfJLK5blCcjpWoRAnMnAJ0dabQGW2edXBXlXVbXqqUSlgUmjACgqJUh kh2pP8ZmgultrR2U+Cusgyc= =tUSf -----END PGP SIGNATURE-----
Apr 18 2007
Thomas Kuehne wrote:-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've done a few tests on how DMD and GDC handle tight loops and was surprised by the slow code DMD generates for "foreach". results and source code: http://dstress.kuehne.cn/benchmark/tight_loops/sum.html Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFGJkFyLK5blCcjpWoRAgM4AJ9JnbxtHTlBFwXdWfwfcuShxi/2iACfVaxI ULxRdOI+28mzZXCbv8RiPO8= =Z0K0 -----END PGP SIGNATURE-----Invalid URL. I tried looking at http://dstress.kuehne.cn/benchmark/ and there is not even a tight_loops directory!
Apr 18 2007
Thomas Kuehne wrote surprised by the slow code DMD generates for "foreach". Not confirmed. On my machine foreach is about 15% faster than pointer_2a. -manfred
Apr 18 2007