digitalmars.D - summing large arrays - or - performance of tight loops

Thomas Kuehne (12/12) Apr 18 2007 -----BEGIN PGP SIGNED MESSAGE-----

Craig Black (4/4) Apr 18 2007 Not Found

Thomas Kuehne (11/14) Apr 18 2007 -----BEGIN PGP SIGNED MESSAGE-----

Craig Black (5/5) Apr 18 2007 Good work on this. It is certainly a problem for Walter if he sticks wi...

Dan (3/3) Apr 18 2007 Indeed, this is clearly an optimization problem. I agree that we should...

Thomas Kuehne (13/22) Apr 23 2007 -----BEGIN PGP SIGNED MESSAGE-----

Dave (5/12) Apr 18 2007 Walter consistently mentioned that a) most of the DM compiler back-end i...

Craig Black (4/6) Apr 19 2007 Nope. Read the documentation on foreach. It says specifically that for...

Dan (2/10) Apr 19 2007 I'm not going to change any foreach code over. I'm going to wait until ...

Dave (4/16) Apr 19 2007 You might want to try it out first before making any judgments anyhow. I...

Manfred Nowak (34/36) Apr 19 2007 It is disgusting to see, that three random benchmarkers all get

Dave (73/115) Apr 19 2007 I was using a simple 1024 x 1024 size array before.

Thomas Kuehne (16/38) Apr 23 2007 -----BEGIN PGP SIGNED MESSAGE-----

Dan (5/15) Apr 23 2007 Actually, the size of the data type doesn't matter at all for a properly...

Thomas Kuehne (11/25) Apr 23 2007 -----BEGIN PGP SIGNED MESSAGE-----

Dave (8/58) Apr 23 2007 Yea, with "...the array size seems to make some difference...", I was re...

Dave (4/14) Apr 19 2007 Sorry - I misunderstood you, and hadn't seen that documentation.

Dave (3/19) Apr 23 2007 I still haven't seen that documentation 'cause I can't find it in the fo...

Dave (2/23) Apr 18 2007

torhu (1/1) Apr 18 2007 That link is dead.
Tomas Lindquist Olsen (3/21) Apr 18 2007 Invalid URL. I tried looking at http://dstress.kuehne.cn/benchmark/ and
Manfred Nowak (5/5) Apr 18 2007 Thomas Kuehne wrote

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've done a few tests on how DMD and GDC handle tight loops
and was surprised by the slow code DMD generates for "foreach".

results and source code:
http://dstress.kuehne.cn/benchmark/tight_loops/sum.html

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFGJkFyLK5blCcjpWoRAgM4AJ9JnbxtHTlBFwXdWfwfcuShxi/2iACfVaxI
ULxRdOI+28mzZXCbv8RiPO8=
=Z0K0
-----END PGP SIGNATURE-----

Apr 18 2007

"Craig Black" <cblack ara.com> writes:

Not Found

The requested URL /benchmark/tight_loops/sum.html was not found on this 
server.

-Craig

Apr 18 2007

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Craig Black schrieb am 2007-04-18:
 Not Found

 The requested URL /benchmark/tight_loops/sum.html was not found on this 
 server.

Seems to be some server issue, backup can be found here:
http://svn.berlios.de/svnroot/repos/dstress/trunk/benchmark/tight_loops/sum.html

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFGJkfJLK5blCcjpWoRAnMnAJ0dabQGW2edXBXlXVbXqqUSlgUmjACgqJUh
kh2pP8ZmgultrR2U+Cusgyc=
=tUSf
-----END PGP SIGNATURE-----

Apr 18 2007

"Craig Black" <cblack ara.com> writes:

Good work on this.  It is certainly a problem for Walter if he sticks with 
his "trust the compiler" philiosphy.  Since the DMD documentation claims 
that foreach should emit optimized code, and it clearly doesn't in these 
cases, I think you could submit this as a bug.

-Craig

Apr 18 2007

Dan <murpsoft hotmail.com> writes:

Indeed, this is clearly an optimization problem.  I agree that we should put
our faith in the compiler because it's bad to undermine it to try to write
optimal code when the compiler may still be improved to solve the problem
*properly*

That said, DMDScript undermines associative arrays by completely
re-implementing them with a tweak - instead of using opIndex and opIndexAssign.
 : p

But yes, let's work on getting the compiler to Do The Right Thing(tm) so we
don't have to kludge our code, mm?

Apr 18 2007

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan schrieb am 2007-04-18:
 Indeed, this is clearly an optimization problem.  I agree that we should
 put our faith in the compiler because it's bad to undermine it to try to
 write optimal code when the compiler may still be improved to solve the
 problem *properly*

 That said, DMDScript undermines associative arrays by completely
 re-implementing them with a tweak - instead of using opIndex and
 opIndexAssign.  : p

 But yes, let's work on getting the compiler to Do The Right Thing(tm) so
 we don't have to kludge our code, mm?

Originally I needed to find the fastest way to iterate repeatedly over
large arrays and apply trivial operations to each element. I've put the
results only so that others can have a look to, not to prioritise compiler
tweaks over bug fixing.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFGKhTGLK5blCcjpWoRAg2XAJ4w03fB+vup/GKPkV/MBtLKb7oVowCeNz8Z
opdrrDv1fXLJLQrx9TEZYKg=
=Zr9Q
-----END PGP SIGNATURE-----

Apr 23 2007

Dave <Dave_member pathlink.com> writes:

Craig Black wrote:
 Good work on this.  It is certainly a problem for Walter if he sticks with 
 his "trust the compiler" philiosphy.  Since the DMD documentation claims 

Walter consistently mentioned that a) most of the DM compiler back-end is 20
years old and b) is not 
optimized for D yet. I think the "trust the compiler" philosophy is for
stability (20 years old), 
not optimizations. Plus for integer stuff it seems to do pretty well (GCC is no
slouch).

That said, both foreach and FP should perform better!

 that foreach should emit optimized code, and it clearly doesn't in these 
 cases, I think you could submit this as a bug.
 
 -Craig

Apr 18 2007

"Craig Black" <cblack ara.com> writes:

  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

Nope.  Read the documentation on foreach.  It says specifically that foreach 
is preferred over other looping mechanisms because the compiler will 
optimize it.

-Craig

Apr 19 2007

Dan <murpsoft hotmail.com> writes:

Craig Black Wrote:

  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

 
 Nope.  Read the documentation on foreach.  It says specifically that foreach 
 is preferred over other looping mechanisms because the compiler will 
 optimize it.
 
 -Craig 

I'm not going to change any foreach code over.  I'm going to wait until someone
in Walter's group fixes it.  : )

Apr 19 2007

Dave <Dave_member pathlink.com> writes:

Dan wrote:
 Craig Black Wrote:
 
  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

 Nope.  Read the documentation on foreach.  It says specifically that foreach 
 is preferred over other looping mechanisms because the compiler will 
 optimize it.

 -Craig 

 
 I'm not going to change any foreach code over.  I'm going to wait until
someone in Walter's group fixes it.  : )
 

You might want to try it out first before making any judgments anyhow. I tried
a couple of the 
benchmark tests and 'foreach' performed the same as 'array' for example. Mine
were ran on a P4 chip 
and an AMD64 chip (original was Turion).

Apr 19 2007

Manfred Nowak <svv1999 hotmail.com> writes:

Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

It is disgusting to see, that three random benchmarkers all get 
different results. Mine were run on an AMD64 X2 under WinXP32 and 
compiled with DMD1.012.

Quotients (foreach/other):
Thomas: 200%
Dave:   100%
me:      85%

What causes those huge differences?

Here is my main:

import std.stdio, std.perf;
void main(){
  const int f= 256;
  int[ ] a= new int[ 1024*1024*f];
  auto c=new PerformanceCounter;
  auto c2=new PerformanceCounter;
  c.start;
  for(int i= 1; i<=5120/f; i++)
    volatile sum_foreach( a);
  c.stop;
  volatile auto t= c.microseconds;
  
  c2.start;
  for(int i= 1; i<=5120/f; i++)
    volatile sum_array( a);
  c2.stop;
  writefln( "%s %s %s"
    , cast(real)t/c2.microseconds
    , t
    , c2.microseconds
  );
}

-manfred

Apr 19 2007

Dave <Dave_member pathlink.com> writes:

Manfred Nowak wrote:
 Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

 
 It is disgusting to see, that three random benchmarkers all get 
 different results. Mine were run on an AMD64 X2 under WinXP32 and 
 compiled with DMD1.012.
 
 Quotients (foreach/other):
 Thomas: 200%
 Dave:   100%
 me:      85%
 
 What causes those huge differences?
 
 Here is my main:
 
 import std.stdio, std.perf;
 void main(){
   const int f= 256;
   int[ ] a= new int[ 1024*1024*f];
   auto c=new PerformanceCounter;
   auto c2=new PerformanceCounter;
   c.start;
   for(int i= 1; i<=5120/f; i++)
     volatile sum_foreach( a);
   c.stop;
   volatile auto t= c.microseconds;
   
   c2.start;
   for(int i= 1; i<=5120/f; i++)
     volatile sum_array( a);
   c2.stop;
   writefln( "%s %s %s"
     , cast(real)t/c2.microseconds
     , t
     , c2.microseconds
   );
 }
 
 -manfred
  

I was using a simple 1024 x 1024 size array before.

Using a 1024 x 1024 x 64 array, I got:

P4:    97% (linux32 FC5)
AMD64: 92% (WinXP32)

So, the array size seems to make some difference, at least on AMD machines.

Here's all of the code I tested with:

import std.date, std.perf, std.stdio;

void main()
{
   int[] arr = new int[1024*1024*64];
   int result1 = 0, result2 = 0;

   // initialize with values that won't cause an integer overflow
   foreach(i, ref e; arr) e = i % 5 + 1;

  version(perftime)
  {
   auto t1 = new PerformanceCounter;
   t1.start;
   for(auto i = 0; i < 10; i++)
     result1 = sum_foreach(arr);
   t1.stop;

   auto t2 = new PerformanceCounter;
   t2.start;
   for(auto i = 0; i < 10; i++)
     result2 = sum_array(arr);
   t2.stop;

   writefln("%s %s %s %s %s"
     , cast(real)t1.microseconds/t2.microseconds
     , t1.microseconds
     , t2.microseconds
     , result1
     , result2
   );
  }
  else
  {
   d_time s1 = getUTCtime;
   for(auto i = 0; i < 10; i++)
     result1 = sum_foreach(arr);
   d_time e1 = getUTCtime;

   d_time s2 = getUTCtime;
   for(auto i = 0; i < 10; i++)
     result2 = sum_array(arr);
   d_time e2 = getUTCtime;

   writefln("%s %s %s %s %s"
     , cast(real)(e1-s1)/(e2-s2)
     , e1-s1
     , e2-s2
     , result1
     , result2
   );
  }
}

T sum_foreach(T)(T[] data)
{
	T result = 0;
	foreach(element; data)
         {
		result += element;
	}
	return result;
}

T sum_array(T)(T[] data)
{
	T result = 0;
	size_t index = 0;
	while(index < data.length)
	{
		result += data[index];
		index++;
	}
	return result;
}

Apr 19 2007

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dave schrieb am 2007-04-19:
 Manfred Nowak wrote:
 Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

 
 It is disgusting to see, that three random benchmarkers all get 
 different results. Mine were run on an AMD64 X2 under WinXP32 and 
 compiled with DMD1.012.
 
 Quotients (foreach/other):
 Thomas: 200%
 Dave:   100%
 me:      85%
 
 What causes those huge differences?
 
 Here is my main:


<snip>

 I was using a simple 1024 x 1024 size array before.

 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

The results strongly depend on the memory architecture and to a lesser
extend on the element values. I've put an updated version online that
contains results for byte, short, int, long, float and double.

The benchmarking was done as follows:

dmd sum.html -O -release -inline -version=benchmark
./sum 2> /dev/null

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFGKhZsLK5blCcjpWoRAha5AJ40lu7apkLufJgfi8wFYiFfIuhWTACdGUJg
3MiQxN7N15GlC+J6OEilwuM=
=dXko
-----END PGP SIGNATURE-----

Apr 23 2007

Dan <murpsoft hotmail.com> writes:

 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

 
 The results strongly depend on the memory architecture and to a lesser
 extend on the element values. I've put an updated version online that
 contains results for byte, short, int, long, float and double.

Actually, the size of the data type doesn't matter at all for a properly
implemented algorithm - as a general rule, you implement a duff's device to
align and then use the largest sized instruction you can fit.  Right now the
SSE2 instruction "movaps" is quite effective for copying memory.

Also, each operating system implements Page tracking differently.  Some do it
by inserting some metadata onto the Page itself.  For that reason, implementing
4kb arrays can perform really well on some OS's, and very very poorly on others
(they take 2 pages when you think they're taking 1, so cache and page misses
screw you up)

For that reason, it can (very rarely but occassionally) actually improve
performance to *not* use a power of two array, but something just short of it.

Sincerely,
Dan

Apr 23 2007

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan schrieb am 2007-04-23:
 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

 
 The results strongly depend on the memory architecture and to a lesser
 extend on the element values. I've put an updated version online that
 contains results for byte, short, int, long, float and double.

 Actually, the size of the data type doesn't matter at all for a properly
 implemented algorithm - as a general rule, you implement a duff's device
 to align and then use the largest sized instruction you can fit.  Right now
 the SSE2 instruction "movaps" is quite effective for copying memory.

That's what I thought too, but while my SSE version for float and double
didn't have the worst performance they were by no means the fastest.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFGLRkOLK5blCcjpWoRAkhDAKCb4IU0RG6HTzL1DywM4yClWwK9eACfYshW
vdYZYS3eIKhBLclsDOyq19M=
=n+9d
-----END PGP SIGNATURE-----

Apr 23 2007

Dave <Dave_member pathlink.com> writes:

Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Dave schrieb am 2007-04-19:
 Manfred Nowak wrote:
 Dave wrote
 Mine were ran on a P4 chip and an AMD64 chip (original was
 Turion). 

 It is disgusting to see, that three random benchmarkers all get 
 different results. Mine were run on an AMD64 X2 under WinXP32 and 
 compiled with DMD1.012.

 Quotients (foreach/other):
 Thomas: 200%
 Dave:   100%
 me:      85%

 What causes those huge differences?

 Here is my main:


 
 <snip>
 
 I was using a simple 1024 x 1024 size array before.

 Using a 1024 x 1024 x 64 array, I got:

 P4:    97% (linux32 FC5)
 AMD64: 92% (WinXP32)

 So, the array size seems to make some difference, at least on AMD machines.

 
 The results strongly depend on the memory architecture and to a lesser

Yea, with "...the array size seems to make some difference...", I was referring
to slight 
differences between tests on my machines.

 extend on the element values. I've put an updated version online that
 contains results for byte, short, int, long, float and double.
 

I agree, I think the big (relative foreach/other) difference between machines
is probably because of 
different cache sizes, or it could be alignment issues I guess. All this seems
to indicate that the 
code generated for foreach is more sensitive to mem. arch. than the other loops.

That said, three of the four architectures spoken of in this thread had foreach
perform as well as 
or better than the common while-loop anyway.

 The benchmarking was done as follows:
 
 dmd sum.html -O -release -inline -version=benchmark
 ./sum 2> /dev/null
 
 Thomas
 
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFGKhZsLK5blCcjpWoRAha5AJ40lu7apkLufJgfi8wFYiFfIuhWTACdGUJg
 3MiQxN7N15GlC+J6OEilwuM=
 =dXko
 -----END PGP SIGNATURE-----

Apr 23 2007

Dave <Dave_member pathlink.com> writes:

Craig Black wrote:
  I think the "trust the compiler" philosophy is for stability (20 years 
 old), not optimizations.

 
 Nope.  Read the documentation on foreach.  It says specifically that foreach 
 is preferred over other looping mechanisms because the compiler will 
 optimize it.
 
 -Craig 
 
 

Sorry - I misunderstood you, and hadn't seen that documentation.

Thanks,

- Dave

Apr 19 2007

Dave <Dave_member pathlink.com> writes:

Dave wrote:
 Craig Black wrote:
  I think the "trust the compiler" philosophy is for stability (20 
 years old), not optimizations.

 Nope.  Read the documentation on foreach.  It says specifically that 
 foreach is preferred over other looping mechanisms because the 
 compiler will optimize it.

 -Craig

 
 Sorry - I misunderstood you, and hadn't seen that documentation.
 

I still haven't seen that documentation 'cause I can't find it in the foreach
section. Could you 
point it out?

 Thanks,
 
 - Dave

Apr 23 2007

Dave <Dave_member pathlink.com> writes:

Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Craig Black schrieb am 2007-04-18:
 Not Found

 The requested URL /benchmark/tight_loops/sum.html was not found on this 
 server.

 
 Seems to be some server issue, backup can be found here:
 http://svn.berlios.de/svnroot/repos/dstress/trunk/benchmark/tight_loops/sum.html
 
 Thomas
 

Nice, but didn't the 'result' variable overflow several times during the loop
for the int tests?

 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFGJkfJLK5blCcjpWoRAnMnAJ0dabQGW2edXBXlXVbXqqUSlgUmjACgqJUh
 kh2pP8ZmgultrR2U+Cusgyc=
 =tUSf
 -----END PGP SIGNATURE-----

Apr 18 2007

torhu <fake address.dude> writes:

That link is dead.

Apr 18 2007

Tomas Lindquist Olsen <tomas famolsen.dk> writes:

Thomas Kuehne wrote:

 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 I've done a few tests on how DMD and GDC handle tight loops
 and was surprised by the slow code DMD generates for "foreach".
 
 results and source code:
 http://dstress.kuehne.cn/benchmark/tight_loops/sum.html
 
 Thomas
 
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFGJkFyLK5blCcjpWoRAgM4AJ9JnbxtHTlBFwXdWfwfcuShxi/2iACfVaxI
 ULxRdOI+28mzZXCbv8RiPO8=
 =Z0K0
 -----END PGP SIGNATURE-----

Invalid URL. I tried looking at http://dstress.kuehne.cn/benchmark/ and
there is not even a tight_loops directory!

Apr 18 2007

Manfred Nowak <svv1999 hotmail.com> writes:

Thomas Kuehne wrote

surprised by the slow code DMD generates for "foreach".

Not confirmed.

On my machine foreach is about 15% faster than pointer_2a.

-manfred

Apr 18 2007

D Programming

C/C++ Programming

Other

digitalmars.D - summing large arrays - or - performance of tight loops