digitalmars.D - Is 2X faster large memcpy interesting?

Don (16/16) Mar 26 2009 The next D2 runtime will include my cache-size detection code. This

Georg Wrede (6/24) Mar 26 2009 What's the alternative? What would you do instead? Is there something
Andrei Alexandrescu (29/39) Mar 26 2009 I'd think so. In this day and age it is appalling that we don't quite

Sean Kelly (7/21) Mar 26 2009 I don't know how sad this is. For better or worse, programming is still...
Walter Bright (5/10) Mar 26 2009 It turns out that efficiently copying objects only a few bytes long
Don (19/61) Mar 27 2009 More precisely, the optimal algorithm depends on the size of the

Walter Bright (2/12) Mar 26 2009 I say go for it!
Christopher Wright (5/15) Mar 26 2009 I don't use large arrays very often. When I do, I would not copy them if...
Trass3r (3/14) Mar 26 2009 Well, arrays > 32K aren't that unsual, esp. in scientific computing.
Thomas Moran (10/16) Mar 26 2009 Don, have you seen Agner Fog's memcpy() and memmove() implementations

Don (5/24) Mar 27 2009 I'm aware of Agner's code (it was a motivation), but I deliberately

Moritz Warning (3/4) Mar 26 2009 It's a basic building block for many programs.
JC (5/23) Mar 27 2009 The applications that I write usually work with matrices of size 600x600...

Don <nospam nospam.com> writes:

The next D2 runtime will include my cache-size detection code. This 
makes it possible to write a cache-aware memcpy, using (for example) 
non-temporal writes when the arrays being copied exceed the size of the 
largest cache.
In my tests, it gives a speed-up of approximately 2X in such cases.
The downside is, it's a fair bit of work to implement, and it only 
affects extremely large arrays, so I fear it's basically irrelevant (It 
probably won't help arrays < 32K in size). Do people actually copy 
megabyte-sized arrays?
Is it worth spending any more time on it?


BTW: I tested the memcpy() code provided in AMD's 1992 optimisation 
manual, and in Intel's 2007 manual. Only one of them actually gave any 
benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't 
Intel!)
I've noticed that AMD's docs are usually greatly superior to Intels, but 
this time the difference is unbelievable.

Mar 26 2009

Georg Wrede <georg.wrede iki.fi> writes:

Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
 
 
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation 
 manual, and in Intel's 2007 manual. Only one of them actually gave any 
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't 
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but 
 this time the difference is unbelievable.

What's the alternative? What would you do instead? Is there something 
cooler or more important for D to do?

(IMHO, if the other alternatives have any merit, then I'd vote for them.)

But then again, you've already invested in this, and it clearly 
interests you. Labourious, yes, but it sounds fun.

Mar 26 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?

I'd think so. In this day and age it is appalling that we don't quite 
know how to quickly copy memory around. A long time ago I ran some 
measurements (http://www.ddj.com/cpp/184403799) and I was quite 
surprised. My musings were as true then as now. And now we're getting to 
the second freakin' Space Odyssey!

===============
Things are clearly hazy, aren't they? First off, maybe it came as a 
surprise to you that there's more than one way to fill and copy objects. 
Then, there's no single variant of fill and copy that works best on all 
compilers, data sets, and machines. (I guess if I tested the same code 
on a Celeron, which has less cache, I would have gotten very different 
results. To say nothing about other architectures.)

As a rule of thumb, it's generally good to use memcpy (and consequently 
fill-by-copy) if you can � for large data sets, memcpy doesn't make much 
difference, and for smaller data sets, it might be much faster. For 
cheap-to-copy objects, Duff's Device might perform faster than a simple 
for loop. Ultimately, all this is subject to your compiler's and 
machine's whims and quirks.

There is a very deep, and sad, realization underlying all this. We are 
in 2001, the year of the Spatial Odyssey. We've done electronic 
computing for more than 50 years now, and we strive to design more and 
more complex systems, with unsatisfactory results. Software development 
is messy. Could it be because the fundamental tools and means we use are 
low-level, inefficient, and not standardized? Just step out of the box 
and look at us � after 50 years, we're still not terribly good at 
filling and copying memory.
================


Andrei

Mar 26 2009

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 As a rule of thumb, it's generally good to use memcpy (and consequently
 fill-by-copy) if you can � for large data sets, memcpy doesn't make much
 difference, and for smaller data sets, it might be much faster. For
 cheap-to-copy objects, Duff's Device might perform faster than a simple
 for loop. Ultimately, all this is subject to your compiler's and
 machine's whims and quirks.
 There is a very deep, and sad, realization underlying all this. We are
 in 2001, the year of the Spatial Odyssey. We've done electronic
 computing for more than 50 years now, and we strive to design more and
 more complex systems, with unsatisfactory results. Software development
 is messy. Could it be because the fundamental tools and means we use are
 low-level, inefficient, and not standardized? Just step out of the box
 and look at us � after 50 years, we're still not terribly good at
 filling and copying memory.

I don't know how sad this is.  For better or worse, programming is still a
craft, much like blacksmithing.  Code is largely written from scratch for
each project, techniques are jealously guarded (in our case via copyright
law), etc.  This may not be great from the perspective of progress, but
it certainly makes the work more interesting.  But then I'm a tinker at
heart, so YMMV.

Mar 26 2009

Walter Bright <newshound1 digitalmars.com> writes:

Andrei Alexandrescu wrote:
 I'd think so. In this day and age it is appalling that we don't quite 
 know how to quickly copy memory around. A long time ago I ran some 
 measurements (http://www.ddj.com/cpp/184403799) and I was quite 
 surprised. My musings were as true then as now. And now we're getting to 
 the second freakin' Space Odyssey!

It turns out that efficiently copying objects only a few bytes long 
requires a bunch of code. So, I gave up on having the compiler generate 
intrinsic memcpy code, and instead just call the library function 
memcpy. This is implemented in the next update.

Mar 26 2009

Don <nospam nospam.com> writes:

Andrei Alexandrescu wrote:
 Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of 
 the largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant 
 (It probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?

 
 I'd think so. In this day and age it is appalling that we don't quite 
 know how to quickly copy memory around. A long time ago I ran some 
 measurements (http://www.ddj.com/cpp/184403799) and I was quite 
 surprised. My musings were as true then as now. And now we're getting to 
 the second freakin' Space Odyssey!
 
 ===============
 Things are clearly hazy, aren't they? First off, maybe it came as a 
 surprise to you that there's more than one way to fill and copy objects. 
 Then, there's no single variant of fill and copy that works best on all 
 compilers, data sets, and machines. (I guess if I tested the same code 
 on a Celeron, which has less cache, I would have gotten very different 
 results. To say nothing about other architectures.)

More precisely, the optimal algorithm depends on the size of the 
affected memory, and the size of the CPU caches. However, the only thing 
which really varies between machines is where the thresholds are.

 As a rule of thumb, it's generally good to use memcpy (and consequently 
 fill-by-copy) if you can � for large data sets, memcpy doesn't make much 
 difference, and for smaller data sets, it might be much faster. For 
 cheap-to-copy objects, Duff's Device might perform faster than a simple 
 for loop. Ultimately, all this is subject to your compiler's and 
 machine's whims and quirks.

This is why I think it's absolutely critical to have cache size 
determination as a fundamental runtime library primitive. Since memory 
speeds have only tripled since 1980, but processor speeds have improved 
by 1000X, I think it's become less and less acceptable for a systems 
language to be cache-unaware.

 There is a very deep, and sad, realization underlying all this. We are 
 in 2001, the year of the Spatial Odyssey. We've done electronic 
 computing for more than 50 years now, and we strive to design more and 
 more complex systems, with unsatisfactory results. Software development 
 is messy. Could it be because the fundamental tools and means we use are 
 low-level, inefficient, and not standardized? Just step out of the box 
 and look at us � after 50 years, we're still not terribly good at 
 filling and copying memory.
 ================
 
 Andrei

Summarizing everyones comments -- nobody thinks that fast large memcopy 
is terribly important, but a slow memcpy seems philosophically wrong.

I wanted to work on getting array operations to be cache-aware. Of 
course, the simplest possible array operation is a[]=b[].
On a Core2 with 1Mb/core L2 cache, DMD's memcpy performance drops to 0.6 
bytes/cycles once you fall out of the L2 cache; but a 64-bit CPU can do 
8 bytes/cycle under optimum conditions, when the data is coming from the 
L1 cache. So I figured it doesn't make sense to optimize the more 
complex operations when the trivial one is has so much potential for 
improvement.

Mar 27 2009

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?

I say go for it!

Mar 26 2009

Christopher Wright <dhasenan gmail.com> writes:

Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?

I don't use large arrays very often. When I do, I would not copy them if 
I could avoid it. Usually, either I keep catenating to an array until a 
certain point, then I only ever need to read from it, with no copying 
ever necessary. So I would rarely, if ever, benefit from this.

Mar 26 2009

Trass3r <mrmocool gmx.de> writes:

Don schrieb:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
 

Well, arrays > 32K aren't that unsual, esp. in scientific computing.
Even a small 200x200 matrix makes up 40000*8 bytes.

Mar 26 2009

Thomas Moran <no spam.plx> writes:

On 26/03/2009 20:08, Don wrote:
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation
 manual, and in Intel's 2007 manual. Only one of them actually gave any
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but
 this time the difference is unbelievable.

Don, have you seen Agner Fog's memcpy() and memmove() implementations 
included with the most recent versions of his manuals? In the unaligned 
case they read two XMM words and shift/combine them into the target 
alignment, so all loads and stores are aligned. Pretty cool.

He says (modestly):

; This method is 2 - 6 times faster than the implementations in the
; standard C libraries (MS, Gnu) when src or dest are misaligned.
; When src and dest are aligned by 16 (relative to each other) then this
; function is only slightly faster than the best standard libraries.

Mar 26 2009

Don <nospam nospam.com> writes:

Thomas Moran wrote:
 On 26/03/2009 20:08, Don wrote:
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation
 manual, and in Intel's 2007 manual. Only one of them actually gave any
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but
 this time the difference is unbelievable.

 
 Don, have you seen Agner Fog's memcpy() and memmove() implementations 
 included with the most recent versions of his manuals? In the unaligned 
 case they read two XMM words and shift/combine them into the target 
 alignment, so all loads and stores are aligned. Pretty cool.
 
 He says (modestly):
 
 ; This method is 2 - 6 times faster than the implementations in the
 ; standard C libraries (MS, Gnu) when src or dest are misaligned.
 ; When src and dest are aligned by 16 (relative to each other) then this
 ; function is only slightly faster than the best standard libraries.

I'm aware of Agner's code (it was a motivation), but I deliberately 
haven't looked at it since it's GPLed.
BTW, I already have copy-with-shifting implemented for the implemenation 
of bigint.

Mar 27 2009

Moritz Warning <moritzwarning web.de> writes:

On Thu, 26 Mar 2009 21:08:40 +0100, Don wrote:

 Is it worth spending any more time on it?

It's a basic building block for many programs.

Let the code flow! :p

Mar 26 2009

JC <jcrapuchettes gmail.com> writes:

The applications that I write usually work with matrices of size 600x600 up to 
2000x2000 and since they are doubles, that is a good chunk of memory.
Unleash the optimizations!
JC

Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
 
 
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation 
 manual, and in Intel's 2007 manual. Only one of them actually gave any 
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't 
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but 
 this time the difference is unbelievable.

Mar 27 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Is 2X faster large memcpy interesting?