www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Is 2X faster large memcpy interesting?

reply Don <nospam nospam.com> writes:
The next D2 runtime will include my cache-size detection code. This 
makes it possible to write a cache-aware memcpy, using (for example) 
non-temporal writes when the arrays being copied exceed the size of the 
largest cache.
In my tests, it gives a speed-up of approximately 2X in such cases.
The downside is, it's a fair bit of work to implement, and it only 
affects extremely large arrays, so I fear it's basically irrelevant (It 
probably won't help arrays < 32K in size). Do people actually copy 
megabyte-sized arrays?
Is it worth spending any more time on it?


BTW: I tested the memcpy() code provided in AMD's 1992 optimisation 
manual, and in Intel's 2007 manual. Only one of them actually gave any 
benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't 
Intel!)
I've noticed that AMD's docs are usually greatly superior to Intels, but 
this time the difference is unbelievable.
Mar 26 2009
next sibling parent Georg Wrede <georg.wrede iki.fi> writes:
Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
 
 
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation 
 manual, and in Intel's 2007 manual. Only one of them actually gave any 
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't 
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but 
 this time the difference is unbelievable.
What's the alternative? What would you do instead? Is there something cooler or more important for D to do? (IMHO, if the other alternatives have any merit, then I'd vote for them.) But then again, you've already invested in this, and it clearly interests you. Labourious, yes, but it sounds fun.
Mar 26 2009
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
I'd think so. In this day and age it is appalling that we don't quite know how to quickly copy memory around. A long time ago I ran some measurements (http://www.ddj.com/cpp/184403799) and I was quite surprised. My musings were as true then as now. And now we're getting to the second freakin' Space Odyssey! =============== Things are clearly hazy, aren't they? First off, maybe it came as a surprise to you that there's more than one way to fill and copy objects. Then, there's no single variant of fill and copy that works best on all compilers, data sets, and machines. (I guess if I tested the same code on a Celeron, which has less cache, I would have gotten very different results. To say nothing about other architectures.) As a rule of thumb, it's generally good to use memcpy (and consequently fill-by-copy) if you can — for large data sets, memcpy doesn't make much difference, and for smaller data sets, it might be much faster. For cheap-to-copy objects, Duff's Device might perform faster than a simple for loop. Ultimately, all this is subject to your compiler's and machine's whims and quirks. There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. We've done electronic computing for more than 50 years now, and we strive to design more and more complex systems, with unsatisfactory results. Software development is messy. Could it be because the fundamental tools and means we use are low-level, inefficient, and not standardized? Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory. ================ Andrei
Mar 26 2009
next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 As a rule of thumb, it's generally good to use memcpy (and consequently
 fill-by-copy) if you can — for large data sets, memcpy doesn't make much
 difference, and for smaller data sets, it might be much faster. For
 cheap-to-copy objects, Duff's Device might perform faster than a simple
 for loop. Ultimately, all this is subject to your compiler's and
 machine's whims and quirks.
 There is a very deep, and sad, realization underlying all this. We are
 in 2001, the year of the Spatial Odyssey. We've done electronic
 computing for more than 50 years now, and we strive to design more and
 more complex systems, with unsatisfactory results. Software development
 is messy. Could it be because the fundamental tools and means we use are
 low-level, inefficient, and not standardized? Just step out of the box
 and look at us — after 50 years, we're still not terribly good at
 filling and copying memory.
I don't know how sad this is. For better or worse, programming is still a craft, much like blacksmithing. Code is largely written from scratch for each project, techniques are jealously guarded (in our case via copyright law), etc. This may not be great from the perspective of progress, but it certainly makes the work more interesting. But then I'm a tinker at heart, so YMMV.
Mar 26 2009
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 I'd think so. In this day and age it is appalling that we don't quite 
 know how to quickly copy memory around. A long time ago I ran some 
 measurements (http://www.ddj.com/cpp/184403799) and I was quite 
 surprised. My musings were as true then as now. And now we're getting to 
 the second freakin' Space Odyssey!
It turns out that efficiently copying objects only a few bytes long requires a bunch of code. So, I gave up on having the compiler generate intrinsic memcpy code, and instead just call the library function memcpy. This is implemented in the next update.
Mar 26 2009
prev sibling parent Don <nospam nospam.com> writes:
Andrei Alexandrescu wrote:
 Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of 
 the largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant 
 (It probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
I'd think so. In this day and age it is appalling that we don't quite know how to quickly copy memory around. A long time ago I ran some measurements (http://www.ddj.com/cpp/184403799) and I was quite surprised. My musings were as true then as now. And now we're getting to the second freakin' Space Odyssey! =============== Things are clearly hazy, aren't they? First off, maybe it came as a surprise to you that there's more than one way to fill and copy objects. Then, there's no single variant of fill and copy that works best on all compilers, data sets, and machines. (I guess if I tested the same code on a Celeron, which has less cache, I would have gotten very different results. To say nothing about other architectures.)
More precisely, the optimal algorithm depends on the size of the affected memory, and the size of the CPU caches. However, the only thing which really varies between machines is where the thresholds are.
 As a rule of thumb, it's generally good to use memcpy (and consequently 
 fill-by-copy) if you can — for large data sets, memcpy doesn't make much 
 difference, and for smaller data sets, it might be much faster. For 
 cheap-to-copy objects, Duff's Device might perform faster than a simple 
 for loop. Ultimately, all this is subject to your compiler's and 
 machine's whims and quirks.
This is why I think it's absolutely critical to have cache size determination as a fundamental runtime library primitive. Since memory speeds have only tripled since 1980, but processor speeds have improved by 1000X, I think it's become less and less acceptable for a systems language to be cache-unaware.
 There is a very deep, and sad, realization underlying all this. We are 
 in 2001, the year of the Spatial Odyssey. We've done electronic 
 computing for more than 50 years now, and we strive to design more and 
 more complex systems, with unsatisfactory results. Software development 
 is messy. Could it be because the fundamental tools and means we use are 
 low-level, inefficient, and not standardized? Just step out of the box 
 and look at us — after 50 years, we're still not terribly good at 
 filling and copying memory.
 ================
 
 Andrei
Summarizing everyones comments -- nobody thinks that fast large memcopy is terribly important, but a slow memcpy seems philosophically wrong. I wanted to work on getting array operations to be cache-aware. Of course, the simplest possible array operation is a[]=b[]. On a Core2 with 1Mb/core L2 cache, DMD's memcpy performance drops to 0.6 bytes/cycles once you fall out of the L2 cache; but a 64-bit CPU can do 8 bytes/cycle under optimum conditions, when the data is coming from the L1 cache. So I figured it doesn't make sense to optimize the more complex operations when the trivial one is has so much potential for improvement.
Mar 27 2009
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
I say go for it!
Mar 26 2009
prev sibling next sibling parent Christopher Wright <dhasenan gmail.com> writes:
Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
I don't use large arrays very often. When I do, I would not copy them if I could avoid it. Usually, either I keep catenating to an array until a certain point, then I only ever need to read from it, with no copying ever necessary. So I would rarely, if ever, benefit from this.
Mar 26 2009
prev sibling next sibling parent Trass3r <mrmocool gmx.de> writes:
Don schrieb:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
 
Well, arrays > 32K aren't that unsual, esp. in scientific computing. Even a small 200x200 matrix makes up 40000*8 bytes.
Mar 26 2009
prev sibling next sibling parent reply Thomas Moran <no spam.plx> writes:
On 26/03/2009 20:08, Don wrote:
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation
 manual, and in Intel's 2007 manual. Only one of them actually gave any
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but
 this time the difference is unbelievable.
Don, have you seen Agner Fog's memcpy() and memmove() implementations included with the most recent versions of his manuals? In the unaligned case they read two XMM words and shift/combine them into the target alignment, so all loads and stores are aligned. Pretty cool. He says (modestly): ; This method is 2 - 6 times faster than the implementations in the ; standard C libraries (MS, Gnu) when src or dest are misaligned. ; When src and dest are aligned by 16 (relative to each other) then this ; function is only slightly faster than the best standard libraries.
Mar 26 2009
parent Don <nospam nospam.com> writes:
Thomas Moran wrote:
 On 26/03/2009 20:08, Don wrote:
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation
 manual, and in Intel's 2007 manual. Only one of them actually gave any
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but
 this time the difference is unbelievable.
Don, have you seen Agner Fog's memcpy() and memmove() implementations included with the most recent versions of his manuals? In the unaligned case they read two XMM words and shift/combine them into the target alignment, so all loads and stores are aligned. Pretty cool. He says (modestly): ; This method is 2 - 6 times faster than the implementations in the ; standard C libraries (MS, Gnu) when src or dest are misaligned. ; When src and dest are aligned by 16 (relative to each other) then this ; function is only slightly faster than the best standard libraries.
I'm aware of Agner's code (it was a motivation), but I deliberately haven't looked at it since it's GPLed. BTW, I already have copy-with-shifting implemented for the implemenation of bigint.
Mar 27 2009
prev sibling next sibling parent Moritz Warning <moritzwarning web.de> writes:
On Thu, 26 Mar 2009 21:08:40 +0100, Don wrote:

 Is it worth spending any more time on it?
It's a basic building block for many programs. Let the code flow! :p
Mar 26 2009
prev sibling parent JC <jcrapuchettes gmail.com> writes:
The applications that I write usually work with matrices of size 600x600 up to 
2000x2000 and since they are doubles, that is a good chunk of memory.
Unleash the optimizations!
JC

Don wrote:
 The next D2 runtime will include my cache-size detection code. This 
 makes it possible to write a cache-aware memcpy, using (for example) 
 non-temporal writes when the arrays being copied exceed the size of the 
 largest cache.
 In my tests, it gives a speed-up of approximately 2X in such cases.
 The downside is, it's a fair bit of work to implement, and it only 
 affects extremely large arrays, so I fear it's basically irrelevant (It 
 probably won't help arrays < 32K in size). Do people actually copy 
 megabyte-sized arrays?
 Is it worth spending any more time on it?
 
 
 BTW: I tested the memcpy() code provided in AMD's 1992 optimisation 
 manual, and in Intel's 2007 manual. Only one of them actually gave any 
 benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't 
 Intel!)
 I've noticed that AMD's docs are usually greatly superior to Intels, but 
 this time the difference is unbelievable.
Mar 27 2009