digitalmars.D - Fast file copy (threaded or not?)
- "Marco Leise" <Marco.Leise gmx.de> Sep 01 2011
- "Vladimir Panteleev" <vladimir thecybershadow.net> Sep 01 2011
- "Jonathan M Davis" <jmdavisProg gmx.com> Sep 01 2011
- Johannes Pfau <spam example.com> Sep 01 2011
- "Marco Leise" <Marco.Leise gmx.de> Sep 01 2011
- "Marco Leise" <Marco.Leise gmx.de> Sep 01 2011
- "Marco Leise" <Marco.Leise gmx.de> Sep 01 2011
- Brad Roberts <braddr puremagic.com> Sep 01 2011
- Johannes Pfau <spam example.com> Sep 02 2011
- zeljkog <zeljkog nospam.com> Sep 02 2011
- Walter Bright <newshound2 digitalmars.com> Sep 02 2011
- Jonathan M Davis <jmdavisProg gmx.com> Sep 02 2011
- Andrej Mitrovic <andrej.mitrovich gmail.com> Sep 02 2011
- "Marco Leise" <Marco.Leise gmx.de> Sep 02 2011
------------aA3iuoAowTPJfsEK0e3WYV
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
I split the discussion with Andrei about the benefit of a multi-threaded
file copy routine to its own thread.
This is about copying a file from and to the same HDD - a mechanical disk
with seek times.
My testing showed that Andrei is correct with the assumption that the
kernel can optimize the small reads and writes in a multi-threaded
application. I had to use large buffers up to 64 MB with my
"single-threaded 100% synchronized writes" version to see the simple
multi-threaded version from Johannes Pfau add 4,3% overhead during a 512
MB copy operation.
Some more things I've experimented with:
- using only system API calls instead of D wrappers:
The difference is close to background noise
- direct I/O for writing as used by databases:
This worked pretty well, but you may not want to use it for
reading as it bypasses the file cache. A file that is already
cached would be copied slower as a result.
- memory maps:
Kernel memory is shared with userspace. This approach does
not allocate memory in the application. It just makes pages
of files directly accessible in user space. Once mapped, the
whole copy operation comes down to a single 'memcpy' call.
- splice (zero-copy):
This is a Linux command that allows memory operations inside
the kernel to be controlled from user space. The benefit is
that the CPU never copies this memory from kernel to
user space. Unfortunately the copy operation goes like this:
"source file -> pipe , pipe -> destination file"
A pipe is a hard-coded 64KB buffer. So it is not easy to move
large chunks of data in a single call to splice(). 512 MB are
still divided into 16.000+ calls.
Although splice looks promising it suffers from too many context switches.
I had the best results with direct I/O and using synchronized writes for
buffer sizes from 8 MB onwards, but I found this to be too complex and
probably system dependent. So I settled with the memory mapped version,
that I rewrote using Phobos instead of POSIX calls, so it should run
equally well on all platforms and is 5 lines of code at it's core:
----------------------------------------------------------------------
import std.datetime, std.exception, std.stdio, std.mmfile;
void main(string[] args)
{
if (!enforce(args.length == 3, {
stderr.writefln("%s SOURCE DEST", args[0]);
})) return;
auto sw = StopWatch();
sw.start();
auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length,
null, src.length);
auto data = dst[];
data[] = src[];
dst.flush();
sw.stop();
writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,
sw.peek().msecs,
1_000_000 * src.length / (1024 * sw.peek().usecs));
}
----------------------------------------------------------------------
This leaves it up to the kernel how to interleave disk reads and writes.
- Marco
------------aA3iuoAowTPJfsEK0e3WYV
Content-Disposition: attachment; filename=dcopy.d
Content-Type: application/octet-stream; name=dcopy.d
Content-Transfer-Encoding: Base64
aW1wb3J0IHN0ZC5kYXRldGltZSwgc3RkLmV4Y2VwdGlvbiwgc3RkLnN0ZGlvLCBz
dGQubW1maWxlOwoKdm9pZCBtYWluKHN0cmluZ1tdIGFyZ3MpCnsKICAgIGlmICgh
ZW5mb3JjZShhcmdzLmxlbmd0aCA9PSAzLCB7CiAgICAgICAgc3RkZXJyLndyaXRl
ZmxuKCIlcyBTT1VSQ0UgREVTVCIsIGFyZ3NbMF0pOwogICAgfSkpIHJldHVybjsK
CiAgICBhdXRvIHN3ID0gU3RvcFdhdGNoKCk7CiAgICBzdy5zdGFydCgpOwoKICAg
IGF1dG8gc3JjID0gbmV3IE1tRmlsZShhcmdzWzFdLCBNbUZpbGUuTW9kZS5SZWFk
LCAwLCBudWxsLCAwKTsKICAgIGF1dG8gZHN0ID0gbmV3IE1tRmlsZShhcmdzWzJd
LCBNbUZpbGUuTW9kZS5SZWFkV3JpdGVOZXcsIHNyYy5sZW5ndGgsIG51bGwsIHNy
Yy5sZW5ndGgpOwogICAgYXV0byBkYXRhID0gZHN0W107CiAgICBkYXRhW10gPSBz
cmNbXTsKICAgIGRzdC5mbHVzaCgpOwoKICAgIHN3LnN0b3AoKTsKICAgIHdyaXRl
ZmxuKCJDb3BpZWQgJXMgYnl0ZXMgaW4gJXMgbXNlYyAoJXMga0IvcykiLCBzcmMu
bGVuZ3RoLCBzdy5wZWVrKCkubXNlY3MsCiAgICAgICAgICAgIDFfMDAwXzAwMCAq
IHNyYy5sZW5ndGggLyAoMTAyNCAqIHN3LnBlZWsoKS51c2VjcykpOwp9
------------aA3iuoAowTPJfsEK0e3WYV--
Sep 01 2011
On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de> wrote:So I settled with the memory mapped version,
I wouldn't advise using memory-mapped files under the hood for anything without prior extensive testing in low-memory conditions. The kernel will be reluctant to drop pages in the file that have already been read/written. Some hinting APIs could be used, but these are not portable or reliable. (I recently had to rewrite one of my programs which used memory-mapped files with this as one of the reasons). -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 01 2011
On Thursday, September 01, 2011 13:13 Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times. My testing showed that Andrei is correct with the assumption that the kernel can optimize the small reads and writes in a multi-threaded application. I had to use large buffers up to 64 MB with my "single-threaded 100% synchronized writes" version to see the simple multi-threaded version from Johannes Pfau add 4,3% overhead during a 512 MB copy operation. Some more things I've experimented with: - using only system API calls instead of D wrappers: The difference is close to background noise - direct I/O for writing as used by databases: This worked pretty well, but you may not want to use it for reading as it bypasses the file cache. A file that is already cached would be copied slower as a result. - memory maps: Kernel memory is shared with userspace. This approach does not allocate memory in the application. It just makes pages of files directly accessible in user space. Once mapped, the whole copy operation comes down to a single 'memcpy' call. - splice (zero-copy): This is a Linux command that allows memory operations inside the kernel to be controlled from user space. The benefit is that the CPU never copies this memory from kernel to user space. Unfortunately the copy operation goes like this: "source file -> pipe , pipe -> destination file" A pipe is a hard-coded 64KB buffer. So it is not easy to move large chunks of data in a single call to splice(). 512 MB are still divided into 16.000+ calls. Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: ---------------------------------------------------------------------- import std.datetime, std.exception, std.stdio, std.mmfile; void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } ---------------------------------------------------------------------- This leaves it up to the kernel how to interleave disk reads and writes.
I would point out that regardless of what happens with performance with synchronous vs asynchronous I/O on a single HDD, it's pretty much a guarantee that in the general case asynchronous I/O is going to be faster when dealing with different HDDs. So, while we should definitely get hard data, unless copying asynchronously on a single hard drive is significantly worse than copying synchronously, then it's pretty much a given that we'd want to go with asynchronous I/O by default. If it were found that asynchronous I/O was significantly better on a single HDD, then that makes the question much more interesting, but as long as it's at least close - if not better - than synchronous I/O on the same HDD, then asynchronous I/O would be the way to go. - Jonathan M Davis
Sep 01 2011
Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times. My testing showed that Andrei is correct with the assumption that the kernel can optimize the small reads and writes in a multi-threaded application. I had to use large buffers up to 64 MB with my "single-threaded 100% synchronized writes" version to see the simple multi-threaded version from Johannes Pfau add 4,3% overhead during a 512 MB copy operation. Some more things I've experimented with: - using only system API calls instead of D wrappers: The difference is close to background noise - direct I/O for writing as used by databases: This worked pretty well, but you may not want to use it for reading as it bypasses the file cache. A file that is already cached would be copied slower as a result. - memory maps: Kernel memory is shared with userspace. This approach does not allocate memory in the application. It just makes pages of files directly accessible in user space. Once mapped, the whole copy operation comes down to a single 'memcpy' call. - splice (zero-copy): This is a Linux command that allows memory operations inside the kernel to be controlled from user space. The benefit is that the CPU never copies this memory from kernel to user space. Unfortunately the copy operation goes like this: "source file -> pipe , pipe -> destination file" A pipe is a hard-coded 64KB buffer. So it is not easy to move large chunks of data in a single call to splice(). 512 MB are still divided into 16.000+ calls. Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: ---------------------------------------------------------------------- import std.datetime, std.exception, std.stdio, std.mmfile; void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } ---------------------------------------------------------------------- This leaves it up to the kernel how to interleave disk reads and writes. - Marco
http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/ More related information: Linux maximum readahead buffer is 128KB (but I think that can be overwritten). It seems like there's no 'per file' limit for the write buffer. The only limit seems to be the memory available for caching (for example, in my case with 3GB of ram 1118MB are available for the write cache) -- Johannes Pfau
Sep 01 2011
Am 01.09.2011, 22:43 Uhr, schrieb Vladimir Panteleev <vladimir thecybershadow.net>:On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de> wrote:So I settled with the memory mapped version,
I wouldn't advise using memory-mapped files under the hood for anything without prior extensive testing in low-memory conditions. The kernel will be reluctant to drop pages in the file that have already been read/written. Some hinting APIs could be used, but these are not portable or reliable. (I recently had to rewrite one of my programs which used memory-mapped files with this as one of the reasons).
So cached data from memory-mapped files is not handled the same way that cache from normal reads/writes is handled? Good catch, I'll remember that.
Sep 01 2011
Am 01.09.2011, 23:55 Uhr, schrieb Johannes Pfau <spam example.com>:Related link: http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/ More related information: Linux maximum readahead buffer is 128KB (but I think that can be overwritten). It seems like there's no 'per file' limit for the write buffer. The only limit seems to be the memory available for caching (for example, in my case with 3GB of ram 1118MB are available for the write cache)
As far as I can tell the write buffer is influenced by several settings, the free RAM and timers :). It will be different for virtually every environment. I didn't know about the readahead buffer though. I read that you can double it with the POSIX_FADV_SEQUENTIAL advise on the file, though. But to be honest, this probably has little effect unless you process the data while reading tiny blocks of it -> one large read is faster than lots of small reads. I tried this POSIX_FADV_SEQUENTIAL flag on my copy routine and it had 0 observable influence. POSIX_FADV_NOREUSE doesn't seem to be implemented :D . A full DMA copy from file descriptor to file descriptor would be nice, or adjustable pipe sizes, so the splice() can do more stuff in the background.
Sep 01 2011
Am 01.09.2011, 22:38 Uhr, schrieb Jonathan M Davis <jmdavisProg gmx.com>:I would point out that regardless of what happens with performance with synchronous vs asynchronous I/O on a single HDD, it's pretty much a guarantee that in the general case asynchronous I/O is going to be faster when dealing with different HDDs. So, while we should definitely get hard data, unless copying asynchronously on a single hard drive is significantly worse than copying synchronously, then it's pretty much a given that we'd want to go with asynchronous I/O by default. If it were found that asynchronous I/O was significantly better on a single HDD, then that makes the question much more interesting, but as long as it's at least close - if not better - than synchronous I/O on the same HDD, then asynchronous I/O would be the way to go. - Jonathan M Davis
I guess you are right. Neither mine nor Andrei's expectations were met. I/O from multiple threads to a single device is handled remarkably well on today's systems. While it looked to me and others on the net like a no-go, we see no major difference in the performance of both approaches with typical buffer sizes and Phobos routines. If you want to go for the extra 5% in some cases you can go for that 100 MB buffer, OS specific functions and file usage hints, but that's never good for a standard library routine that is meant to be short, solid and portable.
Sep 01 2011
On 9/1/2011 1:13 PM, Marco Leise wrote:Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: - Marco
mmap has an issue with files larger than the mappable address space. Not that it's hard to handle that case, it does complicate the code in ways that the other options don't have problems with.
Sep 01 2011
Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times. My testing showed that Andrei is correct with the assumption that the kernel can optimize the small reads and writes in a multi-threaded application. I had to use large buffers up to 64 MB with my "single-threaded 100% synchronized writes" version to see the simple multi-threaded version from Johannes Pfau add 4,3% overhead during a 512 MB copy operation. Some more things I've experimented with: - using only system API calls instead of D wrappers: The difference is close to background noise - direct I/O for writing as used by databases: This worked pretty well, but you may not want to use it for reading as it bypasses the file cache. A file that is already cached would be copied slower as a result. - memory maps: Kernel memory is shared with userspace. This approach does not allocate memory in the application. It just makes pages of files directly accessible in user space. Once mapped, the whole copy operation comes down to a single 'memcpy' call. - splice (zero-copy): This is a Linux command that allows memory operations inside the kernel to be controlled from user space. The benefit is that the CPU never copies this memory from kernel to user space. Unfortunately the copy operation goes like this: "source file -> pipe , pipe -> destination file" A pipe is a hard-coded 64KB buffer. So it is not easy to move large chunks of data in a single call to splice(). 512 MB are still divided into 16.000+ calls. Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: ---------------------------------------------------------------------- import std.datetime, std.exception, std.stdio, std.mmfile; void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } ---------------------------------------------------------------------- This leaves it up to the kernel how to interleave disk reads and writes. - Marco
allocate buffers dynamically: https://gist.github.com/1188128 I hope I didn't screw up there. The idea is to have 2 buffers. Then, at the same time, one buffer is read from and one buffer is written to. When both read & write are finished, the buffers are swapped. -- Johannes Pfau
Sep 02 2011
Marco Leise Wrote:void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } - Marco
Looking at this code, should be StopWatch.peek() defined as property?
Sep 02 2011
On 9/1/2011 1:13 PM, Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times.
On Windows, we should just stick with the Windows CopyFile function: http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx And let the MS guys do their thing. Presumably they will do what works best on Windows.
Sep 02 2011
On Friday, September 02, 2011 04:58:46 zeljkog wrote:Marco Leise Wrote:void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } - Marco
Looking at this code, should be StopWatch.peek() defined as property?
Why? It's name isn't a noun, and conceptually, it's not really a property. You're "peeking" at the current time elapsed. That's very much an action, not a property. - Jonathan M Davis
Sep 02 2011
On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:On Windows, we should just stick with the Windows CopyFile function: http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx And let the MS guys do their thing. Presumably they will do what works best on Windows.
I've given OP's code a few test runs but I just get inconsistent results. Sometimes the async version is twice as fast, other times a simple call via system("copy file1 file2") is faster. Anyway, I'm assuming the MS devs optimized copying beyond the little snippet we have here.. :p
Sep 02 2011
Am 02.09.2011, 16:08 Uhr, schrieb Andrej Mitrovic <andrej.mitrovich gmail.com>:On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:On Windows, we should just stick with the Windows CopyFile function: http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx And let the MS guys do their thing. Presumably they will do what works best on Windows.
I've given OP's code a few test runs but I just get inconsistent results. Sometimes the async version is twice as fast, other times a simple call via system("copy file1 file2") is faster. Anyway, I'm assuming the MS devs optimized copying beyond the little snippet we have here.. :p
Yeah, to get consistent results we'd need at minimum: - fixed target location on disk (sectors to the end are ~2x slower, can be ensured by not truncating/erasing the target on every run) - ability to disable / clear the read cache (possible on Linux) - give the process real-time I/O priority
Sep 02 2011









"Vladimir Panteleev" <vladimir thecybershadow.net> 