www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - parallel copy directory, faster than robocopy

reply Jay Norwood <jayn prismnet.com> writes:
Attached is the source for a small parallel app that copies a source folder to
a destination.  It creates the directory structure first using the breadth
ordering, then uses a parallel foreach loop with the taskPool to copy all the
regular files in parallel.  On my corei7, this copied a 1.5GB folder with
around 36K entries to a destination in about 11.5 secs (src and dest on the
same ssd drive).  This was about a second better than robocopy, which is the
fastest alternative I could find.   The regular win7-64 copy takes 41 secs for
the same folder.

I'd like to add wildcard processing for the sources, but haven't found a good
example. 
Feb 13 2012
next sibling parent reply Jay Norwood <jayn prismnet.com> writes:
ok, so I guess the Add File didn't work for some reason, so here's the source.



module main;

import std.stdio;
import std.file;
import std.path;
import std.datetime;
import std.parallelism;

int main(string[] argv)
{
 	if (argv.length != 3){
 		writeln ("need to specify src and dest dir");
 		return 0;
 	}

	// TODO expand this to handle wildcard

 	string dest = argv[$-1];
 	foreach(string dir; argv[1..$-1])
	{
		writeln("copying directory: "~ dir );
		auto st1 = Clock.currTime(); //Current time in local time.
		cpdir(dir,dest); 
 		auto st2 = Clock.currTime(); //Current time in local time.
		auto dif = st2  - st1 ;
		auto ts= dif.toString();
		writeln("time:"~ts);
	}
	writeln("finished !");
	return 0;
}
void cpdir(in char[] pathname ,in char[] dest){
    DirEntry deSrc = dirEntry(pathname);
	string[] files;

	if (!exists(dest)){
		mkdir (dest); // makes dest root
	}
 	DirEntry destDe = dirEntry(dest);
	if(!destDe.isDir()){        
		throw new FileException( destDe.name, " is not a directory"); 
	}
	string destName = destDe.name ~ '/';

	if(!deSrc.isDir()){
		copy(deSrc.name,dest); 
	}
	else    { 
		string srcRoot = deSrc.name;
		int srcLen = srcRoot.length;
		string destRoot = destName ~ baseName(deSrc.name);
        mkdir(destRoot);
		
		// make an array of the regular files only, also create the directory
structure
		// Since it is SpanMode.breadth, can just use mkdir
 		foreach(DirEntry e; dirEntries(deSrc.name, SpanMode.breadth, false)){
			if (attrIsDir(e.linkAttributes)){
				string destDir = destRoot ~ e.name[srcLen..$];
				mkdir(destDir);
			}
			else{
				files ~= e.name;
			}
 		} 

		// parallel foreach for regular files
		foreach(fn ; taskPool.parallel(files)) {
			string dfn = destRoot ~ fn[srcLen..$];
			copy(fn,dfn);
		}
	}
}
Feb 13 2012
parent reply Jay Norwood <jayn prismnet.com> writes:
ok, I didn't test that first one very well.  It worked for directory copies,
but I didn't test non directories.  So here is the fixed operation for non
directories, where it just copies the single file.  

So it now does two cases:
copy regular_file destinationDirectory
copy folder destinationDirectory

What I'd like to add is wildcard support for something like 
 copy folder/* destinationDirectory

I suppose also it could be enhanced to handle all the robocopy options, but I'm
just trying out the copy speeds for now.


module main;

import std.stdio;
import std.file;
import std.path;
import std.datetime;
import std.parallelism;

int main(string[] argv)
{
 	if (argv.length != 3){
 		writeln ("need to specify src and dest dir");
 		return 0;
 	}

	// TODO expand this to handle wildcard

 	string dest = argv[$-1];
 	foreach(string dir; argv[1..$-1])
	{
		writeln("copying directory: "~ dir );
		auto st1 = Clock.currTime(); //Current time in local time.
		cpdir(dir,dest); 
 		auto st2 = Clock.currTime(); //Current time in local time.
		auto dif = st2  - st1 ;
		auto ts= dif.toString();
		writeln("time:"~ts);
	}
	writeln("finished !");
	return 0;
}
void cpdir(in char[] pathname ,in char[] dest){
    DirEntry deSrc = dirEntry(pathname);
	string[] files;

	if (!exists(dest)){
		mkdir (dest); // makes dest root
	}
 	DirEntry destDe = dirEntry(dest);
	if(!destDe.isDir()){        
		throw new FileException( destDe.name, " is not a directory"); 
	}
	string destName = destDe.name ~ '/';
	string destRoot = destName ~ baseName(deSrc.name);

	if(!deSrc.isDir()){
		copy(deSrc.name,destRoot); 
	}
	else    { 
		string srcRoot = deSrc.name;
		int srcLen = srcRoot.length;
        mkdir(destRoot);
		
		// make an array of the regular files only, also create the directory
structure
		// Since it is SpanMode.breadth, can just use mkdir
 		foreach(DirEntry e; dirEntries(deSrc.name, SpanMode.breadth, false)){
			if (attrIsDir(e.linkAttributes)){
				string destDir = destRoot ~ e.name[srcLen..$];
				mkdir(destDir);
			}
			else{
				files ~= e.name;
			}
 		} 

		// parallel foreach for regular files
		foreach(fn ; taskPool.parallel(files)) {
			string dfn = destRoot ~ fn[srcLen..$];
			copy(fn,dfn);
		}
	}
}
Feb 13 2012
parent reply Jay Norwood <jayn prismnet.com> writes:
An  improvement  is to change this first mkdir to mkdirRecurse.

  if (!exists(dest)){
                mkdir (dest); // makes dest root
        }
Feb 14 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 14/02/2012 14:29, Jay Norwood a écrit :
 An  improvement  is to change this first mkdir to mkdirRecurse.

    if (!exists(dest)){
                  mkdir (dest); // makes dest root
          }

If I could suggest something, it would be great to see this added to std.file . As well as the multithreaded remove we talked about recently in another thread.
Feb 14 2012
prev sibling next sibling parent reply Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 2/13/2012 10:58 PM, Jay Norwood wrote:
 Attached is the source for a small parallel app that copies a source folder to
a destination.  It creates the directory structure first using the breadth
ordering, then uses a parallel foreach loop with the taskPool to copy all the
regular files in parallel.  On my corei7, this copied a 1.5GB folder with
around 36K entries to a destination in about 11.5 secs (src and dest on the
same ssd drive).  This was about a second better than robocopy, which is the
fastest alternative I could find.   The regular win7-64 copy takes 41 secs for
the same folder.

 I'd like to add wildcard processing for the sources, but haven't found a good
example.

more of an 'FYI/reminder': At a minimum Robocopy does additional work to preserve the timestamps and attributes of the copies of the files (by default) so it can avoid redundant copies of files in the future. This is undoubtedly creating some additional overhead. Its probably also quite a bit worse with /SEC etc to copy permissions. On the plus side you would have windows scheduling the IO which in theory would be able to minimize seeking to some degree, compared to robocopy's serial copying.
Feb 14 2012
parent "Jay Norwood" <jayn prismnet.com> writes:
On Wednesday, 15 February 2012 at 00:11:32 UTC, Sean Cavanaugh 
wrote:
 more of an 'FYI/reminder':

 At a minimum Robocopy does additional work to preserve the 
 timestamps and attributes of the copies of the files (by 
 default) so it can avoid redundant copies of files in the 
 future.  This is undoubtedly creating some additional overhead.

 Its probably also quite a bit worse with /SEC etc to copy 
 permissions.

 On the plus side you would have windows scheduling the IO which 
 in theory would be able to minimize seeking to some degree, 
 compared to robocopy's serial copying.

Yeah, Robocopy has a lot of nice options. Currently the D library has copy (srcpath, destpath), which goes directly to the OS copy. If it had something like copy(DirectoryEntry,destpath,options), with the options being like the Robocopy options, that might be more efficient. On the ssd seeking is on the order of 0.2msec vs 16msec on my 7200rpm seagate hard drive. I do think seeks on a hard drive will be a problem with all the small, individual file copies. So is Robocopy bundling these up in some way? I did find a nice solution in std.file for the argv expansion, btw, and posted an example on D.learn. It uses a version of dirEntries that has an extra parameter that is used for expansion that is available in std.path.
Feb 14 2012
prev sibling next sibling parent "Nick Sabalausky" <a a.a> writes:
"Jay Norwood" <jayn prismnet.com> wrote in message 
news:jhcplo$1jj8$1 digitalmars.com...
 Attached is the source for a small parallel app that copies a source 
 folder to a destination.  It creates the directory structure first using 
 the breadth ordering, then uses a parallel foreach loop with the taskPool 
 to copy all the regular files in parallel.  On my corei7, this copied a 
 1.5GB folder with around 36K entries to a destination in about 11.5 secs 
 (src and dest on the same ssd drive).  This was about a second better than 
 robocopy, which is the fastest alternative I could find.   The regular 
 win7-64 copy takes 41 secs for the same folder.

 I'd like to add wildcard processing for the sources, but haven't found a 
 good example.

Nice! Is it possible this could increase disk fragmentation though? Or do the filesystem drivers on Win/Lin/etc work in a way that mitigates that possibility?
Feb 14 2012
prev sibling next sibling parent reply "Jay Norwood" <jayn prismnet.com> writes:
I placed the two parallel file operations, rmdir and copy on 
github in

https://github.com/jnorwood/file_parallel

These combine the std.parallelism operations with the std.file 
operations to speed up the processing on Windows.
-----------
I also put a useful function that does argv pathname wildcard 
expansion in

https://github.com/jnorwood/file_utils

This makes use of one of the existing dirEntries call that has 
the pattern matching parameter which enables simple * and ? 
expansions in windows args.  I'm only allowing expansions in the 
basename, and only expanding in one level of the directory.

There are example Windows commandline utilies that use each of 
the functions in file_parallel/examples.

I've only testsd these on win7, 64 bit.
Mar 04 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/4/12 2:53 PM, Jay Norwood wrote:
 I placed the two parallel file operations, rmdir and copy on github in

 https://github.com/jnorwood/file_parallel

 These combine the std.parallelism operations with the std.file
 operations to speed up the processing on Windows.
 -----------
 I also put a useful function that does argv pathname wildcard expansion in

 https://github.com/jnorwood/file_utils

 This makes use of one of the existing dirEntries call that has the
 pattern matching parameter which enables simple * and ? expansions in
 windows args. I'm only allowing expansions in the basename, and only
 expanding in one level of the directory.

 There are example Windows commandline utilies that use each of the
 functions in file_parallel/examples.

 I've only testsd these on win7, 64 bit.

Sounds great! Next step, should you be interested, is to create a pull request for phobos so we can integrate your code within. Andrei
Mar 05 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Monday, 5 March 2012 at 12:48:54 UTC, Andrei Alexandrescu 
wrote:
 Sounds great! Next step, should you be interested, is to create 
 a pull request for phobos so we can integrate your code within.

 Andrei

I considered that. I suppose the wildArgv code could go in std.path, and the file operations into std.file. and the pull requests against those files. I haven't followed the discussions closely enough to know what are the rules/politics about adding another std library import into those. It would require adding import of std.parallelism into std.file.
Mar 05 2012
prev sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 14.02.2012 05:58, schrieb Jay Norwood:
 Attached is the source for a small parallel app that copies a source folder to
a destination.  It creates the directory structure first using the breadth
ordering, then uses a parallel foreach loop with the taskPool to copy all the
regular files in parallel.  On my corei7, this copied a 1.5GB folder with
around 36K entries to a destination in about 11.5 secs (src and dest on the
same ssd drive).  This was about a second better than robocopy, which is the
fastest alternative I could find.   The regular win7-64 copy takes 41 secs for
the same folder.

 I'd like to add wildcard processing for the sources, but haven't found a good
example.

do you compare single-threaded robocopy with your implementation or multithreaded? you can command robocopy to use multiple threads with /MT[:n]
Mar 05 2012
next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Monday, 5 March 2012 at 16:35:09 UTC, dennis luehring wrote:
 do you compare single-threaded robocopy with your 
 implementation or multithreaded?

 you can command robocopy to use multiple threads with /MT[:n]

yes, I tested vs multithread robocopy. As someone pointed out, robocopy has lots of nice options, which I didn't try to duplicate, and is only about 10% slower on my test. I was happy to see the D app in the same ballpark as robocopy, which means to me that the very simple and clean std.parallism taskpool foreach loop can produce very good multi-core results in a very concise and readable piece of code. I've done some projects previously using omp pragmas in C++ and it is just so ugly.
Mar 05 2012
prev sibling parent "Jay Norwood" <jayn prismnet.com> writes:
So here is the output of a batch file I just ran on the ssd drive 
for the 1.5GB copy.  Robocopy displays that it took around 14 
secs, while the release build of the D commandline cpd utility 
took around 12 secs.  That's a pretty consistent result on the 
ssd drive, which are more sensitive to cpu pr.

06:12 PM

H:\xx8>robocopy /E /NDL /NFL /NC /NS /MT:8 xx8c xx8ca

-------------------------------------------------------------------------------
    ROBOCOPY     ::     Robust File Copy for Windows
-------------------------------------------------------------------------------

   Started : Mon Mar 05 18:12:33 2012

    Source : H:\xx8\xx8c\
      Dest : H:\xx8\xx8ca\

     Files : *.*

   Options : *.* /NS /NC /NDL /NFL /S /E /COPY:DAT /MT:8 
/R:1000000 /W:30

------------------------------------------------------------------------------
100%

------------------------------------------------------------------------------

                Total    Copied   Skipped  Mismatch    FAILED    
Extras
     Dirs :      2627      2626         1         0         0      
    0
    Files :     36969     36969         0         0         0      
    0
    Bytes :   1.502 g   1.502 g         0         0         0      
    0
    Times :   0:02:05   0:00:12                       0:00:00   
0:00:01

    Ended : Mon Mar 05 18:12:47 2012

H:\xx8>time /T
06:12 PM

H:\xx8>rmd xx8ca\*
removing: xx8ca\Cross_Tools
removing: xx8ca\eclipse
removing: xx8ca\gnu
removing: xx8ca\PA
finished! time:17889 ms

H:\xx8>time /T
06:13 PM

H:\xx8>cpd xx8c\* xx8ca
copying: xx8c\Cross_Tools
copying: xx8c\eclipse
copying: xx8c\gnu
copying: xx8c\PA
finished! time: 11681 ms

H:\xx8>time /T
06:13 PM

btw, I just ran robocopy with /mt:1, and it took around 42 
seconds on the same drive, which is about what I see with the 
standard windows copy, including the gui copy.  So, at least for 
these ssd drives the parallel processing results in worthwhile 
speed-ups.

Started : Mon Mar 05 18:24:31 2012
Ended : Mon Mar 05 18:25:13 2012
Mar 05 2012