www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - 4x speedup of recursive rmdir in std.file

reply Jay Norwood <jayn prismnet.com> writes:
It would be good if the std.file operations used the D multi-
thread features, since you've done such a nice job of making them
easy.   I hacked up your std.file recursive remove and got a 4x
speed-up on a win7 system with corei7 using the examples from the
D programming language book.  Code is below with a hard-coded file
I was using for test.  I'm just learning this, so I know you can
do better ...

Delete time dropped from 1minute 5 secs to less than 15 secs.
This was on an ssd drive.

module main;

import std.stdio;
import std.file;
import std.datetime;
import std.concurrency;
const int THREADS = 16;
int main(string[] argv)
{
   writeln("removing H:/pa10_120130/xx8");
	auto st1 = Clock.currTime(); //Current time in local time.
	rmdirRecurse2("H:/pa10_120130/xx8");
 	auto st2 = Clock.currTime(); //Current time in local time.
	auto dif = st2  - st1 ;
	auto ts= dif.toString();
	writeln("time:");
	writeln(ts);
	writeln("finished !");
   return 0;
}
void rmdirRecurse2(in char[] pathname){
    DirEntry de = dirEntry(pathname);
    rmdirRecurse2(de);
}
void rmdirRecurse2(ref DirEntry de){
	if(!de.isDir)
		throw new FileException( de.name, " is not a
directory");
	if(de.isSymlink())
		remove(de.name);
	else    {
		Tid tid[THREADS];
		int i=0;
        for(;i<THREADS;i++){
			tid[i]= spawn(&fileRemover);
		}
		Tid tidd = spawn(&dirRemover);

		// all children, recursively depth-first
	    i=0;
		foreach(DirEntry e; dirEntries(de.name,
SpanMode.depth, false))        {
			string nm = e.name;
            attrIsDir(e.linkAttributes) ? tidd.send(nm)  : tid
[i].send(nm),i=(i+1)%THREADS;
		}

        // wait for the THREADS threads to complete their file
removes and acknowledge
		// receipt of the tid
		for (i=0;i<THREADS;i++){
			tid[i].send(thisTid);
			receiveOnly!Tid();
		}
		tidd.send(thisTid);
		receiveOnly!Tid();

		// the dir itself
		rmdir(de.name);
	}
}
	void fileRemover() {
		for(bool running=true;running;){
		receive(
				(string s) {
					remove(s);
				}, // remove the files
				(Tid x) {
					x.send(thisTid);
					running=false;

				} // this is the terminator
				);
		}
	}

	void dirRemover() {
		string[] dirs;
		for(bool running=true;running;){
			receive(
					(string s) {
						dirs~=s;
					},
					(Tid x) {
						foreach(string
d;dirs){
							rmdir(d);
						}
						x.send(thisTid);
						running = false;
					}
					);
		}
	}
Feb 04 2012
parent reply "Nick Sabalausky" <a a.a> writes:
"Jay Norwood" <jayn prismnet.com> wrote in message 
news:jgkfdf$qb5$1 digitalmars.com...
 It would be good if the std.file operations used the D multi-
 thread features, since you've done such a nice job of making them
 easy.   I hacked up your std.file recursive remove and got a 4x
 speed-up on a win7 system with corei7 using the examples from the
 D programming language book.  Code is below with a hard-coded file
 I was using for test.  I'm just learning this, so I know you can
 do better ...

 Delete time dropped from 1minute 5 secs to less than 15 secs.
 This was on an ssd drive.

Interesting. How does it perform when just running on one core?
Feb 04 2012
parent reply Jay Norwood <jayn prismnet.com> writes:
== Quote from Nick Sabalausky (a a.a)'s article
 Interesting. How does it perform when just running on one core?

The library without the threads is 1 min 5 secs for the 1.5GB directory structure with about 32k files. This is on an 510 series intel ssd. The win7 os removes it in almost exactly the same time, and you can see from their task manager it is also being done single core and only a small percentage of cpu. In contrast, all 8 threads in the task manager max out for a period when running this multi-thread remove. The regular file deletes are occurring in parallel. A single thread removes the directory structure after waiting for all the regular files to be deleted by the parallel threads. I attached a screen capture. I tried last night to do a similar thing with the unzip processing in std.zip, but the library code is written in such a way that the parallel threads would need to create the whole zip archive directory in order to process the elements. I would hope to be able to solve this problem and provide a similar 4x speedup to the unzip of, for example 7zip, which is currently also showing execution on a single thread. 7zip takes about 50 seconds to unzip this file. What is needed is probably a dumber archive element processing call that gets passed an archive element immutable structure read by the main thread. The parallel threads could then seek to the position and process just each assigned single element without loading the whole file. Also, the current design requires a memory buffer with the whole zip archive in it before it can create the archive directory. There should instead be some way of sequentially processing the file.
Feb 05 2012
parent reply "Nick Sabalausky" <a a.a> writes:
"Jay Norwood" <jayn prismnet.com> wrote in message 
news:jgm5vh$hbe$1 digitalmars.com...
 == Quote from Nick Sabalausky (a a.a)'s article
 Interesting. How does it perform when just running on one core?

The library without the threads is 1 min 5 secs for the 1.5GB directory structure with about 32k files. This is on an 510 series intel ssd. The win7 os removes it in almost exactly the same time, and you can see from their task manager it is also being done single core and only a small percentage of cpu. In contrast, all 8 threads in the task manager max out for a period when running this multi-thread remove. The regular file deletes are occurring in parallel. A single thread removes the directory structure after waiting for all the regular files to be deleted by the parallel threads. I attached a screen capture.

What I'm wondering is this: Suppose all the cores but one are already preoccupied with other stuff, or maybe you're even running on a single-core. Does the threading add enough overhead that it would actually go slower than the original single-threaded version? If not, then this would indeed be a fantastic improvement to phobos. Otherwise, I wonder how such a situation could be mitigated?
 I tried last night to do a similar thing with the unzip processing
 in std.zip, but the library code is written in such a way that the
 parallel threads would need to create the whole zip archive
 directory in order to process the elements.   I would hope to be
 able to solve this problem and provide a similar 4x speedup to the
 unzip of, for example 7zip, which is currently also showing
 execution on a single thread.  7zip takes about 50 seconds to
 unzip this file.

That would be cool.
Feb 05 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/5/12 10:16 AM, Nick Sabalausky wrote:
 "Jay Norwood"<jayn prismnet.com>  wrote in message
 news:jgm5vh$hbe$1 digitalmars.com...
 == Quote from Nick Sabalausky (a a.a)'s article
 Interesting. How does it perform when just running on one core?

The library without the threads is 1 min 5 secs for the 1.5GB directory structure with about 32k files. This is on an 510 series intel ssd. The win7 os removes it in almost exactly the same time, and you can see from their task manager it is also being done single core and only a small percentage of cpu. In contrast, all 8 threads in the task manager max out for a period when running this multi-thread remove. The regular file deletes are occurring in parallel. A single thread removes the directory structure after waiting for all the regular files to be deleted by the parallel threads. I attached a screen capture.

What I'm wondering is this: Suppose all the cores but one are already preoccupied with other stuff, or maybe you're even running on a single-core. Does the threading add enough overhead that it would actually go slower than the original single-threaded version? If not, then this would indeed be a fantastic improvement to phobos. Otherwise, I wonder how such a situation could be mitigated?

There's a variety of ways, but the simplest approach is to pass a parameter to the function telling how many threads it's allowed to spawn. Jay? Andrei
Feb 05 2012
next sibling parent reply Jay Norwood <jayn prismnet.com> writes:
== Quote from Andrei Alexandrescu
 Suppose all the cores but one are already preoccupied with


 maybe you're even running on a single-core. Does the threading


 overhead that it would actually go slower than the original


 version?

 If not, then this would indeed be a fantastic improvement to


 Otherwise, I wonder how such a situation could be mitigated?

parameter to the function telling how many threads it's allowed

 spawn. Jay?
 Andrei

I can tell you that there are a couple of seconds improvement in the execution time running 16 threads vs 8 on the i7 on the ssd drive, so we aren't keeping all the cores busy with 8 threads. I suppose they are all blocked waiting for file system operations for some portion of time even with 8 threads. I would guess that even on a single core it would be an advantage to have multiple threads available for the core to work on when it blocks waiting for the fs operations. The previous results were on an ssd drive. I tried again on a Seagate sata3 7200rpm hard drive it took 2 minutes 12 sec to delete the same layout using OS, and never used more than 10% cpu. The one thread configuration of the D program similarly used less than 10% cpu but took only 1 minute 50 seconds to delete the same layout. Anything above 1 thread configuration on the sata drive began degrading the D program performance when using the hard drive. I'll have to scratch my head on this a while. This is for an optiplex 790, win7-64, using the board's sata for both the ssd and the hd. The extract of the zip using 7zip takes 1:55 on the seagate disk drive, btw ... vs about 50 secs on the ssd.
Feb 05 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/5/12 3:04 PM, Jay Norwood wrote:
 I can tell you that there are a couple of seconds improvement in
 the execution time running 16 threads vs 8 on the i7 on the ssd
 drive, so we aren't keeping all the cores busy with 8 threads. I
 suppose they are all blocked waiting for file system operations
 for some portion of time even with 8 threads.  I would guess that
 even on a single core it would be an advantage to have multiple
 threads available for the core to work on when it blocks waiting
 for the fs operations.

That's why I'm saying - let's leave the decision to the user. Take a uint parameter for the number of threads to be used, where 0 means leave it to phobos, and default to 0. Andrei
Feb 05 2012
parent Jay Norwood <jayn prismnet.com> writes:
Andrei Alexandrescu Wrote:
 That's why I'm saying - let's leave the decision to the user. Take a 
 uint parameter for the number of threads to be used, where 0 means leave 
 it to phobos, and default to 0.
 
 Andrei
 

ok, here is another version. I was reading about the std.parallelism library, and I see I can do the parallel removes more cleanly. Plus the library figures out the number of cores and limits the taskpool size accordingly. It is only a slight bit slower than the other code. It looks like they choose 7 threads in the taskPool when you have 8 cores. So, I do the regular files in parallel, then pass it back to the original library code which cleans up the directory-only tree non-parallel. I also added in code to get the directory names from argv. module main; import std.stdio; import std.file; import std.datetime; import std.parallelism; int main(string[] argv) { if (argv.length < 2){ writeln ("need to specify one or more directories to remove"); return 0; } foreach(string dir; argv[1..$]){ writeln("removing directory: "~ dir ); auto st1 = Clock.currTime(); //Current time in local time. rmdirRecurse2(dir); auto st2 = Clock.currTime(); //Current time in local time. auto dif = st2 - st1 ; auto ts= dif.toString(); writeln("time:"~ts); } writeln("finished !"); return 0; } void rmdirRecurse2(in char[] pathname){ DirEntry de = dirEntry(pathname); rmdirRecurse2(de); } void rmdirRecurse2(ref DirEntry de){ string[] files; if(!de.isDir) throw new FileException( de.name, " is not a directory"); if(de.isSymlink()) remove(de.name); else { // make an array of the regular files only foreach(DirEntry e; dirEntries(de.name, SpanMode.depth, false)){ if (!attrIsDir(e.linkAttributes)){ files ~= e.name ; } } // parallel foreach for regular files foreach(fn ; taskPool.parallel(files,1000)) { remove(fn); } // let the original code remove the directories only rmdirRecurse(de); } }
Feb 06 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 05/02/2012 18:38, Andrei Alexandrescu a écrit :
 On 2/5/12 10:16 AM, Nick Sabalausky wrote:
 "Jay Norwood"<jayn prismnet.com> wrote in message
 news:jgm5vh$hbe$1 digitalmars.com...
 == Quote from Nick Sabalausky (a a.a)'s article
 Interesting. How does it perform when just running on one core?

The library without the threads is 1 min 5 secs for the 1.5GB directory structure with about 32k files. This is on an 510 series intel ssd. The win7 os removes it in almost exactly the same time, and you can see from their task manager it is also being done single core and only a small percentage of cpu. In contrast, all 8 threads in the task manager max out for a period when running this multi-thread remove. The regular file deletes are occurring in parallel. A single thread removes the directory structure after waiting for all the regular files to be deleted by the parallel threads. I attached a screen capture.

What I'm wondering is this: Suppose all the cores but one are already preoccupied with other stuff, or maybe you're even running on a single-core. Does the threading add enough overhead that it would actually go slower than the original single-threaded version? If not, then this would indeed be a fantastic improvement to phobos. Otherwise, I wonder how such a situation could be mitigated?

There's a variety of ways, but the simplest approach is to pass a parameter to the function telling how many threads it's allowed to spawn. Jay? Andrei

That cold be a solution, but this is a bad separation of concerns IMO, and should be like that in phobos. The parameter should be a thread pool or something similar. This allow to not only choose the number of thread, but also to choose how the task is distributed over threads, eventually mix thoses task with other tasks (by using the same thread pool in other places). It allow to basically separate the problem of deleting and the problem of spreading the task over multiple threads and with which policy.
Feb 07 2012