digitalmars.D.learn - task parallelize dirEntries

Arun Chandrasekaran (49/49) Aug 11 2017 I've modified the sample from tour.dlang.org to calculate the md5

Arun Chandrasekaran (5/7) Aug 11 2017 RHEL 7.2 64 bit
Johnson (9/58) Aug 11 2017 Just a thought, maybe the GC isn't cleaning up quick enough? You

Arun Chandrasekaran (38/45) Aug 11 2017 John, thanks. That was it. md.d has nifty function that is

Arun Chandrasekaran <aruncxy gmail.com> writes:

I've modified the sample from tour.dlang.org to calculate the md5 
digest of the files in a directory using std.parallelism.

When I run this on a dir with huge number of files, I get:

core.exception.OutOfMemoryError src/core/exception.d(696): Memory 
allocation failed

Since dirEntries returns a range, I thought 
std.parallelism.parallel can make use of that without loading the 
entire file list into the memory.

What am I doing wrong here? Is there a way to achieve what I'm 
expecting?

```
import std.digest.md;
import std.stdio: writeln;
import std.file;
import std.algorithm;
import std.parallelism;

void printUsage()
{
     writeln("Loops through a given directory and calculates the 
md5 digest of each file encountered.");
     writeln("Usage: md <dirname>");
}

void safePrint(T...)(T args)
{
     synchronized
     {
         import std.stdio : writeln;
         writeln(args);
     }
}

void main(string[] args)
{
     if (args.length != 2)
         return printUsage;

     foreach (d; parallel(dirEntries(args[1], 
SpanMode.depth).filter!(f => f.isFile), 1))
     {
         auto md5 = new MD5Digest();
         md5.reset();
         auto data = cast(const(ubyte)[]) read(d.name);
         md5.put(data);
         auto hash = md5.finish();
         import std.array;
         string[] t = split(d.name, '/');
         safePrint(toHexString!(LetterCase.lower)(hash), "  ", 
t[$-1]);
     }
}
```

Aug 11 2017

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Friday, 11 August 2017 at 21:33:51 UTC, Arun Chandrasekaran 
wrote:
 I've modified the sample from tour.dlang.org to calculate the

 [...]

RHEL 7.2 64 bit
dmd v2.075.0
ldc 1.1.0

Aug 11 2017

Johnson <Johnson Johnson.com> writes:

On Friday, 11 August 2017 at 21:33:51 UTC, Arun Chandrasekaran 
wrote:
 I've modified the sample from tour.dlang.org to calculate the 
 md5 digest of the files in a directory using std.parallelism.

 When I run this on a dir with huge number of files, I get:

 core.exception.OutOfMemoryError src/core/exception.d(696): 
 Memory allocation failed

 Since dirEntries returns a range, I thought 
 std.parallelism.parallel can make use of that without loading 
 the entire file list into the memory.

 What am I doing wrong here? Is there a way to achieve what I'm 
 expecting?

 ```
 import std.digest.md;
 import std.stdio: writeln;
 import std.file;
 import std.algorithm;
 import std.parallelism;

 void printUsage()
 {
     writeln("Loops through a given directory and calculates the 
 md5 digest of each file encountered.");
     writeln("Usage: md <dirname>");
 }

 void safePrint(T...)(T args)
 {
     synchronized
     {
         import std.stdio : writeln;
         writeln(args);
     }
 }

 void main(string[] args)
 {
     if (args.length != 2)
         return printUsage;

     foreach (d; parallel(dirEntries(args[1], 
 SpanMode.depth).filter!(f => f.isFile), 1))
     {
         auto md5 = new MD5Digest();
         md5.reset();
         auto data = cast(const(ubyte)[]) read(d.name);
         md5.put(data);
         auto hash = md5.finish();
         import std.array;
         string[] t = split(d.name, '/');
         safePrint(toHexString!(LetterCase.lower)(hash), "  ", 
 t[$-1]);
     }
 }
 ```

Just a thought, maybe the GC isn't cleaning up quick enough? You 
are allocating and md5 digest each iteration.

Possibly, an opitimization is use use a collection of md5 hashes 
and reuse them. e.g., pre-allocate 100(you probably only need as 
many as the number of parallel loops going) and then attempt to 
resuse them. If all are in use, wait for a free one. Might 
require some synchronization.

Aug 11 2017

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Friday, 11 August 2017 at 21:58:20 UTC, Johnson wrote:
 Just a thought, maybe the GC isn't cleaning up quick enough? 
 You are allocating and md5 digest each iteration.

 Possibly, an opitimization is use use a collection of md5 
 hashes and reuse them. e.g., pre-allocate 100(you probably only 
 need as many as the number of parallel loops going) and then 
 attempt to resuse them. If all are in use, wait for a free one. 
 Might require some synchronization.

John, thanks. That was it. md.d has nifty function that is 
straightforward than the OOP version.

```
void main(string[] args)
{
     foreach (d; parallel(dirEntries(args[1], 
SpanMode.depth).filter!(f => f.isFile), 1))
     {
         auto data = cast(const(ubyte)[]) read(d.name);
         auto hash = md5Of(data);
         import std.array;
         string[] t = split(d.name, '/');
         writeln(toHexString(hash), "  ", t[$-1]);
     }
}
```

Also I expected the performance to be faster than `md5sum`. 
However, that was not the case. Please see below. Is there anyway 
to optimize this further?

```
11-08-2017 17:22:54 vaalaham ~/code/d/d-mpmc-sample
$ time find /home/arun/downloads/boost_1_64_0/ -type f | xargs 
md5sum >/dev/null 2>&1

real    0m1.124s
user    0m0.952s
sys     0m0.208s
11-08-2017 17:23:16 vaalaham ~/code/d/d-mpmc-sample
$ ldc2 pmd.d -O3
11-08-2017 17:23:31 vaalaham ~/code/d/d-mpmc-sample
$ time ./pmd ~/downloads/boost_1_64_0 > /dev/null

real    0m0.499s
user    0m1.596s
sys     0m0.580s
11-08-2017 17:23:37 vaalaham ~/code/d/d-mpmc-sample
$
```

strace showed lots of futex exchanges. Why would that be?

Aug 11 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - task parallelize dirEntries