## digitalmars.D.learn - Help with an algorithm!

• MGW (5/5) Jun 14 2017 There are two arrays of string [] mas1, mas2; Size of each about
• Ivan Kazmenko (31/37) Jun 15 2017 The approaches which come to mind are:
• CRAIG DILLABAUGH (7/17) Jun 15 2017 As a follow up to this, if your alphabet is reasonably small
• MGW (17/17) Jun 15 2017 On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH wrote:
• CRAIG DILLABAUGH (22/40) Jun 15 2017 radix sort is O(N) time, which is as fast as you can hope. But
• Ivan Kazmenko (14/28) Jun 15 2017 Ugh. This can work as slow as length-of-m1 *multiplied* by
```There are two arrays of string [] mas1, mas2; Size of each about
5M lines. By the size they different, but lines in both match for
95%. It is necessary to find all lines in an array of mas2 which
differ from mas1. The principal criterion - speed. There are the
8th core processor and it is good to involve a multithreading.
```
Jun 14 2017
```On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
There are two arrays of string [] mas1, mas2; Size of each
about 5M lines. By the size they different, but lines in both
match for 95%. It is necessary to find all lines in an array of
mas2 which differ from mas1. The principal criterion - speed.
There are the 8th core processor and it is good to involve a

The approaches which come to mind are:

1. Sort both arrays, then traverse them in sorted order, like in
merge step of merge sort:

sort (mas1);
sort (mas2);

size_t index1 = 0;
foreach (str2; mas2)
{
while (index1 < mas1.length && mas1[index1] < str2)
index1 += 1;
if (mas1[index1] != str2)
writeln (str2);
}

Sorting takes O (n log n * c) time, where n is the size of the
arrays, and c is the expected time of two strings comparison when
sorting.  The subsequent step is O (n * c) which is faster than
sorting.

2. Hashing.  Just put the contents of the first array into a bool
[string], and then, for each string from the second array, check
whether it is contained in the associative array.  The time will
be O (total length of all strings) multiplied by a moderate
constant, unless the strings are designed specifically to
generate hash collisions, in which case it will be slower.

3. Trie.  Similar to hashing, but the constant multipliers will
be much higher unless the strings have large common prefixes.

Whether we can do faster depends on context.  For example, if the
strings tend to all have long common prefixes, any string
comparison will be slow, but otherwise it can be thought of as
taking constant time.

Ivan Kazmenko.
```
Jun 15 2017
```On Thursday, 15 June 2017 at 11:48:54 UTC, Ivan Kazmenko wrote:
On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
There are two arrays of string [] mas1, mas2; Size of each
about 5M lines. By the size they different, but lines in both
match for 95%. It is necessary to find all lines in an array
of mas2 which differ from mas1. The principal criterion -
speed. There are the 8th core processor and it is good to

The approaches which come to mind are:

clip
taking constant time.

Ivan Kazmenko.

As a follow up to this, if your alphabet is reasonably small
perhaps could run radix sort based on the first few characters to
split your arrays up into smaller subsets, and then use one of
Ivan's suggestions within each subset.  Subset searches could be
easily run in parallel.
```
Jun 15 2017
```On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH wrote:

The purpose - search of changes in file system.
Sorting is a slow operation as well as hashing. Creation of a
tree, is equally in sorting.
So far the best result:

string[] rez;

foreach(str; m2) {
bool fFind;	int j;
foreach(int i, s; m1) {
if(str == s) { fFind = true; j = i; break; }
}
if(!fFind) { rez ~= str; }
else       { 	m1[j] = m1[\$-1]; m1.length = m1.length - 1; 	}
}

//  rez => rezult

How to parallel on thred?
```
Jun 15 2017
```On Thursday, 15 June 2017 at 13:41:07 UTC, MGW wrote:
On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH
wrote:

The purpose - search of changes in file system.
Sorting is a slow operation as well as hashing. Creation of a
tree, is equally in sorting.
So far the best result:

string[] rez;

foreach(str; m2) {
bool fFind;	int j;
foreach(int i, s; m1) {
if(str == s) { fFind = true; j = i; break; }
}
if(!fFind) { rez ~= str; }
else       { 	m1[j] = m1[\$-1]; m1.length = m1.length - 1; 	}
}

//  rez => rezult

How to parallel on thred?

radix sort is O(N) time, which is as fast as you can hope. But
given your specific problem domain (the strings are paths) an
initial radix sort step likely won't gain you much, as everything
is going to be sorted into a small subset of the buckets. So I
guess you can scrap that idea.

Knowing that your strings are actually file paths I think
building some sort of tree structure over M1 wouldn't be
unreasonable. You say go two or three levels deep on your
directory structure (ie nodes are labelled with directory name)
and use that to split M1 into buckets. If some bucket has too
many entries you could apply this recursively. Since you are only
building a constant number of levels and the number of nodes is
not likely to be too large you should do much better than N log N
* c time for this step.

Then you search with the elements of M2. You should be able to do
structure on M1 is read-only you could just split M2 over X
threads and search. I am not an expert in this regard though, so
perhaps someone better informed than I can chime in.

Since strings will tend to have long common prefix's Ivan's Trie
idea would also work well.
```
Jun 15 2017    Ivan Kazmenko <gassa mail.ru> writes:
```On Thursday, 15 June 2017 at 13:41:07 UTC, MGW wrote:
On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH
wrote:

The purpose - search of changes in file system.
Sorting is a slow operation as well as hashing. Creation of a
tree, is equally in sorting.
So far the best result:

foreach(str; m2) {
bool fFind;	int j;
foreach(int i, s; m1) {
if(str == s) { fFind = true; j = i; break; }
}
if(!fFind) { rez ~= str; }
else       { 	m1[j] = m1[\$-1]; m1.length = m1.length - 1; 	}
}

Ugh.  This can work as slow as length-of-m1 *multiplied* by
length-of-m2.  For 5,000,000 strings, it is 5,000,000 * 5,000,000
= 25,000,000,000,000.  Granted, if you run it very often, the
arrays are almost equal, and it's closer to linear.  But once
there are substantial changes between two consecutive runs, this
approach is seriously screwed.

Sorting would work in length-of-array * log(length-of-array).
For 5,000,000 strings, it is 5,000,000 * 23 = 115,000,000.  This
is ~217,391 times better than your method above.  May be a bit
slower because of long common prefixes.  Anyway, a couple of
seconds at most.  How fast you need it to be?  Did you actually
try it?

Ivan Kazmenko.
```
Jun 15 2017