www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Help with an algorithm!

reply MGW <mgw yandex.ru> writes:
There are two arrays of string [] mas1, mas2; Size of each about 
5M lines. By the size they different, but lines in both match for 
95%. It is necessary to find all lines in an array of mas2 which 
differ from mas1. The principal criterion - speed. There are the 
8th core processor and it is good to involve a multithreading.
Jun 14 2017
parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
 There are two arrays of string [] mas1, mas2; Size of each 
 about 5M lines. By the size they different, but lines in both 
 match for 95%. It is necessary to find all lines in an array of 
 mas2 which differ from mas1. The principal criterion - speed. 
 There are the 8th core processor and it is good to involve a 
 multithreading.
The approaches which come to mind are: 1. Sort both arrays, then traverse them in sorted order, like in merge step of merge sort: sort (mas1); sort (mas2); size_t index1 = 0; foreach (str2; mas2) { while (index1 < mas1.length && mas1[index1] < str2) index1 += 1; if (mas1[index1] != str2) writeln (str2); } Sorting takes O (n log n * c) time, where n is the size of the arrays, and c is the expected time of two strings comparison when sorting. The subsequent step is O (n * c) which is faster than sorting. 2. Hashing. Just put the contents of the first array into a bool [string], and then, for each string from the second array, check whether it is contained in the associative array. The time will be O (total length of all strings) multiplied by a moderate constant, unless the strings are designed specifically to generate hash collisions, in which case it will be slower. 3. Trie. Similar to hashing, but the constant multipliers will be much higher unless the strings have large common prefixes. Whether we can do faster depends on context. For example, if the strings tend to all have long common prefixes, any string comparison will be slow, but otherwise it can be thought of as taking constant time. Ivan Kazmenko.
Jun 15 2017
parent reply CRAIG DILLABAUGH <craig.dillabaugh gmail.com> writes:
On Thursday, 15 June 2017 at 11:48:54 UTC, Ivan Kazmenko wrote:
 On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
 There are two arrays of string [] mas1, mas2; Size of each 
 about 5M lines. By the size they different, but lines in both 
 match for 95%. It is necessary to find all lines in an array 
 of mas2 which differ from mas1. The principal criterion - 
 speed. There are the 8th core processor and it is good to 
 involve a multithreading.
The approaches which come to mind are:
clip
 taking constant time.

 Ivan Kazmenko.
As a follow up to this, if your alphabet is reasonably small perhaps could run radix sort based on the first few characters to split your arrays up into smaller subsets, and then use one of Ivan's suggestions within each subset. Subset searches could be easily run in parallel.
Jun 15 2017
parent reply MGW <mgw yandex.ru> writes:
On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH wrote:

The purpose - search of changes in file system.
Sorting is a slow operation as well as hashing. Creation of a 
tree, is equally in sorting.
So far the best result:

string[] rez;

foreach(str; m2) {
	bool fFind;	int j;
	foreach(int i, s; m1) {
		if(str == s) { fFind = true; j = i; break; }
	}
	if(!fFind) { rez ~= str; }
	else       { 	m1[j] = m1[$-1]; m1.length = m1.length - 1; 	}
}
		
//  rez => rezult

How to parallel on thred?
Jun 15 2017
next sibling parent CRAIG DILLABAUGH <craig.dillabaugh gmail.com> writes:
On Thursday, 15 June 2017 at 13:41:07 UTC, MGW wrote:
 On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH 
 wrote:

 The purpose - search of changes in file system.
 Sorting is a slow operation as well as hashing. Creation of a 
 tree, is equally in sorting.
 So far the best result:

 string[] rez;

 foreach(str; m2) {
 	bool fFind;	int j;
 	foreach(int i, s; m1) {
 		if(str == s) { fFind = true; j = i; break; }
 	}
 	if(!fFind) { rez ~= str; }
 	else       { 	m1[j] = m1[$-1]; m1.length = m1.length - 1; 	}
 }
 		
 //  rez => rezult

 How to parallel on thred?
radix sort is O(N) time, which is as fast as you can hope. But given your specific problem domain (the strings are paths) an initial radix sort step likely won't gain you much, as everything is going to be sorted into a small subset of the buckets. So I guess you can scrap that idea. Knowing that your strings are actually file paths I think building some sort of tree structure over M1 wouldn't be unreasonable. You say go two or three levels deep on your directory structure (ie nodes are labelled with directory name) and use that to split M1 into buckets. If some bucket has too many entries you could apply this recursively. Since you are only building a constant number of levels and the number of nodes is not likely to be too large you should do much better than N log N * c time for this step. Then you search with the elements of M2. You should be able to do this in a multi-threaded way since once built, your data structure on M1 is read-only you could just split M2 over X threads and search. I am not an expert in this regard though, so perhaps someone better informed than I can chime in. Since strings will tend to have long common prefix's Ivan's Trie idea would also work well.
Jun 15 2017
prev sibling parent Ivan Kazmenko <gassa mail.ru> writes:
On Thursday, 15 June 2017 at 13:41:07 UTC, MGW wrote:
 On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH 
 wrote:

 The purpose - search of changes in file system.
 Sorting is a slow operation as well as hashing. Creation of a 
 tree, is equally in sorting.
 So far the best result:

 foreach(str; m2) {
 	bool fFind;	int j;
 	foreach(int i, s; m1) {
 		if(str == s) { fFind = true; j = i; break; }
 	}
 	if(!fFind) { rez ~= str; }
 	else       { 	m1[j] = m1[$-1]; m1.length = m1.length - 1; 	}
 }
Ugh. This can work as slow as length-of-m1 *multiplied* by length-of-m2. For 5,000,000 strings, it is 5,000,000 * 5,000,000 = 25,000,000,000,000. Granted, if you run it very often, the arrays are almost equal, and it's closer to linear. But once there are substantial changes between two consecutive runs, this approach is seriously screwed. Sorting would work in length-of-array * log(length-of-array). For 5,000,000 strings, it is 5,000,000 * 23 = 115,000,000. This is ~217,391 times better than your method above. May be a bit slower because of long common prefixes. Anyway, a couple of seconds at most. How fast you need it to be? Did you actually try it? Ivan Kazmenko.
Jun 15 2017