www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Storing and Searching large text lists

reply brian <brian infinityplusb.com> writes:
I have a large list, B, of string items. For each item in that 
large list, I need to see if it is in the smaller list, A.

I have been using a simple string array for the storage of A

string[] A

and then using foreach to go through all the items of B and check 
they are in A

foreach(string;B)
/* this looks hacky but wasn't working without the !=0 bit )
     if(find(A,string) != 0)
         writeln("Found a line: ", string);

While this works for small datasets, but when either A or B get 
large (A could be up to 150k records, B in the millions) it takes 
quite a while to run.

I'd like to know what is the best way to store lists of text for 
searching? Is there a better container than a simply array? 
Neither A nor B need to be ordered for my purpose, but would 
sorting help the search? Would it help enough to be worth the CPU 
expense?

Regards
B
Dec 31
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
 I have a large list, B, of string items. For each item in that 
 large list, I need to see if it is in the smaller list, A.

 I have been using a simple string array for the storage of A

 string[] A

 and then using foreach to go through all the items of B and 
 check they are in A

 foreach(string;B)
 /* this looks hacky but wasn't working without the !=0 bit )
     if(find(A,string) != 0)
         writeln("Found a line: ", string);

 While this works for small datasets, but when either A or B get 
 large (A could be up to 150k records, B in the millions) it 
 takes quite a while to run.

 I'd like to know what is the best way to store lists of text 
 for searching? Is there a better container than a simply array? 
 Neither A nor B need to be ordered for my purpose, but would 
 sorting help the search? Would it help enough to be worth the 
 CPU expense?

 Regards
 B
Your problem is that your algorithm is O(m*n). (B is m ,A is n) A simple speedup would be to sort A and then use a binary search (O(m*log(n))) (partially sorting B may also help but only for cache) Fully sorting may be worth doing if you have other steps after that also benefit in speed when working on sorted data (like searching does). Changing A to a trie is another possibility. But as always it help to know your data. How long are the strings(are they of the same length)? Are there any similarities (e.g. are they all email addresses)? Are there duplicates allowed in B? Nic
Dec 31
parent via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
 

On 2016-01-01 13:22, Nicholas Wilson via Digitalmars-d-learn wrote: 

 On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
 
 I have a large list, B, of string items. For each item in that large list, I
need to see if it is in the smaller list, A. I have been using a simple string
array for the storage of A string[] A and then using foreach to go through all
the items of B and check they are in A foreach(string;B) /* this looks hacky
but wasn't working without the !=0 bit ) if(find(A,string) != 0) writeln("Found
a line: ", string); While this works for small datasets, but when either A or B
get large (A could be up to 150k records, B in the millions) it takes quite a
while to run. I'd like to know what is the best way to store lists of text for
searching? Is there a better container than a simply array? Neither A nor B
need to be ordered for my purpose, but would sorting help the search? Would it
help enough to be worth the CPU expense? Regards B
Your problem is that your algorithm is O(m*n). (B is m ,A is n) A simple speedup would be to sort A and then use a binary search (O(m*log(n))) (partially sorting B may also help but only for cache) Fully sorting may be worth doing if you have other steps after that also benefit in speed when working on sorted data (like searching does). Changing A to a trie is another possibility. But as always it help to know your data. How long are the strings(are they of the same length)? Are there any similarities (e.g. are they all email addresses)? Are there duplicates allowed in B? Nic
Thanks. I'll try your suggestions. The data I am looking at is the Microsoft Academic Graph data. https://academicgraph.blob.core.windows.net/graph-2015-11-06/index.html I have picked a topic of interest, and am pulling the IDs for papers of those topics. There will be up to 150,000 in the list (A) and each id will look something like: 000241C7 0018BC77 I then have a "Papers.txt" file, which has a list (B) of (a couple of hundred million) papers and details about the papers. I am then searching through this list for the items in A.The Papers.txt file has records that look like: 007D0259 Business Intelligence analytics without conventional data warehousing business intelligence analytics without conventional data warehousing 2010 2010 19542 0018BC77 Data mining email to discover social networks and communities of practice data mining email to discover social networks and communities of practice 2003 17584 So I need to iterate through B, and see if the first entry, is one of the papers I want, by comparing it to list A. In the above example, 007D0259 I don't want, 0018BC77 I do want, and will write that record to another file. I have started to delete entries from list A as I find them, so that the search doesn't have as many records to look through each time, but I amn't sure that that will decrease time, because it'll take time to find the record on each delete.
Dec 31