digitalmars.D.learn - Storing and Searching large text lists

brian (20/20) Dec 31 2015 I have a large list, B, of string items. For each item in that

Nicholas Wilson (14/34) Dec 31 2015 Your problem is that your algorithm is O(m*n). (B is m ,A is n)

via Digitalmars-d-learn (30/45) Dec 31 2015 Thanks. I'll try your suggestions.

brian <brian infinityplusb.com> writes:

I have a large list, B, of string items. For each item in that 
large list, I need to see if it is in the smaller list, A.

I have been using a simple string array for the storage of A

string[] A

and then using foreach to go through all the items of B and check 
they are in A

foreach(string;B)
/* this looks hacky but wasn't working without the !=0 bit )
     if(find(A,string) != 0)
         writeln("Found a line: ", string);

While this works for small datasets, but when either A or B get 
large (A could be up to 150k records, B in the millions) it takes 
quite a while to run.

I'd like to know what is the best way to store lists of text for 
searching? Is there a better container than a simply array? 
Neither A nor B need to be ordered for my purpose, but would 
sorting help the search? Would it help enough to be worth the CPU 
expense?

Regards
B

Dec 31 2015

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
 I have a large list, B, of string items. For each item in that 
 large list, I need to see if it is in the smaller list, A.

 I have been using a simple string array for the storage of A

 string[] A

 and then using foreach to go through all the items of B and 
 check they are in A

 foreach(string;B)
 /* this looks hacky but wasn't working without the !=0 bit )
     if(find(A,string) != 0)
         writeln("Found a line: ", string);

 While this works for small datasets, but when either A or B get 
 large (A could be up to 150k records, B in the millions) it 
 takes quite a while to run.

 I'd like to know what is the best way to store lists of text 
 for searching? Is there a better container than a simply array? 
 Neither A nor B need to be ordered for my purpose, but would 
 sorting help the search? Would it help enough to be worth the 
 CPU expense?

 Regards
 B

Your problem is that your algorithm is O(m*n). (B is m ,A is n)
A simple speedup would be to sort A and then use a binary search 
(O(m*log(n)))
(partially sorting B may also help but only for cache)

Fully sorting may be worth doing if you have other steps after 
that also benefit in speed when working on sorted data (like 
searching does).

Changing A to a trie is another possibility.

But as always it help to know your data. How long are the 
strings(are they of the same length)? Are there any similarities 
(e.g. are they all email addresses)? Are there duplicates allowed 
in B?

Nic

Dec 31 2015

via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

 

On 2016-01-01 13:22, Nicholas Wilson via Digitalmars-d-learn wrote: 

 On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
 
 I have a large list, B, of string items. For each item in that large list, I
need to see if it is in the smaller list, A. I have been using a simple string
array for the storage of A string[] A and then using foreach to go through all
the items of B and check they are in A foreach(string;B) /* this looks hacky
but wasn't working without the !=0 bit ) if(find(A,string) != 0) writeln("Found
a line: ", string); While this works for small datasets, but when either A or B
get large (A could be up to 150k records, B in the millions) it takes quite a
while to run. I'd like to know what is the best way to store lists of text for
searching? Is there a better container than a simply array? Neither A nor B
need to be ordered for my purpose, but would sorting help the search? Would it
help enough to be worth the CPU expense? Regards B

 
 Your problem is that your algorithm is O(m*n). (B is m ,A is n)
 A simple speedup would be to sort A and then use a binary search (O(m*log(n)))
 (partially sorting B may also help but only for cache)
 
 Fully sorting may be worth doing if you have other steps after that also
benefit in speed when working on sorted data (like searching does).
 
 Changing A to a trie is another possibility.
 
 But as always it help to know your data. How long are the strings(are they of
the same length)? Are there any similarities (e.g. are they all email
addresses)? Are there duplicates allowed in B?
 
 Nic

Thanks. I'll try your suggestions. 

The data I am looking at is the Microsoft Academic Graph data.
https://academicgraph.blob.core.windows.net/graph-2015-11-06/index.html 

I have picked a topic of interest, and am pulling the IDs for papers of
those topics.
There will be up to 150,000 in the list (A) and each id will look
something like: 

000241C7
0018BC77 

I then have a "Papers.txt" file, which has a list (B) of (a couple of
hundred million) papers and details about the papers. 
I am then searching through this list for the items in A.The Papers.txt
file has records that look like: 

007D0259 Business Intelligence analytics without conventional data
warehousing business intelligence analytics without conventional data
warehousing 2010 2010 19542 
0018BC77 Data mining email to discover social networks and communities
of practice data mining email to discover social networks and
communities of practice 2003 17584 

So I need to iterate through B, and see if the first entry, is one of
the papers I want, by comparing it to list A.
In the above example, 007D0259 I don't want, 0018BC77 I do want, and
will write that record to another file. 

I have started to delete entries from list A as I find them, so that the
search doesn't have as many records to look through each time, but I
amn't sure that that will decrease time, because it'll take time to find
the record on each delete.

Dec 31 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Storing and Searching large text lists