www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] Best algorithm for extremely large hashtable?

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
This isn't directly related to D (though the code will be in D), and I
thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very large graph,
and I need some kind of data structure to keep track of which nodes have
been visited, that (1) allows reasonably fast lookups (preferably O(1)),
and (2) doesn't require GB's of storage (i.e., some kind of compression
would be nice).

The graph nodes can be represented in various ways, but possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are always between
vectors that differ only by a single coordinate; so the edges of the
graph may be thought of as a subset of the edges of an n-dimensional
grid. The hashtable, therefore, needs to represent some connected subset
of this grid in a space-efficient manner, that still allows fast lookup
times.

The nave approach of using an n-dimensional bit array is not feasible
because n can be quite large (up to 100 or so), and the size of the grid
itself can get up to about 10 in each direction, so we're looking at a
potential maximum size of 10^100, clearly impractical to store
explicitly.

Are there any known good algorithms for tackling this problem?

Thanks!


T

-- 
Creativity is not an excuse for sloppiness.
Nov 15 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
15-Nov-2013 22:41, H. S. Teoh пишет:
 This isn't directly related to D (though the code will be in D), and I
 thought this would be a good place to ask.

 I'm trying to implement an algorithm that traverses a very large graph,
 and I need some kind of data structure to keep track of which nodes have
 been visited, that (1) allows reasonably fast lookups (preferably O(1)),
Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node). -- Dmitry Olshansky
Nov 15 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky wrote:
 15-Nov-2013 22:41, H. S. Teoh пишет:
This isn't directly related to D (though the code will be in D), and
I thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)),
Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node).
[...] There are too many nodes to keep in memory, actually. The algorithm also doesn't need to store the nodes; it can easily generate them on-the-fly when it needs them. The challenge is when it needs to know whether a particular generated node has been visited before. I'd like to be able to have a fast lookup (hopefully O(1)) that tells me whether node X has been seen before. The problem is, there may have been a very large number of previously-seen nodes, so I need some way of storing this information in a space-efficient way. I'm hoping for a kind of compressed representation where multiple adjacent nodes can be represented in a very compact way, by somehow taking advantage of the fact that they lie on an n-dimensional grid and the fact that graph edges are always a subset of grid edges (as opposed to edges between two arbitrary nodes). T -- I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser
Nov 15 2013
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
H. S. Teoh:

 I'm hoping for a kind of compressed representation where 
 multiple
 adjacent nodes can be represented in a very compact way, by 
 somehow
 taking advantage of the fact that they lie on an n-dimensional 
 grid and
 the fact that graph edges are always a subset of grid edges (as 
 opposed
 to edges between two arbitrary nodes).
Look for Succinct Data-Structures, Graphs. You can impose further memory savings in your problem. Bye, bearophile
Nov 15 2013
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 11/15/13 11:20 AM, H. S. Teoh wrote:
 I'm hoping for a kind of compressed representation where multiple
 adjacent nodes can be represented in a very compact way, by somehow
 taking advantage of the fact that they lie on an n-dimensional grid and
 the fact that graph edges are always a subset of grid edges (as opposed
 to edges between two arbitrary nodes).
The classic trick here is parallel arrays - you store a bit array in parallel with an array of nodes. Andrei
Nov 15 2013
prev sibling parent "qznc" <qznc web.de> writes:
On Friday, 15 November 2013 at 19:21:35 UTC, H. S. Teoh wrote:
 On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky 
 wrote:
 15-Nov-2013 22:41, H. S. Teoh пишет:
This isn't directly related to D (though the code will be in 
D), and
I thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very 
large
graph, and I need some kind of data structure to keep track 
of which
nodes have been visited, that (1) allows reasonably fast 
lookups
(preferably O(1)),
Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node).
[...] There are too many nodes to keep in memory, actually. The algorithm also doesn't need to store the nodes; it can easily generate them on-the-fly when it needs them. The challenge is when it needs to know whether a particular generated node has been visited before. I'd like to be able to have a fast lookup (hopefully O(1)) that tells me whether node X has been seen before. The problem is, there may have been a very large number of previously-seen nodes, so I need some way of storing this information in a space-efficient way. I'm hoping for a kind of compressed representation where multiple adjacent nodes can be represented in a very compact way, by somehow taking advantage of the fact that they lie on an n-dimensional grid and the fact that graph edges are always a subset of grid edges (as opposed to edges between two arbitrary nodes).
A Trie might be an idea, especially the bitwise variant. http://en.wikipedia.org/wiki/Trie
Nov 15 2013
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
 This isn't directly related to D (though the code will be in 
 D), and I
 thought this would be a good place to ask.

 I'm trying to implement an algorithm that traverses a very 
 large graph,
 and I need some kind of data structure to keep track of which 
 nodes have
 been visited, that (1) allows reasonably fast lookups 
 (preferably O(1)),
 and (2) doesn't require GB's of storage (i.e., some kind of 
 compression
 would be nice).
A while ago I set out to write a solver for a group of problems which can be described as traversing in breath extremely large implicit graphs. Some code here (C++): https://github.com/CyberShadow/DDD. The project uses delayed duplicate detection, to allow the number of nodes to exceed available RAM. What we found was that in certain cases, delayed duplicate detection beat hash tables even while filtering duplicates in memory, I think mainly because DDD is more parallelizable than hash tables. You mentioned compression - perhaps a bloom filter will fit your needs, as an optimization?
Nov 15 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Nov 15, 2013 at 08:29:22PM +0100, Vladimir Panteleev wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
[...]
I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
some kind of compression would be nice).
A while ago I set out to write a solver for a group of problems which can be described as traversing in breath extremely large implicit graphs. Some code here (C++): https://github.com/CyberShadow/DDD. The project uses delayed duplicate detection, to allow the number of nodes to exceed available RAM. What we found was that in certain cases, delayed duplicate detection beat hash tables even while filtering duplicates in memory, I think mainly because DDD is more parallelizable than hash tables. You mentioned compression - perhaps a bloom filter will fit your needs, as an optimization?
Thanks!! I think delayed duplicate detection is what I'm looking for. The bloom filter idea is an interesting one; I'll definitely consider it as a later optimization once I get DDD working. T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall
Nov 15 2013
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
 so the edges of the graph may be thought of as a subset of the 
 edges of an n-dimensional grid.
Could perhaps elaborate on this point? As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives. How is this the same as the edges of an n-dimensional grid?
Nov 15 2013
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Friday, 15 November 2013 at 19:48:21 UTC, John Colvin wrote:
 i.e. the edges are the euclidean basis vectors and their 
 negatives.
sorry, I meant *integer multiples* of the euclidian basis vectors.
Nov 15 2013
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
so the edges of the graph may be thought of as a subset of the
edges of an n-dimensional grid.
Could perhaps elaborate on this point? As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives.
Actually, i can only be 1 or -1.
 How is this the same as the edges of an n-dimensional grid?
Basically, adjacent nodes differ only in a single coordinate, and that difference can only be 1 or -1. So for the 2D case, if you graph the nodes on the plane, the edges would be horizontal/vertical line segments of unit length. If you consider the 2D grid's edges to be *all* possible such line segments, then in that sense you could think of the graph's edges as being a subset of the 2D grid's. I was hoping that this fact can be taken advantage of, to make a compact representation of visited nodes. T -- By understanding a machine-oriented language, the programmer will tend to use a much more efficient method; it is much closer to reality. -- D. Knuth
Nov 15 2013
next sibling parent "Ivan Kazmenko" <gassa mail.ru> writes:
On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
 On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
 How is this the same as the edges of an n-dimensional grid?
Basically, adjacent nodes differ only in a single coordinate, and that difference can only be 1 or -1. So for the 2D case, if you graph the nodes on the plane, the edges would be horizontal/vertical line segments of unit length. If you consider the 2D grid's edges to be *all* possible such line segments, then in that sense you could think of the graph's edges as being a subset of the 2D grid's. I was hoping that this fact can be taken advantage of, to make a compact representation of visited nodes.
How dense is the graph? For example, if it contains every possible edge described (+-1 to every single coordinate), for a breadth-first search, we can manage to just keep track of a single integer: the farthest distance traveled. For very dense graphs, you can perhaps apply a similar idea: represent large visited areas as "diamonds" with center and radius, then try to "factor" the visited areas into diamonds on-the-fly. Possibly they will be "diamonds with a [much shorter] list of holes". For example, we say that all nodes not further than 3 from (0, 0, ..., 0) are visited, and store a single center (origin) and a radius (3) instead of all (dimensions[100] to the power of distance[3]) of them. The lookup will be at most O (number-of-diamonds) plus a hash table lookup for the individual nodes. Perhaps for certain graphs, we can find a good compromise between the number of diamonds and the number of individual stored nodes. The above is just an idea, perhaps it won't be feasible by itself when you get to the details, but can inspire some related optimization. Also, the size of the border of n-dimensional area is a (n-1)-dimensional object, and for dense enough graphs, you can hope that the number of elements in it is less by an order of magnitude than the total number of visited nodes. However, for too much dimensions (as in your case, 10^100 -> 10^99), it does not seem to help much. Another question is, when will the graph traversal end? For example, if you are certain that you won't need to visit more than say one million nodes, a simple hash table storing the node representations at the hash indices will suffice. On the other hand, if you plan to visit 10^12 nodes, and the graph is not very sparse or very dense (and not regular in any obvious way besides what is described), perhaps you won't get the required compression level (1/1000) anyway. Ivan Kazmenko.
Nov 15 2013
prev sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
 On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
so the edges of the graph may be thought of as a subset of the
edges of an n-dimensional grid.
Could perhaps elaborate on this point? As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives.
Actually, i can only be 1 or -1.
 How is this the same as the edges of an n-dimensional grid?
Basically, adjacent nodes differ only in a single coordinate, and that difference can only be 1 or -1.
Ah, ok.
 So for the 2D case, if you graph the
 nodes on the plane, the edges would be horizontal/vertical line 
 segments
 of unit length. If you consider the 2D grid's edges to be *all* 
 possible
 such line segments, then in that sense you could think of the 
 graph's
 edges as being a subset of the 2D grid's.
If you're going to chuck away the orthogonal location of the edges and only consider parallel position then you might as well go all the way: In the 2-D example, the edges are all +/-(0,1) and +/-(1,0). There's no difference between the edge connecting (x, p) to (x, p+1) and the edge connecting (x, q) to (x, q+1). In the general case there's only 2N possible unique edges.
Nov 15 2013
prev sibling parent reply "qznc" <qznc web.de> writes:
On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
 This isn't directly related to D (though the code will be in 
 D), and I
 thought this would be a good place to ask.

 I'm trying to implement an algorithm that traverses a very 
 large graph,
 and I need some kind of data structure to keep track of which 
 nodes have
 been visited, that (1) allows reasonably fast lookups 
 (preferably O(1)),
 and (2) doesn't require GB's of storage (i.e., some kind of 
 compression
 would be nice).

 The graph nodes can be represented in various ways, but 
 possibly the
 most convenient representation is as n-dimensional vectors of
 (relatively small) integers. Furthermore, graph edges are 
 always between
 vectors that differ only by a single coordinate; so the edges 
 of the
 graph may be thought of as a subset of the edges of an 
 n-dimensional
 grid. The hashtable, therefore, needs to represent some 
 connected subset
 of this grid in a space-efficient manner, that still allows 
 fast lookup
 times.

 The naïve approach of using an n-dimensional bit array is not 
 feasible
 because n can be quite large (up to 100 or so), and the size of 
 the grid
 itself can get up to about 10 in each direction, so we're 
 looking at a
 potential maximum size of 10^100, clearly impractical to store
 explicitly.
So, -10 to 10 in discrete steps. This means 5 bits per dimension and 500 bits for a single coordinate. Is the graph distributed of a compute cluster or does it fit into single computer? With a few GB of RAM, this means your graph is quite sparse, yet nodes are connected ("differ only by a single coordinate")? Can you preprocess? I mean, walk all the nodes O(n) to compute a good (perfect?) hash function. In general, I think you should either store the flag right in the graph node or mirror the graph structure. I do not know any concrete algorithms.
Nov 15 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Nov 15, 2013 at 09:03:17PM +0100, qznc wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
[...]
I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
some kind of compression would be nice).

The graph nodes can be represented in various ways, but possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are always
between vectors that differ only by a single coordinate; so the edges
of the graph may be thought of as a subset of the edges of an
n-dimensional grid. The hashtable, therefore, needs to represent some
connected subset of this grid in a space-efficient manner, that still
allows fast lookup times.

The nave approach of using an n-dimensional bit array is not
feasible because n can be quite large (up to 100 or so), and the size
of the grid itself can get up to about 10 in each direction, so we're
looking at a potential maximum size of 10^100, clearly impractical to
store explicitly.
So, -10 to 10 in discrete steps. This means 5 bits per dimension and 500 bits for a single coordinate. Is the graph distributed of a compute cluster or does it fit into single computer? With a few GB of RAM, this means your graph is quite sparse, yet nodes are connected ("differ only by a single coordinate")?
The graph is not explicitly stored, because it is far too large. The nodes are generated on-the-fly as needed; the challenge is to know whether a particular generated node has been generated before. Basically, the equivalent of a hashtable of previously-visited nodes, but I'm hoping for (1) some kind of compressed representation so that as many nodes as possible can be kept in RAM for speed, and (2) something more efficient for disk I/O, because I'm expecting the number of generated nodes will not fit in RAM and will have to be put on disk (and hashtables don't do very well with disk I/O due to the latency).
 Can you preprocess? I mean, walk all the nodes O(n) to compute a
 good (perfect?) hash function.
Walking the nodes defeats the purpose, because the whole idea is to *not* have to generate every node, just those selected by the algorithm (those are already very unwieldy in number, nevermind generating *all* nodes).
 In general, I think you should either store the flag right in the
 graph node or mirror the graph structure.
[...] That doesn't solve the problem when graph nodes are generated on-the-fly, since I wouldn't know where to look up the node structure without first knowing that it has been generated before. T -- Ph.D. = Permanent head Damage
Nov 15 2013
parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 15 November 2013 at 20:22:29 UTC, H. S. Teoh wrote:
 The graph is not explicitly stored, because it is far too 
 large. The
 nodes are generated on-the-fly as needed; the challenge is to 
 know
 whether a particular generated node has been generated before.
 Basically, the equivalent of a hashtable of previously-visited 
 nodes,
 but I'm hoping for (1) some kind of compressed representation 
 so that as
 many nodes as possible can be kept in RAM for speed, and (2) 
 something
 more efficient for disk I/O, because I'm expecting the number of
 generated nodes will not fit in RAM and will have to be put on 
 disk (and
 hashtables don't do very well with disk I/O due to the latency).
Have multiple levels of lookup. For example, let's say your entire grid G is 1,000,000 x 1,000,000. Have a top level grid H that is 1000 x 1000. Each H[i, j] corresponds to a 1000 x 1000 region in G. If any cell has been visited in that region then H[i, j] points to a direct representation of that 1000 x 1000 subregion of G, if not, then it points to null. Checking if G[i, j] has been visited is as easy as: visited(i, j) = H[i/1000, j/1000] && (*H[i/1000, j/1000])[i%1000, j%1000]; You can easily page-in and page-out those 1000 x 1000 subregions to disk if needed, just keeping the active ones in memory. H will stay in memory (only 8MB). Obviously, you'd be better off using a power of 2 instead of 1000 for fast div and mod. And of course, you can have as many levels to this as you like.
Nov 15 2013