digitalmars.D - [OT] Best algorithm for extremely large hashtable?

H. S. Teoh (25/25) Nov 15 2013 This isn't directly related to D (though the code will be in D), and I

Dmitry Olshansky (7/12) Nov 15 2013 Store as a bit-flag in whatever structure that serves as node. It's

H. S. Teoh (18/31) Nov 15 2013 [...]

bearophile (5/14) Nov 15 2013 Look for Succinct Data-Structures, Graphs. You can impose further
Andrei Alexandrescu (4/9) Nov 15 2013 The classic trick here is parallel arrays - you store a bit array in
qznc (3/50) Nov 15 2013 A Trie might be an idea, especially the bitwise variant.

Vladimir Panteleev (13/25) Nov 15 2013 A while ago I set out to write a solver for a group of problems

H. S. Teoh (8/28) Nov 15 2013 Thanks!! I think delayed duplicate detection is what I'm looking for.

John Colvin (8/10) Nov 15 2013 Could perhaps elaborate on this point?

John Colvin (2/4) Nov 15 2013 sorry, I meant *integer multiples* of the euclidian basis vectors.
H. S. Teoh (15/27) Nov 15 2013 Basically, adjacent nodes differ only in a single coordinate, and that

Ivan Kazmenko (36/52) Nov 15 2013 How dense is the graph?
John Colvin (9/38) Nov 15 2013 If you're going to chuck away the orthogonal location of the

qznc (11/45) Nov 15 2013 So, -10 to 10 in discrete steps. This means 5 bits per dimension

H. S. Teoh (22/53) Nov 15 2013 The graph is not explicitly stored, because it is far too large. The

Peter Alexander (17/32) Nov 15 2013 Have multiple levels of lookup.

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

This isn't directly related to D (though the code will be in D), and I
thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very large graph,
and I need some kind of data structure to keep track of which nodes have
been visited, that (1) allows reasonably fast lookups (preferably O(1)),
and (2) doesn't require GB's of storage (i.e., some kind of compression
would be nice).

The graph nodes can be represented in various ways, but possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are always between
vectors that differ only by a single coordinate; so the edges of the
graph may be thought of as a subset of the edges of an n-dimensional
grid. The hashtable, therefore, needs to represent some connected subset
of this grid in a space-efficient manner, that still allows fast lookup
times.

The na�ve approach of using an n-dimensional bit array is not feasible
because n can be quite large (up to 100 or so), and the size of the grid
itself can get up to about 10 in each direction, so we're looking at a
potential maximum size of 10^100, clearly impractical to store
explicitly.

Are there any known good algorithms for tackling this problem?

Thanks!


T

-- 
Creativity is not an excuse for sloppiness.

Nov 15 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

15-Nov-2013 22:41, H. S. Teoh пишет:
 This isn't directly related to D (though the code will be in D), and I
 thought this would be a good place to ask.

 I'm trying to implement an algorithm that traverses a very large graph,
 and I need some kind of data structure to keep track of which nodes have
 been visited, that (1) allows reasonably fast lookups (preferably O(1)),

Store as a bit-flag in whatever structure that serves as node. It's 
going to be the fastest anyway and you indeed get to use only 1 bit per 
node (and given padding and whatnot you may already have a couple of 
bytes to spare per node).



-- 
Dmitry Olshansky

Nov 15 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky wrote:
 15-Nov-2013 22:41, H. S. Teoh пишет:
This isn't directly related to D (though the code will be in D), and
I thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)),

 
 Store as a bit-flag in whatever structure that serves as node. It's
 going to be the fastest anyway and you indeed get to use only 1 bit
 per node (and given padding and whatnot you may already have a couple
 of bytes to spare per node).

[...]

There are too many nodes to keep in memory, actually. The algorithm also
doesn't need to store the nodes; it can easily generate them on-the-fly
when it needs them.  The challenge is when it needs to know whether a
particular generated node has been visited before. I'd like to be able
to have a fast lookup (hopefully O(1)) that tells me whether node X has
been seen before. The problem is, there may have been a very large
number of previously-seen nodes, so I need some way of storing this
information in a space-efficient way.

I'm hoping for a kind of compressed representation where multiple
adjacent nodes can be represented in a very compact way, by somehow
taking advantage of the fact that they lie on an n-dimensional grid and
the fact that graph edges are always a subset of grid edges (as opposed
to edges between two arbitrary nodes).


T

-- 
I think Debian's doing something wrong, `apt-get install pesticide', doesn't
seem to remove the bugs on my system! -- Mike Dresser

Nov 15 2013

"bearophile" <bearophileHUGS lycos.com> writes:

H. S. Teoh:

 I'm hoping for a kind of compressed representation where 
 multiple
 adjacent nodes can be represented in a very compact way, by 
 somehow
 taking advantage of the fact that they lie on an n-dimensional 
 grid and
 the fact that graph edges are always a subset of grid edges (as 
 opposed
 to edges between two arbitrary nodes).

Look for Succinct Data-Structures, Graphs. You can impose further 
memory savings in your problem.

Bye,
bearophile

Nov 15 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 11/15/13 11:20 AM, H. S. Teoh wrote:
 I'm hoping for a kind of compressed representation where multiple
 adjacent nodes can be represented in a very compact way, by somehow
 taking advantage of the fact that they lie on an n-dimensional grid and
 the fact that graph edges are always a subset of grid edges (as opposed
 to edges between two arbitrary nodes).

The classic trick here is parallel arrays - you store a bit array in 
parallel with an array of nodes.

Andrei

Nov 15 2013

"qznc" <qznc web.de> writes:

On Friday, 15 November 2013 at 19:21:35 UTC, H. S. Teoh wrote:
 On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky 
 wrote:
 15-Nov-2013 22:41, H. S. Teoh пишет:
This isn't directly related to D (though the code will be in 
D), and
I thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very 
large
graph, and I need some kind of data structure to keep track 
of which
nodes have been visited, that (1) allows reasonably fast 
lookups
(preferably O(1)),

 
 Store as a bit-flag in whatever structure that serves as node. 
 It's
 going to be the fastest anyway and you indeed get to use only 
 1 bit
 per node (and given padding and whatnot you may already have a 
 couple
 of bytes to spare per node).

 [...]

 There are too many nodes to keep in memory, actually. The 
 algorithm also
 doesn't need to store the nodes; it can easily generate them 
 on-the-fly
 when it needs them.  The challenge is when it needs to know 
 whether a
 particular generated node has been visited before. I'd like to 
 be able
 to have a fast lookup (hopefully O(1)) that tells me whether 
 node X has
 been seen before. The problem is, there may have been a very 
 large
 number of previously-seen nodes, so I need some way of storing 
 this
 information in a space-efficient way.

 I'm hoping for a kind of compressed representation where 
 multiple
 adjacent nodes can be represented in a very compact way, by 
 somehow
 taking advantage of the fact that they lie on an n-dimensional 
 grid and
 the fact that graph edges are always a subset of grid edges (as 
 opposed
 to edges between two arbitrary nodes).

A Trie might be an idea, especially the bitwise variant.

http://en.wikipedia.org/wiki/Trie

Nov 15 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
 This isn't directly related to D (though the code will be in 
 D), and I
 thought this would be a good place to ask.

 I'm trying to implement an algorithm that traverses a very 
 large graph,
 and I need some kind of data structure to keep track of which 
 nodes have
 been visited, that (1) allows reasonably fast lookups 
 (preferably O(1)),
 and (2) doesn't require GB's of storage (i.e., some kind of 
 compression
 would be nice).

A while ago I set out to write a solver for a group of problems 
which can be described as traversing in breath extremely large 
implicit graphs. Some code here (C++): 
https://github.com/CyberShadow/DDD. The project uses delayed 
duplicate detection, to allow the number of nodes to exceed 
available RAM.

What we found was that in certain cases, delayed duplicate 
detection beat hash tables even while filtering duplicates in 
memory, I think mainly because DDD is more parallelizable than 
hash tables.

You mentioned compression - perhaps a bloom filter will fit your 
needs, as an optimization?

Nov 15 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Nov 15, 2013 at 08:29:22PM +0100, Vladimir Panteleev wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:

[...]
I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
some kind of compression would be nice).

 
 A while ago I set out to write a solver for a group of problems
 which can be described as traversing in breath extremely large
 implicit graphs. Some code here (C++):
 https://github.com/CyberShadow/DDD. The project uses delayed
 duplicate detection, to allow the number of nodes to exceed
 available RAM.
 
 What we found was that in certain cases, delayed duplicate detection
 beat hash tables even while filtering duplicates in memory, I think
 mainly because DDD is more parallelizable than hash tables.
 
 You mentioned compression - perhaps a bloom filter will fit your
 needs, as an optimization?

Thanks!! I think delayed duplicate detection is what I'm looking for.
The bloom filter idea is an interesting one; I'll definitely consider it
as a later optimization once I get DDD working.


T

-- 
It won't be covered in the book. The source code has to be useful for
something, after all. -- Larry Wall

Nov 15 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
 so the edges of the graph may be thought of as a subset of the 
 edges of an n-dimensional grid.

Could perhaps elaborate on this point?

As far as I can see, the possible values of the nodes are all the 
points in N^n (or Z^n, dunno whether you have negatives) and the 
edges are all in the set of i*e_j for i in Z and j in N, where 
e_j is the unit vector along one axis, i.e. the edges are the 
euclidean basis vectors and their negatives.

How is this the same as the edges of an n-dimensional grid?

Nov 15 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 15 November 2013 at 19:48:21 UTC, John Colvin wrote:
 i.e. the edges are the euclidean basis vectors and their 
 negatives.

sorry, I meant *integer multiples* of the euclidian basis vectors.

Nov 15 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
so the edges of the graph may be thought of as a subset of the
edges of an n-dimensional grid.

 
 Could perhaps elaborate on this point?
 
 As far as I can see, the possible values of the nodes are all the
 points in N^n (or Z^n, dunno whether you have negatives) and the
 edges are all in the set of i*e_j for i in Z and j in N, where e_j
 is the unit vector along one axis, i.e. the edges are the euclidean
 basis vectors and their negatives.

Actually, i can only be 1 or -1.


 How is this the same as the edges of an n-dimensional grid?

Basically, adjacent nodes differ only in a single coordinate, and that
difference can only be 1 or -1. So for the 2D case, if you graph the
nodes on the plane, the edges would be horizontal/vertical line segments
of unit length. If you consider the 2D grid's edges to be *all* possible
such line segments, then in that sense you could think of the graph's
edges as being a subset of the 2D grid's.

I was hoping that this fact can be taken advantage of, to make a compact
representation of visited nodes.


T

-- 
By understanding a machine-oriented language, the programmer will tend
to use a much more efficient method; it is much closer to reality. -- D.
Knuth

Nov 15 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
 On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
 How is this the same as the edges of an n-dimensional grid?

 Basically, adjacent nodes differ only in a single coordinate, 
 and that
 difference can only be 1 or -1. So for the 2D case, if you 
 graph the
 nodes on the plane, the edges would be horizontal/vertical line 
 segments
 of unit length. If you consider the 2D grid's edges to be *all* 
 possible
 such line segments, then in that sense you could think of the 
 graph's
 edges as being a subset of the 2D grid's.

 I was hoping that this fact can be taken advantage of, to make 
 a compact
 representation of visited nodes.

How dense is the graph?

For example, if it contains every possible edge described (+-1 to 
every single coordinate), for a breadth-first search, we can 
manage to just keep track of a single integer: the farthest 
distance traveled.

For very dense graphs, you can perhaps apply a similar idea: 
represent large visited areas as "diamonds" with center and 
radius, then try to "factor" the visited areas into diamonds 
on-the-fly.  Possibly they will be "diamonds with a [much 
shorter] list of holes".  For example, we say that all nodes not 
further than 3 from (0, 0, ..., 0) are visited, and store a 
single center (origin) and a radius (3) instead of all 
(dimensions[100] to the power of distance[3]) of them.  The 
lookup will be at most O (number-of-diamonds) plus a hash table 
lookup for the individual nodes.  Perhaps for certain graphs, we 
can find a good compromise between the number of diamonds and the 
number of individual stored nodes.

The above is just an idea, perhaps it won't be feasible by itself 
when you get to the details, but can inspire some related 
optimization.

Also, the size of the border of n-dimensional area is a 
(n-1)-dimensional object, and for dense enough graphs, you can 
hope that the number of elements in it is less by an order of 
magnitude than the total number of visited nodes.  However, for 
too much dimensions (as in your case, 10^100 -> 10^99), it does 
not seem to help much.

Another question is, when will the graph traversal end?  For 
example, if you are certain that you won't need to visit more 
than say one million nodes, a simple hash table storing the node 
representations at the hash indices will suffice.  On the other 
hand, if you plan to visit 10^12 nodes, and the graph is not very 
sparse or very dense (and not regular in any obvious way besides 
what is described), perhaps you won't get the required 
compression level (1/1000) anyway.

Ivan Kazmenko.

Nov 15 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
 On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
so the edges of the graph may be thought of as a subset of the
edges of an n-dimensional grid.

 
 Could perhaps elaborate on this point?
 
 As far as I can see, the possible values of the nodes are all 
 the
 points in N^n (or Z^n, dunno whether you have negatives) and 
 the
 edges are all in the set of i*e_j for i in Z and j in N, where 
 e_j
 is the unit vector along one axis, i.e. the edges are the 
 euclidean
 basis vectors and their negatives.

 Actually, i can only be 1 or -1.


 How is this the same as the edges of an n-dimensional grid?

 Basically, adjacent nodes differ only in a single coordinate, 
 and that
 difference can only be 1 or -1.

Ah, ok.

 So for the 2D case, if you graph the
 nodes on the plane, the edges would be horizontal/vertical line 
 segments
 of unit length. If you consider the 2D grid's edges to be *all* 
 possible
 such line segments, then in that sense you could think of the 
 graph's
 edges as being a subset of the 2D grid's.

If you're going to chuck away the orthogonal location of the 
edges and only consider parallel position then you might as well 
go all the way:

In the 2-D example, the edges are all +/-(0,1) and +/-(1,0). 
There's no difference between the edge connecting (x, p) to (x, 
p+1) and the edge connecting (x, q) to (x, q+1). In the general 
case there's only 2N possible unique edges.

Nov 15 2013

"qznc" <qznc web.de> writes:

On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
 This isn't directly related to D (though the code will be in 
 D), and I
 thought this would be a good place to ask.

 I'm trying to implement an algorithm that traverses a very 
 large graph,
 and I need some kind of data structure to keep track of which 
 nodes have
 been visited, that (1) allows reasonably fast lookups 
 (preferably O(1)),
 and (2) doesn't require GB's of storage (i.e., some kind of 
 compression
 would be nice).

 The graph nodes can be represented in various ways, but 
 possibly the
 most convenient representation is as n-dimensional vectors of
 (relatively small) integers. Furthermore, graph edges are 
 always between
 vectors that differ only by a single coordinate; so the edges 
 of the
 graph may be thought of as a subset of the edges of an 
 n-dimensional
 grid. The hashtable, therefore, needs to represent some 
 connected subset
 of this grid in a space-efficient manner, that still allows 
 fast lookup
 times.

 The naïve approach of using an n-dimensional bit array is not 
 feasible
 because n can be quite large (up to 100 or so), and the size of 
 the grid
 itself can get up to about 10 in each direction, so we're 
 looking at a
 potential maximum size of 10^100, clearly impractical to store
 explicitly.

So, -10 to 10 in discrete steps. This means 5 bits per dimension 
and 500 bits for a single coordinate. Is the graph distributed of 
a compute cluster or does it fit into single computer? With a few 
GB of RAM, this means your graph is quite sparse, yet nodes are 
connected ("differ only by a single coordinate")?

Can you preprocess? I mean, walk all the nodes O(n) to compute a 
good (perfect?) hash function.

In general, I think you should either store the flag right in the 
graph node or mirror the graph structure.

I do not know any concrete algorithms.

Nov 15 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Nov 15, 2013 at 09:03:17PM +0100, qznc wrote:
 On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:

[...]
I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
some kind of compression would be nice).

The graph nodes can be represented in various ways, but possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are always
between vectors that differ only by a single coordinate; so the edges
of the graph may be thought of as a subset of the edges of an
n-dimensional grid. The hashtable, therefore, needs to represent some
connected subset of this grid in a space-efficient manner, that still
allows fast lookup times.

The na�ve approach of using an n-dimensional bit array is not
feasible because n can be quite large (up to 100 or so), and the size
of the grid itself can get up to about 10 in each direction, so we're
looking at a potential maximum size of 10^100, clearly impractical to
store explicitly.

 
 So, -10 to 10 in discrete steps. This means 5 bits per dimension and
 500 bits for a single coordinate. Is the graph distributed of a
 compute cluster or does it fit into single computer? With a few GB
 of RAM, this means your graph is quite sparse, yet nodes are
 connected ("differ only by a single coordinate")?

The graph is not explicitly stored, because it is far too large. The
nodes are generated on-the-fly as needed; the challenge is to know
whether a particular generated node has been generated before.
Basically, the equivalent of a hashtable of previously-visited nodes,
but I'm hoping for (1) some kind of compressed representation so that as
many nodes as possible can be kept in RAM for speed, and (2) something
more efficient for disk I/O, because I'm expecting the number of
generated nodes will not fit in RAM and will have to be put on disk (and
hashtables don't do very well with disk I/O due to the latency).


 Can you preprocess? I mean, walk all the nodes O(n) to compute a
 good (perfect?) hash function.

Walking the nodes defeats the purpose, because the whole idea is to
*not* have to generate every node, just those selected by the algorithm
(those are already very unwieldy in number, nevermind generating *all*
nodes).


 In general, I think you should either store the flag right in the
 graph node or mirror the graph structure.

[...]

That doesn't solve the problem when graph nodes are generated
on-the-fly, since I wouldn't know where to look up the node structure
without first knowing that it has been generated before.


T

-- 
Ph.D. = Permanent head Damage

Nov 15 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Friday, 15 November 2013 at 20:22:29 UTC, H. S. Teoh wrote:
 The graph is not explicitly stored, because it is far too 
 large. The
 nodes are generated on-the-fly as needed; the challenge is to 
 know
 whether a particular generated node has been generated before.
 Basically, the equivalent of a hashtable of previously-visited 
 nodes,
 but I'm hoping for (1) some kind of compressed representation 
 so that as
 many nodes as possible can be kept in RAM for speed, and (2) 
 something
 more efficient for disk I/O, because I'm expecting the number of
 generated nodes will not fit in RAM and will have to be put on 
 disk (and
 hashtables don't do very well with disk I/O due to the latency).

Have multiple levels of lookup.

For example, let's say your entire grid G is 1,000,000 x 
1,000,000.

Have a top level grid H that is 1000 x 1000. Each H[i, j] 
corresponds to a 1000 x 1000 region in G. If any cell has been 
visited in that region then H[i, j] points to a direct 
representation of that 1000 x 1000 subregion of G, if not, then 
it points to null. Checking if G[i, j] has been visited is as 
easy as:

visited(i, j) = H[i/1000, j/1000] && (*H[i/1000, j/1000])[i%1000, 
j%1000];

You can easily page-in and page-out those 1000 x 1000 subregions 
to disk if needed, just keeping the active ones in memory. H will 
stay in memory (only 8MB). Obviously, you'd be better off using a 
power of 2 instead of 1000 for fast div and mod.

And of course, you can have as many levels to this as you like.

Nov 15 2013

D Programming

C/C++ Programming

Other

digitalmars.D - [OT] Best algorithm for extremely large hashtable?