## digitalmars.D - [OT] Best algorithm for extremely large hashtable?

• H. S. Teoh (25/25) Nov 15 2013 This isn't directly related to D (though the code will be in D), and I
• Dmitry Olshansky (7/12) Nov 15 2013 Store as a bit-flag in whatever structure that serves as node. It's
• H. S. Teoh (18/31) Nov 15 2013 [...]
• bearophile (5/14) Nov 15 2013 Look for Succinct Data-Structures, Graphs. You can impose further
• Andrei Alexandrescu (4/9) Nov 15 2013 The classic trick here is parallel arrays - you store a bit array in
• qznc (3/50) Nov 15 2013 A Trie might be an idea, especially the bitwise variant.
• Vladimir Panteleev (13/25) Nov 15 2013 A while ago I set out to write a solver for a group of problems
• H. S. Teoh (8/28) Nov 15 2013 Thanks!! I think delayed duplicate detection is what I'm looking for.
• John Colvin (8/10) Nov 15 2013 Could perhaps elaborate on this point?
• John Colvin (2/4) Nov 15 2013 sorry, I meant *integer multiples* of the euclidian basis vectors.
• H. S. Teoh (15/27) Nov 15 2013 Basically, adjacent nodes differ only in a single coordinate, and that
• Ivan Kazmenko (36/52) Nov 15 2013 How dense is the graph?
• John Colvin (9/38) Nov 15 2013 If you're going to chuck away the orthogonal location of the
• qznc (11/45) Nov 15 2013 So, -10 to 10 in discrete steps. This means 5 bits per dimension
• H. S. Teoh (22/53) Nov 15 2013 The graph is not explicitly stored, because it is far too large. The
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```This isn't directly related to D (though the code will be in D), and I
thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very large graph,
and I need some kind of data structure to keep track of which nodes have
been visited, that (1) allows reasonably fast lookups (preferably O(1)),
and (2) doesn't require GB's of storage (i.e., some kind of compression
would be nice).

The graph nodes can be represented in various ways, but possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are always between
vectors that differ only by a single coordinate; so the edges of the
graph may be thought of as a subset of the edges of an n-dimensional
grid. The hashtable, therefore, needs to represent some connected subset
of this grid in a space-efficient manner, that still allows fast lookup
times.

The naïve approach of using an n-dimensional bit array is not feasible
because n can be quite large (up to 100 or so), and the size of the grid
itself can get up to about 10 in each direction, so we're looking at a
potential maximum size of 10^100, clearly impractical to store
explicitly.

Are there any known good algorithms for tackling this problem?

Thanks!

T

--
Creativity is not an excuse for sloppiness.
```
Nov 15 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```15-Nov-2013 22:41, H. S. Teoh Ð¿Ð¸ÑˆÐµÑ‚:
This isn't directly related to D (though the code will be in D), and I
thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very large graph,
and I need some kind of data structure to keep track of which nodes have
been visited, that (1) allows reasonably fast lookups (preferably O(1)),

Store as a bit-flag in whatever structure that serves as node. It's
going to be the fastest anyway and you indeed get to use only 1 bit per
node (and given padding and whatnot you may already have a couple of
bytes to spare per node).

--
Dmitry Olshansky
```
Nov 15 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky wrote:
15-Nov-2013 22:41, H. S. Teoh Ð¿Ð¸ÑˆÐµÑ‚:
This isn't directly related to D (though the code will be in D), and
I thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)),

Store as a bit-flag in whatever structure that serves as node. It's
going to be the fastest anyway and you indeed get to use only 1 bit
per node (and given padding and whatnot you may already have a couple
of bytes to spare per node).

[...]

There are too many nodes to keep in memory, actually. The algorithm also
doesn't need to store the nodes; it can easily generate them on-the-fly
when it needs them.  The challenge is when it needs to know whether a
particular generated node has been visited before. I'd like to be able
to have a fast lookup (hopefully O(1)) that tells me whether node X has
been seen before. The problem is, there may have been a very large
number of previously-seen nodes, so I need some way of storing this
information in a space-efficient way.

I'm hoping for a kind of compressed representation where multiple
adjacent nodes can be represented in a very compact way, by somehow
taking advantage of the fact that they lie on an n-dimensional grid and
the fact that graph edges are always a subset of grid edges (as opposed
to edges between two arbitrary nodes).

T

--
I think Debian's doing something wrong, `apt-get install pesticide', doesn't
seem to remove the bugs on my system! -- Mike Dresser
```
Nov 15 2013
"bearophile" <bearophileHUGS lycos.com> writes:
```H. S. Teoh:

I'm hoping for a kind of compressed representation where
multiple
adjacent nodes can be represented in a very compact way, by
somehow
taking advantage of the fact that they lie on an n-dimensional
grid and
the fact that graph edges are always a subset of grid edges (as
opposed
to edges between two arbitrary nodes).

Look for Succinct Data-Structures, Graphs. You can impose further

Bye,
bearophile
```
Nov 15 2013
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 11/15/13 11:20 AM, H. S. Teoh wrote:
I'm hoping for a kind of compressed representation where multiple
adjacent nodes can be represented in a very compact way, by somehow
taking advantage of the fact that they lie on an n-dimensional grid and
the fact that graph edges are always a subset of grid edges (as opposed
to edges between two arbitrary nodes).

The classic trick here is parallel arrays - you store a bit array in
parallel with an array of nodes.

Andrei
```
Nov 15 2013
"qznc" <qznc web.de> writes:
```On Friday, 15 November 2013 at 19:21:35 UTC, H. S. Teoh wrote:
On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky
wrote:
15-Nov-2013 22:41, H. S. Teoh Ð¿Ð¸ÑˆÐµÑ‚:
This isn't directly related to D (though the code will be in
D), and
I thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very
large
graph, and I need some kind of data structure to keep track
of which
nodes have been visited, that (1) allows reasonably fast
lookups
(preferably O(1)),

Store as a bit-flag in whatever structure that serves as node.
It's
going to be the fastest anyway and you indeed get to use only
1 bit
per node (and given padding and whatnot you may already have a
couple
of bytes to spare per node).

[...]

There are too many nodes to keep in memory, actually. The
algorithm also
doesn't need to store the nodes; it can easily generate them
on-the-fly
when it needs them.  The challenge is when it needs to know
whether a
particular generated node has been visited before. I'd like to
be able
to have a fast lookup (hopefully O(1)) that tells me whether
node X has
been seen before. The problem is, there may have been a very
large
number of previously-seen nodes, so I need some way of storing
this
information in a space-efficient way.

I'm hoping for a kind of compressed representation where
multiple
adjacent nodes can be represented in a very compact way, by
somehow
taking advantage of the fact that they lie on an n-dimensional
grid and
the fact that graph edges are always a subset of grid edges (as
opposed
to edges between two arbitrary nodes).

A Trie might be an idea, especially the bitwise variant.

http://en.wikipedia.org/wiki/Trie
```
Nov 15 2013
```On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
This isn't directly related to D (though the code will be in
D), and I
thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very
large graph,
and I need some kind of data structure to keep track of which
nodes have
been visited, that (1) allows reasonably fast lookups
(preferably O(1)),
and (2) doesn't require GB's of storage (i.e., some kind of
compression
would be nice).

A while ago I set out to write a solver for a group of problems
which can be described as traversing in breath extremely large
implicit graphs. Some code here (C++):
duplicate detection, to allow the number of nodes to exceed
available RAM.

What we found was that in certain cases, delayed duplicate
detection beat hash tables even while filtering duplicates in
memory, I think mainly because DDD is more parallelizable than
hash tables.

You mentioned compression - perhaps a bloom filter will fit your
needs, as an optimization?
```
Nov 15 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Fri, Nov 15, 2013 at 08:29:22PM +0100, Vladimir Panteleev wrote:
On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:

[...]
I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
some kind of compression would be nice).

A while ago I set out to write a solver for a group of problems
which can be described as traversing in breath extremely large
implicit graphs. Some code here (C++):
duplicate detection, to allow the number of nodes to exceed
available RAM.

What we found was that in certain cases, delayed duplicate detection
beat hash tables even while filtering duplicates in memory, I think
mainly because DDD is more parallelizable than hash tables.

You mentioned compression - perhaps a bloom filter will fit your
needs, as an optimization?

Thanks!! I think delayed duplicate detection is what I'm looking for.
The bloom filter idea is an interesting one; I'll definitely consider it
as a later optimization once I get DDD working.

T

--
It won't be covered in the book. The source code has to be useful for
something, after all. -- Larry Wall
```
Nov 15 2013
"John Colvin" <john.loughran.colvin gmail.com> writes:
```On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
so the edges of the graph may be thought of as a subset of the
edges of an n-dimensional grid.

Could perhaps elaborate on this point?

As far as I can see, the possible values of the nodes are all the
points in N^n (or Z^n, dunno whether you have negatives) and the
edges are all in the set of i*e_j for i in Z and j in N, where
e_j is the unit vector along one axis, i.e. the edges are the
euclidean basis vectors and their negatives.

How is this the same as the edges of an n-dimensional grid?
```
Nov 15 2013
"John Colvin" <john.loughran.colvin gmail.com> writes:
```On Friday, 15 November 2013 at 19:48:21 UTC, John Colvin wrote:
i.e. the edges are the euclidean basis vectors and their
negatives.

sorry, I meant *integer multiples* of the euclidian basis vectors.
```
Nov 15 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
so the edges of the graph may be thought of as a subset of the
edges of an n-dimensional grid.

Could perhaps elaborate on this point?

As far as I can see, the possible values of the nodes are all the
points in N^n (or Z^n, dunno whether you have negatives) and the
edges are all in the set of i*e_j for i in Z and j in N, where e_j
is the unit vector along one axis, i.e. the edges are the euclidean
basis vectors and their negatives.

Actually, i can only be 1 or -1.

How is this the same as the edges of an n-dimensional grid?

Basically, adjacent nodes differ only in a single coordinate, and that
difference can only be 1 or -1. So for the 2D case, if you graph the
nodes on the plane, the edges would be horizontal/vertical line segments
of unit length. If you consider the 2D grid's edges to be *all* possible
such line segments, then in that sense you could think of the graph's
edges as being a subset of the 2D grid's.

I was hoping that this fact can be taken advantage of, to make a compact
representation of visited nodes.

T

--
By understanding a machine-oriented language, the programmer will tend
to use a much more efficient method; it is much closer to reality. -- D.
Knuth
```
Nov 15 2013
"Ivan Kazmenko" <gassa mail.ru> writes:
```On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
How is this the same as the edges of an n-dimensional grid?

Basically, adjacent nodes differ only in a single coordinate,
and that
difference can only be 1 or -1. So for the 2D case, if you
graph the
nodes on the plane, the edges would be horizontal/vertical line
segments
of unit length. If you consider the 2D grid's edges to be *all*
possible
such line segments, then in that sense you could think of the
graph's
edges as being a subset of the 2D grid's.

I was hoping that this fact can be taken advantage of, to make
a compact
representation of visited nodes.

How dense is the graph?

For example, if it contains every possible edge described (+-1 to
every single coordinate), for a breadth-first search, we can
manage to just keep track of a single integer: the farthest
distance traveled.

For very dense graphs, you can perhaps apply a similar idea:
represent large visited areas as "diamonds" with center and
radius, then try to "factor" the visited areas into diamonds
on-the-fly.  Possibly they will be "diamonds with a [much
shorter] list of holes".  For example, we say that all nodes not
further than 3 from (0, 0, ..., 0) are visited, and store a
(dimensions[100] to the power of distance[3]) of them.  The
lookup will be at most O (number-of-diamonds) plus a hash table
lookup for the individual nodes.  Perhaps for certain graphs, we
can find a good compromise between the number of diamonds and the
number of individual stored nodes.

The above is just an idea, perhaps it won't be feasible by itself
when you get to the details, but can inspire some related
optimization.

Also, the size of the border of n-dimensional area is a
(n-1)-dimensional object, and for dense enough graphs, you can
hope that the number of elements in it is less by an order of
magnitude than the total number of visited nodes.  However, for
too much dimensions (as in your case, 10^100 -> 10^99), it does
not seem to help much.

Another question is, when will the graph traversal end?  For
example, if you are certain that you won't need to visit more
than say one million nodes, a simple hash table storing the node
representations at the hash indices will suffice.  On the other
hand, if you plan to visit 10^12 nodes, and the graph is not very
sparse or very dense (and not regular in any obvious way besides
what is described), perhaps you won't get the required
compression level (1/1000) anyway.

Ivan Kazmenko.
```
Nov 15 2013
"John Colvin" <john.loughran.colvin gmail.com> writes:
```On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
so the edges of the graph may be thought of as a subset of the
edges of an n-dimensional grid.

Could perhaps elaborate on this point?

As far as I can see, the possible values of the nodes are all
the
points in N^n (or Z^n, dunno whether you have negatives) and
the
edges are all in the set of i*e_j for i in Z and j in N, where
e_j
is the unit vector along one axis, i.e. the edges are the
euclidean
basis vectors and their negatives.

Actually, i can only be 1 or -1.

How is this the same as the edges of an n-dimensional grid?

Basically, adjacent nodes differ only in a single coordinate,
and that
difference can only be 1 or -1.

Ah, ok.

So for the 2D case, if you graph the
nodes on the plane, the edges would be horizontal/vertical line
segments
of unit length. If you consider the 2D grid's edges to be *all*
possible
such line segments, then in that sense you could think of the
graph's
edges as being a subset of the 2D grid's.

If you're going to chuck away the orthogonal location of the
edges and only consider parallel position then you might as well
go all the way:

In the 2-D example, the edges are all +/-(0,1) and +/-(1,0).
There's no difference between the edge connecting (x, p) to (x,
p+1) and the edge connecting (x, q) to (x, q+1). In the general
case there's only 2N possible unique edges.
```
Nov 15 2013
"qznc" <qznc web.de> writes:
```On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
This isn't directly related to D (though the code will be in
D), and I
thought this would be a good place to ask.

I'm trying to implement an algorithm that traverses a very
large graph,
and I need some kind of data structure to keep track of which
nodes have
been visited, that (1) allows reasonably fast lookups
(preferably O(1)),
and (2) doesn't require GB's of storage (i.e., some kind of
compression
would be nice).

The graph nodes can be represented in various ways, but
possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are
always between
vectors that differ only by a single coordinate; so the edges
of the
graph may be thought of as a subset of the edges of an
n-dimensional
grid. The hashtable, therefore, needs to represent some
connected subset
of this grid in a space-efficient manner, that still allows
fast lookup
times.

The naÃ¯ve approach of using an n-dimensional bit array is not
feasible
because n can be quite large (up to 100 or so), and the size of
the grid
itself can get up to about 10 in each direction, so we're
looking at a
potential maximum size of 10^100, clearly impractical to store
explicitly.

So, -10 to 10 in discrete steps. This means 5 bits per dimension
and 500 bits for a single coordinate. Is the graph distributed of
a compute cluster or does it fit into single computer? With a few
GB of RAM, this means your graph is quite sparse, yet nodes are
connected ("differ only by a single coordinate")?

Can you preprocess? I mean, walk all the nodes O(n) to compute a
good (perfect?) hash function.

In general, I think you should either store the flag right in the
graph node or mirror the graph structure.

I do not know any concrete algorithms.
```
Nov 15 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Fri, Nov 15, 2013 at 09:03:17PM +0100, qznc wrote:
On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:

[...]
I'm trying to implement an algorithm that traverses a very large
graph, and I need some kind of data structure to keep track of which
nodes have been visited, that (1) allows reasonably fast lookups
(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
some kind of compression would be nice).

The graph nodes can be represented in various ways, but possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are always
between vectors that differ only by a single coordinate; so the edges
of the graph may be thought of as a subset of the edges of an
n-dimensional grid. The hashtable, therefore, needs to represent some
connected subset of this grid in a space-efficient manner, that still
allows fast lookup times.

The naïve approach of using an n-dimensional bit array is not
feasible because n can be quite large (up to 100 or so), and the size
of the grid itself can get up to about 10 in each direction, so we're
looking at a potential maximum size of 10^100, clearly impractical to
store explicitly.

So, -10 to 10 in discrete steps. This means 5 bits per dimension and
500 bits for a single coordinate. Is the graph distributed of a
compute cluster or does it fit into single computer? With a few GB
of RAM, this means your graph is quite sparse, yet nodes are
connected ("differ only by a single coordinate")?

The graph is not explicitly stored, because it is far too large. The
nodes are generated on-the-fly as needed; the challenge is to know
whether a particular generated node has been generated before.
Basically, the equivalent of a hashtable of previously-visited nodes,
but I'm hoping for (1) some kind of compressed representation so that as
many nodes as possible can be kept in RAM for speed, and (2) something
more efficient for disk I/O, because I'm expecting the number of
generated nodes will not fit in RAM and will have to be put on disk (and
hashtables don't do very well with disk I/O due to the latency).

Can you preprocess? I mean, walk all the nodes O(n) to compute a
good (perfect?) hash function.

Walking the nodes defeats the purpose, because the whole idea is to
*not* have to generate every node, just those selected by the algorithm
(those are already very unwieldy in number, nevermind generating *all*
nodes).

In general, I think you should either store the flag right in the
graph node or mirror the graph structure.

[...]

That doesn't solve the problem when graph nodes are generated
on-the-fly, since I wouldn't know where to look up the node structure
without first knowing that it has been generated before.

T

--
```
Nov 15 2013
"Peter Alexander" <peter.alexander.au gmail.com> writes:
```On Friday, 15 November 2013 at 20:22:29 UTC, H. S. Teoh wrote:
The graph is not explicitly stored, because it is far too
large. The
nodes are generated on-the-fly as needed; the challenge is to
know
whether a particular generated node has been generated before.
Basically, the equivalent of a hashtable of previously-visited
nodes,
but I'm hoping for (1) some kind of compressed representation
so that as
many nodes as possible can be kept in RAM for speed, and (2)
something
more efficient for disk I/O, because I'm expecting the number of
generated nodes will not fit in RAM and will have to be put on
disk (and
hashtables don't do very well with disk I/O due to the latency).

Have multiple levels of lookup.

For example, let's say your entire grid G is 1,000,000 x
1,000,000.

Have a top level grid H that is 1000 x 1000. Each H[i, j]
corresponds to a 1000 x 1000 region in G. If any cell has been
visited in that region then H[i, j] points to a direct
representation of that 1000 x 1000 subregion of G, if not, then
it points to null. Checking if G[i, j] has been visited is as
easy as:

visited(i, j) = H[i/1000, j/1000] && (*H[i/1000, j/1000])[i%1000,
j%1000];

You can easily page-in and page-out those 1000 x 1000 subregions
to disk if needed, just keeping the active ones in memory. H will
stay in memory (only 8MB). Obviously, you'd be better off using a
power of 2 instead of 1000 for fast div and mod.

And of course, you can have as many levels to this as you like.
```
Nov 15 2013