digitalmars.D - [OT] Efficient file structure for very large lookup tables?

H. S. Teoh (22/22) Dec 17 2013 Another OT thread to pick your brains. :)

tn (4/12) Dec 17 2013 Perhaps some kind of k-d tree? Maybe even some sort of cache

Craig Dillabaugh (7/19) Dec 17 2013 Its not cache-oblivious buy he could try the KDB tree:

H. S. Teoh (21/42) Dec 17 2013 I skimmed over the KDB tree paper, and am reading the Bkd-tree paper

Joseph Cassman (8/40) Dec 17 2013 A burst trie might work, although it is originally designed for
Craig Dillabaugh (24/56) Dec 17 2013 As a first attempt could you use a key-value database (like REDIS

H. S. Teoh (48/93) Dec 17 2013 Well, currently I'm just using plain ole in-memory hash tables to store

John Colvin (8/67) Dec 17 2013 I can't really add anything on the fancy data
Craig Dillabaugh (15/28) Dec 18 2013 I should clarify a bit what I meant by batched. If you have a

H. S. Teoh (20/47) Dec 18 2013 Hmm. This is an interesting thought. In a sense, queries don't have to

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (3/12) Dec 17 2013 But 200*10million = 2GB. Can't you use an existing KD-tree and

H. S. Teoh (12/21) Dec 17 2013 Hmm. You're right, I think I underestimated my data size. :P The 10

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (28/35) Dec 17 2013 If you can buffer the queries in a queue then you can issue a

Dylan Knutson (10/10) Dec 17 2013 It's not a datastructure persay, but Google's LevelDB is very

H. S. Teoh (33/37) Dec 17 2013 [...]

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (5/6) Dec 17 2013 On OS-X a page is 4K, so there are only 20 entries per page. If

luka8088 (3/29) Dec 17 2013 sqlite file format seems to be fairly documented:
Francesco Cattoglio (4/14) Dec 17 2013 What kind of key do you have? vector of floats? integers?

H. S. Teoh (8/20) Dec 17 2013 That's a good idea, actually, since the keys are very similar to each

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will be
very frequent.

I did some online research, and it seems that hashtables perform poorly
on disk, because the usual hash functions cause random scattering of
related data (which are likely to be access with higher temporal
locality), which incurs lots of disk seeks.

I thought about B-trees, but they have high overhead (and are a pain to
implement), and also only exhibit good locality if table entries are
accessed sequentially; the problem is I'm working with high-dimensional
data and the order of accesses is unlikely to be sequential. However,
they do exhibit good spatial locality in higher-dimensional space (i.e.,
if entry X is accessed first, then the next entry Y is quite likely to
be close to X in that space).  Does anybody know of a good data
structure that can take advantage of this fact to minimize disk
accesses?


T

-- 
Computers are like a jungle: they have monitor lizards, rams, mice, c-moss,
binary trees... and bugs.

Dec 17 2013

"tn" <no email.com> writes:

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?

Perhaps some kind of k-d tree? Maybe even some sort of cache 
oblivious k-d tree to make it more suitable for storage on the 
disk?

Dec 17 2013

"Craig Dillabaugh" <craig.dillabaugh gmail.com> writes:

On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?

 Perhaps some kind of k-d tree? Maybe even some sort of cache 
 oblivious k-d tree to make it more suitable for storage on the 
 disk?

Its not cache-oblivious buy he could try the KDB tree:

http://dl.acm.org/citation.cfm?id=582321

or, again for batched queries the is the

Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a 
type of buffer tree so the operations need to be able to be run 
as a batch.

Dec 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Dec 17, 2013 at 08:54:17PM +0100, Craig Dillabaugh wrote:
 On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
However, they do exhibit good spatial locality in higher-dimensional
space (i.e., if entry X is accessed first, then the next entry Y is
quite likely to be close to X in that space).  Does anybody know of
a good data structure that can take advantage of this fact to
minimize disk accesses?

Perhaps some kind of k-d tree? Maybe even some sort of cache
oblivious k-d tree to make it more suitable for storage on the
disk?

 
 Its not cache-oblivious buy he could try the KDB tree:
 
 http://dl.acm.org/citation.cfm?id=582321
 
 or, again for batched queries the is the
 
 Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type
 of buffer tree so the operations need to be able to be run as a
 batch.

I skimmed over the KDB tree paper, and am reading the Bkd-tree paper
now; it appears that the latter is an improvement over the former, and
has better update performance characteristics. However, it also appears
much more complicated to implement.

Another point I forgot to mention, is that currently I only ever perform
insertion and queries, and the two can actually be combined (basically
the table is kept for uniqueness testing). I don't need to delete
entries from the table. Eventually I might look into that to reduce disk
space requirements, but as far as the actual algorithm is concerned,
it's not necessary. So there's potentially a good amount of room here
for optimization by foregoing the ability to delete entries.

I'm currently considering using simpler, "dumber" file structures for
storing nodes, with an in-memory Bloom filter to eliminate a (hopefully)
large number of I/O roundtrips, since I'm expecting that a good fraction
of queries will return 'not found'.


T

-- 
"A one-question geek test. If you get the joke, you're a geek: Seen on a
California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D.
Wachs - Natural Intelligence, Inc.

Dec 17 2013

"Joseph Cassman" <jc7919 outlook.com> writes:

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)

 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.

 I did some online research, and it seems that hashtables 
 perform poorly
 on disk, because the usual hash functions cause random 
 scattering of
 related data (which are likely to be access with higher temporal
 locality), which incurs lots of disk seeks.

 I thought about B-trees, but they have high overhead (and are a 
 pain to
 implement), and also only exhibit good locality if table 
 entries are
 accessed sequentially; the problem is I'm working with 
 high-dimensional
 data and the order of accesses is unlikely to be sequential. 
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?


 T

A burst trie might work, although it is originally designed for 
text. I haven't had a chance to try one out in code yet but the 
paper by Heinz, Zobel, and Williams is interesting.

Burst Tries: A Fast, Efficient Data Structure for String Keys 
(2002)
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499

Joseph

Dec 17 2013

"Craig Dillabaugh" <craig.dillabaugh gmail.com> writes:

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)

 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.

 I did some online research, and it seems that hashtables 
 perform poorly
 on disk, because the usual hash functions cause random 
 scattering of
 related data (which are likely to be access with higher temporal
 locality), which incurs lots of disk seeks.

 I thought about B-trees, but they have high overhead (and are a 
 pain to
 implement), and also only exhibit good locality if table 
 entries are
 accessed sequentially; the problem is I'm working with 
 high-dimensional
 data and the order of accesses is unlikely to be sequential. 
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?


 T

As a first attempt could you use a key-value database (like REDIS 
if you have enough memory to fit everything in)?  Or is that out 
of the question.

Another question is can your queries be batched?  If that is the 
case and your data is bigger than your available memory, then try 
Googling "Lars Arge Buffer Tree" which might work well. However, 
if you thought implementing a B-tree was going to be painful, 
that might not appeal to you.  If you don't want to implement 
that yourself you could look at TPIE:

http://www.madalgo.au.dk/tpie/

Although it is in C++.

If I had to design something quick on the spot, my first guess 
would be to use a grid on the first two dimensions and then bin 
the 'points' or keys within each grid square and build a simpler 
structure on those.  This won't work so well though for really 
high dimension data or if the 'points' are randomly distributed.

Also, what exactly do you mean by "in that space" when you say:

"if entry X is accessed first, then the next entry Y is quite 
likely to be close to X in that space".

Do you mean that the value of Y in the next dimension is 
numerically close (or expected to be) to X?

Cheers,

Craig

Dec 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will
be very frequent.

I did some online research, and it seems that hashtables perform
poorly on disk, because the usual hash functions cause random
scattering of related data (which are likely to be access with higher
temporal locality), which incurs lots of disk seeks.

I thought about B-trees, but they have high overhead (and are a pain
to implement), and also only exhibit good locality if table entries
are accessed sequentially; the problem is I'm working with
high-dimensional data and the order of accesses is unlikely to be
sequential.  However, they do exhibit good spatial locality in
higher-dimensional space (i.e., if entry X is accessed first, then
the next entry Y is quite likely to be close to X in that space).
Does anybody know of a good data structure that can take advantage of
this fact to minimize disk accesses?


T

 
 As a first attempt could you use a key-value database (like REDIS if
 you have enough memory to fit everything in)?  Or is that out of the
 question.

Well, currently I'm just using plain ole in-memory hash tables to store
the data. It works quite well for small problems, but I'm finding that
memory runs out quickly with problems of moderate size, so I need to
move the storage to disk. I'm just looking to minimize I/O latency,
which hash tables are notorious for. (A previous incarnation of the
program uses Berkeley DB hash tables, but it becomes I/O bound and slow
when the problem size is large.)


 Another question is can your queries be batched?  If that is the
 case and your data is bigger than your available memory, then try
 Googling "Lars Arge Buffer Tree" which might work well.

Hmm. I'm not sure if it's possible to batch the queries unless I
multithread the algorithm; the queries are generated on-the-fly. But
they *are* somewhat predictable (spatial locality: see below), so maybe
that might help?


[...]
 If I had to design something quick on the spot, my first guess would
 be to use a grid on the first two dimensions and then bin the
 'points' or keys within each grid square and build a simpler
 structure on those.  This won't work so well though for really high
 dimension data or if the 'points' are randomly distributed.

 Also, what exactly do you mean by "in that space" when you say:
 
 "if entry X is accessed first, then the next entry Y is quite likely
 to be close to X in that space".
 
 Do you mean that the value of Y in the next dimension is numerically
 close (or expected to be) to X?

[...]

How many dimensions is considered "really high"?  The number of
dimensions can be up to about 50 or so. Is that high? Also, the points
are not randomly distributed; they are highly-connected (i.e., they tend
to be adjacent to each other).

Basically, one possible representation of the keys is as an
n-dimensional array of integers. (They are not natively in that form,
but can be compressed into that form easily.) And the way the algorithm
works is that if the key (p1,p2,p3,...) is accessed, then the next key
(q1,q2,q3,...) accessed is likely to be differ from (p1,p2,p3,...) only
in a small number of coordinates, and the difference in coordinates is
likely to be very small (usually in the +1/-1 range).

Actually, now that I think of it... the problem may be far more
tractable than I realized. I was previously thinking of some kind of
n-dimensional space-partitioning storage scheme, where the state space
is partitioned into n-dimensional cube buckets, and points are placed in
their corresponding buckets. However, this had the problem that the
n-cube contains 2^n vertices, so the bucket sizes would have to be k^n,
and since I can't control n, it makes it difficult to map bucket sizes
to page sizes for optimal I/O performance. For large n, the bucket size
would be unmanageably big (e.g., in the 50-dimensional case).

But it just dawned on me that partitioning into n-cubes is unnecessary,
because most of the time, I only care about points that differ in a
small number of coordinates from the current point. So a more efficient
bucket shape would be something closer to an n-cross (higher-dimensional
equivalent of the octahedron), which only has 2*n vertices, or one of
its derivatives. This makes bucket sizes far easier to control, and far
more efficient to store. The tricky part is only, how to partition
n-space into n-cross-shaped pieces? It would have to be some kind of
tiling (otherwise there'd be ambiguity over which bucket a given point
falls into). So now I've to do some research into n-dimensional
tilings... :P


T

-- 
"The number you have dialed is imaginary. Please rotate your phone 90 degrees
and try again."

Dec 17 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
 On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh 
 wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely 
large
lookup tables? (Extremely large as in > 10 million entries, 
with keys
and values roughly about 100 bytes each.) The structure must 
support
efficient adding and lookup of entries, as these two 
operations will
be very frequent.

I did some online research, and it seems that hashtables 
perform
poorly on disk, because the usual hash functions cause random
scattering of related data (which are likely to be access 
with higher
temporal locality), which incurs lots of disk seeks.

I thought about B-trees, but they have high overhead (and are 
a pain
to implement), and also only exhibit good locality if table 
entries
are accessed sequentially; the problem is I'm working with
high-dimensional data and the order of accesses is unlikely 
to be
sequential.  However, they do exhibit good spatial locality in
higher-dimensional space (i.e., if entry X is accessed first, 
then
the next entry Y is quite likely to be close to X in that 
space).
Does anybody know of a good data structure that can take 
advantage of
this fact to minimize disk accesses?


T

 
 As a first attempt could you use a key-value database (like 
 REDIS if
 you have enough memory to fit everything in)?  Or is that out 
 of the
 question.

 Well, currently I'm just using plain ole in-memory hash tables 
 to store
 the data. It works quite well for small problems, but I'm 
 finding that
 memory runs out quickly with problems of moderate size, so I 
 need to
 move the storage to disk. I'm just looking to minimize I/O 
 latency,
 which hash tables are notorious for. (A previous incarnation of 
 the
 program uses Berkeley DB hash tables, but it becomes I/O bound 
 and slow
 when the problem size is large.)

I can't really add anything on the fancy data 
structures/algorithms, it's all a bit out of my league.

That said, have you considered asynchronously maintaining the 
locality in memory, i.e. prefetching? If you have predictable 
access and good locality it should perform well. In particular, 
you can build a prefetcher that is tuned to the particular access 
pattern.

Dec 17 2013

"Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:

On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
 On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh 
 wrote:

 Another question is can your queries be batched?  If that is 
 the
 case and your data is bigger than your available memory, then 
 try
 Googling "Lars Arge Buffer Tree" which might work well.

 Hmm. I'm not sure if it's possible to batch the queries unless I
 multithread the algorithm; the queries are generated 
 on-the-fly. But
 they *are* somewhat predictable (spatial locality: see below), 
 so maybe
 that might help?

I should clarify a bit what I meant by batched.  If you have a
sequence of queries q_1, q_2, .... q_n (where I guess in your
case a query can be/is an insertion), then by 'batch' proccess
what I mean is you do not need to get the answers on the fly. It
is sufficient to have all queries answered when the program
terminates?

Note that these techniques still account for the 'order' of the
queries, so the answer to query q_i will be reported correctly
(ie. same as if the queries were performed sequentially on a
standard data structure).


Also, if you are still considering using hashing the following
paper may be of interest to you (I haven't read it in detail, but
it doesn't look too crazy)

http://arxiv.org/abs/1104.2799

Dec 18 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Dec 18, 2013 at 03:03:20PM +0100, Craig Dillabaugh wrote:
 On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:

 
Another question is can your queries be batched?  If that is the
case and your data is bigger than your available memory, then try
Googling "Lars Arge Buffer Tree" which might work well.

Hmm. I'm not sure if it's possible to batch the queries unless I
multithread the algorithm; the queries are generated on-the-fly.  But
they *are* somewhat predictable (spatial locality: see below), so
maybe that might help?

 I should clarify a bit what I meant by batched.  If you have a
 sequence of queries q_1, q_2, .... q_n (where I guess in your
 case a query can be/is an insertion), then by 'batch' proccess
 what I mean is you do not need to get the answers on the fly. It
 is sufficient to have all queries answered when the program
 terminates?

Hmm. This is an interesting thought. In a sense, queries don't have to
be answered immediately; the program can generate a series of queries
and then continue processing the next item, the answer can be
asynchronous. When the answer becomes available, then further processing
will happen (add new items to be processed, update existing items,
etc.). The only requirement is that queries will always be answered in
the order they were made (the correctness of the algorithm depends on
this).

In a sense, what I need is really just a single operation
"add-or-update", which can be processed in bulk. Either an item is
already in the table, in which case it gets updated in a certain way, or
it isn't, in which case it is added. In both cases, some additional
processing takes place after the table is updated, which may create more
items to be processed.


 Note that these techniques still account for the 'order' of the
 queries, so the answer to query q_i will be reported correctly
 (ie. same as if the queries were performed sequentially on a
 standard data structure).

Then this is OK.


 Also, if you are still considering using hashing the following
 paper may be of interest to you (I haven't read it in detail, but
 it doesn't look too crazy)
 
 http://arxiv.org/abs/1104.2799

Thanks for the references! I'll look into this, it seems quite relevant.


T

-- 
You have to expect the unexpected. -- RL

Dec 18 2013

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.

But 200*10million = 2GB. Can't you use an existing KD-tree and 
tweak it to fit memory pages and rely on paging for a start?

Dec 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Dec 17, 2013 at 09:02:41PM +0100, digitalmars-d-bounces puremagic.com
wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will
be very frequent.

 
 But 200*10million = 2GB. Can't you use an existing KD-tree and tweak
 it to fit memory pages and rely on paging for a start?

Hmm. You're right, I think I underestimated my data size. :P  The 10
million figure is based on problems that my current program successfully
solved (with an in-memory hashtable). I guess that larger problems must
far exceed that number (I ran out of memory so I couldn't check just how
big it got before I killed the process). Suffice it to say that this is
a combinatorial problem, so the number of entries grow exponentially;
anything that can help reduce the storage requirements / I/O latency
would be a big help.


T

-- 
VI = Visual Irritation

Dec 17 2013

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Tuesday, 17 December 2013 at 20:54:56 UTC, H. S. Teoh wrote:
 big it got before I killed the process). Suffice it to say that 
 this is
 a combinatorial problem, so the number of entries grow 
 exponentially;
 anything that can help reduce the storage requirements / I/O 
 latency
 would be a big help.

If you can buffer the queries in a queue then you can issue a 
prefetch-request to the OS to bring in the memory-page from disk 
when you put it into the queue to prevent the process from being 
put to sleep, the length of the queue has to be tailored to how 
fast it can load the page.

If the data is uniformly distributed then perhaps you could 
partition the disk-space with a n-dimensional grid, and then have 
a key-value store that you page into memory?

If you can do some queries out of order then you probably should 
set up some "buckets"/"bins" in the queue and prefetch the page 
that is referenced by the fullest/oldest bucket if you can do 
some queries out of order. Just to avoid that pages are pushed in 
and out of memory all the time.

Otherwise a kd-tree with a scaled grid per leaf that stays within 
a memorypage, probably could work, but that sounds like a lot of 
work. Implementing n-dimensional datastructures is conceptually 
easy, but tedious in reality (can you trust it to be bug free? It 
is hard to visualize in text.).

If you have to do the spatial datastructure yourself, then 
perhaps an octree or n-dimensional oc-tree would be helpful. You 
could make it shallow and in memory and use it both to buffer 
queries and to store indexes to disk-data. It would be 2^N nodes 
per level. (2D data has 4 nodes, 3D has 8 nodes)

I once read a paper of mapping GIS data to a regular database 
(mapping 2D to 1D) using hilbert curves. Not sure if that is of 
any help.

Hm.

Dec 17 2013

"Dylan Knutson" <tcdknutson gmail.com> writes:

It's not a datastructure persay, but Google's LevelDB is very 
good, quite fast arbitrary key-value storage system (I use it in 
my project relating a 1 word key to a 200 word value) supporting 
batch transactional operations.

Think of it as a key-value storage version of SQLite.

https://code.google.com/p/leveldb/

And then the D bindings:
https://github.com/bheads/d-leveldb

And a D wrapper providing a nicer interface to LevelDB:
https://github.com/bheads/leveldb

Dec 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Dec 17, 2013 at 09:43:37PM +0100, Dylan Knutson wrote:
 It's not a datastructure persay, but Google's LevelDB is very good,
 quite fast arbitrary key-value storage system (I use it in my
 project relating a 1 word key to a 200 word value) supporting batch
 transactional operations.

[...]

I'm not looking for a transactional DB, though; I'm looking for a fast
key-value storage system with very high insert/query throughput, and
hopefully good space requirements.  Deletion is not needed, though for
very large problems it might be useful to reduce disk space usage. It
must be able to handle huge numbers of insertions (say 10 million
entries for small/moderate problem instances, perhaps in the billions or
trillions, if I use it on a large problem instance), and must be able to
do so very fast -- the faster the better.

Preferably, it would take advantage of the n-dimensional locality that I
described in my other post. Basically, the idea is to reduce I/O
roundtrips by caching disk pages in memory, because there are far more
entries than will fit in memory all at once. But since the number of
entries is very large, to prevent thrashing I have to maximize the
number of cache hits. It does no good if I have to load 100 disk pages
to answer 100 queries; if 100 pages fit in RAM, I'd like most of the
entries in those 100 pages to be used in answering queries before they
get swapped out for new pages to be loaded in. In the ideal case
(probably not attainable), those 100 pages will contain exactly the
next, say, million entries I'm about to query, so that once I'm done
with them, I can swap those pages out and never have to load them back
in again, giving the space for the next 100 pages to be loaded and
queried. So the layout of the disk pages should be as closely
corresponding to the order the algorithm will be looking up entries as
possible. (Unfortunately it's not 100% predictable, as it depends on the
problem being solved, but at least there are general trends we can take
advantage of.) A generic storage scheme like SQLite will probably not
perform as fast as I'd like.


T

-- 
It is impossible to make anything foolproof because fools are so
ingenious. -- Sammy

Dec 17 2013

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Tuesday, 17 December 2013 at 23:35:40 UTC, H. S. Teoh wrote:
 to answer 100 queries; if 100 pages fit in RAM, I'd like most

On OS-X a page is 4K, so there are only 20 entries per page. If 
you can get hold of a disk that is  SSD (to avoid rotational 
latency) and use a buffer/queue while waiting then it you should 
be ok?

Dec 17 2013

luka8088 <luka8088 owave.net> writes:

On 17.12.2013. 20:08, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)
 
 What's a good, efficient file structure for storing extremely large
 lookup tables? (Extremely large as in > 10 million entries, with keys
 and values roughly about 100 bytes each.) The structure must support
 efficient adding and lookup of entries, as these two operations will be
 very frequent.
 
 I did some online research, and it seems that hashtables perform poorly
 on disk, because the usual hash functions cause random scattering of
 related data (which are likely to be access with higher temporal
 locality), which incurs lots of disk seeks.
 
 I thought about B-trees, but they have high overhead (and are a pain to
 implement), and also only exhibit good locality if table entries are
 accessed sequentially; the problem is I'm working with high-dimensional
 data and the order of accesses is unlikely to be sequential. However,
 they do exhibit good spatial locality in higher-dimensional space (i.e.,
 if entry X is accessed first, then the next entry Y is quite likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?
 
 
 T
 

sqlite file format seems to be fairly documented:
http://www.sqlite.org/fileformat.html

Dec 17 2013

"Francesco Cattoglio" <francesco.cattoglio gmail.com> writes:

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)

 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.

What kind of key do you have? vector of floats? integers?
How about some uberfast compression algorithm (eg: LZ4) to 
compress pages when you have to offload data to the disk?

Dec 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Dec 18, 2013 at 01:14:58AM +0100, Francesco Cattoglio wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will
be very frequent.

 
 What kind of key do you have? vector of floats? integers?

Vector of (relatively small) integers.


 How about some uberfast compression algorithm (eg: LZ4) to compress
 pages when you have to offload data to the disk?

That's a good idea, actually, since the keys are very similar to each
other, so should compress very well when many are placed together in a
single page.


T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it
often, use it unsparingly, to subdue your listeners into submission!

Dec 17 2013

D Programming

C/C++ Programming

Other

digitalmars.D - [OT] Efficient file structure for very large lookup tables?