www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] Efficient file structure for very large lookup tables?

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will be
very frequent.

I did some online research, and it seems that hashtables perform poorly
on disk, because the usual hash functions cause random scattering of
related data (which are likely to be access with higher temporal
locality), which incurs lots of disk seeks.

I thought about B-trees, but they have high overhead (and are a pain to
implement), and also only exhibit good locality if table entries are
accessed sequentially; the problem is I'm working with high-dimensional
data and the order of accesses is unlikely to be sequential. However,
they do exhibit good spatial locality in higher-dimensional space (i.e.,
if entry X is accessed first, then the next entry Y is quite likely to
be close to X in that space).  Does anybody know of a good data
structure that can take advantage of this fact to minimize disk
accesses?


T

-- 
Computers are like a jungle: they have monitor lizards, rams, mice, c-moss,
binary trees... and bugs.
Dec 17 2013
next sibling parent reply "tn" <no email.com> writes:
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?
Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
Dec 17 2013
parent reply "Craig Dillabaugh" <craig.dillabaugh gmail.com> writes:
On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?
Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
Its not cache-oblivious buy he could try the KDB tree: http://dl.acm.org/citation.cfm?id=582321 or, again for batched queries the is the Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch.
Dec 17 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Dec 17, 2013 at 08:54:17PM +0100, Craig Dillabaugh wrote:
 On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
However, they do exhibit good spatial locality in higher-dimensional
space (i.e., if entry X is accessed first, then the next entry Y is
quite likely to be close to X in that space).  Does anybody know of
a good data structure that can take advantage of this fact to
minimize disk accesses?
Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
Its not cache-oblivious buy he could try the KDB tree: http://dl.acm.org/citation.cfm?id=582321 or, again for batched queries the is the Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch.
I skimmed over the KDB tree paper, and am reading the Bkd-tree paper now; it appears that the latter is an improvement over the former, and has better update performance characteristics. However, it also appears much more complicated to implement. Another point I forgot to mention, is that currently I only ever perform insertion and queries, and the two can actually be combined (basically the table is kept for uniqueness testing). I don't need to delete entries from the table. Eventually I might look into that to reduce disk space requirements, but as far as the actual algorithm is concerned, it's not necessary. So there's potentially a good amount of room here for optimization by foregoing the ability to delete entries. I'm currently considering using simpler, "dumber" file structures for storing nodes, with an in-memory Bloom filter to eliminate a (hopefully) large number of I/O roundtrips, since I'm expecting that a good fraction of queries will return 'not found'. T -- "A one-question geek test. If you get the joke, you're a geek: Seen on a California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs - Natural Intelligence, Inc.
Dec 17 2013
prev sibling next sibling parent "Joseph Cassman" <jc7919 outlook.com> writes:
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)

 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.

 I did some online research, and it seems that hashtables 
 perform poorly
 on disk, because the usual hash functions cause random 
 scattering of
 related data (which are likely to be access with higher temporal
 locality), which incurs lots of disk seeks.

 I thought about B-trees, but they have high overhead (and are a 
 pain to
 implement), and also only exhibit good locality if table 
 entries are
 accessed sequentially; the problem is I'm working with 
 high-dimensional
 data and the order of accesses is unlikely to be sequential. 
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?


 T
A burst trie might work, although it is originally designed for text. I haven't had a chance to try one out in code yet but the paper by Heinz, Zobel, and Williams is interesting. Burst Tries: A Fast, Efficient Data Structure for String Keys (2002) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499 Joseph
Dec 17 2013
prev sibling next sibling parent reply "Craig Dillabaugh" <craig.dillabaugh gmail.com> writes:
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)

 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.

 I did some online research, and it seems that hashtables 
 perform poorly
 on disk, because the usual hash functions cause random 
 scattering of
 related data (which are likely to be access with higher temporal
 locality), which incurs lots of disk seeks.

 I thought about B-trees, but they have high overhead (and are a 
 pain to
 implement), and also only exhibit good locality if table 
 entries are
 accessed sequentially; the problem is I'm working with 
 high-dimensional
 data and the order of accesses is unlikely to be sequential. 
 However,
 they do exhibit good spatial locality in higher-dimensional 
 space (i.e.,
 if entry X is accessed first, then the next entry Y is quite 
 likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?


 T
As a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)? Or is that out of the question. Another question is can your queries be batched? If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well. However, if you thought implementing a B-tree was going to be painful, that might not appeal to you. If you don't want to implement that yourself you could look at TPIE: http://www.madalgo.au.dk/tpie/ Although it is in C++. If I had to design something quick on the spot, my first guess would be to use a grid on the first two dimensions and then bin the 'points' or keys within each grid square and build a simpler structure on those. This won't work so well though for really high dimension data or if the 'points' are randomly distributed. Also, what exactly do you mean by "in that space" when you say: "if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space". Do you mean that the value of Y in the next dimension is numerically close (or expected to be) to X? Cheers, Craig
Dec 17 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will
be very frequent.

I did some online research, and it seems that hashtables perform
poorly on disk, because the usual hash functions cause random
scattering of related data (which are likely to be access with higher
temporal locality), which incurs lots of disk seeks.

I thought about B-trees, but they have high overhead (and are a pain
to implement), and also only exhibit good locality if table entries
are accessed sequentially; the problem is I'm working with
high-dimensional data and the order of accesses is unlikely to be
sequential.  However, they do exhibit good spatial locality in
higher-dimensional space (i.e., if entry X is accessed first, then
the next entry Y is quite likely to be close to X in that space).
Does anybody know of a good data structure that can take advantage of
this fact to minimize disk accesses?


T
As a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)? Or is that out of the question.
Well, currently I'm just using plain ole in-memory hash tables to store the data. It works quite well for small problems, but I'm finding that memory runs out quickly with problems of moderate size, so I need to move the storage to disk. I'm just looking to minimize I/O latency, which hash tables are notorious for. (A previous incarnation of the program uses Berkeley DB hash tables, but it becomes I/O bound and slow when the problem size is large.)
 Another question is can your queries be batched?  If that is the
 case and your data is bigger than your available memory, then try
 Googling "Lars Arge Buffer Tree" which might work well.
Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help? [...]
 If I had to design something quick on the spot, my first guess would
 be to use a grid on the first two dimensions and then bin the
 'points' or keys within each grid square and build a simpler
 structure on those.  This won't work so well though for really high
 dimension data or if the 'points' are randomly distributed.

 Also, what exactly do you mean by "in that space" when you say:
 
 "if entry X is accessed first, then the next entry Y is quite likely
 to be close to X in that space".
 
 Do you mean that the value of Y in the next dimension is numerically
 close (or expected to be) to X?
[...] How many dimensions is considered "really high"? The number of dimensions can be up to about 50 or so. Is that high? Also, the points are not randomly distributed; they are highly-connected (i.e., they tend to be adjacent to each other). Basically, one possible representation of the keys is as an n-dimensional array of integers. (They are not natively in that form, but can be compressed into that form easily.) And the way the algorithm works is that if the key (p1,p2,p3,...) is accessed, then the next key (q1,q2,q3,...) accessed is likely to be differ from (p1,p2,p3,...) only in a small number of coordinates, and the difference in coordinates is likely to be very small (usually in the +1/-1 range). Actually, now that I think of it... the problem may be far more tractable than I realized. I was previously thinking of some kind of n-dimensional space-partitioning storage scheme, where the state space is partitioned into n-dimensional cube buckets, and points are placed in their corresponding buckets. However, this had the problem that the n-cube contains 2^n vertices, so the bucket sizes would have to be k^n, and since I can't control n, it makes it difficult to map bucket sizes to page sizes for optimal I/O performance. For large n, the bucket size would be unmanageably big (e.g., in the 50-dimensional case). But it just dawned on me that partitioning into n-cubes is unnecessary, because most of the time, I only care about points that differ in a small number of coordinates from the current point. So a more efficient bucket shape would be something closer to an n-cross (higher-dimensional equivalent of the octahedron), which only has 2*n vertices, or one of its derivatives. This makes bucket sizes far easier to control, and far more efficient to store. The tricky part is only, how to partition n-space into n-cross-shaped pieces? It would have to be some kind of tiling (otherwise there'd be ambiguity over which bucket a given point falls into). So now I've to do some research into n-dimensional tilings... :P T -- "The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."
Dec 17 2013
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
 On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh 
 wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely 
large
lookup tables? (Extremely large as in > 10 million entries, 
with keys
and values roughly about 100 bytes each.) The structure must 
support
efficient adding and lookup of entries, as these two 
operations will
be very frequent.

I did some online research, and it seems that hashtables 
perform
poorly on disk, because the usual hash functions cause random
scattering of related data (which are likely to be access 
with higher
temporal locality), which incurs lots of disk seeks.

I thought about B-trees, but they have high overhead (and are 
a pain
to implement), and also only exhibit good locality if table 
entries
are accessed sequentially; the problem is I'm working with
high-dimensional data and the order of accesses is unlikely 
to be
sequential.  However, they do exhibit good spatial locality in
higher-dimensional space (i.e., if entry X is accessed first, 
then
the next entry Y is quite likely to be close to X in that 
space).
Does anybody know of a good data structure that can take 
advantage of
this fact to minimize disk accesses?


T
As a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)? Or is that out of the question.
Well, currently I'm just using plain ole in-memory hash tables to store the data. It works quite well for small problems, but I'm finding that memory runs out quickly with problems of moderate size, so I need to move the storage to disk. I'm just looking to minimize I/O latency, which hash tables are notorious for. (A previous incarnation of the program uses Berkeley DB hash tables, but it becomes I/O bound and slow when the problem size is large.)
I can't really add anything on the fancy data structures/algorithms, it's all a bit out of my league. That said, have you considered asynchronously maintaining the locality in memory, i.e. prefetching? If you have predictable access and good locality it should perform well. In particular, you can build a prefetcher that is tuned to the particular access pattern.
Dec 17 2013
prev sibling parent reply "Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:
On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
 On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh 
 wrote:
 Another question is can your queries be batched?  If that is 
 the
 case and your data is bigger than your available memory, then 
 try
 Googling "Lars Arge Buffer Tree" which might work well.
Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help?
I should clarify a bit what I meant by batched. If you have a sequence of queries q_1, q_2, .... q_n (where I guess in your case a query can be/is an insertion), then by 'batch' proccess what I mean is you do not need to get the answers on the fly. It is sufficient to have all queries answered when the program terminates? Note that these techniques still account for the 'order' of the queries, so the answer to query q_i will be reported correctly (ie. same as if the queries were performed sequentially on a standard data structure). Also, if you are still considering using hashing the following paper may be of interest to you (I haven't read it in detail, but it doesn't look too crazy) http://arxiv.org/abs/1104.2799
Dec 18 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Dec 18, 2013 at 03:03:20PM +0100, Craig Dillabaugh wrote:
 On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
Another question is can your queries be batched?  If that is the
case and your data is bigger than your available memory, then try
Googling "Lars Arge Buffer Tree" which might work well.
Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help?
I should clarify a bit what I meant by batched. If you have a sequence of queries q_1, q_2, .... q_n (where I guess in your case a query can be/is an insertion), then by 'batch' proccess what I mean is you do not need to get the answers on the fly. It is sufficient to have all queries answered when the program terminates?
Hmm. This is an interesting thought. In a sense, queries don't have to be answered immediately; the program can generate a series of queries and then continue processing the next item, the answer can be asynchronous. When the answer becomes available, then further processing will happen (add new items to be processed, update existing items, etc.). The only requirement is that queries will always be answered in the order they were made (the correctness of the algorithm depends on this). In a sense, what I need is really just a single operation "add-or-update", which can be processed in bulk. Either an item is already in the table, in which case it gets updated in a certain way, or it isn't, in which case it is added. In both cases, some additional processing takes place after the table is updated, which may create more items to be processed.
 Note that these techniques still account for the 'order' of the
 queries, so the answer to query q_i will be reported correctly
 (ie. same as if the queries were performed sequentially on a
 standard data structure).
Then this is OK.
 Also, if you are still considering using hashing the following
 paper may be of interest to you (I haven't read it in detail, but
 it doesn't look too crazy)
 
 http://arxiv.org/abs/1104.2799
Thanks for the references! I'll look into this, it seems quite relevant. T -- You have to expect the unexpected. -- RL
Dec 18 2013
prev sibling next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.
But 200*10million = 2GB. Can't you use an existing KD-tree and tweak it to fit memory pages and rely on paging for a start?
Dec 17 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Dec 17, 2013 at 09:02:41PM +0100, digitalmars-d-bounces puremagic.com
wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will
be very frequent.
But 200*10million = 2GB. Can't you use an existing KD-tree and tweak it to fit memory pages and rely on paging for a start?
Hmm. You're right, I think I underestimated my data size. :P The 10 million figure is based on problems that my current program successfully solved (with an in-memory hashtable). I guess that larger problems must far exceed that number (I ran out of memory so I couldn't check just how big it got before I killed the process). Suffice it to say that this is a combinatorial problem, so the number of entries grow exponentially; anything that can help reduce the storage requirements / I/O latency would be a big help. T -- VI = Visual Irritation
Dec 17 2013
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 17 December 2013 at 20:54:56 UTC, H. S. Teoh wrote:
 big it got before I killed the process). Suffice it to say that 
 this is
 a combinatorial problem, so the number of entries grow 
 exponentially;
 anything that can help reduce the storage requirements / I/O 
 latency
 would be a big help.
If you can buffer the queries in a queue then you can issue a prefetch-request to the OS to bring in the memory-page from disk when you put it into the queue to prevent the process from being put to sleep, the length of the queue has to be tailored to how fast it can load the page. If the data is uniformly distributed then perhaps you could partition the disk-space with a n-dimensional grid, and then have a key-value store that you page into memory? If you can do some queries out of order then you probably should set up some "buckets"/"bins" in the queue and prefetch the page that is referenced by the fullest/oldest bucket if you can do some queries out of order. Just to avoid that pages are pushed in and out of memory all the time. Otherwise a kd-tree with a scaled grid per leaf that stays within a memorypage, probably could work, but that sounds like a lot of work. Implementing n-dimensional datastructures is conceptually easy, but tedious in reality (can you trust it to be bug free? It is hard to visualize in text.). If you have to do the spatial datastructure yourself, then perhaps an octree or n-dimensional oc-tree would be helpful. You could make it shallow and in memory and use it both to buffer queries and to store indexes to disk-data. It would be 2^N nodes per level. (2D data has 4 nodes, 3D has 8 nodes) I once read a paper of mapping GIS data to a regular database (mapping 2D to 1D) using hilbert curves. Not sure if that is of any help. Hm.
Dec 17 2013
prev sibling next sibling parent reply "Dylan Knutson" <tcdknutson gmail.com> writes:
It's not a datastructure persay, but Google's LevelDB is very 
good, quite fast arbitrary key-value storage system (I use it in 
my project relating a 1 word key to a 200 word value) supporting 
batch transactional operations.

Think of it as a key-value storage version of SQLite.

https://code.google.com/p/leveldb/

And then the D bindings:
https://github.com/bheads/d-leveldb

And a D wrapper providing a nicer interface to LevelDB:
https://github.com/bheads/leveldb
Dec 17 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Dec 17, 2013 at 09:43:37PM +0100, Dylan Knutson wrote:
 It's not a datastructure persay, but Google's LevelDB is very good,
 quite fast arbitrary key-value storage system (I use it in my
 project relating a 1 word key to a 200 word value) supporting batch
 transactional operations.
[...] I'm not looking for a transactional DB, though; I'm looking for a fast key-value storage system with very high insert/query throughput, and hopefully good space requirements. Deletion is not needed, though for very large problems it might be useful to reduce disk space usage. It must be able to handle huge numbers of insertions (say 10 million entries for small/moderate problem instances, perhaps in the billions or trillions, if I use it on a large problem instance), and must be able to do so very fast -- the faster the better. Preferably, it would take advantage of the n-dimensional locality that I described in my other post. Basically, the idea is to reduce I/O roundtrips by caching disk pages in memory, because there are far more entries than will fit in memory all at once. But since the number of entries is very large, to prevent thrashing I have to maximize the number of cache hits. It does no good if I have to load 100 disk pages to answer 100 queries; if 100 pages fit in RAM, I'd like most of the entries in those 100 pages to be used in answering queries before they get swapped out for new pages to be loaded in. In the ideal case (probably not attainable), those 100 pages will contain exactly the next, say, million entries I'm about to query, so that once I'm done with them, I can swap those pages out and never have to load them back in again, giving the space for the next 100 pages to be loaded and queried. So the layout of the disk pages should be as closely corresponding to the order the algorithm will be looking up entries as possible. (Unfortunately it's not 100% predictable, as it depends on the problem being solved, but at least there are general trends we can take advantage of.) A generic storage scheme like SQLite will probably not perform as fast as I'd like. T -- It is impossible to make anything foolproof because fools are so ingenious. -- Sammy
Dec 17 2013
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 17 December 2013 at 23:35:40 UTC, H. S. Teoh wrote:
 to answer 100 queries; if 100 pages fit in RAM, I'd like most
On OS-X a page is 4K, so there are only 20 entries per page. If you can get hold of a disk that is SSD (to avoid rotational latency) and use a buffer/queue while waiting then it you should be ok?
Dec 17 2013
prev sibling next sibling parent luka8088 <luka8088 owave.net> writes:
On 17.12.2013. 20:08, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)
 
 What's a good, efficient file structure for storing extremely large
 lookup tables? (Extremely large as in > 10 million entries, with keys
 and values roughly about 100 bytes each.) The structure must support
 efficient adding and lookup of entries, as these two operations will be
 very frequent.
 
 I did some online research, and it seems that hashtables perform poorly
 on disk, because the usual hash functions cause random scattering of
 related data (which are likely to be access with higher temporal
 locality), which incurs lots of disk seeks.
 
 I thought about B-trees, but they have high overhead (and are a pain to
 implement), and also only exhibit good locality if table entries are
 accessed sequentially; the problem is I'm working with high-dimensional
 data and the order of accesses is unlikely to be sequential. However,
 they do exhibit good spatial locality in higher-dimensional space (i.e.,
 if entry X is accessed first, then the next entry Y is quite likely to
 be close to X in that space).  Does anybody know of a good data
 structure that can take advantage of this fact to minimize disk
 accesses?
 
 
 T
 
sqlite file format seems to be fairly documented: http://www.sqlite.org/fileformat.html
Dec 17 2013
prev sibling parent reply "Francesco Cattoglio" <francesco.cattoglio gmail.com> writes:
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
 Another OT thread to pick your brains. :)

 What's a good, efficient file structure for storing extremely 
 large
 lookup tables? (Extremely large as in > 10 million entries, 
 with keys
 and values roughly about 100 bytes each.) The structure must 
 support
 efficient adding and lookup of entries, as these two operations 
 will be
 very frequent.
What kind of key do you have? vector of floats? integers? How about some uberfast compression algorithm (eg: LZ4) to compress pages when you have to offload data to the disk?
Dec 17 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Dec 18, 2013 at 01:14:58AM +0100, Francesco Cattoglio wrote:
 On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
Another OT thread to pick your brains. :)

What's a good, efficient file structure for storing extremely large
lookup tables? (Extremely large as in > 10 million entries, with keys
and values roughly about 100 bytes each.) The structure must support
efficient adding and lookup of entries, as these two operations will
be very frequent.
What kind of key do you have? vector of floats? integers?
Vector of (relatively small) integers.
 How about some uberfast compression algorithm (eg: LZ4) to compress
 pages when you have to offload data to the disk?
That's a good idea, actually, since the keys are very similar to each other, so should compress very well when many are placed together in a single page. T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
Dec 17 2013