digitalmars.D - [OT] Efficient file structure for very large lookup tables?
- H. S. Teoh (22/22) Dec 17 2013 Another OT thread to pick your brains. :)
- tn (4/12) Dec 17 2013 Perhaps some kind of k-d tree? Maybe even some sort of cache
- Craig Dillabaugh (7/19) Dec 17 2013 Its not cache-oblivious buy he could try the KDB tree:
- H. S. Teoh (21/42) Dec 17 2013 I skimmed over the KDB tree paper, and am reading the Bkd-tree paper
- Joseph Cassman (8/40) Dec 17 2013 A burst trie might work, although it is originally designed for
- Craig Dillabaugh (24/56) Dec 17 2013 As a first attempt could you use a key-value database (like REDIS
- H. S. Teoh (48/93) Dec 17 2013 Well, currently I'm just using plain ole in-memory hash tables to store
- John Colvin (8/67) Dec 17 2013 I can't really add anything on the fancy data
- Craig Dillabaugh (15/28) Dec 18 2013 I should clarify a bit what I meant by batched. If you have a
- H. S. Teoh (20/47) Dec 18 2013 Hmm. This is an interesting thought. In a sense, queries don't have to
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (3/12) Dec 17 2013 But 200*10million = 2GB. Can't you use an existing KD-tree and
- H. S. Teoh (12/21) Dec 17 2013 Hmm. You're right, I think I underestimated my data size. :P The 10
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (28/35) Dec 17 2013 If you can buffer the queries in a queue then you can issue a
- Dylan Knutson (10/10) Dec 17 2013 It's not a datastructure persay, but Google's LevelDB is very
- H. S. Teoh (33/37) Dec 17 2013 [...]
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (5/6) Dec 17 2013 On OS-X a page is 4K, so there are only 20 entries per page. If
- luka8088 (3/29) Dec 17 2013 sqlite file format seems to be fairly documented:
- Francesco Cattoglio (4/14) Dec 17 2013 What kind of key do you have? vector of floats? integers?
- H. S. Teoh (8/20) Dec 17 2013 That's a good idea, actually, since the keys are very similar to each
Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent. I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks. I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? T -- Computers are like a jungle: they have monitor lizards, rams, mice, c-moss, binary trees... and bugs.
Dec 17 2013
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
Dec 17 2013
On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Its not cache-oblivious buy he could try the KDB tree: http://dl.acm.org/citation.cfm?id=582321 or, again for batched queries the is the Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch.However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
Dec 17 2013
On Tue, Dec 17, 2013 at 08:54:17PM +0100, Craig Dillabaugh wrote:On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:I skimmed over the KDB tree paper, and am reading the Bkd-tree paper now; it appears that the latter is an improvement over the former, and has better update performance characteristics. However, it also appears much more complicated to implement. Another point I forgot to mention, is that currently I only ever perform insertion and queries, and the two can actually be combined (basically the table is kept for uniqueness testing). I don't need to delete entries from the table. Eventually I might look into that to reduce disk space requirements, but as far as the actual algorithm is concerned, it's not necessary. So there's potentially a good amount of room here for optimization by foregoing the ability to delete entries. I'm currently considering using simpler, "dumber" file structures for storing nodes, with an in-memory Bloom filter to eliminate a (hopefully) large number of I/O roundtrips, since I'm expecting that a good fraction of queries will return 'not found'. T -- "A one-question geek test. If you get the joke, you're a geek: Seen on a California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs - Natural Intelligence, Inc.On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Its not cache-oblivious buy he could try the KDB tree: http://dl.acm.org/citation.cfm?id=582321 or, again for batched queries the is the Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type of buffer tree so the operations need to be able to be run as a batch.However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses?Perhaps some kind of k-d tree? Maybe even some sort of cache oblivious k-d tree to make it more suitable for storage on the disk?
Dec 17 2013
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent. I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks. I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? TA burst trie might work, although it is originally designed for text. I haven't had a chance to try one out in code yet but the paper by Heinz, Zobel, and Williams is interesting. Burst Tries: A Fast, Efficient Data Structure for String Keys (2002) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499 Joseph
Dec 17 2013
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent. I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks. I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? TAs a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)? Or is that out of the question. Another question is can your queries be batched? If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well. However, if you thought implementing a B-tree was going to be painful, that might not appeal to you. If you don't want to implement that yourself you could look at TPIE: http://www.madalgo.au.dk/tpie/ Although it is in C++. If I had to design something quick on the spot, my first guess would be to use a grid on the first two dimensions and then bin the 'points' or keys within each grid square and build a simpler structure on those. This won't work so well though for really high dimension data or if the 'points' are randomly distributed. Also, what exactly do you mean by "in that space" when you say: "if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space". Do you mean that the value of Y in the next dimension is numerically close (or expected to be) to X? Cheers, Craig
Dec 17 2013
On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Well, currently I'm just using plain ole in-memory hash tables to store the data. It works quite well for small problems, but I'm finding that memory runs out quickly with problems of moderate size, so I need to move the storage to disk. I'm just looking to minimize I/O latency, which hash tables are notorious for. (A previous incarnation of the program uses Berkeley DB hash tables, but it becomes I/O bound and slow when the problem size is large.)Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent. I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks. I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? TAs a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)? Or is that out of the question.Another question is can your queries be batched? If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well.Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help? [...]If I had to design something quick on the spot, my first guess would be to use a grid on the first two dimensions and then bin the 'points' or keys within each grid square and build a simpler structure on those. This won't work so well though for really high dimension data or if the 'points' are randomly distributed. Also, what exactly do you mean by "in that space" when you say: "if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space". Do you mean that the value of Y in the next dimension is numerically close (or expected to be) to X?[...] How many dimensions is considered "really high"? The number of dimensions can be up to about 50 or so. Is that high? Also, the points are not randomly distributed; they are highly-connected (i.e., they tend to be adjacent to each other). Basically, one possible representation of the keys is as an n-dimensional array of integers. (They are not natively in that form, but can be compressed into that form easily.) And the way the algorithm works is that if the key (p1,p2,p3,...) is accessed, then the next key (q1,q2,q3,...) accessed is likely to be differ from (p1,p2,p3,...) only in a small number of coordinates, and the difference in coordinates is likely to be very small (usually in the +1/-1 range). Actually, now that I think of it... the problem may be far more tractable than I realized. I was previously thinking of some kind of n-dimensional space-partitioning storage scheme, where the state space is partitioned into n-dimensional cube buckets, and points are placed in their corresponding buckets. However, this had the problem that the n-cube contains 2^n vertices, so the bucket sizes would have to be k^n, and since I can't control n, it makes it difficult to map bucket sizes to page sizes for optimal I/O performance. For large n, the bucket size would be unmanageably big (e.g., in the 50-dimensional case). But it just dawned on me that partitioning into n-cubes is unnecessary, because most of the time, I only care about points that differ in a small number of coordinates from the current point. So a more efficient bucket shape would be something closer to an n-cross (higher-dimensional equivalent of the octahedron), which only has 2*n vertices, or one of its derivatives. This makes bucket sizes far easier to control, and far more efficient to store. The tricky part is only, how to partition n-space into n-cross-shaped pieces? It would have to be some kind of tiling (otherwise there'd be ambiguity over which bucket a given point falls into). So now I've to do some research into n-dimensional tilings... :P T -- "The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."
Dec 17 2013
On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:I can't really add anything on the fancy data structures/algorithms, it's all a bit out of my league. That said, have you considered asynchronously maintaining the locality in memory, i.e. prefetching? If you have predictable access and good locality it should perform well. In particular, you can build a prefetcher that is tuned to the particular access pattern.On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Well, currently I'm just using plain ole in-memory hash tables to store the data. It works quite well for small problems, but I'm finding that memory runs out quickly with problems of moderate size, so I need to move the storage to disk. I'm just looking to minimize I/O latency, which hash tables are notorious for. (A previous incarnation of the program uses Berkeley DB hash tables, but it becomes I/O bound and slow when the problem size is large.)Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent. I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks. I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? TAs a first attempt could you use a key-value database (like REDIS if you have enough memory to fit everything in)? Or is that out of the question.
Dec 17 2013
On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:I should clarify a bit what I meant by batched. If you have a sequence of queries q_1, q_2, .... q_n (where I guess in your case a query can be/is an insertion), then by 'batch' proccess what I mean is you do not need to get the answers on the fly. It is sufficient to have all queries answered when the program terminates? Note that these techniques still account for the 'order' of the queries, so the answer to query q_i will be reported correctly (ie. same as if the queries were performed sequentially on a standard data structure). Also, if you are still considering using hashing the following paper may be of interest to you (I haven't read it in detail, but it doesn't look too crazy) http://arxiv.org/abs/1104.2799Another question is can your queries be batched? If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well.Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help?
Dec 18 2013
On Wed, Dec 18, 2013 at 03:03:20PM +0100, Craig Dillabaugh wrote:On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:Hmm. This is an interesting thought. In a sense, queries don't have to be answered immediately; the program can generate a series of queries and then continue processing the next item, the answer can be asynchronous. When the answer becomes available, then further processing will happen (add new items to be processed, update existing items, etc.). The only requirement is that queries will always be answered in the order they were made (the correctness of the algorithm depends on this). In a sense, what I need is really just a single operation "add-or-update", which can be processed in bulk. Either an item is already in the table, in which case it gets updated in a certain way, or it isn't, in which case it is added. In both cases, some additional processing takes place after the table is updated, which may create more items to be processed.On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:I should clarify a bit what I meant by batched. If you have a sequence of queries q_1, q_2, .... q_n (where I guess in your case a query can be/is an insertion), then by 'batch' proccess what I mean is you do not need to get the answers on the fly. It is sufficient to have all queries answered when the program terminates?Another question is can your queries be batched? If that is the case and your data is bigger than your available memory, then try Googling "Lars Arge Buffer Tree" which might work well.Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help?Note that these techniques still account for the 'order' of the queries, so the answer to query q_i will be reported correctly (ie. same as if the queries were performed sequentially on a standard data structure).Then this is OK.Also, if you are still considering using hashing the following paper may be of interest to you (I haven't read it in detail, but it doesn't look too crazy) http://arxiv.org/abs/1104.2799Thanks for the references! I'll look into this, it seems quite relevant. T -- You have to expect the unexpected. -- RL
Dec 18 2013
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.But 200*10million = 2GB. Can't you use an existing KD-tree and tweak it to fit memory pages and rely on paging for a start?
Dec 17 2013
On Tue, Dec 17, 2013 at 09:02:41PM +0100, digitalmars-d-bounces puremagic.com wrote:On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Hmm. You're right, I think I underestimated my data size. :P The 10 million figure is based on problems that my current program successfully solved (with an in-memory hashtable). I guess that larger problems must far exceed that number (I ran out of memory so I couldn't check just how big it got before I killed the process). Suffice it to say that this is a combinatorial problem, so the number of entries grow exponentially; anything that can help reduce the storage requirements / I/O latency would be a big help. T -- VI = Visual IrritationWhat's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.But 200*10million = 2GB. Can't you use an existing KD-tree and tweak it to fit memory pages and rely on paging for a start?
Dec 17 2013
On Tuesday, 17 December 2013 at 20:54:56 UTC, H. S. Teoh wrote:big it got before I killed the process). Suffice it to say that this is a combinatorial problem, so the number of entries grow exponentially; anything that can help reduce the storage requirements / I/O latency would be a big help.If you can buffer the queries in a queue then you can issue a prefetch-request to the OS to bring in the memory-page from disk when you put it into the queue to prevent the process from being put to sleep, the length of the queue has to be tailored to how fast it can load the page. If the data is uniformly distributed then perhaps you could partition the disk-space with a n-dimensional grid, and then have a key-value store that you page into memory? If you can do some queries out of order then you probably should set up some "buckets"/"bins" in the queue and prefetch the page that is referenced by the fullest/oldest bucket if you can do some queries out of order. Just to avoid that pages are pushed in and out of memory all the time. Otherwise a kd-tree with a scaled grid per leaf that stays within a memorypage, probably could work, but that sounds like a lot of work. Implementing n-dimensional datastructures is conceptually easy, but tedious in reality (can you trust it to be bug free? It is hard to visualize in text.). If you have to do the spatial datastructure yourself, then perhaps an octree or n-dimensional oc-tree would be helpful. You could make it shallow and in memory and use it both to buffer queries and to store indexes to disk-data. It would be 2^N nodes per level. (2D data has 4 nodes, 3D has 8 nodes) I once read a paper of mapping GIS data to a regular database (mapping 2D to 1D) using hilbert curves. Not sure if that is of any help. Hm.
Dec 17 2013
It's not a datastructure persay, but Google's LevelDB is very good, quite fast arbitrary key-value storage system (I use it in my project relating a 1 word key to a 200 word value) supporting batch transactional operations. Think of it as a key-value storage version of SQLite. https://code.google.com/p/leveldb/ And then the D bindings: https://github.com/bheads/d-leveldb And a D wrapper providing a nicer interface to LevelDB: https://github.com/bheads/leveldb
Dec 17 2013
On Tue, Dec 17, 2013 at 09:43:37PM +0100, Dylan Knutson wrote:It's not a datastructure persay, but Google's LevelDB is very good, quite fast arbitrary key-value storage system (I use it in my project relating a 1 word key to a 200 word value) supporting batch transactional operations.[...] I'm not looking for a transactional DB, though; I'm looking for a fast key-value storage system with very high insert/query throughput, and hopefully good space requirements. Deletion is not needed, though for very large problems it might be useful to reduce disk space usage. It must be able to handle huge numbers of insertions (say 10 million entries for small/moderate problem instances, perhaps in the billions or trillions, if I use it on a large problem instance), and must be able to do so very fast -- the faster the better. Preferably, it would take advantage of the n-dimensional locality that I described in my other post. Basically, the idea is to reduce I/O roundtrips by caching disk pages in memory, because there are far more entries than will fit in memory all at once. But since the number of entries is very large, to prevent thrashing I have to maximize the number of cache hits. It does no good if I have to load 100 disk pages to answer 100 queries; if 100 pages fit in RAM, I'd like most of the entries in those 100 pages to be used in answering queries before they get swapped out for new pages to be loaded in. In the ideal case (probably not attainable), those 100 pages will contain exactly the next, say, million entries I'm about to query, so that once I'm done with them, I can swap those pages out and never have to load them back in again, giving the space for the next 100 pages to be loaded and queried. So the layout of the disk pages should be as closely corresponding to the order the algorithm will be looking up entries as possible. (Unfortunately it's not 100% predictable, as it depends on the problem being solved, but at least there are general trends we can take advantage of.) A generic storage scheme like SQLite will probably not perform as fast as I'd like. T -- It is impossible to make anything foolproof because fools are so ingenious. -- Sammy
Dec 17 2013
On Tuesday, 17 December 2013 at 23:35:40 UTC, H. S. Teoh wrote:to answer 100 queries; if 100 pages fit in RAM, I'd like mostOn OS-X a page is 4K, so there are only 20 entries per page. If you can get hold of a disk that is SSD (to avoid rotational latency) and use a buffer/queue while waiting then it you should be ok?
Dec 17 2013
On 17.12.2013. 20:08, H. S. Teoh wrote:Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent. I did some online research, and it seems that hashtables perform poorly on disk, because the usual hash functions cause random scattering of related data (which are likely to be access with higher temporal locality), which incurs lots of disk seeks. I thought about B-trees, but they have high overhead (and are a pain to implement), and also only exhibit good locality if table entries are accessed sequentially; the problem is I'm working with high-dimensional data and the order of accesses is unlikely to be sequential. However, they do exhibit good spatial locality in higher-dimensional space (i.e., if entry X is accessed first, then the next entry Y is quite likely to be close to X in that space). Does anybody know of a good data structure that can take advantage of this fact to minimize disk accesses? Tsqlite file format seems to be fairly documented: http://www.sqlite.org/fileformat.html
Dec 17 2013
On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.What kind of key do you have? vector of floats? integers? How about some uberfast compression algorithm (eg: LZ4) to compress pages when you have to offload data to the disk?
Dec 17 2013
On Wed, Dec 18, 2013 at 01:14:58AM +0100, Francesco Cattoglio wrote:On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:Vector of (relatively small) integers.Another OT thread to pick your brains. :) What's a good, efficient file structure for storing extremely large lookup tables? (Extremely large as in > 10 million entries, with keys and values roughly about 100 bytes each.) The structure must support efficient adding and lookup of entries, as these two operations will be very frequent.What kind of key do you have? vector of floats? integers?How about some uberfast compression algorithm (eg: LZ4) to compress pages when you have to offload data to the disk?That's a good idea, actually, since the keys are very similar to each other, so should compress very well when many are placed together in a single page. T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
Dec 17 2013