www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Associative Array: reasonable limits?

reply Charles Hixson <charleshixsn earthlink.net> writes:
I'm contemplating an associative array that will eventually grow to be 
an estimated 64KB in size, assuming it's about half full.  It would then 
be holding around 90,000 entries.  Is this reasonable, or should I go 
with a sorted array, and do binary searches?  I estimate that binary 
searches would average around 15 accesses/hit, but that's still a lot 
faster than disk access.  The array would be nearly full, so there would 
be less wasted space, but, of course, insertions would become quite 
expensive.  (Deletions aren't expected.)  In either case the keys would 
be fixed length character arrays, and the values would also be simple 
structs without reference types (so the garbage collector could pretty 
much ignore things).

FWIW I'm expecting this array to be around the entire time the program 
is running, but it would probably be rolled out most of the time, as 
frequently accessed items would be pulled into a much smaller structure, 
and if I go with the array rather than the hash table I may only do 
updates as a batch.  (Since it's linear with the size of the array, but 
essentially ignores the number of items being added.)

My temptation is to go with the hash table...but I'm not sure of the 
underlying mechanics.

-- 
Charles Hixson
Nov 02 2013
next sibling parent reply "TheFlyingFiddle" <theflyingfiddle gmail.com> writes:
What are you going to be using the collection for?
Nov 02 2013
parent reply Charles Hixson <charleshixsn earthlink.net> writes:
On 11/02/2013 04:49 PM, TheFlyingFiddle wrote:
 What are you going to be using the collection for?
It's basically a lookup table used for translating external codes (essentially arbitrary) into id#s used internally. There are LOTS of id#s that don't correspond to ANY external code, and nearly all the processing is done only using those id#s. The codes are used only during i/o, as they are more intelligible to people. -- Charles Hixson
Nov 02 2013
parent reply "TheFlyingFiddle" <theflyingfiddle gmail.com> writes:
On Saturday, 2 November 2013 at 23:58:07 UTC, Charles Hixson 
wrote:
 On 11/02/2013 04:49 PM, TheFlyingFiddle wrote:
 What are you going to be using the collection for?
It's basically a lookup table used for translating external codes (essentially arbitrary) into id#s used internally. There are LOTS of id#s that don't correspond to ANY external code, and nearly all the processing is done only using those id#s. The codes are used only during i/o, as they are more intelligible to people.
Well if the codes are only used during i/o the performance of the collection is not rly that important. Since the bottleneck is the i/o anyways. Saying that my recommendation is to go with the AA since it's rly simple to work with. It also has the benefit that if you have a good hashing function lookup time will generally be O(1). Also this eleminates the need for building a custom datastructure. It is my understanding that there are no hard limits on the number of elements in the built in AA.(But i canno't say for sure, it's very simple to test though). Since the collection is never going to be deleted. If the external codes are known at compiletime i would recomend you to generate the collection at compiletime thus avoiding the runtime cost of building the table. (This works for either AA or Sorted Array) Also this eleminates the need to care about insertion times (unless you get unreasonable build times ofc ^^). Hope this helped a little. Its hard to say if AA or SA is the way to go without profiling the complete application.
Nov 02 2013
parent reply Charles Hixson <charleshixsn earthlink.net> writes:
On 11/02/2013 05:16 PM, TheFlyingFiddle wrote:
 On Saturday, 2 November 2013 at 23:58:07 UTC, Charles Hixson wrote:
 On 11/02/2013 04:49 PM, TheFlyingFiddle wrote:
 What are you going to be using the collection for?
It's basically a lookup table used for translating external codes (essentially arbitrary) into id#s used internally. There are LOTS of id#s that don't correspond to ANY external code, and nearly all the processing is done only using those id#s. The codes are used only during i/o, as they are more intelligible to people.
Well if the codes are only used during i/o the performance of the collection is not rly that important. Since the bottleneck is the i/o anyways. Saying that my recommendation is to go with the AA since it's rly simple to work with. It also has the benefit that if you have a good hashing function lookup time will generally be O(1). Also this eleminates the need for building a custom datastructure. It is my understanding that there are no hard limits on the number of elements in the built in AA.(But i canno't say for sure, it's very simple to test though). Since the collection is never going to be deleted. If the external codes are known at compiletime i would recomend you to generate the collection at compiletime thus avoiding the runtime cost of building the table. (This works for either AA or Sorted Array) Also this eleminates the need to care about insertion times (unless you get unreasonable build times ofc ^^). Hope this helped a little. Its hard to say if AA or SA is the way to go without profiling the complete application.
Well, I didn't expect any hard limits, but it isn't easy to test this in context. I'm pretty sure an array implementation could be run without impacting the rest of the program (which hasn't yet been written). It's slower, but it's less resource intensive, and it's even smaller. If AAs don't have any hidden penalty, like causing the garbage collector to thrash, then that's the way to go (and then, yes, I'll need to test it, but I'd really be surprised if THAT step caused a problem). IIRC I've used larger AAs in simple programs without problems, but this one will probably run continually for months, and the AA is only one piece, so I'm a bit concerned about things that I can't see any easy way to test. So I thought I'd ask. P.S.: The external codes are NOT known at compile time. Not even at the start of run time. They are added incrementally, and there is no actual rule about what they will be. (This is a part of the reason for all the "about" and "estimated" in my original statement (unless I edited them out).) And there will actually be two of the structures of slightly differing composition and size, but they're essentially the same in use. There's also be a couple of others that are rather different in this particular section of the program. Access to all of these will be via a handler class, so I may make all the mentioned tables private and handle them with internal classes. But this is probably more about penguins than you really want to know. So it sounds like I should plan on using AAs, and just keep the array option around as a fallback position. -- Charles Hixson
Nov 02 2013
parent "TheFlyingFiddle" <theflyingfiddle gmail.com> writes:
 If AAs don't have any hidden penalty, like causing the garbage 
 collector to thrash, then that's the way to go
Any time you insert an item into an AA it might allocate. So it might cause a garbage collection cycle. I'm unsure what you mean by "causing the garbage collector to trash" memory leeaks? If it's not acceptable that the gc will collect then either build a custom hash map class with determenistic memory management or implement your sorted array version. Note: Lookup into AA's does not allocate
 I've used larger AAs in simple programs without problems, but 
 this one will probably run continually for months.
I have never written anything that is supposed to run for months at a time so take anything i say with a grain of salt.
 is only one piece, so I'm a bit concerned about things that I 
 can't see any easy way to test.  So I thought I'd ask.
 P.S.:  The external codes are NOT known at compile time.  Not 
 even at the start of run time. They are added incrementally, 
 and there is no actual rule about what they will be.  (This is 
 a part of the reason for all the "about" and "estimated" in my 
 original statement (unless I edited them out).)
Since the codes have no rule the best you can do for an hash function is to use a gernerall one, this might not be good enough for your purposes. You could test it with the build in string hash function on random strings of a specific length but it's no guarantee that it will work well for you specific data. It's however highly likely that it will still be faster then 15 misses to 1 hit ratio.
 So it sounds like I should plan on using AAs, and just keep the 
 array option around as a fallback position.
I would aggre. If the extra memory and non-deterministisism is not acceptable go with the array.
Nov 02 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Nov-2013 02:37, Charles Hixson пишет:
 I'm contemplating an associative array that will eventually grow to be
 an estimated 64KB in size, assuming it's about half full.  It would then
 be holding around 90,000 entries.  Is this reasonable, or should I go
 with a sorted array, and do binary searches?  I estimate that binary
 searches would average around 15 accesses/hit, but that's still a lot
 faster than disk access.  The array would be nearly full, so there would
 be less wasted space, but, of course, insertions would become quite
 expensive.  (Deletions aren't expected.)  In either case the keys would
 be fixed length character arrays, and the values would also be simple
 structs without reference types (so the garbage collector could pretty
 much ignore things).
90k elements is a small AA. It would work just fine with any sane hash function (e.g. the built-in one). At around 10M+ elements it may be worth to consider space saving techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd advise against it but only because of GC.
 FWIW I'm expecting this array to be around the entire time the program
 is running, but it would probably be rolled out most of the time, as
 frequently accessed items would be pulled into a much smaller structure,
 and if I go with the array rather than the hash table I may only do
 updates as a batch.  (Since it's linear with the size of the array, but
 essentially ignores the number of items being added.)

 My temptation is to go with the hash table...but I'm not sure of the
 underlying mechanics.
Think of it as a sparse array that has "holes" and takes around ~2-4 times the size of normal array. It's not only because of holes but because a slot is a bit bigger. In return lookups/inserts/deletions are cheap O(1) with high probability. -- Dmitry Olshansky
Nov 03 2013
parent reply Charles Hixson <charleshixsn earthlink.net> writes:
On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
 03-Nov-2013 02:37, Charles Hixson пишет:
 I'm contemplating an associative array that will eventually grow to be
 an estimated 64KB in size, assuming it's about half full.  It would then
 be holding around 90,000 entries.  Is this reasonable, or should I go
 with a sorted array, and do binary searches?  I estimate that binary
 searches would average around 15 accesses/hit, but that's still a lot
 faster than disk access.  The array would be nearly full, so there would
 be less wasted space, but, of course, insertions would become quite
 expensive.  (Deletions aren't expected.)  In either case the keys would
 be fixed length character arrays, and the values would also be simple
 structs without reference types (so the garbage collector could pretty
 much ignore things).
90k elements is a small AA. It would work just fine with any sane hash function (e.g. the built-in one). At around 10M+ elements it may be worth to consider space saving techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd advise against it but only because of GC.
 FWIW I'm expecting this array to be around the entire time the program
 is running, but it would probably be rolled out most of the time, as
 frequently accessed items would be pulled into a much smaller structure,
 and if I go with the array rather than the hash table I may only do
 updates as a batch.  (Since it's linear with the size of the array, but
 essentially ignores the number of items being added.)

 My temptation is to go with the hash table...but I'm not sure of the
 underlying mechanics.
Think of it as a sparse array that has "holes" and takes around ~2-4 times the size of normal array. It's not only because of holes but because a slot is a bit bigger. In return lookups/inserts/deletions are cheap O(1) with high probability.
Thanks, I'd been assuming that it would just be around 128 bytes larger (i.e., two pointers) and around a 50% fill rate. Sounds like a bit worse than I was estimating...not hugely worse though. OTOH, that's an argument in favor of using a small cache that is a hash table, and a large array as a backing store, where infrequently accessed items could be kept. Probably if the cache could hold around 2,000 entries it would have over a 90% hit rate (though that would need to be tested). And the array would also provide a sorted access method for traversing it. I'm not sure that's important, but it could be. It could also be saved to an externally editable file (i.e., using a standard text editor) which also might be important. This will, naturally, slow down insertions. (I don't expect any deletions, as they would render historical data unintelligible.) Perhaps I'll batch the insertions, as the insertion is linear WRT the size of the array, but is essentially independent of the number of insertions. (I.e., no matter how many insertions, it's a one pass job.) -- Charles Hixson
Nov 04 2013
next sibling parent reply "Froglegs" <barf barf.com> writes:
why are you obsessing over a 64k hash table... that is tiny..
Nov 04 2013
parent Charles Hixson <charleshixsn earthlink.net> writes:
On 11/04/2013 04:22 PM, Froglegs wrote:
 why are you obsessing over a 64k hash table... that is tiny..
Probably because I'm an old fogey, and to me 64KB sounds like all the RAM a computer is likely to have. -- Charles Hixson
Nov 04 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
05-Nov-2013 02:20, Charles Hixson пишет:
 On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
 03-Nov-2013 02:37, Charles Hixson пишет:
 I'm contemplating an associative array that will eventually grow to be
 an estimated 64KB in size, assuming it's about half full.  It would then
 be holding around 90,000 entries.  Is this reasonable, or should I go
 with a sorted array, and do binary searches?  I estimate that binary
 searches would average around 15 accesses/hit, but that's still a lot
 faster than disk access.  The array would be nearly full, so there would
 be less wasted space, but, of course, insertions would become quite
 expensive.  (Deletions aren't expected.)  In either case the keys would
 be fixed length character arrays, and the values would also be simple
 structs without reference types (so the garbage collector could pretty
 much ignore things).
90k elements is a small AA. It would work just fine with any sane hash function (e.g. the built-in one). At around 10M+ elements it may be worth to consider space saving techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd advise against it but only because of GC.
 FWIW I'm expecting this array to be around the entire time the program
 is running, but it would probably be rolled out most of the time, as
 frequently accessed items would be pulled into a much smaller structure,
 and if I go with the array rather than the hash table I may only do
 updates as a batch.  (Since it's linear with the size of the array, but
 essentially ignores the number of items being added.)

 My temptation is to go with the hash table...but I'm not sure of the
 underlying mechanics.
Think of it as a sparse array that has "holes" and takes around ~2-4 times the size of normal array. It's not only because of holes but because a slot is a bit bigger. In return lookups/inserts/deletions are cheap O(1) with high probability.
Thanks, I'd been assuming that it would just be around 128 bytes larger (i.e., two pointers) and around a 50% fill rate. Sounds like a bit worse than I was estimating...not hugely worse though. OTOH, that's an argument in favor of using a small cache that is a hash table, and a large array as a backing store, where infrequently accessed items could be kept. Probably if the cache could hold around 2,000 entries it would have over a 90% hit rate (though that would need to be tested).
What's your requirements for this data-structure anyway?
 And the
 array would also provide a sorted access method for traversing it.  I'm
 not sure that's important, but it could be.  It could also be saved to
 an externally editable file (i.e., using a standard text editor) which
 also might be important.
Oh my, that's a lot of complexity for no gain. Truth be told I tend to always optimize for speed (=less power consumption) these days. What you describe is sort of collision pages (like in some DB) but deployed in such a small scenario I see ZERO gain in any dimension. You know code is also occupying some RAM ;).
 This will, naturally, slow down insertions. (I don't expect any
 deletions, as they would render historical data unintelligible.)

 Perhaps I'll batch the insertions, as the insertion is linear WRT the
 size of the array, but is essentially independent of the number of
 insertions.  (I.e., no matter how many insertions, it's a one pass job.)
Even more code. And space for batching and keeping track of if the item is batched but not yet in collection. -- Dmitry Olshansky
Nov 05 2013
parent reply Charles Hixson <charleshixsn earthlink.net> writes:
On 11/05/2013 05:34 AM, Dmitry Olshansky wrote:
 05-Nov-2013 02:20, Charles Hixson пишет:
 On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
 03-Nov-2013 02:37, Charles Hixson пишет:
 I'm contemplating an associative array that will eventually grow to be
 an estimated 64KB in size, assuming it's about half full. It would 
 then
 be holding around 90,000 entries.  Is this reasonable, or should I go
 with a sorted array, and do binary searches?  I estimate that binary
 searches would average around 15 accesses/hit, but that's still a lot
 faster than disk access.  The array would be nearly full, so there 
 would
 be less wasted space, but, of course, insertions would become quite
 expensive.  (Deletions aren't expected.)  In either case the keys 
 would
 be fixed length character arrays, and the values would also be simple
 structs without reference types (so the garbage collector could pretty
 much ignore things).
90k elements is a small AA. It would work just fine with any sane hash function (e.g. the built-in one). At around 10M+ elements it may be worth to consider space saving techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd advise against it but only because of GC.
 FWIW I'm expecting this array to be around the entire time the program
 is running, but it would probably be rolled out most of the time, as
 frequently accessed items would be pulled into a much smaller 
 structure,
 and if I go with the array rather than the hash table I may only do
 updates as a batch.  (Since it's linear with the size of the array, 
 but
 essentially ignores the number of items being added.)

 My temptation is to go with the hash table...but I'm not sure of the
 underlying mechanics.
Think of it as a sparse array that has "holes" and takes around ~2-4 times the size of normal array. It's not only because of holes but because a slot is a bit bigger. In return lookups/inserts/deletions are cheap O(1) with high probability.
Thanks, I'd been assuming that it would just be around 128 bytes larger (i.e., two pointers) and around a 50% fill rate. Sounds like a bit worse than I was estimating...not hugely worse though. OTOH, that's an argument in favor of using a small cache that is a hash table, and a large array as a backing store, where infrequently accessed items could be kept. Probably if the cache could hold around 2,000 entries it would have over a 90% hit rate (though that would need to be tested).
What's your requirements for this data-structure anyway?
 And the
 array would also provide a sorted access method for traversing it.  I'm
 not sure that's important, but it could be.  It could also be saved to
 an externally editable file (i.e., using a standard text editor) which
 also might be important.
Oh my, that's a lot of complexity for no gain. Truth be told I tend to always optimize for speed (=less power consumption) these days. What you describe is sort of collision pages (like in some DB) but deployed in such a small scenario I see ZERO gain in any dimension. You know code is also occupying some RAM ;).
 This will, naturally, slow down insertions. (I don't expect any
 deletions, as they would render historical data unintelligible.)

 Perhaps I'll batch the insertions, as the insertion is linear WRT the
 size of the array, but is essentially independent of the number of
 insertions.  (I.e., no matter how many insertions, it's a one pass job.)
Even more code. And space for batching and keeping track of if the item is batched but not yet in collection.
I'm not sure about the batch adds. I certainly won't do it at first, while the data is small. The thing is, I want to keep as much as feasible rolled out most of the time. That means that while it's active in "RAM" that RAM is actually pages on the disk of virtual memory. With a large hash table, that's not likely to be controllable. What I'm doing here is optimizing for "speed and..." where and includes active memory use (as opposed to rolled out virtual memory). A part of what makes a hash table efficient is that it's always in RAM. If you need to roll it in to access it, then you lose most of the efficiency. A sorted array is a pretty straightforwards data structure. It's reasonably quick to access, though not as fast as a hash table. The real killer is insertion (or deletion). And it doesn't lose anywhere near as much by being rolled out (for one thing, each page holds about twice as much data, so you don't need as many reads). And it's easy to save to external disk. (You can size it easily so that when starting up the program again, the array is initialized at the correct size.) I had actually for a long time considered just using a database, but I can't control how the database caches things in RAM. I want the cache to have a long term usage memory as well as recent use, and doing that with a database is just too complex. Yes, this is sort of like a database, but except for some databases that store everything in RAM, it's not the same. OTOH, if this were the entire program, you would be right that just using a hash table was a better idea. But this is actually almost a side show, rather like GUI's are sideshows to what the program is actually doing. This is what the program needs to interface intelligibly with people. As such, it's not a heavily used part, and it's best if it doesn't consume excess active RAM. (For that matter, I *know* my computer is too small for what I'm trying to do. So using extra RAM is undesirable. I may eventually decide to have the thing that I'm currently calling "the array" live out on disk, but it's my understanding that this isn't much of a gain over having it just rolled out by virtual memory.) I've also seriously considered using a sort of database (I'm thinking of KyotoCabinet), but doing it this way yields fewer external dependencies, and the hash cache is probably a better choice. (Again, with KyotoCabinet I don't have as much control over what it decides to keep in the cache.) I expect that the total memory needs of the program will edge towards the Terabyte range, so most of it is going to need to live on disk no matter what I do. But I don't know in detail just what's going to be involved there. I've roughed out a be transformed into a file offset, but how to decide what should be removed from RAM isn't something I've figured out yet. You could think of the basic datastructure of the program as a 40 dimensional cyclic graph with one-way links, but that's not quite right. The links carry more information than is typical of graph structures (though logically that's no change...you could model them as nodes along the links that act as filters). OTOH, the file structure that supports this is quite simple. Just a direct access file with fixed length blocks. As for "to be added" being separate, that's not hard either. That could be either a hash table or an array. though a hash table would probably be simpler. And it's just another place I'd need to check when looking to see if I already recognized the code. Still, that's something that should really be saved until later, as it can be added on at any time. (Which probably means I'll never get around to it.) -- Charles Hixson
Nov 05 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
06-Nov-2013 00:36, Charles Hixson пишет:
 On 11/05/2013 05:34 AM, Dmitry Olshansky wrote:
 05-Nov-2013 02:20, Charles Hixson пишет:
 On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
 03-Nov-2013 02:37, Charles Hixson пишет:
 I'm contemplating an associative array that will eventually grow to be
 an estimated 64KB in size, assuming it's about half full. It would
 then
 be holding around 90,000 entries.  Is this reasonable, or should I go
 with a sorted array, and do binary searches?  I estimate that binary
 searches would average around 15 accesses/hit, but that's still a lot
 faster than disk access.  The array would be nearly full, so there
 would
 be less wasted space, but, of course, insertions would become quite
 expensive.  (Deletions aren't expected.)  In either case the keys
 would
 be fixed length character arrays, and the values would also be simple
 structs without reference types (so the garbage collector could pretty
 much ignore things).
90k elements is a small AA. It would work just fine with any sane hash function (e.g. the built-in one). At around 10M+ elements it may be worth to consider space saving techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd advise against it but only because of GC.
 FWIW I'm expecting this array to be around the entire time the program
 is running, but it would probably be rolled out most of the time, as
 frequently accessed items would be pulled into a much smaller
 structure,
 and if I go with the array rather than the hash table I may only do
 updates as a batch.  (Since it's linear with the size of the array,
 but
 essentially ignores the number of items being added.)

 My temptation is to go with the hash table...but I'm not sure of the
 underlying mechanics.
Think of it as a sparse array that has "holes" and takes around ~2-4 times the size of normal array. It's not only because of holes but because a slot is a bit bigger. In return lookups/inserts/deletions are cheap O(1) with high probability.
Thanks, I'd been assuming that it would just be around 128 bytes larger (i.e., two pointers) and around a 50% fill rate. Sounds like a bit worse than I was estimating...not hugely worse though. OTOH, that's an argument in favor of using a small cache that is a hash table, and a large array as a backing store, where infrequently accessed items could be kept. Probably if the cache could hold around 2,000 entries it would have over a 90% hit rate (though that would need to be tested).
What's your requirements for this data-structure anyway?
Still no answer. I'm asking it only because designing any data-structure w/o answering this first is an exercise in futility. It seems like priority is lookup speed and data size. Search for succint data-structures if you really want to dive that deep. I'd recommend it only for fun, not for the task at hand.
 I'm not sure about the batch adds.  I certainly won't do it at first,
 while the data is small.  The thing is, I want to keep as much as
 feasible rolled out most of the time.  That means that while it's active
 in "RAM" that RAM is actually pages on the disk of virtual memory.  With
 a large hash table, that's not likely to be controllable.  What I'm
 doing here is optimizing for "speed and..." where and includes active
 memory use (as opposed to rolled out virtual memory).  A part of what
 makes a hash table efficient is that it's always in RAM.
The moment you start using data it all gets in RAM pretty soon (potentially with a few of page faults). After that point you may as well forget about page system and virtual memory. The thing is that number of page faults is environment dependent anyway an is "a constant cost" compared to any operations on data.
 If you need to
 roll it in to access it, then you lose most of the efficiency.
 A sorted
 array is a pretty straightforwards data structure.  It's reasonably
 quick to access, though not as fast as a hash table.
 The real killer is
 insertion (or deletion).  And it doesn't lose anywhere near as much by
 being rolled out (for one thing, each page holds about twice as much
 data, so you don't need as many reads).
I doubt you'll ever be able to measure this effect on top of OS induced noise. Thing is OS will keep pages forever in RAM anyway if there is plenty of it. And if it's paging on every call to your tiny dataset - you have far-far bigger problems anyway.
 And it's easy to save to
 external disk.  (You can size it easily so that when starting up the
 program again, the array is initialized at the correct size.)
*Yawn* Anyhow you'd serialize it in some format. Maybe even text ;)
 I had actually for a long time considered just using a database, but I
 can't control how the database caches things in RAM.  I want the cache
 to have a long term usage memory as well as recent use, and doing that
 with a database is just too complex.  Yes, this is sort of like a
 database, but except for some databases that store everything in RAM,
 it's not the same.
There is this moment through out your post - you seem to require some implementation details but have no high-level goals specified. It's kind of backwards.
 OTOH, if this were the entire program, you would be right that just
 using a hash table was a better idea.  But this is actually almost a
 side show, rather like GUI's are sideshows to what the program is
 actually doing.  This is what the program needs to interface
 intelligibly with people.  As such, it's not a heavily used part, and
 it's best if it doesn't consume excess active RAM.  (For that matter, I
 *know* my computer is too small for what I'm trying to do.  So using
 extra RAM is undesirable. I may eventually decide to have the thing that
 I'm currently calling "the array" live out on disk, but it's my
 understanding that this isn't much of a gain over having it just rolled
 out by virtual memory.)
  I've also seriously considered using a sort of
 database (I'm thinking of KyotoCabinet), but doing it this way yields
 fewer external dependencies, and the hash cache is probably a better
 choice.  (Again, with KyotoCabinet I don't have as much control over
 what it decides to keep in the cache.)
 I expect that the total memory
 needs of the program will edge towards the Terabyte range, so most of it
 is going to need to live on disk no matter what I do.
Quoting you before that:
 I'm contemplating an associative array that will eventually grow to be
 an estimated 64KB in size, assuming it's about half full.  It would 
then be holding around 90,000 entries. It's anyway sooo tiny compared to the rest you'd better focus on getting the main program efficient. Even saving say save all of that RAM (assume you cache weights 0 bytes) you'd save 64Kb of RAM. Now compared with at least 1G of ram (or what have you to work with said 1Tb?) it's 64K/1G or 0.006% of RAM gained. Anyhow I'd spent that time learning how memory mapped I/O works instead of sweating over that 0.006% (that are absolute maximum you'd never achieve btw). All in all the fruit to optimize is where the potential gain is large, something that is proportional to your dataset.
 But I don't know
 in detail just what's going to be involved there.  I've roughed out a

 be transformed into a file offset, but how to decide what should be
 removed from RAM isn't something I've figured out yet.  You could think
 of the basic datastructure of the program as a 40 dimensional cyclic
 graph with one-way links, but that's not quite right.  The links carry
 more information than is typical of graph structures (though logically
 that's no change...you could model them as nodes along the links that
 act as filters). OTOH, the file structure that supports this is quite
 simple.  Just a direct access file with fixed length blocks.
 As for "to be added" being separate, that's not hard either.  That could
 be either a hash table or an array. though a hash table would probably
 be simpler.  And it's just another place I'd need to check when looking
 to see if I already recognized the code.  Still, that's something that
 should really be saved until later, as it can be added on at any time.
 (Which probably means I'll never get around to it.)
What I can say is it that there is never enough time to optimize every bit of system. Spend time wisely. -- Dmitry Olshansky
Nov 06 2013