www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - RE: Are Gigantic Associative Arrays Now Possible?

reply dlangPupil <x x989898998.com> writes:
Hello to all!  As a (D, and coding and forum) newbie I'm learning 
about D's associative arrays (AAs), and their tiny latency 
regardless of AA size. Cool!

One practical limitation on the practical maximum AA size is 
memory/disk paging.  But maybe this limit could be overcome with 
the latest SSDs, whose nonvolatile memory can be addressed like 
RAM.

The article below says that Intel Optane SSDs:
	-allow reads and writes can on individual bytes.
	-have a latency 10x of DRAM (but  AAs' latency is so low that 
this might not matter in many cases).
	-currently offer 375GB of "RAM" for $1,500.
	-will support up to 3 TB on 2 socket Xeon systems (48TB on 
4-socket).
	-will be supplemented with Optane DIMMs in the future.

Some questions that arise are...

1) Wouldn't using such "RAM" eliminate any paging issue for 
super-gigantic AAs?
2) What other bottlenecks could arise for gigantic AAs, e.g., 
garbage collection?
3) Would an append-only data design mitigate GC or other 
bottlenecks?
4) Has anyone tried this out?

What a coup if D could "be the first" lang to make this 
practical.  Thanks.

https://arstechnica.com/information-technology/2017/03/intels-first-optane-ssd-375gb-that-you-can-also-use-as-ram/
Mar 22 2017
next sibling parent Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:
On Wednesday, 22 March 2017 at 20:00:56 UTC, dlangPupil wrote:
 Hello to all!  As a (D, and coding and forum) newbie I'm 
 learning about D's associative arrays (AAs), and their tiny 
 latency regardless of AA size. Cool!

 One practical limitation on the practical maximum AA size is 
 memory/disk paging.  But maybe this limit could be overcome 
 with the latest SSDs, whose nonvolatile memory can be addressed 
 like RAM.

 The article below says that Intel Optane SSDs:
 	-allow reads and writes can on individual bytes.
 	-have a latency 10x of DRAM (but  AAs' latency is so low that 
 this might not matter in many cases).
 	-currently offer 375GB of "RAM" for $1,500.
 	-will support up to 3 TB on 2 socket Xeon systems (48TB on 
 4-socket).
 	-will be supplemented with Optane DIMMs in the future.

 Some questions that arise are...

 1) Wouldn't using such "RAM" eliminate any paging issue for 
 super-gigantic AAs?
 2) What other bottlenecks could arise for gigantic AAs, e.g., 
 garbage collection?
 3) Would an append-only data design mitigate GC or other 
 bottlenecks?
 4) Has anyone tried this out?

 What a coup if D could "be the first" lang to make this 
 practical.  Thanks.

 https://arstechnica.com/information-technology/2017/03/intels-first-optane-ssd-375gb-that-you-can-also-use-as-ram/
Hi. I am very interested in this topic, although I don't really have answers for your questions. See the acm article which I mentioned last year I think on forum. https://queue.acm.org/detail.cfm?id=2874238 "For the entire careers of most practicing computer scientists, a fundamental observation has consistently held true: CPUs are significantly more performant and more expensive than I/O devices. The fact that CPUs can process data at extremely high rates, while simultaneously servicing multiple I/O devices, has had a sweeping impact on the design of both hardware and software for systems of all sizes, for pretty much as long as we've been building them. This assumption, however, is in the process of being completely invalidated. The arrival of high-speed, non-volatile storage devices, typically referred to as Storage Class Memories (SCM), is likely the most significant architectural change that datacenter and software designers will face in the foreseeable future. SCMs are increasingly part of server systems, and they constitute a massive change: the cost of an SCM, at $3-5k, easily exceeds that of a many-core CPU ($1-2k), and the performance of an SCM (hundreds of thousands of I/O operations per second) is such that one or more entire many-core CPUs are required to saturate it. This change has profound effects: 1. The age-old assumption that I/O is slow and computation is fast is no longer true: this invalidates decades of design decisions that are deeply embedded in today's systems. 2. The relative performance of layers in systems has changed by a factor of a thousand times over a very short time: this requires rapid adaptation throughout the systems software stack. 3. Piles of existing enterprise datacenter infrastructure—hardware and software—are about to become useless (or, at least, very inefficient): SCMs require rethinking the compute/storage balance and architecture from the ground up. " It's a massive relative price shock - storage vs CPU - and whole structures around it will need to be utterly transformed. Intel say it's the biggest technological breakthrough since the internet. I'm good at recognising moments of exhaustion in old trends and the beginning of new ones, and I've thought for some time now that people were complacently squandering CPU performance just as conditions were tending to favour it becoming important again. https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow/answer/Laeeth-Isharc?srid=35gE I don't know how data structures and file systems should adapt to this. But I do think the prospective return on efficient code just went up a lot - as far as I can see, this shift is very good for D.
Mar 23 2017
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Wednesday, 22 March 2017 at 20:00:56 UTC, dlangPupil wrote:
 1) Wouldn't using such "RAM" eliminate any paging issue for 
 super-gigantic AAs?
 2) What other bottlenecks could arise for gigantic AAs, e.g., 
 garbage collection?
Increasing the size of a hash table would be prohibitively expensive. You need a data-structure that can grow gracefully.
 What a coup if D could "be the first" lang to make this 
 practical.  Thanks.
This is more an OS issue than a language issue. Although it could boost high-level numeric-oriented languages and functional languages that cache expensive computations. Low level languages are pretty much in the same boat.
Mar 23 2017
parent reply dlangPupil <x x989898998.com> writes:
On Thursday, 23 March 2017 at 10:27:36 UTC, Ola Fosheim Grøstad 
wrote:
 Increasing the size of a hash table would be prohibitively 
 expensive. You need a data-structure that can grow gracefully.
Hi Ola, Are the hash tables you refer to the ones that D uses in the background to implement associative arrays, or the ones that a programmer might create using AAs? --If the former, then perhaps the AA's hash function could be tweaked for terabyte-range SSD "RAM". But even if the typical 2:1 bucket/data storage ratio couldn't be improved, then creating 32 TB of buckets would still allow 16 TB of nearly-instantly-addressable data (on a 4-core Xeon w/ 48 TB Optane SSD). --If the latter, then my goal is design something that makes specific items randomly and instantly accessible, like a hash table, but with zero potential collisions. Thanks!
Mar 23 2017
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Mar 24, 2017 at 12:27:00AM +0000, dlangPupil via Digitalmars-d wrote:
 On Thursday, 23 March 2017 at 10:27:36 UTC, Ola Fosheim Grøstad wrote:
 
 Increasing the size of a hash table would be prohibitively
 expensive.  You need a data-structure that can grow gracefully.
Hi Ola, Are the hash tables you refer to the ones that D uses in the background to implement associative arrays, or the ones that a programmer might create using AAs? --If the former, then perhaps the AA's hash function could be tweaked for terabyte-range SSD "RAM". But even if the typical 2:1 bucket/data storage ratio couldn't be improved, then creating 32 TB of buckets would still allow 16 TB of nearly-instantly-addressable data (on a 4-core Xeon w/ 48 TB Optane SSD). --If the latter, then my goal is design something that makes specific items randomly and instantly accessible, like a hash table, but with zero potential collisions. Thanks!
How do SSD access speeds compare with L1 and L2 cache speeds? You have to keep in mind that one downside of hashtables is that they tend to be unfriendly towards caching hierarchies, because generally the hash function is designed to scatter hash keys as widely and as randomly as possible (to minimize collisions), meaning that potentially *every* hash lookup will be a cache miss, perhaps more, depending on how collision resolution is handled, which can be a big performance killer. For certain applications, where key lookups are more-or-less random, this is the best you could do, but for a good number of applications, there tends to be some correlation between lookups, and even things like B-trees could potentially do better because of their cache-friendliness (and locality), even if you assume I/O is much faster than your traditional spindle-based hard drive. The O(log n) could have a much smaller constant than the hashtable O(1) if lookups tend to be correlated (i.e., exhibit locality), and lots more cache misses are incurred in the latter. Having said that, though, large-capacity low-latency SSDs are very interesting to me because I'm interested in certain applications that involve very large lookup tables. Currently I'm just using traditional hashtables, but once you get to a certain size, I/O overhead dominates and progress grinds to a halt. I've been researching ways of taking advantage of locality to ease this, but so far haven't gotten it to actually work yet. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Mar 23 2017
parent reply dlangPupil <x x989898998.com> writes:
On Friday, 24 March 2017 at 06:30:25 UTC, H. S. Teoh wrote:

 You have to keep in mind that one downside of hashtables is 
 that they tend to be unfriendly towards caching hierarchies
...
 For certain applications, where key lookups are more-or-less 
 random, this is the best you could do, but for a good number of 
 applications, there tends to be some correlation between 
 lookups, and even things like B-trees could potentially do 
 better because of their cache-friendliness (and locality), even 
 if you assume I/O is much faster than your traditional 
 spindle-based hard drive. The O(log n) could have a much 
 smaller constant than the hashtable O(1) if lookups tend to be 
 correlated (i.e., exhibit locality), and lots more cache misses 
 are incurred in the latter.

 Having said that, though, large-capacity low-latency SSDs are 
 very interesting to me because I'm interested in certain 
 applications that involve very large lookup tables.  Currently 
 I'm just using traditional hashtables, but once you get to a 
 certain size, I/O overhead dominates and progress grinds to a 
 halt.  I've been researching ways of taking advantage of 
 locality to ease this, but so far haven't gotten it to actually 
 work yet.


 T
Thanks T for the great insight. Very helpful! Nice to see that you, Laeeth, the ACM, Intel and Micron all agree: this new technology could be very disruptive. In addition to prevalence of random lookups, certain additional app conditions and designs might make "gigantic AAs" more useful or competitive with alternatives: 1. Use by programmers who need a data structure that's easier to reason about or faster and safer to implement or prototype, and easier for future users to read and maintain. 2. Apps with large tables that use (or could use) non-natural keys, e.g., crypto-quality GUIDs, which can't benefit from (or impose the extra computational cost of) the sorting that must be performed and maintained on natural key values (e.g., LastName="Smith") in order to use a B-Tree search. 3. Tables (and related tables) whose row data (and even aggregate data) could be "clumped" (think "documentized") to achieve a singularity (and not merely some degree of locality), thus consolidating multiple lookups into a single lookup. -In other words, a key's value is itself an array of data that can be fetched in a single lookup. -The key id for each "next" (updated) version of a document could also be stored with the current version. -The next key-value entry to hold the update could be created immediately but not have its values written unless and until the prior version of the document has changed. -Better yet in an append-only design, creation of the next key-value entry could be deferred until a revision actually occurs. -Such "chained documentization" would automate histories, and could be further abstracted and adapted (e.g., with columnstore concepts) to accommodate apps in which lookups aren't mostly random. Finally, re: caches: I haven't found whether it is or isn't possible to combine a server's DRAM and its Optane SSD RAM or DIMMs to form a single pool of RAM. Mixing DIMMS of different speeds is a no-no; but if this could be done, then the hottest data could be "cached" in DRAM and thus spared the 10x latency penalty of the SSD device.
Mar 24 2017
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Mar 24, 2017 at 04:59:00PM +0000, dlangPupil via Digitalmars-d wrote:
[...]
 In addition to prevalence of random lookups, certain additional app
 conditions and designs might make "gigantic AAs" more useful or
 competitive with alternatives:
 
 1.  Use by programmers who need a data structure that's easier to
 reason about or faster and safer to implement or prototype, and easier
 for future users to read and maintain.
This point is arguable, since a B tree is pretty well-known and thus not difficult to reason about. Generally, whether you use a hashtable of a B tree or something else, the high-level view is that it's some opaque data structure that, given some key K, returns some value V. Which underlying algorithm is used is an implementation detail. And generally as far as application programmers are concerned, you wouldn't want to implement an advanced data structure from scratch; you'd use an off-the-shelf library that has already implemented it.
 2.  Apps with large tables that use (or could use) non-natural keys,
 e.g., crypto-quality GUIDs, which can't benefit from (or impose the
 extra computational cost of) the sorting that must be performed and
 maintained on natural key values (e.g., LastName="Smith") in order to
 use a B-Tree search.
Encrypted GUIDs is a very good point. You probably can't expect much locality from that. But this point, and the following, presume a rather specific use case, i.e., document retrieval, presumably over the network. The application of hashtables and B-trees, however, is far wider in scope. In particular, what I had in mind when I asked about cache speeds is compute-intensive (as opposed to I/O intensive) processes. These two domains of application are quite different in some ways. In document retrieval, for example, if you view it as a server serving documents over the network to a set of clients, having 1 I/O roundtrip per lookup is probably acceptable, because the time it takes to send the document over the network to the client is large enough that making lookups any faster probably won't make too much of a difference. Besides, if client requests are essentially unpredictable, then 1 I/O roundtrip is about the best you could do. In CPU-intensive computations, however, the cache hierarchy becomes very prominent, because you're talking about CPU speeds vs. RAM speeds, which is currently a rather large gap. Being cache-friendly could mean a difference on the scale of orders of magnitude in terms of overall performance. If you're talking about data lookups inside a compute-intensive inner loop, having 1 I/O roundtrip per iteration is unacceptable. Even 1 L1 cache miss per iteration may be unacceptable, depending on the application. Ideally, you want to maximize the number of iterations per L1 cache miss, and then on the next level, squeeze as many L1 misses onto L2 as possible before you have to go to L3 or revert to RAM, because RAM is slow, relatively speaking, even if it's still very fast compared to an I/O roundtrip to the hard disk. In other words, you want as much locality as you can possibly get in your memory accesses -- the entire cache hierarchy, after all, is built on the premise of memory accesses exhibiting locality. Furthermore, when it comes to CPU-intensive applications, quite often you already know the pattern of lookups beforehand, since it's usually coming from within the application itself, not from external clients that may be completely unpredictable. So you can usually predict the next n memory accesses much more easily, and take advantage of it by organizing your data structure around that. [...]
 Finally, re: caches: I haven't found whether it is or isn't possible
 to combine a server's DRAM and its Optane SSD RAM or DIMMs to form a
 single pool of RAM.  Mixing DIMMS of different speeds is a no-no; but
 if this could be done, then the hottest data could be "cached" in DRAM
 and thus spared the 10x latency penalty of the SSD device.
If the SSD device incurs a 10x latency penalty, that's still a far cry from L1/L2 cache latencies! True, it will help in the worst case when you have no choice but to take an I/O roundtrip (a hard drive would be far slower)... but this would hardly be anything revolutionary if DRAM is still faster. At the end of the day, you're still talking about a memory hierarchy with L1/L2 at the top, L3 and RAM in between, and hard drive / SSD at the bottom. If so, then the cache-unfriendliness of hashtables still applies, and we still have to work with finding ways of improving hashtable locality, e.g., with locality-sensitive hashing, cache-oblivious hashtables, etc., or use a more cache-friendly data structure like B-trees or one of the newer cache-oblivious trees. (In my case, though, B-trees may not represent much of an improvement, because I'm dealing with high-dimensional data that cannot be easily linearized to take maximum advantage of B-tree locality. So at some level I still need some kind of hash-like structure to work with my data. But it will probably have some tree-like structure to it, because of the (high-dimensional) locality it exhibits.) T -- Without geometry, life would be pointless. -- VS
Mar 24 2017
parent dlangPupil <x x989898998.com> writes:
On Friday, 24 March 2017 at 17:48:35 UTC, H. S. Teoh wrote:

 (In my case, though, B-trees may not represent much of an 
 improvement, because I'm dealing with high-dimensional data 
 that cannot be easily linearized to take maximum advantage of 
 B-tree locality. So at some level I still need some kind of 
 hash-like structure to work with my data. But it will probably 
 have some tree-like structure to it, because of the 
 (high-dimensional) locality it exhibits.)


 T
Hi T, Your problem is intriguing and definitely stretching my mind! I'll be factoring your ideas into my app design as I go along. Some techniques that might be relevant to your app, if only for relative performance comparisons, might be: Using metadata in lieu of actual data to maximize the number of rows "represented" in the caches. Using one or more columnstores, both the intra- and extra-cache, to allow transformations of one or more fields of one or more rows with extremely small read, computation and write costs. Scaling the app horizontally, if possible. Using stored procedures on a a SQL NoSQL or NewSQL DBMS to harness the DBMS's bulk-processing and high-throughput capabilities. I'd love to hear whatever details you can share about your app. Alternatively I made a list of a dozen or so questions that would help me think about how to approach your problem. If you're interested in pursuing either avenue, let me know! Thanks again
Mar 24 2017
prev sibling parent reply =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Wednesday, 22 March 2017 at 20:00:56 UTC, dlangPupil wrote:
 Hello to all!  As a (D, and coding and forum) newbie I'm 
 learning about D's associative arrays (AAs), and their tiny 
 latency regardless of AA size. Cool!
Where can this be read? Is there a naming for this particular implementation of an AA?
Mar 23 2017
parent dlangPupil <x x989898998.com> writes:
On Thursday, 23 March 2017 at 20:09:46 UTC, Nordlöw wrote:
...
 Where can this be read? Is there a naming for this particular 
 implementation of an AA?
Hi Nordlow, Associative Arrays are introduced in Ali Çehreli's great book, Programming in D – Tutorial and Reference, Ch. 28; and in Section 1.4 of Andrei Alexandrescu's equally great "The D Programming Language." Ali's book is at http://ddili.org/ders/d.en/aa.html The particular data structure I'd like to implement is not a simple hash, but rather a mixture of relational, key-value, columnstore and document database structures. I"ll share it after I've "hashed" it out, and thought of a cool name for it!
Mar 23 2017