www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - AA rehash threshold

reply Steven Schveighoffer <schveiguy yahoo.com> writes:
 From all I remember about hash tables, you are supposed to minimize the 
number of collisions when adding/looking up data. This helps keep the 
O(1) lookup property intact.

Reading wikipedia 
(http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing):

"To keep the load factor under a certain limit, e.g. under 3/4, many 
table implementations expand the table when items are inserted. For 
example, in Java's HashMap class the default load factor threshold for 
table expansion is 0.75 and in Python's dict, table size is resized when 
load factor is greater than 2/3."

Well, I just realized that D AA's "load factor" is 4 (see below). That 
means, for a hash table with 4 buckets, you need to add 17 items to 
generate a rehash. It also means that you are guaranteed to have one 
bucket with at least 4 elements on it before a rehash. And that's on an 
evenly spread hash table. In most cases, we would see buckets with 
around 8-10 elements.

I don't think this is optimal, but I'm nervous about correcting this -- 
going from a load factor of 4 to a load factor of 0.75 would be a 
tremendous jump. I'm not certain how the interaction with the GC and the 
list of primes in the AA runtime will fare with this new load.

Any thoughts from the experts out there?

-Steve

Code that uses factor of 4: 
https://github.com/D-Programming-Language/druntime/blob/9e410b0bd91ead4439bf304c28c2fce511df436a/src/rt/aaA.d#L221
Nov 18 2014
next sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/18/14 9:16 PM, Steven Schveighoffer wrote:

 Code that uses factor of 4:
 https://github.com/D-Programming-Language/druntime/blob/9e410b0bd91ead4439bf304c28c2fce511df436a/src/rt/aaA.d#L221
BTW, I traced this load factor all the way back to the initial commit of druntime. It's been this way since 2009. -Steve
Nov 18 2014
prev sibling next sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
Schveighoffer wrote:
 From all I remember about hash tables, you are supposed to 
 minimize the number of collisions when adding/looking up data. 
 This helps keep the O(1) lookup property intact.
That is the theory, but the real world is sometime different. It depends on how you resolve collisions. If you do it via some kind of linked list, then yes, you must remove all collisions. But if you do it by putting the element in a slot close to its actual position, then thing gets different. In that case, you don't want to minimize collision, but to minimize the distance to the object, in order for it to be in the cache line. After all, unless your hash table is very small, the first hit is most likely a cache miss, meaning ~300 cycles. at this point the cache line is in L1, with a 3 cycle access time. So accessing several element in the same cache line is often preferable. You may want to read about robin hood hash table for a simple example of this. These hash tables are fast even with a very high collision rate.
Nov 18 2014
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/18/14 9:46 PM, deadalnix wrote:
 On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
 Schveighoffer wrote:
 From all I remember about hash tables, you are supposed to minimize
 the number of collisions when adding/looking up data. This helps keep
 the O(1) lookup property intact.
That is the theory, but the real world is sometime different. It depends on how you resolve collisions. If you do it via some kind of linked list, then yes, you must remove all collisions.
This is what AAs currently do.
 But if you do it by putting the element in a slot close to its
 actual position, then thing gets different. In that case, you
 don't want to minimize collision, but to minimize the distance to
 the object, in order for it to be in the cache line.
Heh, a load factor of 4 wouldn't work too well there :) I do recall reading about all the different cache miss handling. The wikipedia article I referenced has a lot of good info.
 After all, unless your hash table is very small, the first hit is
 most likely a cache miss, meaning ~300 cycles. at this point the
 cache line is in L1, with a 3 cycle access time. So accessing
 several element in the same cache line is often preferable.
In the case of AAs, all bucket elements are pointers, so this point is moot for our purposes -- one may have to jump outside the cache to find the first element. A good improvement (this is likely to come with a full library type) is to instead store an inline element for each bucket entry instead of just a pointer to an element. I recall when writing dcollections, this added a significant speedup. -Steve
Nov 18 2014
parent reply Jerry Quinn <jlquinn optonline.net> writes:
Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/18/14 9:46 PM, deadalnix wrote:
 After all, unless your hash table is very small, the first hit is
 most likely a cache miss, meaning ~300 cycles. at this point the
 cache line is in L1, with a 3 cycle access time. So accessing
 several element in the same cache line is often preferable.
In the case of AAs, all bucket elements are pointers, so this point is moot for our purposes -- one may have to jump outside the cache to find the first element. A good improvement (this is likely to come with a full library type) is to instead store an inline element for each bucket entry instead of just a pointer to an element. I recall when writing dcollections, this added a significant speedup.
This works nicely for small types, but has gotchas. For example, if you've got an AA of ints, what value indicates that this is a value folded into the bucket entry vs actually being a pointer? You'll need extra space to make sure it's safe. Alignment is another concern. Let's say you have bool for the entry/element flag. With ints you've doubled the size of the bucket array. If you have large objects as the elements, you'll waste space. Probably the implementation needs to be switchable depending on the size of the object. As an aside, I find it invaluable to have statistics about hash table buckets, ala C++11. When building custom hash functions and hash table implementations, it's almost necessary. I'd like to see bucket stats available in some manner for built-in AAs. If they're going away for a library type, we should have such stat info as part of the API. Jerry
Nov 20 2014
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/20/14 5:30 PM, Jerry Quinn wrote:
 Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/18/14 9:46 PM, deadalnix wrote:
 After all, unless your hash table is very small, the first hit is
 most likely a cache miss, meaning ~300 cycles. at this point the
 cache line is in L1, with a 3 cycle access time. So accessing
 several element in the same cache line is often preferable.
In the case of AAs, all bucket elements are pointers, so this point is moot for our purposes -- one may have to jump outside the cache to find the first element. A good improvement (this is likely to come with a full library type) is to instead store an inline element for each bucket entry instead of just a pointer to an element. I recall when writing dcollections, this added a significant speedup.
This works nicely for small types, but has gotchas. For example, if you've got an AA of ints, what value indicates that this is a value folded into the bucket entry vs actually being a pointer? You'll need extra space to make sure it's safe. Alignment is another concern. Let's say you have bool for the entry/element flag. With ints you've doubled the size of the bucket array.
Hm.. the bucket entry will simply be what a node is now. You are actually saving space, because you don't need to store that extra pointer to the first node. Where it gets dicey is when you rehash. Some of the nodes will be moved into the bucket space, some will move out of it. But I don't think it's a big deal, as a rehash doesn't happen very often.
 If you have large objects as the elements, you'll waste space.  Probably
 the implementation needs to be switchable depending on the size of the object.
Yes, this is also a concern. This almost necessitates a minimum load as well as a maximum.
 As an aside, I find it invaluable to have statistics about hash table
 buckets, ala C++11.  When building custom hash functions and hash table
 implementations, it's almost necessary.

 I'd like to see bucket stats available in some manner for built-in AAs.
 If they're going away for a library type, we should have such stat info
 as part of the API.
Absolutely all of this could be added to a library type. -Steve
Nov 20 2014
parent reply Jerry Quinn <jlquinn optonline.net> writes:
Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/20/14 5:30 PM, Jerry Quinn wrote:
 This works nicely for small types, but has gotchas.  For example, if
 you've got an AA of ints, what value indicates that this is a value
 folded into the bucket entry vs actually being a pointer?  You'll need
 extra space to make sure it's safe.  Alignment is another concern.
 Let's say you have bool for the entry/element flag.  With ints you've
 doubled the size of the bucket array.
Hm.. the bucket entry will simply be what a node is now. You are actually saving space, because you don't need to store that extra pointer to the first node.
Almost. You do need a way to indicate that the bucket is empty. So you need another flag though it can be smaller than a pointer (for 64 bit). But the overhead is low except for things like ints.
 Where it gets dicey is when you rehash. Some of the nodes will be moved into
 the bucket space, some will move out of it. But I don't think it's a big deal,
 as a rehash doesn't happen very often.
True. Umm, that does raise a question - are addresses of AA contents required to be stable? C++11 requires that of its hashtables. But it ties your hands when trying to write customized hash tables.
Nov 20 2014
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/20/14 8:42 PM, Jerry Quinn wrote:
 Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/20/14 5:30 PM, Jerry Quinn wrote:
 This works nicely for small types, but has gotchas.  For example, if
 you've got an AA of ints, what value indicates that this is a value
 folded into the bucket entry vs actually being a pointer?  You'll need
 extra space to make sure it's safe.  Alignment is another concern.
 Let's say you have bool for the entry/element flag.  With ints you've
 doubled the size of the bucket array.
Hm.. the bucket entry will simply be what a node is now. You are actually saving space, because you don't need to store that extra pointer to the first node.
Almost. You do need a way to indicate that the bucket is empty. So you need another flag though it can be smaller than a pointer (for 64 bit). But the overhead is low except for things like ints.
It's easy enough to set the "next" pointer to some invalid-but-not-null value. Hackish, but it would work ;)
 Where it gets dicey is when you rehash. Some of the nodes will be moved into
 the bucket space, some will move out of it. But I don't think it's a big deal,
 as a rehash doesn't happen very often.
True. Umm, that does raise a question - are addresses of AA contents required to be stable? C++11 requires that of its hashtables. But it ties your hands when trying to write customized hash tables.
I don't think we make any guarantees about that. As it stands now, whatever is implemented seems to be what people think is the spec. But it's not always true. We have so many more options when AA's become a library type. We could say "the builtin AA is this flavor of hashtable", but provide options for many different flavors with the same template if you want specific ones. -Steve
Nov 20 2014
prev sibling next sibling parent "Martin Nowak" <code dawg.eu> writes:
On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven 
Schveighoffer wrote:
 Any thoughts from the experts out there?
We'll throw the current implementation away and implement an open addressing AA once we get to it. So don't worry about the load factor right now. https://github.com/D-Programming-Language/dmd/pull/4088#issuecomment-63152848
Nov 18 2014
prev sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven 
Schveighoffer wrote:
 Well, I just realized that D AA's "load factor" is 4 (see 
 below). That means, for a hash table with 4 buckets, you need 
 to add 17 items to generate a rehash. It also means that you 
 are guaranteed to have one bucket with at least 4 elements on 
 it before a rehash. And that's on an evenly spread hash table. 
 In most cases, we would see buckets with around 8-10 elements.
I'd like this to be configurable once we have the new and improved AA implementation. 4 does seem a bit high though. The default max load factor for C++ unordered containers is 1.0.
Nov 18 2014