## digitalmars.D - AA rehash threshold

• Steven Schveighoffer (24/24) Nov 18 2014 From all I remember about hash tables, you are supposed to minimize the...
• Steven Schveighoffer (4/6) Nov 18 2014 BTW, I traced this load factor all the way back to the initial commit of...
• deadalnix (16/19) Nov 18 2014 That is the theory, but the real world is sometime different. It
• Steven Schveighoffer (12/29) Nov 18 2014 Heh, a load factor of 4 wouldn't work too well there :)
• Jerry Quinn (16/28) Nov 20 2014 This works nicely for small types, but has gotchas. For example, if
• Steven Schveighoffer (11/39) Nov 20 2014 Hm.. the bucket entry will simply be what a node is now. You are
• Jerry Quinn (7/20) Nov 20 2014 Almost. You do need a way to indicate that the bucket is empty. So you
• Steven Schveighoffer (10/31) Nov 20 2014 It's easy enough to set the "next" pointer to some invalid-but-not-null
• Martin Nowak (6/7) Nov 18 2014 We'll throw the current implementation away and implement an open
• Sean Kelly (5/11) Nov 18 2014 I'd like this to be configurable once we have the new and
Steven Schveighoffer <schveiguy yahoo.com> writes:
``` From all I remember about hash tables, you are supposed to minimize the
number of collisions when adding/looking up data. This helps keep the
O(1) lookup property intact.

(http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing):

"To keep the load factor under a certain limit, e.g. under 3/4, many
table implementations expand the table when items are inserted. For
example, in Java's HashMap class the default load factor threshold for
table expansion is 0.75 and in Python's dict, table size is resized when
load factor is greater than 2/3."

Well, I just realized that D AA's "load factor" is 4 (see below). That
means, for a hash table with 4 buckets, you need to add 17 items to
generate a rehash. It also means that you are guaranteed to have one
bucket with at least 4 elements on it before a rehash. And that's on an
evenly spread hash table. In most cases, we would see buckets with
around 8-10 elements.

I don't think this is optimal, but I'm nervous about correcting this --
going from a load factor of 4 to a load factor of 0.75 would be a
tremendous jump. I'm not certain how the interaction with the GC and the
list of primes in the AA runtime will fare with this new load.

Any thoughts from the experts out there?

-Steve

Code that uses factor of 4:
```
Nov 18 2014
Steven Schveighoffer <schveiguy yahoo.com> writes:
```On 11/18/14 9:16 PM, Steven Schveighoffer wrote:

Code that uses factor of 4:

BTW, I traced this load factor all the way back to the initial commit of
druntime. It's been this way since 2009.

-Steve
```
Nov 18 2014
```On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
Schveighoffer wrote:
From all I remember about hash tables, you are supposed to
minimize the number of collisions when adding/looking up data.
This helps keep the O(1) lookup property intact.

That is the theory, but the real world is sometime different. It
depends on how you resolve collisions. If you do it via some kind
of linked list, then yes, you must remove all collisions.

But if you do it by putting the element in a slot close to its
actual position, then thing gets different. In that case, you
don't want to minimize collision, but to minimize the distance to
the object, in order for it to be in the cache line.

After all, unless your hash table is very small, the first hit is
most likely a cache miss, meaning ~300 cycles. at this point the
cache line is in L1, with a 3 cycle access time. So accessing
several element in the same cache line is often preferable.

You may want to read about robin hood hash table for a simple
example of this. These hash tables are fast even with a very high
collision rate.
```
Nov 18 2014
Steven Schveighoffer <schveiguy yahoo.com> writes:
```On 11/18/14 9:46 PM, deadalnix wrote:
On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
Schveighoffer wrote:
From all I remember about hash tables, you are supposed to minimize
the number of collisions when adding/looking up data. This helps keep
the O(1) lookup property intact.

That is the theory, but the real world is sometime different. It
depends on how you resolve collisions. If you do it via some kind
of linked list, then yes, you must remove all collisions.

This is what AAs currently do.

But if you do it by putting the element in a slot close to its
actual position, then thing gets different. In that case, you
don't want to minimize collision, but to minimize the distance to
the object, in order for it to be in the cache line.

Heh, a load factor of 4 wouldn't work too well there :)

I do recall reading about all the different cache miss handling. The
wikipedia article I referenced has a lot of good info.

After all, unless your hash table is very small, the first hit is
most likely a cache miss, meaning ~300 cycles. at this point the
cache line is in L1, with a 3 cycle access time. So accessing
several element in the same cache line is often preferable.

In the case of AAs, all bucket elements are pointers, so this point is
moot for our purposes -- one may have to jump outside the cache to find
the first element. A good improvement (this is likely to come with a
full library type) is to instead store an inline element for each bucket
entry instead of just a pointer to an element. I recall when writing
dcollections, this added a significant speedup.

-Steve
```
Nov 18 2014
Jerry Quinn <jlquinn optonline.net> writes:
```Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/18/14 9:46 PM, deadalnix wrote:
After all, unless your hash table is very small, the first hit is
most likely a cache miss, meaning ~300 cycles. at this point the
cache line is in L1, with a 3 cycle access time. So accessing
several element in the same cache line is often preferable.

In the case of AAs, all bucket elements are pointers, so this point is moot
for our purposes -- one may have to jump outside the cache to find the first
element. A good improvement (this is likely to come with a full library type)
is to instead store an inline element for each bucket entry instead of just a
pointer to an element. I recall when writing dcollections, this added a
significant speedup.

This works nicely for small types, but has gotchas.  For example, if
you've got an AA of ints, what value indicates that this is a value
folded into the bucket entry vs actually being a pointer?  You'll need
extra space to make sure it's safe.  Alignment is another concern.
Let's say you have bool for the entry/element flag.  With ints you've
doubled the size of the bucket array.

If you have large objects as the elements, you'll waste space.  Probably
the implementation needs to be switchable depending on the size of the object.

As an aside, I find it invaluable to have statistics about hash table
buckets, ala C++11.  When building custom hash functions and hash table
implementations, it's almost necessary.

I'd like to see bucket stats available in some manner for built-in AAs.
If they're going away for a library type, we should have such stat info
as part of the API.

Jerry
```
Nov 20 2014
Steven Schveighoffer <schveiguy yahoo.com> writes:
```On 11/20/14 5:30 PM, Jerry Quinn wrote:
Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/18/14 9:46 PM, deadalnix wrote:
After all, unless your hash table is very small, the first hit is
most likely a cache miss, meaning ~300 cycles. at this point the
cache line is in L1, with a 3 cycle access time. So accessing
several element in the same cache line is often preferable.

In the case of AAs, all bucket elements are pointers, so this point is moot
for our purposes -- one may have to jump outside the cache to find the first
element. A good improvement (this is likely to come with a full library type)
is to instead store an inline element for each bucket entry instead of just a
pointer to an element. I recall when writing dcollections, this added a
significant speedup.

This works nicely for small types, but has gotchas.  For example, if
you've got an AA of ints, what value indicates that this is a value
folded into the bucket entry vs actually being a pointer?  You'll need
extra space to make sure it's safe.  Alignment is another concern.
Let's say you have bool for the entry/element flag.  With ints you've
doubled the size of the bucket array.

Hm.. the bucket entry will simply be what a node is now. You are
actually saving space, because you don't need to store that extra
pointer to the first node.

Where it gets dicey is when you rehash. Some of the nodes will be moved
into the bucket space, some will move out of it. But I don't think it's
a big deal, as a rehash doesn't happen very often.

If you have large objects as the elements, you'll waste space.  Probably
the implementation needs to be switchable depending on the size of the object.

Yes, this is also a concern. This almost necessitates a minimum load as
well as a maximum.

As an aside, I find it invaluable to have statistics about hash table
buckets, ala C++11.  When building custom hash functions and hash table
implementations, it's almost necessary.

I'd like to see bucket stats available in some manner for built-in AAs.
If they're going away for a library type, we should have such stat info
as part of the API.

Absolutely all of this could be added to a library type.

-Steve
```
Nov 20 2014
Jerry Quinn <jlquinn optonline.net> writes:
```Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/20/14 5:30 PM, Jerry Quinn wrote:
This works nicely for small types, but has gotchas.  For example, if
you've got an AA of ints, what value indicates that this is a value
folded into the bucket entry vs actually being a pointer?  You'll need
extra space to make sure it's safe.  Alignment is another concern.
Let's say you have bool for the entry/element flag.  With ints you've
doubled the size of the bucket array.

Hm.. the bucket entry will simply be what a node is now. You are actually
saving space, because you don't need to store that extra pointer to the first
node.

Almost. You do need a way to indicate that the bucket is empty.  So you
need another flag though it can be smaller than a pointer (for 64 bit).  But
the overhead is low except for things like ints.

Where it gets dicey is when you rehash. Some of the nodes will be moved into
the bucket space, some will move out of it. But I don't think it's a big deal,
as a rehash doesn't happen very often.

True.  Umm, that does raise a question - are addresses of AA contents
required to be stable?  C++11 requires that of its hashtables.  But it
ties your hands when trying to write customized hash tables.
```
Nov 20 2014
Steven Schveighoffer <schveiguy yahoo.com> writes:
```On 11/20/14 8:42 PM, Jerry Quinn wrote:
Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/20/14 5:30 PM, Jerry Quinn wrote:
This works nicely for small types, but has gotchas.  For example, if
you've got an AA of ints, what value indicates that this is a value
folded into the bucket entry vs actually being a pointer?  You'll need
extra space to make sure it's safe.  Alignment is another concern.
Let's say you have bool for the entry/element flag.  With ints you've
doubled the size of the bucket array.

Hm.. the bucket entry will simply be what a node is now. You are actually
saving space, because you don't need to store that extra pointer to the first
node.

Almost. You do need a way to indicate that the bucket is empty.  So you
need another flag though it can be smaller than a pointer (for 64 bit).  But
the overhead is low except for things like ints.

It's easy enough to set the "next" pointer to some invalid-but-not-null
value. Hackish, but it would work ;)

Where it gets dicey is when you rehash. Some of the nodes will be moved into
the bucket space, some will move out of it. But I don't think it's a big deal,
as a rehash doesn't happen very often.

True.  Umm, that does raise a question - are addresses of AA contents
required to be stable?  C++11 requires that of its hashtables.  But it
ties your hands when trying to write customized hash tables.

I don't think we make any guarantees about that. As it stands now,
whatever is implemented seems to be what people think is the spec. But
it's not always true.

We have so many more options when AA's become a library type. We could
say "the builtin AA is this flavor of hashtable", but provide options
for many different flavors with the same template if you want specific ones.

-Steve
```
Nov 20 2014
"Martin Nowak" <code dawg.eu> writes:
```On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
Schveighoffer wrote:
Any thoughts from the experts out there?

We'll throw the current implementation away and implement an open
factor right now.
https://github.com/D-Programming-Language/dmd/pull/4088#issuecomment-63152848
```
Nov 18 2014
"Sean Kelly" <sean invisibleduck.org> writes:
```On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
Schveighoffer wrote:
Well, I just realized that D AA's "load factor" is 4 (see
below). That means, for a hash table with 4 buckets, you need
to add 17 items to generate a rehash. It also means that you
are guaranteed to have one bucket with at least 4 elements on
it before a rehash. And that's on an evenly spread hash table.
In most cases, we would see buckets with around 8-10 elements.

I'd like this to be configurable once we have the new and
improved AA implementation.  4 does seem a bit high though.  The
default max load factor for C++ unordered containers is 1.0.
```
Nov 18 2014