digitalmars.D.learn - Memory usage of AAs?

Nick Sabalausky (4/4) Mar 29 2011 My understanding of hash tables is that they allocate a fixed size array...

spir (20/24) Mar 29 2011 Each element is a data structure, often called bucket (typically a link ...

Nick Sabalausky (24/47) Mar 29 2011 Right, I know that, but that's not what I was asking. Take this hypothet...

Steven Schveighoffer (27/89) Mar 30 2011 It varies. The hash table size is not constant, the load factor is. Th...

spir (8/92) Mar 30 2011 IIRC, this is because buckets are (minimal) link lists. Cells holding ke...

"Nick Sabalausky" <a a.a> writes:

My understanding of hash tables is that they allocate a fixed size array and 
map keys to indicies within the range 0..predefined_length_of_the_AA.

So I've been wondering, how many elements do D's built-in AAs have? And 
what's the content of each one, just a single pointer?

Mar 29 2011

spir <denis.spir gmail.com> writes:

On 03/30/2011 01:24 AM, Nick Sabalausky wrote:
 My understanding of hash tables is that they allocate a fixed size array and
 map keys to indicies within the range 0..predefined_length_of_the_AA.

 So I've been wondering, how many elements do D's built-in AAs have? And
 what's the content of each one, just a single pointer?

Each element is a data structure, often called bucket (typically a link list), 
storing (key:value) pairs for which the key, once hashed and modulo-ed, maps to 
the given index. That's why the famous O(1) lookup time for hash tables is very 
theoretic: the constant part holds average time for hashing which is very 
variable, plus another average time for linearly traversing the bucket. The 
latter part depends on table load factor (= number of elements / number of 
buckets) and proper statistical repartition of keys into buckets.

The key (!) points are finding a good hash func to "linearize" said repartition 
(which depends on actual key-data domain, not only on their type...), but 
better ones rapidly eat much time (ones used in practice are rather simple & 
fast); and finding optimal load factor, and growth scheme. In practice, all of 
this tends to make hash tables an implementation nightmare (for me). I'd love 
to find practicle alternatives, but efficient binary trees also are complex and 
even more depend on kind of keys, I guess.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 29 2011

"Nick Sabalausky" <a a.a> writes:

"spir" <denis.spir gmail.com> wrote in message 
news:mailman.2909.1301443345.4748.digitalmars-d-learn puremagic.com...
 On 03/30/2011 01:24 AM, Nick Sabalausky wrote:
 My understanding of hash tables is that they allocate a fixed size array 
 and
 map keys to indicies within the range 0..predefined_length_of_the_AA.

 So I've been wondering, how many elements do D's built-in AAs have? And
 what's the content of each one, just a single pointer?

 Each element is a data structure, often called bucket (typically a link 
 list), storing (key:value) pairs for which the key, once hashed and 
 modulo-ed, maps to the given index. That's why the famous O(1) lookup time 
 for hash tables is very theoretic: the constant part holds average time 
 for hashing which is very variable, plus another average time for linearly 
 traversing the bucket. The latter part depends on table load factor (= 
 number of elements / number of buckets) and proper statistical repartition 
 of keys into buckets.

 The key (!) points are finding a good hash func to "linearize" said 
 repartition (which depends on actual key-data domain, not only on their 
 type...), but better ones rapidly eat much time (ones used in practice are 
 rather simple & fast); and finding optimal load factor, and growth scheme. 
 In practice, all of this tends to make hash tables an implementation 
 nightmare (for me). I'd love to find practicle alternatives, but efficient 
 binary trees also are complex and even more depend on kind of keys, I 
 guess.

Right, I know that, but that's not what I was asking. Take this hypothetical 
implementation:

struct Bucket(TKey, TVal)
{
    ...
}

enum hashTableSize = ...;

struct Hash(TKey, TVal)
{
    Bucket!(TKey, TVal)[hashTableSize] data;

    TVal get(TKey key) { ... }
    void set(TKey key, TVal value) { ... }
}

I assume that D's AA are something at least vaguely like that. My questions 
are:

1. What does D use for "hashTableSize"? Or does "hashTableSize" vary? If it 
varies, what's a typical rough ballpark size? (And just out of curiosity, if 
it varies, is it decided at compile-time, or does it change even at 
runtime?)

2. What is "sizeof(Bucket(TKey, TVal))"? And I mean the shallow size, not 
deep size. Is it dependent on TKey or TVal? Or is it just simply a pointer 
to the start of a linked list (and therefore "sizeof(size_t)")?

Mar 29 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 29 Mar 2011 22:20:05 -0400, Nick Sabalausky <a a.a> wrote:

 "spir" <denis.spir gmail.com> wrote in message
 news:mailman.2909.1301443345.4748.digitalmars-d-learn puremagic.com...
 On 03/30/2011 01:24 AM, Nick Sabalausky wrote:
 My understanding of hash tables is that they allocate a fixed size  
 array
 and
 map keys to indicies within the range 0..predefined_length_of_the_AA.

 So I've been wondering, how many elements do D's built-in AAs have? And
 what's the content of each one, just a single pointer?

 Each element is a data structure, often called bucket (typically a link
 list), storing (key:value) pairs for which the key, once hashed and
 modulo-ed, maps to the given index. That's why the famous O(1) lookup  
 time
 for hash tables is very theoretic: the constant part holds average time
 for hashing which is very variable, plus another average time for  
 linearly
 traversing the bucket. The latter part depends on table load factor (=
 number of elements / number of buckets) and proper statistical  
 repartition
 of keys into buckets.

 The key (!) points are finding a good hash func to "linearize" said
 repartition (which depends on actual key-data domain, not only on their
 type...), but better ones rapidly eat much time (ones used in practice  
 are
 rather simple & fast); and finding optimal load factor, and growth  
 scheme.
 In practice, all of this tends to make hash tables an implementation
 nightmare (for me). I'd love to find practicle alternatives, but  
 efficient
 binary trees also are complex and even more depend on kind of keys, I
 guess.

 Right, I know that, but that's not what I was asking. Take this  
 hypothetical
 implementation:

 struct Bucket(TKey, TVal)
 {
     ...
 }

 enum hashTableSize = ...;

 struct Hash(TKey, TVal)
 {
     Bucket!(TKey, TVal)[hashTableSize] data;

     TVal get(TKey key) { ... }
     void set(TKey key, TVal value) { ... }
 }

 I assume that D's AA are something at least vaguely like that. My  
 questions
 are:

 1. What does D use for "hashTableSize"? Or does "hashTableSize" vary? If  
 it
 varies, what's a typical rough ballpark size? (And just out of  
 curiosity, if
 it varies, is it decided at compile-time, or does it change even at
 runtime?)

It varies.  The hash table size is not constant, the load factor is.  The  
load factor is the number of elements in the hash divided by the number of  
buckets.  You never want to fill up all the spaces, because the more full  
you get, the more chances for collisions there are.  Essentially, the  
tricky part about hashing is what to do about collisions (two elements are  
different, but go in the same bucket).

So what happens is when the load factor exceeds a predefined constant  
(e.g. in dcollections the load factor defaults to .75), the table  
"rehashes", or increases (usually logarithmically) the size of the array,  
and re-inserts all its elements.

I believe there is a minimum array size, and things are increased from  
there.  I think also you can do a manual "rehash" which should adjust the  
size of the array to match a certain load factor (below the maximum).

In some implementations, hashes will even shrink when the load factor goes  
below a minimum (dcollections does not do this to avoid invalidating  
ranges).  There are a million different ways to implement the basic hash.   
The most complex part though, is usually the collision handling.  In my  
algo book, there are at least 3 ways to handle collisions, and I think  
there are countless more.  If you look up hashing on wikipedia, you'll get  
a much better explanation.

 2. What is "sizeof(Bucket(TKey, TVal))"? And I mean the shallow size, not
 deep size. Is it dependent on TKey or TVal? Or is it just simply a  
 pointer
 to the start of a linked list (and therefore "sizeof(size_t)")?

Here is the AA implementation:

https://github.com/D-Programming-Language/druntime/blob/master/src/rt/aaA.d

 From that page, you can see that AA is your bucket (note this is runtime  
stuff, so there are no templates), and BB is your Hash struct.  It looks  
like BB has an array of AA pointers.

-Steve

Mar 30 2011

spir <denis.spir gmail.com> writes:

On 03/30/2011 03:31 PM, Steven Schveighoffer wrote:
 On Tue, 29 Mar 2011 22:20:05 -0400, Nick Sabalausky <a a.a> wrote:

 "spir" <denis.spir gmail.com> wrote in message
 news:mailman.2909.1301443345.4748.digitalmars-d-learn puremagic.com...
 On 03/30/2011 01:24 AM, Nick Sabalausky wrote:
 My understanding of hash tables is that they allocate a fixed size array
 and
 map keys to indicies within the range 0..predefined_length_of_the_AA.

 So I've been wondering, how many elements do D's built-in AAs have? And
 what's the content of each one, just a single pointer?

 Each element is a data structure, often called bucket (typically a link
 list), storing (key:value) pairs for which the key, once hashed and
 modulo-ed, maps to the given index. That's why the famous O(1) lookup time
 for hash tables is very theoretic: the constant part holds average time
 for hashing which is very variable, plus another average time for linearly
 traversing the bucket. The latter part depends on table load factor (=
 number of elements / number of buckets) and proper statistical repartition
 of keys into buckets.

 The key (!) points are finding a good hash func to "linearize" said
 repartition (which depends on actual key-data domain, not only on their
 type...), but better ones rapidly eat much time (ones used in practice are
 rather simple & fast); and finding optimal load factor, and growth scheme.
 In practice, all of this tends to make hash tables an implementation
 nightmare (for me). I'd love to find practicle alternatives, but efficient
 binary trees also are complex and even more depend on kind of keys, I
 guess.

 Right, I know that, but that's not what I was asking. Take this hypothetical
 implementation:

 struct Bucket(TKey, TVal)
 {
 ...
 }

 enum hashTableSize = ...;

 struct Hash(TKey, TVal)
 {
 Bucket!(TKey, TVal)[hashTableSize] data;

 TVal get(TKey key) { ... }
 void set(TKey key, TVal value) { ... }
 }

 I assume that D's AA are something at least vaguely like that. My questions
 are:

 1. What does D use for "hashTableSize"? Or does "hashTableSize" vary? If it
 varies, what's a typical rough ballpark size? (And just out of curiosity, if
 it varies, is it decided at compile-time, or does it change even at
 runtime?)

 It varies. The hash table size is not constant, the load factor is. The load
 factor is the number of elements in the hash divided by the number of buckets.
 You never want to fill up all the spaces, because the more full you get, the
 more chances for collisions there are. Essentially, the tricky part about
 hashing is what to do about collisions (two elements are different, but go in
 the same bucket).

 So what happens is when the load factor exceeds a predefined constant (e.g. in
 dcollections the load factor defaults to .75), the table "rehashes", or
 increases (usually logarithmically) the size of the array, and re-inserts all
 its elements.

 I believe there is a minimum array size, and things are increased from there. I
 think also you can do a manual "rehash" which should adjust the size of the
 array to match a certain load factor (below the maximum).

Yes, there is a rehash method.

 In some implementations, hashes will even shrink when the load factor goes
 below a minimum (dcollections does not do this to avoid invalidating ranges).
 There are a million different ways to implement the basic hash. The most
 complex part though, is usually the collision handling. In my algo book, there
 are at least 3 ways to handle collisions, and I think there are countless more.
 If you look up hashing on wikipedia, you'll get a much better explanation.

 2. What is "sizeof(Bucket(TKey, TVal))"? And I mean the shallow size, not
 deep size. Is it dependent on TKey or TVal? Or is it just simply a pointer
 to the start of a linked list (and therefore "sizeof(size_t)")?

 Here is the AA implementation:

 https://github.com/D-Programming-Language/druntime/blob/master/src/rt/aaA.d

  From that page, you can see that AA is your bucket (note this is runtime
 stuff, so there are no templates), and BB is your Hash struct. It looks like BB
 has an array of AA pointers.

IIRC, this is because buckets are (minimal) link lists. Cells holding key:value 
are list nodes.

 -Steve

-- 
_________________
vita es estrany
spir.wikidot.com

Mar 30 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Memory usage of AAs?