www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Strings hash value memoized

reply bearophile <bearophileHUGS lycos.com> writes:
I presume AAs are used often enough in D. If D 2.x strings are immutable, then
their hash function doesn't change, so it can be useful to store the hash value
inside the data structure of the string (Python works like this), by default it
can be initialized to a value like -1, so it's computed only the first time
it's actually needed, avoiding wasting time to compute the hash value if it's
not used.

Having the hash value inside strings and other immutable data structures allows
quite faster AAs (and sets too, even if they are inside an external library).
Currently strings aren't special structures, they are just immutable dynamical
arrays of chars, so the has value (32 or 64 bit according to the word size of
the CPU, I presume) can't be stored in the dynamic array record, so it may be
stored in the -4/-8 bytes of the data block, but this can't work for string
slices. A possible solution is to create a special string type that's just like
the dynamic array struct, plus the hash value, so this struct becomes bigger
(12 instead of 8 bytes on 32 bit CPUs), slowing down some operations a little.

Bye,
bearophile
Jul 12 2008
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g5a8h5$gmk$1 digitalmars.com...
I presume AAs are used often enough in D. If D 2.x strings are immutable, 
then their hash function doesn't change, so it can be useful to store the 
hash value inside the data structure of the string (Python works like 
this), by default it can be initialized to a value like -1, so it's 
computed only the first time it's actually needed, avoiding wasting time to 
compute the hash value if it's not used.

 Having the hash value inside strings and other immutable data structures 
 allows quite faster AAs (and sets too, even if they are inside an external 
 library).
 Currently strings aren't special structures, they are just immutable 
 dynamical arrays of chars, so the has value (32 or 64 bit according to the 
 word size of the CPU, I presume) can't be stored in the dynamic array 
 record, so it may be stored in the -4/-8 bytes of the data block, but this 
 can't work for string slices. A possible solution is to create a special 
 string type that's just like the dynamic array struct, plus the hash 
 value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit 
 CPUs), slowing down some operations a little.
Or, you store the precomputed hash in the structure (i.e. AA) that uses them. Which is exactly what the default AA implementation does, actually.
Jul 12 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Jarrett Billingsley:
 Or, you store the precomputed hash in the structure (i.e. AA) that uses 
 them.  Which is exactly what the default AA implementation does, actually.
Having the hash value in the string has some advantages, if have two strings and both of them have their hash value is already computed, you have a way to test quickly if they are different. You can compute intersection and union between two string sets quickly. If the same string is in more AAs/sets you compute and store its hash value only once for its life. Bye, bearophile
Jul 12 2008
prev sibling next sibling parent "Koroskin Denis" <2korden gmail.com> writes:
On Sat, 12 Jul 2008 16:36:53 +0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 I presume AAs are used often enough in D. If D 2.x strings are  
 immutable, then their hash function doesn't change, so it can be useful  
 to store the hash value inside the data structure of the string (Python  
 works like this), by default it can be initialized to a value like -1,  
 so it's computed only the first time it's actually needed, avoiding  
 wasting time to compute the hash value if it's not used.

 Having the hash value inside strings and other immutable data structures  
 allows quite faster AAs (and sets too, even if they are inside an  
 external library).
 Currently strings aren't special structures, they are just immutable  
 dynamical arrays of chars, so the has value (32 or 64 bit according to  
 the word size of the CPU, I presume) can't be stored in the dynamic  
 array record, so it may be stored in the -4/-8 bytes of the data block,  
 but this can't work for string slices. A possible solution is to create  
 a special string type that's just like the dynamic array struct, plus  
 the hash value, so this struct becomes bigger (12 instead of 8 bytes on  
 32 bit CPUs), slowing down some operations a little.

 Bye,
 bearophile
Why distiguish between strings and other data type? Just wrap any invariant data type into a special structure that lazily computes and stores hash, and use it in a AA instead.
Jul 12 2008
prev sibling parent JAnderson <ask me.com> writes:
bearophile wrote:
 I presume AAs are used often enough in D. If D 2.x strings are immutable, then
their hash function doesn't change, so it can be useful to store the hash value
inside the data structure of the string (Python works like this), by default it
can be initialized to a value like -1, so it's computed only the first time
it's actually needed, avoiding wasting time to compute the hash value if it's
not used.
 
 Having the hash value inside strings and other immutable data structures
allows quite faster AAs (and sets too, even if they are inside an external
library).
 Currently strings aren't special structures, they are just immutable dynamical
arrays of chars, so the has value (32 or 64 bit according to the word size of
the CPU, I presume) can't be stored in the dynamic array record, so it may be
stored in the -4/-8 bytes of the data block, but this can't work for string
slices. A possible solution is to create a special string type that's just like
the dynamic array struct, plus the hash value, so this struct becomes bigger
(12 instead of 8 bytes on 32 bit CPUs), slowing down some operations a little.
 
 Bye,
 bearophile
Sounds like a good case for a standard lib function. Potentially the hashing part could be a generic class for any type with a particular spealization for strings. It could also be lazily computed (ie compute it the first time its needed then cache it). Maybe something like: alias hashed(string) hashed_string; alias hashed(int[]) hashed_array; I guess one downside is that compile time string hashes may have to be figured out at run-time with that approach. -Joel
Jul 12 2008