www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Perfect hashing for string switch

reply bearophile <bearophileHUGS lycos.com> writes:
This is just a possible optimization, it's not a new feature request.

A perfect hash computed at compile time can be good to implement the string
switches, I have created a small benchmark:
http://codepad.org/YVFfGpsM

Timings, ldc, seconds:
  test1: 4.48 // normal switch
  test2: 2.98 // perfect hash
  test3: 2.09
  test4: 2.07
  test5: 5.44 // AA

Notice the very nice thing that I have computed the hash at compile time in
test2 switch cases, you can't do that in a simple way in C :-)

You can't compute the coefficient tables of the perfect hash at compile-time in
D because probably this is currently too much slow to do, unless D compilers
become staged compilers and compile the code they have to run at run time too
:o) (Well, an intermediate and less intrusive solution is to use LLVM to
Just-In-time compile the code that LDC runs at compile time, to speed that up.)

A possible problem is that the compilation time grows if the number of strings
is high.

The version based on the perfect hash (test2) becomes more efficient if the
number of wrong strings gets lower.

In Tango the AA opIn_r is really slow, I think it can be a bug. I have not
found a Tango dev interested in this so far.

Note that the perfect hash I have used in this little example is minimal, so
there's a better way for the compiler to implement that conversion table (as I
have shown in test4 that assumes a minimal perfect hash. It's about as fast as
the test3 version that can tolerate a table a bit sparse), but the code I have
written works equally well if the hash isn't minimal (this is positive to
decrease the compilation time): in that case the table case gets a little more
sparse, but the running time is the same, because it's just a little sparse, so
llvm compiles it still with table (with holes).

GCC is able to compile this code better, I have asked LLVM devs to implement
the same thing (this is not about perfect hashing, it can be a way for the
compiler to use it better in certain situations, where you have a conversion
table or string->int compile-time constant associative array):
http://llvm.org/bugs/show_bug.cgi?id=6009

Bye,
bearophile
Jan 22 2010
next sibling parent reply BCS <none anon.com> writes:
Hello bearophile,

 This is just a possible optimization, it's not a new feature request.
 
 A perfect hash computed at compile time can be good to implement the
 string switches, I have created a small benchmark:

Have you compared it to a decisition tree or lex style state mechine?
Jan 22 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
BCS:
 Have you compared it to a decisition tree or lex style state mechine?

I have not, I am sorry :-) But more comparisons can be added. I know what decision trees are in data mining, but I presume you mean some kind of ternary tree (3 possible results of the string cmp for each char compared). Bye, bearophile
Jan 22 2010
parent BCS <none anon.com> writes:
Hello bearophile,

 BCS:
 
 Have you compared it to a decisition tree or lex style state mechine?
 

I know what decision trees are in data mining, but I presume you mean some kind of ternary tree (3 possible results of the string cmp for each char compared). Bye, bearophile

I'm not sure how I'd set it up but I expect it would amount to a hard coded radex sort: - do a normal integer switch on string length - for each length group the strings based on common prefixes (if you have aaab, aaac, aabb, bbbb the grouping would be: [[aaab, aaac], aabb], [bbbb]) - walk down the tree using if statements. I guess it's not much different than a lex DFA but for a small list of static strings it might be faster. Also it has the option of using strcmp for tails and long sub spans.
Jan 22 2010
prev sibling parent reply strtr <strtr spam.com> writes:
bearophile Wrote: <>

Wouldn't perfect hashing for compile time AAs be a better starting point?
Jan 22 2010
parent bearophile <bearophileHUGS lycos.com> writes:
strtr:
 Wouldn't perfect hashing for compile time AAs be a better starting point?

AA keys can be all kind of things, while here I was talking just about perfect hashing of strings, it's a bit less complex problem. So I think it's better to start from string switch first and go on CT AAs later. Bye, bearophile
Jan 22 2010