digitalmars.D - Perfect hashing for string switch
- bearophile <bearophileHUGS lycos.com> Jan 22 2010
- BCS <none anon.com> Jan 22 2010
- bearophile <bearophileHUGS lycos.com> Jan 22 2010
- BCS <none anon.com> Jan 22 2010
- strtr <strtr spam.com> Jan 22 2010
- bearophile <bearophileHUGS lycos.com> Jan 22 2010
This is just a possible optimization, it's not a new feature request. A perfect hash computed at compile time can be good to implement the string switches, I have created a small benchmark: http://codepad.org/YVFfGpsM Timings, ldc, seconds: test1: 4.48 // normal switch test2: 2.98 // perfect hash test3: 2.09 test4: 2.07 test5: 5.44 // AA Notice the very nice thing that I have computed the hash at compile time in test2 switch cases, you can't do that in a simple way in C :-) You can't compute the coefficient tables of the perfect hash at compile-time in D because probably this is currently too much slow to do, unless D compilers become staged compilers and compile the code they have to run at run time too :o) (Well, an intermediate and less intrusive solution is to use LLVM to Just-In-time compile the code that LDC runs at compile time, to speed that up.) A possible problem is that the compilation time grows if the number of strings is high. The version based on the perfect hash (test2) becomes more efficient if the number of wrong strings gets lower. In Tango the AA opIn_r is really slow, I think it can be a bug. I have not found a Tango dev interested in this so far. Note that the perfect hash I have used in this little example is minimal, so there's a better way for the compiler to implement that conversion table (as I have shown in test4 that assumes a minimal perfect hash. It's about as fast as the test3 version that can tolerate a table a bit sparse), but the code I have written works equally well if the hash isn't minimal (this is positive to decrease the compilation time): in that case the table case gets a little more sparse, but the running time is the same, because it's just a little sparse, so llvm compiles it still with table (with holes). GCC is able to compile this code better, I have asked LLVM devs to implement the same thing (this is not about perfect hashing, it can be a way for the compiler to use it better in certain situations, where you have a conversion table or string->int compile-time constant associative array): http://llvm.org/bugs/show_bug.cgi?id=6009 Bye, bearophile
Jan 22 2010
Hello bearophile,This is just a possible optimization, it's not a new feature request. A perfect hash computed at compile time can be good to implement the string switches, I have created a small benchmark:
Have you compared it to a decisition tree or lex style state mechine?
Jan 22 2010
BCS:Have you compared it to a decisition tree or lex style state mechine?
I have not, I am sorry :-) But more comparisons can be added. I know what decision trees are in data mining, but I presume you mean some kind of ternary tree (3 possible results of the string cmp for each char compared). Bye, bearophile
Jan 22 2010
Hello bearophile,BCS:Have you compared it to a decisition tree or lex style state mechine?
I know what decision trees are in data mining, but I presume you mean some kind of ternary tree (3 possible results of the string cmp for each char compared). Bye, bearophile
I'm not sure how I'd set it up but I expect it would amount to a hard coded radex sort: - do a normal integer switch on string length - for each length group the strings based on common prefixes (if you have aaab, aaac, aabb, bbbb the grouping would be: [[aaab, aaac], aabb], [bbbb]) - walk down the tree using if statements. I guess it's not much different than a lex DFA but for a small list of static strings it might be faster. Also it has the option of using strcmp for tails and long sub spans.
Jan 22 2010
bearophile Wrote: <> Wouldn't perfect hashing for compile time AAs be a better starting point?
Jan 22 2010
strtr:Wouldn't perfect hashing for compile time AAs be a better starting point?
AA keys can be all kind of things, while here I was talking just about perfect hashing of strings, it's a bit less complex problem. So I think it's better to start from string switch first and go on CT AAs later. Bye, bearophile
Jan 22 2010









BCS <none anon.com> 