www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - getHash inconsistency

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
Is this a bug?

	char[] a = "abc".dup;
	const(char)[] b = "abc";
	string c = "abc";

	writeln(typeid(a).getHash(&a));	// 12914
	writeln(typeid(b).getHash(&b)); // 8503969683799911018
	writeln(typeid(c).getHash(&c));	// 12914


T

-- 
Obviously, some things aren't very obvious.
Mar 19 2012
parent reply "Daniel Murphy" <yebblies nospamgmail.com> writes:
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...
 Is this a bug?

 char[] a = "abc".dup;
 const(char)[] b = "abc";
 string c = "abc";

 writeln(typeid(a).getHash(&a)); // 12914
 writeln(typeid(b).getHash(&b)); // 8503969683799911018
 writeln(typeid(c).getHash(&c)); // 12914

Of course.
Mar 19 2012
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Mar 20, 2012 at 05:07:12PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...
 Is this a bug?

 char[] a = "abc".dup;
 const(char)[] b = "abc";
 string c = "abc";

 writeln(typeid(a).getHash(&a)); // 12914
 writeln(typeid(b).getHash(&b)); // 8503969683799911018
 writeln(typeid(c).getHash(&c)); // 12914

Of course.

[...] Then the question is, what should be the fix? Currently, char[] and string have getHash defined in rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses rt.util.hash.hashOf. Which one should be used? T -- Life would be easier if I had the source code. -- YHL
Mar 20 2012
parent reply "Daniel Murphy" <yebblies nospamgmail.com> writes:
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...
 On Tue, Mar 20, 2012 at 05:07:12PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message
 news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...
 Is this a bug?

 char[] a = "abc".dup;
 const(char)[] b = "abc";
 string c = "abc";

 writeln(typeid(a).getHash(&a)); // 12914
 writeln(typeid(b).getHash(&b)); // 8503969683799911018
 writeln(typeid(c).getHash(&c)); // 12914

Of course.

[...] Then the question is, what should be the fix? Currently, char[] and string have getHash defined in rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses rt.util.hash.hashOf. Which one should be used? T -- Life would be easier if I had the source code. -- YHL

I assume the custom hashing routine is better/faster? If so, that one. I'd assume that getHash not working the same for const strings is an oversight.
Mar 20 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Mar 21, 2012 at 03:55:49PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...

[...]
 Then the question is, what should be the fix?

 Currently, char[] and string have getHash defined in
 rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing
 algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which
 uses rt.util.hash.hashOf. Which one should be used?


[...]
 
 I assume the custom hashing routine is better/faster?  If so, that
 one.  I'd assume that getHash not working the same for const strings
 is an oversight. 

[...] Here's the current hashing code for char[] and string: foreach (char c; s) hash = hash * 11 + c; For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's SuperFastHash algorithm with very good hash distribution properties. It does seem to involve a lot more operations that the simple loop above, though; so I assume the above simple loop was chosen because hashing strings are a very common operation and, for the most part, only need a simple hash function. So I'm kinda leaning towards SuperFastHash, but worried about whether it will cause performance degradation, in which case we should stick with the simple loop. T -- "Maybe" is a strange word. When mom or dad says it it means "yes", but when my big brothers say it it means "no"! -- PJ jr.
Mar 20 2012
parent reply "Daniel Murphy" <yebblies nospamgmail.com> writes:
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...
 Here's the current hashing code for char[] and string:

        foreach (char c; s)
            hash = hash * 11 + c;

 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
 SuperFastHash algorithm with very good hash distribution properties. It
 does seem to involve a lot more operations that the simple loop above,
 though; so I assume the above simple loop was chosen because hashing
 strings are a very common operation and, for the most part, only need a
 simple hash function.

 So I'm kinda leaning towards SuperFastHash, but worried about whether it
 will cause performance degradation, in which case we should stick with
 the simple loop.


 T

 -- 
 "Maybe" is a strange word.  When mom or dad says it it means "yes", but 
 when my big brothers say it it means "no"! -- PJ jr.

Benchmark time!
Mar 20 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Mar 21, 2012 at 04:35:11PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...
 Here's the current hashing code for char[] and string:

        foreach (char c; s)
            hash = hash * 11 + c;

 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
 SuperFastHash algorithm with very good hash distribution properties.
 It does seem to involve a lot more operations that the simple loop
 above, though; so I assume the above simple loop was chosen because
 hashing strings are a very common operation and, for the most part,
 only need a simple hash function.

 So I'm kinda leaning towards SuperFastHash, but worried about
 whether it will cause performance degradation, in which case we
 should stick with the simple loop.


[...]
 Benchmark time! 

Alright, so after some benchmarking, I found that the above custom hash function works best for *short* (3 to 10 character) randomized alphabetic strings (I assumed alphabetic to be the typical use case of strings). It's faster than SuperFastHash, and even has better distribution properties, probably because SuperFastHash is tuned for arbitrary binary data of arbitrary length, whereas the custom function is tuned for short string-like data. With longer strings, SuperFastHash beats the custom algorithm, and distribution properties are approximately the same. So I'm still on the fence about which algorithm is better. I can see why the custom hash was adopted for strings, since your typical AA tends to have short alphabetic keys, and something like SuperFastHash is probably overkill. But for longer keys, SuperFastHash is better. T -- Curiosity kills the cat. Moral: don't be the cat.
Mar 22 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/22/12 1:18 PM, H. S. Teoh wrote:
 Alright, so after some benchmarking, I found that the above custom hash
 function works best for *short* (3 to 10 character) randomized
 alphabetic strings (I assumed alphabetic to be the typical use case of
 strings). It's faster than SuperFastHash, and even has better
 distribution properties, probably because SuperFastHash is tuned for
 arbitrary binary data of arbitrary length, whereas the custom function
 is tuned for short string-like data.

 With longer strings, SuperFastHash beats the custom algorithm, and
 distribution properties are approximately the same.

 So I'm still on the fence about which algorithm is better. I can see why
 the custom hash was adopted for strings, since your typical AA tends to
 have short alphabetic keys, and something like SuperFastHash is probably
 overkill. But for longer keys, SuperFastHash is better.

Note that you can switch the actual algorithm depending on string length. Andrei
Mar 22 2012
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Mar 22, 2012, at 11:18 AM, H. S. Teoh wrote:

 On Wed, Mar 21, 2012 at 04:35:11PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...
 
 Here's the current hashing code for char[] and string:
 
       foreach (char c; s)
           hash = hash * 11 + c;
 
 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
 SuperFastHash algorithm with very good hash distribution properties.
 It does seem to involve a lot more operations that the simple loop
 above, though; so I assume the above simple loop was chosen because
 hashing strings are a very common operation and, for the most part,
 only need a simple hash function.
 
 So I'm kinda leaning towards SuperFastHash, but worried about
 whether it will cause performance degradation, in which case we
 should stick with the simple loop.


[...]
 Benchmark time! 

Alright, so after some benchmarking, I found that the above custom hash function works best for *short* (3 to 10 character) randomized alphabetic strings (I assumed alphabetic to be the typical use case of strings). It's faster than SuperFastHash, and even has better distribution properties, probably because SuperFastHash is tuned for arbitrary binary data of arbitrary length, whereas the custom function is tuned for short string-like data. With longer strings, SuperFastHash beats the custom algorithm, and distribution properties are approximately the same. So I'm still on the fence about which algorithm is better. I can see why the custom hash was adopted for strings, since your typical AA tends to have short alphabetic keys, and something like SuperFastHash is probably overkill. But for longer keys, SuperFastHash is better.

Might be worth plugging in MurmurHash for comparison.
Mar 22 2012
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Mar 20, 2012, at 10:10 PM, H. S. Teoh wrote:

 On Wed, Mar 21, 2012 at 03:55:49PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message=20
 news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...

[...]
 Then the question is, what should be the fix?
=20
 Currently, char[] and string have getHash defined in
 rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing
 algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which
 uses rt.util.hash.hashOf. Which one should be used?


[...]
=20
 I assume the custom hashing routine is better/faster?  If so, that
 one.  I'd assume that getHash not working the same for const strings
 is an oversight.=20

[...] =20 Here's the current hashing code for char[] and string: =20 foreach (char c; s) hash =3D hash * 11 + c; =20 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's SuperFastHash algorithm with very good hash distribution properties. =

It
 does seem to involve a lot more operations that the simple loop above,
 though; so I assume the above simple loop was chosen because hashing
 strings are a very common operation and, for the most part, only need =

a
 simple hash function.
=20
 So I'm kinda leaning towards SuperFastHash, but worried about whether =

it
 will cause performance degradation, in which case we should stick with
 the simple loop.

The thing with hashes is that you usually gain more by getting a good = distribution than you lose by computing the hash. Our AA implementation = should probably store the computed hash value per node if it isn't = already though, so we can compare against that before doing opEquals = when looking up a node, and so it's available when rehashing.=
Mar 22 2012