digitalmars.D - getHash inconsistency

H. S. Teoh (10/10) Mar 19 2012 Is this a bug?

Daniel Murphy (3/10) Mar 19 2012 Of course.

H. S. Teoh (10/24) Mar 20 2012 [...]

Daniel Murphy (4/28) Mar 20 2012 I assume the custom hashing routine is better/faster? If so, that one. ...

H. S. Teoh (19/31) Mar 20 2012 [...]

Daniel Murphy (3/19) Mar 20 2012 Benchmark time!

H. S. Teoh (18/37) Mar 22 2012 Alright, so after some benchmarking, I found that the above custom hash

Andrei Alexandrescu (3/16) Mar 22 2012 Note that you can switch the actual algorithm depending on string length...

Sean Kelly (2/39) Mar 22 2012 Might be worth plugging in MurmurHash for comparison.

Sean Kelly (9/41) Mar 22 2012 a

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

Is this a bug?

	char[] a = "abc".dup;
	const(char)[] b = "abc";
	string c = "abc";

	writeln(typeid(a).getHash(&a));	// 12914
	writeln(typeid(b).getHash(&b)); // 8503969683799911018
	writeln(typeid(c).getHash(&c));	// 12914


T

-- 
Obviously, some things aren't very obvious.

Mar 19 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...
 Is this a bug?

 char[] a = "abc".dup;
 const(char)[] b = "abc";
 string c = "abc";

 writeln(typeid(a).getHash(&a)); // 12914
 writeln(typeid(b).getHash(&b)); // 8503969683799911018
 writeln(typeid(c).getHash(&c)); // 12914

Of course.

Mar 19 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Mar 20, 2012 at 05:07:12PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...
 Is this a bug?

 char[] a = "abc".dup;
 const(char)[] b = "abc";
 string c = "abc";

 writeln(typeid(a).getHash(&a)); // 12914
 writeln(typeid(b).getHash(&b)); // 8503969683799911018
 writeln(typeid(c).getHash(&c)); // 12914

 
 Of course. 

[...]

Then the question is, what should be the fix?

Currently, char[] and string have getHash defined in
rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing
algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses
rt.util.hash.hashOf. Which one should be used?


T

-- 
Life would be easier if I had the source code. -- YHL

Mar 20 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...
 On Tue, Mar 20, 2012 at 05:07:12PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message
 news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...
 Is this a bug?

 char[] a = "abc".dup;
 const(char)[] b = "abc";
 string c = "abc";

 writeln(typeid(a).getHash(&a)); // 12914
 writeln(typeid(b).getHash(&b)); // 8503969683799911018
 writeln(typeid(c).getHash(&c)); // 12914

 Of course.

 [...]

 Then the question is, what should be the fix?

 Currently, char[] and string have getHash defined in
 rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing
 algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses
 rt.util.hash.hashOf. Which one should be used?


 T

 -- 
 Life would be easier if I had the source code. -- YHL

I assume the custom hashing routine is better/faster?  If so, that one.  I'd 
assume that getHash not working the same for const strings is an oversight.

Mar 20 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Mar 21, 2012 at 03:55:49PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...

[...]
 Then the question is, what should be the fix?

 Currently, char[] and string have getHash defined in
 rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing
 algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which
 uses rt.util.hash.hashOf. Which one should be used?


[...]
 
 I assume the custom hashing routine is better/faster?  If so, that
 one.  I'd assume that getHash not working the same for const strings
 is an oversight. 

[...]

Here's the current hashing code for char[] and string:

        foreach (char c; s)
            hash = hash * 11 + c;

For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
SuperFastHash algorithm with very good hash distribution properties. It
does seem to involve a lot more operations that the simple loop above,
though; so I assume the above simple loop was chosen because hashing
strings are a very common operation and, for the most part, only need a
simple hash function.

So I'm kinda leaning towards SuperFastHash, but worried about whether it
will cause performance degradation, in which case we should stick with
the simple loop.


T

-- 
"Maybe" is a strange word.  When mom or dad says it it means "yes", but when my
big brothers say it it means "no"! -- PJ jr.

Mar 20 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...
 Here's the current hashing code for char[] and string:

        foreach (char c; s)
            hash = hash * 11 + c;

 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
 SuperFastHash algorithm with very good hash distribution properties. It
 does seem to involve a lot more operations that the simple loop above,
 though; so I assume the above simple loop was chosen because hashing
 strings are a very common operation and, for the most part, only need a
 simple hash function.

 So I'm kinda leaning towards SuperFastHash, but worried about whether it
 will cause performance degradation, in which case we should stick with
 the simple loop.


 T

 -- 
 "Maybe" is a strange word.  When mom or dad says it it means "yes", but 
 when my big brothers say it it means "no"! -- PJ jr.

Benchmark time!

Mar 20 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Mar 21, 2012 at 04:35:11PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...
 Here's the current hashing code for char[] and string:

        foreach (char c; s)
            hash = hash * 11 + c;

 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
 SuperFastHash algorithm with very good hash distribution properties.
 It does seem to involve a lot more operations that the simple loop
 above, though; so I assume the above simple loop was chosen because
 hashing strings are a very common operation and, for the most part,
 only need a simple hash function.

 So I'm kinda leaning towards SuperFastHash, but worried about
 whether it will cause performance degradation, in which case we
 should stick with the simple loop.


[...]
 Benchmark time! 

Alright, so after some benchmarking, I found that the above custom hash
function works best for *short* (3 to 10 character) randomized
alphabetic strings (I assumed alphabetic to be the typical use case of
strings). It's faster than SuperFastHash, and even has better
distribution properties, probably because SuperFastHash is tuned for
arbitrary binary data of arbitrary length, whereas the custom function
is tuned for short string-like data.

With longer strings, SuperFastHash beats the custom algorithm, and
distribution properties are approximately the same.

So I'm still on the fence about which algorithm is better. I can see why
the custom hash was adopted for strings, since your typical AA tends to
have short alphabetic keys, and something like SuperFastHash is probably
overkill. But for longer keys, SuperFastHash is better.


T

-- 
Curiosity kills the cat. Moral: don't be the cat.

Mar 22 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/22/12 1:18 PM, H. S. Teoh wrote:
 Alright, so after some benchmarking, I found that the above custom hash
 function works best for *short* (3 to 10 character) randomized
 alphabetic strings (I assumed alphabetic to be the typical use case of
 strings). It's faster than SuperFastHash, and even has better
 distribution properties, probably because SuperFastHash is tuned for
 arbitrary binary data of arbitrary length, whereas the custom function
 is tuned for short string-like data.

 With longer strings, SuperFastHash beats the custom algorithm, and
 distribution properties are approximately the same.

 So I'm still on the fence about which algorithm is better. I can see why
 the custom hash was adopted for strings, since your typical AA tends to
 have short alphabetic keys, and something like SuperFastHash is probably
 overkill. But for longer keys, SuperFastHash is better.

Note that you can switch the actual algorithm depending on string length.

Andrei

Mar 22 2012

Sean Kelly <sean invisibleduck.org> writes:

On Mar 22, 2012, at 11:18 AM, H. S. Teoh wrote:

 On Wed, Mar 21, 2012 at 04:35:11PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...
 
 Here's the current hashing code for char[] and string:
 
       foreach (char c; s)
           hash = hash * 11 + c;
 
 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
 SuperFastHash algorithm with very good hash distribution properties.
 It does seem to involve a lot more operations that the simple loop
 above, though; so I assume the above simple loop was chosen because
 hashing strings are a very common operation and, for the most part,
 only need a simple hash function.
 
 So I'm kinda leaning towards SuperFastHash, but worried about
 whether it will cause performance degradation, in which case we
 should stick with the simple loop.


 [...]
 Benchmark time! 

 
 Alright, so after some benchmarking, I found that the above custom hash
 function works best for *short* (3 to 10 character) randomized
 alphabetic strings (I assumed alphabetic to be the typical use case of
 strings). It's faster than SuperFastHash, and even has better
 distribution properties, probably because SuperFastHash is tuned for
 arbitrary binary data of arbitrary length, whereas the custom function
 is tuned for short string-like data.
 
 With longer strings, SuperFastHash beats the custom algorithm, and
 distribution properties are approximately the same.
 
 So I'm still on the fence about which algorithm is better. I can see why
 the custom hash was adopted for strings, since your typical AA tends to
 have short alphabetic keys, and something like SuperFastHash is probably
 overkill. But for longer keys, SuperFastHash is better.

Might be worth plugging in MurmurHash for comparison.

Mar 22 2012

Sean Kelly <sean invisibleduck.org> writes:

On Mar 20, 2012, at 10:10 PM, H. S. Teoh wrote:

 On Wed, Mar 21, 2012 at 03:55:49PM +1100, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message=20
 news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...

 [...]
 Then the question is, what should be the fix?
=20
 Currently, char[] and string have getHash defined in
 rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing
 algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which
 uses rt.util.hash.hashOf. Which one should be used?


 [...]
=20
 I assume the custom hashing routine is better/faster?  If so, that
 one.  I'd assume that getHash not working the same for const strings
 is an oversight.=20

 [...]
=20
 Here's the current hashing code for char[] and string:
=20
        foreach (char c; s)
            hash =3D hash * 11 + c;
=20
 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
 SuperFastHash algorithm with very good hash distribution properties. =

It
 does seem to involve a lot more operations that the simple loop above,
 though; so I assume the above simple loop was chosen because hashing
 strings are a very common operation and, for the most part, only need =

a
 simple hash function.
=20
 So I'm kinda leaning towards SuperFastHash, but worried about whether =

it
 will cause performance degradation, in which case we should stick with
 the simple loop.

The thing with hashes is that you usually gain more by getting a good =
distribution than you lose by computing the hash.  Our AA implementation =
should probably store the computed hash value per node if it isn't =
already though, so we can compare against that before doing opEquals =
when looking up a node, and so it's available when rehashing.=

Mar 22 2012

D Programming

C/C++ Programming

Other

digitalmars.D - getHash inconsistency