www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - MurmurHash3

reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
Here is an implementation of MurmurHash [1] for D.
http://dpaste.dzfl.pl/1b94ed0aa96e

I'll do a proper pull request later on for addition to std.digest 
if the community feels like it's a valuable addition.

Guillaume

--
1 - https://en.wikipedia.org/wiki/MurmurHash
Dec 10 2015
next sibling parent Brad Anderson <eco gnuk.net> writes:
On Thursday, 10 December 2015 at 22:25:21 UTC, Guillaume Chatelet 
wrote:
 Here is an implementation of MurmurHash [1] for D.
 http://dpaste.dzfl.pl/1b94ed0aa96e

 I'll do a proper pull request later on for addition to 
 std.digest if the community feels like it's a valuable addition.

 Guillaume

 --
 1 - https://en.wikipedia.org/wiki/MurmurHash
Seems like it'd be good to have it ready and in place as the upcoming containers work starts materializing.
Dec 10 2015
prev sibling next sibling parent Ilya <ilyayaroshenko gmail.com> writes:
On Thursday, 10 December 2015 at 22:25:21 UTC, Guillaume Chatelet 
wrote:
 Here is an implementation of MurmurHash [1] for D.
 http://dpaste.dzfl.pl/1b94ed0aa96e

 I'll do a proper pull request later on for addition to 
 std.digest if the community feels like it's a valuable addition.

 Guillaume

 --
 1 - https://en.wikipedia.org/wiki/MurmurHash
Great! Could you please add an optimized interface to compute hashes for `uint` , `ulong` and `ulong[2]`? It would very good both for upcoming containers and multidimensional slices https://github.com/D-Programming-Language/phobos/pull/3397 .
Dec 10 2015
prev sibling parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Thursday, 10 December 2015 at 22:25:21 UTC, Guillaume Chatelet 
wrote:
 Here is an implementation of MurmurHash [1] for D.
 http://dpaste.dzfl.pl/1b94ed0aa96e

 I'll do a proper pull request later on for addition to 
 std.digest if the community feels like it's a valuable addition.

 Guillaume

 --
 1 - https://en.wikipedia.org/wiki/MurmurHash
http://dpaste.dzfl.pl/1b94ed0aa96e#line-222 - seed is uint, can it be ulong? Mutmur hash has three stages: 1. Computation of hash for blocks (32bit or 128bit) 2. Compitation of hash for tail (remainder) 3. Finalization. I will be very happy, if step 1 will be represented as an output range. Then it can be used directly like reduce aggregator for ranges and multidimensional slices.
Dec 10 2015
parent reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
On Friday, 11 December 2015 at 01:51:09 UTC, Ilya wrote:
 http://dpaste.dzfl.pl/1b94ed0aa96e#line-222 - seed is uint, can 
 it be ulong?
Done
 Mutmur hash has three stages:
 1. Computation of hash for blocks (32bit or 128bit)
 2. Compitation of hash for tail (remainder)
 3. Finalization.

 I will be very happy, if step 1 will be represented as an 
 output range. Then it can be used directly like reduce 
 aggregator for ranges and multidimensional slices.
Done Not thoroughly tested but updated for range and taking an ulong seed for MurmurHash3_x64_128: http://dpaste.dzfl.pl/1b94ed0aa96e Not sure I got what you meant about the optimized version. For the return value ? I haven't done any benchmarking yet.
Dec 11 2015
parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Friday, 11 December 2015 at 22:43:00 UTC, Guillaume Chatelet 
wrote:
 On Friday, 11 December 2015 at 01:51:09 UTC, Ilya wrote:
 http://dpaste.dzfl.pl/1b94ed0aa96e#line-222 - seed is uint, 
 can it be ulong?
Done
 Mutmur hash has three stages:
 1. Computation of hash for blocks (32bit or 128bit)
 2. Compitation of hash for tail (remainder)
 3. Finalization.

 I will be very happy, if step 1 will be represented as an 
 output range. Then it can be used directly like reduce 
 aggregator for ranges and multidimensional slices.
Done Not thoroughly tested but updated for range and taking an ulong seed for MurmurHash3_x64_128: http://dpaste.dzfl.pl/1b94ed0aa96e Not sure I got what you meant about the optimized version. For the return value ? I haven't done any benchmarking yet.
Current version is suitable for arrays but not ranges or types. Few examples: 1. Compute hash of ulong. 2. Compute hash of all elements in matrix column (element are in different arrays). I have created output range API draft http://dpaste.dzfl.pl/a24050042758 Ilya
Dec 11 2015
parent reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
On Saturday, 12 December 2015 at 02:59:21 UTC, Ilya wrote:
 Current version is suitable for arrays but not ranges or types.

 Few examples:
 1. Compute hash of ulong.
 2. Compute hash of all elements in matrix column (element are 
 in different arrays).

 I have created output range API draft 
 http://dpaste.dzfl.pl/a24050042758

 Ilya
I created https://github.com/gchatelet/murmurhash3_d and updated the code a bit. It conforms to the digest template interface, allows pushing ulong[2] and accept ranges. PR welcome :)
Dec 12 2015
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Saturday, 12 December 2015 at 20:12:49 UTC, Guillaume Chatelet 
wrote:
 On Saturday, 12 December 2015 at 02:59:21 UTC, Ilya wrote:
 Current version is suitable for arrays but not ranges or types.

 Few examples:
 1. Compute hash of ulong.
 2. Compute hash of all elements in matrix column (element are 
 in different arrays).

 I have created output range API draft 
 http://dpaste.dzfl.pl/a24050042758

 Ilya
I created https://github.com/gchatelet/murmurhash3_d and updated the code a bit. It conforms to the digest template interface, allows pushing ulong[2] and accept ranges. PR welcome :)
AFAICS this doesn't conform to the digest interface. For example, there should be a `finish` method that returns the hash as a static array (see the ExampleDigest [1]). More importantly, I believe your `put()` implementation only works if it is fed the entire data at once. I haven't tested it, but I believe that the following two calls will have a different result, while they should result in the same hash: hash.put([1,2,3,4,5,6]); vs hash.put([1,2,3]); hash.put([4,5,6]);
Dec 13 2015
parent reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
On Sunday, 13 December 2015 at 12:44:06 UTC, Marc Schütz wrote:
 AFAICS this doesn't conform to the digest interface. For 
 example, there should be a `finish` method that returns the 
 hash as a static array (see the ExampleDigest [1]).
The structs themselves do not but the alias at the beginning of the file make sure they do. alias MurmurHash3_x86_32 = Digester!SMurmurHash3_x86_32; alias MurmurHash3_x86_128 = Digester!SMurmurHash3_x86_128; alias MurmurHash3_x64_128 = Digester!SMurmurHash3_x64_128;
 More importantly, I believe your `put()` implementation only 
 works if it is fed the entire data at once. I haven't tested 
 it, but I believe that the following two calls will have a 
 different result, while they should result in the same hash:

     hash.put([1,2,3,4,5,6]);

 vs

     hash.put([1,2,3]);
     hash.put([4,5,6]);
I suspect this as well although I haven't tested. I'll add more tests and add the missing logic if needed.
Dec 13 2015
parent reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
On Sunday, 13 December 2015 at 16:24:35 UTC, Guillaume Chatelet 
wrote:
 On Sunday, 13 December 2015 at 12:44:06 UTC, Marc Schütz wrote:
 [...]
The structs themselves do not but the alias at the beginning of the file make sure they do. alias MurmurHash3_x86_32 = Digester!SMurmurHash3_x86_32; alias MurmurHash3_x86_128 = Digester!SMurmurHash3_x86_128; alias MurmurHash3_x64_128 = Digester!SMurmurHash3_x64_128;
     [...]
I suspect this as well although I haven't tested. I'll add more tests and add the missing logic if needed.
Fixed in last commit. Thx for the heads up Marc.
Dec 13 2015
parent reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
The last version of the code is available here and is feature 
complete AFAICT
https://github.com/gchatelet/murmurhash3_d/blob/master/murmurhash3.d

Last concern, I declared blockSize in bytes where 
std.digest.digest says it should be in bits. Why does it need to 
be bits ? It looks like HMAC (which needs it) is explicitly 
making sure it's always a multiple of 8 bits.
Dec 19 2015
parent Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Saturday, 19 December 2015 at 22:15:14 UTC, Guillaume Chatelet 
wrote:
 The last version of the code is available here and is feature 
 complete AFAICT
 https://github.com/gchatelet/murmurhash3_d/blob/master/murmurhash3.d

 Last concern, I declared blockSize in bytes where 
 std.digest.digest says it should be in bits. Why does it need 
 to be bits ? It looks like HMAC (which needs it) is explicitly 
 making sure it's always a multiple of 8 bits.
I was the one who introduced it, and I chose bits instead of bytes because I didn't want to exclude the possibility that there are hashing algos that have a block size not divisible by 8. The algorithms are usually described in terms of bits, not bytes, so it's not unconceivable that such hash functions exist, though I don't know any. (Of course, HMAC wouldn't work with those, but that doesn't mean that other algorithms couldn't.) I'd suggest you change `blockSize` to the number of bits, and introduce an enum `blockSizeInBytes` for internal use.
Dec 20 2015