www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - khash associative array / hash map / hash set

reply James Blachly <james.blachly gmail.com> writes:
In addition to our interval tree implementation ( 
https://forum.dlang.org/thread/rhrgbl$2n1h$1 digitalmars.com ), I wanted 
to share another tool we have been using extensively here, a D templates 
port of attractivechaos' klib (https://github.com/attractivechaos/klib) 
component khash.

https://code.dlang.org/packages/dklib
https://github.com/blachlylab/dklib

I namespaced this as `dklib.khash` as ultimately we will place some 
other klib components in this package as we use them (kavl for instance 
is already used in our intervaltree implementation)

Although it is a relatively faithful straight port of the C code, I 
tried to turn attractivechaos' classic C-style generic programming 
(#define hell) into templated code which looks a bit awkward in places, 
but I think turns out very nicely in practice.

opAssign, opIndex, .require, .byKey() are all implemented for ergonomics.

As always, as computational biologists rather than professional software 
engineers, we would be glad to hear feedback and hope others will find 
this useful.

OH, I almost forgot the best part. It is crazy fast.

https://attractivechaos.wordpress.com/2018/01/13/revisiting-hash-table-performance/
https://attractivechaos.wordpress.com/2019/12/28/deletion-from-hash-tables-without-tombstones/

My naive benchmark shows -- compared to emsi_containers.HashMap -- 30% 
faster inserts, 80% faster serial retrieval and 70% faster random 
retrieval. Perhaps I am doing something wrong?

James
Aug 23 2020
parent reply ikod <igor.khasilev gmail.com> writes:
On Monday, 24 August 2020 at 01:39:26 UTC, James Blachly wrote:
 OH, I almost forgot the best part. It is crazy fast.

 https://attractivechaos.wordpress.com/2018/01/13/revisiting-hash-table-performance/
 https://attractivechaos.wordpress.com/2019/12/28/deletion-from-hash-tables-without-tombstones/

 My naive benchmark shows -- compared to emsi_containers.HashMap 
 -- 30% faster inserts, 80% faster serial retrieval and 70% 
 faster random retrieval. Perhaps I am doing something wrong?

 James
Thanks, nice job! You also may want to compare performance with https://github.com/ikod/ikod-containers, just add dependency from ikod-containers, then: import ikod.containers; and use ikod.containers.hashmap.HashMap as alias for container. I squeezed everything I was able from the open-addressing hash map.
Aug 23 2020
parent reply James Blachly <james.blachly gmail.com> writes:
On Monday, 24 August 2020 at 05:51:59 UTC, ikod wrote:
 On Monday, 24 August 2020 at 01:39:26 UTC, James Blachly wrote:
 OH, I almost forgot the best part. It is crazy fast.

 https://attractivechaos.wordpress.com/2018/01/13/revisiting-hash-table-performance/
 https://attractivechaos.wordpress.com/2019/12/28/deletion-from-hash-tables-without-tombstones/

 My naive benchmark shows -- compared to 
 emsi_containers.HashMap -- 30% faster inserts, 80% faster 
 serial retrieval and 70% faster random retrieval. Perhaps I am 
 doing something wrong?

 James
Thanks, nice job! You also may want to compare performance with https://github.com/ikod/ikod-containers, just add dependency from ikod-containers, then: import ikod.containers; and use ikod.containers.hashmap.HashMap as alias for container. I squeezed everything I was able from the open-addressing hash map.
Nice, thank you and great job! Performance looks very comparable and I would be happy to use your package as well. Perhaps it is time that Dlang have a faster canonical hashmap (phobos.next ?) ``` hashmap benchmarks Inserts for HashMap finished in 518 milliseconds. Inserts for khash finished in 549 milliseconds. Serial Lookups for HashMap finished in 21 milliseconds. Random lookups for HashMap finished in 41 milliseconds. Confirming stored value of last lookup: 7353ece9-506c-467f-9cb4-7686426fa828 Serial Lookups for khash finished in 12 milliseconds. Random lookups for khash finished in 36 milliseconds. Confirming stored value of last lookup: 1164a2f1-e6cb-4072-89d9-23cec5cadd95 ``` Repeated tests show that ikod.containers' HashMap is consistently faster on insertions, while khash is consistently faster on retrieval.
Aug 24 2020
parent reply ikod <igor.khasilev gmail.com> writes:
On Monday, 24 August 2020 at 20:31:42 UTC, James Blachly wrote:
 On Monday, 24 August 2020 at 05:51:59 UTC, ikod wrote:
 On Monday, 24 August 2020 at 01:39:26 UTC, James Blachly wrote:
 OH, I almost forgot the best part. It is crazy fast.

 https://attractivechaos.wordpress.com/2018/01/13/revisiting-hash-table-performance/
 https://attractivechaos.wordpress.com/2019/12/28/deletion-from-hash-tables-without-tombstones/

 My naive benchmark shows -- compared to 
 emsi_containers.HashMap -- 30% faster inserts, 80% faster 
 serial retrieval and 70% faster random retrieval. Perhaps I 
 am doing something wrong?

 James
Thanks, nice job! You also may want to compare performance with https://github.com/ikod/ikod-containers, just add dependency from ikod-containers, then: import ikod.containers; and use ikod.containers.hashmap.HashMap as alias for container. I squeezed everything I was able from the open-addressing hash map.
Nice, thank you and great job! Performance looks very comparable and I would be happy to use your package as well. Perhaps it is time that Dlang have a faster canonical hashmap (phobos.next ?)
Thanks, but no ) This hashmap can't replace standard AA for next reason: with standard AA you can safely do: string[int] aa; aa[0] = "null"; auto v = 0 in aa; aa.remove(0); assert(*v == "null"); aa[0] = "one"; assert(*v == "null"); This is because AA allocate memory in GC area for every value it store(and return pointer to it when "in" used), so even if you delete key from AA it is still safe to use pointer to value. But this require GC allocations. Correct me if I'm wrong - your and mine HashMaps avoid allocations and store values inline, so you can't use pointer to values in safe code (value can be deleted, or replaced on next table insertion/deletion). In order to be safe my hashmap do not support "in" operator and always return value. Also you may find useful safe map modification during iteration over map items (at the cost of creating temporary table copy).
 ```
 hashmap benchmarks
 Inserts for HashMap finished in 518 milliseconds.
 Inserts for khash finished in 549 milliseconds.
 Serial Lookups for HashMap finished in 21 milliseconds.
 Random lookups for HashMap finished in 41 milliseconds.
 Confirming stored value of last lookup: 
 7353ece9-506c-467f-9cb4-7686426fa828
 Serial Lookups for khash finished in 12 milliseconds.
 Random lookups for khash finished in 36 milliseconds.
 Confirming stored value of last lookup: 
 1164a2f1-e6cb-4072-89d9-23cec5cadd95
 ```

 Repeated tests show that ikod.containers' HashMap is 
 consistently faster on insertions, while khash is consistently 
 faster on retrieval.
Aug 24 2020
parent reply James Blachly <james.blachly gmail.com> writes:
On 8/24/20 5:11 PM, ikod wrote:
 Thanks, but no )
 This hashmap can't replace standard AA for next reason:
 with standard AA you can safely do:
 
 string[int] aa;
 aa[0] = "null";
 auto v = 0 in aa;
 aa.remove(0);
 assert(*v == "null");
 aa[0] = "one";
 assert(*v == "null");
Hm, I see what you mean but I am not sure I love the behavior (or would call it safe habit to be in) to keep a pointer to something that was subsequently deleted (obviously I understand that in case of Dlang AA it is not really a pointer to what was deleted, it is a pointer to existing object and only the rference was deleted from the AA). If one is using pointers one should be mindful of deletes, etc. anyway =)
 This is because AA allocate memory in GC area for every value it 
 store(and return pointer to it when "in" used), so even if you delete 
 key from AA it is still safe to use pointer to value. But this require 
 GC allocations.
 
 Correct me if I'm wrong - your and mine HashMaps avoid allocations and 
 store values inline, so you can't use pointer to values in safe code 
 (value can be deleted, or replaced on next table insertion/deletion). In 
 order to be safe my hashmap do not support "in" operator and always 
 return value.
Correct, mine operates similarly (malloc plus GC.addRange), no opBinaryRight!in
 Also you may find useful safe map modification during iteration over map 
 items (at the cost of creating temporary table copy).
In search of absolute speed I am willing to forego this, but certainly it could be useful in concurrency type situation
Aug 24 2020
parent ikod <igor.khasilev gmail.com> writes:
On Tuesday, 25 August 2020 at 01:04:27 UTC, James Blachly wrote:
 On 8/24/20 5:11 PM, ikod wrote:
 Thanks, but no )
 Also you may find useful safe map modification during 
 iteration over map items (at the cost of creating temporary 
 table copy).
In search of absolute speed I am willing to forego this, but certainly it could be useful in concurrency type situation
I made this copy lazy, kind of COW, as we don't have to copy kv storage while we iterate over it and do not change anything. I also added something to README. Hope this will be usefull, Regards!
Aug 25 2020