digitalmars.D - Python dictionary inspired hash table implementation

Moritz Warning (20/20) Jul 02 2008 I like to share a hash table implementation inspired by Pythons Dictiona...

bearophile (34/37) Jul 02 2008 I think it may need more than just tweakings: the advantage of not using...

Moritz Warning (19/55) Jul 02 2008 True, but these changes can be made easily by changing one of the

bearophile (46/48) Jul 02 2008 Experience tells me that creating benchmarks is a very tricky thing, it'...

Jarrett Billingsley (3/5) Jul 02 2008 Uh, that's in D too.

Jarrett Billingsley (3/8) Jul 02 2008 Nevermind, I cannot read.

Manfred_Nowak (16/17) Jul 02 2008 Yes. And as one may recognize: Moritz designed the benchmarks in such a

Moritz Warning (24/42) Jul 03 2008 There was no intention in designing the "benchmark" to get high speed fo...

Manfred_Nowak (4/6) Jul 04 2008 Meanwhile you might want to know of an already existing generic one:

Moritz Warning (14/23) Jul 04 2008 Good idea - I ported the test to Tango.

Manfred_Nowak (10/13) Jul 04 2008 No. You should reduce the number of operations to 7_500_000.

Moritz Warning (9/30) Jul 04 2008 With these changes, I get these results:

Manfred_Nowak (8/9) Jul 09 2008 This is a hint, that PyDict may be well suited only, when the actual

bearophile (5/10) Jul 09 2008 Python dicts contains only python objects, they are always references, s...

Moritz Warning (32/44) Jul 03 2008 Outputs 0.68 on average.

Moritz Warning <moritzwarning web.de> writes:

I like to share a hash table implementation inspired by Pythons Dictionary.

It's a so called closed hashing table, that is, keys and elements are stored in
the table.
Pointers are avoided when possible.
The table size is 2**n and the used load factor is 0.75.

It works for most key types (also static arrays).
All kind of values types should be supported.

For comparison, I did a simple/stupid benchmark,
adding 10.000.000 uint pairs and looking them up
in consecutive order (see attached file).

The build-in AA uint[uint]:
inserts:  1901966/s (5.26s) lookups: 8065673/s (1.24s)

tango.util.container.HashMap!(uint, uint, ..):
inserts:  6346664/s (1.58s) lookups: 23780047/s (0.42s)

The above implementation, Dict!(uint, uint):
inserts:  8199181/s (1.22s) lookups: 34506306/s (0.29s)

This implementation is far from perfect and has room for improvements.
E.g., it is twice as slow for lookups of ulong keys compared to Tango HashMap
(3 times slower for insertion). It's also slower for arrays.
But that's likely to change with some more tweaking.

The attached code is placed into Public Domain.

Jul 02 2008

bearophile <bearophileHUGS lycos.com> writes:

Moritz Warning Wrote:
 E.g., it is twice as slow for lookups of ulong keys compared to Tango HashMap
 (3 times slower for insertion). It's also slower for arrays.
 But that's likely to change with some more tweaking.

I think it may need more than just tweakings: the advantage of not using
pointers may get lost when the data is too much long. So you may switch to
using pointers when the data requires more than 4 or 8 bytes.


How much work does it need to be ported to Phobos?


Instead of this:
import tango.stdc.string; //for memset
You probably want to write something like:
import tango.stdc.string: memset;


I suggest you to compact the code vertically some, to improve readability.


This may enjoy a better name:
template Helper(T)
In my libs I have named it DeconstArrayType.


If you have a Python, like a 2.5, you may try it with similar code, to compare
the speeds. The simpler code you may try are (in pairs):

def main():
    d = {}
    for i in xrange(500000):
        d["key_" + str(i)] = 0
import psyco; psyco.bind(main)
main()


import std.string: format;
void main() {
    int[string] d;
    for (int i = 0; i < 500_000; ++i)
        d["key_" ~ format(i)] = 0;
}



def main():
    d = dict.fromkeys(xrange(10000000), 0)
main()



void main() {
    int[int] d;
    for (int i; i < 10_000_000; ++i)
        d[i] = 0;
}

Bye,
bearophile

Jul 02 2008

Moritz Warning <moritzwarning web.de> writes:

On Wed, 02 Jul 2008 17:11:43 -0400, bearophile wrote:

 Moritz Warning Wrote:
 E.g., it is twice as slow for lookups of ulong keys compared to Tango
 HashMap (3 times slower for insertion). It's also slower for arrays.
 But that's likely to change with some more tweaking.

 
 I think it may need more than just tweakings: the advantage of not using
 pointers may get lost when the data is too much long. So you may switch
 to using pointers when the data requires more than 4 or 8 bytes.

True, but these changes can be made easily by changing one of the 
wrappers for the targeted key type or by adding a new wrapper.
I haven't tested much which approach is most efficient for each data type 
so far.

 
 How much work does it need to be ported to Phobos?

I guess it's a matter of ~10min for those used to Phobos.
The only thing that would need to be changed
in the implementation itself is the traits stuff.


 
 This may enjoy a better name:
 template Helper(T)
 In my libs I have named it DeconstArrayType.

The name was just a dummy introduced a few hours ago until a better name 
arises (until now :P).

 
 If you have a Python, like a 2.5, you may try it with similar code, to
 compare the speeds. The simpler code you may try are (in pairs):

I have Python 2.5.2 (for Debian).

 def main():
     d = {}
     for i in xrange(500000):
         d["key_" + str(i)] = 0
 import psyco; psyco.bind(main)
 main()
 
 
 import std.string: format;
 void main() {
     int[string] d;
     for (int i = 0; i < 500_000; ++i)
         d["key_" ~ format(i)] = 0;
 }

I have my doubts that these methods are useful for comparison
since both string formating function may overrule the computation done by 
the dictionary.
Can you provide a Python example that pre-computes an array with strings 
in a separate function?

 
 
 def main():
     d = dict.fromkeys(xrange(10000000), 0)
 main()

cProfile tells me that Python spend 2.497s (best from 5 tries) with this 
build-in function.
That's about 6680027 insertions per second.

Jul 02 2008

bearophile <bearophileHUGS lycos.com> writes:

Moritz Warning:
 Can you provide a Python example that pre-computes an array with strings 
 in a separate function?

Experience tells me that creating benchmarks is a very tricky thing, it's
really easy to do something wrong. Here I have created a new pair for you, but
I am ready to fix them if someone spots a problem. If you need help just ask me.
You can create two more versions for your data structure and for the hash map
of Tango, if you want.

Your timings seem quite good. Python 2.6b may be a bit faster, but 2.5.2 is
plenty enough.

------------------

from timeit import default_timer as clock

def main():
    strings = ["key_" + str(i) for i in xrange(1000000)]

    t = clock()
    d = dict.fromkeys(strings, 0)
    print round(clock() - t, 2)
    if len(d) < 200: print d

import psyco; psyco.bind(main)
main()

------------------

import std.stdio: putr = writefln;
import std.string: format;

version (Win32) {
    import std.c.windows.windows: GetTickCount;
    double clock() {
        return cast(double)GetTickCount() * 1.0e-03;
    }
}
version (linux) {
    import std.c.linux.linux: time;
    double clock() {
        return cast(double)time(null);
    }
}

void main() {
    const uint N = 1_000_000;

    // this doesn't save much time
    //auto strings = (cast(string*)malloc(string.sizeof * N))[0 .. N];

    auto strings = new string[N];
    for (uint i; i < N; ++i)
        strings[i] = "key_" ~ format(i);

    auto t = clock();
    int[string] d;
    foreach(s; strings)
        d[s] = 0;
    putr(clock() - t);

    if (N < 200) putr(d);
}


One of the things I miss in Python is the ability to write numbers as 1_000_000.

Bye,
bearophile

Jul 02 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g4hbj1$1jjj$1 digitalmars.com...

 One of the things I miss in Python is the ability to write numbers as 
 1_000_000.

Uh, that's in D too.

Jul 02 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Jarrett Billingsley" <kb3ctd2 yahoo.com> wrote in message 
news:g4hdkd$1pbv$1 digitalmars.com...
 "bearophile" <bearophileHUGS lycos.com> wrote in message 
 news:g4hbj1$1jjj$1 digitalmars.com...

 One of the things I miss in Python is the ability to write numbers as 
 1_000_000.

 Uh, that's in D too.

Nevermind, I cannot read.

Jul 02 2008

"Manfred_Nowak" <svv1999 hotmail.com> writes:

bearophile wrote:

 creating benchmarks is a very tricky thing

Yes. And as one may recognize: Moritz designed the benchmarks in such a 
way,
- that collisions at least aren't obvious
- that the locality of two consecutive inserts/acesses seems to be 
close to maximum

In addition: the space effciency of as low as 25% which Moritz' attempt 
allows may be inacceptable.   


Whithout some agreed scenario, i.e. specifications

- about space effciency
- about static properties of the key spaces
- about dynamic properties of the actually used set of keys and its 
insert/access sequences

every user of a dictionary will be able to claim, that his preferred 
implementation is better than any other.

-manfred

Jul 02 2008

Moritz Warning <moritzwarning web.de> writes:

On Thu, 03 Jul 2008 06:49:03 +0000, Manfred_Nowak wrote:

 bearophile wrote:
 
 creating benchmarks is a very tricky thing

 
 Yes. And as one may recognize: Moritz designed the benchmarks in such a
 way,
 - that collisions at least aren't obvious - that the locality of two
 consecutive inserts/acesses seems to be close to maximum

There was no intention in designing the "benchmark" to get high speed for 
this particular implementation.
It's just the simplest approach to test for errors most people would come 
up to.
It's main aim is to detect bigger errors (and it already did).

 In addition: the space effciency of as low as 25% which Moritz' attempt
 allows may be inacceptable.

The build-in AA runs out of space on my machine (1GB)
when other implementations keep on working (GC was enabled, 20m uint 
pairs).
Apart from that, the Dictionary implementation is open for improvement.

 
 Whithout some agreed scenario, i.e. specifications
 
 - about space effciency
 - about static properties of the key spaces - about dynamic properties
 of the actually used set of keys and its insert/access sequences

That's a todo.
Don't expect me to begin testing with an iso certificated benchmark 
scenario from start up. ;-)


 every user of a dictionary will be able to claim, that his preferred
 implementation is better than any other.

As already was said and most people know, designing hash tables is a 
tricky thing.
You can mostly/always design them to be the fastest in the world for a 
particular testing method.

The main goal is to perform well on most everyday use cases.
Since the implementation is based on Pythons Dictionary
(which is aimed for a general use),
There is justified hope that it doesn't perform much worse
for other use cases as well.
But that's up to testing and and further improvements!


- mwarning

Jul 03 2008

"Manfred_Nowak" <svv1999 hotmail.com> writes:

Moritz Warning wrote:

 Don't expect me to begin testing with an iso certificated benchmark 
 scenario from start up. ;-)

Meanwhile you might want to know of an already existing  generic one:
http://www.digitalmars.com/d/archives/digitalmars/D/30773.html#N30849

-manfred

Jul 04 2008

Moritz Warning <moritzwarning web.de> writes:

On Fri, 04 Jul 2008 07:15:02 +0000, Manfred_Nowak wrote:

 Moritz Warning wrote:
 
 Don't expect me to begin testing with an iso certificated benchmark
 scenario from start up. ;-)

 
 Meanwhile you might want to know of an already existing  generic one:
 http://www.digitalmars.com/d/archives/digitalmars/D/30773.html#N30849
 
 -manfred

Good idea - I ported the test to Tango.
Here are the results for uint=>uint maps on average (ten runs):

(GC disabled)
PyDict: 6.73s
Tango HashMap: 9.27s
build-in AA: 12.20s

(GC enabled)
PyDict: 6.85s
Tango HashMap: 9.43s
build-in AA: 70.xys (I've got lazy)

The tests for the build-in AA had to be started ten times since the test 
loop causes out of memory problems. The effect on the test result
should be negligible.

Jul 04 2008

"Manfred_Nowak" <svv1999 hotmail.com> writes:

Moritz Warning wrote:

 The tests for the build-in AA had to be started ten times since
 the test loop causes out of memory problems. The effect on the
 test result should be negligible.

No. You should reduce the number of operations to 7_500_000.

Sorrily the PyDict seems to be bad suited for adresses out of the 2G-
range. My tests show, that the builtin AA only needs 45% of the time of 
PyDict.

To verify this on your machine set

  ttotal= 512*1024*1024; // 512 MB

and replace "rand()" by "rand() << 2",

which effectively gives 4 byte alligned adresses out of the 2G range.

-manfred

Jul 04 2008

Moritz Warning <moritzwarning web.de> writes:

On Fri, 04 Jul 2008 19:45:47 +0000, Manfred_Nowak wrote:

 Moritz Warning wrote:
 
 The tests for the build-in AA had to be started ten times since the
 test loop causes out of memory problems. The effect on the test result
 should be negligible.

 
 No. You should reduce the number of operations to 7_500_000.
 
 Sorrily the PyDict seems to be bad suited for adresses out of the 2G-
 range. My tests show, that the builtin AA only needs 45% of the time of
 PyDict.
 
 To verify this on your machine set
 
   ttotal= 512*1024*1024; // 512 MB
 
 and replace "rand()" by "rand() << 2",
 
 which effectively gives 4 byte alligned adresses out of the 2G range.
 
 -manfred

With these changes, I get these results:

PyDict:7.73s
AA: 9.76s
HashMap: 7.36s

PyDict decreases in performance for this case.
But I can't see any significant speedup for AA.



Btw.: I had the 4x table grow for small tables changed to 2x in PyDict 
for all previous tests, since there wasn't any speed decrease to measure.

Jul 04 2008

"Manfred_Nowak" <svv1999 hotmail.com> writes:

Moritz Warning wrote:

 PyDict decreases in performance for this case.

This is a hint, that PyDict may be well suited only, when the actual 
distribution of keys over the keyspace is dense. 

-manfred
-- 
Maybe some knowledge of some types of disagreeing and their relation 
can turn out to be useful:
http://blog.createdebate.com/2008/04/07/writing-strong-arguments/

Jul 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Manfred_Nowak:
 Moritz Warning wrote:
 PyDict decreases in performance for this case.

 
 This is a hint, that PyDict may be well suited only, when the actual 
 distribution of keys over the keyspace is dense.

Python dicts contains only python objects, they are always references, so they
contain only pointers, this means Python dicts are probably optimized only for
keys/values 4 (and probably 8 too) bytes long each.
So the AA implementation may have to contain two versions, one for large keys
and one for smaller (4-8 bytes) ones. Builtin AAs are used very often (if you
are used to program in scripting languages), so this may be worth.

Bye,
bearophile

Jul 09 2008

Moritz Warning <moritzwarning web.de> writes:

On Wed, 02 Jul 2008 21:55:45 -0400, bearophile wrote:

[..]
 from timeit import default_timer as clock
 
 def main():
     strings = ["key_" + str(i) for i in xrange(1000000)]
 
     t = clock()
     d = dict.fromkeys(strings, 0)
     print round(clock() - t, 2)
     if len(d) < 200: print d
 
 import psyco; psyco.bind(main)
 main()

Outputs 0.68 on average.


I ported your Phobos code to Tango, these are the results:

Dict!(char[], uint)
10 x 1000000 iterations
inserts:  1292566/s (0.77s)
lookups: 3492603/s (0.29s)

HashMap!(char[], uint, ..)
10 x 1000000 iterations
inserts:  1673673/s (0.60s)
lookups: 6483940/s (0.15s)

uint[char[]]
10 x 1000000 iterations
inserts:  2088894/s (0.48s)
lookups: 3886346/s (0.26s)


There might be some bottleneck in Dict.
Array hashing for Dict only accounts for 0.03s.

btw.:
There is a bug in the array hash method.
It doesn't iterate up to the last array element.
Here is the fixed version used for the test.

//hash function
ubyte[] a = cast(ubyte[]) data;
int len = a.length;
ubyte* p = cast(ubyte *) a.ptr;
hash = *p << 7;
while (--len >= 0)
{
	hash = (1000003 * hash) ^ *p++;
}
hash ^= a.length;

Jul 03 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Python dictionary inspired hash table implementation