digitalmars.D - Complexity in docs

Tom�s Rossi (11/11) Nov 18 2005 Is there somewhere in the docs something about the complexity of operati...

Oskar Linde (7/12) Nov 18 2005 Associative arrays are implemented as hash-tables, i.e. O(1).

Tom�s Rossi (6/18) Nov 18 2005 Ok a hash table, don't know why didn't think about it before. Certainly ...
Walter Bright (6/10) Nov 19 2005 indices

Chris (5/10) Nov 18 2005 the implimentation for them can be found in dmd/src/phobos/internal.

Tom�s Rossi (4/14) Nov 18 2005 Ok, but why should I be looking at the implementation when searching for

Chris (6/8) Nov 18 2005 I just mentioned it as an fyi. I do agree with you that complexity

J C Calvarese (8/17) Nov 18 2005 Unfortunately, that wasn't the first time someone got yelled at for tryi...

J C Calvarese (11/27) Nov 18 2005 If you're reading the docs and it occurs to you that some important info...

Sean Kelly (12/22) Nov 18 2005 AA's are implemented as a chaining hash table with binary trees

Manfred Nowak (21/25) Nov 18 2005 A clear maybe.

Sean Kelly (7/16) Nov 18 2005 The number of rehashes/reallocations/whatever in any amortized constant

Sean Kelly (4/9) Nov 18 2005 I take it back :p There is a constant bounding those functions. Should...

Munchgreeble (26/35) Nov 19 2005 Hi,

Manfred Nowak (13/16) Nov 19 2005 [...]
Sean Kelly (16/23) Nov 19 2005 Actually, the C++ map is implemented as a balanced binary tree, which

Manfred Nowak (21/22) Nov 19 2005 [...]

Sean Kelly (8/25) Nov 19 2005 Oops, I was unclear. I meant that there is a constant bound for

Manfred Nowak (12/15) Nov 19 2005 Using Hashing, the time for the access of an element can be bounded

Sean Kelly (9/21) Nov 19 2005 I was thinking that if the size of the hash table doubles every time a

Manfred Nowak (10/12) Nov 19 2005 [...]

Sean Kelly (6/20) Nov 19 2005 I don't think so, though my math is a bit rusty in this area. My

Lionello Lunesu (18/22) Nov 21 2005 OR when a bucket is (too) full? Is this true? I have no idea about the i...

Oskar Linde (37/58) Nov 21 2005 Looking at the source (internal/aaA.d), the only place where an automati...

Lionello Lunesu (28/59) Nov 22 2005 Hi,

Walter Bright (7/11) Nov 19 2005 I've been using variations of that for 25 years. Every once in a while, ...

Sean Kelly (7/20) Nov 19 2005 It does. And the memory cost of an additonal pointer per node isn't a

Walter Bright (10/23) Nov 19 2005 I

Manfred Nowak (6/7) Nov 19 2005 [...]

Sean Kelly (6/14) Nov 19 2005 I think that can be considered a rebalancing so long as the insertions

Manfred Nowak (8/10) Nov 20 2005 Correct. But this leads only to a randomly balanced tree and I doubt,

Walter Bright (15/25) Nov 23 2005 Yes, that's true.

Sean Kelly (5/12) Nov 23 2005 Word occurrence tests seem to be a pretty reasonable way to measure AA

Tom�s Rossi <Tom�s_member pathlink.com> writes:

Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

IMO (as Stroustrup did with stl) it's a MUST to specify complexity of
non-trivial operations and since D has complex (non-constant-time) features
inside the language (e.g. associative arrays), it should stipulate time
complexity of operations (and space complexity as well if relevant).

Regards 

Tom

Nov 18 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

In article <dll6iq$1n1u$1 digitaldaemon.com>, Tom�s Rossi says...
Is there somewhere in the docs something about the complexity of operations?

Not as far as I know.

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

Associative arrays are implemented as hash-tables, i.e. O(1).
(IIRC, the hash table nodes are sorted for each bucket, so even if all indices
hash to the same value, you get O(log n). But doing binary search is probably
reducing the performance when you have a good hash function.)

/Oskar

Nov 18 2005

Tom�s Rossi <Tom�s_member pathlink.com> writes:

In article <dll7fq$1nn2$1 digitaldaemon.com>, Oskar Linde says...
In article <dll6iq$1n1u$1 digitaldaemon.com>, Tom�s Rossi says...
Is there somewhere in the docs something about the complexity of operations?

Not as far as I know.

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

Associative arrays are implemented as hash-tables, i.e. O(1).
(IIRC, the hash table nodes are sorted for each bucket, so even if all indices
hash to the same value, you get O(log n). But doing binary search is probably
reducing the performance when you have a good hash function.)

Ok a hash table, don't know why didn't think about it before. Certainly I write
without thinking first :). But still my concern could be other peoples concern
so it should be documented as well as other non-trivial features/operations (if
there is more).

Tom

Nov 18 2005

"Walter Bright" <newshound digitalmars.com> writes:

"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote in message
news:dll7fq$1nn2$1 digitaldaemon.com...
 Associative arrays are implemented as hash-tables, i.e. O(1).
 (IIRC, the hash table nodes are sorted for each bucket, so even if all

indices
 hash to the same value, you get O(log n). But doing binary search is

probably
 reducing the performance when you have a good hash function.)

The binary tree part is only for dealing with collisions. The rest is O(1),
and as you say, the better the hash function, the more O(1) it will be.

Nov 19 2005

Chris <ctlajoie yahoo.com> writes:

On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tom�s Rossi
<Tom�s_member pathlink.com> wrote:

Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

the implimentation for them can be found in dmd/src/phobos/internal.
Like Oskar said, it's a hashtable.


Chris

Nov 18 2005

Tom�s Rossi <Tom�s_member pathlink.com> writes:

In article <8a9sn1553k8psuebi8ff1msi0inb4i4dcu 4ax.com>, Chris says...
On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tom�s Rossi
<Tom�s_member pathlink.com> wrote:

Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

the implimentation for them can be found in dmd/src/phobos/internal.
Like Oskar said, it's a hashtable.

Ok, but why should I be looking at the implementation when searching for
complexities? Should be documented (aren't them or am I getting all wrong?)

Tom

Nov 18 2005

Chris <ctlajoie yahoo.com> writes:

On Fri, 18 Nov 2005 19:19:38 +0000 (UTC), Tom�s Rossi
<Tom�s_member pathlink.com> wrote:
Ok, but why should I be looking at the implementation when searching for
complexities? Should be documented (aren't them or am I getting all wrong?)

I  just mentioned it as an fyi. I do agree with you that complexity
should be documented. It would give you a general idea of what to stay
away from when writing high performance code. 

Chris

Nov 18 2005

J C Calvarese <technocrat7 gmail.com> writes:

In article <uhfsn1p10c1jan97n6lv7f8kd8eqp1t9k0 4ax.com>, Chris says...
On Fri, 18 Nov 2005 19:19:38 +0000 (UTC), Tom�s Rossi
<Tom�s_member pathlink.com> wrote:
Ok, but why should I be looking at the implementation when searching for
complexities? Should be documented (aren't them or am I getting all wrong?)

I  just mentioned it as an fyi. 

Unfortunately, that wasn't the first time someone got yelled at for trying to
help. ;)

I do agree with you that complexity
should be documented. It would give you a general idea of what to stay
away from when writing high performance code. 

Chris

It's not in the official docs yet (and it'll probably quite a while before it
is), but it's already in the un-official docs (due to the generous contributions
from this newsgroup):
http://www.prowiki.org/wiki4d/wiki.cgi?DocComments/Arrays#ComplexityofAssociativeArrayOperations

jcc7

Nov 18 2005

J C Calvarese <technocrat7 gmail.com> writes:

In article <dll9ga$1pmq$1 digitaldaemon.com>, Tom�s Rossi says...
In article <8a9sn1553k8psuebi8ff1msi0inb4i4dcu 4ax.com>, Chris says...
On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tom�s Rossi
<Tom�s_member pathlink.com> wrote:

Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

the implimentation for them can be found in dmd/src/phobos/internal.
Like Oskar said, it's a hashtable.

Ok, but why should I be looking at the implementation when searching for
complexities? Should be documented (aren't them or am I getting all wrong?)

Tom

If you're reading the docs and it occurs to you that some important informatin
is lacking you can click on the "Comments" link at the top of the page. In the
wiki, you can add a comment about what you think needs to be added (whether it's
something you know or something you'd like to know). Walter knows these wiki
comment pages are there and he integrates new material from them from time to
time.

Home  | Search  | D  | Comments <---

Or you can ask here and maybe has already figured out where the answer is by
studying the code of Phobos or the front end.

jcc7

Nov 18 2005

Sean Kelly <sean f4.ca> writes:

Tom�s Rossi wrote:
 Is there somewhere in the docs something about the complexity of operations?
 
 For example: 
 What is the complexity for associative array operations? (I could guess it's
 implemented as a map (upon a tree) so it has map complexities but thas's a
 guess). 

AA's are implemented as a chaining hash table with binary trees 
replacing the slists (thus the need for opCmp instead of opEquals for 
stored objects).  So AA operations should be O(1).  Personally, I don't 
understand the choice of btrees over slists as it seems to expect that 
the hashing algorithm won't provide a good distribution.  I'd rather 
lose the need for an extra pointer or two per node if possible.  Though 
I suppose the tree could be implemented in an array to save memory as well.

 IMO (as Stroustrup did with stl) it's a MUST to specify complexity of
 non-trivial operations and since D has complex (non-constant-time) features
 inside the language (e.g. associative arrays), it should stipulate time
 complexity of operations (and space complexity as well if relevant).

Agreed.  Complexity constraints should be present in documentation 
whenever possile.  This is just another thing that has been ignored in 
favor of more substantial updates.


Sean

Nov 18 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 So AA operations should be O(1). 

A clear maybe.

 Personally, I don't understand the choice of btrees over slists
 as it seems to expect that the hashing algorithm won't provide a
 good distribution.

The hash function walter uses is the adress of the element inserted, 
that is as fast as can be---and bad in terms of distribution over the 
table. But although the distribution is bad the probability is close 
to zero, that any bin contains more than 16 elements. Although the 
unbalanced binary tree Walter has choosen for the resolution of 
collisions is bad as hell in theory it performs very well under the 
constraints given.

Let a be the number of accesses to the n element in the AA and let r 
be the number of rehashes of the AA, then the runtime in total is O
(a+r*n), the runtime for a single access is O(n) because an automatic 
rehash might occur, and the amortized runtime for a single access is 
O(1+r/ae), where ae is the minimal number of accesses to an element 
in the AA, of course ae>=1.

Therefore the time bound for an access to an element in the AA is 
amortized O(1) if and only if the number of rehashes to the AA can be 
bound by a constant.

-manfred

Nov 18 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 So AA operations should be O(1). 

 
 A clear maybe.

Oops, I forgot the "amortized."

 Therefore the time bound for an access to an element in the AA is 
 amortized O(1) if and only if the number of rehashes to the AA can be 
 bound by a constant.

The number of rehashes/reallocations/whatever in any amortized constant 
time algorithm tend to be a (slowly growing) function of N (which isn't 
a constant), so by that rationale they can't be amortized O(1).  What am 
I missing here?


Sean

Nov 18 2005

Sean Kelly <sean f4.ca> writes:

Sean Kelly wrote:
 
 The number of rehashes/reallocations/whatever in any amortized constant 
 time algorithm tend to be a (slowly growing) function of N (which isn't 
 a constant), so by that rationale they can't be amortized O(1).  What am 
 I missing here?

I take it back :p  There is a constant bounding those functions.  Should 
have given it some thought before posting.


Sean

Nov 18 2005

Munchgreeble <Munchgreeble_member pathlink.com> writes:

In article <dlmhh2$2o35$1 digitaldaemon.com>, Sean Kelly says...
Sean Kelly wrote:
 
 The number of rehashes/reallocations/whatever in any amortized constant 
 time algorithm tend to be a (slowly growing) function of N (which isn't 
 a constant), so by that rationale they can't be amortized O(1).  What am 
 I missing here?

I take it back :p  There is a constant bounding those functions.  Should 
have given it some thought before posting.


Sean

Hi,

I could be missing something here, but complexity information gives you a an
idea of how *scaleable* an implementation is, not how *fast* it is.

It's not uncommon for lower order solutions to a problem to be really slow and
higher order solutions to be very quick - it just depends how big N is. I might
have a great O(1) time algorithm for some application, but maybe it takes four
days to run. If you have an O(N^2) algorithm for the same thing that runs in
lets say 10ms + 2ms * N^2, N is going to have to get pretty big before you
decide to switch over to the O(1) implementation.

That's why you mostly only see the first order term given. The only point in
going into more detail is to let you know about corner cases - it might just be
that your particular application is going to push the wrong buttons and not end
up with the typical performance (e.g you might get O(N) instead of O(1)) - extra
terms are not useful for comparing execution speeds between two algorithms.

If raw runtime execution speed is what you're interested in, you're just going
to have to profile the code and see what works fastest for your
application/target platform.

Take a look at this performance comparision for example, it demonstrates that
even for hashmaps, the performance varies quite widely among implementations:
http://www.tommti-systems.de/go.html?http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html

Clearly they're all O(1) algorithms, but surprisingly the C++ implementation is
the slowest even though it has the potential to be fastest.

Hope that's useful

Cheers

Munch

Nov 19 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Munchgreeble wrote:
[...]
 If raw runtime execution speed is what you're interested in,
 you're just going to have to profile the code and see what works
 fastest for your application/target platform.

[...]

Very true, but if and only if you know that platform.

This is not the case with a language that is intended to provide bare 
metal access to the underlying hardware, where even the types of the 
hardware in general are unknown.

Therefore there might be only two choices:
1) restrict the target harware or
2) provide a mechanism to adapt the implementation of the critical 
language features to the hardware.

Number two is the way to go, but this feature is missing in D.

-manfred

Nov 19 2005

Sean Kelly <sean f4.ca> writes:

Munchgreeble wrote:
 
 Take a look at this performance comparision for example, it demonstrates that
 even for hashmaps, the performance varies quite widely among implementations:
 http://www.tommti-systems.de/go.html?http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html
 
 Clearly they're all O(1) algorithms, but surprisingly the C++ implementation is
 the slowest even though it has the potential to be fastest.

Actually, the C++ map is implemented as a balanced binary tree, which 
has O(logN) find/insert performance.  The hash-based associated 

an official part of the standard until around 2009 when the next C++ 
standard is published.  For what it's worth, I've written an 
implementation of std::tr1::unordered_map which beats the Dinkumware 
implementation of std::map quite soundly in memory usage, cache miss, 
and test execution speed tests given some average to large-sized 
datasets.  I can't remember the exact proportion, but I think it was 
about twice as good for all 3 metrics, which would put it on par with 

TR1 implementations floating around and I expect them all to perform 
approximately as well as mine, given that I didn't do anything too 
terribly fancy (std::vector on top of a single std::list).


Sean

Nov 19 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 There is a constant bounding those functions.

[...]

I do not agree.

The amortized upper bound for a single access to an element in the AA 
is O( 1 + r/ae) as shown.

The number of rehashes is summed up from the number ra of automatic 
rehashes and the number rm of manual rehashes: r = ra + rm.

From the implementation no bound can be shown for rm. But even if it 
can be assumed that rm==0 from the implementation can be deduced that 
ra is lower bounded by Omega( log n).

From the implementation also no lower bound can be shown for ae 
except that an element must at least be inserted, therefore ae is 
lower bounded by Omega(1).

Because a quotient is upperbounded by the upper bound of its dividend  
divided by the lower bound of its divisor, the division ra/ae is 
upper bounded by O(log n).

To reach an amortized upper bound of O(1) the minimal number ae of 
accesses to an element must be shown to be lower bounded by Omega(log 
n), which is impossible from the implementation.

-manfred

Nov 19 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 There is a constant bounding those functions.

 [...]
 
 I do not agree.
 
 The amortized upper bound for a single access to an element in the AA 
 is O( 1 + r/ae) as shown.
 
 The number of rehashes is summed up from the number ra of automatic 
 rehashes and the number rm of manual rehashes: r = ra + rm.
 
 From the implementation no bound can be shown for rm. But even if it 
 can be assumed that rm==0 from the implementation can be deduced that 
 ra is lower bounded by Omega( log n).

Oops, I was unclear.  I meant that there is a constant bound for 
amortized constant time functions in general.  And while there are 
hashing algorithms that I'm fairly certain could be bounded, I don't 
know enough about the DMD implementation to say.  My question was about 
the concept in general, not DMD's AAs in particular.  So what controls 
when a rehashing occurs in this implementation?


Sean

Nov 19 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 And while there are hashing algorithms that I'm fairly certain
 could be bounded,

Using Hashing, the time for the access of an element can be bounded 
by a constant if and only if the number of elements to store in the 
table is known before hand. This seems to be true for every machine 
that will exist ever, because the number of atoms in the universe 
seems to be bounded by a constant also as time goes by. But I do not 
believe, that you meant that bound. 


 So what controls when a rehashing occurs in this implementation?

An automatic rehash occurs whenever the number of elements exceeds 
the table size by the factor four. Currently an adress space of 2**32 
is assumed, which results in maximal 14 automatic rehashes.

-manfred

Nov 19 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 And while there are hashing algorithms that I'm fairly certain
 could be bounded,

 
 Using Hashing, the time for the access of an element can be bounded 
 by a constant if and only if the number of elements to store in the 
 table is known before hand. This seems to be true for every machine 
 that will exist ever, because the number of atoms in the universe 
 seems to be bounded by a constant also as time goes by. But I do not 
 believe, that you meant that bound. 

I was thinking that if the size of the hash table doubles every time a 
rehash occurs (ie. if there are logN rehashes for a table of size N) 
then the frequency of rehashes should approach zero as N approaches 
infinity.  Amortizing the cost of these rehashes over all N insertions 
seems like it should yield an amortized constant time insert, just as 
push_back operations on std::vector in C++ are said to occur in 
amortized constant time.


Sean

Nov 19 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 Amortizing the cost of these rehashes over all N insertions seems
 like it should yield an amortized constant time insert

[...]

Agreed, because I did not take into account that the size of the 
table changed at every rehash.

Of course is the sum over the 2**-i (from i=1 to infinity) bounded by 
a constant.

Sometimes facts generalized without thought turn out to be plain 
preconception :-(  

-manfred

Nov 19 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 Amortizing the cost of these rehashes over all N insertions seems
 like it should yield an amortized constant time insert

 [...]
 
 Agreed, because I did not take into account that the size of the 
 table changed at every rehash.
 
 Of course is the sum over the 2**-i (from i=1 to infinity) bounded by 
 a constant.

I don't think so, though my math is a bit rusty in this area.  My 
conclusions regarding all this were more from a general sense of the 
concepts than because I worked it out on paper.

 Sometimes facts generalized without thought turn out to be plain 
 preconception :-(  

Agreed.


Sean

Nov 19 2005

"Lionello Lunesu" <lio remove.lunesu.com> writes:

 An automatic rehash occurs whenever the number of elements exceeds
 the table size by the factor four. Currently an adress space of 2**32
 is assumed, which results in maximal 14 automatic rehashes.

 -manfred

OR when a bucket is (too) full? Is this true? I have no idea about the inner 
workings of D's hash-table.

If so, what happens if after a rehash the bucket is still (too) full? Does 
it rehash again? Where can I find more info?

In my hash-table I use a single array of slots (always 2^n) and keep track 
of a "distance" which is the highest distance from an element to its 
expected slot (if an element wants to be on [0] but [0..3] is full, it will 
end up on [3] and the distance is 3). Isn't this safer and more efficient* 
than a fixed bucket size of 16?

* space efficient: I don't need to allocate buckets but simply use empty 
slots in the array
* time efficient: I have direct control over "distance" and can force a 
rehash if it gets too high.

(Interestingly, it happens that after a rehash, the "distance" turns out 
higher than before! Since the hash-function makes no guarantees I'd guess 
it's just a case of bad luck, as opposed to buggy code)

I probably got a wrong idea about D's AA. I'll wait for more info : )

Lionello.

Nov 21 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Hi,

In article <dlske4$2k2d$1 digitaldaemon.com>, Lionello Lunesu says...
 An automatic rehash occurs whenever the number of elements exceeds
 the table size by the factor four. Currently an adress space of 2**32
 is assumed, which results in maximal 14 automatic rehashes.

 -manfred

OR when a bucket is (too) full? Is this true? I have no idea about the inner 
workings of D's hash-table.

Looking at the source (internal/aaA.d), the only place where an automatic 
rehash gets done is when the average bucket contains more than 4 elements.

If so, what happens if after a rehash the bucket is still (too) full? Does 
it rehash again? Where can I find more info?

In my hash-table I use a single array of slots (always 2^n) and keep track 
of a "distance" which is the highest distance from an element to its 
expected slot (if an element wants to be on [0] but [0..3] is full, it will 
end up on [3] and the distance is 3). Isn't this safer and more efficient* 
than a fixed bucket size of 16?

What do you mean by a fixed bucket size of 16?

* space efficient: I don't need to allocate buckets but simply use empty 
slots in the array

Yes, this is obviously a more space efficient storage if your highest allowed
distance is high enough*. No additional pointers or other (non constant) 
overhead if I understand you correctly.

*) this is a time/space trade-of. In order to be as fast as a traditional
hash table, you may need to make the table so much larger that the memory
requirements even are higher than for the traditional implementation. You are
also much more reliant on a good hash function. It would be very interesting 
to see time/space comparisons of such a hash table to a more traditional one 
for a number of real life cases.

* time efficient: I have direct control over "distance" and can force a 
rehash if it gets too high.

I'm hardly an expert on hash tables, but in my experience, the only way to
tell if it is faster for a particular application is to compare it to the
alternative implementation. Some points though:

- You are more sensitive to collisions, since you have not only the collisions
within a bucket, but also collisions with nearby buckets.
- You have no guarantee that a rehashing (and resizing) will improve the
maximum distance. To get around this, you will need an additional condition
that limits rehashing to, for instance, when the hash table size < a*n. But
adding such a condition will remove your guarantee of a maximum highest
distance.
- If m is your maximum distance, lookup is O(m) vs O(log m) for a hash table
that keeps balanced trees for each bucket.
- Deletes are more expensive and needs to move more data.

In conclusion, I think (guess) you are more sensitive to the performance of the
hash function. I would be very interested in seeing a comparison with other
methods on realistic data. 

I think D's hash table implementaion is safer for general purpose, because it
gracefully degrades to O(log n) when the hash function is poor. It also
minimizes the number of calls to comparison and hash functions, which may,
depending on the implementations, be a good thing.

(Interestingly, it happens that after a rehash, the "distance" turns out 
higher than before! Since the hash-function makes no guarantees I'd guess 
it's just a case of bad luck, as opposed to buggy code)

Yes, I would assume such cases could happen.

Regards,

/Oskar

Nov 21 2005

"Lionello Lunesu" <lio remove.lunesu.com> writes:

Hi,

I found this: http://en.wikipedia.org/wiki/Hashtable#Chaining describes a 
hash-table with balanced trees, like the one used by D. For everyone's 
information.

 What do you mean by a fixed bucket size of 16?

Sorry, I've understood this from some other post that the bucket's size is 
fixed. But indeed, it makes no sense to fix it if it's a tree.

* space efficient: I don't need to allocate buckets but simply use empty
slots in the array

 Yes, this is obviously a more space efficient storage if your highest 
 allowed
 distance is high enough*. No additional pointers or other (non constant)
 overhead if I understand you correctly.

Exactly. The implementation is extremely simple. That's what I love about 
it. But yes, it's O(n) if you have many collisions.

 - You are more sensitive to collisions, since you have not only the 
 collisions
 within a bucket, but also collisions with nearby buckets.

This is true. I've tested a couple of different hash functions, and the 
better hash-function nearly always wins (even if it's slower). For example, 
for a string-map I've compared CRC32 against java's string-hash (hash=7; 
foreach(c) hash=hash*31+c;) and the CRC32 won, eventhough it's much slower. 
The entropy of java's string-hash was improved by multiplying with some 
magic number (sqrt(5)-1, read about it somewhere) and that's the one I'm 
using ATM.

 - You have no guarantee that a rehashing (and resizing) will improve the
 maximum distance. To get around this, you will need an additional 
 condition
 that limits rehashing to, for instance, when the hash table size < a*n. 
 But
 adding such a condition will remove your guarantee of a maximum highest
 distance.

This is true. However, in practice the distance never exceeds 8 (for <50% 
occupation of the hash-table) so the impact of the linear search is kept to 
a minimum.

 - If m is your maximum distance, lookup is O(m) vs O(log m) for a hash 
 table
 that keeps balanced trees for each bucket.

Yeah, you're right. Indeed, only a specific example can compare two hash 
implementations.

 - Deletes are more expensive and needs to move more data.

Why exactly are deletes more expensive? The operation is the same as with 
insertion: jump to the (hash) position, walk some (up to 'distance'), delete 
the (key,value) pair (if any). The down side is that a look-up for a key not 
in the table always walks up to the max 'distance'.

 In conclusion, I think (guess) you are more sensitive to the performance 
 of the
 hash function. I would be very interested in seeing a comparison with 
 other
 methods on realistic data.

That sounds like nice D exercise!

 I think D's hash table implementaion is safer for general purpose, because 
 it
 gracefully degrades to O(log n) when the hash function is poor. It also
 minimizes the number of calls to comparison and hash functions, which may,
 depending on the implementations, be a good thing.

Indeed, I can imagine it's faster for general cases, but it's certainly more 
complex (the code, I mean). Good thing it's built-in then ; )

L.

Nov 22 2005

"Walter Bright" <newshound digitalmars.com> writes:

"Manfred Nowak" <svv1999 hotmail.com> wrote in message
news:Xns9713199CF924Esvv1999hotmailcom 63.105.9.61...
 Although the
 unbalanced binary tree Walter has choosen for the resolution of
 collisions is bad as hell in theory it performs very well under the
 constraints given.

I've been using variations of that for 25 years. Every once in a while, I
try out some new table method that is theoretically supposed to be faster.
They always turn out to be slower. So back to the binary tree!

I've also found it scales very well, from a few entries to tens of thousands
of entries.

Nov 19 2005

Sean Kelly <sean f4.ca> writes:

Walter Bright wrote:
 "Manfred Nowak" <svv1999 hotmail.com> wrote in message
 news:Xns9713199CF924Esvv1999hotmailcom 63.105.9.61...
 Although the
 unbalanced binary tree Walter has choosen for the resolution of
 collisions is bad as hell in theory it performs very well under the
 constraints given.

 
 I've been using variations of that for 25 years. Every once in a while, I
 try out some new table method that is theoretically supposed to be faster.
 They always turn out to be slower. So back to the binary tree!
 
 I've also found it scales very well, from a few entries to tens of thousands
 of entries.

It does.  And the memory cost of an additonal pointer per node isn't a 
big deal, though it may be an issue for some extreme cases.  Out of 
curiosity, is the binary tree implementation balanced in any way, or is 
it a plain old btree?  Given the relatively small expected chain size 
I'd expect the latter, but I figured I'd ask anyway.


Sean

Nov 19 2005

"Walter Bright" <newshound digitalmars.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:dloh13$1glb$1 digitaldaemon.com...
 Walter Bright wrote:
 I've been using variations of that for 25 years. Every once in a while,


I
 try out some new table method that is theoretically supposed to be


faster.
 They always turn out to be slower. So back to the binary tree!

 I've also found it scales very well, from a few entries to tens of


thousands
 of entries.

 It does.  And the memory cost of an additonal pointer per node isn't a
 big deal, though it may be an issue for some extreme cases.

D AA's are general purpose, which means they are meant to do a decent job
for a wide range of applications. For very specific or extreme cases, it's
likely you can do better with a custom implementation.

 Out of
 curiosity, is the binary tree implementation balanced in any way, or is
 it a plain old btree?  Given the relatively small expected chain size
 I'd expect the latter, but I figured I'd ask anyway.

It gets rebalanced when a rehash happens. Rebalancing it on the fly is one
of those theoretical improvements that in practice makes things much worse.

Nov 19 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Walter Bright wrote:

[...]
 It gets rebalanced when a rehash happens.

[...]

I do not see any balancing in the implementation. Its plain 
inserting.

-manfred

Nov 19 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Walter Bright wrote:
 
 [...]
 It gets rebalanced when a rehash happens.

 [...]
 
 I do not see any balancing in the implementation. Its plain 
 inserting.

I think that can be considered a rebalancing so long as the insertions 
don't occur in the same order as they had originally.  Also, the 
contents of each tree may change on a rehash, since some elements may 
end up in different buckets, correct?


Sean

Nov 19 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
  Also, the contents of each tree may change on a rehash, since
 some elements may end up in different buckets, correct?

Correct. But this leads only to a randomly balanced tree and I doubt, 
that the factor of 3 for the space requirement of the tree against 
the space requirement of a 16 element array is justified by the time 
gain.

However, I assume, that Walter has already tested that.

-manfred

Nov 20 2005

"Walter Bright" <newshound digitalmars.com> writes:

"Manfred Nowak" <svv1999 hotmail.com> wrote in message
news:Xns9714624647748svv1999hotmailcom 63.105.9.61...
 Sean Kelly wrote:

 [...]
  Also, the contents of each tree may change on a rehash, since
 some elements may end up in different buckets, correct?

 Correct. But this leads only to a randomly balanced tree

Yes, that's true.

 and I doubt,
 that the factor of 3 for the space requirement of the tree against
 the space requirement of a 16 element array is justified by the time
 gain.

AA's are usually used when performance matters more than memory consumption.
Regular arrays are used where memory consumption is paramount. Thus, it is
appropriate to trade memory for speed in AA algorithm design.

 However, I assume, that Walter has already tested that.

It's easy to contrive test cases designed to make any particular scheme look
good and any other particular scheme look bad. The current AA scheme has
worked well for me for many years on all kinds of data. For example, it's
used in DMDScript, and the performance of DMDScript is heavilly dependent on
the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a
factor of 2, the fastest Javascript interpreter available.

The AA algorithm is all contained in internal\aaA.d. None of the rest of the
language or library has any dependency on it. This means anyone can plug in
another algorithm and give it a try.

Nov 23 2005

Sean Kelly <sean f4.ca> writes:

Walter Bright wrote:
 
 It's easy to contrive test cases designed to make any particular scheme look
 good and any other particular scheme look bad. The current AA scheme has
 worked well for me for many years on all kinds of data. For example, it's
 used in DMDScript, and the performance of DMDScript is heavilly dependent on
 the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a
 factor of 2, the fastest Javascript interpreter available.

Word occurrence tests seem to be a pretty reasonable way to measure AA 
performance.  Out of curiosity, I think I'll extend my C++ test to the D 
AA and see how it compares.


Sean

Nov 23 2005

D Programming

C/C++ Programming

Other

digitalmars.D - Complexity in docs