www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Complexity in docs

reply Tomás Rossi <Tomás_member pathlink.com> writes:
Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

IMO (as Stroustrup did with stl) it's a MUST to specify complexity of
non-trivial operations and since D has complex (non-constant-time) features
inside the language (e.g. associative arrays), it should stipulate time
complexity of operations (and space complexity as well if relevant).

Regards 

Tom
Nov 18 2005
next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
In article <dll6iq$1n1u$1 digitaldaemon.com>, Tomás Rossi says...
Is there somewhere in the docs something about the complexity of operations?

Not as far as I know.
For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

Associative arrays are implemented as hash-tables, i.e. O(1). (IIRC, the hash table nodes are sorted for each bucket, so even if all indices hash to the same value, you get O(log n). But doing binary search is probably reducing the performance when you have a good hash function.) /Oskar
Nov 18 2005
next sibling parent Tomás Rossi <Tomás_member pathlink.com> writes:
In article <dll7fq$1nn2$1 digitaldaemon.com>, Oskar Linde says...
In article <dll6iq$1n1u$1 digitaldaemon.com>, Tomás Rossi says...
Is there somewhere in the docs something about the complexity of operations?

Not as far as I know.
For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

Associative arrays are implemented as hash-tables, i.e. O(1). (IIRC, the hash table nodes are sorted for each bucket, so even if all indices hash to the same value, you get O(log n). But doing binary search is probably reducing the performance when you have a good hash function.)

Ok a hash table, don't know why didn't think about it before. Certainly I write without thinking first :). But still my concern could be other peoples concern so it should be documented as well as other non-trivial features/operations (if there is more). Tom
Nov 18 2005
prev sibling parent "Walter Bright" <newshound digitalmars.com> writes:
"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote in message
news:dll7fq$1nn2$1 digitaldaemon.com...
 Associative arrays are implemented as hash-tables, i.e. O(1).
 (IIRC, the hash table nodes are sorted for each bucket, so even if all

 hash to the same value, you get O(log n). But doing binary search is

 reducing the performance when you have a good hash function.)

The binary tree part is only for dealing with collisions. The rest is O(1), and as you say, the better the hash function, the more O(1) it will be.
Nov 19 2005
prev sibling next sibling parent reply Chris <ctlajoie yahoo.com> writes:
On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tomás Rossi
<Tomás_member pathlink.com> wrote:

Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

Like Oskar said, it's a hashtable. Chris
Nov 18 2005
parent reply Tomás Rossi <Tomás_member pathlink.com> writes:
In article <8a9sn1553k8psuebi8ff1msi0inb4i4dcu 4ax.com>, Chris says...
On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tomás Rossi
<Tomás_member pathlink.com> wrote:

Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

Like Oskar said, it's a hashtable.

Ok, but why should I be looking at the implementation when searching for complexities? Should be documented (aren't them or am I getting all wrong?) Tom
Nov 18 2005
next sibling parent reply Chris <ctlajoie yahoo.com> writes:
On Fri, 18 Nov 2005 19:19:38 +0000 (UTC), Tomás Rossi
<Tomás_member pathlink.com> wrote:
Ok, but why should I be looking at the implementation when searching for
complexities? Should be documented (aren't them or am I getting all wrong?)

I just mentioned it as an fyi. I do agree with you that complexity should be documented. It would give you a general idea of what to stay away from when writing high performance code. Chris
Nov 18 2005
parent J C Calvarese <technocrat7 gmail.com> writes:
In article <uhfsn1p10c1jan97n6lv7f8kd8eqp1t9k0 4ax.com>, Chris says...
On Fri, 18 Nov 2005 19:19:38 +0000 (UTC), Tomás Rossi
<Tomás_member pathlink.com> wrote:
Ok, but why should I be looking at the implementation when searching for
complexities? Should be documented (aren't them or am I getting all wrong?)

I just mentioned it as an fyi.

Unfortunately, that wasn't the first time someone got yelled at for trying to help. ;)
I do agree with you that complexity
should be documented. It would give you a general idea of what to stay
away from when writing high performance code. 

Chris

It's not in the official docs yet (and it'll probably quite a while before it is), but it's already in the un-official docs (due to the generous contributions from this newsgroup): http://www.prowiki.org/wiki4d/wiki.cgi?DocComments/Arrays#ComplexityofAssociativeArrayOperations jcc7
Nov 18 2005
prev sibling parent J C Calvarese <technocrat7 gmail.com> writes:
In article <dll9ga$1pmq$1 digitaldaemon.com>, Tomás Rossi says...
In article <8a9sn1553k8psuebi8ff1msi0inb4i4dcu 4ax.com>, Chris says...
On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tomás Rossi
<Tomás_member pathlink.com> wrote:

Is there somewhere in the docs something about the complexity of operations?

For example: 
What is the complexity for associative array operations? (I could guess it's
implemented as a map (upon a tree) so it has map complexities but thas's a
guess). 

Like Oskar said, it's a hashtable.

Ok, but why should I be looking at the implementation when searching for complexities? Should be documented (aren't them or am I getting all wrong?) Tom

If you're reading the docs and it occurs to you that some important informatin is lacking you can click on the "Comments" link at the top of the page. In the wiki, you can add a comment about what you think needs to be added (whether it's something you know or something you'd like to know). Walter knows these wiki comment pages are there and he integrates new material from them from time to time. Home | Search | D | Comments <--- Or you can ask here and maybe has already figured out where the answer is by studying the code of Phobos or the front end. jcc7
Nov 18 2005
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Tomás Rossi wrote:
 Is there somewhere in the docs something about the complexity of operations?
 
 For example: 
 What is the complexity for associative array operations? (I could guess it's
 implemented as a map (upon a tree) so it has map complexities but thas's a
 guess). 

AA's are implemented as a chaining hash table with binary trees replacing the slists (thus the need for opCmp instead of opEquals for stored objects). So AA operations should be O(1). Personally, I don't understand the choice of btrees over slists as it seems to expect that the hashing algorithm won't provide a good distribution. I'd rather lose the need for an extra pointer or two per node if possible. Though I suppose the tree could be implemented in an array to save memory as well.
 IMO (as Stroustrup did with stl) it's a MUST to specify complexity of
 non-trivial operations and since D has complex (non-constant-time) features
 inside the language (e.g. associative arrays), it should stipulate time
 complexity of operations (and space complexity as well if relevant).

Agreed. Complexity constraints should be present in documentation whenever possile. This is just another thing that has been ignored in favor of more substantial updates. Sean
Nov 18 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 So AA operations should be O(1). 

A clear maybe.
 Personally, I don't understand the choice of btrees over slists
 as it seems to expect that the hashing algorithm won't provide a
 good distribution.

The hash function walter uses is the adress of the element inserted, that is as fast as can be---and bad in terms of distribution over the table. But although the distribution is bad the probability is close to zero, that any bin contains more than 16 elements. Although the unbalanced binary tree Walter has choosen for the resolution of collisions is bad as hell in theory it performs very well under the constraints given. Let a be the number of accesses to the n element in the AA and let r be the number of rehashes of the AA, then the runtime in total is O (a+r*n), the runtime for a single access is O(n) because an automatic rehash might occur, and the amortized runtime for a single access is O(1+r/ae), where ae is the minimal number of accesses to an element in the AA, of course ae>=1. Therefore the time bound for an access to an element in the AA is amortized O(1) if and only if the number of rehashes to the AA can be bound by a constant. -manfred
Nov 18 2005
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 So AA operations should be O(1). 

A clear maybe.

Oops, I forgot the "amortized."
 Therefore the time bound for an access to an element in the AA is 
 amortized O(1) if and only if the number of rehashes to the AA can be 
 bound by a constant.

The number of rehashes/reallocations/whatever in any amortized constant time algorithm tend to be a (slowly growing) function of N (which isn't a constant), so by that rationale they can't be amortized O(1). What am I missing here? Sean
Nov 18 2005
parent reply Sean Kelly <sean f4.ca> writes:
Sean Kelly wrote:
 
 The number of rehashes/reallocations/whatever in any amortized constant 
 time algorithm tend to be a (slowly growing) function of N (which isn't 
 a constant), so by that rationale they can't be amortized O(1).  What am 
 I missing here?

I take it back :p There is a constant bounding those functions. Should have given it some thought before posting. Sean
Nov 18 2005
next sibling parent reply Munchgreeble <Munchgreeble_member pathlink.com> writes:
In article <dlmhh2$2o35$1 digitaldaemon.com>, Sean Kelly says...
Sean Kelly wrote:
 
 The number of rehashes/reallocations/whatever in any amortized constant 
 time algorithm tend to be a (slowly growing) function of N (which isn't 
 a constant), so by that rationale they can't be amortized O(1).  What am 
 I missing here?

I take it back :p There is a constant bounding those functions. Should have given it some thought before posting. Sean

Hi, I could be missing something here, but complexity information gives you a an idea of how *scaleable* an implementation is, not how *fast* it is. It's not uncommon for lower order solutions to a problem to be really slow and higher order solutions to be very quick - it just depends how big N is. I might have a great O(1) time algorithm for some application, but maybe it takes four days to run. If you have an O(N^2) algorithm for the same thing that runs in lets say 10ms + 2ms * N^2, N is going to have to get pretty big before you decide to switch over to the O(1) implementation. That's why you mostly only see the first order term given. The only point in going into more detail is to let you know about corner cases - it might just be that your particular application is going to push the wrong buttons and not end up with the typical performance (e.g you might get O(N) instead of O(1)) - extra terms are not useful for comparing execution speeds between two algorithms. If raw runtime execution speed is what you're interested in, you're just going to have to profile the code and see what works fastest for your application/target platform. Take a look at this performance comparision for example, it demonstrates that even for hashmaps, the performance varies quite widely among implementations: http://www.tommti-systems.de/go.html?http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html Clearly they're all O(1) algorithms, but surprisingly the C++ implementation is the slowest even though it has the potential to be fastest. Hope that's useful Cheers Munch
Nov 19 2005
next sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
Munchgreeble wrote:
[...]
 If raw runtime execution speed is what you're interested in,
 you're just going to have to profile the code and see what works
 fastest for your application/target platform.

Very true, but if and only if you know that platform. This is not the case with a language that is intended to provide bare metal access to the underlying hardware, where even the types of the hardware in general are unknown. Therefore there might be only two choices: 1) restrict the target harware or 2) provide a mechanism to adapt the implementation of the critical language features to the hardware. Number two is the way to go, but this feature is missing in D. -manfred
Nov 19 2005
prev sibling parent Sean Kelly <sean f4.ca> writes:
Munchgreeble wrote:
 
 Take a look at this performance comparision for example, it demonstrates that
 even for hashmaps, the performance varies quite widely among implementations:
 http://www.tommti-systems.de/go.html?http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html
 
 Clearly they're all O(1) algorithms, but surprisingly the C++ implementation is
 the slowest even though it has the potential to be fastest.

Actually, the C++ map is implemented as a balanced binary tree, which has O(logN) find/insert performance. The hash-based associated container, unordered_map, is in the C++ Technical Report #1 but won't be an official part of the standard until around 2009 when the next C++ standard is published. For what it's worth, I've written an implementation of std::tr1::unordered_map which beats the Dinkumware implementation of std::map quite soundly in memory usage, cache miss, and test execution speed tests given some average to large-sized datasets. I can't remember the exact proportion, but I think it was about twice as good for all 3 metrics, which would put it on par with the C# dictionary in that test you linked to. There are a few other C++ TR1 implementations floating around and I expect them all to perform approximately as well as mine, given that I didn't do anything too terribly fancy (std::vector on top of a single std::list). Sean
Nov 19 2005
prev sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 There is a constant bounding those functions.

I do not agree. The amortized upper bound for a single access to an element in the AA is O( 1 + r/ae) as shown. The number of rehashes is summed up from the number ra of automatic rehashes and the number rm of manual rehashes: r = ra + rm. From the implementation no bound can be shown for rm. But even if it can be assumed that rm==0 from the implementation can be deduced that ra is lower bounded by Omega( log n). From the implementation also no lower bound can be shown for ae except that an element must at least be inserted, therefore ae is lower bounded by Omega(1). Because a quotient is upperbounded by the upper bound of its dividend divided by the lower bound of its divisor, the division ra/ae is upper bounded by O(log n). To reach an amortized upper bound of O(1) the minimal number ae of accesses to an element must be shown to be lower bounded by Omega(log n), which is impossible from the implementation. -manfred
Nov 19 2005
parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 There is a constant bounding those functions.

I do not agree. The amortized upper bound for a single access to an element in the AA is O( 1 + r/ae) as shown. The number of rehashes is summed up from the number ra of automatic rehashes and the number rm of manual rehashes: r = ra + rm. From the implementation no bound can be shown for rm. But even if it can be assumed that rm==0 from the implementation can be deduced that ra is lower bounded by Omega( log n).

Oops, I was unclear. I meant that there is a constant bound for amortized constant time functions in general. And while there are hashing algorithms that I'm fairly certain could be bounded, I don't know enough about the DMD implementation to say. My question was about the concept in general, not DMD's AAs in particular. So what controls when a rehashing occurs in this implementation? Sean
Nov 19 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 And while there are hashing algorithms that I'm fairly certain
 could be bounded,

Using Hashing, the time for the access of an element can be bounded by a constant if and only if the number of elements to store in the table is known before hand. This seems to be true for every machine that will exist ever, because the number of atoms in the universe seems to be bounded by a constant also as time goes by. But I do not believe, that you meant that bound.
 So what controls when a rehashing occurs in this implementation?

An automatic rehash occurs whenever the number of elements exceeds the table size by the factor four. Currently an adress space of 2**32 is assumed, which results in maximal 14 automatic rehashes. -manfred
Nov 19 2005
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 And while there are hashing algorithms that I'm fairly certain
 could be bounded,

Using Hashing, the time for the access of an element can be bounded by a constant if and only if the number of elements to store in the table is known before hand. This seems to be true for every machine that will exist ever, because the number of atoms in the universe seems to be bounded by a constant also as time goes by. But I do not believe, that you meant that bound.

I was thinking that if the size of the hash table doubles every time a rehash occurs (ie. if there are logN rehashes for a table of size N) then the frequency of rehashes should approach zero as N approaches infinity. Amortizing the cost of these rehashes over all N insertions seems like it should yield an amortized constant time insert, just as push_back operations on std::vector in C++ are said to occur in amortized constant time. Sean
Nov 19 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 Amortizing the cost of these rehashes over all N insertions seems
 like it should yield an amortized constant time insert

Agreed, because I did not take into account that the size of the table changed at every rehash. Of course is the sum over the 2**-i (from i=1 to infinity) bounded by a constant. Sometimes facts generalized without thought turn out to be plain preconception :-( -manfred
Nov 19 2005
parent Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 Amortizing the cost of these rehashes over all N insertions seems
 like it should yield an amortized constant time insert

Agreed, because I did not take into account that the size of the table changed at every rehash. Of course is the sum over the 2**-i (from i=1 to infinity) bounded by a constant.

I don't think so, though my math is a bit rusty in this area. My conclusions regarding all this were more from a general sense of the concepts than because I worked it out on paper.
 Sometimes facts generalized without thought turn out to be plain 
 preconception :-(  

Agreed. Sean
Nov 19 2005
prev sibling parent reply "Lionello Lunesu" <lio remove.lunesu.com> writes:
 An automatic rehash occurs whenever the number of elements exceeds
 the table size by the factor four. Currently an adress space of 2**32
 is assumed, which results in maximal 14 automatic rehashes.

 -manfred

OR when a bucket is (too) full? Is this true? I have no idea about the inner workings of D's hash-table. If so, what happens if after a rehash the bucket is still (too) full? Does it rehash again? Where can I find more info? In my hash-table I use a single array of slots (always 2^n) and keep track of a "distance" which is the highest distance from an element to its expected slot (if an element wants to be on [0] but [0..3] is full, it will end up on [3] and the distance is 3). Isn't this safer and more efficient* than a fixed bucket size of 16? * space efficient: I don't need to allocate buckets but simply use empty slots in the array * time efficient: I have direct control over "distance" and can force a rehash if it gets too high. (Interestingly, it happens that after a rehash, the "distance" turns out higher than before! Since the hash-function makes no guarantees I'd guess it's just a case of bad luck, as opposed to buggy code) I probably got a wrong idea about D's AA. I'll wait for more info : ) Lionello.
Nov 21 2005
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Hi,

In article <dlske4$2k2d$1 digitaldaemon.com>, Lionello Lunesu says...
 An automatic rehash occurs whenever the number of elements exceeds
 the table size by the factor four. Currently an adress space of 2**32
 is assumed, which results in maximal 14 automatic rehashes.

 -manfred

OR when a bucket is (too) full? Is this true? I have no idea about the inner workings of D's hash-table.

Looking at the source (internal/aaA.d), the only place where an automatic rehash gets done is when the average bucket contains more than 4 elements.
If so, what happens if after a rehash the bucket is still (too) full? Does 
it rehash again? Where can I find more info?

In my hash-table I use a single array of slots (always 2^n) and keep track 
of a "distance" which is the highest distance from an element to its 
expected slot (if an element wants to be on [0] but [0..3] is full, it will 
end up on [3] and the distance is 3). Isn't this safer and more efficient* 
than a fixed bucket size of 16?

What do you mean by a fixed bucket size of 16?
* space efficient: I don't need to allocate buckets but simply use empty 
slots in the array

Yes, this is obviously a more space efficient storage if your highest allowed distance is high enough*. No additional pointers or other (non constant) overhead if I understand you correctly. *) this is a time/space trade-of. In order to be as fast as a traditional hash table, you may need to make the table so much larger that the memory requirements even are higher than for the traditional implementation. You are also much more reliant on a good hash function. It would be very interesting to see time/space comparisons of such a hash table to a more traditional one for a number of real life cases.
* time efficient: I have direct control over "distance" and can force a 
rehash if it gets too high.

I'm hardly an expert on hash tables, but in my experience, the only way to tell if it is faster for a particular application is to compare it to the alternative implementation. Some points though: - You are more sensitive to collisions, since you have not only the collisions within a bucket, but also collisions with nearby buckets. - You have no guarantee that a rehashing (and resizing) will improve the maximum distance. To get around this, you will need an additional condition that limits rehashing to, for instance, when the hash table size < a*n. But adding such a condition will remove your guarantee of a maximum highest distance. - If m is your maximum distance, lookup is O(m) vs O(log m) for a hash table that keeps balanced trees for each bucket. - Deletes are more expensive and needs to move more data. In conclusion, I think (guess) you are more sensitive to the performance of the hash function. I would be very interested in seeing a comparison with other methods on realistic data. I think D's hash table implementaion is safer for general purpose, because it gracefully degrades to O(log n) when the hash function is poor. It also minimizes the number of calls to comparison and hash functions, which may, depending on the implementations, be a good thing.
(Interestingly, it happens that after a rehash, the "distance" turns out 
higher than before! Since the hash-function makes no guarantees I'd guess 
it's just a case of bad luck, as opposed to buggy code)

Yes, I would assume such cases could happen. Regards, /Oskar
Nov 21 2005
parent "Lionello Lunesu" <lio remove.lunesu.com> writes:
Hi,

I found this: http://en.wikipedia.org/wiki/Hashtable#Chaining describes a 
hash-table with balanced trees, like the one used by D. For everyone's 
information.

 What do you mean by a fixed bucket size of 16?

Sorry, I've understood this from some other post that the bucket's size is fixed. But indeed, it makes no sense to fix it if it's a tree.
* space efficient: I don't need to allocate buckets but simply use empty
slots in the array

Yes, this is obviously a more space efficient storage if your highest allowed distance is high enough*. No additional pointers or other (non constant) overhead if I understand you correctly.

Exactly. The implementation is extremely simple. That's what I love about it. But yes, it's O(n) if you have many collisions.
 - You are more sensitive to collisions, since you have not only the 
 collisions
 within a bucket, but also collisions with nearby buckets.

This is true. I've tested a couple of different hash functions, and the better hash-function nearly always wins (even if it's slower). For example, for a string-map I've compared CRC32 against java's string-hash (hash=7; foreach(c) hash=hash*31+c;) and the CRC32 won, eventhough it's much slower. The entropy of java's string-hash was improved by multiplying with some magic number (sqrt(5)-1, read about it somewhere) and that's the one I'm using ATM.
 - You have no guarantee that a rehashing (and resizing) will improve the
 maximum distance. To get around this, you will need an additional 
 condition
 that limits rehashing to, for instance, when the hash table size < a*n. 
 But
 adding such a condition will remove your guarantee of a maximum highest
 distance.

This is true. However, in practice the distance never exceeds 8 (for <50% occupation of the hash-table) so the impact of the linear search is kept to a minimum.
 - If m is your maximum distance, lookup is O(m) vs O(log m) for a hash 
 table
 that keeps balanced trees for each bucket.

Yeah, you're right. Indeed, only a specific example can compare two hash implementations.
 - Deletes are more expensive and needs to move more data.

Why exactly are deletes more expensive? The operation is the same as with insertion: jump to the (hash) position, walk some (up to 'distance'), delete the (key,value) pair (if any). The down side is that a look-up for a key not in the table always walks up to the max 'distance'.
 In conclusion, I think (guess) you are more sensitive to the performance 
 of the
 hash function. I would be very interested in seeing a comparison with 
 other
 methods on realistic data.

That sounds like nice D exercise!
 I think D's hash table implementaion is safer for general purpose, because 
 it
 gracefully degrades to O(log n) when the hash function is poor. It also
 minimizes the number of calls to comparison and hash functions, which may,
 depending on the implementations, be a good thing.

Indeed, I can imagine it's faster for general cases, but it's certainly more complex (the code, I mean). Good thing it's built-in then ; ) L.
Nov 22 2005
prev sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Manfred Nowak" <svv1999 hotmail.com> wrote in message
news:Xns9713199CF924Esvv1999hotmailcom 63.105.9.61...
 Although the
 unbalanced binary tree Walter has choosen for the resolution of
 collisions is bad as hell in theory it performs very well under the
 constraints given.

I've been using variations of that for 25 years. Every once in a while, I try out some new table method that is theoretically supposed to be faster. They always turn out to be slower. So back to the binary tree! I've also found it scales very well, from a few entries to tens of thousands of entries.
Nov 19 2005
parent reply Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 "Manfred Nowak" <svv1999 hotmail.com> wrote in message
 news:Xns9713199CF924Esvv1999hotmailcom 63.105.9.61...
 Although the
 unbalanced binary tree Walter has choosen for the resolution of
 collisions is bad as hell in theory it performs very well under the
 constraints given.

I've been using variations of that for 25 years. Every once in a while, I try out some new table method that is theoretically supposed to be faster. They always turn out to be slower. So back to the binary tree! I've also found it scales very well, from a few entries to tens of thousands of entries.

It does. And the memory cost of an additonal pointer per node isn't a big deal, though it may be an issue for some extreme cases. Out of curiosity, is the binary tree implementation balanced in any way, or is it a plain old btree? Given the relatively small expected chain size I'd expect the latter, but I figured I'd ask anyway. Sean
Nov 19 2005
parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
news:dloh13$1glb$1 digitaldaemon.com...
 Walter Bright wrote:
 I've been using variations of that for 25 years. Every once in a while,


 try out some new table method that is theoretically supposed to be


 They always turn out to be slower. So back to the binary tree!

 I've also found it scales very well, from a few entries to tens of


 of entries.

It does. And the memory cost of an additonal pointer per node isn't a big deal, though it may be an issue for some extreme cases.

D AA's are general purpose, which means they are meant to do a decent job for a wide range of applications. For very specific or extreme cases, it's likely you can do better with a custom implementation.
 Out of
 curiosity, is the binary tree implementation balanced in any way, or is
 it a plain old btree?  Given the relatively small expected chain size
 I'd expect the latter, but I figured I'd ask anyway.

It gets rebalanced when a rehash happens. Rebalancing it on the fly is one of those theoretical improvements that in practice makes things much worse.
Nov 19 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Walter Bright wrote:

[...]
 It gets rebalanced when a rehash happens.

I do not see any balancing in the implementation. Its plain inserting. -manfred
Nov 19 2005
parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Walter Bright wrote:
 
 [...]
 It gets rebalanced when a rehash happens.

I do not see any balancing in the implementation. Its plain inserting.

I think that can be considered a rebalancing so long as the insertions don't occur in the same order as they had originally. Also, the contents of each tree may change on a rehash, since some elements may end up in different buckets, correct? Sean
Nov 19 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
  Also, the contents of each tree may change on a rehash, since
 some elements may end up in different buckets, correct?

Correct. But this leads only to a randomly balanced tree and I doubt, that the factor of 3 for the space requirement of the tree against the space requirement of a 16 element array is justified by the time gain. However, I assume, that Walter has already tested that. -manfred
Nov 20 2005
parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Manfred Nowak" <svv1999 hotmail.com> wrote in message
news:Xns9714624647748svv1999hotmailcom 63.105.9.61...
 Sean Kelly wrote:

 [...]
  Also, the contents of each tree may change on a rehash, since
 some elements may end up in different buckets, correct?

Correct. But this leads only to a randomly balanced tree

Yes, that's true.
 and I doubt,
 that the factor of 3 for the space requirement of the tree against
 the space requirement of a 16 element array is justified by the time
 gain.

AA's are usually used when performance matters more than memory consumption. Regular arrays are used where memory consumption is paramount. Thus, it is appropriate to trade memory for speed in AA algorithm design.
 However, I assume, that Walter has already tested that.

It's easy to contrive test cases designed to make any particular scheme look good and any other particular scheme look bad. The current AA scheme has worked well for me for many years on all kinds of data. For example, it's used in DMDScript, and the performance of DMDScript is heavilly dependent on the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a factor of 2, the fastest Javascript interpreter available. The AA algorithm is all contained in internal\aaA.d. None of the rest of the language or library has any dependency on it. This means anyone can plug in another algorithm and give it a try.
Nov 23 2005
parent Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 
 It's easy to contrive test cases designed to make any particular scheme look
 good and any other particular scheme look bad. The current AA scheme has
 worked well for me for many years on all kinds of data. For example, it's
 used in DMDScript, and the performance of DMDScript is heavilly dependent on
 the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a
 factor of 2, the fastest Javascript interpreter available.

Word occurrence tests seem to be a pretty reasonable way to measure AA performance. Out of curiosity, I think I'll extend my C++ test to the D AA and see how it compares. Sean
Nov 23 2005