www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - About built-in AAs

reply bearophile <bearophileHUGS lycos.com> writes:
I have some comments, that I see as problems, about the built-in AAs.


1) I have written code similar to this, do you think it's correct?

import std.stdio;

struct F {
    uint x;

    hash_t toHash() {
        puts("*");
        return x;
    }

    const bool opEquals(ref const(F) f) {
        puts("#");
        return x == f.x;
    }
}
void main() {
    auto data = [F(1), F(2), F(3), F(4), F(5), F(6)];
    int[F] aa;
    foreach (i, f; data)
        aa[f] = i;
}


After a good amount of debugging I have added those puts, and I've seen that
those toHash/opEquals are never called, it's the second time I waste time on
this problem.

In my opinion this is not acceptable in a language like D. If a user wants to
add the hashing protocol to a class/struct then he/she will probably add a
toHash. So if a toHash() is present, the compiler has to give a compile time
error if all the other parts of the hash protocol are not present or if they
are not correct (this currently means a correct opEquals and opCmp). This is
quite important.

See also:
http://d.puremagic.com/issues/show_bug.cgi?id=3844
http://d.puremagic.com/issues/show_bug.cgi?id=4290

-------------------

2) Originally built-in AAs used a search tree to resolve collisions, but I
think since some time they use a simple linked list (this has the advantage of
simplifying the code, but it has the disadvantage that now hashing is able to
become O(n^2), allowing certain attacks to D code that where impossible before).

So why are D AAs requiring still an opCmp to used-defined hashable
structs/classes? Why don't we remove this need (and change the D docs)?

-------------------

3) This program finally uses the user-defined hash:


import std.stdio;

struct F {
    uint x;

    hash_t toHash() {
        puts("*");
        return x;
    }

    const bool opEquals(ref const(F) f) {
        puts("#");
        return x == f.x;
    }

    const int opCmp(ref const F f) {
        puts("$");
        return x - f.x;
    }
}
void main() {
    auto data = [F(1), F(2), F(3), F(3), F(5), F(1)];
    int[F] aa;
    foreach (i, f; data)
        aa[f] = i;
    writeln(aa);
}


It outputs:

$
$
F:1 F:3 F:5 F:4


The output shows that the AA is using opCmp() instead opEquals(). If the AA is
hash based and resolves collisions with an unordered linked list then why isn't
it calling opEquals() instead? Generally opEquals() is faster than opCmp().

-------------------

Now and then I compare the running time of C++0x code that uses <unordered_map>
with the running time of the same operations done with D AAs and I don't see
good timings for D. Are D AAs true templates or do they use some runtime typeid
function? If they aren't true templates isn't it better to change this?

Bye,
bearophile
Aug 16 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2011 3:09 PM, bearophile wrote:
 2) Originally built-in AAs used a search tree to resolve collisions, but I
 think since some time they use a simple linked list (this has the advantage
 of simplifying the code, but it has the disadvantage that now hashing is able
 to become O(n^2), allowing certain attacks to D code that where impossible
 before).

It wasn't changed to simplify code, it was changed because it was faster. As for "attacks", if your code is subject to people sending it specially crafted input in the hopes of making it run slowly, you can make specially crafted input to make trees run slowly, too.
Aug 16 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 It wasn't changed to simplify code, it was changed because it was faster.

Right.
 As for "attacks", if your code is subject to people sending it specially
crafted 
 input in the hopes of making it run slowly, you can make specially crafted
input 
 to make trees run slowly, too.

I think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?
Since operations such as inserting, deleting, and finding values require
worst-case time proportional to the height of the tree, this theoretical upper
bound on the height allows red–black trees to be efficient in the worst-case,
unlike ordinary binary search trees.<

From: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree#Properties Bye, bearophile
Aug 16 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2011 5:06 PM, bearophile wrote:
 I think there are search trees like the Red-Black ones that guarantee a O(n ln
n) worst case. I am wrong?

Just feed it more data.
Aug 16 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee a O(n ln
n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile
Aug 16 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee a O(n ln
n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle. Thanks, Andrei
Aug 16 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2011 9:15 PM, Andrei Alexandrescu wrote:
 Let's please stop this. Many of us, including yourself, noticed the relatively
 poor performance of D's previous hashtables compared to other languages.
 Switching to singly-list collision handling marked an improvement. Now a lot of
 data structure designs have a worst-case that makes them perform worse than
 others. If you worry about attacks, please implement your own hashtable. If we
 switch back to the old implementation, you'll complain again about D's
 hashtables being slower than Python's, thus closing a years-long cycle.

Also, I should point out that the switch from binary trees to linear lists involved zero change in user code - not even a recompile. Hence, any person who is worried about Bearophile's scenario, or that believe they can come up with a faster hash, can swap it with their own and prove it. The hash implementation was made to be opaque to the user, and pluggable, and this proved it.
Aug 16 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/17/11 12:43 AM, Walter Bright wrote:
 On 8/16/2011 9:15 PM, Andrei Alexandrescu wrote:
 Let's please stop this. Many of us, including yourself, noticed the
 relatively
 poor performance of D's previous hashtables compared to other languages.
 Switching to singly-list collision handling marked an improvement. Now
 a lot of
 data structure designs have a worst-case that makes them perform worse
 than
 others. If you worry about attacks, please implement your own
 hashtable. If we
 switch back to the old implementation, you'll complain again about D's
 hashtables being slower than Python's, thus closing a years-long cycle.

Also, I should point out that the switch from binary trees to linear lists involved zero change in user code - not even a recompile. Hence, any person who is worried about Bearophile's scenario, or that believe they can come up with a faster hash, can swap it with their own and prove it. The hash implementation was made to be opaque to the user, and pluggable, and this proved it.

Though that's a good point, people would usually recompile projects if they want to relink so the point is not as strong a point as it might have been. I still very strongly believe we need to continue aggressively migrating built-in hashtables from pluggable to templated. Pluggable hashtables incur indirect calls in core functions - the bane of all speed optimizers. We can't afford that cost for such a useful data structure. Andrei
Aug 17 2011
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 Let's please stop this.

OK. This sub-thread about Hash resilience against attacks is a minor detail of my original post. kennytm has answered my point 4, but I'd like some answers to the first three points; regarding the error messages for not complete hash protocol, the possible removal of opCmp from the hash protocol, and the usage of opCmp instead of opEquals from the current hash protocol. Bye, bearophile
Aug 17 2011
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -Steve

Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) Andrei
Aug 17 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 
 <SeeWebsiteForEmail erdani.org> wrote:
 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:
 I think there are search trees like the Red-Black ones that
 guarantee
 a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change.

But then we can't change the hash table type to one that needs opCmp if we need to later. That might be acceptable, but it makes it so that we can't transparently change the implementation again if we decide that we need to. - Jonathan M Davis
Aug 17 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 17 Aug 2011 11:47:59 -0400, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu

 <SeeWebsiteForEmail erdani.org> wrote:
 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:
 I think there are search trees like the Red-Black ones that
 guarantee
 a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to


 the items in the same bucket you keep having a O(n ln n) behaviour,
 that's usually fast enough. With Python and the new D AAs you instead
 get a quadratic one. This quadratic behaviour gives troubles way
 before
 the physical RAM is exhausted.

 Bye,
 bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to

 languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case  

 makes them perform worse than others. If you worry about attacks,  

 implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being  

 than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change.

But then we can't change the hash table type to one that needs opCmp if we need to later. That might be acceptable, but it makes it so that we can't transparently change the implementation again if we decide that we need to.

I think that's a choice we should embrace. AFAIK, no *builtin* hash implementations use trees for buckets in any language I'm aware of (I'm sure someone will find one though :). The precedent is to require opHash and opEquals, not opCmp. It just makes more sense for builtin hash tables to allow the most possible key types it can. Also, currently, if opCmp doesn't exist the *COMPILER MAKES ONE UP*, which is totally unacceptable. So if you define opEquals and not opCmp, as bearophile points out, your specifically defined opEquals is not even used, and some made-up approximation is used instead! It's one thing to make up opEquals, that is pretty easy to get reasonably right. It's something entirely different to invent an opCmp, especially for types which have no ordering! -Steve
Aug 17 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I'm all for whatever fixes hashes and makes them properly CTFE'able,
IOW no more "expression is not constant" for hash declarations please!
Aug 17 2011
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 Let's please stop this.

OK, but I'd like to note that what I have written comes from some experiments too: http://is.gd/L4U3Kp Hugs, bearophile
Aug 17 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/17/2011 11:33 AM, bearophile wrote:
 Andrei Alexandrescu:

 Let's please stop this.

OK, but I'd like to note that what I have written comes from some experiments too: http://is.gd/L4U3Kp

I don't think it's a sensible point of view. If you supply your own hash function, the onus is on you to produce a suitable one. It is not the languages' fault if the hash tables run poorly if your hash function always returns the same value. Secondly, I seriously doubt this is any sort of viable "attack" vector. If you are allowing users to write system code (which D is) and remotely run it on your system, a slow hash map is the least of your security concerns. Heck, you could just write for(;;){} and call it a day. Nothing clever required. And, the linear version is much FASTER and uses significantly LESS memory in real world situations, which is why we switched to it. Bottom line, I don't think there's an actual problem here.
Aug 17 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Walter:

 Bottom line, I don't think there's an actual problem here.

Thank you for your answers. And I agree that the current situation is overall better than the precedent one. My original first post of this thread was about other problems, quite more practical ones, like receiving help from the compiler if I am using hash protocol badly, etc. :-) Bye, bearophile
Aug 17 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/17/2011 7:00 AM, Steven Schveighoffer wrote:
 On Tue, 16 Aug 2011 20:06:19 -0400, bearophile <bearophileHUGS lycos.com>
wrote:

 Walter Bright:

 It wasn't changed to simplify code, it was changed because it was faster.


And uses much less space -- only one pointer per node vs. two.

Actually, that was a large factor in the speed increase.
Aug 17 2011
prev sibling next sibling parent Andrew Wiley <wiley.andrew.j gmail.com> writes:
--bcaec529999d5e983404aaaa6c0b
Content-Type: text/plain; charset=ISO-8859-1

On Tue, Aug 16, 2011 at 7:29 PM, bearophile <bearophileHUGS lycos.com>wrote:

 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee a


 Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted.

The problem here is that algorithmic complexity doesn't really tell the whole story. We can talk about what "should" be faster all day, but unless we can prove that there's really a problem here, this doesn't seem like it's worth worrying about. And if it was changed in the past because the linked lists turned out to be faster, I'd guess that actual benchmarking was probably used back then to determine which was better. --bcaec529999d5e983404aaaa6c0b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <br><br><div class=3D"gmail_quote">On Tue, Aug 16, 2011 at 7:29 PM, bearoph= ile <span dir=3D"ltr">&lt;<a href=3D"mailto:bearophileHUGS lycos.com">bearo= phileHUGS lycos.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quo= te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;= "> Walter Bright:<br> <div class=3D"im"><br> &gt; &gt; I think there are search trees like the Red-Black ones that guara= ntee a O(n ln n) worst case. I am wrong?<br> &gt;<br> &gt; Just feed it more data.<br> <br> </div>If you feed it more data, even if all items pruce collision because t= hey all hash to the same bucket, if you use Red-Black trees to handle the i= tems in the same bucket you keep having a O(n ln n) behaviour, that&#39;s u= sually fast enough. With Python and the new D AAs you instead get a quadrat= ic one. This quadratic behaviour gives troubles way before the physical RAM= is exhausted.</blockquote> <div><br></div><div>The problem here is that algorithmic complexity doesn&#= 39;t really tell the whole story. We can talk about what &quot;should&quot;= be faster all day, but unless we can prove that there&#39;s really a probl= em here, this doesn&#39;t seem like it&#39;s worth worrying about.</div> <div>And if it was changed in the past because the linked lists turned out = to be faster, I&#39;d guess that actual benchmarking was probably used back= then to determine which was better.</div></div><br> --bcaec529999d5e983404aaaa6c0b--
Aug 16 2011
prev sibling next sibling parent Josh Simmons <simmons.44 gmail.com> writes:
On Wed, Aug 17, 2011 at 12:40 PM, Andrew Wiley <wiley.andrew.j gmail.com> wrote:
 On Tue, Aug 16, 2011 at 7:29 PM, bearophile <bearophileHUGS lycos.com>
 wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee
 a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted.

The problem here is that algorithmic complexity doesn't really tell the whole story. We can talk about what "should" be faster all day, but unless we can prove that there's really a problem here, this doesn't seem like it's worth worrying about. And if it was changed in the past because the linked lists turned out to be faster, I'd guess that actual benchmarking was probably used back then to determine which was better.

There's a lot more to making a hash-map robust than changing the lookup scheme. There's also a large variety of collision resolution methods that all have their pros and cons depending on the data-set and lookup patterns you're dealing with. For a built in AA system I feel simplicity is a much more achievable goal than trying to solve everybody's specific problems. Hence I feel that the current solution is probably a good choice. Just my two cents anyway. Josh.
Aug 16 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 16 Aug 2011 20:06:19 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Walter Bright:

 It wasn't changed to simplify code, it was changed because it was  
 faster.


And uses much less space -- only one pointer per node vs. two.
 Right.

I'd add that removing the requirement for opCmp is a reasonable request -- this opens up keys to significantly more data types (those which don't have a defined order).
 As for "attacks", if your code is subject to people sending it  
 specially crafted
 input in the hopes of making it run slowly, you can make specially  
 crafted input
 to make trees run slowly, too.

I think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?

Red black trees would be overkill here. The ideal hash only has one element per bucket, and at most 2 or 3. This is why you expand the bucket array even though you only have an element to bucket ratio of .75 or so. Consider that for one or two (or even sometimes three) elements, a linked list has identical topology, and it's insertion/removal algorithm is much less complex. -Steve
Aug 17 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee  
 a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -Steve
Aug 17 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Aug 17, 2011, at 7:04 AM, Steven Schveighoffer wrote:
=20
 Yes, but let's not forget the one valid request out of all of this -- =

opCmp. This allows more possible key types (which don't define an = ordering). I think this would be a simple druntime change. Yup. It would make the AAs faster too. I'd love to deprecate the = requirement for opCmp with AAs.=
Aug 17 2011
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
On Aug 17, 2011, at 2:36 PM, bearophile wrote:

 Walter:
=20
 Bottom line, I don't think there's an actual problem here.

Thank you for your answers. And I agree that the current situation is =

=20
 My original first post of this thread was about other problems, quite =

hash protocol badly, etc. :-) This would be a run-time issue, unless you're asking the compiler to = verify your hash algorithm at compile-time :-p I'd actually like to = have some introspection functionality so I could find out the average = chain length, max chain length, etc (basically what's provided by the = unordered containers from C++11), but the user would still have to query = this stuff to know that something was wrong.=
Aug 17 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Sean Kelly:

This would be a run-time issue, unless you're asking the compiler to verify
your hash algorithm at compile-time :-p  I'd actually like to have some
introspection functionality so I could find out the average chain length, max
chain length, etc (basically what's provided by the unordered containers from
C++11), but the user would still have to query this stuff to know that
something was wrong.<

What you are talking about is interesting. But I was asking in the point 1 of my original post is something different and simple, it's a fully static thing. This was my original post of this thread: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=142709 What I was asking in that point 1 is: 1) If the user writes wrong signatures for the each of the methods needed for the hash protocol (currently they are toHash() opEquals() and opCmp()), then I'd like the compiler to tell me with a compile-time error. I think DMD is already mostly doing this. 2) If the user adds a toHash method to a class or struct, then the compiler must statically require the presence of all the other methods needed by the hash protocol. Currently DMD is not doing this, if you add correct toHash() and opEquals() methods, the compiler will silently not use them. I think this isn't acceptable in a good language. -------------- In the successive points of my original post I have asked two other things: pushing AAs to a fully templated implementation, and totally remove the need of a opCmp from the hash protocol. -------------- Extra: I presume there is no way to ask (on request) the language to use a stack-style memory allocator for the built-in AAs instead the normal GC. But once built-in AAs are fully templates I presume it's not too much hard to give an interface to them from Phobos functions too. So now it's possible to let this use a different memory allocator. Bye, bearophile
Aug 18 2011
prev sibling next sibling parent Jesse Phillips <jessekphillips+d gmail.com> writes:
On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:

 On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -Steve

Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) Andrei

Added, do we also want to add don't have compiler generate opCmp automatically? http://www.prowiki.org/wiki4d/wiki.cgi?LanguageDevel
Aug 17 2011
prev sibling next sibling parent Jesse Phillips <jessekphillips+d gmail.com> writes:
On Wed, 17 Aug 2011 15:40:25 -0700, Sean Kelly wrote:

 This would be a run-time issue, unless you're asking the compiler to
 verify your hash algorithm at compile-time :-p  I'd actually like to
 have some introspection functionality so I could find out the average
 chain length, max chain length, etc (basically what's provided by the
 unordered containers from C++11), but the user would still have to query
 this stuff to know that something was wrong.

What I got from it was that it didn't use his hash until he provided opCmp too, thus the compiler should complain when it isn't going to use the hash you provide.
Aug 17 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 17 Aug 2011 21:07:29 -0400, Jesse Phillips
<jessekphillips+d gmail.com> wrote:

 On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:

 On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -Steve

Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) Andrei

Added, do we also want to add don't have compiler generate opCmp automatically?

So actually, it's technically a joint effort of the compiler and druntime. The compiler doesn't *reject* creating AA's using keys that don't define opCmp. It doesn't actually generate a function. But druntime *does* do a "default" opCmp if no opCmp is defined. See here: https://github.com/D-Programming-Language/druntime/blob/master/src/object_.d#L955 In essence, this "bug" is fixed by one of two things: 1. Using a fully templated AA implementation instead of a RTTI-based one (which avoids using the TypeInfo.compare function). 2. Not using opCmp anymore. Since 2 is already happening, I think we are good. I don't know how we should "fix" the TypeInfo_Struct function. Should it just throw an exception if xOpCmp isn't set? -Steve
Aug 17 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 17 Aug 2011 21:07:29 -0400, Jesse Phillips
<jessekphillips+d gmail.com> wrote:

 On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:

 On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -Steve

Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) Andrei

Added, do we also want to add don't have compiler generate opCmp automatically?

So actually, it's technically a joint effort of the compiler and druntime. The compiler doesn't *reject* creating AA's using keys that don't define opCmp. It doesn't actually generate a function. But druntime *does* do a "default" opCmp if no opCmp is defined. See here: https://github.com/D-Programming-Language/druntime/blob/master/src/object_.d#L955 In essence, this "bug" is fixed by one of two things: 1. Using a fully templated AA implementation instead of a RTTI-based one (which avoids using the TypeInfo.compare function). 2. Not using opCmp anymore. Since 2 is already happening, I think we are good. I don't know how we should "fix" the TypeInfo_Struct function. Should it just throw an exception if xOpCmp isn't set? -Steve
Aug 17 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
This sounds like an application design issue rather than a language issue. D=
o any languages use a pool of hash routines like this?

Sent from my iPhone

On Aug 17, 2011, at 5:06 PM, Josh Simmons <simmons.44 gmail.com> wrote:

 On Thu, Aug 18, 2011 at 8:40 AM, Sean Kelly <sean invisibleduck.org> wrote=

 On Aug 17, 2011, at 2:36 PM, bearophile wrote:
=20
 Walter:
=20
 Bottom line, I don't think there's an actual problem here.

Thank you for your answers. And I agree that the current situation is ov=



=20
 My original first post of this thread was about other problems, quite mo=



rotocol badly, etc. :-)
=20
 This would be a run-time issue, unless you're asking the compiler to veri=


ntrospection functionality so I could find out the average chain length, max= chain length, etc (basically what's provided by the unordered containers fr= om C++11), but the user would still have to query this stuff to know that so= mething was wrong.
=20
 The security issue is basically a DoS one, for example if you know a
 web server is using a specific hash and collision resolution method to
 store message headers you can pass headers that all hash to buckets
 that provide worst-case behavior. In this instance universal hashing
 where a hash function is chosen randomly from a pool of hashes
 combined with good algorithmic complexity means the attacker is unable
 to do this reliably.
=20
 Unrelated though, I'm quite a fan of hopscotch hashing at the moment,
 in theory at least. It'd be interesting to get a few different
 resolution schemes working and compare their performance on various
 workloads.

Aug 18 2011
prev sibling parent Josh Simmons <simmons.44 gmail.com> writes:
On Fri, Aug 19, 2011 at 1:43 AM, Sean Kelly <sean invisibleduck.org> wrote:
 This sounds like an application design issue rather than a language issue=

 Sent from my iPhone

 On Aug 17, 2011, at 5:06 PM, Josh Simmons <simmons.44 gmail.com> wrote:

 On Thu, Aug 18, 2011 at 8:40 AM, Sean Kelly <sean invisibleduck.org> wro=


 On Aug 17, 2011, at 2:36 PM, bearophile wrote:

 Walter:

 Bottom line, I don't think there's an actual problem here.

Thank you for your answers. And I agree that the current situation is =




 My original first post of this thread was about other problems, quite =




sh protocol badly, etc. :-)
 This would be a run-time issue, unless you're asking the compiler to ve=



e some introspection functionality so I could find out the average chain le= ngth, max chain length, etc (basically what's provided by the unordered con= tainers from C++11), but the user would still have to query this stuff to k= now that something was wrong.
 The security issue is basically a DoS one, for example if you know a
 web server is using a specific hash and collision resolution method to
 store message headers you can pass headers that all hash to buckets
 that provide worst-case behavior. In this instance universal hashing
 where a hash function is chosen randomly from a pool of hashes
 combined with good algorithmic complexity means the attacker is unable
 to do this reliably.

 Unrelated though, I'm quite a fan of hopscotch hashing at the moment,
 in theory at least. It'd be interesting to get a few different
 resolution schemes working and compare their performance on various
 workloads.


I doubt it, I was just noting the issue not suggesting it should be fixed. Cheers, Josh.
Aug 18 2011
prev sibling next sibling parent kennytm <kennytm gmail.com> writes:
bearophile <bearophileHUGS lycos.com> wrote:
 
 Now and then I compare the running time of C++0x code that uses
 <unordered_map> with the running time of the same operations done with D
 AAs and I don't see good timings for D. Are D AAs true templates or do
 they use some runtime typeid function? If they aren't true templates
 isn't it better to change this?
 
 Bye,
 bearophile

They use runtime typeid functions. Some V[K] type and methods are converted to AssociativeArray!(K, V) which itself is a proxy to those extern(C) aaGetX() functions. Some methods will resolve to those aaGetX directly in DMD. These functions can only know what types they're working on via the typeid. IMO in DMD V[K]'s methods shouldn't be special. They should be treated just as a syntactic substitution for the AssociativeArray!(K, V) type. The latter shall implement opIndex etc for its normal operation. Druntime gets to decide whether to continue using the C interface or turn it into a concrete templated structure. This also solves bug 5350.
Aug 16 2011
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 17, 2011 08:59 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 11:47:59 -0400, Jonathan M Davis <jmdavisProg gmx.com>
 
 wrote:
 On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 
 <SeeWebsiteForEmail erdani.org> wrote:
 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:
 I think there are search trees like the Red-Black ones that
 guarantee
 a O(n ln n) worst case. I am wrong?

Just feed it more data.

If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to


handle
 the items in the same bucket you keep having a O(n ln n) behaviour,
 that's usually fast enough. With Python and the new D AAs you instead
 get a quadratic one. This quadratic behaviour gives troubles way
 before
 the physical RAM is exhausted.
 
 Bye,
 bearophile

Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to

other
 languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case

that
 makes them perform worse than others. If you worry about attacks,

please
 implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being

slower
 than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change.

But then we can't change the hash table type to one that needs opCmp if we need to later. That might be acceptable, but it makes it so that we can't transparently change the implementation again if we decide that we need to.

I think that's a choice we should embrace. AFAIK, no *builtin* hash implementations use trees for buckets in any language I'm aware of (I'm sure someone will find one though :). The precedent is to require opHash and opEquals, not opCmp. It just makes more sense for builtin hash tables to allow the most possible key types it can. Also, currently, if opCmp doesn't exist the *COMPILER MAKES ONE UP*, which is totally unacceptable. So if you define opEquals and not opCmp, as bearophile points out, your specifically defined opEquals is not even used, and some made-up approximation is used instead! It's one thing to make up opEquals, that is pretty easy to get reasonably right. It's something entirely different to invent an opCmp, especially for types which have no ordering!

I'm not necessarily arguing that we should keep requiring opCmp (and certainly having the compiler generate one is not good IMHO). I'm just pointing out that that would mean that we'd be putting further limitations on our ability to change the implementation later if we decide that we need to. As long as we think that the benefits outweight the costs, then removing the need for opCmp and AA's probably is what we should do. But regardless, having the compiler create an opCmp for you seems like highly broken behavior. - Jonathan M Davis
Aug 17 2011