digitalmars.D - About built-in AAs

bearophile (62/62) Aug 16 2011 I have some comments, that I see as problems, about the built-in AAs.

Walter Bright (5/10) Aug 16 2011 It wasn't changed to simplify code, it was changed because it was faster...

bearophile (7/12) Aug 16 2011 I think there are search trees like the Red-Black ones that guarantee a ...

Walter Bright (2/3) Aug 16 2011 Just feed it more data.

bearophile (4/7) Aug 16 2011 If you feed it more data, even if all items pruce collision because they...

Andrew Wiley (8/19) Aug 16 2011 The problem here is that algorithmic complexity doesn't really tell the
Josh Simmons (10/33) Aug 16 2011 There's a lot more to making a hash-map robust than changing the
Andrei Alexandrescu (11/18) Aug 16 2011 Let's please stop this. Many of us, including yourself, noticed the

Walter Bright (7/14) Aug 16 2011 Also, I should point out that the switch from binary trees to linear lis...

Andrei Alexandrescu (9/28) Aug 17 2011 Though that's a good point, people would usually recompile projects if

bearophile (5/6) Aug 17 2011 OK. This sub-thread about Hash resilience against attacks is a minor det...
Steven Schveighoffer (7/32) Aug 17 2011 Yes, but let's not forget the one valid request out of all of this -- if...

Andrei Alexandrescu (4/37) Aug 17 2011 Yah, we should put that on the list of things to do on the Wiki. Wait,

Jesse Phillips (4/46) Aug 17 2011 Added, do we also want to add don't have compiler generate opCmp

Steven Schveighoffer (15/60) Aug 17 2011 So actually, it's technically a joint effort of the compiler and druntim...
Steven Schveighoffer (15/60) Aug 17 2011 So actually, it's technically a joint effort of the compiler and druntim...

Jonathan M Davis (5/40) Aug 17 2011 But then we can't change the hash table type to one that needs opCmp if ...

Steven Schveighoffer (16/62) Aug 17 2011 I think that's a choice we should embrace. AFAIK, no *builtin* hash

Andrej Mitrovic (2/2) Aug 17 2011 I'm all for whatever fixes hashes and makes them properly CTFE'able,
Jonathan M Davis (9/86) Aug 17 2011 I'm not necessarily arguing that we should keep requiring opCmp (and cer...

Sean Kelly (6/8) Aug 17 2011 if trees are no longer being used, opEquals should be used insted of =

bearophile (5/6) Aug 17 2011 OK, but I'd like to note that what I have written comes from some experi...

Walter Bright (12/16) Aug 17 2011 I don't think it's a sensible point of view. If you supply your own hash...

bearophile (5/6) Aug 17 2011 Thank you for your answers. And I agree that the current situation is ov...

Sean Kelly (10/17) Aug 17 2011 more practical ones, like receiving help from the compiler if I am using...

Jesse Phillips (4/10) Aug 17 2011 What I got from it was that it didn't use his hash until he provided opC...
bearophile (14/15) Aug 18 2011 What you are talking about is interesting. But I was asking in the point...

Sean Kelly (13/38) Aug 18 2011 This sounds like an application design issue rather than a language issu...
Josh Simmons (14/42) Aug 18 2011 te:

Steven Schveighoffer (13/24) Aug 17 2011 And uses much less space -- only one pointer per node vs. two.

Walter Bright (2/7) Aug 17 2011 Actually, that was a large factor in the speed increase.

kennytm (11/20) Aug 16 2011 They use runtime typeid functions. Some V[K] type and methods are conver...

bearophile <bearophileHUGS lycos.com> writes:

I have some comments, that I see as problems, about the built-in AAs.


1) I have written code similar to this, do you think it's correct?

import std.stdio;

struct F {
    uint x;

    hash_t toHash() {
        puts("*");
        return x;
    }

    const bool opEquals(ref const(F) f) {

        return x == f.x;
    }
}
void main() {
    auto data = [F(1), F(2), F(3), F(4), F(5), F(6)];
    int[F] aa;
    foreach (i, f; data)
        aa[f] = i;
}


After a good amount of debugging I have added those puts, and I've seen that
those toHash/opEquals are never called, it's the second time I waste time on
this problem.

In my opinion this is not acceptable in a language like D. If a user wants to
add the hashing protocol to a class/struct then he/she will probably add a
toHash. So if a toHash() is present, the compiler has to give a compile time
error if all the other parts of the hash protocol are not present or if they
are not correct (this currently means a correct opEquals and opCmp). This is
quite important.

See also:
http://d.puremagic.com/issues/show_bug.cgi?id=3844
http://d.puremagic.com/issues/show_bug.cgi?id=4290

-------------------

2) Originally built-in AAs used a search tree to resolve collisions, but I
think since some time they use a simple linked list (this has the advantage of
simplifying the code, but it has the disadvantage that now hashing is able to
become O(n^2), allowing certain attacks to D code that where impossible before).

So why are D AAs requiring still an opCmp to used-defined hashable
structs/classes? Why don't we remove this need (and change the D docs)?

-------------------

3) This program finally uses the user-defined hash:


import std.stdio;

struct F {
    uint x;

    hash_t toHash() {
        puts("*");
        return x;
    }

    const bool opEquals(ref const(F) f) {

        return x == f.x;
    }

    const int opCmp(ref const F f) {
        puts("$");
        return x - f.x;
    }
}
void main() {
    auto data = [F(1), F(2), F(3), F(3), F(5), F(1)];
    int[F] aa;
    foreach (i, f; data)
        aa[f] = i;
    writeln(aa);
}


It outputs:

$
$
F:1 F:3 F:5 F:4


The output shows that the AA is using opCmp() instead opEquals(). If the AA is
hash based and resolves collisions with an unordered linked list then why isn't
it calling opEquals() instead? Generally opEquals() is faster than opCmp().

-------------------

Now and then I compare the running time of C++0x code that uses <unordered_map>
with the running time of the same operations done with D AAs and I don't see
good timings for D. Are D AAs true templates or do they use some runtime typeid
function? If they aren't true templates isn't it better to change this?

Bye,
bearophile

Aug 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2011 3:09 PM, bearophile wrote:
 2) Originally built-in AAs used a search tree to resolve collisions, but I
 think since some time they use a simple linked list (this has the advantage
 of simplifying the code, but it has the disadvantage that now hashing is able
 to become O(n^2), allowing certain attacks to D code that where impossible
 before).

It wasn't changed to simplify code, it was changed because it was faster.

As for "attacks", if your code is subject to people sending it specially
crafted 
input in the hopes of making it run slowly, you can make specially crafted
input 
to make trees run slowly, too.

Aug 16 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

 It wasn't changed to simplify code, it was changed because it was faster.

Right.


 As for "attacks", if your code is subject to people sending it specially
crafted 
 input in the hopes of making it run slowly, you can make specially crafted
input 
 to make trees run slowly, too.

I think there are search trees like the Red-Black ones that guarantee a O(n ln
n) worst case. I am wrong?

Since operations such as inserting, deleting, and finding values require
worst-case time proportional to the height of the tree, this theoretical upper
bound on the height allows red�black trees to be efficient in the worst-case,
unlike ordinary binary search trees.<

From:
http://en.wikipedia.org/wiki/Red%E2%80%93black_tree#Properties

Bye,
bearophile

Aug 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2011 5:06 PM, bearophile wrote:
 I think there are search trees like the Red-Black ones that guarantee a O(n ln
n) worst case. I am wrong?

Just feed it more data.

Aug 16 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee a O(n ln
n) worst case. I am wrong?

 
 Just feed it more data.

If you feed it more data, even if all items pruce collision because they all
hash to the same bucket, if you use Red-Black trees to handle the items in the
same bucket you keep having a O(n ln n) behaviour, that's usually fast enough.
With Python and the new D AAs you instead get a quadratic one. This quadratic
behaviour gives troubles way before the physical RAM is exhausted.

Bye,
bearophile

Aug 16 2011

Andrew Wiley <wiley.andrew.j gmail.com> writes:

On Tue, Aug 16, 2011 at 7:29 PM, bearophile <bearophileHUGS lycos.com>wrote:

 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee a


 O(n ln n) worst case. I am wrong?
 Just feed it more data.

 If you feed it more data, even if all items pruce collision because they
 all hash to the same bucket, if you use Red-Black trees to handle the items
 in the same bucket you keep having a O(n ln n) behaviour, that's usually
 fast enough. With Python and the new D AAs you instead get a quadratic one.
 This quadratic behaviour gives troubles way before the physical RAM is
 exhausted.


The problem here is that algorithmic complexity doesn't really tell the
whole story. We can talk about what "should" be faster all day, but unless
we can prove that there's really a problem here, this doesn't seem like it's
worth worrying about.
And if it was changed in the past because the linked lists turned out to be
faster, I'd guess that actual benchmarking was probably used back then to
determine which was better.

Aug 16 2011

Josh Simmons <simmons.44 gmail.com> writes:

On Wed, Aug 17, 2011 at 12:40 PM, Andrew Wiley <wiley.andrew.j gmail.com> wrote:
 On Tue, Aug 16, 2011 at 7:29 PM, bearophile <bearophileHUGS lycos.com>
 wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee
 a O(n ln n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because they
 all hash to the same bucket, if you use Red-Black trees to handle the items
 in the same bucket you keep having a O(n ln n) behaviour, that's usually
 fast enough. With Python and the new D AAs you instead get a quadratic one.
 This quadratic behaviour gives troubles way before the physical RAM is
 exhausted.

 The problem here is that algorithmic complexity doesn't really tell the
 whole story. We can talk about what "should" be faster all day, but unless
 we can prove that there's really a problem here, this doesn't seem like it's
 worth worrying about.
 And if it was changed in the past because the linked lists turned out to be
 faster, I'd guess that actual benchmarking was probably used back then to
 determine which was better.

There's a lot more to making a hash-map robust than changing the
lookup scheme. There's also a large variety of collision resolution
methods that all have their pros and cons depending on the data-set
and lookup patterns you're dealing with. For a built in AA system I
feel simplicity is a much more achievable goal than trying to solve
everybody's specific problems. Hence I feel that the current solution
is probably a good choice.

Just my two cents anyway.
Josh.

Aug 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee a O(n ln
n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because they all
hash to the same bucket, if you use Red-Black trees to handle the items in the
same bucket you keep having a O(n ln n) behaviour, that's usually fast enough.
With Python and the new D AAs you instead get a quadratic one. This quadratic
behaviour gives troubles way before the physical RAM is exhausted.

 Bye,
 bearophile

Let's please stop this. Many of us, including yourself, noticed the 
relatively poor performance of D's previous hashtables compared to other 
languages. Switching to singly-list collision handling marked an 
improvement. Now a lot of data structure designs have a worst-case that 
makes them perform worse than others. If you worry about attacks, please 
implement your own hashtable. If we switch back to the old 
implementation, you'll complain again about D's hashtables being slower 
than Python's, thus closing a years-long cycle.

Thanks,

Andrei

Aug 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/16/2011 9:15 PM, Andrei Alexandrescu wrote:
 Let's please stop this. Many of us, including yourself, noticed the relatively
 poor performance of D's previous hashtables compared to other languages.
 Switching to singly-list collision handling marked an improvement. Now a lot of
 data structure designs have a worst-case that makes them perform worse than
 others. If you worry about attacks, please implement your own hashtable. If we
 switch back to the old implementation, you'll complain again about D's
 hashtables being slower than Python's, thus closing a years-long cycle.

Also, I should point out that the switch from binary trees to linear lists 
involved zero change in user code - not even a recompile.

Hence, any person who is worried about Bearophile's scenario, or that believe 
they can come up with a faster hash, can swap it with their own and prove it.

The hash implementation was made to be opaque to the user, and pluggable, and 
this proved it.

Aug 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/17/11 12:43 AM, Walter Bright wrote:
 On 8/16/2011 9:15 PM, Andrei Alexandrescu wrote:
 Let's please stop this. Many of us, including yourself, noticed the
 relatively
 poor performance of D's previous hashtables compared to other languages.
 Switching to singly-list collision handling marked an improvement. Now
 a lot of
 data structure designs have a worst-case that makes them perform worse
 than
 others. If you worry about attacks, please implement your own
 hashtable. If we
 switch back to the old implementation, you'll complain again about D's
 hashtables being slower than Python's, thus closing a years-long cycle.

 Also, I should point out that the switch from binary trees to linear
 lists involved zero change in user code - not even a recompile.

 Hence, any person who is worried about Bearophile's scenario, or that
 believe they can come up with a faster hash, can swap it with their own
 and prove it.

 The hash implementation was made to be opaque to the user, and
 pluggable, and this proved it.

Though that's a good point, people would usually recompile projects if 
they want to relink so the point is not as strong a point as it might 
have been.

I still very strongly believe we need to continue aggressively migrating 
built-in hashtables from pluggable to templated. Pluggable hashtables 
incur indirect calls in core functions - the bane of all speed 
optimizers. We can't afford that cost for such a useful data structure.


Andrei

Aug 17 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 Let's please stop this.

OK. This sub-thread about Hash resilience against attacks is a minor detail of
my original post.

kennytm has answered my point 4, but I'd like some answers to the first three
points; regarding the error messages for not complete hash protocol, the
possible removal of opCmp from the hash protocol, and the usage of opCmp
instead of opEquals from the current hash protocol.

Bye,
bearophile

Aug 17 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that guarantee  
 a O(n ln n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because  
 they all hash to the same bucket, if you use Red-Black trees to handle  
 the items in the same bucket you keep having a O(n ln n) behaviour,  
 that's usually fast enough. With Python and the new D AAs you instead  
 get a quadratic one. This quadratic behaviour gives troubles way before  
 the physical RAM is exhausted.

 Bye,
 bearophile

 Let's please stop this. Many of us, including yourself, noticed the  
 relatively poor performance of D's previous hashtables compared to other  
 languages. Switching to singly-list collision handling marked an  
 improvement. Now a lot of data structure designs have a worst-case that  
 makes them perform worse than others. If you worry about attacks, please  
 implement your own hashtable. If we switch back to the old  
 implementation, you'll complain again about D's hashtables being slower  
 than Python's, thus closing a years-long cycle.

Yes, but let's not forget the one valid request out of all of this -- if  
trees are no longer being used, opEquals should be used insted of opCmp.   
This allows more possible key types (which don't define an ordering). I  
think this would be a simple druntime change.

-Steve

Aug 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because
 they all hash to the same bucket, if you use Red-Black trees to
 handle the items in the same bucket you keep having a O(n ln n)
 behaviour, that's usually fast enough. With Python and the new D AAs
 you instead get a quadratic one. This quadratic behaviour gives
 troubles way before the physical RAM is exhausted.

 Bye,
 bearophile

 Let's please stop this. Many of us, including yourself, noticed the
 relatively poor performance of D's previous hashtables compared to
 other languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case
 that makes them perform worse than others. If you worry about attacks,
 please implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being
 slower than Python's, thus closing a years-long cycle.

 Yes, but let's not forget the one valid request out of all of this -- if
 trees are no longer being used, opEquals should be used insted of opCmp.
 This allows more possible key types (which don't define an ordering). I
 think this would be a simple druntime change.

 -Steve

Yah, we should put that on the list of things to do on the Wiki. Wait, 
where the heck is that? :o)

Andrei

Aug 17 2011

Jesse Phillips <jessekphillips+d gmail.com> writes:

On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:

 On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because
 they all hash to the same bucket, if you use Red-Black trees to
 handle the items in the same bucket you keep having a O(n ln n)
 behaviour, that's usually fast enough. With Python and the new D AAs
 you instead get a quadratic one. This quadratic behaviour gives
 troubles way before the physical RAM is exhausted.

 Bye,
 bearophile

 Let's please stop this. Many of us, including yourself, noticed the
 relatively poor performance of D's previous hashtables compared to
 other languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case
 that makes them perform worse than others. If you worry about attacks,
 please implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being
 slower than Python's, thus closing a years-long cycle.

 Yes, but let's not forget the one valid request out of all of this --
 if trees are no longer being used, opEquals should be used insted of
 opCmp. This allows more possible key types (which don't define an
 ordering). I think this would be a simple druntime change.

 -Steve

 
 Yah, we should put that on the list of things to do on the Wiki. Wait,
 where the heck is that? :o)
 
 Andrei

Added, do we also want to add don't have compiler generate opCmp 
automatically?

http://www.prowiki.org/wiki4d/wiki.cgi?LanguageDevel

Aug 17 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 17 Aug 2011 21:07:29 -0400, Jesse Phillips
<jessekphillips+d gmail.com> wrote:

 On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:

 On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because
 they all hash to the same bucket, if you use Red-Black trees to
 handle the items in the same bucket you keep having a O(n ln n)
 behaviour, that's usually fast enough. With Python and the new D AAs
 you instead get a quadratic one. This quadratic behaviour gives
 troubles way before the physical RAM is exhausted.

 Bye,
 bearophile

 Let's please stop this. Many of us, including yourself, noticed the
 relatively poor performance of D's previous hashtables compared to
 other languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case
 that makes them perform worse than others. If you worry about attacks,
 please implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being
 slower than Python's, thus closing a years-long cycle.

 Yes, but let's not forget the one valid request out of all of this --
 if trees are no longer being used, opEquals should be used insted of
 opCmp. This allows more possible key types (which don't define an
 ordering). I think this would be a simple druntime change.

 -Steve

 Yah, we should put that on the list of things to do on the Wiki. Wait,
 where the heck is that? :o)

 Andrei

 Added, do we also want to add don't have compiler generate opCmp
 automatically?

So actually, it's technically a joint effort of the compiler and druntime.

The compiler doesn't *reject* creating AA's using keys that don't define
opCmp.  It doesn't actually generate a function.  But druntime *does* do a
"default" opCmp if no opCmp is defined.  See here:

https://github.com/D-Programming-Language/druntime/blob/master/src/object_.d#L955

In essence, this "bug" is fixed by one of two things:

1. Using a fully templated AA implementation instead of a RTTI-based one
(which avoids using the TypeInfo.compare function).
2. Not using opCmp anymore.

Since 2 is already happening, I think we are good.

I don't know how we should "fix" the TypeInfo_Struct function.  Should it  
just throw an exception if xOpCmp isn't set?

-Steve

Aug 17 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 17 Aug 2011 21:07:29 -0400, Jesse Phillips
<jessekphillips+d gmail.com> wrote:

 On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:

 On 8/17/11 9:04 AM, Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:

 I think there are search trees like the Red-Black ones that
 guarantee a O(n ln n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because
 they all hash to the same bucket, if you use Red-Black trees to
 handle the items in the same bucket you keep having a O(n ln n)
 behaviour, that's usually fast enough. With Python and the new D AAs
 you instead get a quadratic one. This quadratic behaviour gives
 troubles way before the physical RAM is exhausted.

 Bye,
 bearophile

 Let's please stop this. Many of us, including yourself, noticed the
 relatively poor performance of D's previous hashtables compared to
 other languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case
 that makes them perform worse than others. If you worry about attacks,
 please implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being
 slower than Python's, thus closing a years-long cycle.

 Yes, but let's not forget the one valid request out of all of this --
 if trees are no longer being used, opEquals should be used insted of
 opCmp. This allows more possible key types (which don't define an
 ordering). I think this would be a simple druntime change.

 -Steve

 Yah, we should put that on the list of things to do on the Wiki. Wait,
 where the heck is that? :o)

 Andrei

 Added, do we also want to add don't have compiler generate opCmp
 automatically?

So actually, it's technically a joint effort of the compiler and druntime.

The compiler doesn't *reject* creating AA's using keys that don't define  
opCmp.  It doesn't actually generate a function.  But druntime *does* do a  
"default" opCmp if no opCmp is defined.  See here:

https://github.com/D-Programming-Language/druntime/blob/master/src/object_.d#L955

In essence, this "bug" is fixed by one of two things:

1. Using a fully templated AA implementation instead of a RTTI-based one  
(which avoids using the TypeInfo.compare function).
2. Not using opCmp anymore.

Since 2 is already happening, I think we are good.

I don't know how we should "fix" the TypeInfo_Struct function.  Should it  
just throw an exception if xOpCmp isn't set?

-Steve

Aug 17 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 
 <SeeWebsiteForEmail erdani.org> wrote:
 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:
 I think there are search trees like the Red-Black ones that
 guarantee
 a O(n ln n) worst case. I am wrong?

 
 Just feed it more data.

 
 If you feed it more data, even if all items pruce collision because
 they all hash to the same bucket, if you use Red-Black trees to handle
 the items in the same bucket you keep having a O(n ln n) behaviour,
 that's usually fast enough. With Python and the new D AAs you instead
 get a quadratic one. This quadratic behaviour gives troubles way
 before
 the physical RAM is exhausted.
 
 Bye,
 bearophile

 
 Let's please stop this. Many of us, including yourself, noticed the
 relatively poor performance of D's previous hashtables compared to other
 languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case that
 makes them perform worse than others. If you worry about attacks, please
 implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being slower
 than Python's, thus closing a years-long cycle.

 
 Yes, but let's not forget the one valid request out of all of this -- if
 trees are no longer being used, opEquals should be used insted of opCmp.
 This allows more possible key types (which don't define an ordering). I
 think this would be a simple druntime change.

But then we can't change the hash table type to one that needs opCmp if we 
need to later. That might be acceptable, but it makes it so that we can't 
transparently change the implementation again if we decide that we need to.

- Jonathan M Davis

Aug 17 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 17 Aug 2011 11:47:59 -0400, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu

 <SeeWebsiteForEmail erdani.org> wrote:
 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:
 I think there are search trees like the Red-Black ones that
 guarantee
 a O(n ln n) worst case. I am wrong?

 Just feed it more data.

 If you feed it more data, even if all items pruce collision because
 they all hash to the same bucket, if you use Red-Black trees to  


 handle
 the items in the same bucket you keep having a O(n ln n) behaviour,
 that's usually fast enough. With Python and the new D AAs you instead
 get a quadratic one. This quadratic behaviour gives troubles way
 before
 the physical RAM is exhausted.

 Bye,
 bearophile

 Let's please stop this. Many of us, including yourself, noticed the
 relatively poor performance of D's previous hashtables compared to  

 other
 languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case  

 that
 makes them perform worse than others. If you worry about attacks,  

 please
 implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being  

 slower
 than Python's, thus closing a years-long cycle.

 Yes, but let's not forget the one valid request out of all of this -- if
 trees are no longer being used, opEquals should be used insted of opCmp.
 This allows more possible key types (which don't define an ordering). I
 think this would be a simple druntime change.

 But then we can't change the hash table type to one that needs opCmp if  
 we
 need to later. That might be acceptable, but it makes it so that we can't
 transparently change the implementation again if we decide that we need  
 to.

I think that's a choice we should embrace.  AFAIK, no *builtin* hash  
implementations use trees for buckets in any language I'm aware of (I'm  
sure someone will find one though :).  The precedent is to require opHash  
and opEquals, not opCmp.  It just makes more sense for builtin hash tables  
to allow the most possible key types it can.

Also, currently, if opCmp doesn't exist the *COMPILER MAKES ONE UP*, which  
is totally unacceptable.

So if you define opEquals and not opCmp, as bearophile points out, your  
specifically defined opEquals is not even used, and some made-up  
approximation is used instead!

It's one thing to make up opEquals, that is pretty easy to get reasonably  
right.  It's something entirely different to invent an opCmp, especially  
for types which have no ordering!

-Steve

Aug 17 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I'm all for whatever fixes hashes and makes them properly CTFE'able,
IOW no more "expression is not constant" for hash declarations please!

Aug 17 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, August 17, 2011 08:59 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 11:47:59 -0400, Jonathan M Davis <jmdavisProg gmx.com>
 
 wrote:
 On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:
 On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
 
 <SeeWebsiteForEmail erdani.org> wrote:
 On 8/16/11 9:29 PM, bearophile wrote:
 Walter Bright:
 I think there are search trees like the Red-Black ones that
 guarantee
 a O(n ln n) worst case. I am wrong?

 
 Just feed it more data.

 
 If you feed it more data, even if all items pruce collision because
 they all hash to the same bucket, if you use Red-Black trees to


 
 handle
 
 the items in the same bucket you keep having a O(n ln n) behaviour,
 that's usually fast enough. With Python and the new D AAs you instead
 get a quadratic one. This quadratic behaviour gives troubles way
 before
 the physical RAM is exhausted.
 
 Bye,
 bearophile

 
 Let's please stop this. Many of us, including yourself, noticed the
 relatively poor performance of D's previous hashtables compared to

 
 other
 
 languages. Switching to singly-list collision handling marked an
 improvement. Now a lot of data structure designs have a worst-case

 
 that
 
 makes them perform worse than others. If you worry about attacks,

 
 please
 
 implement your own hashtable. If we switch back to the old
 implementation, you'll complain again about D's hashtables being

 
 slower
 
 than Python's, thus closing a years-long cycle.

 
 Yes, but let's not forget the one valid request out of all of this -- if
 trees are no longer being used, opEquals should be used insted of opCmp.
 This allows more possible key types (which don't define an ordering). I
 think this would be a simple druntime change.

 
 But then we can't change the hash table type to one that needs opCmp if
 we
 need to later. That might be acceptable, but it makes it so that we can't
 transparently change the implementation again if we decide that we need
 to.

 
 I think that's a choice we should embrace. AFAIK, no *builtin* hash
 implementations use trees for buckets in any language I'm aware of (I'm
 sure someone will find one though :). The precedent is to require opHash
 and opEquals, not opCmp. It just makes more sense for builtin hash tables
 to allow the most possible key types it can.
 
 Also, currently, if opCmp doesn't exist the *COMPILER MAKES ONE UP*, which
 is totally unacceptable.
 
 So if you define opEquals and not opCmp, as bearophile points out, your
 specifically defined opEquals is not even used, and some made-up
 approximation is used instead!
 
 It's one thing to make up opEquals, that is pretty easy to get reasonably
 right. It's something entirely different to invent an opCmp, especially
 for types which have no ordering!

I'm not necessarily arguing that we should keep requiring opCmp (and certainly 
having the compiler generate one is not good IMHO). I'm just pointing out that 
that would mean that we'd be putting further limitations on our ability to 
change the implementation later if we decide that we need to. As long as we 
think that the benefits outweight the costs, then removing the need for opCmp 
and AA's probably is what we should do. But regardless, having the compiler 
create an opCmp for you seems like highly broken behavior.

- Jonathan M Davis

Aug 17 2011

Sean Kelly <sean invisibleduck.org> writes:

On Aug 17, 2011, at 7:04 AM, Steven Schveighoffer wrote:
=20
 Yes, but let's not forget the one valid request out of all of this -- =

if trees are no longer being used, opEquals should be used insted of =
opCmp.  This allows more possible key types (which don't define an =
ordering). I think this would be a simple druntime change.

Yup.  It would make the AAs faster too.  I'd love to deprecate the =
requirement for opCmp with AAs.=

Aug 17 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 Let's please stop this.

OK, but I'd like to note that what I have written comes from some experiments
too:
http://is.gd/L4U3Kp

Hugs,
bearophile

Aug 17 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/17/2011 11:33 AM, bearophile wrote:
 Andrei Alexandrescu:

 Let's please stop this.

 OK, but I'd like to note that what I have written comes from some experiments
too:
 http://is.gd/L4U3Kp

I don't think it's a sensible point of view. If you supply your own hash 
function, the onus is on you to produce a suitable one. It is not the
languages' 
fault if the hash tables run poorly if your hash function always returns the 
same value.

Secondly, I seriously doubt this is any sort of viable "attack" vector. If you 
are allowing users to write system code (which D is) and remotely run it on
your 
system, a slow hash map is the least of your security concerns. Heck, you could 
just write for(;;){} and call it a day. Nothing clever required.

And, the linear version is much FASTER and uses significantly LESS memory in 
real world situations, which is why we switched to it.

Bottom line, I don't think there's an actual problem here.

Aug 17 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 Bottom line, I don't think there's an actual problem here.

Thank you for your answers. And I agree that the current situation is overall
better than the precedent one. 

My original first post of this thread was about other problems, quite more
practical ones, like receiving help from the compiler if I am using hash
protocol badly, etc. :-)

Bye,
bearophile

Aug 17 2011

Sean Kelly <sean invisibleduck.org> writes:

On Aug 17, 2011, at 2:36 PM, bearophile wrote:

 Walter:
=20
 Bottom line, I don't think there's an actual problem here.

=20
 Thank you for your answers. And I agree that the current situation is =

overall better than the precedent one.=20
=20
 My original first post of this thread was about other problems, quite =

more practical ones, like receiving help from the compiler if I am using =
hash protocol badly, etc. :-)

This would be a run-time issue, unless you're asking the compiler to =
verify your hash algorithm at compile-time :-p  I'd actually like to =
have some introspection functionality so I could find out the average =
chain length, max chain length, etc (basically what's provided by the =
unordered containers from C++11), but the user would still have to query =
this stuff to know that something was wrong.=

Aug 17 2011

Jesse Phillips <jessekphillips+d gmail.com> writes:

On Wed, 17 Aug 2011 15:40:25 -0700, Sean Kelly wrote:

 This would be a run-time issue, unless you're asking the compiler to
 verify your hash algorithm at compile-time :-p  I'd actually like to
 have some introspection functionality so I could find out the average
 chain length, max chain length, etc (basically what's provided by the
 unordered containers from C++11), but the user would still have to query
 this stuff to know that something was wrong.

What I got from it was that it didn't use his hash until he provided opCmp 
too, thus the compiler should complain when it isn't going to use the 
hash you provide.

Aug 17 2011

bearophile <bearophileHUGS lycos.com> writes:

Sean Kelly:

This would be a run-time issue, unless you're asking the compiler to verify
your hash algorithm at compile-time :-p  I'd actually like to have some
introspection functionality so I could find out the average chain length, max
chain length, etc (basically what's provided by the unordered containers from
C++11), but the user would still have to query this stuff to know that
something was wrong.<

What you are talking about is interesting. But I was asking in the point 1 of
my original post is something different and simple, it's a fully static thing.

This was my original post of this thread:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=142709

What I was asking in that point 1 is:
1) If the user writes wrong signatures for the each of the methods needed for
the hash protocol (currently they are toHash() opEquals() and opCmp()), then
I'd like the compiler to tell me with a compile-time error. I think DMD is
already mostly doing this.

2) If the user adds a toHash method to a class or struct, then the compiler
must statically require the presence of all the other methods needed by the
hash protocol. Currently DMD is not doing this, if you add correct toHash() and
opEquals() methods, the compiler will silently not use them. I think this isn't
acceptable in a good language.

--------------

In the successive points of my original post I have asked two other things:
pushing AAs to a fully templated implementation, and totally remove the need of
a opCmp from the hash protocol.

--------------

Extra:

I presume there is no way to ask (on request) the language to use a stack-style
memory allocator for the built-in AAs instead the normal GC. But once built-in
AAs are fully templates I presume it's not too much hard to give an interface
to them from Phobos functions too. So now it's possible to let this use a
different memory allocator.

Bye,
bearophile

Aug 18 2011

Sean Kelly <sean invisibleduck.org> writes:

This sounds like an application design issue rather than a language issue. D=
o any languages use a pool of hash routines like this?

Sent from my iPhone

On Aug 17, 2011, at 5:06 PM, Josh Simmons <simmons.44 gmail.com> wrote:

 On Thu, Aug 18, 2011 at 8:40 AM, Sean Kelly <sean invisibleduck.org> wrote=

:
 On Aug 17, 2011, at 2:36 PM, bearophile wrote:
=20
 Walter:
=20
 Bottom line, I don't think there's an actual problem here.

=20
 Thank you for your answers. And I agree that the current situation is ov=



erall better than the precedent one.
=20
 My original first post of this thread was about other problems, quite mo=



re practical ones, like receiving help from the compiler if I am using hash p=
rotocol badly, etc. :-)
=20
 This would be a run-time issue, unless you're asking the compiler to veri=


fy your hash algorithm at compile-time :-p  I'd actually like to have some i=
ntrospection functionality so I could find out the average chain length, max=
 chain length, etc (basically what's provided by the unordered containers fr=
om C++11), but the user would still have to query this stuff to know that so=
mething was wrong.
=20
 The security issue is basically a DoS one, for example if you know a
 web server is using a specific hash and collision resolution method to
 store message headers you can pass headers that all hash to buckets
 that provide worst-case behavior. In this instance universal hashing
 where a hash function is chosen randomly from a pool of hashes
 combined with good algorithmic complexity means the attacker is unable
 to do this reliably.
=20
 Unrelated though, I'm quite a fan of hopscotch hashing at the moment,
 in theory at least. It'd be interesting to get a few different
 resolution schemes working and compare their performance on various
 workloads.

Aug 18 2011

Josh Simmons <simmons.44 gmail.com> writes:

On Fri, Aug 19, 2011 at 1:43 AM, Sean Kelly <sean invisibleduck.org> wrote:
 This sounds like an application design issue rather than a language issue=

. Do any languages use a pool of hash routines like this?
 Sent from my iPhone

 On Aug 17, 2011, at 5:06 PM, Josh Simmons <simmons.44 gmail.com> wrote:

 On Thu, Aug 18, 2011 at 8:40 AM, Sean Kelly <sean invisibleduck.org> wro=


te:
 On Aug 17, 2011, at 2:36 PM, bearophile wrote:

 Walter:

 Bottom line, I don't think there's an actual problem here.

 Thank you for your answers. And I agree that the current situation is =




overall better than the precedent one.
 My original first post of this thread was about other problems, quite =




more practical ones, like receiving help from the compiler if I am using ha=
sh protocol badly, etc. :-)
 This would be a run-time issue, unless you're asking the compiler to ve=



rify your hash algorithm at compile-time :-p =C2=A0I'd actually like to hav=
e some introspection functionality so I could find out the average chain le=
ngth, max chain length, etc (basically what's provided by the unordered con=
tainers from C++11), but the user would still have to query this stuff to k=
now that something was wrong.
 The security issue is basically a DoS one, for example if you know a
 web server is using a specific hash and collision resolution method to
 store message headers you can pass headers that all hash to buckets
 that provide worst-case behavior. In this instance universal hashing
 where a hash function is chosen randomly from a pool of hashes
 combined with good algorithmic complexity means the attacker is unable
 to do this reliably.

 Unrelated though, I'm quite a fan of hopscotch hashing at the moment,
 in theory at least. It'd be interesting to get a few different
 resolution schemes working and compare their performance on various
 workloads.


I doubt it, I was just noting the issue not suggesting it should be fixed.

Cheers,
Josh.

Aug 18 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 16 Aug 2011 20:06:19 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Walter Bright:

 It wasn't changed to simplify code, it was changed because it was  
 faster.


And uses much less space -- only one pointer per node vs. two.

 Right.

I'd add that removing the requirement for opCmp is a reasonable request --  
this opens up keys to significantly more data types (those which don't  
have a defined order).

 As for "attacks", if your code is subject to people sending it  
 specially crafted
 input in the hopes of making it run slowly, you can make specially  
 crafted input
 to make trees run slowly, too.

 I think there are search trees like the Red-Black ones that guarantee a  
 O(n ln n) worst case. I am wrong?

Red black trees would be overkill here.  The ideal hash only has one  
element per bucket, and at most 2 or 3.  This is why you expand the bucket  
array even though you only have an element to bucket ratio of .75 or so.   
Consider that for one or two (or even sometimes three) elements, a linked  
list has identical topology, and it's insertion/removal algorithm is much  
less complex.

-Steve

Aug 17 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/17/2011 7:00 AM, Steven Schveighoffer wrote:
 On Tue, 16 Aug 2011 20:06:19 -0400, bearophile <bearophileHUGS lycos.com>
wrote:

 Walter Bright:

 It wasn't changed to simplify code, it was changed because it was faster.


 And uses much less space -- only one pointer per node vs. two.

Actually, that was a large factor in the speed increase.

Aug 17 2011

kennytm <kennytm gmail.com> writes:

bearophile <bearophileHUGS lycos.com> wrote:
 
 Now and then I compare the running time of C++0x code that uses
 <unordered_map> with the running time of the same operations done with D
 AAs and I don't see good timings for D. Are D AAs true templates or do
 they use some runtime typeid function? If they aren't true templates
 isn't it better to change this?
 
 Bye,
 bearophile

They use runtime typeid functions. Some V[K] type and methods are converted
to AssociativeArray!(K, V) which itself is a proxy to those extern(C)
aaGetX() functions. Some methods will resolve to those aaGetX directly in
DMD. These functions can only know what types they're working on via the
typeid.  

IMO in DMD V[K]'s methods shouldn't be special. They should be treated just
as a syntactic substitution for the AssociativeArray!(K, V) type. The
latter shall implement opIndex etc for its normal operation. Druntime gets
to decide whether to continue using the C interface or turn it into a
concrete templated structure.  This also solves bug 5350.

Aug 16 2011

D Programming

C/C++ Programming

Other

digitalmars.D - About built-in AAs