www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - False matches in AAs revisited

reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Using DMD 0.128, Windows 98SE.

I'd got the impression that associative array lookup was a three-stage 
process:
(i) toHash - to see which hash slot to look in
(ii) opCmp - to do a binary search within the hash slot
(iii) opEquals - to check that the key it's found really does match

However, my recent experiments have shown that it skips stage (iii).  At 
first this seemed to be a problem only with structs, but at least now 
it's happening with classes just the same.  As this program shows:

----------
import std.stdio;
import std.ctype;

class IChar {
     char data;

     this(char d) { data = d; }

     int opEquals(Object o)
     in {
         assert (cast(IChar) o !is null);
     }
     body {
         writefln("Entered IChar.opEquals(Object)");
         return opEquals(cast(IChar) o);
     }

     bit opEquals(IChar i) {
         writefln("Entered IChar.opEquals(IChar)");
         return data == i.data;
     }

     int opCmp(Object o)
     in {
         assert (cast(IChar) o !is null);
     }
     body {
         writefln("Entered IChar.opCmp(Object)");
         return opCmp(cast(IChar) o);
     }

     int opCmp(IChar i) {
         writefln("Entered IChar.opCmp(IChar)");
         return toupper(data) - toupper(i.data);
     }

     uint toHash() {
         writefln("Entered IChar.toHash()");
         return toupper(data);
     }

     unittest {
         IChar c = new IChar('c');
         IChar d = new IChar('c');
         IChar C = new IChar('C');
         IChar e = new IChar('e');
         assert (c == d);
         assert (c == C);
         assert (c != e);

         puts("Units tested");
     }
}

void main() {
     int[IChar] aa;

     IChar c1 = new IChar('c');
     IChar c2 = new IChar('c');
     IChar c3 = new IChar('C');

     aa[c1] = 4;
     writefln("Added aa[c1]");
     aa.rehash;
     writefln("Rehashed, looking up aa[c1]");

     if (c1 in aa) {
         writefln("aa[c1]:");
         writefln(aa[c1]);
     } else {
         writefln("aa[c1] doesn't exist");
     }

     writefln("looking up aa[c2]");
     if (c2 in aa) {
         writefln("aa[c2]:");
         writefln(aa[c2]);
     } else {
         writefln("aa[c2] doesn't exist");
     }

     writefln("looking up aa[c3]");
     if (c3 in aa) {
         writefln("aa[c3]:");
         writefln(aa[c3]);
     } else {
         writefln("aa[c3] doesn't exist");
     }

     aa[c3] = 7;
     writefln("Added aa[c3]");
     aa.rehash;
     writefln("Rehashed, looking up aa[c1]");

     if (c1 in aa) {
         writefln("aa[c1]:");
         writefln(aa[c1]);
     } else {
         writefln("aa[c1] doesn't exist");
     }

     writefln("looking up aa[c2]");
     if (c2 in aa) {
         writefln("aa[c2]:");
         writefln(aa[c2]);
     } else {
         writefln("aa[c2] doesn't exist");
     }

     writefln("looking up aa[c3]");
     if (c3 in aa) {
         writefln("aa[c3]:");
         writefln(aa[c3]);
     } else {
         writefln("aa[c3] doesn't exist");
     }

     writefln("removing aa[c3]");
     aa.remove(c3);

     writefln("removed aa[c3], looking up aa[c1]");
     if (c1 in aa) {
         writefln("aa[c1]:");
         writefln(aa[c1]);
     } else {
         writefln("aa[c1] doesn't exist");
     }
}
----------

As you see if you try running it, opEquals is never called.  It just 
relies on opCmp to compare the objects for equality.  This doesn't work 
if, as in this case, unequal objects can rank equally in order.

It would be trivial to make it call opEquals after digging down the 
binary search tree to check for true equality.  It also wouldn't be 
difficult to dig further down by a combination of opCmp and opEquals 
calls if the equally ranking key it's found doesn't turn out to be 
equal.  But see also

digitalmars.D.bugs/4555

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB  P+ L E  W++  N+++ o K- w++  O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.
Jul 18 2005
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
Lets say that opCmp() said that the object it found was equal, BUT, opEqual()
thought it was not.  What should the AA insertion / search code do?

So, I think this is intentional.  It is probably better to use opCmp() to test
equality, because it can tell you which branch of the binary tree to take if it
is not equal.  opEqual only says 1 or 0.

This problem shows up in Java, and C++ too.  The C++ containers use a similar
technique to D -- they call "less<T>" to determine order, and to test equality I
think they assume that (! (less<T>(a,b) || less<T>(b,a)) means that a==b.

I think this is a general issue in object oriented languages though -- opEqual()
and opCmp() can be defined to implement inconsistent comparisons.  You can fix
this in various ways (always use opCmp() for example), but they all seem to
impede performance.  But Object comparisons really do have to be fast, so it
becomes a case of caveat programmor.

Kevin

In article <dbfu2v$2kiv$1 digitaldaemon.com>, Stewart Gordon says...
Using DMD 0.128, Windows 98SE.

I'd got the impression that associative array lookup was a three-stage 
process:
(i) toHash - to see which hash slot to look in
(ii) opCmp - to do a binary search within the hash slot
(iii) opEquals - to check that the key it's found really does match

However, my recent experiments have shown that it skips stage (iii).  At 
first this seemed to be a problem only with structs, but at least now 
it's happening with classes just the same.  As this program shows:

----------
import std.stdio;
import std.ctype;

class IChar {
     char data;

     this(char d) { data = d; }

     int opEquals(Object o)
     in {
         assert (cast(IChar) o !is null);
     }
     body {
         writefln("Entered IChar.opEquals(Object)");
         return opEquals(cast(IChar) o);
     }

     bit opEquals(IChar i) {
         writefln("Entered IChar.opEquals(IChar)");
         return data == i.data;
     }

     int opCmp(Object o)
     in {
         assert (cast(IChar) o !is null);
     }
     body {
         writefln("Entered IChar.opCmp(Object)");
         return opCmp(cast(IChar) o);
     }

     int opCmp(IChar i) {
         writefln("Entered IChar.opCmp(IChar)");
         return toupper(data) - toupper(i.data);
     }

     uint toHash() {
         writefln("Entered IChar.toHash()");
         return toupper(data);
     }

     unittest {
         IChar c = new IChar('c');
         IChar d = new IChar('c');
         IChar C = new IChar('C');
         IChar e = new IChar('e');
         assert (c == d);
         assert (c == C);
         assert (c != e);

         puts("Units tested");
     }
}

void main() {
     int[IChar] aa;

     IChar c1 = new IChar('c');
     IChar c2 = new IChar('c');
     IChar c3 = new IChar('C');

     aa[c1] = 4;
     writefln("Added aa[c1]");
     aa.rehash;
     writefln("Rehashed, looking up aa[c1]");

     if (c1 in aa) {
         writefln("aa[c1]:");
         writefln(aa[c1]);
     } else {
         writefln("aa[c1] doesn't exist");
     }

     writefln("looking up aa[c2]");
     if (c2 in aa) {
         writefln("aa[c2]:");
         writefln(aa[c2]);
     } else {
         writefln("aa[c2] doesn't exist");
     }

     writefln("looking up aa[c3]");
     if (c3 in aa) {
         writefln("aa[c3]:");
         writefln(aa[c3]);
     } else {
         writefln("aa[c3] doesn't exist");
     }

     aa[c3] = 7;
     writefln("Added aa[c3]");
     aa.rehash;
     writefln("Rehashed, looking up aa[c1]");

     if (c1 in aa) {
         writefln("aa[c1]:");
         writefln(aa[c1]);
     } else {
         writefln("aa[c1] doesn't exist");
     }

     writefln("looking up aa[c2]");
     if (c2 in aa) {
         writefln("aa[c2]:");
         writefln(aa[c2]);
     } else {
         writefln("aa[c2] doesn't exist");
     }

     writefln("looking up aa[c3]");
     if (c3 in aa) {
         writefln("aa[c3]:");
         writefln(aa[c3]);
     } else {
         writefln("aa[c3] doesn't exist");
     }

     writefln("removing aa[c3]");
     aa.remove(c3);

     writefln("removed aa[c3], looking up aa[c1]");
     if (c1 in aa) {
         writefln("aa[c1]:");
         writefln(aa[c1]);
     } else {
         writefln("aa[c1] doesn't exist");
     }
}
----------

As you see if you try running it, opEquals is never called.  It just 
relies on opCmp to compare the objects for equality.  This doesn't work 
if, as in this case, unequal objects can rank equally in order.

It would be trivial to make it call opEquals after digging down the 
binary search tree to check for true equality.  It also wouldn't be 
difficult to dig further down by a combination of opCmp and opEquals 
calls if the equally ranking key it's found doesn't turn out to be 
equal.  But see also

digitalmars.D.bugs/4555

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB  P+ L E  W++  N+++ o K- w++  O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Jul 20 2005
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Kevin Bealer wrote:
 Lets say that opCmp() said that the object it found was equal, BUT, opEqual() 
 thought it was not.  What should the AA insertion / search code do?

The spec doesn't refer to opEqual. But I'm guessing you mean opEquals. It should use it.
 So, I think this is intentional.  It is probably better to use opCmp() to test 
 equality, because it can tell you which branch of the binary tree to take if
it 
 is not equal.  opEqual only says 1 or 0.

Up to a point....
 This problem shows up in Java, and C++ too.  The C++ containers use a similar 
 technique to D -- they call "less<T>" to determine order, and to test equality
I 
 think they assume that (! (less<T>(a,b) || less<T>(b,a)) means that a==b.

How is less different from the < operator? Or is it just something that the STL provides so that you can pretend incomparable classes are comparable?
 I think this is a general issue in object oriented languages though --
opEqual() 
 and opCmp() can be defined to implement inconsistent comparisons.

This can sometimes be by design. Look at BigDecimal in Java for example.
 You can fix this in various ways (always use opCmp() for example), 
 but they all seem to impede performance.  But Object comparisons 
 really do have to be fast, so it becomes a case of caveat programmor.

Using only opCmp impedes generality more than it impedes performance. The same goes as for many other long-pending issues surrounding AAs and opCmp: if this is intended behaviour, it would need to be documented. Stewart. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/M d- s:- a->--- UB P+ L E W++ N+++ o K- w++ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y ------END GEEK CODE BLOCK------ My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Jul 21 2005
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <dbnq4l$1tfh$1 digitaldaemon.com>, Stewart Gordon says...
Kevin Bealer wrote:
 Lets say that opCmp() said that the object it found was equal, BUT, opEqual() 
 thought it was not.  What should the AA insertion / search code do?

The spec doesn't refer to opEqual. But I'm guessing you mean opEquals. It should use it.

Let me explain what I mean. It has to use opCmp() to find the answer in the binary tree. Once it finds an answer that way, you would like it to use opEquals()? If so, what should it do with it? If it returns true, obviously we found the answer. If it returns false, what should the action be?
 So, I think this is intentional.  It is probably better to use opCmp() to test 
 equality, because it can tell you which branch of the binary tree to take if
it 
 is not equal.  opEqual only says 1 or 0.

Up to a point....
 This problem shows up in Java, and C++ too.  The C++ containers use a similar 
 technique to D -- they call "less<T>" to determine order, and to test equality
I 
 think they assume that (! (less<T>(a,b) || less<T>(b,a)) means that a==b.

How is less different from the < operator? Or is it just something that the STL provides so that you can pretend incomparable classes are comparable?

No difference, really. I think less<T> is a template argument, but if you don't provide it, the sorting in STL uses operator<. In any case, the same argument applies to operator<.
 I think this is a general issue in object oriented languages though --
opEqual() 
 and opCmp() can be defined to implement inconsistent comparisons.

This can sometimes be by design. Look at BigDecimal in Java for example.
 You can fix this in various ways (always use opCmp() for example), 
 but they all seem to impede performance.  But Object comparisons 
 really do have to be fast, so it becomes a case of caveat programmor.

Using only opCmp impedes generality more than it impedes performance. The same goes as for many other long-pending issues surrounding AAs and opCmp: if this is intended behaviour, it would need to be documented. Stewart.

Documentation is good, but I don't understand the generality argument you are making. Kevin
-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB  P+ L E  W++  N+++ o K- w++  O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Jul 22 2005
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Kevin Bealer wrote:
 In article <dbnq4l$1tfh$1 digitaldaemon.com>, Stewart Gordon says...
 Kevin Bealer wrote:
 Lets say that opCmp() said that the object it found was equal, BUT, opEqual() 
 thought it was not.  What should the AA insertion / search code do?

The spec doesn't refer to opEqual. But I'm guessing you mean opEquals. It should use it.

Let me explain what I mean. It has to use opCmp() to find the answer in the binary tree. Once it finds an answer that way, you would like it to use opEquals()? If so, what should it do with it? If it returns true, obviously we found the answer. If it returns false, what should the action be?

That given in the last paragraph of my original post. <snip>
 This problem shows up in Java, and C++ too.  The C++ containers use a similar 
 technique to D -- they call "less<T>" to determine order, and to test equality
I 
 think they assume that (! (less<T>(a,b) || less<T>(b,a)) means that a==b.

How is less different from the < operator? Or is it just something that the STL provides so that you can pretend incomparable classes are comparable?

No difference, really. I think less<T> is a template argument, but if you don't provide it, the sorting in STL uses operator<. In any case, the same argument applies to operator<.

I see.... <snip>
 Using only opCmp impedes generality more than it impedes performance. 
 The same goes as for many other long-pending issues surrounding AAs and 
 opCmp: if this is intended behaviour, it would need to be documented.

Documentation is good, but I don't understand the generality argument you are making.

Have you not been listening? 1. It doesn't work on types that have no ordering. 2. It doesn't work on types where unequal objects can rank equally in order. Stewart. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/M d- s:- a->--- UB P+ L E W++ N+++ o K- w++ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y ------END GEEK CODE BLOCK------ My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Jul 25 2005
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <dc2e89$1h10$1 digitaldaemon.com>, Stewart Gordon says...
Kevin Bealer wrote:
 In article <dbnq4l$1tfh$1 digitaldaemon.com>, Stewart Gordon says...
 Kevin Bealer wrote:
 Lets say that opCmp() said that the object it found was equal, BUT, opEqual() 
 thought it was not.  What should the AA insertion / search code do?

The spec doesn't refer to opEqual. But I'm guessing you mean opEquals. It should use it.

Let me explain what I mean. It has to use opCmp() to find the answer in the binary tree. Once it finds an answer that way, you would like it to use opEquals()? If so, what should it do with it? If it returns true, obviously we found the answer. If it returns false, what should the action be?

That given in the last paragraph of my original post. <snip>
 This problem shows up in Java, and C++ too.  The C++ containers use a similar 
 technique to D -- they call "less<T>" to determine order, and to test equality
I 
 think they assume that (! (less<T>(a,b) || less<T>(b,a)) means that a==b.

How is less different from the < operator? Or is it just something that the STL provides so that you can pretend incomparable classes are comparable?

No difference, really. I think less<T> is a template argument, but if you don't provide it, the sorting in STL uses operator<. In any case, the same argument applies to operator<.

I see.... <snip>
 Using only opCmp impedes generality more than it impedes performance. 
 The same goes as for many other long-pending issues surrounding AAs and 
 opCmp: if this is intended behaviour, it would need to be documented.

Documentation is good, but I don't understand the generality argument you are making.

Have you not been listening? 1. It doesn't work on types that have no ordering. 2. It doesn't work on types where unequal objects can rank equally in order. Stewart.

I see your point. However, I would prefer the current, faster, bin-tree list to changing the behavior for 1. I can code up an arbitrary order if necessary for objects that I need. I usually think its bad form to code a type where opEquals() != (0 == opCmp()). I know people do this, but I think equality should be equality. To me the fact that two methods exist is a performance thing, no more. In other words, having opCmp() follow a looser standard for returning "0" than opEquals() has for returning "1" seems wrong to me. The opCmp() (in my opinion) should check all the fields that opEquals() uses and impose some kind of directionality. BTW, I've heard arguments that some common cases of "inexact equality" are considered poor design now -- ie comparing floating point numbers with an "epsilon". This is tangential to the discussion though. Kevin
Jul 27 2005
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Kevin Bealer wrote:
<snip>
 I see your point.  However, I would prefer the current, faster, bin-tree list
to
 changing the behavior for 1. 

There seems to be an assumption polluting some of the discussions about AAs and opCmp - that there should be only one AA implementation. As I've said a few times, if we reimplemented AAs using templates and got rid of Object.opCmp, it would be simple to choose one implementation for comparable classes and another for incomparable ones. This is already possible with structs and unions without any changes to the language. Determinining whether opEquals(o) and opCmp(o) == 0 are equivalent is of course harder. But should it be a default behaviour to assume this equivalence? If we didn't, then we could have a pragma to persuade the compiler to assume that opEquals(o) == (opCmp(o) == 0) and hence hook up an implementation that skips the opEquals stage of the lookup. Of course, anyone could forget to specify this pragma. But that's a bit like "forgetting" to define an opCmp at all.
 I can code up an arbitrary order if necessary for objects that I need.

True, but this effectively cancels out the statement that "For some objects, testing for less or greater makes no sense". Because the object that you "need" might be part of a third-party library.
 I usually think its bad form to code a type where opEquals() != (0 == opCmp()).
 I know people do this, but I think equality should be equality.  To me the fact
 that two methods exist is a performance thing, no more.  In other words, having
 opCmp() follow a looser standard for returning "0" than opEquals() has for
 returning "1" seems wrong to me.  The opCmp() (in my opinion) should check all
 the fields that opEquals() uses and impose some kind of directionality.

Here's an idea. When I've finished catching up after my week away, I'll start a vote over on d.D. Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on on the 'group where everyone may benefit.
Aug 06 2005