www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Robustness of the built in associative array

reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Hello,

The latest "Implicit fn template instantiation" thread contains code 
that contains a hidden and possibly hard to find bug. Rewriting the code 
with a non-templated version for clarity:

void increment(int[char[]] map, char[] key) {
	if (auto t = (key in map))
		(*t)++;
	else
		map[key] = 1;
}

To understand the code, one first has to know if the associative array 
is a value type or a reference type.

It turns out it is neither. It is definitely not a value type. The above 
function will in most cases alter the original associative array and 
work as expected. But it is not a reference type either -- in the few 
cases where adding a new key to the AA invokes a rehash, all hell breaks 
loose. The original associative array is suddenly *corrupted*! (The way 
to make the function work as expected is to make the map an inout 
argument, but that is beside my point.)

Internally, an AA is implemented as an aaA*[] -- a dynamic array of 
pointers to aaA-nodes. The first node of the array contains metadata 
(the total number of keys), and the remaining [1..$] nodes contains the 
hash buckets. When passing an aa to a function, the aaA*[] gets passed. 
I assume the reason for this implementations is efficiency. All you need 
to do an efficient lookup is the array of buckets. But is this 
implementation comes at a cost of robustness. To make use of AA's, you 
have to follow a special rule:

- Never add an element to or rehash the AA unless you know you hold the 
only reference.

This means that even though the AA behaves as a reference type, the data 
can not be shared unless all references are readonly. I feel I must see 
this bastardization of being neither value or reference type a design 
mistake.

A simple fix is to change the AA implementation so that AAs become MyAA* 
  instead of aaA*[], with MyAA defined as:

struct MyAA {
	AANode[] buckets;
	uint numberOfNodes;
}

You get a very slight additional lookup cost (double dereferencing to 
get to the buckets), but make the AA a pure reference type without any 
caveats and problems.

My questions are:

1) Why is the D built in array type designed in such a fragile way? 
Should D not be robust language? Do users have to resort to writing 
library wrappers around D's types to get robustness? If so:

2) Why even keep the builtin AA? With the addition of an opIn() 
overload, you could get *exactly* the same functionality from a library 
implementation:

Map!(char[],int) map;

rather than:

int[char[]] map;

Cons:
  - This slight syntax difference.

Pros:
  - The library version would be transparently replaceable by other 
implementations when the data or other conditions have different 
algorithmic requirements.
  - The library implementation could easily be more efficient than the 
current built in version by taking advantage of templates.
  - The library implementation could without problems be made to work 
with types that lack opCompare.
  - The library version would not have any strange quirks and gotchas. 
(Dangerous usage cases like above)
  - Greater library/language decoupling.
  - The possibility of implementing an UnsafeMap type with high 
performance but less safety for the few cases where the performance 
really matters.

In my opinion, D's lack of protection (const arrays, etc) and robustness 
against programming mistakes is D's greatest weakness.

My contribution to the slogan thread:

"D - Programming without a lifeline" :)

/Oskar
Mar 24 2006
next sibling parent reply Nick <Nick_member pathlink.com> writes:
In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
Hello,

The latest "Implicit fn template instantiation" thread contains code 
that contains a hidden and possibly hard to find bug.
[...]

Good catch!
2) Why even keep the builtin AA? With the addition of an opIn() 
overload, you could get *exactly* the same functionality from a library 
implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following: # int[int] ii; # ii[10]++; I find it rather confusing and frustrating. In my current project I ended up making my own template AA to get around some of the deficiencies of the builtin type. Mostly I wanted to avoid double lookups, and easily allow custom hashers without encapsulating the key in a struct/class (think case-insensitive char[] keys.) The result was also faster and more memory efficient than the builtin AA. Nick
Mar 24 2006
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
 Hello,

 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug.
 [...]

Good catch!

Indeed. It seems roughly similar to how arrays are handled: in place modifications persist, but anything causing a realloc does not. But the original should never be rendered unusable.
 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a library 
 implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following: # int[int] ii; # ii[10]++;

Yup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p
 I find it rather confusing and frustrating. In my current project I ended up
 making my own template AA to get around some of the deficiencies of the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
 without encapsulating the key in a struct/class (think case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the builtin
AA.

This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice. I definately do like having built in AAs, but I do find some of the consequences irritating. Sean
Mar 24 2006
next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Sean Kelly wrote:
 Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...


 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a 
 library implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following: # int[int] ii; # ii[10]++;

Yup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p

Consider the consequences of using a pointer in the above example. ;)
 I find it rather confusing and frustrating. In my current project I 
 ended up
 making my own template AA to get around some of the deficiencies of 
 the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom 
 hashers
 without encapsulating the key in a struct/class (think 
 case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the 
 builtin AA.

This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice. I definately do like having built in AAs, but I do find some of the consequences irritating.

Is there any way to retain the syntax and ease of use of the built in AA with a library implementation allowing the flexibility of choosing: - implementation - hash function - comparison function ? /Oskar
Mar 24 2006
parent reply Nick <Nick_member pathlink.com> writes:
In article <e01a21$te2$5 digitaldaemon.com>, Oskar Linde says...
Is there any way to retain the syntax and ease of use of the built in AA 
with a library implementation allowing the flexibility of choosing:
- implementation
- hash function
- comparison function
?

I think the STL containers do a very good job at being both flexible and usable, covering most normal cases of use. The thing missing from D to make a STL-like library would be the reference return type. Some of the syntax would still be impossible to duplicate, eg. int[int] ii; ii[1]++; // This inserts the key '1' int i = ii[2]; // This does NOT insert '2' but I always thought this was the wrong way to do it anyway. (Eg. what should happen when the value is a struct and you use ii[3].foo?) Nick
Mar 25 2006
next sibling parent Wolfgang Draxinger <wdraxinger darkstargames.de> writes:
Nick wrote:

 but I always thought this was the wrong way to do it anyway. (Eg. what
 should happen when the value is a struct and you use ii[3].foo?)

Raise an exception? Wolfgang Draxinger
Mar 28 2006
prev sibling parent Wolfgang Draxinger <wdraxinger darkstargames.de> writes:
Nick wrote:

 I think the STL containers do a very good job at being both flexible and
 usable, covering most normal cases of use. The thing missing from D to
 make a STL-like library would be the reference return type.

Well, there's just the little problem, that C++ '&' reference types being results of functions are not safe either. A C++ reference is IMHO nothing better than a strongly typed pointer. example class foo { foo(int bar) { a = new int[bar]; } int *a; // ignore the possible bounds error ;-) int & operator [] (int i){ return a[i]; } int & something () { int hello = 123; return hello; } }; This isn't a bit better than using a pointer instead a reference. The real problem currently is, that the index expression can be used as a lvalue, but using opIndex has not yet support for it. But why not solve it this way: The code int[] array; aa[i]++; compiles into execution of aa.opIndexAssign(aa.opIndex(i)+1, i); Now for the support of user defined AA I'd like to suggest a new overload of opIndex and opIndexAssign or new operator functions opKey, opKeyAssign, which take a parameter of the key type as first parameter. Also I'd introduce a base AA type, so that one can easyly derive from it. To use such a user defined AA class I'd like to suggest the following syntax; int[char[]]!MyAA an_AA; The idea is, that any sort of AA is in fact an implcit instanciation of the builtin AA template, i.e. int[char[]] an_AA; is the same like int[char[]]!__builtinAA an_AA; Instead of a struct I'd use classes for this. AAs are dynamic and thus allocated on the heap anyway. But making them a class would easyly allow to catch such ambigous things, like Oskar had found. So logically an AA class would be declared as class __builtinAA(T, K) { }; And a derived class class MyAA(T, K) : __builtinAA!(T, K) { }; or even class IntToStringAA : __builtinAA(char[], int) { }; Eventually derive __builtinAA from a common base, non templated AA type, so that RTTI clearly recognizes it as AA, and not derive from a templated class. How about this? Wolfgang Draxinger
Mar 28 2006
prev sibling next sibling parent reply Nick <Nick_member pathlink.com> writes:
In article <e0169e$25hq$1 digitaldaemon.com>, Sean Kelly says...
Indeed.  It seems roughly similar to how arrays are handled: in place 
modifications persist, but anything causing a realloc does not.  But the 
original should never be rendered unusable.

True. Except that adding an element to an array (when no realloc occurs) does not change the original either.
This would also *almost* allow us to get rid of opCmp(Object) and 
opEquals(Object) in the Object base class, which would be very nice.  I 
definately do like having built in AAs, but I do find some of the 
consequences irritating.

Sort of OT, but I've studied the builtin AA code and have come to the conclusion that opCmp() isn't needed NOW either. In the binary tree implementation that Walter used, he first compares the hash values, and only if those matches he calls opCmp(). Since actual hash collisions should be pretty uncommon(*), opCmp is almost never called(**). In the rare case of a hash collision, one could make a special case and eg. always place it in the left node. (*) It is uncommon for decent hash functions. It seems the builtin AA is optimized to also work with very bad hashers (in which case it approaches a pure binary tree.) I guess this isn't a bad feature for a generic AA, but it is also some of the reason why my custom hash map outperformed it (I didn't use btrees at all.) (**) Just did a test with an english dictionary of 173528 words and the standard D string hasher. opCmp() was called 1840 times, slightly over 1%. Nick
Mar 25 2006
parent reply Dave <Dave_member pathlink.com> writes:
In article <e0360o$28ep$1 digitaldaemon.com>, Nick says...
(*) It is uncommon for decent hash functions. It seems the builtin AA is
optimized to also work with very bad hashers (in which case it approaches a pure
binary tree.) I guess this isn't a bad feature for a generic AA, but it is also
some of the reason why my custom hash map outperformed it (I didn't use btrees
at all.)

Just curious - what portion of the perf. diff. was that? What data types? And in your estimation what accounts for the rest of the diff. (e.g. are you pre-allocating a bunch of space for the data (as opposed to the key) or not using the GC)? Thanks, - Dave
Mar 26 2006
parent reply Nick <Nick_member pathlink.com> writes:
In article <e06hpc$vb6$1 digitaldaemon.com>, Dave says...
Just curious - what portion of the perf. diff. was that?

I don't remember exactly. I did a btree implementation, but when I saw that it was slightly slower than the (simpler) list implementation (and definitely not faster), I just scrapped it. Note that from a theoretical view point, the btrees only make sense when the average bucket size is larger than 2. This practically only happens when the lookup table is too small compared to the number of elements, or when you use a bad hash function. Both probably will occur "in the wild", though. The DMD AA should keep the btrees. I knew my data beforehand, so I didn't need them.
What data types?

The speed tests were done with char[] keys and int values.
 And in your estimation what accounts for the rest of the diff.

For lookups, hard to say. I found that using a simple power-of-two table size and a bitmask for lookups in was just as good as messing about with prime numbers, but only gave a miniscule performance increase. Perhaps templated code just means less overhead. In any case, the difference was very small, and the builtin AA is very fast. Inserts, however, are a different story. The builting AA starts with a small table size (97), and resizes (rehashes) at regular intervals as you insert elements. Being able to specify the approximate number of elements _in advance_ (thus avoiding the rehashes) gave me a 3x-5x performance increase when inserting.
(e.g. are you pre-allocating a bunch of space for the data (as opposed to the
 key) or not using the GC)?

Nope, the default allocater is the GC, and one new node struct is new'ed for each insert. (But with a malloc or region allocator it becomes much faster.) Nick
Mar 27 2006
parent Dave <Dave_member pathlink.com> writes:
In article <e08fi6$ouf$1 digitaldaemon.com>, Nick says...
In article <e06hpc$vb6$1 digitaldaemon.com>, Dave says...
Just curious - what portion of the perf. diff. was that?


Nick

Thanks for that info.
Mar 27 2006
prev sibling next sibling parent Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Sean Kelly wrote:
 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a 
 library implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following: # int[int] ii; # ii[10]++;

Yup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p

two distinct things, which I hope you already know: * Reference types (as in vs. value types) * Reference types or references (as in C++ references) I don't know any unambiguous terminology for these two terms, there may not even be one (correct me if wrong). However, nowadays, most of the time when one says "reference types" it means the first thing (me inclusive). So please, when you speak of the C++ references please disambiguate the term. (usually called just "references", although this is still very close to the other term) As for you point. Yes, this demonstrates a good usage scenario for a "C++ reference". Should there be such a feature in D? Or can this problem be solved in an alternate way? Is the added complexity of this feature worth it? I don't know I cannot come to a conclusion. :/ -- Bruno Medeiros - CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Mar 26 2006
prev sibling parent Hong <Hong_member pathlink.com> writes:
This is just a perfect demonstration of the problem with D value types. A way to
solve this would be to extend the inout keyword to function return and
variables, such that [] would look like:

inout T opIndex(size_t i)

Hong

In article <e0169e$25hq$1 digitaldaemon.com>, Sean Kelly says...
Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
 Hello,

 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug.
 [...]

Good catch!

Indeed. It seems roughly similar to how arrays are handled: in place modifications persist, but anything causing a realloc does not. But the original should never be rendered unusable.
 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a library 
 implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following: # int[int] ii; # ii[10]++;

Yup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p
 I find it rather confusing and frustrating. In my current project I ended up
 making my own template AA to get around some of the deficiencies of the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
 without encapsulating the key in a struct/class (think case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the builtin
AA.

This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice. I definately do like having built in AAs, but I do find some of the consequences irritating. Sean

Mar 28 2006
prev sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
 Hello,

 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug.
 [...]

Good catch!
 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a library 
 implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following: # int[int] ii; # ii[10]++;

You correct. I don't know how I could have forgotten that. D's lack of a reference return type has implications on properties too.
 I find it rather confusing and frustrating. In my current project I ended up
 making my own template AA to get around some of the deficiencies of the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
 without encapsulating the key in a struct/class (think case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the builtin
AA.

I have similar experiences implementing a custom templated hash-table. /Oskar
Mar 24 2006
prev sibling next sibling parent Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Oskar Linde wrote:
 Hello,
 
 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug. Rewriting the code 
 with a non-templated version for clarity:
 
 void increment(int[char[]] map, char[] key) {
     if (auto t = (key in map))
         (*t)++;
     else
         map[key] = 1;
 }
 
 To understand the code, one first has to know if the associative array 
 is a value type or a reference type.
 
 It turns out it is neither. It is definitely not a value type. The above 
 function will in most cases alter the original associative array and 
 work as expected. But it is not a reference type either -- in the few 
 cases where adding a new key to the AA invokes a rehash, all hell breaks 
 loose. The original associative array is suddenly *corrupted*! (The way 
 to make the function work as expected is to make the map an inout 
 argument, but that is beside my point.)
 
 Internally, an AA is implemented as an aaA*[] -- a dynamic array of 
 pointers to aaA-nodes. The first node of the array contains metadata 
 (the total number of keys), and the remaining [1..$] nodes contains the 
 hash buckets. When passing an aa to a function, the aaA*[] gets passed. 
 I assume the reason for this implementations is efficiency. All you need 
 to do an efficient lookup is the array of buckets. But is this 
 implementation comes at a cost of robustness. To make use of AA's, you 
 have to follow a special rule:
 
 - Never add an element to or rehash the AA unless you know you hold the 
 only reference.
 
 This means that even though the AA behaves as a reference type, the data 
 can not be shared unless all references are readonly. I feel I must see 
 this bastardization of being neither value or reference type a design 
 mistake.
 

Oskar, thanks for pointing that out. That was one issue that I hadn't noticed yet, and I agree wholeheartedly that is a quite a design problem. As a side note: I would prefer saying that these are more like broken reference types, rather than saying they are not reference types, since that may not be the best way to describe it (not being reference or value types is more the nature of the problem of static arrays). Anyway, in any case you description of the problem is quite clear. Also, this illustrates another issue, which is the tricky reference semantics of dynamic arrays. Dynamic arrays are reference types, but they have a "catch", in that unlike all other reference types, there is an operation other than assignment that changes the identity of the array. It is the length change operation. But what is *really bad*, is that it is not "deterministic"[1] whether the resulting array from a length change will a brand new one or not. The identity (or perhaps better said, the contents) can be all new, or partially the same. [1] "deterministic" is not the 100% correct word (as it is more the opposite of random), but I'm missing a better term.
 A simple fix is to change the AA implementation so that AAs become MyAA* 
  instead of aaA*[], with MyAA defined as:
 
 struct MyAA {
     AANode[] buckets;
     uint numberOfNodes;
 }
 
 You get a very slight additional lookup cost (double dereferencing to 
 get to the buckets), but make the AA a pure reference type without any 
 caveats and problems.
 
 My questions are:
 
 1) Why is the D built in array type designed in such a fragile way? 
 Should D not be robust language? Do users have to resort to writing 
 library wrappers around D's types to get robustness? If so:
 

I'd like to know that too. One more problem to the list, and yet people are already thinking about 1.0 and D slogans and whatnot. :/ -- Bruno Medeiros - CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Mar 26 2006
prev sibling parent "Walter Bright" <newshound digitalmars.com> writes:
"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote in message 
news:e00v0m$te2$3 digitaldaemon.com...
 It turns out it is neither. It is definitely not a value type. The above 
 function will in most cases alter the original associative array and work 
 as expected. But it is not a reference type either -- in the few cases 
 where adding a new key to the AA invokes a rehash, all hell breaks loose. 
 The original associative array is suddenly *corrupted*! (The way to make 
 the function work as expected is to make the map an inout argument, but 
 that is beside my point.)

I hadn't realized that could happen, and it clearly needs to be fixed.
 A simple fix is to change the AA implementation so that AAs become MyAA* 
 instead of aaA*[], with MyAA defined as:

 struct MyAA {
 AANode[] buckets;
 uint numberOfNodes;
 }

 You get a very slight additional lookup cost (double dereferencing to get 
 to the buckets), but make the AA a pure reference type without any caveats 
 and problems.

I'll make that change.
Mar 27 2006