digitalmars.D - Robustness of the built in associative array

Oskar Linde (73/73) Mar 24 2006 Hello,

Nick (12/19) Mar 24 2006 Nope, in fact, you couldn't. The kicker is that the builtin AA uses some...

Sean Kelly (11/33) Mar 24 2006 Indeed. It seems roughly similar to how arrays are handled: in place

Oskar Linde (9/40) Mar 24 2006 Is there any way to retain the syntax and ease of use of the built in AA...

Nick (11/17) Mar 25 2006 I think the STL containers do a very good job at being both flexible and...

Wolfgang Draxinger (3/5) Mar 28 2006 Raise an exception?
Wolfgang Draxinger (55/58) Mar 28 2006 Well, there's just the little problem, that C++ '&' reference types bein...

Nick (17/24) Mar 25 2006 True. Except that adding an element to an array (when no realloc occurs)...

Dave (7/12) Mar 26 2006 Just curious - what portion of the perf. diff. was that? What data types...

Nick (24/29) Mar 27 2006 I don't remember exactly. I did a btree implementation, but when I saw t...

Dave (3/7) Mar 27 2006 Thanks for that info.

Bruno Medeiros (18/33) Mar 26 2006 WHOOOA there. "Reference type" is a pretty overloaded term that means
Hong (6/39) Mar 28 2006 This is just a perfect demonstration of the problem with D value types. ...

Oskar Linde (5/28) Mar 24 2006 You correct. I don't know how I could have forgotten that. D's lack of a...

Bruno Medeiros (23/82) Mar 26 2006 Oskar, thanks for pointing that out. That was one issue that I hadn't
Walter Bright (4/20) Mar 27 2006 I hadn't realized that could happen, and it clearly needs to be fixed.

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Hello,

The latest "Implicit fn template instantiation" thread contains code 
that contains a hidden and possibly hard to find bug. Rewriting the code 
with a non-templated version for clarity:

void increment(int[char[]] map, char[] key) {
	if (auto t = (key in map))
		(*t)++;
	else
		map[key] = 1;
}

To understand the code, one first has to know if the associative array 
is a value type or a reference type.

It turns out it is neither. It is definitely not a value type. The above 
function will in most cases alter the original associative array and 
work as expected. But it is not a reference type either -- in the few 
cases where adding a new key to the AA invokes a rehash, all hell breaks 
loose. The original associative array is suddenly *corrupted*! (The way 
to make the function work as expected is to make the map an inout 
argument, but that is beside my point.)

Internally, an AA is implemented as an aaA*[] -- a dynamic array of 
pointers to aaA-nodes. The first node of the array contains metadata 
(the total number of keys), and the remaining [1..$] nodes contains the 
hash buckets. When passing an aa to a function, the aaA*[] gets passed. 
I assume the reason for this implementations is efficiency. All you need 
to do an efficient lookup is the array of buckets. But is this 
implementation comes at a cost of robustness. To make use of AA's, you 
have to follow a special rule:

- Never add an element to or rehash the AA unless you know you hold the 
only reference.

This means that even though the AA behaves as a reference type, the data 
can not be shared unless all references are readonly. I feel I must see 
this bastardization of being neither value or reference type a design 
mistake.

A simple fix is to change the AA implementation so that AAs become MyAA* 
  instead of aaA*[], with MyAA defined as:

struct MyAA {
	AANode[] buckets;
	uint numberOfNodes;
}

You get a very slight additional lookup cost (double dereferencing to 
get to the buckets), but make the AA a pure reference type without any 
caveats and problems.

My questions are:

1) Why is the D built in array type designed in such a fragile way? 
Should D not be robust language? Do users have to resort to writing 
library wrappers around D's types to get robustness? If so:

2) Why even keep the builtin AA? With the addition of an opIn() 
overload, you could get *exactly* the same functionality from a library 
implementation:

Map!(char[],int) map;

rather than:

int[char[]] map;

Cons:
  - This slight syntax difference.

Pros:
  - The library version would be transparently replaceable by other 
implementations when the data or other conditions have different 
algorithmic requirements.
  - The library implementation could easily be more efficient than the 
current built in version by taking advantage of templates.
  - The library implementation could without problems be made to work 
with types that lack opCompare.
  - The library version would not have any strange quirks and gotchas. 
(Dangerous usage cases like above)
  - Greater library/language decoupling.
  - The possibility of implementing an UnsafeMap type with high 
performance but less safety for the few cases where the performance 
really matters.

In my opinion, D's lack of protection (const arrays, etc) and robustness 
against programming mistakes is D's greatest weakness.

My contribution to the slogan thread:

"D - Programming without a lifeline" :)

/Oskar

Mar 24 2006

Nick <Nick_member pathlink.com> writes:

In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
Hello,

The latest "Implicit fn template instantiation" thread contains code 
that contains a hidden and possibly hard to find bug.
[...]

Good catch!

2) Why even keep the builtin AA? With the addition of an opIn() 
overload, you could get *exactly* the same functionality from a library 
implementation

Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
that is currently _impossible_ to duplicate in custom types. Eg. the following:




I find it rather confusing and frustrating. In my current project I ended up
making my own template AA to get around some of the deficiencies of the builtin
type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
without encapsulating the key in a struct/class (think case-insensitive char[]
keys.) The result was also faster and more memory efficient than the builtin AA.

Nick

Mar 24 2006

Sean Kelly <sean f4.ca> writes:

Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
 Hello,

 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug.
 [...]

 
 Good catch!

Indeed.  It seems roughly similar to how arrays are handled: in place 
modifications persist, but anything causing a realloc does not.  But the 
original should never be rendered unusable.

 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a library 
 implementation

 
 Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
 that is currently _impossible_ to duplicate in custom types. Eg. the following:
 



Yup, not having a bona fide reference type in D is problematic.  Though 
you could always use pointers :-p

 I find it rather confusing and frustrating. In my current project I ended up
 making my own template AA to get around some of the deficiencies of the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
 without encapsulating the key in a struct/class (think case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the builtin
AA.

This would also *almost* allow us to get rid of opCmp(Object) and 
opEquals(Object) in the Object base class, which would be very nice.  I 
definately do like having built in AAs, but I do find some of the 
consequences irritating.


Sean

Mar 24 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Sean Kelly wrote:
 Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...


 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a 
 library implementation

 Nope, in fact, you couldn't. The kicker is that the builtin AA uses 
 some syntax
 that is currently _impossible_ to duplicate in custom types. Eg. the 
 following:




 
 Yup, not having a bona fide reference type in D is problematic.  Though 
 you could always use pointers :-p

Consider the consequences of using a pointer in the above example. ;)

 I find it rather confusing and frustrating. In my current project I 
 ended up
 making my own template AA to get around some of the deficiencies of 
 the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom 
 hashers
 without encapsulating the key in a struct/class (think 
 case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the 
 builtin AA.

 
 This would also *almost* allow us to get rid of opCmp(Object) and 
 opEquals(Object) in the Object base class, which would be very nice.  I 
 definately do like having built in AAs, but I do find some of the 
 consequences irritating.

Is there any way to retain the syntax and ease of use of the built in AA 
with a library implementation allowing the flexibility of choosing:
- implementation
- hash function
- comparison function
?

/Oskar

Mar 24 2006

Nick <Nick_member pathlink.com> writes:

In article <e01a21$te2$5 digitaldaemon.com>, Oskar Linde says...
Is there any way to retain the syntax and ease of use of the built in AA 
with a library implementation allowing the flexibility of choosing:
- implementation
- hash function
- comparison function
?

I think the STL containers do a very good job at being both flexible and usable,
covering most normal cases of use. The thing missing from D to make a STL-like
library would be the reference return type.

Some of the syntax would still be impossible to duplicate, eg.

int[int] ii;
ii[1]++; // This inserts the key '1'
int i = ii[2]; // This does NOT insert '2'

but I always thought this was the wrong way to do it anyway. (Eg. what should
happen when the value is a struct and you use ii[3].foo?)

Nick

Mar 25 2006

Wolfgang Draxinger <wdraxinger darkstargames.de> writes:

Nick wrote:

 but I always thought this was the wrong way to do it anyway. (Eg. what
 should happen when the value is a struct and you use ii[3].foo?)

Raise an exception?

Wolfgang Draxinger

Mar 28 2006

Wolfgang Draxinger <wdraxinger darkstargames.de> writes:

Nick wrote:

 I think the STL containers do a very good job at being both flexible and
 usable, covering most normal cases of use. The thing missing from D to
 make a STL-like library would be the reference return type.

Well, there's just the little problem, that C++ '&' reference types being
results of functions are not safe either. A C++ reference is IMHO nothing
better than a strongly typed pointer.

example

class foo
{
foo(int bar) { a = new int[bar]; }
int *a;

// ignore the possible bounds error ;-)
int & operator [] (int i){ return a[i]; }

int & something ()
{
        int hello = 123;
        return hello;
}

};

This isn't a bit better than using a pointer instead a reference.

The real problem currently is, that the index expression can be used as a
lvalue, but using opIndex has not yet support for it.

But why not solve it this way: The code

int[] array;
aa[i]++;

compiles into execution of

aa.opIndexAssign(aa.opIndex(i)+1, i);

Now for the support of user defined AA I'd like to suggest a new overload of
opIndex and opIndexAssign or new operator functions opKey, opKeyAssign,
which take a parameter of the key type as first parameter. Also I'd
introduce a base AA type, so that one can easyly derive from it. To use
such a user defined AA class I'd like to suggest the following syntax;

int[char[]]!MyAA an_AA;

The idea is, that any sort of AA is in fact an implcit instanciation of the
builtin AA template, i.e. 

int[char[]] an_AA;

is the same like

int[char[]]!__builtinAA an_AA;

Instead of a struct I'd use classes for this. AAs are dynamic and thus
allocated on the heap anyway. But making them a class would easyly allow to
catch such ambigous things, like Oskar had found.

So logically an AA class would be declared as

class __builtinAA(T, K)
{
};

And a derived class

class MyAA(T, K) : __builtinAA!(T, K)
{
};

or even

class IntToStringAA : __builtinAA(char[], int)
{
};

Eventually derive __builtinAA from a common base, non templated AA type, so
that RTTI clearly recognizes it as AA, and not derive from a templated
class. How about this?

Wolfgang Draxinger

Mar 28 2006

Nick <Nick_member pathlink.com> writes:

In article <e0169e$25hq$1 digitaldaemon.com>, Sean Kelly says...
Indeed.  It seems roughly similar to how arrays are handled: in place 
modifications persist, but anything causing a realloc does not.  But the 
original should never be rendered unusable.

True. Except that adding an element to an array (when no realloc occurs) does
not change the original either.

This would also *almost* allow us to get rid of opCmp(Object) and 
opEquals(Object) in the Object base class, which would be very nice.  I 
definately do like having built in AAs, but I do find some of the 
consequences irritating.

Sort of OT, but I've studied the builtin AA code and have come to the conclusion
that opCmp() isn't needed NOW either. In the binary tree implementation that
Walter used, he first compares the hash values, and only if those matches he
calls opCmp(). Since actual hash collisions should be pretty uncommon(*), opCmp
is almost never called(**). In the rare case of a hash collision, one could make
a special case and eg. always place it in the left node.

(*) It is uncommon for decent hash functions. It seems the builtin AA is
optimized to also work with very bad hashers (in which case it approaches a pure
binary tree.) I guess this isn't a bad feature for a generic AA, but it is also
some of the reason why my custom hash map outperformed it (I didn't use btrees
at all.)

(**) Just did a test with an english dictionary of 173528 words and the standard
D string hasher. opCmp() was called 1840 times, slightly over 1%.

Nick

Mar 25 2006

Dave <Dave_member pathlink.com> writes:

In article <e0360o$28ep$1 digitaldaemon.com>, Nick says...
(*) It is uncommon for decent hash functions. It seems the builtin AA is
optimized to also work with very bad hashers (in which case it approaches a pure
binary tree.) I guess this isn't a bad feature for a generic AA, but it is also
some of the reason why my custom hash map outperformed it (I didn't use btrees
at all.)

Just curious - what portion of the perf. diff. was that? What data types? And in
your estimation what accounts for the rest of the diff. (e.g. are you
pre-allocating a bunch of space for the data (as opposed to the key) or not
using the GC)?

Thanks,

- Dave

Mar 26 2006

Nick <Nick_member pathlink.com> writes:

In article <e06hpc$vb6$1 digitaldaemon.com>, Dave says...
Just curious - what portion of the perf. diff. was that?

I don't remember exactly. I did a btree implementation, but when I saw that it
was slightly slower than the (simpler) list implementation (and definitely not
faster), I just scrapped it.

Note that from a theoretical view point, the btrees only make sense when the
average bucket size is larger than 2. This practically only happens when the
lookup table is too small compared to the number of elements, or when you use a
bad hash function. Both probably will occur "in the wild", though.

The DMD AA should keep the btrees. I knew my data beforehand, so I didn't need
them.

What data types?

The speed tests were done with char[] keys and int values.

 And in your estimation what accounts for the rest of the diff.

For lookups, hard to say. I found that using a simple power-of-two table size
and a bitmask for lookups in was just as good as messing about with prime
numbers, but only gave a miniscule performance increase. Perhaps templated code
just means less overhead. In any case, the difference was very small, and the
builtin AA is very fast.

Inserts, however, are a different story. The builting AA starts with a small
table size (97), and resizes (rehashes) at regular intervals as you insert
elements. Being able to specify the approximate number of elements _in advance_
(thus avoiding the rehashes) gave me a 3x-5x performance increase when
inserting.

(e.g. are you pre-allocating a bunch of space for the data (as opposed to the
 key) or not using the GC)?

Nope, the default allocater is the GC, and one new node struct is new'ed for
each insert. (But with a malloc or region allocator it becomes much faster.)

Nick

Mar 27 2006

Dave <Dave_member pathlink.com> writes:

In article <e08fi6$ouf$1 digitaldaemon.com>, Nick says...
In article <e06hpc$vb6$1 digitaldaemon.com>, Dave says...
Just curious - what portion of the perf. diff. was that?


<snip, see: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/35954>
Nick

Thanks for that info.

Mar 27 2006

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Sean Kelly wrote:
 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a 
 library implementation

 Nope, in fact, you couldn't. The kicker is that the builtin AA uses 
 some syntax
 that is currently _impossible_ to duplicate in custom types. Eg. the 
 following:




 
 Yup, not having a bona fide reference type in D is problematic.  Though 
 you could always use pointers :-p
 

WHOOOA there. "Reference type" is a pretty overloaded term that means 
two distinct things, which I hope you already know:
* Reference types (as in vs. value types)
* Reference types or references (as in C++ references)
I don't know any unambiguous terminology for these two terms, there may 
not even be one (correct me if wrong). However, nowadays, most of the 
time when one says "reference types" it means the first thing (me 
inclusive). So please, when you speak of the C++ references please 
disambiguate the term. (usually called just "references", although this 
is still very close to the other term)

As for you point. Yes, this demonstrates a good usage scenario for a 
"C++ reference". Should there be such a feature in D? Or can this 
problem be solved in an alternate way? Is the added complexity of this 
feature worth it? I don't know I cannot come to a conclusion. :/


-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Mar 26 2006

Hong <Hong_member pathlink.com> writes:

This is just a perfect demonstration of the problem with D value types. A way to
solve this would be to extend the inout keyword to function return and
variables, such that [] would look like:

inout T opIndex(size_t i)

Hong

In article <e0169e$25hq$1 digitaldaemon.com>, Sean Kelly says...
Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
 Hello,

 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug.
 [...]

 
 Good catch!

Indeed.  It seems roughly similar to how arrays are handled: in place 
modifications persist, but anything causing a realloc does not.  But the 
original should never be rendered unusable.

 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a library 
 implementation

 
 Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
 that is currently _impossible_ to duplicate in custom types. Eg. the following:
 



Yup, not having a bona fide reference type in D is problematic.  Though 
you could always use pointers :-p

 I find it rather confusing and frustrating. In my current project I ended up
 making my own template AA to get around some of the deficiencies of the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
 without encapsulating the key in a struct/class (think case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the builtin
AA.

This would also *almost* allow us to get rid of opCmp(Object) and 
opEquals(Object) in the Object base class, which would be very nice.  I 
definately do like having built in AAs, but I do find some of the 
consequences irritating.


Sean

Mar 28 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Nick wrote:
 In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...
 Hello,

 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug.
 [...]

 
 Good catch!
 
 2) Why even keep the builtin AA? With the addition of an opIn() 
 overload, you could get *exactly* the same functionality from a library 
 implementation

 
 Nope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax
 that is currently _impossible_ to duplicate in custom types. Eg. the following:
 



You correct. I don't know how I could have forgotten that. D's lack of a 
reference return type has implications on properties too.

 I find it rather confusing and frustrating. In my current project I ended up
 making my own template AA to get around some of the deficiencies of the builtin
 type. Mostly I wanted to avoid double lookups, and easily allow custom hashers
 without encapsulating the key in a struct/class (think case-insensitive char[]
 keys.) The result was also faster and more memory efficient than the builtin
AA.

I have similar experiences implementing a custom templated hash-table.

/Oskar

Mar 24 2006

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Oskar Linde wrote:
 Hello,
 
 The latest "Implicit fn template instantiation" thread contains code 
 that contains a hidden and possibly hard to find bug. Rewriting the code 
 with a non-templated version for clarity:
 
 void increment(int[char[]] map, char[] key) {
     if (auto t = (key in map))
         (*t)++;
     else
         map[key] = 1;
 }
 
 To understand the code, one first has to know if the associative array 
 is a value type or a reference type.
 
 It turns out it is neither. It is definitely not a value type. The above 
 function will in most cases alter the original associative array and 
 work as expected. But it is not a reference type either -- in the few 
 cases where adding a new key to the AA invokes a rehash, all hell breaks 
 loose. The original associative array is suddenly *corrupted*! (The way 
 to make the function work as expected is to make the map an inout 
 argument, but that is beside my point.)
 
 Internally, an AA is implemented as an aaA*[] -- a dynamic array of 
 pointers to aaA-nodes. The first node of the array contains metadata 
 (the total number of keys), and the remaining [1..$] nodes contains the 
 hash buckets. When passing an aa to a function, the aaA*[] gets passed. 
 I assume the reason for this implementations is efficiency. All you need 
 to do an efficient lookup is the array of buckets. But is this 
 implementation comes at a cost of robustness. To make use of AA's, you 
 have to follow a special rule:
 
 - Never add an element to or rehash the AA unless you know you hold the 
 only reference.
 
 This means that even though the AA behaves as a reference type, the data 
 can not be shared unless all references are readonly. I feel I must see 
 this bastardization of being neither value or reference type a design 
 mistake.
 

Oskar, thanks for pointing that out. That was one issue that I hadn't 
noticed yet, and I agree wholeheartedly that is a quite a design problem.

As a side note: I would prefer saying that these are more like broken 
reference types, rather than saying they are not reference types, since 
that may not be the best way to describe it (not being reference or 
value types is more the nature of the problem of static arrays).
Anyway, in any case you description of the problem is quite clear.

Also, this illustrates another issue, which is the tricky reference 
semantics of dynamic arrays. Dynamic arrays are reference types, but 
they have a "catch", in that unlike all other reference types, there is 
an operation other than assignment that changes the identity of the 
array. It is the length change operation. But what is *really bad*, is 
that it is not "deterministic"[1] whether the resulting array from a 
length change will a brand new one or not. The identity (or perhaps 
better said, the contents) can be all new, or partially the same.

[1] "deterministic" is not the 100% correct word (as it is more the 
opposite of random), but I'm missing a better term.

 A simple fix is to change the AA implementation so that AAs become MyAA* 
  instead of aaA*[], with MyAA defined as:
 
 struct MyAA {
     AANode[] buckets;
     uint numberOfNodes;
 }
 
 You get a very slight additional lookup cost (double dereferencing to 
 get to the buckets), but make the AA a pure reference type without any 
 caveats and problems.
 
 My questions are:
 
 1) Why is the D built in array type designed in such a fragile way? 
 Should D not be robust language? Do users have to resort to writing 
 library wrappers around D's types to get robustness? If so:
 

I'd like to know that too. One more problem to the list, and yet people 
are already thinking about 1.0 and D slogans and whatnot. :/


-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Mar 26 2006

"Walter Bright" <newshound digitalmars.com> writes:

"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote in message 
news:e00v0m$te2$3 digitaldaemon.com...
 It turns out it is neither. It is definitely not a value type. The above 
 function will in most cases alter the original associative array and work 
 as expected. But it is not a reference type either -- in the few cases 
 where adding a new key to the AA invokes a rehash, all hell breaks loose. 
 The original associative array is suddenly *corrupted*! (The way to make 
 the function work as expected is to make the map an inout argument, but 
 that is beside my point.)

I hadn't realized that could happen, and it clearly needs to be fixed.

 A simple fix is to change the AA implementation so that AAs become MyAA* 
 instead of aaA*[], with MyAA defined as:

 struct MyAA {
 AANode[] buckets;
 uint numberOfNodes;
 }

 You get a very slight additional lookup cost (double dereferencing to get 
 to the buckets), but make the AA a pure reference type without any caveats 
 and problems.

I'll make that change.

Mar 27 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Robustness of the built in associative array