www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Revamping associative arrays

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Associative arrays are today quite problematic because they don't offer 
any true iteration. Furthermore, the .keys and .values properties create 
new arrays, which is wasteful.

Another issue with associative arrays is that ++a[k] is hacked, which 
reflects a grave language limitation. That needs to be replaced with a 
true facility.

Any other issues with AAs that you want to see fixed, and ideas guiding 
a redesign?


Thanks,

Andrei
Oct 17 2009
next sibling parent Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 Associative arrays are today quite problematic because they don't offer 
 any true iteration. Furthermore, the .keys and .values properties create 
 new arrays, which is wasteful.
 
 Another issue with associative arrays is that ++a[k] is hacked, which 
 reflects a grave language limitation. That needs to be replaced with a 
 true facility.
 
 Any other issues with AAs that you want to see fixed, and ideas guiding 
 a redesign?
 
 
 Thanks,
 
 Andrei

I've heard that D's AA's are not GC friendly. I'd recommend looking at "high scale lib". It's a java container library designed for high throughput and high concurrency. I was going to try my hand at writing it, but the whole shared debacle kind of squashed that. With all the polishing of D2, when will shared be both well documented and well implemented. Shared updates haven't been in any of the recent changelogs.
Oct 17 2009
prev sibling next sibling parent reply Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
Andrei Alexandrescu wrote:
 Associative arrays are today quite problematic because they don't offer 
 any true iteration. Furthermore, the .keys and .values properties create 
 new arrays, which is wasteful.
 
 Another issue with associative arrays is that ++a[k] is hacked, which 
 reflects a grave language limitation. That needs to be replaced with a 
 true facility.
 
 Any other issues with AAs that you want to see fixed, and ideas guiding 
 a redesign?
 
 
 Thanks,
 
 Andrei

Idea: the .keys and .values properties, rather than creating arrays, could create iterable ranges with the smallest possible footprint, internally walking the tree structure.
Oct 17 2009
next sibling parent reply BCS <none anon.com> writes:
Hello Chris Nicholson-Sauls,

 Andrei Alexandrescu wrote:
 
 Associative arrays are today quite problematic because they don't
 offer any true iteration. Furthermore, the .keys and .values
 properties create new arrays, which is wasteful.
 
 Another issue with associative arrays is that ++a[k] is hacked, which
 reflects a grave language limitation. That needs to be replaced with
 a true facility.
 
 Any other issues with AAs that you want to see fixed, and ideas
 guiding a redesign?
 
 Thanks,
 
 Andrei
 

could create iterable ranges with the smallest possible footprint, internally walking the tree structure.

what will this do? foreach(key; aa.keys) if(Test(key)) aa.remove(key);
Oct 17 2009
next sibling parent Christopher Wright <dhasenan gmail.com> writes:
Moritz Warning wrote:
 On Sat, 17 Oct 2009 18:58:08 +0000, BCS wrote:
 what will this do?

 foreach(key; aa.keys)
    if(Test(key))
       aa.remove(key);

It's undefined behavior. You shouldn't try to mutate the aa while iterating. I hope that will be fixed. It took me some time to find this out.

That's really annoying, and it's true of most (all?) C# base class library collections. When coding in C#, I often find myself doing: var remove = new HashedSet<Foo>(); foreach (var foo in foos) if (Test(foo)) remove.Add(foo); foos.RemoveAll(remove); It's more allocation than necessary, but more than that, it's an unnecessarily complicated way of interacting with the collection.
Oct 18 2009
prev sibling parent Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
BCS wrote:
 Hello Chris Nicholson-Sauls,
 
 Andrei Alexandrescu wrote:

 Associative arrays are today quite problematic because they don't
 offer any true iteration. Furthermore, the .keys and .values
 properties create new arrays, which is wasteful.

 Another issue with associative arrays is that ++a[k] is hacked, which
 reflects a grave language limitation. That needs to be replaced with
 a true facility.

 Any other issues with AAs that you want to see fixed, and ideas
 guiding a redesign?

 Thanks,

 Andrei

could create iterable ranges with the smallest possible footprint, internally walking the tree structure.

what will this do? foreach(key; aa.keys) if(Test(key)) aa.remove(key);

Pre-cache a pointer to the next node in sequence; two pointers is still better than an arbitrarily long on-the-fly array. Its my understanding that AA trees do not automatically rebalance, so the sequence isn't going to (otherwise) change on a removal. End of the sequence means a null "next" pointer. For that matter: how common is looping over keys just to remove select ones versus other purposes in looping just the keys? -- Chris Nicholson-Sauls
Oct 18 2009
prev sibling next sibling parent reply Moritz Warning <moritzwarning web.de> writes:
On Sat, 17 Oct 2009 18:58:08 +0000, BCS wrote:

 Hello Chris Nicholson-Sauls,
 
 Andrei Alexandrescu wrote:
 
 Associative arrays are today quite problematic because they don't
 offer any true iteration. Furthermore, the .keys and .values
 properties create new arrays, which is wasteful.
 
 Another issue with associative arrays is that ++a[k] is hacked, which
 reflects a grave language limitation. That needs to be replaced with a
 true facility.
 
 Any other issues with AAs that you want to see fixed, and ideas
 guiding a redesign?
 
 Thanks,
 
 Andrei
 

could create iterable ranges with the smallest possible footprint, internally walking the tree structure.

foreach(key; aa.keys) if(Test(key)) aa.remove(key);

It's undefined behavior. You shouldn't try to mutate the aa while iterating. I hope that will be fixed. It took me some time to find this out.
Oct 17 2009
parent BCS <none anon.com> writes:
Hello Moritz,

 On Sat, 17 Oct 2009 18:58:08 +0000, BCS wrote:
 
 Hello Chris Nicholson-Sauls,
 
 Idea: the .keys and .values properties, rather than creating arrays,
 could create iterable ranges with the smallest possible footprint,
 internally walking the tree structure.
 

foreach(key; aa.keys) if(Test(key)) aa.remove(key);

It's undefined behavior. You shouldn't try to mutate the aa while iterating. I hope that will be fixed. It took me some time to find this out.

That's the reason for the above idiom. Under the current spec, the above is totally valid and, as far as I know, the only valid way to remove all elements fitting some test with only a single pass. Whatever is chosen, the above code had better work or at a minimum, requiter nothing more than the addition of a .dup OTOH building .keys must do a pass of it's own so all that's gained is sugar. But still, switching to a .keys that gets invalidated on updating the AA will break existing code so making the fix simple should be kept in mind.
Oct 17 2009
prev sibling next sibling parent reply Moritz Warning <moritzwarning web.de> writes:
On Sat, 17 Oct 2009 19:06:36 +0000, Moritz Warning wrote:

 On Sat, 17 Oct 2009 18:58:08 +0000, BCS wrote:
 
 Hello Chris Nicholson-Sauls,
 
 Andrei Alexandrescu wrote:
 
 Associative arrays are today quite problematic because they don't
 offer any true iteration. Furthermore, the .keys and .values
 properties create new arrays, which is wasteful.
 
 Another issue with associative arrays is that ++a[k] is hacked, which
 reflects a grave language limitation. That needs to be replaced with
 a true facility.
 
 Any other issues with AAs that you want to see fixed, and ideas
 guiding a redesign?
 
 Thanks,
 
 Andrei
 

could create iterable ranges with the smallest possible footprint, internally walking the tree structure.

foreach(key; aa.keys) if(Test(key)) aa.remove(key);

It's undefined behavior. You shouldn't try to mutate the aa while iterating. I hope that will be fixed. It took me some time to find this out.

Uh, sorry. You are iterating over an fresh allocated array of the keys. That's ok.
Oct 17 2009
parent BCS <none anon.com> writes:
Hello Moritz,

 On Sat, 17 Oct 2009 19:06:36 +0000, Moritz Warning wrote:
 
 On Sat, 17 Oct 2009 18:58:08 +0000, BCS wrote:
 
 what will this do?
 
 foreach(key; aa.keys)
 if(Test(key))
 aa.remove(key);

It's undefined behavior. You shouldn't try to mutate the aa while iterating. I hope that will be fixed. It took me some time to find this out.

Uh, sorry. You are iterating over an fresh allocated array of the keys. That's ok.

As it happens, your comment would be correct if .keys were, as proposed, made to be a range that walks the AA.
Oct 17 2009
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 17 Oct 2009 14:58:08 -0400, BCS <none anon.com> wrote:

 Hello Chris Nicholson-Sauls,

 Andrei Alexandrescu wrote:

 Associative arrays are today quite problematic because they don't
 offer any true iteration. Furthermore, the .keys and .values
 properties create new arrays, which is wasteful.
  Another issue with associative arrays is that ++a[k] is hacked, which
 reflects a grave language limitation. That needs to be replaced with
 a true facility.
  Any other issues with AAs that you want to see fixed, and ideas
 guiding a redesign?
  Thanks,
  Andrei

could create iterable ranges with the smallest possible footprint, internally walking the tree structure.

what will this do? foreach(key; aa.keys) if(Test(key)) aa.remove(key);

http://www.dsource.org/projects/dcollections/docs/current/dcollections.model.Iterator.html Search for Purgeable. I always hated that limitation of not being able to remove elements while iterating. It's one of the things I always despised about Java and C# iteration compared to C++. -Steve
Oct 20 2009
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 Associative arrays are today quite problematic because they don't offer
 any true iteration. Furthermore, the .keys and .values properties create
 new arrays, which is wasteful.
 Another issue with associative arrays is that ++a[k] is hacked, which
 reflects a grave language limitation. That needs to be replaced with a
 true facility.
 Any other issues with AAs that you want to see fixed, and ideas guiding
 a redesign?
 Thanks,
 Andrei

I've thought about this a lot, here's a laundry list. 1. Huge AAs (over about 1 million elements) create massive amounts of false pointers. Whether this should be fixed in the AA design or whether the AA design should assume a good GC design and hope we get one is a matter of debate. 2. Every insertion to an AA requires a memory allocation (read: requires taking a lock in a multithreaded program). Again, it's debatable whether an AA design should consider this or whether the real answer is to fix the GC. 3. The structure of AAs means that the GC must scan everything. There is no part of the data structure that can be marked as not having pointers, so that the GC can skip scanning it. This takes forever and a day. 4. AAs are really, really, *really* space inefficient. For *every element* in the array, you have a struct that looks like the following: struct aaA { aaA *left; aaA *right; hash_t hash; /* key */ /* value */ } Assuming 32-bit ints and pointers and zero overhead from other sources, this means you have *at least* 12 bytes per element in wasted space. 5. Everything in the current design is based on RTTI instead of templates. I haven't benchmarked this, but I'd imagine it has to slow things down a little. 6. The array of aaA structs can only have the following sizes: immutable size_t[] prime_list = [ 97UL, 389UL, 1_543UL, 6_151UL, 24_593UL, 98_317UL, 393_241UL, 1_572_869UL, 6_291_469UL, 25_165_843UL, 100_663_319UL, 402_653_189UL, 1_610_612_741UL, 4_294_967_291UL, // 8_589_934_513UL, 17_179_869_143UL ]; I understand that there is theoretical justification for this, but in practice I'd rather have O(N) performance in a few more corner cases in exchange for having my AA not be horribly space-inefficient. I've worked a little on a design that's based simply on two parallel arrays, one for keys and one for values. I used parallel arrays instead of structs to cut down on alignment overhead and so that if one slot has ptrs and the other doesn't, only the one that does can be scanned by the GC. Collisions are resolved by probing in an order defined by a linear congruential random number generator seeded with the hash before computing the modulus. This means that, even if two elements have the same hash in modulus space, as long as their hashes are different in full 32-bit space, the probing sequences for resolving collisions will be completely different. I might have mentioned it here a while back, but I'll post it here for comment. AFAIK it solves every one of the problems listed above.
Oct 17 2009
next sibling parent reply language_fan <foo bar.com.invalid> writes:
Sat, 17 Oct 2009 19:16:32 +0000, dsimcha thusly wrote:

 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 Associative arrays are today quite problematic because they don't offer
 any true iteration. Furthermore, the .keys and .values properties
 create new arrays, which is wasteful.
 Another issue with associative arrays is that ++a[k] is hacked, which
 reflects a grave language limitation. That needs to be replaced with a
 true facility.
 Any other issues with AAs that you want to see fixed, and ideas guiding
 a redesign?
 Thanks,
 Andrei

I've thought about this a lot, here's a laundry list.

[snip] It would be great if the underlying implementation could be switched based on the task at hand. The AAs could be completely library provided, with only a small built-in syntactical sugaring on top of it.
Oct 17 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
language_fan wrote:
 It would be great if the underlying implementation could be switched 
 based on the task at hand. The AAs could be completely library provided, 
 with only a small built-in syntactical sugaring on top of it.

They already are. In D1, the complete implementation is in phobos/internal/aaA.d, and the compiler knows nothing about it other than the function signatures.
Oct 18 2009
prev sibling parent BCS <none anon.com> writes:
Hello dsimcha,

 6.  The array of aaA structs can only have the following sizes:
 
 immutable size_t[] prime_list = [
 97UL,            389UL,
 1_543UL,          6_151UL,
 24_593UL,         98_317UL,
 393_241UL,      1_572_869UL,
 6_291_469UL,     25_165_843UL,
 100_663_319UL,    402_653_189UL,
 1_610_612_741UL,  4_294_967_291UL,
 //  8_589_934_513UL, 17_179_869_143UL
 ];
 
 I understand that there is theoretical justification for this, but in
 practice I'd rather have O(N) performance in a few more corner cases
 in exchange for having my AA not be horribly space-inefficient.
 

Adding more primes to that list should resolve that issue nicely while retaining the advantages of using mod-prime in the hashing. for example: 1 97 2 197 3 397 4 797 5 1597 6 3203 7 6421 8 12853 9 25717 10 51437 11 102877 12 205759 13 411527 14 823117 15 1646237 16 3292489 17 6584983 18 13169977 19 26339969 20 52679969 21 105359939 22 ... heres some code to generate a list at whatever rate you want: import std.stdio; import std.math; ulong[] primes; // running time O(intagral(pi(sqrt(x)), dx, 0, n)) void main() { real rate = 2.0; uint from = 97; int count = 1; writef("%s\t%s\n", count, from); primes.length = 10_000; // stop somewhere north of 4_294_967_291UL WARNING: this might not end correctly long prime_n = 0; ulong sqrt_n = 0; ulong p = 1; outer: while(sqrt_n < primes.length - 5) { p += 2; ulong sqrt_p = 1+cast(ulong)sqrt(cast(real)p); sqrt_n = 0; while(sqrt_n < prime_n && primes[sqrt_n] <= sqrt_p) if(p % primes[sqrt_n] == 0) continue outer; else sqrt_n++; if(prime_n < primes.length) primes[prime_n] = p; prime_n++; if(p >= from * rate) { count++; from = p; writef("%s\t%s\n", count, from); } } }
Oct 17 2009
prev sibling next sibling parent reply Max Samukha <spambox d-coding.com> writes:
On Sat, 17 Oct 2009 13:28:51 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

Associative arrays are today quite problematic because they don't offer 
any true iteration. Furthermore, the .keys and .values properties create 
new arrays, which is wasteful.

Another issue with associative arrays is that ++a[k] is hacked, which 
reflects a grave language limitation. That needs to be replaced with a 
true facility.

Any other issues with AAs that you want to see fixed, and ideas guiding 
a redesign?


Thanks,

Andrei

They should be true reference types. Now we have oddities like this: int[int] a; auto b = a; a[1] = 2; assert(b[1] == 2); // Range violation int[int] a; a[1] = 2; auto b = a; a[2] = 3; assert(b[2] == 3); // Ok
Oct 17 2009
parent reply grauzone <none example.net> writes:
Max Samukha wrote:
 On Sat, 17 Oct 2009 13:28:51 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Associative arrays are today quite problematic because they don't offer 
 any true iteration. Furthermore, the .keys and .values properties create 
 new arrays, which is wasteful.

 Another issue with associative arrays is that ++a[k] is hacked, which 
 reflects a grave language limitation. That needs to be replaced with a 
 true facility.

 Any other issues with AAs that you want to see fixed, and ideas guiding 
 a redesign?


 Thanks,

 Andrei

They should be true reference types. Now we have oddities like this: int[int] a; auto b = a; a[1] = 2; assert(b[1] == 2); // Range violation int[int] a; a[1] = 2; auto b = a; a[2] = 3; assert(b[2] == 3); // Ok

That's because the actual AA is allocated on demand (when you add the first element). The new array type T[new] is going to have the same "problem". I like it, because if you don't need it, it doesn't allocate memory. Of course it's possible that the complication in semantics isn't it worth. But D is still a systems programming language with a *very* bad GC, which is why I think it should continue to work this way.
Oct 17 2009
parent reply Max Samukha <spambox d-coding.com> writes:
On Sat, 17 Oct 2009 22:47:21 +0200, grauzone <none example.net> wrote:

Max Samukha wrote:
 On Sat, 17 Oct 2009 13:28:51 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Associative arrays are today quite problematic because they don't offer 
 any true iteration. Furthermore, the .keys and .values properties create 
 new arrays, which is wasteful.

 Another issue with associative arrays is that ++a[k] is hacked, which 
 reflects a grave language limitation. That needs to be replaced with a 
 true facility.

 Any other issues with AAs that you want to see fixed, and ideas guiding 
 a redesign?


 Thanks,

 Andrei

They should be true reference types. Now we have oddities like this: int[int] a; auto b = a; a[1] = 2; assert(b[1] == 2); // Range violation int[int] a; a[1] = 2; auto b = a; a[2] = 3; assert(b[2] == 3); // Ok

That's because the actual AA is allocated on demand (when you add the first element). The new array type T[new] is going to have the same "problem".

I agree. But the problem is the lack of a decent way to instantiate an empty AA. Now you either have to do idiotic things like: int[int] a; a[1] = 1; a.remove(1); Or wrap the array in another type. And, talking about T[new], we should be able to instantiate them eagerly as well.
I like it, because if you don't need it, it doesn't allocate memory. Of 
course it's possible that the complication in semantics isn't it worth. 
But D is still a systems programming language with a *very* bad GC, 
which is why I think it should continue to work this way.

Oct 18 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Max Samukha:

 I agree. But the problem is the lack of a decent way to instantiate an
 empty AA. Now you either have to do idiotic things like:
 
 int[int] a;
 a[1] = 1;
 a.remove(1);
 
 Or wrap the array in another type.

This is an empty AA, you need nothing else: int[int] a; In my post I have also said I'd like to have a built-in expression literal for empty AA, I use this in my dlibs: private template AA_impl(KeyType, ValueType) { ValueType[KeyType] result; const ValueType[KeyType] res = result; } template AA(KeyType, ValueType) { const ValueType[KeyType] AA = AA_impl!(KeyType, ValueType).res; } That you can use for example like this, to give an empty AA to a foo() function: foo(AA!(int, float)); Something built-in with a better syntax may be found, example: That can be used like: foo([int:float]); Bye, bearophile
Oct 18 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile:
 That you can use for example like this, to give an empty AA to a foo()
function:
 foo(AA!(int, float));

I have just found that you can also use "null" there. It's less explicit because if you take a look at the call site only, you don't know that the receiver will need an AA. That's why I'd like to disallow "null" to be used where an empty array can be given (and force the programmer to use [] there). This is also how Delight language works. Bye, bearophile
Oct 18 2009
prev sibling parent Max Samukha <spambox d-coding.com> writes:
On Sun, 18 Oct 2009 04:29:57 -0400, bearophile
<bearophileHUGS lycos.com> wrote:

Max Samukha:

 I agree. But the problem is the lack of a decent way to instantiate an
 empty AA. Now you either have to do idiotic things like:
 
 int[int] a;
 a[1] = 1;
 a.remove(1);
 
 Or wrap the array in another type.

This is an empty AA, you need nothing else: int[int] a;

I showed in my original post it is not enough to use AA as a reference type.
Oct 18 2009
prev sibling next sibling parent reply Robert Clipsham <robert octarineparrot.com> writes:
Andrei Alexandrescu wrote:
 Any other issues with AAs that you want to see fixed, and ideas guiding 
 a redesign?

Their speed. I did a benchmark a while ago comparing tango's HashMap implementation and built in AAs, tango's hashmaps were twice as fast... It'd be nice if the built in language feature was faster than an implementation done within the language! :)
Oct 17 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Robert Clipsham wrote:
 Andrei Alexandrescu wrote:
 Any other issues with AAs that you want to see fixed, and ideas 
 guiding a redesign?

Their speed. I did a benchmark a while ago comparing tango's HashMap implementation and built in AAs, tango's hashmaps were twice as fast... It'd be nice if the built in language feature was faster than an implementation done within the language! :)

I hear you, but fortunately that doesn't concern the interface. Andrei
Oct 17 2009
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 17 Oct 2009 14:28:51 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Associative arrays are today quite problematic because they don't offer  
 any true iteration. Furthermore, the .keys and .values properties create  
 new arrays, which is wasteful.

 Another issue with associative arrays is that ++a[k] is hacked, which  
 reflects a grave language limitation. That needs to be replaced with a  
 true facility.

 Any other issues with AAs that you want to see fixed, and ideas guiding  
 a redesign?

Do not require opCmp for AAs. There are some good hashmap implementations do not require using a tree for collisions. This would also eliminate at least one pointer in the element struct. It also causes some problems with using arbitrary classes as keys. It is easy to define a default hash and default opEquals for a class, but it is difficult to define a default opCmp. In fact, the default opCmp in object simply throws an exception, making it a nuisance to have the compiler allow using a class as a key, and then throwing an exception at runtime when you use it. Removing the requirement for opCmp would also eliminate the requirement for opCmp to be in object (it currently by default throws an exception), so it could be in an interface instead. -Steve
Oct 20 2009