www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A case for valueless AA's?

reply Andrej Mitrovic <none none.none> writes:
Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm showing
traversal and lookup speed. The `static this()` is used to create a random
collection of words which get stored in the two arrays:

https://gist.github.com/893340

Hash lookups are naturally very fast. So I am inclined to use hashes for some
of my code, but there's a few problems:

1. I'm wasting memory. I don't need the values, I only need the keys. But I
can't declare a void[string] hash.

2. AA literals are awkward to write if you don't need the values:
Aa = ["foo":0, "bar":0, "car":0,
         "dar":0, "mar":0, "doo":0];

So there's redundancy there.

Anyhow, would this be a good case for valueless AA's? Is that even possible?

I could just wrap an AA in a custom type that has a nice constructor, which
takes care of issue #2. But I don't know how to get rid of the values, which
leaves the issue #1.
Mar 29 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrej Mitrovic (none none.none)'s article
 Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm showing

collection of words which get stored in the two arrays:
 https://gist.github.com/893340
 Hash lookups are naturally very fast. So I am inclined to use hashes for some
of

 1. I'm wasting memory. I don't need the values, I only need the keys. But I

 2. AA literals are awkward to write if you don't need the values:
 Aa = ["foo":0, "bar":0, "car":0,
          "dar":0, "mar":0, "doo":0];
 So there's redundancy there.
 Anyhow, would this be a good case for valueless AA's? Is that even possible?

They're called sets. In Java I think they're named HashSet and TreeSet. In Python they're just named Set. Like AAs, the common ways to implement them are via hashing and balanced binary trees. They are usually fleshed out with operations like union, intersection and difference. I'm not sure whether these belong in the core language as an extension of AAs but if not they definitely belong in std.container.
Mar 29 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-03-29 14:34, dsimcha wrote:
 == Quote from Andrej Mitrovic (none none.none)'s article
 
 Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm
 showing

traversal and lookup speed. The `static this()` is used to create a random collection of words which get stored in the two arrays:
 https://gist.github.com/893340
 Hash lookups are naturally very fast. So I am inclined to use hashes for
 some of

my code, but there's a few problems:
 1. I'm wasting memory. I don't need the values, I only need the keys. But
 I

can't declare a void[string] hash.
 2. AA literals are awkward to write if you don't need the values:
 Aa = ["foo":0, "bar":0, "car":0,
 
          "dar":0, "mar":0, "doo":0];
 
 So there's redundancy there.
 Anyhow, would this be a good case for valueless AA's? Is that even
 possible?

They're called sets. In Java I think they're named HashSet and TreeSet. In Python they're just named Set. Like AAs, the common ways to implement them are via hashing and balanced binary trees. They are usually fleshed out with operations like union, intersection and difference. I'm not sure whether these belong in the core language as an extension of AAs but if not they definitely belong in std.container.

We now have std.container.RedBlackTree, which can be used as a set, but it is a tree rather than a hash table. - Jonathan M Davis
Mar 29 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/29/2011 8:37 PM, Jonathan M Davis wrote:
 We now have std.container.RedBlackTree, which can be used as a set, but it is
 a tree rather than a hash table.

 - Jonathan M Davis

This isn't a very good set. It doesn't even support basic set primitives like intersection, difference, union, etc. Furthermore, most programs will probably want a hash set. A tree set is only better when you don't have a good hash function or worst case performance is more important than average case. I'm not saying there's anything wrong with RedBlackTree. It's a perfectly good data structure to implement a set on top of. It's just that it's not a "real" set type because it's not meant to be.
Mar 29 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-03-29 20:11, Andrej Mitrovic wrote:
 I had a look at dcollections, it seems to have a few different
 implementations of a set. And it has intersection (not sure about
 difference or union). It's boost licensed, I wonder if Steve will make
 a push of his library to Phobos when its done..

RedBlackTree came from dcollections in the first place. Its internals use pretty much the same backend implementation as dcollecitions' red-black tree does, but the front-end is very different (since dcollections and std.container are very different in philosophy). But as I understand it, he has no intention of getting any of the rest of dcollections in Phobos. His opinions on containers differ from Andrei's, and he said that the other container types are simple enough to reimplement, that there's no need to port them. - Jonathan M Davis
Mar 29 2011
prev sibling parent reply David Nadlinger <see klickverbot.at> writes:
On 3/30/11 5:02 AM, dsimcha wrote:
 […] Furthermore, most programs will probably want a hash set.

First of all, I agree that we really want to have a hash set in Phobos. I think that a first implementation wouldn't even need to support advanced operations like union, intersection, etc., but we should really have some kind of set in the standard library which doesn't look plain ugly (recently, I quickly needed some set implementation and abused AAs for it, but this isn't a serious option). I vaguely remember you having an optimized AA library, David. Does it contain something of interest for a pure set implementation as well, or is it strictly tailored towards hash maps? David
Mar 30 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from David Nadlinger (see klickverbot.at)'s article
 On 3/30/11 5:02 AM, dsimcha wrote:
 […] Furthermore, most programs will probably want a hash set.

I think that a first implementation wouldn't even need to support advanced operations like union, intersection, etc., but we should really have some kind of set in the standard library which doesn't look plain ugly (recently, I quickly needed some set implementation and abused AAs for it, but this isn't a serious option).

I was thinking an easy implementation of sets would be to abuse AAs using the void[0][KeyType] trick under the hood. We could even make a type that wraps any AA conforming to the canonical compile time interface and makes a set, using this trick.
 I vaguely remember you having an optimized AA library, David. Does it
 contain something of interest for a pure set implementation as well, or
 is it strictly tailored towards hash maps?

Strictly hash maps (though see above). This is mostly optimized for very large AAs that will be created and destroyed frequently. The optimizations are with regard to memory management. For lookups it's actually slower than the builtin AA b/c it's not as cache efficient.
Mar 30 2011
next sibling parent David Nadlinger <see klickverbot.at> writes:
On 3/30/11 3:59 PM, dsimcha wrote:
 == Quote from David Nadlinger (see klickverbot.at)'s article
 On 3/30/11 5:02 AM, dsimcha wrote:
 […] Furthermore, most programs will probably want a hash set.

I think that a first implementation wouldn't even need to support advanced operations like union, intersection, etc., but we should really have some kind of set in the standard library which doesn't look plain ugly (recently, I quickly needed some set implementation and abused AAs for it, but this isn't a serious option).

I was thinking an easy implementation of sets would be to abuse AAs using the void[0][KeyType] trick under the hood. We could even make a type that wraps any AA conforming to the canonical compile time interface and makes a set, using this trick.

Yes, that's what I was thinking about as well, and I really think just using AAs under the hood would be fine for an initial implementation. I'm considering writing up a patch for std.container myself (at least I certainly will if my Thrift project was accepted, because I'd rather target a »canonical« HashSet implementation there), but as I am unsure about what design to use (final classes?), I refrained from it so far. David
Mar 30 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-03-30 07:42, David Nadlinger wrote:
 On 3/30/11 3:59 PM, dsimcha wrote:
 =3D=3D Quote from David Nadlinger (see klickverbot.at)'s article
=20
 On 3/30/11 5:02 AM, dsimcha wrote:
 [=E2=80=A6] Furthermore, most programs will probably want a hash set.

First of all, I agree that we really want to have a hash set in Phobos. I think that a first implementation wouldn't even need to support advanced operations like union, intersection, etc., but we should real=



 have some kind of set in the standard library which doesn't look plain
 ugly (recently, I quickly needed some set implementation and abused AAs
 for it, but this isn't a serious option).

I was thinking an easy implementation of sets would be to abuse AAs usi=


 the void[0][KeyType] trick under the hood.  We could even make a type
 that wraps any AA conforming to the canonical compile time interface and
 makes a set, using this trick.

Yes, that's what I was thinking about as well, and I really think just using AAs under the hood would be fine for an initial implementation. =20 I'm considering writing up a patch for std.container myself (at least I certainly will if my Thrift project was accepted, because I'd rather target a =C2=BBcanonical=C2=AB HashSet implementation there), but as I am=

 about what design to use (final classes?), I refrained from it so far.

The containers in std.container are supposed to be final classes and implem= ent=20 any of the functions in the table which they can implement within the requi= red=20 Big O complexity. The exact set of functions may change over time as usage= =20 shows things that need to improve, but what's there is what is currently=20 expected. Currently (in git), RedBlackTree is the only one which is a class= ,=20 and glancing at it, I see that I forgot to make it final when I turned it i= nto=20 a class, so I'm going to have to go and fix that (but at least it's a class= =20 now). The others still need to be turned into classes. =2D Jonathan M Davis
Mar 30 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I had a look at dcollections, it seems to have a few different
implementations of a set. And it has intersection (not sure about
difference or union). It's boost licensed, I wonder if Steve will make
a push of his library to Phobos when its done..
Mar 29 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 29 Mar 2011 23:02:35 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 3/29/2011 8:37 PM, Jonathan M Davis wrote:
 We now have std.container.RedBlackTree, which can be used as a set, but  
 it is
 a tree rather than a hash table.

 - Jonathan M Davis

This isn't a very good set. It doesn't even support basic set primitives like intersection, difference, union, etc.

std.container.RedBlackTree supports the standard methods, and not much else. I am not the architect for std.container, so I did not want to add methods that might some day be superseded by further additions to the std.container regime. dcollections does support these primitives, so the code already exists, I just am not sure of the interface Andrei wants.
 Furthermore, most programs will probably want a hash set.  A tree set is  
 only better when you don't have a good hash function or worst case  
 performance is more important than average case.

A tree-based set's advantage over hash-based is that it's ordered. This means things like insertions and deletions do not affect ranges, whereas an insertion in a hash set can cause a rehash. In addition to that, an ordered set provides excellent slicing ability (dcollections supports this). So it depends on your needs. Tree sets are definitely lower performing, I think Hashes win out pretty much there.
 I'm not saying there's anything wrong with RedBlackTree.  It's a  
 perfectly good data structure to implement a set on top of.  It's just  
 that it's not a "real" set type because it's not meant to be.

I found that the primitives in RedBlackTree suffice to make a set except for intersection, which cannot be done in a generic fashion with high performance. This function exists in dcollections, and can easily be added to std.container.RedBlackTree. -Steve
Mar 30 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 29 Mar 2011 23:11:53 -0400, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 I had a look at dcollections, it seems to have a few different
 implementations of a set. And it has intersection (not sure about
 difference or union). It's boost licensed, I wonder if Steve will make
 a push of his library to Phobos when its done..

set1.difference(set2) => set1.remove(set2) set1.union(set2) => set1.add(set2) So they are there, but they just aren't called union and difference. intersection isn't a very generic function, so it is called intersect. In fact, I had to add a special implementation function to both Hash and RBTree to get it to work correctly. And yes, I will add intersection to std.container when the API for it becomes clear. -Steve
Mar 30 2011
prev sibling next sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Andrej Mitrovic wrote:
 1. I'm wasting memory. I don't need the values, I only need the keys. But I
can't declare a void[string] hash.
 But I don't know how to get rid of the values, which leaves the issue #1.

Maybe static array of zero length? From documentation: "A static array with a dimension of 0 is allowed, but no space is allocated for it. It's useful as the last member of a variable length struct, or as the degenerate case of a template expansion." And this code works: void[0][string] aa; aa["key"] = []; assert("key" in aa); But anyway, I feel that dedicated HashSet container would be more appropriate.
Mar 29 2011
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Piotr Szturmaj:

 void[0][string] aa;
 aa["key"] = [];
 assert("key" in aa);

Nice. I have done a benchmark, and it seems a void[0][int] uses the same amount of memory as a int[int]. I have not tried a void[0][string]. Bye, bearophile
Mar 29 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 29 Mar 2011 18:42:59 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Piotr Szturmaj:

 void[0][string] aa;
 aa["key"] = [];
 assert("key" in aa);

Nice. I have done a benchmark, and it seems a void[0][int] uses the same amount of memory as a int[int]. I have not tried a void[0][string].

Each node in the hash has a pointer (for the next node), a hash (size_t) and the key and value. Given the granularity of allocation (16 bytes), anything less than 16 bytes will be 16 bytes. If the above trick works correctly, a void[0][string] should be the same size as int[int]. (8 bytes for pointer and hash, 8 bytes for string key, and hopefully 0 bytes for the value). BTW, you should not use [] in your code, as this calls a runtime array allocation function. Better to use something else, not sure what though. Maybe an enum? -Steve
Mar 30 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
A use case for this could be a file system search:

    void[string] names;  // unique names to find
    string[] results;

    foreach (string name; dirEntries(curdir, SpanMode.deep))
    {
        if (name.basename in names)
            results ~= name;
    }

With string arrays the `if` check might slow things down.
Mar 29 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Oh there's a container? I'll take a look, thanks guys.
Mar 29 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Looks like not. I've misread Piotr'es post. :)
Mar 29 2011
prev sibling next sibling parent "Nick Sabalausky" <a a.a> writes:
"Andrej Mitrovic" <none none.none> wrote in message 
news:imtirq$292g$1 digitalmars.com...
 Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm 
 showing traversal and lookup speed. The `static this()` is used to create 
 a random collection of words which get stored in the two arrays:

 https://gist.github.com/893340

 Hash lookups are naturally very fast. So I am inclined to use hashes for 
 some of my code, but there's a few problems:

 1. I'm wasting memory. I don't need the values, I only need the keys. But 
 I can't declare a void[string] hash.

 2. AA literals are awkward to write if you don't need the values:
 Aa = ["foo":0, "bar":0, "car":0,
         "dar":0, "mar":0, "doo":0];

 So there's redundancy there.

 Anyhow, would this be a good case for valueless AA's? Is that even 
 possible?

 I could just wrap an AA in a custom type that has a nice constructor, 
 which takes care of issue #2. But I don't know how to get rid of the 
 values, which leaves the issue #1.

I frequently find use for such a thing. I've always used "bool[whatever]" for that purpose. Piotr's suggestion of "void[0][whatever]" never occurred to me, I may try that sometime.
Mar 29 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/29/2011 11:43 PM, Andrej Mitrovic wrote:
 A use case for this could be a file system search:

      void[string] names;  // unique names to find
      string[] results;

      foreach (string name; dirEntries(curdir, SpanMode.deep))
      {
          if (name.basename in names)
              results ~= name;
      }

 With string arrays the `if` check might slow things down.

There are tons of use cases for sets. Unfortunately, implementation (either as hashed collections or binary trees) is far more complicated than for sequential collections (array, list). Especially, it highly depends on the element type. Thus, sets are rarely a builtin type. For this reason, probably, people often wrongly use arrays/lists where sets are appropriate. This means, mainly, where the primary operation is a test of membership/containment. (Also set operations like set equality, union, intersection...) The pseudo-value trick you used stil a good one: people often build sets from AAs with a value of true for each key/element (thus, set[x] returns true if x is present in set -- but this works only in implementations where absence of x does not throw). Denis -- _________________ vita es estrany spir.wikidot.com
Mar 29 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/29/2011 11:47 PM, Andrej Mitrovic wrote:
 Looks like not. I've misread Piotr'es post. :)

There is one in Steven's dcollections. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 29 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 3/30/11, spir <denis.spir gmail.com> wrote:
 On 03/29/2011 11:47 PM, Andrej Mitrovic wrote:
 Looks like not. I've misread Piotr'es post. :)

There is one in Steven's dcollections. Denis

Oh yes, I've been meaning to look at dcollections. The last time I saw it I was still confused as to how templates work, but now it's time to have some fun. :)
Mar 29 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 03/30/2011 05:02 AM, dsimcha wrote:
 On 3/29/2011 8:37 PM, Jonathan M Davis wrote:
 We now have std.container.RedBlackTree, which can be used as a set, but it is
 a tree rather than a hash table.

 - Jonathan M Davis

This isn't a very good set. It doesn't even support basic set primitives like intersection, difference, union, etc. Furthermore, most programs will probably want a hash set. A tree set is only better when you don't have a good hash function or worst case performance is more important than average case. I'm not saying there's anything wrong with RedBlackTree. It's a perfectly good data structure to implement a set on top of. It's just that it's not a "real" set type because it's not meant to be.

Yep, but with balanced binary search trees you get sorting for free ;-) Denis -- _________________ vita es estrany spir.wikidot.com
Mar 30 2011