digitalmars.D - A case for valueless AA's?

Andrej Mitrovic (10/10) Mar 29 2011 Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm ...

dsimcha (11/20) Mar 29 2011 traversal and lookup speed. The `static this()` is used to create a rand...

Jonathan M Davis (4/37) Mar 29 2011 We now have std.container.RedBlackTree, which can be used as a set, but ...

dsimcha (9/12) Mar 29 2011 This isn't a very good set. It doesn't even support basic set

Andrej Mitrovic (4/4) Mar 29 2011 I had a look at dcollections, it seems to have a few different

Steven Schveighoffer (11/15) Mar 30 2011 set1.difference(set2) => set1.remove(set2)

Jonathan M Davis (10/14) Mar 29 2011 RedBlackTree came from dcollections in the first place. Its internals us...
spir (7/19) Mar 30 2011 Yep, but with balanced binary search trees you get sorting for free ;-)
Steven Schveighoffer (19/33) Mar 30 2011 std.container.RedBlackTree supports the standard methods, and not much
David Nadlinger (11/12) Mar 30 2011 First of all, I agree that we really want to have a hash set in Phobos.

dsimcha (9/20) Mar 30 2011 I was thinking an easy implementation of sets would be to abuse AAs usin...

David Nadlinger (8/21) Mar 30 2011 Yes, that's what I was thinking about as well, and I really think just

Jonathan M Davis (19/44) Mar 30 2011 ng

Piotr Szturmaj (11/13) Mar 29 2011 Maybe static array of zero length? From documentation:

bearophile (4/7) Mar 29 2011 Nice. I have done a benchmark, and it seems a void[0][int] uses the same...

Steven Schveighoffer (12/18) Mar 30 2011 Each node in the hash has a pointer (for the next node), a hash (size_t)...

Andrej Mitrovic (9/9) Mar 29 2011 A use case for this could be a file system search:
Andrej Mitrovic (1/1) Mar 29 2011 Oh there's a container? I'll take a look, thanks guys.
Andrej Mitrovic (1/1) Mar 29 2011 Looks like not. I've misread Piotr'es post. :)
Nick Sabalausky (5/22) Mar 29 2011 I frequently find use for such a thing. I've always used "bool[whatever]...
spir (18/27) Mar 29 2011 There are tons of use cases for sets. Unfortunately, implementation (eit...
spir (7/8) Mar 29 2011 There is one in Steven's dcollections.
Andrej Mitrovic (4/8) Mar 29 2011 Oh yes, I've been meaning to look at dcollections. The last time I saw

Andrej Mitrovic <none none.none> writes:

Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm showing
traversal and lookup speed. The `static this()` is used to create a random
collection of words which get stored in the two arrays:

https://gist.github.com/893340

Hash lookups are naturally very fast. So I am inclined to use hashes for some
of my code, but there's a few problems:

1. I'm wasting memory. I don't need the values, I only need the keys. But I
can't declare a void[string] hash.

2. AA literals are awkward to write if you don't need the values:
Aa = ["foo":0, "bar":0, "car":0,
         "dar":0, "mar":0, "doo":0];

So there's redundancy there.

Anyhow, would this be a good case for valueless AA's? Is that even possible?

I could just wrap an AA in a custom type that has a nice constructor, which

Mar 29 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Andrej Mitrovic (none none.none)'s article
 Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm showing

traversal and lookup speed. The `static this()` is used to create a random
collection of words which get stored in the two arrays:
 https://gist.github.com/893340
 Hash lookups are naturally very fast. So I am inclined to use hashes for some
of

my code, but there's a few problems:
 1. I'm wasting memory. I don't need the values, I only need the keys. But I

can't declare a void[string] hash.
 2. AA literals are awkward to write if you don't need the values:
 Aa = ["foo":0, "bar":0, "car":0,
          "dar":0, "mar":0, "doo":0];
 So there's redundancy there.
 Anyhow, would this be a good case for valueless AA's? Is that even possible?

They're called sets.  In Java I think they're named HashSet and TreeSet.  In
Python they're just named Set.  Like AAs, the common ways to implement them are
via hashing and balanced binary trees.  They are usually fleshed out with
operations like union, intersection and difference.  I'm not sure whether these
belong in the core language as an extension of AAs but if not they definitely
belong in std.container.

Mar 29 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-03-29 14:34, dsimcha wrote:
 == Quote from Andrej Mitrovic (none none.none)'s article
 
 Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm
 showing

 
 traversal and lookup speed. The `static this()` is used to create a random
 
 collection of words which get stored in the two arrays:
 https://gist.github.com/893340
 Hash lookups are naturally very fast. So I am inclined to use hashes for
 some of

 
 my code, but there's a few problems:
 1. I'm wasting memory. I don't need the values, I only need the keys. But
 I

 
 can't declare a void[string] hash.
 
 2. AA literals are awkward to write if you don't need the values:
 Aa = ["foo":0, "bar":0, "car":0,
 
          "dar":0, "mar":0, "doo":0];
 
 So there's redundancy there.
 Anyhow, would this be a good case for valueless AA's? Is that even
 possible?

 
 They're called sets.  In Java I think they're named HashSet and TreeSet. 
 In Python they're just named Set.  Like AAs, the common ways to implement
 them are via hashing and balanced binary trees.  They are usually fleshed
 out with operations like union, intersection and difference.  I'm not sure
 whether these belong in the core language as an extension of AAs but if
 not they definitely belong in std.container.

We now have std.container.RedBlackTree, which can be used as a set, but it is 
a tree rather than a hash table.

- Jonathan M Davis

Mar 29 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/29/2011 8:37 PM, Jonathan M Davis wrote:
 We now have std.container.RedBlackTree, which can be used as a set, but it is
 a tree rather than a hash table.

 - Jonathan M Davis

This isn't a very good set.  It doesn't even support basic set 
primitives like intersection, difference, union, etc.  Furthermore, most 
programs will probably want a hash set.  A tree set is only better when 
you don't have a good hash function or worst case performance is more 
important than average case.

I'm not saying there's anything wrong with RedBlackTree.  It's a 
perfectly good data structure to implement a set on top of.  It's just 
that it's not a "real" set type because it's not meant to be.

Mar 29 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I had a look at dcollections, it seems to have a few different
implementations of a set. And it has intersection (not sure about
difference or union). It's boost licensed, I wonder if Steve will make
a push of his library to Phobos when its done..

Mar 29 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 29 Mar 2011 23:11:53 -0400, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 I had a look at dcollections, it seems to have a few different
 implementations of a set. And it has intersection (not sure about
 difference or union). It's boost licensed, I wonder if Steve will make
 a push of his library to Phobos when its done..

set1.difference(set2) => set1.remove(set2)
set1.union(set2) => set1.add(set2)

So they are there, but they just aren't called union and difference.

intersection isn't a very generic function, so it is called intersect.  In  
fact, I had to add a special implementation function to both Hash and  
RBTree to get it to work correctly.

And yes, I will add intersection to std.container when the API for it  
becomes clear.

-Steve

Mar 30 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-03-29 20:11, Andrej Mitrovic wrote:
 I had a look at dcollections, it seems to have a few different
 implementations of a set. And it has intersection (not sure about
 difference or union). It's boost licensed, I wonder if Steve will make
 a push of his library to Phobos when its done..

RedBlackTree came from dcollections in the first place. Its internals use 
pretty much the same backend implementation as dcollecitions' red-black tree 
does, but the front-end is very different (since dcollections and 
std.container are very different in philosophy).

But as I understand it, he has no intention of getting any of the rest of 
dcollections in Phobos. His opinions on containers differ from Andrei's, and 
he said that the other container types are simple enough to reimplement, that 
there's no need to port them.

- Jonathan M Davis

Mar 29 2011

spir <denis.spir gmail.com> writes:

On 03/30/2011 05:02 AM, dsimcha wrote:
 On 3/29/2011 8:37 PM, Jonathan M Davis wrote:
 We now have std.container.RedBlackTree, which can be used as a set, but it is
 a tree rather than a hash table.

 - Jonathan M Davis

 This isn't a very good set. It doesn't even support basic set primitives like
 intersection, difference, union, etc. Furthermore, most programs will probably
 want a hash set. A tree set is only better when you don't have a good hash
 function or worst case performance is more important than average case.

 I'm not saying there's anything wrong with RedBlackTree. It's a perfectly good
 data structure to implement a set on top of. It's just that it's not a "real"
 set type because it's not meant to be.

Yep, but with balanced binary search trees you get sorting for free ;-)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 30 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 29 Mar 2011 23:02:35 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 3/29/2011 8:37 PM, Jonathan M Davis wrote:
 We now have std.container.RedBlackTree, which can be used as a set, but  
 it is
 a tree rather than a hash table.

 - Jonathan M Davis

 This isn't a very good set.  It doesn't even support basic set  
 primitives like intersection, difference, union, etc.

std.container.RedBlackTree supports the standard methods, and not much  
else.  I am not the architect for std.container, so I did not want to add  
methods that might some day be superseded by further additions to the  
std.container regime.

dcollections does support these primitives, so the code already exists, I  
just am not sure of the interface Andrei wants.

 Furthermore, most programs will probably want a hash set.  A tree set is  
 only better when you don't have a good hash function or worst case  
 performance is more important than average case.

A tree-based set's advantage over hash-based is that it's ordered.  This  
means things like insertions and deletions do not affect ranges, whereas  
an insertion in a hash set can cause a rehash.

In addition to that, an ordered set provides excellent slicing ability  
(dcollections supports this).

So it depends on your needs.  Tree sets are definitely lower performing, I  
think Hashes win out pretty much there.

 I'm not saying there's anything wrong with RedBlackTree.  It's a  
 perfectly good data structure to implement a set on top of.  It's just  
 that it's not a "real" set type because it's not meant to be.

I found that the primitives in RedBlackTree suffice to make a set except  
for intersection, which cannot be done in a generic fashion with high  
performance.  This function exists in dcollections, and can easily be  
added to std.container.RedBlackTree.

-Steve

Mar 30 2011

David Nadlinger <see klickverbot.at> writes:

On 3/30/11 5:02 AM, dsimcha wrote:
 […] Furthermore, most programs will probably want a hash set.

First of all, I agree that we really want to have a hash set in Phobos. 
I think that a first implementation wouldn't even need to support 
advanced operations like union, intersection, etc., but we should really 
have some kind of set in the standard library which doesn't look plain 
ugly (recently, I quickly needed some set implementation and abused AAs 
for it, but this isn't a serious option).

I vaguely remember you having an optimized AA library, David. Does it 
contain something of interest for a pure set implementation as well, or 
is it strictly tailored towards hash maps?

David

Mar 30 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from David Nadlinger (see klickverbot.at)'s article
 On 3/30/11 5:02 AM, dsimcha wrote:
 […] Furthermore, most programs will probably want a hash set.

 First of all, I agree that we really want to have a hash set in Phobos.
 I think that a first implementation wouldn't even need to support
 advanced operations like union, intersection, etc., but we should really
 have some kind of set in the standard library which doesn't look plain
 ugly (recently, I quickly needed some set implementation and abused AAs
 for it, but this isn't a serious option).

I was thinking an easy implementation of sets would be to abuse AAs using the
void[0][KeyType] trick under the hood.  We could even make a type that wraps any
AA conforming to the canonical compile time interface and makes a set, using
this
trick.

 I vaguely remember you having an optimized AA library, David. Does it
 contain something of interest for a pure set implementation as well, or
 is it strictly tailored towards hash maps?

Strictly hash maps (though see above).  This is mostly optimized for very large
AAs that will be created and destroyed frequently.  The optimizations are with
regard to memory management.  For lookups it's actually slower than the builtin
AA
b/c it's not as cache efficient.

Mar 30 2011

David Nadlinger <see klickverbot.at> writes:

On 3/30/11 3:59 PM, dsimcha wrote:
 == Quote from David Nadlinger (see klickverbot.at)'s article
 On 3/30/11 5:02 AM, dsimcha wrote:
 […] Furthermore, most programs will probably want a hash set.

 First of all, I agree that we really want to have a hash set in Phobos.
 I think that a first implementation wouldn't even need to support
 advanced operations like union, intersection, etc., but we should really
 have some kind of set in the standard library which doesn't look plain
 ugly (recently, I quickly needed some set implementation and abused AAs
 for it, but this isn't a serious option).

 I was thinking an easy implementation of sets would be to abuse AAs using the
 void[0][KeyType] trick under the hood.  We could even make a type that wraps
any
 AA conforming to the canonical compile time interface and makes a set, using
this
 trick.

Yes, that's what I was thinking about as well, and I really think just 
using AAs under the hood would be fine for an initial implementation.

I'm considering writing up a patch for std.container myself (at least I 
certainly will if my Thrift project was accepted, because I'd rather 
target a »canonical« HashSet implementation there), but as I am unsure 
about what design to use (final classes?), I refrained from it so far.

David

Mar 30 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-03-30 07:42, David Nadlinger wrote:
 On 3/30/11 3:59 PM, dsimcha wrote:
 =3D=3D Quote from David Nadlinger (see klickverbot.at)'s article
=20
 On 3/30/11 5:02 AM, dsimcha wrote:
 [=E2=80=A6] Furthermore, most programs will probably want a hash set.

=20
 First of all, I agree that we really want to have a hash set in Phobos.
 I think that a first implementation wouldn't even need to support
 advanced operations like union, intersection, etc., but we should real=



ly
 have some kind of set in the standard library which doesn't look plain
 ugly (recently, I quickly needed some set implementation and abused AAs
 for it, but this isn't a serious option).

=20
 I was thinking an easy implementation of sets would be to abuse AAs usi=


ng
 the void[0][KeyType] trick under the hood.  We could even make a type
 that wraps any AA conforming to the canonical compile time interface and
 makes a set, using this trick.

=20
 Yes, that's what I was thinking about as well, and I really think just
 using AAs under the hood would be fine for an initial implementation.
=20
 I'm considering writing up a patch for std.container myself (at least I
 certainly will if my Thrift project was accepted, because I'd rather
 target a =C2=BBcanonical=C2=AB HashSet implementation there), but as I am=

 unsure
 about what design to use (final classes?), I refrained from it so far.

The containers in std.container are supposed to be final classes and implem=
ent=20
any of the functions in the table which they can implement within the requi=
red=20
Big O complexity. The exact set of functions may change over time as usage=
=20
shows things that need to improve, but what's there is what is currently=20
expected. Currently (in git), RedBlackTree is the only one which is a class=
,=20
and glancing at it, I see that I forgot to make it final when I turned it i=
nto=20
a class, so I'm going to have to go and fix that (but at least it's a class=
=20
now). The others still need to be turned into classes.

=2D Jonathan M Davis

Mar 30 2011

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Andrej Mitrovic wrote:
 1. I'm wasting memory. I don't need the values, I only need the keys. But I
can't declare a void[string] hash.


Maybe static array of zero length? From documentation:

"A static array with a dimension of 0 is allowed, but no space is 
allocated for it. It's useful as the last member of a variable length 
struct, or as the degenerate case of a template expansion."

And this code works:

void[0][string] aa;
aa["key"] = [];
assert("key" in aa);

But anyway, I feel that dedicated HashSet container would be more 
appropriate.

Mar 29 2011

bearophile <bearophileHUGS lycos.com> writes:

Piotr Szturmaj:

 void[0][string] aa;
 aa["key"] = [];
 assert("key" in aa);

Nice. I have done a benchmark, and it seems a void[0][int] uses the same amount
of memory as a int[int]. I have not tried a void[0][string].

Bye,
bearophile

Mar 29 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 29 Mar 2011 18:42:59 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Piotr Szturmaj:

 void[0][string] aa;
 aa["key"] = [];
 assert("key" in aa);

 Nice. I have done a benchmark, and it seems a void[0][int] uses the same  
 amount of memory as a int[int]. I have not tried a void[0][string].

Each node in the hash has a pointer (for the next node), a hash (size_t)  
and the key and value.  Given the granularity of allocation (16 bytes),  
anything less than 16 bytes will be 16 bytes.

If the above trick works correctly, a void[0][string] should be the same  
size as int[int]. (8 bytes for pointer and hash, 8 bytes for string key,  
and hopefully 0 bytes for the value).

BTW, you should not use [] in your code, as this calls a runtime array  
allocation function.  Better to use something else, not sure what though.   
Maybe an enum?

-Steve

Mar 30 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

A use case for this could be a file system search:

    void[string] names;  // unique names to find
    string[] results;

    foreach (string name; dirEntries(curdir, SpanMode.deep))
    {
        if (name.basename in names)
            results ~= name;
    }

With string arrays the `if` check might slow things down.

Mar 29 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Oh there's a container? I'll take a look, thanks guys.

Mar 29 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Looks like not. I've misread Piotr'es post. :)

Mar 29 2011

"Nick Sabalausky" <a a.a> writes:

"Andrej Mitrovic" <none none.none> wrote in message 
news:imtirq$292g$1 digitalmars.com...
 Here's a little benchmark of Associative Arrays and Dynamic Arrays. I'm 
 showing traversal and lookup speed. The `static this()` is used to create 
 a random collection of words which get stored in the two arrays:

 https://gist.github.com/893340

 Hash lookups are naturally very fast. So I am inclined to use hashes for 
 some of my code, but there's a few problems:

 1. I'm wasting memory. I don't need the values, I only need the keys. But 
 I can't declare a void[string] hash.

 2. AA literals are awkward to write if you don't need the values:
 Aa = ["foo":0, "bar":0, "car":0,
         "dar":0, "mar":0, "doo":0];

 So there's redundancy there.

 Anyhow, would this be a good case for valueless AA's? Is that even 
 possible?

 I could just wrap an AA in a custom type that has a nice constructor, 



I frequently find use for such a thing. I've always used "bool[whatever]" 
for that purpose. Piotr's suggestion of "void[0][whatever]" never occurred 
to me, I may try that sometime.

Mar 29 2011

spir <denis.spir gmail.com> writes:

On 03/29/2011 11:43 PM, Andrej Mitrovic wrote:
 A use case for this could be a file system search:

      void[string] names;  // unique names to find
      string[] results;

      foreach (string name; dirEntries(curdir, SpanMode.deep))
      {
          if (name.basename in names)
              results ~= name;
      }

 With string arrays the `if` check might slow things down.

There are tons of use cases for sets. Unfortunately, implementation (either as 
hashed collections or binary trees) is far more complicated than for sequential 
collections (array, list). Especially, it highly depends on the element type. 
Thus, sets are rarely a builtin type.
For this reason, probably, people often wrongly use arrays/lists where sets are 
appropriate. This means, mainly, where the primary operation is a test of 
membership/containment. (Also set operations like set equality, union, 
intersection...)
The pseudo-value trick you used stil a good one: people often build sets from 
AAs with a value of true for each key/element (thus, set[x] returns true if x 
is present in set -- but this works only in implementations where absence of x 
does not throw).

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 29 2011

spir <denis.spir gmail.com> writes:

On 03/29/2011 11:47 PM, Andrej Mitrovic wrote:
 Looks like not. I've misread Piotr'es post. :)

There is one in Steven's dcollections.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 29 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 3/30/11, spir <denis.spir gmail.com> wrote:
 On 03/29/2011 11:47 PM, Andrej Mitrovic wrote:
 Looks like not. I've misread Piotr'es post. :)

 There is one in Steven's dcollections.

 Denis

Oh yes, I've been meaning to look at dcollections. The last time I saw
it I was still confused as to how templates work, but now it's time to
have some fun. :)

Mar 29 2011

D Programming

C/C++ Programming

Other

digitalmars.D - A case for valueless AA's?