digitalmars.D - Associative arrays with void values

Doctor J (7/7) Apr 12 2009 Sometimes you want to use an associative array just to track keys you've...

torhu (34/41) Apr 12 2009 You can just use an associative array to make a set type, which has the
bearophile (7/7) Apr 12 2009 Doctor J:
dsimcha (28/35) Apr 12 2009 or count distinct keys, and you don't care about the values. The langua...

bearophile (4/6) Apr 12 2009 I think it never wastes one byte. The GC allocates blocks of memory with...

dsimcha (4/10) Apr 12 2009 length as a power of 2, for small sizes. You can see this very well if y...

bearophile (9/10) Apr 12 2009 At the moment an AA like this:

Steven Schveighoffer (14/25) Apr 13 2009 The AA node types are currently implemented as a binary tree for collisi...

Benji Smith (10/14) Apr 12 2009 Especially since an associative array should have a .keys property that

bearophile (7/12) Apr 12 2009 It's less semantically clean to define certain set operations on AAs, be...

Michel Fortin (10/18) Apr 13 2009 You could pass an alias to a function that would perform the merge
Benji Smith (14/24) Apr 13 2009 Actually I think we do agree. From an API perspective (rather than an

Doctor J <nobody nowhere.com> writes:

Sometimes you want to use an associative array just to track keys you've seen,
or count distinct keys, and you don't care about the values.  The language will
let you declare an array such as void[int]; but you can't actually do anything
with it:

void main()
{
    void[int] voidmap;    // This compiles
    voidmap[1] = void;  // This doesn't
}

My question is, is this a common or useful enough special case to warrant
inclusion in the language?  Or should I just go quietly and use a bool or a
byte or something?

Apr 12 2009

torhu <no spam.invalid> writes:

On 12.04.2009 17:50, Doctor J wrote:
 Sometimes you want to use an associative array just to track keys you've seen,
or count distinct keys, and you don't care about the values.  The language will
let you declare an array such as void[int]; but you can't actually do anything
with it:

 void main()
 {
      void[int] voidmap;    // This compiles
      voidmap[1] = void;  // This doesn't
 }

 My question is, is this a common or useful enough special case to warrant
inclusion in the language?  Or should I just go quietly and use a bool or a
byte or something?

You can just use an associative array to make a set type, which has the 
operations you need.  Tango has a HashSet, but I assume you're using 
phobos.  The implementation below is very basic, but it did what I 
needed at the time.  No difference, union, or intersection operations, 
though.

struct Set(T) {
	private bool[T] data_;

	void add(T val) { data_[val] = true; }  ///
	void remove(T val) { data_.remove(val); } ///
	bool opIn_r(T val) { return (val in data_) != null; } ///
	size_t length() { return data_.length; } ///
	void rehash() { data_.rehash; } ///
	T[] toArray() { return data_.keys; } ///

	///
	static Set!(T) opCall(T[] values=null)
	{
		Set!(T) set;
		foreach (T val; values)
			set.add(val);
		return set;
	}

	///
	int opApply(int delegate(ref T) dg)
	{
		int result = 0;
		foreach (key, val; data_) {
			result = dg(key);
			if (result)
				break;
		}
		return result;
	}
}

Apr 12 2009

bearophile <bearophileHUGS lycos.com> writes:

Doctor J:
In my libs there is a Set(), but it maps to the built-in AAs:
http://www.fantascienza.net/leonardo/so/dlibs/sets.html
(and adds set methods).

Adding sets to the language as AAs with void values, plus set methods is nice.

Bye,
bearophile

Apr 12 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Doctor J (nobody nowhere.com)'s article
 Sometimes you want to use an associative array just to track keys you've seen,

or count distinct keys, and you don't care about the values.  The language will
let you declare an array such as void[int]; but you can't actually do anything
with it:
 void main()
 {
     void[int] voidmap;    // This compiles
     voidmap[1] = void;  // This doesn't
 }
 My question is, is this a common or useful enough special case to warrant

inclusion in the language?  Or should I just go quietly and use a bool or a byte
or something?


IMHO using AAs as makeshift sets is a terrible solution for a few reasons:

1.  Without the void[someType] hack, it wastes at least one byte per element,
sometimes a lot more depending on alignment issues.  If you use the
void[SomeType]
hack, you get really screwy syntax, to the point where a library type would have
much better syntax.  This means you gain nothing except maybe usability at
compile
time by having it builtin.
2.  A proper set type needs to have some extra methods like intersect, etc.  If
we're going to add all this, we should step back and figure out how to add sets
to
the language the right way instead of doing it the most immediately expedient
way
and being stuck with it.
3.  D's current AA implementation kind of sucks anyhow, at least when it
interacts
with conservative GC.  You probably don't want to build too much more stuff on
top
of it.

That said, sets should be in either the language or in the standard lib.  There
are at least a few good implementations (Tango, Dcollections, probably a bunch
more out there). Including one with an appropriate license and maybe some
consistency tweaks in Phobos wouldn't be too hard technically, though it might
be
politically.

On the other hand, I'm not sure if it makes sense from a consistency perspective
to have AAs as a builtin, first class type and sets as a library type.  I'm not
sure whether this argues more for AAs being a library type or sets being
builtin,
but the inconsistency is just weird.

Apr 12 2009

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:
 1.  Without the void[someType] hack, it wastes at least one byte per element,
 sometimes a lot more depending on alignment issues.

I think it never wastes one byte. The GC allocates blocks of memory with a
length as a power of 2, for small sizes. You can see this very well if you do
few experiments with big AAs.

Bye,
bearophile

Apr 12 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
 1.  Without the void[someType] hack, it wastes at least one byte per element,
 sometimes a lot more depending on alignment issues.

 I think it never wastes one byte. The GC allocates blocks of memory with a

length as a power of 2, for small sizes. You can see this very well if you do
few
experiments with big AAs.
 Bye,
 bearophile

What if the key is a long?

Apr 12 2009

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:
 What if the key is a long?

At the moment an AA like this:
byte[long] aa;
allocates 16 or more bytes/pair, so it needs the same memory as:
long[long] aa;

A built-in set can of course use only about 8 bytes (plus few empty cells in
the hash) for a:
HashSet!(long)

Bye,
bearophile

Apr 12 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 13 Apr 2009 01:45:00 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 dsimcha:
 What if the key is a long?

 At the moment an AA like this:
 byte[long] aa;
 allocates 16 or more bytes/pair, so it needs the same memory as:
 long[long] aa;

 A built-in set can of course use only about 8 bytes (plus few empty  
 cells in the hash) for a:
 HashSet!(long)

 Bye,
 bearophile

The AA node types are currently implemented as a binary tree for collision  
resolution, so each node also needs at least 2 pointers for the child  
nodes.  Possibly a parent pointer, but I don't think so.

So T[long] aa will be 16 bytes + sizeof(T) for each node, assuming a  
pointer is 4 bytes.

But technically, there is always going to be cases where waste occurs,  
even without the void hack, depending on the resulting size of your  
nodes.  Imagine a 33 byte node, which will use up 64 bytes of space per  
node.  This is one of the reasons Tango and Dcollections have chunk  
allocators where a large chunk of nodes is allocated at once from the GC  
(or from malloc for non-GC types).

-Steve

Apr 13 2009

Benji Smith <dlanguage benjismith.net> writes:

dsimcha wrote:
 On the other hand, I'm not sure if it makes sense from a consistency
perspective
 to have AAs as a builtin, first class type and sets as a library type.  I'm not
 sure whether this argues more for AAs being a library type or sets being
builtin,
 but the inconsistency is just weird.

Especially since an associative array should have a .keys property that 
returns a set.

(Incidentally, I also think the natural set operations, like 
intersection and mutual exclusion, are just as handy for maps as for sets.)

The natural conclusion is that AAs should be library types.

I like the fact that D provides literal syntax for AAs, but I think the 
correct implementation is for the compiler to pass the values from those 
   literal expressions into a library type constructor.

--benji

Apr 12 2009

bearophile <bearophileHUGS lycos.com> writes:

Benji Smith:

 Especially since an associative array should have a .keys property that 
 returns a set.

I don't agree.  I think associative arrays should have .keys/.values/.items
that return a lazy view that acts like a .set/.list/.list of pairs. Such "lazy
views" don't actually store anything, they are very light. This design is now
present in Python3, Java and I have done very similar things in my dlibs (named
xkeys/xvalues/xitems in my dlibs, but xkeys isn't a set-like thing yet).


 (Incidentally, I also think the natural set operations, like 
 intersection and mutual exclusion, are just as handy for maps as for sets.)

It's less semantically clean to define certain set operations on AAs, because
for example you have to decide what to do when keys are equal but their values
are not. You can avoid such semantic troubles altogether performing set
operations just only on the lazy view of the keys.


 The natural conclusion is that AAs should be library types.

I agree that probably now it's better to keep built-in only the syntax of AAs
and keep their semantics into D modules code. But I think what we/I have
discussed here doesn't imply such conclusion :-) I think you can do all the
things I have explained here even with a fully built-in.

I'm waiting for a more community-driven design of D ;-)

Bye,
bearophile

Apr 12 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-04-13 01:53:41 -0400, bearophile <bearophileHUGS lycos.com> said:

 (Incidentally, I also think the natural set operations, like
 intersection and mutual exclusion, are just as handy for maps as for sets.)

 
 It's less semantically clean to define certain set operations on AAs, 
 because for example you have to decide what to do when keys are equal 
 but their values are not.

You could pass an alias to a function that would perform the merge 
between values. That way, there's not ambiguity.


 You can avoid such semantic troubles altogether performing set 
 operations just only on the lazy view of the keys.

That'd require two passes: one for the keys, a second for getting the 
values from both sides (merging the values). It's going to be less 
efficient.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Apr 13 2009

Benji Smith <dlanguage benjismith.net> writes:

bearophile wrote:
 Benji Smith:
 
 Especially since an associative array should have a .keys property that 
 returns a set.

 
 I don't agree.  I think associative arrays should have .keys/.values/.items
that return a lazy view that acts like a .set/.list/.list of pairs. Such "lazy
views" don't actually store anything, they are very light. This design is now
present in Python3, Java and I have done very similar things in my dlibs (named
xkeys/xvalues/xitems in my dlibs, but xkeys isn't a set-like thing yet).

Actually I think we do agree. From an API perspective (rather than an 
implementation perspective), I think the .keys property should generally 
return a lazily constructed result (object? struct? I don't really 
care). But I think it should conform to some standardized notion of 
"set-ness" (interface? concept? again, I don't care).

HashSets are a perfectly acceptable implementation for me, as are Set 
interfaces, but I know some people won't like them, and those impl 
details aren't a big deal to me.

But whatever notion the language uses for its "Set" construct should be 
the same dohickey used by the AA .keys property.

 (Incidentally, I also think the natural set operations, like 
 intersection and mutual exclusion, are just as handy for maps as for sets.)

 
 It's less semantically clean to define certain set operations on AAs, because
for example you have to decide what to do when keys are equal but their values
are not. You can avoid such semantic troubles altogether performing set
operations just only on the lazy view of the keys.

You just have to define those operations on pairs rather than just on 
single values (for example, the union of two maps is naturally a multimap).

--benji

Apr 13 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Associative arrays with void values