www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - AA's and mutating keys

reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
There isn't any documentation that I can tell on the Arrays page that 
determines whether a mutating key can be used for an AA.  For example, if I 
use char[] as a key:

char[][char[]] map;

char[] keyvalue = "hello";

map[keyvalue] = "world";
keyvalue[] = "cello";

char[] *val = "hello" in map;

So does val get set to non-null?

-Steve 
Oct 31 2007
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:fga4ah$kov$1 digitalmars.com...
 There isn't any documentation that I can tell on the Arrays page that 
 determines whether a mutating key can be used for an AA.  For example, if 
 I use char[] as a key:

 char[][char[]] map;

 char[] keyvalue = "hello";

 map[keyvalue] = "world";
 keyvalue[] = "cello";

 char[] *val = "hello" in map;

 So does val get set to non-null?

 -Steve
Never ever ever ever ever ever ever do this. Really. I'm pretty sure what you get from that 'in' is undefined. Some languages go so far as to disallow using mutable types as map keys (i.e. Python). It's just bad news.
Oct 31 2007
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
It's ok if the AA makes a copy of the key, or uses something based on the 
key (like a hashcode).

My code is less likely to be encountered in real life, but I have a more 
probable example:

Let's say you were counting the number of occurrences of each word in a 
file.  You may do it by reading lines into a buffer.  Then for each word in 
the line, increment a counter in an AA that uses strings as the key and 
integers as the value.

If you re-use the buffer, you might overwrite the key that you used to 
insert an element into an AA!  But it's not obvious to someone who normally 
codes in C++ or Java, because strings are either pass by value or immutable. 
I've used D for a few months now, and it isn't even obvious to me :)

I am running into this very situation, so I want to know whether an AA will 
dup the key automatically or if I have to do it myself.

Plus, if I have to dup the key every time I just increment the count (not 
the first insertion), that is a lot of wasted memory and copying.  So I have 
to do something like:

int *x = key in map;
if(x)
  (*x)++;
else
  map[key.dup] = 1;

which would be nice if it was done automatically for me :)

Also, if I do have to dup it myself, then the website should specify that 
keys should not be modified once they are used to insert nodes into an AA.

-Steve

"Jarrett Billingsley" wrote
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
 news:fga4ah$kov$1 digitalmars.com...
 There isn't any documentation that I can tell on the Arrays page that 
 determines whether a mutating key can be used for an AA.  For example, if 
 I use char[] as a key:

 char[][char[]] map;

 char[] keyvalue = "hello";

 map[keyvalue] = "world";
 keyvalue[] = "cello";

 char[] *val = "hello" in map;

 So does val get set to non-null?

 -Steve
Never ever ever ever ever ever ever do this. Really. I'm pretty sure what you get from that 'in' is undefined. Some languages go so far as to disallow using mutable types as map keys (i.e. Python). It's just bad news.
Oct 31 2007
next sibling parent BCS <BCS pathlink.com> writes:
Steven Schveighoffer wrote:

 I am running into this very situation, so I want to know whether an AA will 
 dup the key automatically or if I have to do it myself.
 
AA's do NOT dup the string. AA's should requirer that keys be invariant. It's a bit expensive but I try to so this int[char[]] a; char[] s = something(); a[a.dup] = 5l;
Oct 31 2007
prev sibling next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:fgaf4j$1d18$1 digitalmars.com...
 It's ok if the AA makes a copy of the key, or uses something based on the 
 key (like a hashcode).
Not really. Consider: int[char[]] map; char[] key = new char[5]; key[] = "hello"[]; map[key] = 8; key[0] = 'c'; map["cello"] = 2; What happens is that you insert ("hello" => 8) into the table. But then you go and modify the key so it's "cello" instead. However now you have an inconsistent key: the key says "cello" but its hash is the hash of "hello". So, when you try to change "cello", the new "cello" maps to another slot and ("cello" => 2) is inserted. In fact, if you run this code and then print out the key value pairs, you'll get something like: cello 8 cello 2 In fact it's now impossible to modify or remove the original key-value pair because even if you can get something to hash to the right slot, the keys won't compare equal, and so the AA implementation won't find it.
 Let's say you were counting the number of occurrences of each word in a 
 file.  You may do it by reading lines into a buffer.  Then for each word 
 in the line, increment a counter in an AA that uses strings as the key and 
 integers as the value.

 If you re-use the buffer, you might overwrite the key that you used to 
 insert an element into an AA!  But it's not obvious to someone who 
 normally codes in C++ or Java, because strings are either pass by value or 
 immutable. I've used D for a few months now, and it isn't even obvious to 
 me :)

 I am running into this very situation, so I want to know whether an AA 
 will dup the key automatically or if I have to do it myself.

 Plus, if I have to dup the key every time I just increment the count (not 
 the first insertion), that is a lot of wasted memory and copying.  So I 
 have to do something like:

 int *x = key in map;
 if(x)
  (*x)++;
 else
  map[key.dup] = 1;

 which would be nice if it was done automatically for me :)
Easy, nested functions to the rescue: char[80] buffer; int[char[]] map; void inc(char[] key) { int* x = key in map; if(x) (*x)++; else map[key.dup] = 1; } foreach(thing; input) { fill buffer; inc(buffer[0 .. end]); } The nice thing about it not doing it automatically for you is performance. Most of the time you _don't_ need this behavior, and when you do it's easy to implement it on top of the existing behavior.
 Also, if I do have to dup it myself, then the website should specify that 
 keys should not be modified once they are used to insert nodes into an AA.
Agreed.
Oct 31 2007
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Jarrett Billingsley" wrote
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
 news:fgaf4j$1d18$1 digitalmars.com...
 It's ok if the AA makes a copy of the key, or uses something based on the 
 key (like a hashcode).
Not really. Consider: int[char[]] map; char[] key = new char[5]; key[] = "hello"[]; map[key] = 8; key[0] = 'c'; map["cello"] = 2; What happens is that you insert ("hello" => 8) into the table. But then you go and modify the key so it's "cello" instead. However now you have an inconsistent key: the key says "cello" but its hash is the hash of "hello". So, when you try to change "cello", the new "cello" maps to another slot and ("cello" => 2) is inserted. In fact, if you run this code and then print out the key value pairs, you'll get something like: cello 8 cello 2
You are half right :) The first part of my statement says "if the AA makes a copy of the key", so I think that is ok (and I think you agree). But you are right about the hash table. I understand that portion of it, but if the AA had a near-perfect hash function (like for an extreme example, md5sum of the text), then the AA would not even need to store the key, it would just store the hash. That's besides the point. I probably shouldn't have said the second part in my original statement.
 In fact it's now impossible to modify or remove the original key-value 
 pair because even if you can get something to hash to the right slot, the 
 keys won't compare equal, and so the AA implementation won't find it.
<splitting hairs> In fact it is possible :) key[0] = 'h'; map.remove(key); </splitting hairs>
 Easy, nested functions to the rescue:
This is a good point, I absolutely love nested functions, and didn't think of that.
 The nice thing about it not doing it automatically for you is performance. 
 Most of the time you _don't_ need this behavior, and when you do it's easy 
 to implement it on top of the existing behavior.
The performance hit of copying the string on first insertion is very minimal in most cases. I think the benefit far outweighs the cost. C++ std::map does this, and nobody has ever complained about the performance hit. It would be nice to have a class/struct that wraps an AA with this behavior. -Steve
Nov 01 2007
parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:fgctpf$1acr$1 digitalmars.com...
 It's ok if the AA makes a copy of the key, or uses something based on 
 the key (like a hashcode).
You are half right :) The first part of my statement says "if the AA makes a copy of the key", so I think that is ok (and I think you agree).
I must have interpreted the 'or' in your first statement as 'xor' ;)
 <splitting hairs>
 In fact it is possible :)

 key[0] = 'h';
 map.remove(key);
 </splitting hairs>
PSSSSSHHHHH
 The performance hit of copying the string on first insertion is very 
 minimal in most cases.  I think the benefit far outweighs the cost.  C++ 
 std::map does this, and nobody has ever complained about the performance 
 hit.

 It would be nice to have a class/struct that wraps an AA with this 
 behavior.
struct DupAA(V, K) { V[K] data; static DupAA!(V, K) opCall(V[K] data) { DupAA!(V, K) ret; ret.data = data; return ret; } size_t length() { return data.length; } K[] keys() { return data.keys; } V[] values() { return data.values; } void rehash() { data.rehash; } V opIndex(K k) { return data[k]; } void opIndexAssign(V v, K k) { if(auto val = k in data) *val = v; else { static if(is(typeof(k.dup))) data[k.dup] = v; else data[k] = v; } } int opApply(int delegate(inout V) dg) { foreach(v; data) if(auto result = dg(v)) return result; return 0; } int opApply(int delegate(inout K, inout V) dg) { foreach(k, v; data) if(auto result = dg(k, v)) return result; return 0; } } void main() { DupAA!(int, char[]) map; char[] key = new char[5]; key[] = "hello"[]; map[key] = 8; key[0] = 'c'; map["cello"] = 2; foreach(k, v; map) Stdout.formatln("{}: {}", k, v); } WHAM BAM
Nov 02 2007
prev sibling parent Paul Findlay <r.lph50+d gmail.com> writes:
Steven Schveighoffer wrote:
 int *x = key in map;
 if(x)
   (*x)++;
 else
   map[key.dup] = 1;
That's exactly what I was doing when trying Tim Bray's "Wide Finder" thing in D. I didn't mind it so much, but I found myself wishing it could do something like dup the key automatically and insert a default value. - Paul
Oct 31 2007
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Paul Findlay Wrote:
 That's exactly what I was doing when trying Tim Bray's "Wide Finder" thing
 in D.
Much slower than the psyco version still, I presume because I know Python more than D still (I have used functional-style coding in D 1.x, and with the new closures of D 2.x it may improve) and because Python AAs are quite more optimized: import std.stream, d.func, d.time, std.string; import std.regexp: search; void main() { auto t0 = clock(); string patt = `GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) `; int[string] count; foreach(string line; new BufferedFile("o1000k.ap")) if (line.find("GET /ongoing/When") != -1) if (auto m = search(line, patt)) count[m.match(1)]++; foreach(key; sortedAA(count, &Vgetter!(string, int))[0 .. 10]) { // writefln("%40s = %s", key, count[key]); } writefln(clock()-t0, " s"); foreach(key; sortedAA(count, &Vgetter!(string, int))[$-10 .. $]) writefln("%40s = %s", key, count[key]); } Psyco: import sys, time, re, collections, psyco timer = time.clock if sys.platform == "win32" else time.time def main(filenamein): t0 = timer() search = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ").search count = collections.defaultdict(int) for line in open(filenamein, "rb"): if "GET /ongoing/When" in line: match = search(line) if match: count[match.group(1)] += 1 for key in sorted(count, key=count.get)[:10]: print round(timer() - t0, 2), "s" for key in sorted(count, key=count.get)[-10:]: print "%40s = %s" % (key, count[key]) psyco.full() main("o1000k.ap") Bear hugs, bearophile
Nov 02 2007