digitalmars.D.learn - hash design problem: both string and int[] keys

Charles Hixson (14/14) Nov 20 2012 I'm trying to figure out how to construct an associative array whose

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (24/42) Nov 20 2012 A list of dchars is more accurate of course, but yes, dchar can be

Charles Hixson (30/73) Nov 20 2012 Sorry, I'm still thinking about how to design it. BBFMI has definite

Jonathan M Davis (7/15) Nov 20 2012 Um, no. int is 32 bits. dchar is 32 bits. long is 64 bits. So, both int ...
Charles Hixson (15/107) Nov 20 2012 After a bit more thought, I couldn't use a hash function to merge the

Charles Hixson (32/141) Nov 22 2012 This is the approach I've decided to use, i.e., generate string keys

Charles Hixson <charleshixsn earthlink.net> writes:

I'm trying to figure out how to construct an associative array whose 
keys will be a combination of strings and immutable int[]'s, but every 
approach I've looked at has run into problems.  It should be relatively 
easy, as strings are really just an immutable list of ints, but I 
haven't been able to find how strings are hashed.  (It's looking as if 
the details are handled in C...which makes it difficult.)  I could 
define my own hash code, but I don't find it at all clear what would be 
appropriate given that I don't know the size.  The data is rather 
sparse, so a hash table seems appropriate.  (My only other alternative 
is a database, and that imposes the heavy slowdown of I/O ... even 
though it does automate persistence it doesn't seem like a good tradeoff.)

So far my best idea is to build two tables, and then look at the key 
when a retrieval is attempted to figure out which table it's in.  That 
would probably work, but it feels like a kludge.

Nov 20 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 11/20/2012 11:39 AM, Charles Hixson wrote:

 I'm trying to figure out how to construct an associative array whose
 keys will be a combination of strings and immutable int[]'s, but every
 approach I've looked at has run into problems.

Can you show some code?

 It should be relatively
 easy, as strings are really just an immutable list of ints,

A list of dchars is more accurate of course, but yes, dchar can be 
casted to int.

 but I
 haven't been able to find how strings are hashed. (It's looking as if
 the details are handled in C...which makes it difficult.)

I bet the algorithm considers all of the characters of the strings. The 
following has been mentioned in a recent forum thread:

   http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

 I could define
 my own hash code, but I don't find it at all clear what would be
 appropriate given that I don't know the size.

I wonder how much slow iterating over string would be, compared to 
directly accessing the elements of a dstring. (After all, strings must 
be decoded to produce dchars.)

In any case, you don't need the length of the string; a loop would do:

     while (!s.empty) {
         // ... use s.front and s.popFront() ...
     }

Alternatively:

     foreach (c; stride(s, 1)) {
         // ... use c ...
     }

UFCS looks better: :)

     foreach (c; s.stride(1)) {
         // ... use c ...
     }

 The data is rather sparse,
 so a hash table seems appropriate. (My only other alternative is a
 database, and that imposes the heavy slowdown of I/O ... even though it
 does automate persistence it doesn't seem like a good tradeoff.)

 So far my best idea is to build two tables, and then look at the key
 when a retrieval is attempted to figure out which table it's in. That
 would probably work, but it feels like a kludge.

How do you differentiate between the key type being int[] vs. string?

Ali

Nov 20 2012

Charles Hixson <charleshixsn earthlink.net> writes:

Ali Çehreli wrote:
 On 11/20/2012 11:39 AM, Charles Hixson wrote:

  > I'm trying to figure out how to construct an associative array whose
  > keys will be a combination of strings and immutable int[]'s, but every
  > approach I've looked at has run into problems.

 Can you show some code?

Sorry, I'm still thinking about how to design it.  BBFMI has definite 
limitations, and that's all I could use to write code right now.
  > It should be relatively
  > easy, as strings are really just an immutable list of ints,

 A list of dchars is more accurate of course, but yes, dchar can be
 casted to int.

Well, longs anyway.  Which is a minor problem, as I'd prefer to use 
ints, which are smaller.
  > but I
  > haven't been able to find how strings are hashed. (It's looking as if
  > the details are handled in C...which makes it difficult.)

 I bet the algorithm considers all of the characters of the strings. The
 following has been mentioned in a recent forum thread:

 http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

For unicode strings?  This causes a problem, though "considers" gives a 
lot of room for alternatives.  For ASCII there are several plausible 
approaches, since all one needs to do is minimize collisions.  But there 
are too many plausible unicode chars for most of them to be reasonable. 
  I'd probably go for multiply by a medium size prime, add to the 
accumulator, mod the accumulator by a large prime, stepping through the 
loop.  And I'd initialize the accumulator by the length of the current 
string.  (Note that this would apply equally to strings or int[]'s.)  So 
if that's what you mean by "considers all the characters", then, yes.
  > I could define
  > my own hash code, but I don't find it at all clear what would be
  > appropriate given that I don't know the size.

The above was a pseudo-code of the hash function I was considering.
 I wonder how much slow iterating over string would be, compared to
 directly accessing the elements of a dstring. (After all, strings must
 be decoded to produce dchars.)

 In any case, you don't need the length of the string; a loop would do:

The length isn't a problem.  It just further differentiates the 
different strings, and it's a convenient built-in function.
 while (!s.empty) {
 // ... use s.front and s.popFront() ...
 }

 Alternatively:

 foreach (c; stride(s, 1)) {
 // ... use c ...
 }

 UFCS looks better: :)

 foreach (c; s.stride(1)) {
 // ... use c ...
 }

Those are all good, but I might just use:
foreach (char c; lst)  {  ...  }
A hash code doesn't need to reflect the chars of the string, merely to 
separate different cases.
  > The data is rather sparse,
  > so a hash table seems appropriate. (My only other alternative is a
  > database, and that imposes the heavy slowdown of I/O ... even though it
  > does automate persistence it doesn't seem like a good tradeoff.)
  >
  > So far my best idea is to build two tables, and then look at the key
  > when a retrieval is attempted to figure out which table it's in. That
  > would probably work, but it feels like a kludge.

 How do you differentiate between the key type being int[] vs. string?

 Ali

I'd define the same access method (opIndex, opIndexAssign) with two 
different parameter types.

But as I said, that feels like a kludge.  I'm just contemplating it 
because I want rehash to work properly.  So I'd have two indexes:
Items[string]  stringKeys;
Items[int]    listKeys;
This lets the compiler handle resizing the arrays properly, at the cost 
of doubly hashing the listKeys variable.  This isn't too bad, as I 
expect all the lists to be 5 items or shorter, but it feels like a kludge.

Nov 20 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, November 20, 2012 12:42:50 Charles Hixson wrote:
  > It should be relatively
  > easy, as strings are really just an immutable list of ints,

 A list of dchars is more accurate of course, but yes, dchar can be
 casted to int.

 Well, longs anyway.  Which is a minor problem, as I'd prefer to use
 ints, which are smaller.

Um, no. int is 32 bits. dchar is 32 bits. long is 64 bits. So, both int and 
dchar will implicitly convert to long, but long will not implicitly to them. 
int and dchar however _do_ implicitly convert both ways. But if you're using 
string, then you're dealing with char (8 bits), which means that it's byte or 
ubyte (also 8 bits) that it implicitly converts to, not int.

- Jonathan M Davis

Nov 20 2012

Charles Hixson <charleshixsn earthlink.net> writes:

Charles Hixson wrote:
 Ali Çehreli wrote:
 On 11/20/2012 11:39 AM, Charles Hixson wrote:

 I'm trying to figure out how to construct an associative array whose
 keys will be a combination of strings and immutable int[]'s, but every
 approach I've looked at has run into problems.

 Can you show some code?

 Sorry, I'm still thinking about how to design it. BBFMI has definite
 limitations, and that's all I could use to write code right now.
 It should be relatively
 easy, as strings are really just an immutable list of ints,

 A list of dchars is more accurate of course, but yes, dchar can be
 casted to int.

 Well, longs anyway. Which is a minor problem, as I'd prefer to use ints,
 which are smaller.
 but I
 haven't been able to find how strings are hashed. (It's looking as if
 the details are handled in C...which makes it difficult.)

 I bet the algorithm considers all of the characters of the strings. The
 following has been mentioned in a recent forum thread:

 http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

 For unicode strings? This causes a problem, though "considers" gives a
 lot of room for alternatives. For ASCII there are several plausible
 approaches, since all one needs to do is minimize collisions. But there
 are too many plausible unicode chars for most of them to be reasonable.
 I'd probably go for multiply by a medium size prime, add to the
 accumulator, mod the accumulator by a large prime, stepping through the
 loop. And I'd initialize the accumulator by the length of the current
 string. (Note that this would apply equally to strings or int[]'s.) So
 if that's what you mean by "considers all the characters", then, yes.
 I could define
 my own hash code, but I don't find it at all clear what would be
 appropriate given that I don't know the size.


 The above was a pseudo-code of the hash function I was considering.
 I wonder how much slow iterating over string would be, compared to
 directly accessing the elements of a dstring. (After all, strings must
 be decoded to produce dchars.)

 In any case, you don't need the length of the string; a loop would do:

 The length isn't a problem. It just further differentiates the different
 strings, and it's a convenient built-in function.
 while (!s.empty) {
 // ... use s.front and s.popFront() ...
 }

 Alternatively:

 foreach (c; stride(s, 1)) {
 // ... use c ...
 }

 UFCS looks better: :)

 foreach (c; s.stride(1)) {
 // ... use c ...
 }

 Those are all good, but I might just use:
 foreach (char c; lst) { ... }
 A hash code doesn't need to reflect the chars of the string, merely to
 separate different cases.
 The data is rather sparse,
 so a hash table seems appropriate. (My only other alternative is a
 database, and that imposes the heavy slowdown of I/O ... even though it
 does automate persistence it doesn't seem like a good tradeoff.)

 So far my best idea is to build two tables, and then look at the key
 when a retrieval is attempted to figure out which table it's in. That
 would probably work, but it feels like a kludge.

 How do you differentiate between the key type being int[] vs. string?

 Ali

 I'd define the same access method (opIndex, opIndexAssign) with two
 different parameter types.

 But as I said, that feels like a kludge. I'm just contemplating it
 because I want rehash to work properly. So I'd have two indexes:
 Items[string] stringKeys;
 Items[int] listKeys;
 This lets the compiler handle resizing the arrays properly, at the cost
 of doubly hashing the listKeys variable. This isn't too bad, as I expect
 all the lists to be 5 items or shorter, but it feels like a kludge.

After a bit more thought, I couldn't use a hash function to merge the 
ints of the list, because if I did then any collision would cause the 
old value to be replaces...or would it if they didn't compare to equal? 
  Which different items wouldn't, but the *keys* would compare to equal, 
so they probably would replace, which isn't at all what I desire.  But 
if I just try to combine the ints, I can't guarantee unique 
keys...certainly not for any list longer than 2 items, as int.max * int 
would require a long, but adding a third term is a "can't do".  BigInt 
defines opCmp, but doesn't seem to define toHash.  So that seems to mean 
I need to convert the list to a string.  UGH!!

On the one hand, it allows me to get away with only using one hashtable, 
but on the other I've got this ugly key conversion...but maybe I won't 
need to convert both ways.  But it's still ugly.  Even the two table 
approach was nicer.

Nov 20 2012

Charles Hixson <charleshixsn earthlink.net> writes:

Charles Hixson wrote:
 Charles Hixson wrote:
 Ali Çehreli wrote:
 On 11/20/2012 11:39 AM, Charles Hixson wrote:

 I'm trying to figure out how to construct an associative array whose
 keys will be a combination of strings and immutable int[]'s, but every
 approach I've looked at has run into problems.

 Can you show some code?

 Sorry, I'm still thinking about how to design it. BBFMI has definite
 limitations, and that's all I could use to write code right now.
 It should be relatively
 easy, as strings are really just an immutable list of ints,

 A list of dchars is more accurate of course, but yes, dchar can be
 casted to int.

 Well, longs anyway. Which is a minor problem, as I'd prefer to use ints,
 which are smaller.
 but I
 haven't been able to find how strings are hashed. (It's looking as if
 the details are handled in C...which makes it difficult.)

 I bet the algorithm considers all of the characters of the strings. The
 following has been mentioned in a recent forum thread:

 http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

 For unicode strings? This causes a problem, though "considers" gives a
 lot of room for alternatives. For ASCII there are several plausible
 approaches, since all one needs to do is minimize collisions. But there
 are too many plausible unicode chars for most of them to be reasonable.
 I'd probably go for multiply by a medium size prime, add to the
 accumulator, mod the accumulator by a large prime, stepping through the
 loop. And I'd initialize the accumulator by the length of the current
 string. (Note that this would apply equally to strings or int[]'s.) So
 if that's what you mean by "considers all the characters", then, yes.
 I could define
 my own hash code, but I don't find it at all clear what would be
 appropriate given that I don't know the size.


 The above was a pseudo-code of the hash function I was considering.
 I wonder how much slow iterating over string would be, compared to
 directly accessing the elements of a dstring. (After all, strings must
 be decoded to produce dchars.)

 In any case, you don't need the length of the string; a loop would do:

 The length isn't a problem. It just further differentiates the different
 strings, and it's a convenient built-in function.
 while (!s.empty) {
 // ... use s.front and s.popFront() ...
 }

 Alternatively:

 foreach (c; stride(s, 1)) {
 // ... use c ...
 }

 UFCS looks better: :)

 foreach (c; s.stride(1)) {
 // ... use c ...
 }

 Those are all good, but I might just use:
 foreach (char c; lst) { ... }
 A hash code doesn't need to reflect the chars of the string, merely to
 separate different cases.
 The data is rather sparse,
 so a hash table seems appropriate. (My only other alternative is a
 database, and that imposes the heavy slowdown of I/O ... even

 though it
 does automate persistence it doesn't seem like a good tradeoff.)

 So far my best idea is to build two tables, and then look at the key
 when a retrieval is attempted to figure out which table it's in. That
 would probably work, but it feels like a kludge.

 How do you differentiate between the key type being int[] vs. string?

 Ali

 I'd define the same access method (opIndex, opIndexAssign) with two
 different parameter types.

 But as I said, that feels like a kludge. I'm just contemplating it
 because I want rehash to work properly. So I'd have two indexes:
 Items[string] stringKeys;
 Items[int] listKeys;
 This lets the compiler handle resizing the arrays properly, at the cost
 of doubly hashing the listKeys variable. This isn't too bad, as I expect
 all the lists to be 5 items or shorter, but it feels like a kludge.

 After a bit more thought, I couldn't use a hash function to merge the
 ints of the list, because if I did then any collision would cause the
 old value to be replaces...or would it if they didn't compare to equal?
 Which different items wouldn't, but the *keys* would compare to equal,
 so they probably would replace, which isn't at all what I desire. But if
 I just try to combine the ints, I can't guarantee unique
 keys...certainly not for any list longer than 2 items, as int.max * int
 would require a long, but adding a third term is a "can't do". BigInt
 defines opCmp, but doesn't seem to define toHash. So that seems to mean
 I need to convert the list to a string. UGH!!

 On the one hand, it allows me to get away with only using one hashtable,
 but on the other I've got this ugly key conversion...but maybe I won't
 need to convert both ways. But it's still ugly. Even the two table
 approach was nicer.

This is the approach I've decided to use, i.e., generate string keys 
from the list of ints.
/** Convert a list of non-negative ints into an invalid utf-8
  *  string.
  * \todo The generated string both begins and ends with 0xff ...
  * these aren't needed as a hash code, and could be trimmed.  Think
  * about removing them.	*/
string	lst2sbytes (int[] lst)
{  char[] retc;
    char[] num;
    retc	~=	0xff;
    foreach	(int i; lst)
    {  enforce (i >= 0, "only non-negative ints are allowed in an sbytes");
       int	val	=	i;
       num.length	=	0;
       while (val > 0)
       {  num	~=	val % 128;
          val	=	val / 128;
       }
       // N.B.:  the reverse is needed if you wish to decode the list
       // later, but is excess baggage when generating a hash code.
       // reverse (num);
       retc	~=	num ~ 0xff;
    }
    return	retc.idup;
}

A bit of a kludge, but it seems to work, and should be fairly fast and 
reasonably compact.  And it allows the components of the key to be any 
reasonable number.  (Well, reasonable for my purposes.  I don't want to 
handle anything but positive ints.  Even allowing 0 isn't really 
reasonable, it just doesn't cost anything.)

Nov 22 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - hash design problem: both string and int[] keys