digitalmars.D - Fast hashtable

Andrei Alexandrescu (3/3) Feb 28 2017 This is of possible interest:

bauss (3/5) Feb 28 2017 That's really interesting.
deadalnix (8/13) Feb 28 2017 In which case you don't need powers of 2 either.
H. S. Teoh via Digitalmars-d (28/30) Feb 28 2017 Related to this, recently I found some interesting papers for large

Cecil Ward (19/56) Feb 28 2017 I liked that article. I didn't really understand the point about

Daniel Kozak (3/21) Mar 01 2017 Yes :D, this is something compiler should do.

Cecil Ward (5/32) Mar 02 2017 Does anyone know if the compilers could use this for code

deadalnix (2/14) Mar 01 2017 The lower slot will be twice as crowded as the higher ones.

Cecil Ward (4/22) Mar 02 2017 Sorry, I think I was unclear, I was suggesting the author should

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

This is of possible interest: 
https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ -- 
Andrei

Feb 28 2017

bauss <jj_1337 live.dk> writes:

On Tuesday, 28 February 2017 at 17:57:14 UTC, Andrei Alexandrescu 
wrote:
 This is of possible interest: 
 https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ -- Andrei

That's really interesting.

Feb 28 2017

deadalnix <deadalnix gmail.com> writes:

On Tuesday, 28 February 2017 at 17:57:14 UTC, Andrei Alexandrescu 
wrote:
 This is of possible interest: 
 https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ -- Andrei

 But let’s say you know that your hash function returns numbers 
 that are well distributed and that you’re rarely going to get 
 hash collisions even if you use powers of two.

In which case you don't need powers of 2 either.

ucent h = hash64(key);
ulong slot = (h * slotCount) >> 64;

And you get what you want, with no constraint on the number of 
slots. Note that on most architectures, this actually lowers to 
one mulhi operation, which is typically 3 cycles.

Feb 28 2017

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Tue, Feb 28, 2017 at 12:57:14PM -0500, Andrei Alexandrescu via Digitalmars-d
wrote:
 This is of possible interest:
 https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ --

Related to this, recently I found some interesting papers for large
external (i.e., on-disk) hash tables:

	http://www.itu.dk/people/pagh/papers/linear.pdf
	http://www.weizhewei.com/papers/pods020-wei.pdf

AIUI, blocked probing (1st paper) is similar to Robin Hood hashing, in
that inserting an entry may cause an existing entry to be moved out of
its occupied slot to a different one, but blocked probing also has
additional interesting characteristics:

- Scanning is done in blocks of size 2^i with starting slots of index
  2^i for incrementing i. This structure makes it cache-oblivious, and
  thus able to take advantage of the modern cache hierarchy without
  needing cache-specific tuning.

- Something special about the 2^i block sizes makes the math just work
  out (see 2nd paper), so that lookups have expected average 1 +
  1/2^Ω(b) I/O operations, where b is the block size (provided the load
  factor is bounded away from 1), which is a marked improvement over
  plain linear probing which has expected 1 + O(α/b) I/O operations,
  where α is the load factor.

I didn't look closely enough at the analysis to know for sure, but it
seems that since the analysis is cache-oblivious, the O(1 + 1/2^Ω(b))
I/O operations should also generalize to cache misses as well (as part
of the memory hierarchy, if you regard secondary storage as the lowest
level of the hierarchy).  So I'm expecting this might be even faster
than the Robin Hood hashing in your linked blog.


T

-- 
Sometimes the best solution to morale problems is just to fire all of the
unhappy people. -- despair.com

Feb 28 2017

Cecil Ward <d cecilward.com> writes:

On Tuesday, 28 February 2017 at 21:00:05 UTC, H. S. Teoh wrote:
 On Tue, Feb 28, 2017 at 12:57:14PM -0500, Andrei Alexandrescu 
 via Digitalmars-d wrote:
 This is of possible interest: 
 https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ --

 Related to this, recently I found some interesting papers for 
 large external (i.e., on-disk) hash tables:

 	http://www.itu.dk/people/pagh/papers/linear.pdf
 	http://www.weizhewei.com/papers/pods020-wei.pdf

 AIUI, blocked probing (1st paper) is similar to Robin Hood 
 hashing, in that inserting an entry may cause an existing entry 
 to be moved out of its occupied slot to a different one, but 
 blocked probing also has additional interesting characteristics:

 - Scanning is done in blocks of size 2^i with starting slots of 
 index
   2^i for incrementing i. This structure makes it 
 cache-oblivious, and
   thus able to take advantage of the modern cache hierarchy 
 without
   needing cache-specific tuning.

 - Something special about the 2^i block sizes makes the math 
 just work
   out (see 2nd paper), so that lookups have expected average 1 +
   1/2^Ω(b) I/O operations, where b is the block size (provided 
 the load
   factor is bounded away from 1), which is a marked improvement 
 over
   plain linear probing which has expected 1 + O(α/b) I/O 
 operations,
   where α is the load factor.

 I didn't look closely enough at the analysis to know for sure, 
 but it seems that since the analysis is cache-oblivious, the 
 O(1 + 1/2^Ω(b)) I/O operations should also generalize to cache 
 misses as well (as part of the memory hierarchy, if you regard 
 secondary storage as the lowest level of the hierarchy).  So 
 I'm expecting this might be even faster than the Robin Hood 
 hashing in your linked blog.


 T

I liked that article. I didn't really understand the point about 
implementation of modulo primes, maybe I missed something. Given 
that our man is doing modulo a 'known' value (he had a switch 
statement to get to them), why not do something rather cheaper 
than a compiler-expanded constant div/mod made up of multiplies 
and shifts

     const uint power2 = 512; // say, some 1 << n anyway
     const uint prime = 509; // some prime just below the power, 
some prime > power2/2

     static assert( power2 - 1 - prime < prime );

     x = x & ( power2 - 1 );
     x = ( x >= prime ) ? x - prime : x;

which is good news on my x86 with GDC -O3 (only 3 operations, and 
sub cmovx ) - all well provided you make sure that you are 
getting CMOVx not branches. I could work out the power from the 
prime using CTFE given a bit of thought. Maybe CTFE could even do 
the reverse?

Have I finally gone mad?

Feb 28 2017

Daniel Kozak <kozzi11 gmail.com> writes:

On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote:
 I liked that article. I didn't really understand the point 
 about implementation of modulo primes, maybe I missed 
 something. Given that our man is doing modulo a 'known' value 
 (he had a switch statement to get to them), why not do 
 something rather cheaper than a compiler-expanded constant 
 div/mod made up of multiplies and shifts

     const uint power2 = 512; // say, some 1 << n anyway
     const uint prime = 509; // some prime just below the power, 
 some prime > power2/2

     static assert( power2 - 1 - prime < prime );

     x = x & ( power2 - 1 );
     x = ( x >= prime ) ? x - prime : x;

 which is good news on my x86 with GDC -O3 (only 3 operations, 
 and sub cmovx ) - all well provided you make sure that you are 
 getting CMOVx not branches. I could work out the power from the 
 prime using CTFE given a bit of thought. Maybe CTFE could even 
 do the reverse?

 Have I finally gone mad?

Yes :D, this is something compiler should do.

btw: https://github.com/dlang/phobos/pull/1452

Mar 01 2017

Cecil Ward <d cecilward.com> writes:

On Wednesday, 1 March 2017 at 09:45:23 UTC, Daniel Kozak wrote:
 On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote:
 I liked that article. I didn't really understand the point 
 about implementation of modulo primes, maybe I missed 
 something. Given that our man is doing modulo a 'known' value 
 (he had a switch statement to get to them), why not do 
 something rather cheaper than a compiler-expanded constant 
 div/mod made up of multiplies and shifts

     const uint power2 = 512; // say, some 1 << n anyway
     const uint prime = 509; // some prime just below the 
 power, some prime > power2/2

     static assert( power2 - 1 - prime < prime );

     x = x & ( power2 - 1 );
     x = ( x >= prime ) ? x - prime : x;

 which is good news on my x86 with GDC -O3 (only 3 operations, 
 and sub cmovx ) - all well provided you make sure that you are 
 getting CMOVx not branches. I could work out the power from 
 the prime using CTFE given a bit of thought. Maybe CTFE could 
 even do the reverse?

 Have I finally gone mad?

 Yes :D, this is something compiler should do.

 btw: https://github.com/dlang/phobos/pull/1452

Does anyone know if the compilers could use this for code 
generation? I did later CTFE the prime from the power, can do the 
reverse more easily which is the way compilers would need to do 
it for division by known x.

Mar 02 2017

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote:
     const uint power2 = 512; // say, some 1 << n anyway
     const uint prime = 509; // some prime just below the power, 
 some prime > power2/2

     static assert( power2 - 1 - prime < prime );

     x = x & ( power2 - 1 );
     x = ( x >= prime ) ? x - prime : x;

 which is good news on my x86 with GDC -O3 (only 3 operations, 
 and sub cmovx ) - all well provided you make sure that you are 
 getting CMOVx not branches. I could work out the power from the 
 prime using CTFE given a bit of thought. Maybe CTFE could even 
 do the reverse?

 Have I finally gone mad?

The lower slot will be twice as crowded as the higher ones.

Mar 01 2017

Cecil Ward <d cecilward.com> writes:

On Wednesday, 1 March 2017 at 12:59:30 UTC, deadalnix wrote:
 On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote:
     const uint power2 = 512; // say, some 1 << n anyway
     const uint prime = 509; // some prime just below the 
 power, some prime > power2/2

     static assert( power2 - 1 - prime < prime );

     x = x & ( power2 - 1 );
     x = ( x >= prime ) ? x - prime : x;

 which is good news on my x86 with GDC -O3 (only 3 operations, 
 and sub cmovx ) - all well provided you make sure that you are 
 getting CMOVx not branches. I could work out the power from 
 the prime using CTFE given a bit of thought. Maybe CTFE could 
 even do the reverse?

 Have I finally gone mad?

 The lower slot will be twice as crowded as the higher ones.

Sorry, I think I was unclear, I was suggesting the author should 
use modulo the prime. The power of two is irrelevant, it's just a 
quick(er?) way of computing modulo. Are we on the same wavelength?

Mar 02 2017

D Programming

C/C++ Programming

Other

digitalmars.D - Fast hashtable