digitalmars.D - "Faster than Rust and C++: the PERFECT hash table"

Christopher Katko (9/9) Dec 10 2023 Has anyone seen this video?

Siarhei Siamashka (16/19) Dec 11 2023 The "faster than Rust" achievement doesn't mean much in reality,

Christopher Katko (36/40) Dec 11 2023 Holy crap! I was thinking about this kind of thing a few months

Christopher Katko (17/17) Dec 11 2023 Hmm, so it looks like a part of what I'm thinking of, is called a

Basile.B (5/11) Dec 11 2023 To fork on that subject... I was very

Siarhei Siamashka (4/7) Dec 11 2023 If the whole set of the string identifiers to be parsed is known

Basile.B (4/12) Dec 11 2023 Sorry but that sound a bit like "let's rely on a library, let's
Basile B. (15/23) Jan 05 2024 Sorry for the inaccuracy. This is part of what a compiler uses to

Christopher Katko <ckatko gmail.com> writes:

Has anyone seen this video?

https://www.youtube.com/watch?v=DMQ_HcNSOAI

additional reddit discussion:

https://www.reddit.com/r/rust/comments/11r26x1/faster_than_rust_and_c_the_perfect_hash_table/

Custom hash aside, I had no idea about gperf or that it was so 
blazing fast.

  I wonder how easy it would be to slap it into D to test. Do you 
just override the toHash function in your objects or is there 
more to it?

Dec 10 2023

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Monday, 11 December 2023 at 03:25:15 UTC, Christopher Katko 
wrote:
  I wonder how easy it would be to slap it into D to test. Do 
 you just override the toHash function in your objects or is 
 there more to it?

The "faster than Rust" achievement doesn't mean much in reality, 
because the hash function in Rust is DoS-resistant by default. 
There's an old bug about a related issue in D: 
https://issues.dlang.org/show_bug.cgi?id=7179

D language lacks flexibility in this area and replacing the 
built-in associative arrays hash function doesn't seem like an 
easy task. Moreover, there's now the 
https://dlang.org/changelog/2.106.0.html#dmd.static-assoc-array 
feature and CTFE makes everything even more tricky. Reminds me of 
the old prelink vs. ASLR dilemma: https://lwn.net/Articles/190139/

One the other hand, it's possible to leave the built-in 
associative arrays alone and design a custom container class with 
a custom hash function. Tailored either for maximum performance 
or for DoS-resistance.

Dec 11 2023

Christopher Katko <ckatko gmail.com> writes:

 If the whole set of the string identifiers to be parsed is 
 known at compile time, then another approach for mapping them 
 to numbers or whatever else is to use something like 
 https://re2c.org

Holy crap! I was thinking about this kind of thing a few months 
ago. My situation is this:

  - I have textures (filename = string texturename) listed in a 
manfiest.json file.
  - all textures are loaded and placed into an texture[string] 
associative array
  - they're used in code that ends up being like 
drawBitmap(bitmaps["grass"], ...)

It lets me immediately add a new texture, add one line to the 
manifest, and I can call it from my code. (I could do filename = 
stringname automatically but that doesn't let me do things like 
pack textures into atlases to reduce texture swaps so that kind 
of automatic filename scanning isn't applicable.)

Noting that:
  - technically, a finite list of all potentially used texture 
names in code should be knowable to the compiler.
  - the list does not change at run time.

I had thought about trying some sort of D-based compile-time, or, 
manual tool-based lexing to build say, an enum, and replacing all 
those names from "grass01" to GRASS01=17.

But I was also wondering if simply getting a faster container 
would improve things. Because one flaw there is if I ever support 
mods, then I __do not__ at compile time know the total maximum 
set of texture name entries.

Still, once all my texture entries are received they don't change 
after loading (even if I load-in assets based on scene 
requirements, the names will never swap/reorder. Even if grass01 
texture isn't loaded, it can still have the name grass01 
reserved. There are no key deletions or changes once set. So 
grass01 could simply be a null pointer until loaded.)

Awhile back, during some profiling, D Associative Arrays were 
__way__ higher up in total CPU usage in my game than they should 
have been compared to the act of drawing entire textures. Like in 
the top 7. BUT, as I'm thinking right now, I'm wondering if now 
the profiling measuring code itself was inflating those AA 
accesses (a tiny real code compared to overhead of logging code).

Dec 11 2023

Christopher Katko <ckatko gmail.com> writes:

Hmm, so it looks like a part of what I'm thinking of, is called a 
"perfect hash function".

https://en.wikipedia.org/wiki/Perfect_hash_function

And I probably could still generate one, based on known a 
load-time constant such as all the names/set of all textures in 
game+[loaded mods].

I wonder if using const strings keys (e.g. "grass01") in my code 
is a significant slowdown compared to an int key. I should 
probably do some benchmarks. Because if they are, maybe I could 
do some tokenization pass converting all strings into ints. But 
if I'm adding an entire parsing copy-paste step to my makefile, 
it better be for a significant improvement to justify the 
complexity, maintainability, etc.

And of course, what to do about modding adding stuff without 
recompiling.

I'm just looking into what all my options are given various 
constraints.

Dec 11 2023

Basile.B <b2.temp gmx.com> writes:

On Monday, 11 December 2023 at 03:25:15 UTC, Christopher Katko 
wrote:
 Has anyone seen this video?

 https://www.youtube.com/watch?v=DMQ_HcNSOAI

 additional reddit discussion:

 https://www.reddit.com/r/rust/comments/11r26x1/faster_than_rust_and_c_the_perfect_hash_table/

 Custom hash aside, I had no idea about gperf or that it was so 
 blazing fast.

To fork on that subject... I was very 
"gperf-like-perfecthash-friendly" for years but nowadays I use 
[lookuptables](https://gitlab.com/styx-lang/styx/-/blob/master/src/styx/string_switch
sx?ref_type=heads). Very fast too. You have Trie or suffix arrays that are very
fast too but they tend not to be memory friendly.

Dec 11 2023

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Monday, 11 December 2023 at 11:36:10 UTC, Basile.B wrote:
 To fork on that subject... I was very 
 "gperf-like-perfecthash-friendly" for years but nowadays I use 
 [lookuptables](https://gitlab.com/styx-lang/styx/-/blob/master/src/styx/string_switch
sx?ref_type=heads). Very fast too. You have Trie or suffix arrays that are very
fast too but they tend not to be memory friendly.

If the whole set of the string identifiers to be parsed is known 
at compile time, then another approach for mapping them to 
numbers or whatever else is to use something like https://re2c.org

Dec 11 2023

Basile.B <b2.temp gmx.com> writes:

On Monday, 11 December 2023 at 12:37:56 UTC, Siarhei Siamashka 
wrote:
 On Monday, 11 December 2023 at 11:36:10 UTC, Basile.B wrote:
 To fork on that subject... I was very 
 "gperf-like-perfecthash-friendly" for years but nowadays I use 
 [lookuptables](https://gitlab.com/styx-lang/styx/-/blob/master/src/styx/string_switch
sx?ref_type=heads). Very fast too. You have Trie or suffix arrays that are very
fast too but they tend not to be memory friendly.

 If the whole set of the string identifiers to be parsed is 
 known at compile time, then another approach for mapping them 
 to numbers or whatever else is to use something like 
 https://re2c.org

Sorry but that sound a bit like "let's rely on a library, let's 
not know how that works" ;)

Dec 11 2023

Basile B. <b2.temp gmx.com> writes:

On Monday, 11 December 2023 at 12:37:56 UTC, Siarhei Siamashka 
wrote:
 On Monday, 11 December 2023 at 11:36:10 UTC, Basile.B wrote:
 To fork on that subject... I was very 
 "gperf-like-perfecthash-friendly" for years but nowadays I use 
 [lookuptables](https://gitlab.com/styx-lang/styx/-/blob/master/src/styx/string_switch
sx?ref_type=heads). Very fast too. You have Trie or suffix arrays that are very
fast too but they tend not to be memory friendly.

 If the whole set of the string identifiers to be parsed is 
 known at compile time, then another approach for mapping them 
 to numbers or whatever else is to use something like 
 https://re2c.org

Sorry for the inaccuracy. This is part of what a compiler uses to 
generate the code to switch over a string. This is not used at 
run-time.

Generated code is more like: https://godbolt.org/z/Y54bKeWYe

What is used at run-time is: 
https://gitlab.com/styx-lang/styx/-/blob/master/library/rtl.sx?ref_type=heads#L426

In fine you have decimated a string into a unique number and you 
just take the path of a normal switch. The decimation is not very 
different of hashing. Just it's guaranteed to give unique indexes 
for a set of unique strings.

I believe D uses a different strategy for that. Binary search 
IIRC. not quite sure.

Anyway just to clarify.

Jan 05 2024

D Programming

C/C++ Programming

Other

digitalmars.D - "Faster than Rust and C++: the PERFECT hash table"