digitalmars.D.learn - Re: Help optimize D solution to phone encoding problem: extremely

H. S. Teoh (88/112) Jan 18 OK, this piqued my interest enough that I decided to install rust using

Renato (34/145) Jan 19 That's not correct. The discrepancy is due to the benchmark

Renato (19/41) Jan 19 I forgot to mention: the Java version is using a Trie... and it

evilrat (8/19) Jan 19 My guess is that's because int128 is not that much optimized due

Renato (9/34) Jan 19 I am not using int128 anymore. I explained why a few posts back.
Renato (43/57) Jan 19 I tried with DMD again, and yeah, it's much slower.

H. S. Teoh (59/83) Jan 19 For anything where performance is even remotely important, I wouldn't
ryuukk_ (6/64) Jan 19 You do hash map lookup for every character in D, it's slow,

evilrat (12/17) Jan 19 There is another important difference, i quickly looked up D

ryuukk_ (17/37) Jan 19 I'm not talking about the difference between the hashmap

Daniel Kozak (12/16) Jan 19 On Fri, Jan 19, 2024 at 4:44=E2=80=AFPM H. S. Teoh via Digitalmars-d-lea...
H. S. Teoh (15/28) Jan 19 [...]

Siarhei Siamashka (33/57) Jan 20 Hash which key? The problem of the naive hashtable-based
Renato (167/218) Jan 20 Wow, fantastic feedback from lots of people, thanks so much!

Renato (12/21) Jan 20 Oh sorry, I only posted the rates for the Linux benchmark...
Daniel Kozak (9/35) Jan 20 1fa1b77ad928f7d536728fba11f2d69b4afe7e3).

Renato (11/55) Jan 20 I've explained why this function is almost a perfect hash

Daniel Kozak (6/66) Jan 21 1fa1b77ad928f7d536728fba11f2d69b4afe7e3

Jordan Wilson (5/8) Jan 19 I agree! Thanks for posting your benchmarks, I thought your whole

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Thu, Jan 18, 2024 at 04:23:16PM +0000, Renato via Digitalmars-d-learn wrote:
[...]
 Ok, last time I'm running this for someone else :D
 
 ```
 Proc,Run,Memory(bytes),Time(ms)
 ===> ./rust
 ./rust,23920640,30
 ./rust,24018944,147
 ./rust,24068096,592
 ./rust,24150016,1187
 ./rust,7766016,4972
 ./rust,8011776,46101
 ===> src/d/dencoder
 src/d/dencoder,44154880,42
 src/d/dencoder,51347456,87
 src/d/dencoder,51380224,273
 src/d/dencoder,51462144,441
 src/d/dencoder,18644992,4414
 src/d/dencoder,18710528,43548
 ```

OK, this piqued my interest enough that I decided to install rust using
rustup instead of my distro's package manager.  Here are the numbers I
got for my machine:

===> ./rust
./rust,22896640,35
./rust,22896640,137
./rust,22384640,542
./rust,22896640,1034
./rust,8785920,2489
./rust,8785920,12157
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,72
src/d/dencoder,1066799104,198
src/d/dencoder,1066799104,344
src/d/dencoder,1035292672,2372
src/d/dencoder,1035292672,13867

Looks like we lost out to Rust for larger inputs. :-D  Probably due to
environmental factors (and the fact that std.stdio is slow).  I re-ran
it again and got this:

===> ./rust
./rust,22896640,30
./rust,22896640,131
./rust,22896640,511
./rust,22896640,983
./rust,8785920,3102
./rust,8785920,9912
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,71
src/d/dencoder,1066799104,197
src/d/dencoder,1066799104,355
src/d/dencoder,1035292672,3441
src/d/dencoder,1035292672,9471

Notice the significant discrepancy between the two runs; this seems to
show that the benchmark is only accurate up to about �1.5 seconds.

Anyway, oddly enough, Java seems to beat Rust on larger inputs.  Maybe
my Java compiler has a better JIT implementation? :-P


 Congratulations on beating Rust :D but remember: you're using a much
 more efficient algorithm! I must conclude that the Rust translation of
 the Trie algorithm would be much faster still, unfortunately (you may
 have noticed that I am on D's side here!).

At this point, it's not really about the difference between languages
anymore; it's about the programmer's skill at optimizing his code.

Traditionally Java is thought to be the slowest, because it runs in a VM
and generally tends to use more heap allocations.  In recent times,
however, JIT and advanced GC implementations have significantly levelled
that out, so you're probably not going to see the difference unless you
hand-tweak your code down to the bare metal.

Surprisingly, at least on my machine, Lisp actually performed the worst.
I'd have thought it would at least beat Java, but I was quite wrong. :-D
Perhaps the Lisp implementation I'm using is suboptimal, I don't know.
Or perhaps modern JVMs have really overtaken Lisp.

Now I'm really curious how a Rust version of the trie algorithm would
perform.  Unfortunately I don't know Rust so I wouldn't be able to write
it myself. (Hint, hint, nudge, nudge ;-)).

As far as the performance of my D version is concerned, I still haven't
squeezed out all the performance I could yet.  Going into this, my
intention was to take the lazy way of optimizing only what the profiler
points out to me, with the slight ulterior motive of proving that a
relatively small amount of targeted optimizations can go a long way at
making the GC a non-problem in your typical D code. ;-)  I haven't
pulled out all the optimization guns at my disposal yet.

If I were to go the next step, I'd split up the impl() function so that
I get a better profile of where it's spending most of its time, and then
optimize that.  My current suspicion is that the traversal of the trie
could be improved by caching intermediate results to eliminate a good
proportion of recursive calls in impl().

Also, the `print` mode of operation is quite slow, probably because
writefln() allocates. (It allocates less than if I had used .format like
I did before, but it nevertheless still allocates.) To alleviate this
cost, I'd allocate an output buffer and write to that, flushing only
once it filled up.

Another thing I could do is to use std.parallelism.parallel to run
searches on batches of phone numbers in parallel. This is kinda
cheating, though, since it's the same algorithm with the same cost,
we're just putting more CPU cores to work. :-P  But in D this is quite
easy to do, often as easy as simply adding .parallel to your outer
foreach loop. In this particular case it will need some additional
refactoring due to the fact that the input is being read line by line.
But it's relatively easy to load the input into a buffer by chunks
instead, and just run the searches on all the numbers found in the
buffer in parallel.


On Thu, Jan 18, 2024 at 04:25:45PM +0000, Renato via Digitalmars-d-learn wrote:
[...]
 BTW here's you main function so it can run on the benchmark:

[...]

Thanks, I've adapted my code accordingly and pushed to my github repo.


T

-- 
This is a tpyo.

Jan 18

Renato <renato athaydes.com> writes:

On Friday, 19 January 2024 at 05:17:51 UTC, H. S. Teoh wrote:
 On Thu, Jan 18, 2024 at 04:23:16PM +0000, Renato via 
 Digitalmars-d-learn wrote: [...]
 Ok, last time I'm running this for someone else :D
 
 ```
 Proc,Run,Memory(bytes),Time(ms)
 ===> ./rust
 ./rust,23920640,30
 ./rust,24018944,147
 ./rust,24068096,592
 ./rust,24150016,1187
 ./rust,7766016,4972
 ./rust,8011776,46101
 ===> src/d/dencoder
 src/d/dencoder,44154880,42
 src/d/dencoder,51347456,87
 src/d/dencoder,51380224,273
 src/d/dencoder,51462144,441
 src/d/dencoder,18644992,4414
 src/d/dencoder,18710528,43548
 ```

 OK, this piqued my interest enough that I decided to install 
 rust using rustup instead of my distro's package manager.  Here 
 are the numbers I got for my machine:

 ===> ./rust
 ./rust,22896640,35
 ./rust,22896640,137
 ./rust,22384640,542
 ./rust,22896640,1034
 ./rust,8785920,2489
 ./rust,8785920,12157
 ===> src/d/dencoder
 src/d/dencoder,1066799104,36
 src/d/dencoder,1066799104,72
 src/d/dencoder,1066799104,198
 src/d/dencoder,1066799104,344
 src/d/dencoder,1035292672,2372
 src/d/dencoder,1035292672,13867

 Looks like we lost out to Rust for larger inputs. :-D  Probably 
 due to environmental factors (and the fact that std.stdio is 
 slow).  I re-ran it again and got this:

 ===> ./rust
 ./rust,22896640,30
 ./rust,22896640,131
 ./rust,22896640,511
 ./rust,22896640,983
 ./rust,8785920,3102
 ./rust,8785920,9912
 ===> src/d/dencoder
 src/d/dencoder,1066799104,36
 src/d/dencoder,1066799104,71
 src/d/dencoder,1066799104,197
 src/d/dencoder,1066799104,355
 src/d/dencoder,1035292672,3441
 src/d/dencoder,1035292672,9471

 Notice the significant discrepancy between the two runs; this 
 seems to show that the benchmark is only accurate up to about 
 ±1.5 seconds.


That's not correct. The discrepancy is due to the benchmark 
always generating different input on each run - and the 
characteristics of the input affects runs significantly. This 
affects the last two runs much more due to them using the more 
challenging dictionary. The important is that the relative 
performance between languages is reliable.

If you stop re-generating the phone numbers and just run the 
benchmark multiple times using the same input, you'll see it's 
very close between runs.

 Anyway, oddly enough, Java seems to beat Rust on larger inputs. 
  Maybe my Java compiler has a better JIT implementation? :-P

Again, in case I haven't made it clear by repeating this multiple 
times: The Java code is using a Trie, the Rust code is using a 
numeric solution. They are completely different algorithms. Much 
more important than which language is being used is the 
algorithm, as has been shown again and again.

 Congratulations on beating Rust :D but remember: you're using 
 a much more efficient algorithm! I must conclude that the Rust 
 translation of the Trie algorithm would be much faster still, 
 unfortunately (you may have noticed that I am on D's side 
 here!).

 At this point, it's not really about the difference between 
 languages anymore; it's about the programmer's skill at 
 optimizing his code.

 Traditionally Java is thought to be the slowest, because it 
 runs in a VM and generally tends to use more heap allocations.  
 In recent times, however, JIT and advanced GC implementations 
 have significantly levelled that out, so you're probably not 
 going to see the difference unless you hand-tweak your code 
 down to the bare metal.

Java has been very fast for over a decade now, this is not new at 
all.

 Surprisingly, at least on my machine, Lisp actually performed 
 the worst. I'd have thought it would at least beat Java, but I 
 was quite wrong. :-D Perhaps the Lisp implementation I'm using 
 is suboptimal, I don't know. Or perhaps modern JVMs have really 
 overtaken Lisp.

Please don't compare different algorithms in different languages 
and make conclusions about each language's speed.

 Now I'm really curious how a Rust version of the trie algorithm 
 would perform.  Unfortunately I don't know Rust so I wouldn't 
 be able to write it myself. (Hint, hint, nudge, nudge ;-)).

 As far as the performance of my D version is concerned, I still 
 haven't squeezed out all the performance I could yet.  Going 
 into this, my intention was to take the lazy way of optimizing 
 only what the profiler points out to me, with the slight 
 ulterior motive of proving that a relatively small amount of 
 targeted optimizations can go a long way at making the GC a 
 non-problem in your typical D code. ;-)  I haven't pulled out 
 all the optimization guns at my disposal yet.


You don't really have to...  ssvb's solution is incredibly fast 
at the "count" problem and I really don't think anyone can beat 
that implementation. The only problem is that the implementation 
is very C-like and nothing like D I would write.

 If I were to go the next step, I'd split up the impl() function 
 so that I get a better profile of where it's spending most of 
 its time, and then optimize that.  My current suspicion is that 
 the traversal of the trie could be improved by caching 
 intermediate results to eliminate a good proportion of 
 recursive calls in impl().

 Also, the `print` mode of operation is quite slow, probably 
 because writefln() allocates. (It allocates less than if I had 
 used .format like I did before, but it nevertheless still 
 allocates.) To alleviate this cost, I'd allocate an output 
 buffer and write to that, flushing only once it filled up.

 Another thing I could do is to use std.parallelism.parallel to 
 run searches on batches of phone numbers in parallel. This is 
 kinda cheating, though, since it's the same algorithm with the 
 same cost, we're just putting more CPU cores to work. :-P  But 
 in D this is quite easy to do, often as easy as simply adding 
 .parallel to your outer foreach loop. In this particular case 
 it will need some additional refactoring due to the fact that 
 the input is being read line by line. But it's relatively easy 
 to load the input into a buffer by chunks instead, and just run 
 the searches on all the numbers found in the buffer in parallel.

I don't have interest in doing that because what I was trying to 
accomplish was to compare the exact same algorithm in different 
languages to the extent possible. This is why I ported the Common 
Lisp code to D, then tried to optimize from there just like I did 
for Rust. I find the numeric implementation much more "valuable" 
in comparing performance than the Trie solution because that 
evaluates quite a few things in the language that are common to 
most programs, like hashing, recursion and "search" in general.

Do you know why the whole thread seems to have disappeared?? 
There's a lot of good stuff in the thread, it would be a huge 
shame to lose all that!

Jan 19

Renato <renato athaydes.com> writes:

 Looks like we lost out to Rust for larger inputs. :-D  
 Probably due to environmental factors (and the fact that 
 std.stdio is slow).  I re-ran it again and got this:

 ===> ./rust
 ./rust,22896640,30
 ./rust,22896640,131
 ./rust,22896640,511
 ./rust,22896640,983
 ./rust,8785920,3102
 ./rust,8785920,9912
 ===> src/d/dencoder
 src/d/dencoder,1066799104,36
 src/d/dencoder,1066799104,71
 src/d/dencoder,1066799104,197
 src/d/dencoder,1066799104,355
 src/d/dencoder,1035292672,3441
 src/d/dencoder,1035292672,9471


I forgot to mention: the Java version is using a Trie... and it 
consistently beats the Rust numeric algorithm (which means it's 
still faster than your D solution), but the Java version that's 
equivalent to Rust's implementation is around 3x slower... i.e. 
it runs at about the same speed as my current fastest numeric 
algorithm in D as well.

This is what I would like to be discussing in this thread: why is 
D running at Java speeds and not at D speeds when using the same 
algorithm? I know there's small differences in the 
implementations, they are different languages after all, but not 
enough IMO to justify anything like 3x difference from Rust.

If you really want to show to me that D is as fast as Rust, 
please spend time on my solution (because that's a direct 
comparison algorithmically to the Rust implementation) and remove 
any unnecessary overhead from it (without changing the overall 
algorithm , of course - unless that's to make it closer to the 
Rust implementation).


 Do you know why the whole thread seems to have disappeared?? 
 There's a lot of good stuff in the thread, it would be a huge 
 shame to lose all that!

Nevermind, looks like I was unable to open any other answers in 
this thread while writing my own answer!?

Jan 19

evilrat <evilrat666 gmail.com> writes:

On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and it 
 consistently beats the Rust numeric algorithm (which means it's 
 still faster than your D solution), but the Java version that's 
 equivalent to Rust's implementation is around 3x slower... i.e. 
 it runs at about the same speed as my current fastest numeric 
 algorithm in D as well.

 This is what I would like to be discussing in this thread: why 
 is D running at Java speeds and not at D speeds when using the 
 same algorithm? I know there's small differences in the 
 implementations, they are different languages after all, but 
 not enough IMO to justify anything like 3x difference from Rust.

My guess is that's because int128 is not that much optimized due 
to being not so popular type, though to answer what's wrong would 
require to look at assembly code produced for both D and Rust.

Additionally if you comparing D by measuring DMD performance - 
don't.
It is valuable in developing for fast iterations, but it lacks 
many modern optimization techniques, for that we have LDC and GDC.

Jan 19

Renato <renato athaydes.com> writes:

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and 
 it consistently beats the Rust numeric algorithm (which means 
 it's still faster than your D solution), but the Java version 
 that's equivalent to Rust's implementation is around 3x 
 slower... i.e. it runs at about the same speed as my current 
 fastest numeric algorithm in D as well.

 This is what I would like to be discussing in this thread: why 
 is D running at Java speeds and not at D speeds when using the 
 same algorithm? I know there's small differences in the 
 implementations, they are different languages after all, but 
 not enough IMO to justify anything like 3x difference from 
 Rust.

 My guess is that's because int128 is not that much optimized 
 due to being not so popular type, though to answer what's wrong 
 would require to look at assembly code produced for both D and 
 Rust.

 Additionally if you comparing D by measuring DMD performance - 
 don't.
 It is valuable in developing for fast iterations, but it lacks 
 many modern optimization techniques, for that we have LDC and 
 GDC.

I am not using int128 anymore. I explained why a few posts back. 
I am using a byte array and computing the hash incrementally when 
trying different input, so that partially computed hashes are 
re-used on each try (this is a bit cheating, as Rust is not doing 
that, but I consider that to be acceptable as it's still 
computing hashes and looking up entries in the associative array).

I used all D compilers and picked the fastest one (GDC in the 
case of int128, but LDC2 in the current case).

Jan 19

Renato <renato athaydes.com> writes:

On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and 
 it consistently beats the Rust numeric algorithm (which means 
 it's still faster than your D solution), but the Java version 
 that's equivalent to Rust's implementation is around 3x 
 slower... i.e. it runs at about the same speed as my current 
 fastest numeric algorithm in D as well.

 Additionally if you comparing D by measuring DMD performance - 
 don't.
 It is valuable in developing for fast iterations, but it lacks 
 many modern optimization techniques, for that we have LDC and 
 GDC.

I tried with DMD again, and yeah, it's much slower.

Here's the [current implementation in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/sr
/d/src/dencoder.d), and the roughly [equivalent Rust
implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).

The only "significant" difference is that in Rust, an enum 
`WordOrDigit` is used to represent currently known "words"... I 
[did try using that in 
D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/sr
/d/src/dencoder.d), but it made the algorithm slower.

If you see anything in D that's not as efficient as it should be, 
or somehow "inferior" to what the Rust version is doing , please 
let me know.

Notice that almost all of the time is spent in the for-loop 
inside `printTranslations` (which is misnamed as it doesn't 
necessarily "print" anything, like it did earlier) - the rest of 
the code almost doesn't matter.

Current performance comparison:

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,31
./rust,24018944,149
./rust,24084480,601
./rust,24248320,1176
./rust,7798784,2958
./rust,7815168,15009
===> src/d/dencoder
src/d/dencoder,14254080,36
src/d/dencoder,24477696,368
src/d/dencoder,24510464,1740
src/d/dencoder,24559616,3396
src/d/dencoder,11321344,6740
src/d/dencoder,11321344,36787
```

So , it's not really 3x slower anymore, here's the "D overhead" 
considering Rust as the baseline:

```
1.161290323
2.469798658
2.895174709
2.887755102
2.278566599
2.450996069
```

Jan 19

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Fri, Jan 19, 2024 at 01:40:39PM +0000, Renato via Digitalmars-d-learn wrote:
 On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:

[...]
 Additionally if you comparing D by measuring DMD performance -
 don't.  It is valuable in developing for fast iterations, but it
 lacks many modern optimization techniques, for that we have LDC and
 GDC.

 
 I tried with DMD again, and yeah, it's much slower.

For anything where performance is even remotely important, I wouldn't
even consider DMD.  It's a well-known fact that it produces suboptimal
executables.  Its only redeeming factor is really only its fast
turnaround time.  If fast turnaround is not important, I would always
use LDC or GDC instead.


 Here's the [current implementation in
 D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d),
 and the roughly [equivalent Rust
 implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).

Taking a look at this code:

One of the most important thing I found is that every call to
printTranslations allocates a new array (`auto keyValue = new
ubyte[...];`).  Given the large number of recursions involved in this
algorithm, this will add up to quite a lot.  If I were optimizing this
code, I'd look into ways of reducing, if not eliminating, this
allocation.  Observe that this allocation is needed each time
printTranslations recurses, so instead of making separate allocations,
you could put it on a stack. Either with alloca, or with my appendPath()
trick in my version of the code: preallocate a reasonably large buffer
and take slices of it each time you need a new keyValue array.

Secondly, your hash function looks suspicious. Why are you chaining your
hash per digit?  That's a lot of hash computations.  Shouldn't you just
hash the entire key each time?  That would eliminate the need to store a
custom hash in your key, you could just lookup the entire key at once.

Next, what stood out is ISolutionHandler.  If I were to write this, I
wouldn't use OO for this at all, and especially not interfaces, because
they involve a double indirection.  I'd just return a delegate instead
(single indirection, no object lookup).  This is a relatively small
matter, but when it's being used inside a hot inner loop, it could be
important.

Then a minor point: I wouldn't use Array in printTranslations. It's
overkill for what you need to do; a built-in array would work just fine.
Take a look at the implementation of Array and you'll see lots of
function calls and locks and GC root-adding and all that stuff. Most of
it doesn't apply here, of course, and is compiled out. Nevertheless, it
uses wrapped integer operations and range checks, etc.. Again, these are
all minor issues, but in a hot inner loop they do add up. Built-in
arrays let you literally just bump the pointer when adding an element.
Just a couple of instructions as opposed to several function calls.
Important difference when you're on the hot path.  Now, as I mentioned
earlier w.r.t. my own code, appending to built-in arrays comes with a
cost. So here's where you'd optimize by creating your own buffer and
custom push/pop operations. Something like appendPath() in my version of
the code would do the job.

Finally, a very a small point: in loadDictionary, you do an AA lookup
with `n in result`, and then if that returns null, you create a new
entry. This does two AA lookups, once unsuccessfully, and the second
time to insert the missing key.  You could use the .require operation
with a delegate instead of `in` followed by `if (... is null)`, which
only requires a single lookup.  Probably not an important point, but for
a large dictionary this might make a difference.


 The only "significant" difference is that in Rust, an enum
 `WordOrDigit` is used to represent currently known "words"... I [did
 try using that in
 D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d),
 but it made the algorithm slower.
 
 If you see anything in D that's not as efficient as it should be, or
 somehow "inferior" to what the Rust version is doing , please let me
 know.

Couldn't tell you, I don't know Rust. :-D


 Notice that almost all of the time is spent in the for-loop inside
 `printTranslations` (which is misnamed as it doesn't necessarily
 "print" anything, like it did earlier) - the rest of the code almost
 doesn't matter.

[...]

Of course, that's where your hot path is.  And that loop makes recursive
calls to printTranslations, so the entire body of the function could use
some optimization. ;-)

Try addressing the points I wrote above and see if it makes a
difference.


T

-- 
The two rules of success: 1. Don't tell everything you know. -- YHL

Jan 19

ryuukk_ <ryuukk.dev gmail.com> writes:

On Friday, 19 January 2024 at 13:40:39 UTC, Renato wrote:
 On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and 
 it consistently beats the Rust numeric algorithm (which means 
 it's still faster than your D solution), but the Java version 
 that's equivalent to Rust's implementation is around 3x 
 slower... i.e. it runs at about the same speed as my current 
 fastest numeric algorithm in D as well.

 Additionally if you comparing D by measuring DMD performance - 
 don't.
 It is valuable in developing for fast iterations, but it lacks 
 many modern optimization techniques, for that we have LDC and 
 GDC.

 I tried with DMD again, and yeah, it's much slower.

 Here's the [current implementation in 
 D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/sr
/d/src/dencoder.d), and the roughly [equivalent Rust
implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).

 The only "significant" difference is that in Rust, an enum 
 `WordOrDigit` is used to represent currently known "words"... I 
 [did try using that in 
 D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/sr
/d/src/dencoder.d), but it made the algorithm slower.

 If you see anything in D that's not as efficient as it should 
 be, or somehow "inferior" to what the Rust version is doing , 
 please let me know.

 Notice that almost all of the time is spent in the for-loop 
 inside `printTranslations` (which is misnamed as it doesn't 
 necessarily "print" anything, like it did earlier) - the rest 
 of the code almost doesn't matter.

 Current performance comparison:

 ```
 Proc,Run,Memory(bytes),Time(ms)
 ===> ./rust
 ./rust,23920640,31
 ./rust,24018944,149
 ./rust,24084480,601
 ./rust,24248320,1176
 ./rust,7798784,2958
 ./rust,7815168,15009
 ===> src/d/dencoder
 src/d/dencoder,14254080,36
 src/d/dencoder,24477696,368
 src/d/dencoder,24510464,1740
 src/d/dencoder,24559616,3396
 src/d/dencoder,11321344,6740
 src/d/dencoder,11321344,36787
 ```

 So , it's not really 3x slower anymore, here's the "D overhead" 
 considering Rust as the baseline:

 ```
 1.161290323
 2.469798658
 2.895174709
 2.887755102
 2.278566599
 2.450996069
 ```

You do hash map lookup for every character in D, it's slow, 
whereas in Rust you do it via pattern matching, java does the 
same, pattern matching


Yet another reason to advocate for pattern matching in D and 
switch as expression

Jan 19

evilrat <evilrat666 gmail.com> writes:

On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:
 You do hash map lookup for every character in D, it's slow, 
 whereas in Rust you do it via pattern matching, java does the 
 same, pattern matching


 Yet another reason to advocate for pattern matching in D and 
 switch as expression

There is another important difference, i quickly looked up D 
associative array implementation and the search looks like 
nlog(n) complexity with plain loop iteration of hashes, whereas 
rust's implementation is using "swisstable" algorithm 
implementation that has packed SIMD optimized lookups, this is 
likely where the 3x speed difference comes from.

Tried to look up rust implementation and it is SOOO generic that 
I was unable to decipher it to find the actual key search and 
stores.

Anyway here is an interesting article about rust implementation
https://faultlore.com/blah/hashbrown-tldr/

Jan 19

ryuukk_ <ryuukk.dev gmail.com> writes:

On Friday, 19 January 2024 at 17:18:36 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:
 You do hash map lookup for every character in D, it's slow, 
 whereas in Rust you do it via pattern matching, java does the 
 same, pattern matching


 Yet another reason to advocate for pattern matching in D and 
 switch as expression

 There is another important difference, i quickly looked up D 
 associative array implementation and the search looks like 
 nlog(n) complexity with plain loop iteration of hashes, whereas 
 rust's implementation is using "swisstable" algorithm 
 implementation that has packed SIMD optimized lookups, this is 
 likely where the 3x speed difference comes from.

 Tried to look up rust implementation and it is SOOO generic 
 that I was unable to decipher it to find the actual key search 
 and stores.

 Anyway here is an interesting article about rust implementation
 https://faultlore.com/blah/hashbrown-tldr/

I'm not talking about the difference between the hashmap 
implementation, but the difference between the algorithm used to 
lookup the characters

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/9b1d7f026943841638a2729922cf000b1b3ce655/src/java/Main2.java#L106-L134

vs

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/d/src/dencoder.d#L10-L49


If D had pattern matching and switch as expression, the faster 
method would be:

1. the most obvious choice

2. the fastest by default

3. the most clean

To save from having to write a old-school verbose `switch`, i 
suspect he went with a hashmap, wich is slower in that case, 
hence why i keep advocate for that feature, or i should say, that 
upgrade to `switch`, wich java has adopted, as well as rust:

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/rust/phone_encoder/src/main.rs#L146-L168

Jan 19

Daniel Kozak <kozzi11 gmail.com> writes:

On Fri, Jan 19, 2024 at 4:44=E2=80=AFPM H. S. Teoh via Digitalmars-d-learn =
<
digitalmars-d-learn puremagic.com> wrote:

 Taking a look at this code:
 ...


 Try addressing the points I wrote above and see if it makes a
 difference.

I have tried it (all of it) even before you wrote it here, because I have
completely the same ideas, but to be fair it has almost zero effect on
speed.
There is my version (It still use OOP, but I have try it wit Printer and
Counter to be structs and it has no effect at all)
https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY

The only difference in speed in the end is caused by hash implementation of
dlang associative arrays and rust HashMap, actually if you modify rust to
not used ahash it has almost same speed as D

Jan 19

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn
wrote:
[...]
    > Try addressing the points I wrote above and see if it makes a
    > difference.

    I have tried it (all of it) even before you wrote it here, because
    I have completely the same ideas, but to be fair it has almost zero
    effect�on speed.
    There is my version (It still use OOP, but I have try it wit
    Printer and Counter to be structs and it has no effect at
    all)�[2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
    The only difference in speed in the end is caused by hash
    implementation of dlang associative arrays and rust HashMap,
    actually if you modify rust to not used ahash it has almost same
    speed as D

[...]

I'm confused by the chained hashing of the digits. Why is that
necessary?  I would have thought it'd be faster to hash the entire key
instead of computing the hash of each digit and chaining them together.

I looked up Rust's ahash algorithm. Apparently they leverage the CPU's
hardware AES instruction to compute a collision-resistant hash very
quickly.

Somebody should file a bug on druntime to implement this where the
hardware supports it, instead of the current hashOf. For relatively
small keys this would be a big performance boost.

T

-- 
Valentine's Day: an occasion for florists to reach into the wallets of nominal
lovers in dire need of being reminded to profess their hypothetical love for
their long-forgotten.

Jan 19

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:
 On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via 
 Digitalmars-d-learn wrote: [...]
    > Try addressing the points I wrote above and see if it 
 makes a
    > difference.

    I have tried it (all of it) even before you wrote it here, 
 because
    I have completely the same ideas, but to be fair it has 
 almost zero
    effect on speed.
    There is my version (It still use OOP, but I have try it wit
    Printer and Counter to be structs and it has no effect at
    all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
    The only difference in speed in the end is caused by hash
    implementation of dlang associative arrays and rust HashMap,
    actually if you modify rust to not used ahash it has almost 
 same
    speed as D

 [...]

 I'm confused by the chained hashing of the digits. Why is that 
 necessary?  I would have thought it'd be faster to hash the 
 entire key instead of computing the hash of each digit and 
 chaining them together.

Hash which key? The problem of the naive hashtable-based 
algorithm is that we don't know the length of the key. So all 
lengths are tried starting from 1 and up to the remaining part of 
the phone number. An additional optimization would be to limit 
this search and terminate it early. For example, stop after we 
exceed the length of the longest word from the dictionary. Or 
stop the search upon encountering certain "leaf" words, but this 
effectively morphs the algorithm into something that partially 
resembles Trie. And further evolving it we may end up with a 
full-fledged Trie.

The only practical value of this algorithm is that it's easy to 
implement and has low LOC count. My initial non-optimized naive 
version was also similar 
https://gist.github.com/ssvb/abe821b3cdba7fcb7f3c3cecde864153 (66 
lines including blank lines and some comments). This is somewhat 
comparable to the Lisp results 
https://flownet.com/ron/papers/lisp-java/raw-results.html or to 
the Peter Norvig's solution http://www.norvig.com/java-lisp.html

The purpose of the initial implementation "prototype" is just to 
have some sort of a starting point. And if the runtime 
performance doesn't matter, then we are already done. But if the 
runtime performance does matter, then a lot of things can be 
rapidly improved by spending a bit of time on it. Eventually one 
runs out of ideas and further optimization efforts yield only 
diminishing returns. That's when it's a good time to clean up the 
code, add comments, etc. and label it as the "final" version.

I think that a good study, that intends to compare different 
programming languages against each other, could take all these 
factors into consideration: time to implement the "prototype", 
runtime performance of the prototype, time to implement the 
"final" version, runtime performance of the final version, the 
number of lines of code and its readability/maintainability.

Jan 20

Renato <renato athaydes.com> writes:

Wow, fantastic feedback from lots of people, thanks so much!


I will try to answer all points raised in the several answers 
I've got here since yesterday.

On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:
 I'm confused by the chained hashing of the digits. Why is that 
 necessary?  I would have thought it'd be faster to hash the 
 entire key instead of computing the hash of each digit and 
 chaining them together.

 I looked up Rust's ahash algorithm. Apparently they leverage 
 the CPU's hardware AES instruction to compute a 
 collision-resistant hash very quickly.

 Somebody should file a bug on druntime to implement this where 
 the hardware supports it, instead of the current hashOf. For 
 relatively small keys this would be a big performance boost.


 T

[This is the 
commit](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/e757f34c3ddd16240e351
bc91564f2a134659ff) I added "incremental hashing". What you're suggesting I
also tried but it was much slower.

The reason incremental hashing is much faster than computing the 
hash of the whole key on each iteration is that you're doing a 
lot less work. To compute the hash of an array fully, you need to 
do something like this (it's just how hashing works):

```d
auto currentHash = <something>;
foreach(item: array) {
   currentHash = hashOf(item);
}
```

But by the structure of the problem, we know that we keep adding 
items to the `array` and re-computing the hash (in the 
`printTranslations` function, which is the hot path... so ignore 
the dictionary loading for now). What you're suggesting is do the 
above for the full array, on each iteration... what I did was 
optmise that so that we only re-compute the hash for each item 
instead per iteration - which is the only work that is required.

As Siarhei Siamashka mentioned: you could keep optimising this 
using heuristics and knowledge about the problem, at which point 
you might converge to a Trie implementation! But for the purpose 
of this comparison, I'm happy to stop at a good enough, kind of 
general purpose algorithm which is comparable in multiple 
languages... I might compare the D Trie impl with a Rust Trie 
impl in the future, but for now, I've had enough of that.

Hope that makes sense now... try to undo it and I believe it will 
run around 30% slower.

 One of the most important thing I found is that every call to 
 printTranslations allocates a new array (`auto keyValue = new 
 ubyte[...];`).  Given the large number of recursions involved 
 in this algorithm, this will add up to quite a lot.

Bingo! I found that before reading your suggestion :) you can see 
the commit where I changed that 
[here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/4e8508c5d16a2b623a4673
9c056a68e03b3bd87). This one-line change was the most effective change, making
the D impl not only consume a lot less memory, but become quite a bit faster (I
will double check later , but I believe this made the code run almost 50%
faster in the longest runs)!

 Secondly, your hash function looks suspicious. Why are you 
 chaining your hash per digit?

I've answered that above... but I went further and instead of 
using the built-in `hashof` function, I decided to use [my own 
hashing 
function](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d53672
fba11f2d69b4afe7e3) which takes advantage of the input characteristics... I
mentioned before the trick I did to make the int128 solution faster (by the
way, the int128 solution was not acceptable, as  ssvb pointed out, that's why I
had to remove that). This is a very simple hashing function that works well
enough for this particular problem, and this also boosted performance
noticeably.

 Then a minor point: I wouldn't use Array in printTranslations. 
 It's overkill for what you need to do; a built-in array would 
 work just fine.

Hm, sorry to break it to you but [changing 
that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba45
d8abc3bccbc17c7f1e) to an Array was the single greatest improvement I had. IIRC
it made the code 2x faster.

Maybe I was using native arrays wrongly before... but it's 
unclear to me how I could've made that better (and the docs 
themselves suggest to use Array if performance matters, which I 
seem to have confirmed is good advice).

 Next, what stood out is ISolutionHandler.  If I were to write 
 this, I wouldn't use OO for this at all, and especially not 
 interfaces, because they involve a double indirection.

As mentioned by Daniel Kozak, this doesn't matter. I tried 
rewriting it so that there was no indirection to the extent 
possible (I used delegate fields within a final class so the 
functions to use are selected at startup, so there's no overhead 
choosing which functions to call at runtime) and it was actually 
slightly slower (but too close to call really). My solution with 
interfaces had two final classes used, so I think there was very 
little overhead as no vtable was needed, right? You can check [my 
change 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d96a73f95b8e204d9325ed
0e2ad19cece5e0c18), let me know if you can find a flaw in it.

 Finally, a very a small point: in loadDictionary, you do an AA 
 lookup with `n in result`, and then if that returns null, you 
 create a new entry. This does two AA lookups, once 
 unsuccessfully, and the second time to insert the missing key.  
 You could use the .require operation with a delegate instead of 
 `in` followed by `if (... is null)`, which only requires a 
 single lookup.

Thanks for pointing that out. I [changed 
that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/1ee3d858466904a74468e
139675ec7cf2c690d0) to use `require`, but it made no perceptible difference...
still I kept this in my "final" implementation because it was much cleaner!

 ryuukk_:
 You do hash map lookup for every character in D, it's slow, 
 whereas in Rust you do it via pattern matching, java does the 
 same, pattern matching

I [rewrote 
that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b8c072f113f5a36b54551
051181ef877a8821e8) to use a switch. The speed seems to have increased very
slightly for the very small inputs, it's difficult to tell.
But I kept this change anyway because it was simpler and closer 
to the Rust implementation.

Notice that this table lookup is no in the hot path, so as 
expected, it only seems to speed up slightly loading the 
dictionary (which is good because D was behind for small inputs - 
and even this tiny improvement could help D catch up).

 To save from having to write a old-school verbose switch, i 
 suspect he went with a hashmap, wich is slower in that case, 
 hence why i keep advocate for that feature, or i should say, 
 that upgrade to switch, wich java has adopted, as well as rust:

The reason I used a hash table was that I believed that would be 
taking advantage of D's compile-time programming and would be 
even faster than a switch. To be completely sure that's not the 
case, I would need to benchmark that function individually, but 
I'm not so interested in knowing that as that made almost no 
difference in the implementation's performance anyway.

 Yet another reason to advocate for pattern matching in D and 
 switch as expression

Hm... not sure your argument holds here, given the above :/

 evilrat:
 There is another important difference, i quickly looked up D 
 associative array implementation and the search looks like 
 nlog(n) complexity with plain loop iteration of hashes, whereas 
 rust's implementation is using "swisstable" algorithm 
 implementation that has packed SIMD optimized lookups, this is 
 likely where the 3x speed difference comes from.

I am not using the default hash implementation in Rust (which is 
hashbrown as you've found out). That's too slow (because it's 
ddos secure - still Java's hash also is and it's still faster). I 
had to replace that with a library called `ahash`.

If you're interested in knowing more about that, please [read my 
blog 
post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about
optimising the Rust code.

So, no, that's not where the difference is coming from... in 
fact, I am using a faster hashing function in D.

You can [see my custom hashing function 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728
ba11f2d69b4afe7e3). This function is not good for the general purpose case, but
it's probably as good as it gets for this problem (see my int128 trick in a
previous post on this thread to understand why). I did try a few variations of
hashing before settling on this one... Rust, if anything, is at a disadvantage
here.

One more thing I did was to avoid allocations in the print 
function so that D could have less memory allocations (which 
proved to be a big slowdown already) and perform better on the 
print problem.

This [change did make the code 
faster](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/8937b80ec9c851ca19678
0a708c2ef84c499f8c) and it actually caught up with Rust for the smallest input.

 Jordan Wilson:
 I agree! Thanks for posting your benchmarks, I thought your 
 whole benching setup was pretty good, and learnt alot from your 
 code and the resulting contributions in the thread and others.

Thanks for that :) I am used to being mercilessly roosted in 
threads like this because no one likes to be told their favourite 
language is not so fast :D so any support is highly appreciated!

Anyway, here's the [fastest implementation]() I came up with 
based on all the feedback here that still resembles the Common 
Lisp algorithm and the Rust implementation:

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/1ee3d858466904a74468eb139675ec7cf2c690d0/src/d/src/dencoder.d

(branch name is `dlang-key-hash-incremental-avoid-gc`)

Performance on Mac M1:

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,36
./rust,24100864,144
./rust,23986176,581
./rust,24100864,1152
./rust,7815168,6644
./rust,7766016,11752
===> src/d/dencoder
src/d/dencoder,13385728,31
src/d/dencoder,24477696,262
src/d/dencoder,24477696,1173
src/d/dencoder,24395776,2313
src/d/dencoder,5701632,7108
src/d/dencoder,5767168,13692
```

Performance on Linux x86_64 (compiled with LDC 1.36.0 and dub `-b 
release-nobounds`):

```
Proc,Memory(bytes),Time(ms)

===> ./rust
./rust,23138304,62
./rust,23138304,236
./rust,23138304,932
./rust,23138304,1828
./rust,9027584,6108
./rust,9027584,13759
===> src/d/dencoder
src/d/dencoder,229851136,62
src/d/dencoder,229851136,403
src/d/dencoder,229851136,1789
src/d/dencoder,229851136,3573
src/d/dencoder,218996736,8254
src/d/dencoder,218996736,21412
```

Performance on Linux x86_64 using GDC:

```
Proc,Run,Memory(bytes),Time(ms)

===> ./rust
./rust,25178112,57
./rust,25178112,262
./rust,25178112,1082
./rust,25178112,2084
./rust,11067392,6328
./rust,11067392,35849
===> src/d/dencoder
src/d/dencoder,231968768,72
src/d/dencoder,231968768,635
src/d/dencoder,231968768,2875
src/d/dencoder,231968768,5739
src/d/dencoder,221110272,18195
src/d/dencoder,221110272,128987
```

D overhead with the fastest compiler, LDC, compared with Rust:

```
1.0
1.707627119
1.919527897
1.954595186
1.351342502
1.556217748
```

If anyone ever asks me, I will say that D can be as fast as 
Rust/C, but only if you're very careful using D features that 
perform well... if you just write clean D code (use the GC, 
native arrays etc.), you're likely to get at least 50% slower 
code.

Thanks everyone for all the input.

Jan 20

Renato <renato athaydes.com> writes:

On Saturday, 20 January 2024 at 13:07:39 UTC, Renato wrote:
 D overhead with the fastest compiler, LDC, compared with Rust:

 ```
 1.0
 1.707627119
 1.919527897
 1.954595186
 1.351342502
 1.556217748
 ```

Oh sorry, I only posted the rates for the Linux benchmark...

On Mac M1, for some reason, D was a bit closer to Rust in some of 
the runs:

```
0.8611111111
1.819444444
2.018932874
2.0078125
1.069837447
1.165078285
```

Jan 20

Daniel Kozak <kozzi11 gmail.com> writes:

On Sat, Jan 20, 2024 at 2:11=E2=80=AFPM Renato via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 Wow, fantastic feedback from lots of people, thanks so much!
 ...

 evilrat:
 There is another important difference, i quickly looked up D
 associative array implementation and the search looks like
 nlog(n) complexity with plain loop iteration of hashes, whereas
 rust's implementation is using "swisstable" algorithm
 implementation that has packed SIMD optimized lookups, this is
 likely where the 3x speed difference comes from.

 I am not using the default hash implementation in Rust (which is
 hashbrown as you've found out). That's too slow (because it's
 ddos secure - still Java's hash also is and it's still faster). I
 had to replace that with a library called `ahash`.

 If you're interested in knowing more about that, please [read my
 blog
 post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code)
 about optimising the Rust code.

 So, no, that's not where the difference is coming from... in
 fact, I am using a faster hashing function in D.

 You can [see my custom hashing function
 here](
 https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d=

1fa1b77ad928f7d536728fba11f2d69b4afe7e3).
 This function is not good for the general purpose case, but it's probably
 as good as it gets for this problem (see my int128 trick in a previous po=

st
 on this thread to understand why). I did try a few variations of hashing
 before settling on this one... Rust, if anything, is at a disadvantage he=

re.

And here you are wrong, it is the hashing when the slowdown comes. It is
not only about the speed of hashing function. It is about the quality of
hashing functiuon for this particular problem and it is about how hashing
table (associative array) is implemented.

Jan 20

Renato <renato athaydes.com> writes:

On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:
 On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn 
 < digitalmars-d-learn puremagic.com> wrote:

 Wow, fantastic feedback from lots of people, thanks so much! 
 ...

 evilrat:
 There is another important difference, i quickly looked up D
 associative array implementation and the search looks like
 nlog(n) complexity with plain loop iteration of hashes, 
 whereas
 rust's implementation is using "swisstable" algorithm
 implementation that has packed SIMD optimized lookups, this 
 is
 likely where the 3x speed difference comes from.

 I am not using the default hash implementation in Rust (which 
 is hashbrown as you've found out). That's too slow (because 
 it's ddos secure - still Java's hash also is and it's still 
 faster). I had to replace that with a library called `ahash`.

 If you're interested in knowing more about that, please [read 
 my blog 
 post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about
optimising the Rust code.

 So, no, that's not where the difference is coming from... in 
 fact, I am using a faster hashing function in D.

 You can [see my custom hashing function
 here](
 https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3).
 This function is not good for the general purpose case, but 
 it's probably
 as good as it gets for this problem (see my int128 trick in a 
 previous post
 on this thread to understand why). I did try a few variations 
 of hashing
 before settling on this one... Rust, if anything, is at a 
 disadvantage here.

 And here you are wrong, it is the hashing when the slowdown 
 comes. It is not only about the speed of hashing function. It 
 is about the quality of hashing functiuon for this particular 
 problem and it is about how hashing table (associative array) 
 is implemented.

I've explained why this function is almost a perfect hash 
function for this problem (there will be almost no collisions 
except for very large inputs where the shift-left will 
"overflow", and even then the probability of collisions should be 
very low). If you're going to claim I'm wrong, you must show 
exactly where I'm wrong, and preferrably you should provide a 
faster implementation. I will even accept a theoretical 
explanation if you can give one. What I will not accept, is for 
you to call me "wrong" just because you say so. That's childish 
behaviour and uncalled for on a technical discussion.

Jan 20

Daniel Kozak <kozzi11 gmail.com> writes:

Dne so 20. 1. 2024 21:21 u=C5=BEivatel Renato via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> napsal:

 On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:
 On Sat, Jan 20, 2024 at 2:11=E2=80=AFPM Renato via Digitalmars-d-learn
 < digitalmars-d-learn puremagic.com> wrote:

 Wow, fantastic feedback from lots of people, thanks so much!
 ...

 evilrat:
 There is another important difference, i quickly looked up D
 associative array implementation and the search looks like
 nlog(n) complexity with plain loop iteration of hashes,
 whereas
 rust's implementation is using "swisstable" algorithm
 implementation that has packed SIMD optimized lookups, this
 is
 likely where the 3x speed difference comes from.

 I am not using the default hash implementation in Rust (which
 is hashbrown as you've found out). That's too slow (because
 it's ddos secure - still Java's hash also is and it's still
 faster). I had to replace that with a library called `ahash`.

 If you're interested in knowing more about that, please [read
 my blog
 post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code)


 about optimising the Rust code.
 So, no, that's not where the difference is coming from... in
 fact, I am using a faster hashing function in D.

 You can [see my custom hashing function
 here](


 https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d=

1fa1b77ad928f7d536728fba11f2d69b4afe7e3
 ).
 This function is not good for the general purpose case, but
 it's probably
 as good as it gets for this problem (see my int128 trick in a
 previous post
 on this thread to understand why). I did try a few variations
 of hashing
 before settling on this one... Rust, if anything, is at a
 disadvantage here.

 And here you are wrong, it is the hashing when the slowdown
 comes. It is not only about the speed of hashing function. It
 is about the quality of hashing functiuon for this particular
 problem and it is about how hashing table (associative array)
 is implemented.

 I've explained why this function is almost a perfect hash
 function for this problem (there will be almost no collisions
 except for very large inputs where the shift-left will
 "overflow", and even then the probability of collisions should be
 very low). If you're going to claim I'm wrong, you must show
 exactly where I'm wrong, and preferrably you should provide a
 faster implementation. I will even accept a theoretical
 explanation if you can give one. What I will not accept, is for
 you to call me "wrong" just because you say so. That's childish
 behaviour and uncalled for on a technical discussion.


If you provided info that hash is perfect, than I am sorry. I have probably
missed that in this thread my fault. Than I will need to looked into this
more carefully and do much more than just assumption.

Jan 21

Jordan Wilson <wilsonjord gmail.com> writes:

On Friday, 19 January 2024 at 08:57:40 UTC, Renato wrote:
 Do you know why the whole thread seems to have disappeared?? 
 There's a lot of good stuff in the thread, it would be a huge 
 shame to lose all that!

I agree! Thanks for posting your benchmarks, I thought your whole 
benching setup was pretty good, and learnt alot from your code 
and the resulting contributions in the thread and others.

Jordan

Jan 19

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Re: Help optimize D solution to phone encoding problem: extremely