www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Re: Help optimize D solution to phone encoding problem: extremely

reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Thu, Jan 18, 2024 at 04:23:16PM +0000, Renato via Digitalmars-d-learn wrote:
[...]
 Ok, last time I'm running this for someone else :D
 
 ```
 Proc,Run,Memory(bytes),Time(ms)
 ===> ./rust
 ./rust,23920640,30
 ./rust,24018944,147
 ./rust,24068096,592
 ./rust,24150016,1187
 ./rust,7766016,4972
 ./rust,8011776,46101
 ===> src/d/dencoder
 src/d/dencoder,44154880,42
 src/d/dencoder,51347456,87
 src/d/dencoder,51380224,273
 src/d/dencoder,51462144,441
 src/d/dencoder,18644992,4414
 src/d/dencoder,18710528,43548
 ```
OK, this piqued my interest enough that I decided to install rust using rustup instead of my distro's package manager. Here are the numbers I got for my machine: ===> ./rust ./rust,22896640,35 ./rust,22896640,137 ./rust,22384640,542 ./rust,22896640,1034 ./rust,8785920,2489 ./rust,8785920,12157 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,72 src/d/dencoder,1066799104,198 src/d/dencoder,1066799104,344 src/d/dencoder,1035292672,2372 src/d/dencoder,1035292672,13867 Looks like we lost out to Rust for larger inputs. :-D Probably due to environmental factors (and the fact that std.stdio is slow). I re-ran it again and got this: ===> ./rust ./rust,22896640,30 ./rust,22896640,131 ./rust,22896640,511 ./rust,22896640,983 ./rust,8785920,3102 ./rust,8785920,9912 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,71 src/d/dencoder,1066799104,197 src/d/dencoder,1066799104,355 src/d/dencoder,1035292672,3441 src/d/dencoder,1035292672,9471 Notice the significant discrepancy between the two runs; this seems to show that the benchmark is only accurate up to about ±1.5 seconds. Anyway, oddly enough, Java seems to beat Rust on larger inputs. Maybe my Java compiler has a better JIT implementation? :-P
 Congratulations on beating Rust :D but remember: you're using a much
 more efficient algorithm! I must conclude that the Rust translation of
 the Trie algorithm would be much faster still, unfortunately (you may
 have noticed that I am on D's side here!).
At this point, it's not really about the difference between languages anymore; it's about the programmer's skill at optimizing his code. Traditionally Java is thought to be the slowest, because it runs in a VM and generally tends to use more heap allocations. In recent times, however, JIT and advanced GC implementations have significantly levelled that out, so you're probably not going to see the difference unless you hand-tweak your code down to the bare metal. Surprisingly, at least on my machine, Lisp actually performed the worst. I'd have thought it would at least beat Java, but I was quite wrong. :-D Perhaps the Lisp implementation I'm using is suboptimal, I don't know. Or perhaps modern JVMs have really overtaken Lisp. Now I'm really curious how a Rust version of the trie algorithm would perform. Unfortunately I don't know Rust so I wouldn't be able to write it myself. (Hint, hint, nudge, nudge ;-)). As far as the performance of my D version is concerned, I still haven't squeezed out all the performance I could yet. Going into this, my intention was to take the lazy way of optimizing only what the profiler points out to me, with the slight ulterior motive of proving that a relatively small amount of targeted optimizations can go a long way at making the GC a non-problem in your typical D code. ;-) I haven't pulled out all the optimization guns at my disposal yet. If I were to go the next step, I'd split up the impl() function so that I get a better profile of where it's spending most of its time, and then optimize that. My current suspicion is that the traversal of the trie could be improved by caching intermediate results to eliminate a good proportion of recursive calls in impl(). Also, the `print` mode of operation is quite slow, probably because writefln() allocates. (It allocates less than if I had used .format like I did before, but it nevertheless still allocates.) To alleviate this cost, I'd allocate an output buffer and write to that, flushing only once it filled up. Another thing I could do is to use std.parallelism.parallel to run searches on batches of phone numbers in parallel. This is kinda cheating, though, since it's the same algorithm with the same cost, we're just putting more CPU cores to work. :-P But in D this is quite easy to do, often as easy as simply adding .parallel to your outer foreach loop. In this particular case it will need some additional refactoring due to the fact that the input is being read line by line. But it's relatively easy to load the input into a buffer by chunks instead, and just run the searches on all the numbers found in the buffer in parallel. On Thu, Jan 18, 2024 at 04:25:45PM +0000, Renato via Digitalmars-d-learn wrote: [...]
 BTW here's you main function so it can run on the benchmark:
[...] Thanks, I've adapted my code accordingly and pushed to my github repo. T -- This is a tpyo.
Jan 18
parent reply Renato <renato athaydes.com> writes:
On Friday, 19 January 2024 at 05:17:51 UTC, H. S. Teoh wrote:
 On Thu, Jan 18, 2024 at 04:23:16PM +0000, Renato via 
 Digitalmars-d-learn wrote: [...]
 Ok, last time I'm running this for someone else :D
 
 ```
 Proc,Run,Memory(bytes),Time(ms)
 ===> ./rust
 ./rust,23920640,30
 ./rust,24018944,147
 ./rust,24068096,592
 ./rust,24150016,1187
 ./rust,7766016,4972
 ./rust,8011776,46101
 ===> src/d/dencoder
 src/d/dencoder,44154880,42
 src/d/dencoder,51347456,87
 src/d/dencoder,51380224,273
 src/d/dencoder,51462144,441
 src/d/dencoder,18644992,4414
 src/d/dencoder,18710528,43548
 ```
OK, this piqued my interest enough that I decided to install rust using rustup instead of my distro's package manager. Here are the numbers I got for my machine: ===> ./rust ./rust,22896640,35 ./rust,22896640,137 ./rust,22384640,542 ./rust,22896640,1034 ./rust,8785920,2489 ./rust,8785920,12157 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,72 src/d/dencoder,1066799104,198 src/d/dencoder,1066799104,344 src/d/dencoder,1035292672,2372 src/d/dencoder,1035292672,13867 Looks like we lost out to Rust for larger inputs. :-D Probably due to environmental factors (and the fact that std.stdio is slow). I re-ran it again and got this: ===> ./rust ./rust,22896640,30 ./rust,22896640,131 ./rust,22896640,511 ./rust,22896640,983 ./rust,8785920,3102 ./rust,8785920,9912 ===> src/d/dencoder src/d/dencoder,1066799104,36 src/d/dencoder,1066799104,71 src/d/dencoder,1066799104,197 src/d/dencoder,1066799104,355 src/d/dencoder,1035292672,3441 src/d/dencoder,1035292672,9471 Notice the significant discrepancy between the two runs; this seems to show that the benchmark is only accurate up to about ±1.5 seconds.
That's not correct. The discrepancy is due to the benchmark always generating different input on each run - and the characteristics of the input affects runs significantly. This affects the last two runs much more due to them using the more challenging dictionary. The important is that the relative performance between languages is reliable. If you stop re-generating the phone numbers and just run the benchmark multiple times using the same input, you'll see it's very close between runs.
 Anyway, oddly enough, Java seems to beat Rust on larger inputs. 
  Maybe my Java compiler has a better JIT implementation? :-P
Again, in case I haven't made it clear by repeating this multiple times: The Java code is using a Trie, the Rust code is using a numeric solution. They are completely different algorithms. Much more important than which language is being used is the algorithm, as has been shown again and again.
 Congratulations on beating Rust :D but remember: you're using 
 a much more efficient algorithm! I must conclude that the Rust 
 translation of the Trie algorithm would be much faster still, 
 unfortunately (you may have noticed that I am on D's side 
 here!).
At this point, it's not really about the difference between languages anymore; it's about the programmer's skill at optimizing his code. Traditionally Java is thought to be the slowest, because it runs in a VM and generally tends to use more heap allocations. In recent times, however, JIT and advanced GC implementations have significantly levelled that out, so you're probably not going to see the difference unless you hand-tweak your code down to the bare metal.
Java has been very fast for over a decade now, this is not new at all.
 Surprisingly, at least on my machine, Lisp actually performed 
 the worst. I'd have thought it would at least beat Java, but I 
 was quite wrong. :-D Perhaps the Lisp implementation I'm using 
 is suboptimal, I don't know. Or perhaps modern JVMs have really 
 overtaken Lisp.
Please don't compare different algorithms in different languages and make conclusions about each language's speed.
 Now I'm really curious how a Rust version of the trie algorithm 
 would perform.  Unfortunately I don't know Rust so I wouldn't 
 be able to write it myself. (Hint, hint, nudge, nudge ;-)).

 As far as the performance of my D version is concerned, I still 
 haven't squeezed out all the performance I could yet.  Going 
 into this, my intention was to take the lazy way of optimizing 
 only what the profiler points out to me, with the slight 
 ulterior motive of proving that a relatively small amount of 
 targeted optimizations can go a long way at making the GC a 
 non-problem in your typical D code. ;-)  I haven't pulled out 
 all the optimization guns at my disposal yet.
You don't really have to... ssvb's solution is incredibly fast at the "count" problem and I really don't think anyone can beat that implementation. The only problem is that the implementation is very C-like and nothing like D I would write.
 If I were to go the next step, I'd split up the impl() function 
 so that I get a better profile of where it's spending most of 
 its time, and then optimize that.  My current suspicion is that 
 the traversal of the trie could be improved by caching 
 intermediate results to eliminate a good proportion of 
 recursive calls in impl().

 Also, the `print` mode of operation is quite slow, probably 
 because writefln() allocates. (It allocates less than if I had 
 used .format like I did before, but it nevertheless still 
 allocates.) To alleviate this cost, I'd allocate an output 
 buffer and write to that, flushing only once it filled up.

 Another thing I could do is to use std.parallelism.parallel to 
 run searches on batches of phone numbers in parallel. This is 
 kinda cheating, though, since it's the same algorithm with the 
 same cost, we're just putting more CPU cores to work. :-P  But 
 in D this is quite easy to do, often as easy as simply adding 
 .parallel to your outer foreach loop. In this particular case 
 it will need some additional refactoring due to the fact that 
 the input is being read line by line. But it's relatively easy 
 to load the input into a buffer by chunks instead, and just run 
 the searches on all the numbers found in the buffer in parallel.
I don't have interest in doing that because what I was trying to accomplish was to compare the exact same algorithm in different languages to the extent possible. This is why I ported the Common Lisp code to D, then tried to optimize from there just like I did for Rust. I find the numeric implementation much more "valuable" in comparing performance than the Trie solution because that evaluates quite a few things in the language that are common to most programs, like hashing, recursion and "search" in general. Do you know why the whole thread seems to have disappeared?? There's a lot of good stuff in the thread, it would be a huge shame to lose all that!
Jan 19
next sibling parent reply Renato <renato athaydes.com> writes:
 Looks like we lost out to Rust for larger inputs. :-D  
 Probably due to environmental factors (and the fact that 
 std.stdio is slow).  I re-ran it again and got this:

 ===> ./rust
 ./rust,22896640,30
 ./rust,22896640,131
 ./rust,22896640,511
 ./rust,22896640,983
 ./rust,8785920,3102
 ./rust,8785920,9912
 ===> src/d/dencoder
 src/d/dencoder,1066799104,36
 src/d/dencoder,1066799104,71
 src/d/dencoder,1066799104,197
 src/d/dencoder,1066799104,355
 src/d/dencoder,1035292672,3441
 src/d/dencoder,1035292672,9471
I forgot to mention: the Java version is using a Trie... and it consistently beats the Rust numeric algorithm (which means it's still faster than your D solution), but the Java version that's equivalent to Rust's implementation is around 3x slower... i.e. it runs at about the same speed as my current fastest numeric algorithm in D as well. This is what I would like to be discussing in this thread: why is D running at Java speeds and not at D speeds when using the same algorithm? I know there's small differences in the implementations, they are different languages after all, but not enough IMO to justify anything like 3x difference from Rust. If you really want to show to me that D is as fast as Rust, please spend time on my solution (because that's a direct comparison algorithmically to the Rust implementation) and remove any unnecessary overhead from it (without changing the overall algorithm , of course - unless that's to make it closer to the Rust implementation).
 Do you know why the whole thread seems to have disappeared?? 
 There's a lot of good stuff in the thread, it would be a huge 
 shame to lose all that!
Nevermind, looks like I was unable to open any other answers in this thread while writing my own answer!?
Jan 19
parent reply evilrat <evilrat666 gmail.com> writes:
On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and it 
 consistently beats the Rust numeric algorithm (which means it's 
 still faster than your D solution), but the Java version that's 
 equivalent to Rust's implementation is around 3x slower... i.e. 
 it runs at about the same speed as my current fastest numeric 
 algorithm in D as well.

 This is what I would like to be discussing in this thread: why 
 is D running at Java speeds and not at D speeds when using the 
 same algorithm? I know there's small differences in the 
 implementations, they are different languages after all, but 
 not enough IMO to justify anything like 3x difference from Rust.
My guess is that's because int128 is not that much optimized due to being not so popular type, though to answer what's wrong would require to look at assembly code produced for both D and Rust. Additionally if you comparing D by measuring DMD performance - don't. It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.
Jan 19
next sibling parent Renato <renato athaydes.com> writes:
On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and 
 it consistently beats the Rust numeric algorithm (which means 
 it's still faster than your D solution), but the Java version 
 that's equivalent to Rust's implementation is around 3x 
 slower... i.e. it runs at about the same speed as my current 
 fastest numeric algorithm in D as well.

 This is what I would like to be discussing in this thread: why 
 is D running at Java speeds and not at D speeds when using the 
 same algorithm? I know there's small differences in the 
 implementations, they are different languages after all, but 
 not enough IMO to justify anything like 3x difference from 
 Rust.
My guess is that's because int128 is not that much optimized due to being not so popular type, though to answer what's wrong would require to look at assembly code produced for both D and Rust. Additionally if you comparing D by measuring DMD performance - don't. It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.
I am not using int128 anymore. I explained why a few posts back. I am using a byte array and computing the hash incrementally when trying different input, so that partially computed hashes are re-used on each try (this is a bit cheating, as Rust is not doing that, but I consider that to be acceptable as it's still computing hashes and looking up entries in the associative array). I used all D compilers and picked the fastest one (GDC in the case of int128, but LDC2 in the current case).
Jan 19
prev sibling parent reply Renato <renato athaydes.com> writes:
On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and 
 it consistently beats the Rust numeric algorithm (which means 
 it's still faster than your D solution), but the Java version 
 that's equivalent to Rust's implementation is around 3x 
 slower... i.e. it runs at about the same speed as my current 
 fastest numeric algorithm in D as well.
Additionally if you comparing D by measuring DMD performance - don't. It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.
I tried with DMD again, and yeah, it's much slower. Here's the [current implementation in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/sr /d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs). The only "significant" difference is that in Rust, an enum `WordOrDigit` is used to represent currently known "words"... I [did try using that in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/sr /d/src/dencoder.d), but it made the algorithm slower. If you see anything in D that's not as efficient as it should be, or somehow "inferior" to what the Rust version is doing , please let me know. Notice that almost all of the time is spent in the for-loop inside `printTranslations` (which is misnamed as it doesn't necessarily "print" anything, like it did earlier) - the rest of the code almost doesn't matter. Current performance comparison: ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23920640,31 ./rust,24018944,149 ./rust,24084480,601 ./rust,24248320,1176 ./rust,7798784,2958 ./rust,7815168,15009 ===> src/d/dencoder src/d/dencoder,14254080,36 src/d/dencoder,24477696,368 src/d/dencoder,24510464,1740 src/d/dencoder,24559616,3396 src/d/dencoder,11321344,6740 src/d/dencoder,11321344,36787 ``` So , it's not really 3x slower anymore, here's the "D overhead" considering Rust as the baseline: ``` 1.161290323 2.469798658 2.895174709 2.887755102 2.278566599 2.450996069 ```
Jan 19
next sibling parent "H. S. Teoh" <hsteoh qfbox.info> writes:
On Fri, Jan 19, 2024 at 01:40:39PM +0000, Renato via Digitalmars-d-learn wrote:
 On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
[...]
 Additionally if you comparing D by measuring DMD performance -
 don't.  It is valuable in developing for fast iterations, but it
 lacks many modern optimization techniques, for that we have LDC and
 GDC.
I tried with DMD again, and yeah, it's much slower.
For anything where performance is even remotely important, I wouldn't even consider DMD. It's a well-known fact that it produces suboptimal executables. Its only redeeming factor is really only its fast turnaround time. If fast turnaround is not important, I would always use LDC or GDC instead.
 Here's the [current implementation in
 D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d),
 and the roughly [equivalent Rust
 implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).
Taking a look at this code: One of the most important thing I found is that every call to printTranslations allocates a new array (`auto keyValue = new ubyte[...];`). Given the large number of recursions involved in this algorithm, this will add up to quite a lot. If I were optimizing this code, I'd look into ways of reducing, if not eliminating, this allocation. Observe that this allocation is needed each time printTranslations recurses, so instead of making separate allocations, you could put it on a stack. Either with alloca, or with my appendPath() trick in my version of the code: preallocate a reasonably large buffer and take slices of it each time you need a new keyValue array. Secondly, your hash function looks suspicious. Why are you chaining your hash per digit? That's a lot of hash computations. Shouldn't you just hash the entire key each time? That would eliminate the need to store a custom hash in your key, you could just lookup the entire key at once. Next, what stood out is ISolutionHandler. If I were to write this, I wouldn't use OO for this at all, and especially not interfaces, because they involve a double indirection. I'd just return a delegate instead (single indirection, no object lookup). This is a relatively small matter, but when it's being used inside a hot inner loop, it could be important. Then a minor point: I wouldn't use Array in printTranslations. It's overkill for what you need to do; a built-in array would work just fine. Take a look at the implementation of Array and you'll see lots of function calls and locks and GC root-adding and all that stuff. Most of it doesn't apply here, of course, and is compiled out. Nevertheless, it uses wrapped integer operations and range checks, etc.. Again, these are all minor issues, but in a hot inner loop they do add up. Built-in arrays let you literally just bump the pointer when adding an element. Just a couple of instructions as opposed to several function calls. Important difference when you're on the hot path. Now, as I mentioned earlier w.r.t. my own code, appending to built-in arrays comes with a cost. So here's where you'd optimize by creating your own buffer and custom push/pop operations. Something like appendPath() in my version of the code would do the job. Finally, a very a small point: in loadDictionary, you do an AA lookup with `n in result`, and then if that returns null, you create a new entry. This does two AA lookups, once unsuccessfully, and the second time to insert the missing key. You could use the .require operation with a delegate instead of `in` followed by `if (... is null)`, which only requires a single lookup. Probably not an important point, but for a large dictionary this might make a difference.
 The only "significant" difference is that in Rust, an enum
 `WordOrDigit` is used to represent currently known "words"... I [did
 try using that in
 D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d),
 but it made the algorithm slower.
 
 If you see anything in D that's not as efficient as it should be, or
 somehow "inferior" to what the Rust version is doing , please let me
 know.
Couldn't tell you, I don't know Rust. :-D
 Notice that almost all of the time is spent in the for-loop inside
 `printTranslations` (which is misnamed as it doesn't necessarily
 "print" anything, like it did earlier) - the rest of the code almost
 doesn't matter.
[...] Of course, that's where your hot path is. And that loop makes recursive calls to printTranslations, so the entire body of the function could use some optimization. ;-) Try addressing the points I wrote above and see if it makes a difference. T -- The two rules of success: 1. Don't tell everything you know. -- YHL
Jan 19
prev sibling next sibling parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
On Friday, 19 January 2024 at 13:40:39 UTC, Renato wrote:
 On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 09:08:17 UTC, Renato wrote:
 I forgot to mention: the Java version is using a Trie... and 
 it consistently beats the Rust numeric algorithm (which means 
 it's still faster than your D solution), but the Java version 
 that's equivalent to Rust's implementation is around 3x 
 slower... i.e. it runs at about the same speed as my current 
 fastest numeric algorithm in D as well.
Additionally if you comparing D by measuring DMD performance - don't. It is valuable in developing for fast iterations, but it lacks many modern optimization techniques, for that we have LDC and GDC.
I tried with DMD again, and yeah, it's much slower. Here's the [current implementation in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/sr /d/src/dencoder.d), and the roughly [equivalent Rust implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs). The only "significant" difference is that in Rust, an enum `WordOrDigit` is used to represent currently known "words"... I [did try using that in D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/sr /d/src/dencoder.d), but it made the algorithm slower. If you see anything in D that's not as efficient as it should be, or somehow "inferior" to what the Rust version is doing , please let me know. Notice that almost all of the time is spent in the for-loop inside `printTranslations` (which is misnamed as it doesn't necessarily "print" anything, like it did earlier) - the rest of the code almost doesn't matter. Current performance comparison: ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23920640,31 ./rust,24018944,149 ./rust,24084480,601 ./rust,24248320,1176 ./rust,7798784,2958 ./rust,7815168,15009 ===> src/d/dencoder src/d/dencoder,14254080,36 src/d/dencoder,24477696,368 src/d/dencoder,24510464,1740 src/d/dencoder,24559616,3396 src/d/dencoder,11321344,6740 src/d/dencoder,11321344,36787 ``` So , it's not really 3x slower anymore, here's the "D overhead" considering Rust as the baseline: ``` 1.161290323 2.469798658 2.895174709 2.887755102 2.278566599 2.450996069 ```
You do hash map lookup for every character in D, it's slow, whereas in Rust you do it via pattern matching, java does the same, pattern matching Yet another reason to advocate for pattern matching in D and switch as expression
Jan 19
parent reply evilrat <evilrat666 gmail.com> writes:
On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:
 You do hash map lookup for every character in D, it's slow, 
 whereas in Rust you do it via pattern matching, java does the 
 same, pattern matching


 Yet another reason to advocate for pattern matching in D and 
 switch as expression
There is another important difference, i quickly looked up D associative array implementation and the search looks like nlog(n) complexity with plain loop iteration of hashes, whereas rust's implementation is using "swisstable" algorithm implementation that has packed SIMD optimized lookups, this is likely where the 3x speed difference comes from. Tried to look up rust implementation and it is SOOO generic that I was unable to decipher it to find the actual key search and stores. Anyway here is an interesting article about rust implementation https://faultlore.com/blah/hashbrown-tldr/
Jan 19
parent ryuukk_ <ryuukk.dev gmail.com> writes:
On Friday, 19 January 2024 at 17:18:36 UTC, evilrat wrote:
 On Friday, 19 January 2024 at 16:55:25 UTC, ryuukk_ wrote:
 You do hash map lookup for every character in D, it's slow, 
 whereas in Rust you do it via pattern matching, java does the 
 same, pattern matching


 Yet another reason to advocate for pattern matching in D and 
 switch as expression
There is another important difference, i quickly looked up D associative array implementation and the search looks like nlog(n) complexity with plain loop iteration of hashes, whereas rust's implementation is using "swisstable" algorithm implementation that has packed SIMD optimized lookups, this is likely where the 3x speed difference comes from. Tried to look up rust implementation and it is SOOO generic that I was unable to decipher it to find the actual key search and stores. Anyway here is an interesting article about rust implementation https://faultlore.com/blah/hashbrown-tldr/
I'm not talking about the difference between the hashmap implementation, but the difference between the algorithm used to lookup the characters https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/9b1d7f026943841638a2729922cf000b1b3ce655/src/java/Main2.java#L106-L134 vs https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/d/src/dencoder.d#L10-L49 If D had pattern matching and switch as expression, the faster method would be: 1. the most obvious choice 2. the fastest by default 3. the most clean To save from having to write a old-school verbose `switch`, i suspect he went with a hashmap, wich is slower in that case, hence why i keep advocate for that feature, or i should say, that upgrade to `switch`, wich java has adopted, as well as rust: https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/40cd423fc9dd1b1b47f02d8ab66ca03420820e84/src/rust/phone_encoder/src/main.rs#L146-L168
Jan 19
prev sibling next sibling parent Daniel Kozak <kozzi11 gmail.com> writes:
On Fri, Jan 19, 2024 at 4:44=E2=80=AFPM H. S. Teoh via Digitalmars-d-learn =
<
digitalmars-d-learn puremagic.com> wrote:

 Taking a look at this code:
 ...
 Try addressing the points I wrote above and see if it makes a
 difference.
I have tried it (all of it) even before you wrote it here, because I have completely the same ideas, but to be fair it has almost zero effect on speed. There is my version (It still use OOP, but I have try it wit Printer and Counter to be structs and it has no effect at all) https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY The only difference in speed in the end is caused by hash implementation of dlang associative arrays and rust HashMap, actually if you modify rust to not used ahash it has almost same speed as D
Jan 19
prev sibling parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via Digitalmars-d-learn
wrote:
[...]
    > Try addressing the points I wrote above and see if it makes a
    > difference.
 
    I have tried it (all of it) even before you wrote it here, because
    I have completely the same ideas, but to be fair it has almost zero
    effect on speed.
    There is my version (It still use OOP, but I have try it wit
    Printer and Counter to be structs and it has no effect at
    all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
    The only difference in speed in the end is caused by hash
    implementation of dlang associative arrays and rust HashMap,
    actually if you modify rust to not used ahash it has almost same
    speed as D
[...] I'm confused by the chained hashing of the digits. Why is that necessary? I would have thought it'd be faster to hash the entire key instead of computing the hash of each digit and chaining them together. I looked up Rust's ahash algorithm. Apparently they leverage the CPU's hardware AES instruction to compute a collision-resistant hash very quickly. Somebody should file a bug on druntime to implement this where the hardware supports it, instead of the current hashOf. For relatively small keys this would be a big performance boost. T -- Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.
Jan 19
next sibling parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:
 On Sat, Jan 20, 2024 at 01:35:44AM +0100, Daniel Kozak via 
 Digitalmars-d-learn wrote: [...]
    > Try addressing the points I wrote above and see if it 
 makes a
    > difference.
 
    I have tried it (all of it) even before you wrote it here, 
 because
    I have completely the same ideas, but to be fair it has 
 almost zero
    effect on speed.
    There is my version (It still use OOP, but I have try it wit
    Printer and Counter to be structs and it has no effect at
    all) [2]https://paste.ofcode.org/38vKWLS8DHRazpv6MTidRJY
    The only difference in speed in the end is caused by hash
    implementation of dlang associative arrays and rust HashMap,
    actually if you modify rust to not used ahash it has almost 
 same
    speed as D
[...] I'm confused by the chained hashing of the digits. Why is that necessary? I would have thought it'd be faster to hash the entire key instead of computing the hash of each digit and chaining them together.
Hash which key? The problem of the naive hashtable-based algorithm is that we don't know the length of the key. So all lengths are tried starting from 1 and up to the remaining part of the phone number. An additional optimization would be to limit this search and terminate it early. For example, stop after we exceed the length of the longest word from the dictionary. Or stop the search upon encountering certain "leaf" words, but this effectively morphs the algorithm into something that partially resembles Trie. And further evolving it we may end up with a full-fledged Trie. The only practical value of this algorithm is that it's easy to implement and has low LOC count. My initial non-optimized naive version was also similar https://gist.github.com/ssvb/abe821b3cdba7fcb7f3c3cecde864153 (66 lines including blank lines and some comments). This is somewhat comparable to the Lisp results https://flownet.com/ron/papers/lisp-java/raw-results.html or to the Peter Norvig's solution http://www.norvig.com/java-lisp.html The purpose of the initial implementation "prototype" is just to have some sort of a starting point. And if the runtime performance doesn't matter, then we are already done. But if the runtime performance does matter, then a lot of things can be rapidly improved by spending a bit of time on it. Eventually one runs out of ideas and further optimization efforts yield only diminishing returns. That's when it's a good time to clean up the code, add comments, etc. and label it as the "final" version. I think that a good study, that intends to compare different programming languages against each other, could take all these factors into consideration: time to implement the "prototype", runtime performance of the prototype, time to implement the "final" version, runtime performance of the final version, the number of lines of code and its readability/maintainability.
Jan 20
prev sibling parent reply Renato <renato athaydes.com> writes:
Wow, fantastic feedback from lots of people, thanks so much!


I will try to answer all points raised in the several answers 
I've got here since yesterday.

On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:
 I'm confused by the chained hashing of the digits. Why is that 
 necessary?  I would have thought it'd be faster to hash the 
 entire key instead of computing the hash of each digit and 
 chaining them together.

 I looked up Rust's ahash algorithm. Apparently they leverage 
 the CPU's hardware AES instruction to compute a 
 collision-resistant hash very quickly.

 Somebody should file a bug on druntime to implement this where 
 the hardware supports it, instead of the current hashOf. For 
 relatively small keys this would be a big performance boost.


 T
[This is the commit](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/e757f34c3ddd16240e351 bc91564f2a134659ff) I added "incremental hashing". What you're suggesting I also tried but it was much slower. The reason incremental hashing is much faster than computing the hash of the whole key on each iteration is that you're doing a lot less work. To compute the hash of an array fully, you need to do something like this (it's just how hashing works): ```d auto currentHash = <something>; foreach(item: array) { currentHash = hashOf(item); } ``` But by the structure of the problem, we know that we keep adding items to the `array` and re-computing the hash (in the `printTranslations` function, which is the hot path... so ignore the dictionary loading for now). What you're suggesting is do the above for the full array, on each iteration... what I did was optmise that so that we only re-compute the hash for each item instead per iteration - which is the only work that is required. As Siarhei Siamashka mentioned: you could keep optimising this using heuristics and knowledge about the problem, at which point you might converge to a Trie implementation! But for the purpose of this comparison, I'm happy to stop at a good enough, kind of general purpose algorithm which is comparable in multiple languages... I might compare the D Trie impl with a Rust Trie impl in the future, but for now, I've had enough of that. Hope that makes sense now... try to undo it and I believe it will run around 30% slower.
 One of the most important thing I found is that every call to 
 printTranslations allocates a new array (`auto keyValue = new 
 ubyte[...];`).  Given the large number of recursions involved 
 in this algorithm, this will add up to quite a lot.
Bingo! I found that before reading your suggestion :) you can see the commit where I changed that [here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/4e8508c5d16a2b623a4673 9c056a68e03b3bd87). This one-line change was the most effective change, making the D impl not only consume a lot less memory, but become quite a bit faster (I will double check later , but I believe this made the code run almost 50% faster in the longest runs)!
 Secondly, your hash function looks suspicious. Why are you 
 chaining your hash per digit?
I've answered that above... but I went further and instead of using the built-in `hashof` function, I decided to use [my own hashing function](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d53672 fba11f2d69b4afe7e3) which takes advantage of the input characteristics... I mentioned before the trick I did to make the int128 solution faster (by the way, the int128 solution was not acceptable, as ssvb pointed out, that's why I had to remove that). This is a very simple hashing function that works well enough for this particular problem, and this also boosted performance noticeably.
 Then a minor point: I wouldn't use Array in printTranslations. 
 It's overkill for what you need to do; a built-in array would 
 work just fine.
Hm, sorry to break it to you but [changing that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba45 d8abc3bccbc17c7f1e) to an Array was the single greatest improvement I had. IIRC it made the code 2x faster. Maybe I was using native arrays wrongly before... but it's unclear to me how I could've made that better (and the docs themselves suggest to use Array if performance matters, which I seem to have confirmed is good advice).
 Next, what stood out is ISolutionHandler.  If I were to write 
 this, I wouldn't use OO for this at all, and especially not 
 interfaces, because they involve a double indirection.
As mentioned by Daniel Kozak, this doesn't matter. I tried rewriting it so that there was no indirection to the extent possible (I used delegate fields within a final class so the functions to use are selected at startup, so there's no overhead choosing which functions to call at runtime) and it was actually slightly slower (but too close to call really). My solution with interfaces had two final classes used, so I think there was very little overhead as no vtable was needed, right? You can check [my change here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d96a73f95b8e204d9325ed 0e2ad19cece5e0c18), let me know if you can find a flaw in it.
 Finally, a very a small point: in loadDictionary, you do an AA 
 lookup with `n in result`, and then if that returns null, you 
 create a new entry. This does two AA lookups, once 
 unsuccessfully, and the second time to insert the missing key.  
 You could use the .require operation with a delegate instead of 
 `in` followed by `if (... is null)`, which only requires a 
 single lookup.
Thanks for pointing that out. I [changed that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/1ee3d858466904a74468e 139675ec7cf2c690d0) to use `require`, but it made no perceptible difference... still I kept this in my "final" implementation because it was much cleaner!
 ryuukk_:
 You do hash map lookup for every character in D, it's slow, 
 whereas in Rust you do it via pattern matching, java does the 
 same, pattern matching
I [rewrote that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b8c072f113f5a36b54551 051181ef877a8821e8) to use a switch. The speed seems to have increased very slightly for the very small inputs, it's difficult to tell. But I kept this change anyway because it was simpler and closer to the Rust implementation. Notice that this table lookup is no in the hot path, so as expected, it only seems to speed up slightly loading the dictionary (which is good because D was behind for small inputs - and even this tiny improvement could help D catch up).
 To save from having to write a old-school verbose switch, i 
 suspect he went with a hashmap, wich is slower in that case, 
 hence why i keep advocate for that feature, or i should say, 
 that upgrade to switch, wich java has adopted, as well as rust:
The reason I used a hash table was that I believed that would be taking advantage of D's compile-time programming and would be even faster than a switch. To be completely sure that's not the case, I would need to benchmark that function individually, but I'm not so interested in knowing that as that made almost no difference in the implementation's performance anyway.
 Yet another reason to advocate for pattern matching in D and 
 switch as expression
Hm... not sure your argument holds here, given the above :/
 evilrat:
 There is another important difference, i quickly looked up D 
 associative array implementation and the search looks like 
 nlog(n) complexity with plain loop iteration of hashes, whereas 
 rust's implementation is using "swisstable" algorithm 
 implementation that has packed SIMD optimized lookups, this is 
 likely where the 3x speed difference comes from.
I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called `ahash`. If you're interested in knowing more about that, please [read my blog post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about optimising the Rust code. So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D. You can [see my custom hashing function here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728 ba11f2d69b4afe7e3). This function is not good for the general purpose case, but it's probably as good as it gets for this problem (see my int128 trick in a previous post on this thread to understand why). I did try a few variations of hashing before settling on this one... Rust, if anything, is at a disadvantage here. One more thing I did was to avoid allocations in the print function so that D could have less memory allocations (which proved to be a big slowdown already) and perform better on the print problem. This [change did make the code faster](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/8937b80ec9c851ca19678 0a708c2ef84c499f8c) and it actually caught up with Rust for the smallest input.
 Jordan Wilson:
 I agree! Thanks for posting your benchmarks, I thought your 
 whole benching setup was pretty good, and learnt alot from your 
 code and the resulting contributions in the thread and others.
Thanks for that :) I am used to being mercilessly roosted in threads like this because no one likes to be told their favourite language is not so fast :D so any support is highly appreciated! Anyway, here's the [fastest implementation]() I came up with based on all the feedback here that still resembles the Common Lisp algorithm and the Rust implementation: https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/1ee3d858466904a74468eb139675ec7cf2c690d0/src/d/src/dencoder.d (branch name is `dlang-key-hash-incremental-avoid-gc`) Performance on Mac M1: ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,23920640,36 ./rust,24100864,144 ./rust,23986176,581 ./rust,24100864,1152 ./rust,7815168,6644 ./rust,7766016,11752 ===> src/d/dencoder src/d/dencoder,13385728,31 src/d/dencoder,24477696,262 src/d/dencoder,24477696,1173 src/d/dencoder,24395776,2313 src/d/dencoder,5701632,7108 src/d/dencoder,5767168,13692 ``` Performance on Linux x86_64 (compiled with LDC 1.36.0 and dub `-b release-nobounds`): ``` Proc,Memory(bytes),Time(ms) ===> ./rust ./rust,23138304,62 ./rust,23138304,236 ./rust,23138304,932 ./rust,23138304,1828 ./rust,9027584,6108 ./rust,9027584,13759 ===> src/d/dencoder src/d/dencoder,229851136,62 src/d/dencoder,229851136,403 src/d/dencoder,229851136,1789 src/d/dencoder,229851136,3573 src/d/dencoder,218996736,8254 src/d/dencoder,218996736,21412 ``` Performance on Linux x86_64 using GDC: ``` Proc,Run,Memory(bytes),Time(ms) ===> ./rust ./rust,25178112,57 ./rust,25178112,262 ./rust,25178112,1082 ./rust,25178112,2084 ./rust,11067392,6328 ./rust,11067392,35849 ===> src/d/dencoder src/d/dencoder,231968768,72 src/d/dencoder,231968768,635 src/d/dencoder,231968768,2875 src/d/dencoder,231968768,5739 src/d/dencoder,221110272,18195 src/d/dencoder,221110272,128987 ``` D overhead with the fastest compiler, LDC, compared with Rust: ``` 1.0 1.707627119 1.919527897 1.954595186 1.351342502 1.556217748 ``` If anyone ever asks me, I will say that D can be as fast as Rust/C, but only if you're very careful using D features that perform well... if you just write clean D code (use the GC, native arrays etc.), you're likely to get at least 50% slower code. Thanks everyone for all the input.
Jan 20
next sibling parent Renato <renato athaydes.com> writes:
On Saturday, 20 January 2024 at 13:07:39 UTC, Renato wrote:
 D overhead with the fastest compiler, LDC, compared with Rust:

 ```
 1.0
 1.707627119
 1.919527897
 1.954595186
 1.351342502
 1.556217748
 ```
Oh sorry, I only posted the rates for the Linux benchmark... On Mac M1, for some reason, D was a bit closer to Rust in some of the runs: ``` 0.8611111111 1.819444444 2.018932874 2.0078125 1.069837447 1.165078285 ```
Jan 20
prev sibling parent reply Daniel Kozak <kozzi11 gmail.com> writes:
On Sat, Jan 20, 2024 at 2:11=E2=80=AFPM Renato via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 Wow, fantastic feedback from lots of people, thanks so much!
 ...

 evilrat:
 There is another important difference, i quickly looked up D
 associative array implementation and the search looks like
 nlog(n) complexity with plain loop iteration of hashes, whereas
 rust's implementation is using "swisstable" algorithm
 implementation that has packed SIMD optimized lookups, this is
 likely where the 3x speed difference comes from.
I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called `ahash`. If you're interested in knowing more about that, please [read my blog post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about optimising the Rust code. So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D. You can [see my custom hashing function here]( https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d=
1fa1b77ad928f7d536728fba11f2d69b4afe7e3).
 This function is not good for the general purpose case, but it's probably
 as good as it gets for this problem (see my int128 trick in a previous po=
st
 on this thread to understand why). I did try a few variations of hashing
 before settling on this one... Rust, if anything, is at a disadvantage he=
re.

And here you are wrong, it is the hashing when the slowdown comes. It is
not only about the speed of hashing function. It is about the quality of
hashing functiuon for this particular problem and it is about how hashing
table (associative array) is implemented.
Jan 20
parent reply Renato <renato athaydes.com> writes:
On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:
 On Sat, Jan 20, 2024 at 2:11 PM Renato via Digitalmars-d-learn 
 < digitalmars-d-learn puremagic.com> wrote:

 Wow, fantastic feedback from lots of people, thanks so much! 
 ...

 evilrat:
 There is another important difference, i quickly looked up D
 associative array implementation and the search looks like
 nlog(n) complexity with plain loop iteration of hashes, 
 whereas
 rust's implementation is using "swisstable" algorithm
 implementation that has packed SIMD optimized lookups, this 
 is
 likely where the 3x speed difference comes from.
I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called `ahash`. If you're interested in knowing more about that, please [read my blog post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about optimising the Rust code. So, no, that's not where the difference is coming from... in fact, I am using a faster hashing function in D. You can [see my custom hashing function here]( https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3). This function is not good for the general purpose case, but it's probably as good as it gets for this problem (see my int128 trick in a previous post on this thread to understand why). I did try a few variations of hashing before settling on this one... Rust, if anything, is at a disadvantage here.
And here you are wrong, it is the hashing when the slowdown comes. It is not only about the speed of hashing function. It is about the quality of hashing functiuon for this particular problem and it is about how hashing table (associative array) is implemented.
I've explained why this function is almost a perfect hash function for this problem (there will be almost no collisions except for very large inputs where the shift-left will "overflow", and even then the probability of collisions should be very low). If you're going to claim I'm wrong, you must show exactly where I'm wrong, and preferrably you should provide a faster implementation. I will even accept a theoretical explanation if you can give one. What I will not accept, is for you to call me "wrong" just because you say so. That's childish behaviour and uncalled for on a technical discussion.
Jan 20
parent Daniel Kozak <kozzi11 gmail.com> writes:
Dne so 20. 1. 2024 21:21 u=C5=BEivatel Renato via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> napsal:

 On Saturday, 20 January 2024 at 19:45:19 UTC, Daniel Kozak wrote:
 On Sat, Jan 20, 2024 at 2:11=E2=80=AFPM Renato via Digitalmars-d-learn
 < digitalmars-d-learn puremagic.com> wrote:

 Wow, fantastic feedback from lots of people, thanks so much!
 ...

 evilrat:
 There is another important difference, i quickly looked up D
 associative array implementation and the search looks like
 nlog(n) complexity with plain loop iteration of hashes,
 whereas
 rust's implementation is using "swisstable" algorithm
 implementation that has packed SIMD optimized lookups, this
 is
 likely where the 3x speed difference comes from.
I am not using the default hash implementation in Rust (which is hashbrown as you've found out). That's too slow (because it's ddos secure - still Java's hash also is and it's still faster). I had to replace that with a library called `ahash`. If you're interested in knowing more about that, please [read my blog post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code)
about optimising the Rust code.
 So, no, that's not where the difference is coming from... in
 fact, I am using a faster hashing function in D.

 You can [see my custom hashing function
 here](
https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d=
1fa1b77ad928f7d536728fba11f2d69b4afe7e3
 ).
 This function is not good for the general purpose case, but
 it's probably
 as good as it gets for this problem (see my int128 trick in a
 previous post
 on this thread to understand why). I did try a few variations
 of hashing
 before settling on this one... Rust, if anything, is at a
 disadvantage here.
And here you are wrong, it is the hashing when the slowdown comes. It is not only about the speed of hashing function. It is about the quality of hashing functiuon for this particular problem and it is about how hashing table (associative array) is implemented.
I've explained why this function is almost a perfect hash function for this problem (there will be almost no collisions except for very large inputs where the shift-left will "overflow", and even then the probability of collisions should be very low). If you're going to claim I'm wrong, you must show exactly where I'm wrong, and preferrably you should provide a faster implementation. I will even accept a theoretical explanation if you can give one. What I will not accept, is for you to call me "wrong" just because you say so. That's childish behaviour and uncalled for on a technical discussion.
If you provided info that hash is perfect, than I am sorry. I have probably missed that in this thread my fault. Than I will need to looked into this more carefully and do much more than just assumption.

Jan 21
prev sibling parent Jordan Wilson <wilsonjord gmail.com> writes:
On Friday, 19 January 2024 at 08:57:40 UTC, Renato wrote:
 Do you know why the whole thread seems to have disappeared?? 
 There's a lot of good stuff in the thread, it would be a huge 
 shame to lose all that!
I agree! Thanks for posting your benchmarks, I thought your whole benching setup was pretty good, and learnt alot from your code and the resulting contributions in the thread and others. Jordan
Jan 19