www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Humble benchmark (fisher's exact test)

reply Ki Rill <rill.ki yahoo.com> writes:
It's a simple benchmark examining:
* execution time (sec)
* memory consumption (kb)
* binary size (kb)
* conciseness of a programming language (lines of code)

[Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Aug 13 2021
next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/13/21 7:19 PM, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)
 
 [Link](https://github.com/rillki/humble-benchmarks/tree/main
fishers-exact-test) 
 
The most obvious improvement I can see is removing the dynamic array creations in a loop. Since that array seems to be very short, using a static array would improve performance. (Ok, now I see that you already do that for the betterC and nogc versios.) Also, I wonder how disabling GC collections would affect execution time and memory consumption: https://dlang.org/spec/garbage.html#gc_config Ali
Aug 13 2021
prev sibling next sibling parent zjh <fqbqrr 163.com> writes:
On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
`Das betterC` is very competitive.`d` should invest 'resources' in this.
Aug 13 2021
prev sibling next sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Lots of things to improve there. https://github.com/rillki/humble-benchmarks/pull/4 A nice quick morning exercise :)
Aug 14 2021
next sibling parent Ki Rill <rill.ki yahoo.com> writes:
On Saturday, 14 August 2021 at 10:26:52 UTC, John Colvin wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Lots of things to improve there. https://github.com/rillki/humble-benchmarks/pull/4 A nice quick morning exercise :)
Thanks!
Aug 14 2021
prev sibling parent reply Ki Rill <rill.ki yahoo.com> writes:
On Saturday, 14 August 2021 at 10:26:52 UTC, John Colvin wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Lots of things to improve there. https://github.com/rillki/humble-benchmarks/pull/4 A nice quick morning exercise :)
I have added the proposed changes. The performance of D increased to almost that of C with ~1-2 seconds difference if using LDC! The betterC version is still slightly faster though. To sum up: ``` Clang C 9.1 s Clang C++ 9.4 s LDC Das betterC 10.3 s LDC D libC math 12.2 s Rust 13 s ``` Thank you John for you invaluable help! I didn't know that Phobos math is twice as slow as libC math.
Aug 15 2021
parent max haughton <maxhaton gmail.com> writes:
On Sunday, 15 August 2021 at 09:20:56 UTC, Ki Rill wrote:
 On Saturday, 14 August 2021 at 10:26:52 UTC, John Colvin wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Lots of things to improve there. https://github.com/rillki/humble-benchmarks/pull/4 A nice quick morning exercise :)
I have added the proposed changes. The performance of D increased to almost that of C with ~1-2 seconds difference if using LDC! The betterC version is still slightly faster though. To sum up: ``` Clang C 9.1 s Clang C++ 9.4 s LDC Das betterC 10.3 s LDC D libC math 12.2 s Rust 13 s ``` Thank you John for you invaluable help! I didn't know that Phobos math is twice as slow as libC math.
I could be wrong but I think our routines internally use the max precision when the can, so they are slower but they are also more precise in the internals (where allowed by the platform). You could probably test this by running these benchmarks on ARM or similar.
Aug 15 2021
prev sibling next sibling parent Jacob Shtokolov <jacob.100205 gmail.com> writes:
On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 * binary size (kb)
Regarding the binary size: please make sure that you're using dynamic linking for the D package, as by default it always links statically, while libc and libc++ are always linked dynamically.
Aug 14 2021
prev sibling next sibling parent reply bachmeier <no spam.net> writes:
On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
I'm skeptical that you're measuring what you think you're measuring. I say this because the R version shouldn't be that much slower than the C version. All that happens when you call `fisher.test` is that it checks which case it's handling and then calls the builtin C function. For example, [this line](https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/fisher.test.R#L120). More likely a chunk of your C code is being eliminated by the optimizer. Another thing is that printing to the screen is much slower in R than in C. You shouldn't benchmark printing to the screen since that is not something you would ever do in practice. If you really want performance, you can determine which case applies to your code and then make the underlying `.Call` yourself. If you don't do that, you're comparing Fisher's exact test against a routine that does a lot more than Fisher's exact test. In any event, you're not comparing against an R implementation of this test.
Aug 14 2021
parent reply Ki Rill <rill.ki yahoo.com> writes:
On Saturday, 14 August 2021 at 12:48:05 UTC, bachmeier wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 More likely a chunk of your C code is being eliminated by the 
 optimizer. Another thing is that printing to the screen is much 
 slower in R than in C. You shouldn't benchmark printing to the 
 screen since that is not something you would ever do in 
 practice.
It happens at the end of the program only once and takes a fraction of a second. I consider it to be irrelevant here.
 If you really want performance, you can determine which case 
 applies to your code and then make the underlying `.Call` 
 yourself. If you don't do that, you're comparing Fisher's exact 
 test against a routine that does a lot more than Fisher's exact 
 test. In any event, you're not comparing against an R 
 implementation of this test.
That is the point of this benchmark, to test it against Python/R implementation irrespective of what it does additionally. And to test compiled languages in general.
Aug 14 2021
parent reply bachmeier <no spam.net> writes:
On Saturday, 14 August 2021 at 14:08:05 UTC, Ki Rill wrote:

 If you really want performance, you can determine which case 
 applies to your code and then make the underlying `.Call` 
 yourself. If you don't do that, you're comparing Fisher's 
 exact test against a routine that does a lot more than 
 Fisher's exact test. In any event, you're not comparing 
 against an R implementation of this test.
That is the point of this benchmark, to test it against Python/R implementation irrespective of what it does additionally. And to test compiled languages in general.
That might have been the point of your benchmark, but that doesn't mean the benchmark is meaningful, in this case for at least three reasons: 1. You're measuring the performance of completely different tasks in R and C, where the R task is much bigger. 2. What you've done is only one way to use R. Anyone that wanted performance would use .Call rather than what you're doing. 3. R has a JIT compiler, and you're likely not making use of it. The comparison against R is not what you're after anyway. If you don't want to do it in a way that's meaningful - and that's perfectly understandable - it's best to delete it.
Aug 23 2021
parent reply russhy <russhy gmail.com> writes:
On Monday, 23 August 2021 at 13:12:21 UTC, bachmeier wrote:
 On Saturday, 14 August 2021 at 14:08:05 UTC, Ki Rill wrote:

 If you really want performance, you can determine which case 
 applies to your code and then make the underlying `.Call` 
 yourself. If you don't do that, you're comparing Fisher's 
 exact test against a routine that does a lot more than 
 Fisher's exact test. In any event, you're not comparing 
 against an R implementation of this test.
That is the point of this benchmark, to test it against Python/R implementation irrespective of what it does additionally. And to test compiled languages in general.
That might have been the point of your benchmark, but that doesn't mean the benchmark is meaningful, in this case for at least three reasons: 1. You're measuring the performance of completely different tasks in R and C, where the R task is much bigger. 2. What you've done is only one way to use R. Anyone that wanted performance would use .Call rather than what you're doing. 3. R has a JIT compiler, and you're likely not making use of it. The comparison against R is not what you're after anyway. If you don't want to do it in a way that's meaningful - and that's perfectly understandable - it's best to delete it.
JIT isn't something you want if you need fast execution time And nobody gonna warm JIT 1000000 times to call a task, you want result immediately they are only reliable if the programs is calling the same code 100000000 times, wich never happen, except under heavy load, wich also almost never happen for most use cases other than webdev; and even then you have crappy execution time because of cold startup This benchmarks even mention it:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)
Aug 23 2021
parent reply bachmeier <no spam.net> writes:
On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time
? I suppose they spent all those hours writing their JIT compilers because they had nothing else to do with their time.
Aug 23 2021
next sibling parent reply russhy <russhy gmail.com> writes:
On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time
? I suppose they spent all those hours writing their JIT compilers because they had nothing else to do with their time.
that's why they are now spending their time writing an AOT compiler after GO started to ate their cake ;)
Aug 23 2021
next sibling parent Alexandru Ermicioi <alexandru.ermicioi gmail.com> writes:
On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 that's why they are now spending their time writing an AOT 
 compiler after GO started to ate their cake ;)
What does go have to do with aot development?
Aug 23 2021
prev sibling parent reply Bienlein <ffm2002 web.de> writes:
On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time
? I suppose they spent all those hours writing their JIT compilers because they had nothing else to do with their time.
that's why they are now spending their time writing an AOT compiler after GO started to ate their cake ;)
anything faster, see https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html
Aug 24 2021
next sibling parent russhy <russhy gmail.com> writes:
On Tuesday, 24 August 2021 at 09:29:03 UTC, Bienlein wrote:
 On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time
? I suppose they spent all those hours writing their JIT compilers because they had nothing else to do with their time.
that's why they are now spending their time writing an AOT compiler after GO started to ate their cake ;)
make anything faster, see https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html
R2R is not true AOT, it still ship JIT and IL and recompile code at runtime, this benchmarkgame is flawed and a pure lie For true AOT you need to use NativeAOT (old-CoreRT)
Aug 24 2021
prev sibling parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Tuesday, 24 August 2021 at 09:29:03 UTC, Bienlein wrote:
 On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time
? I suppose they spent all those hours writing their JIT compilers because they had nothing else to do with their time.
that's why they are now spending their time writing an AOT compiler after GO started to ate their cake ;)
make anything faster, see https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html
and Burst compiler, UWP .NET Native, Windows 8 Bartok, and several community projects. Full AOT on regular .NET is coming in .NET 6 with final touches in .NET 7. Regarding Java, like C and C++, there are plenty of implementations to choose from, a couple of commercial JDKs with proper AOT. Then both OpenJDK and OpenJ9 do JIT cache between runs, which gets improved each time the application runs thanks PGO profiles. OpenJ9 and Azul go one step further by having AOT/JIT compiler daemons that generates native code with PGO data from the whole cluster. Finally Android, despite not being really Java, uses an hand written Assembly interpreter for fast startup, then JIT, and when the device is idle, the JIT code gets AOT compiled with PGO gathered during each run. Starting with Android 10 the PGO profiles are shared across devices via the play store, so that AOT compilation can be done right away skipping the whole interpreter/JIT step. Really, the benchmarks game is a joke, because they only use the basic FOSS tooling available to them. And after 25/20 years apparently plenty still don't know Java and .NET ecosystems as they should.
Aug 24 2021
parent russhy <russhy gmail.com> writes:
On Tuesday, 24 August 2021 at 20:00:06 UTC, Paulo Pinto wrote:
 On Tuesday, 24 August 2021 at 09:29:03 UTC, Bienlein wrote:
 On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time
? I suppose they spent all those hours writing their JIT compilers because they had nothing else to do with their time.
that's why they are now spending their time writing an AOT compiler after GO started to ate their cake ;)
make anything faster, see https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html
and Burst compiler, UWP .NET Native, Windows 8 Bartok, and several community projects. Full AOT on regular .NET is coming in .NET 6 with final touches in .NET 7. Regarding Java, like C and C++, there are plenty of implementations to choose from, a couple of commercial JDKs with proper AOT. Then both OpenJDK and OpenJ9 do JIT cache between runs, which gets improved each time the application runs thanks PGO profiles. OpenJ9 and Azul go one step further by having AOT/JIT compiler daemons that generates native code with PGO data from the whole cluster. Finally Android, despite not being really Java, uses an hand written Assembly interpreter for fast startup, then JIT, and when the device is idle, the JIT code gets AOT compiled with PGO gathered during each run. Starting with Android 10 the PGO profiles are shared across devices via the play store, so that AOT compilation can be done right away skipping the whole interpreter/JIT step. Really, the benchmarks game is a joke, because they only use the basic FOSS tooling available to them. And after 25/20 years apparently plenty still don't know Java and .NET ecosystems as they should.
Android is the worst of all A compiler in a supposed resource constrained device that runs on a limited power source (battery), good on Apple to forbid JIT in their store one of their best move
Aug 24 2021
prev sibling parent Bienlein <ffm2002 web.de> writes:
On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time
? I suppose they spent all those hours writing their JIT compilers because they had nothing else to do with their time.
JIT is a very good compromise for programming languages that are intended for application development and also server side development. also working on making calls to C functions easy. For application development using a jitter and moving performance intensive features into C programs is an approach that has shown to work well already in the old times about 20 years ago when Smalltalk had a time where it was doing commercially well.
Aug 24 2021
prev sibling next sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Using the `intel-intrinsics` package you can do 4x exp or log operations at once.
Aug 14 2021
parent reply Tejas <notrealemail gmail.com> writes:
On Saturday, 14 August 2021 at 14:14:08 UTC, Guillaume Piolat 
wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Using the `intel-intrinsics` package you can do 4x exp or log operations at once.
I know both D and C can theoretically reach the same level of performance, but why does C **always** lead by a few milliseconds? What is it that we aren't doing? Is it the implementation's fault? The optimizer? What can we do for those precious few milliseconds? It's so frustrating to see C/C++ always being the winners in the **absolute** sense, and we always end up making the argument about how much more painstaking it is to actually create a complete program in those languages only for negligibly better performance. Do these benchmarks even matter if it's all about the quality of implementation? Sorry if I'm sounding a little bitter.
Aug 14 2021
parent reply Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 14 August 2021 at 16:20:21 UTC, Tejas wrote:
 It's so frustrating to see C/C++ always being the winners in 
 the **absolute** sense, and we always end up making the 
 argument about how much more painstaking it is to actually 
 create a complete program in those languages only for 
 negligibly better performance.

 Do these benchmarks even matter if it's all about the quality 
 of implementation?
If you pay me I can produce a faster D version of whatever small program you want. But the reality is that noone really _needs_ those benchmark programs, and thinking about optimizing them is an adequate punishment for writing them. The only things I can think of where C++ could wins a bit against D was that the ICC compiler could auto-vectorize transcendentals. Like logf in a loop, which LLVM doesn't do. But the ICC compiler has been moving to LLVM recently. When your compiler see the same IR from different front-end language, in the end it is the same codegen.
Aug 14 2021
parent max haughton <maxhaton gmail.com> writes:
On Saturday, 14 August 2021 at 19:28:42 UTC, Guillaume Piolat 
wrote:
 On Saturday, 14 August 2021 at 16:20:21 UTC, Tejas wrote:
 [...]
If you pay me I can produce a faster D version of whatever small program you want. But the reality is that noone really _needs_ those benchmark programs, and thinking about optimizing them is an adequate punishment for writing them. The only things I can think of where C++ could wins a bit against D was that the ICC compiler could auto-vectorize transcendentals. Like logf in a loop, which LLVM doesn't do. But the ICC compiler has been moving to LLVM recently. When your compiler see the same IR from different front-end language, in the end it is the same codegen.
ICC *has* moved to LLVM. Past tense now, sadly.
Aug 14 2021
prev sibling next sibling parent reply max haughton <maxhaton gmail.com> writes:
On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
If anyone is wondering why the GDC results look a bit weird: It's because GDC doesn't actually inline unless you compile with LTO or enable whole program optimization (The rationale is due to the interaction of linking with templates). https://godbolt.org/z/Gj8hMjEch play with removing the '-fwhole-program' flag on that link.
Aug 14 2021
parent max haughton <maxhaton gmail.com> writes:
On Saturday, 14 August 2021 at 14:29:16 UTC, max haughton wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
If anyone is wondering why the GDC results look a bit weird: It's because GDC doesn't actually inline unless you compile with LTO or enable whole program optimization (The rationale is due to the interaction of linking with templates). https://godbolt.org/z/Gj8hMjEch play with removing the '-fwhole-program' flag on that link.
A little more: I got the performance down to be less awful by using LTO (and found an LTO ICE in the process...), but as far as I can tell the limiting factor for GDC is that it's standard library by default doesn't seem to compile with either inlining or LTO support enabled so cycles are being wasted on (say) calling IsNaN sadly. I also note that X87 code is generated in Phobos, which could be hypothetically required for the necessary precision on a generic target, but is probably quite slow.
Aug 14 2021
prev sibling parent reply Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Interesting. I know people say benchmarks aren't important, but I disagree. I think it's healthy to compare from time to time 👍
Aug 23 2021
parent reply russhy <russhy gmail.com> writes:
On Monday, 23 August 2021 at 07:52:01 UTC, Imperatorn wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)
Interesting. I know people say benchmarks aren't important, but I disagree. I think it's healthy to compare from time to time 👍
I agree
Aug 23 2021
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Aug 23, 2021 at 05:37:01PM +0000, russhy via Digitalmars-d wrote:
 On Monday, 23 August 2021 at 07:52:01 UTC, Imperatorn wrote:
[...]
 Interesting. I know people say benchmarks aren't important, but I
 disagree. I think it's healthy to compare from time to time 👍
I agree
I wouldn't say benchmarks aren't *important*, I think the issue is how you interpret the results. Sometimes people read too much into it, i.e., over-generalize it far beyond the scope of the actual test, forgetting that a benchmark is simply just that: a benchmark. It's a rough estimation of performance in a contrived use case that may or may not reflect real-world usage. Many of the simpler benchmarks consist of repeating a task 100000... times, which can be useful to measure, but hardly represents real-world usage. You can optimize the heck out of memcpy (substitute your favorite function to measure here) until it beats everybody else in the benchmark, but that does not necessarily mean it will actually make your real-life program run faster. It may, it may not. Because after running the 100th time, all the code/data you're using has gone into cache, so it will run much faster than the first few iterations. But in a real program, you usually don't need to call memcpy (etc) 100000 times in a row. Usually you need to do something else in between, and that something else may cause the memcpy-related data to be evicted from cache, change the state of the branch predictor, etc.. So your ultra-optimized code may not behave the same way in a real-life program vs. the benchmark, and may turn out to be actually slower than expected. Note that this does not mean optimizing for a benchmark is completely worthless; usually the first few iterations represents actual bottlenecks in the code, and improving that will improve performance in a real-life scenario. But up to a certain point. Trying to optimize beyond that, and you risk the danger of optimizing for your specific benchmark instead of the general use-case, and therefore may end up pessimizing the code instead of optimizing it. Real-life example: GNU wc, which counts the number of lines in a text file. Some time ago I did a comparison with several different D implementations of the same program that I wrote, and discovered that because GNU wc uses glibc's memchr, which was ultra-optimized for scanning large buffers, if your text files contains long lines then wc will run faster; but if the text file contains many short lines, a naïve File.byLine / foreach (ch; line) D implementation will actually beat wc. The reason is that glibc's memchr implementation is optimized for large buffers, and has a rather expensive setup overhead for the main fast-scanning loop. For long lines, the fast-scan more than outweighs this overhead, but for short lines, the overhead dominates the running time, so a naïve char-by-char implementation works faster. I don't know the history behind glibc's memchr implementation, but I'll bet that at some point somebody came up with a benchmark that tests scanning of large buffers and said, look, this complicated bit-hack fast-scanning loop improves the benchmark by X%! So the code change was accepted. But not realizing that this fast scanning of large buffers comes at the cost of pessimizing the small buffers case. Past a certain point, optimization becomes a trade-off, and which route you go depends on your specific application, not some general benchmarks that do not represent real-world usage patterns. T -- Famous last words: I *think* this will work...
Aug 23 2021