digitalmars.D - Humble benchmark (fisher's exact test)

Ki Rill (6/6) Aug 13 2021 It's a simple benchmark examining:

=?UTF-8?Q?Ali_=c3=87ehreli?= (9/17) Aug 13 2021 The most obvious improvement I can see is removing the dynamic array
zjh (3/4) Aug 13 2021 `Das betterC` is very competitive.`d` should invest 'resources'
John Colvin (4/10) Aug 14 2021 Lots of things to improve there.

Ki Rill (2/13) Aug 14 2021 Thanks!
Ki Rill (14/25) Aug 15 2021 I have added the proposed changes. The performance of D increased

max haughton (6/35) Aug 15 2021 I could be wrong but I think our routines internally use the max

Jacob Shtokolov (4/5) Aug 14 2021 Regarding the binary size: please make sure that you're using
bachmeier (17/23) Aug 14 2021 I'm skeptical that you're measuring what you think you're

Ki Rill (6/18) Aug 14 2021 It happens at the end of the program only once and takes a

bachmeier (12/21) Aug 23 2021 That might have been the point of your benchmark, but that

russhy (11/39) Aug 23 2021 JIT isn't something you want if you need fast execution time

bachmeier (4/5) Aug 23 2021 ?

russhy (3/8) Aug 23 2021 that's why they are now spending their time writing an AOT

Alexandru Ermicioi (2/4) Aug 23 2021 What does go have to do with aot development?
Bienlein (4/15) Aug 24 2021 AOT in C#/Java is only to speed up startup times. It doesn't make

russhy (4/20) Aug 24 2021 R2R is not true AOT, it still ship JIT and IL and recompile code
Paulo Pinto (26/42) Aug 24 2021 That doesn't use all AOT options available to C# and Java.

russhy (5/50) Aug 24 2021 Android is the worst of all

Bienlein (10/15) Aug 24 2021 JIT is a very good compromise for programming languages that are

Guillaume Piolat (3/9) Aug 14 2021 Using the `intel-intrinsics` package you can do 4x exp or log

Tejas (15/25) Aug 14 2021 I know both D and C can theoretically reach the same level of

Guillaume Piolat (12/19) Aug 14 2021 If you pay me I can produce a faster D version of whatever small

max haughton (3/16) Aug 14 2021 ICC *has* moved to LLVM. Past tense now, sadly.

max haughton (7/13) Aug 14 2021 If anyone is wondering why the GDC results look a bit weird: It's

max haughton (10/24) Aug 14 2021 A little more: I got the performance down to be less awful by

Imperatorn (3/9) Aug 23 2021 Interesting. I know people say benchmarks aren't important, but I

russhy (2/12) Aug 23 2021 I agree

H. S. Teoh (53/58) Aug 23 2021 I wouldn't say benchmarks aren't *important*, I think the issue is how

Ki Rill <rill.ki yahoo.com> writes:

It's a simple benchmark examining:
* execution time (sec)
* memory consumption (kb)
* binary size (kb)
* conciseness of a programming language (lines of code)

[Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

Aug 13 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 8/13/21 7:19 PM, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)
 
 [Link](https://github.com/rillki/humble-benchmarks/tree/main
fishers-exact-test) 
 

The most obvious improvement I can see is removing the dynamic array 
creations in a loop. Since that array seems to be very short, using a 
static array would improve performance. (Ok, now I see that you already 
do that for the betterC and nogc versios.)

Also, I wonder how disabling GC collections would affect execution time 
and memory consumption:

   https://dlang.org/spec/garbage.html#gc_config

Ali

Aug 13 2021

zjh <fqbqrr 163.com> writes:

On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:

`Das betterC` is very competitive.`d` should invest 'resources' 
in this.

Aug 13 2021

John Colvin <john.loughran.colvin gmail.com> writes:

On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

Lots of things to improve there.

https://github.com/rillki/humble-benchmarks/pull/4

A nice quick morning exercise :)

Aug 14 2021

Ki Rill <rill.ki yahoo.com> writes:

On Saturday, 14 August 2021 at 10:26:52 UTC, John Colvin wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

 Lots of things to improve there.

 https://github.com/rillki/humble-benchmarks/pull/4

 A nice quick morning exercise :)

Thanks!

Aug 14 2021

Ki Rill <rill.ki yahoo.com> writes:

On Saturday, 14 August 2021 at 10:26:52 UTC, John Colvin wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

 Lots of things to improve there.

 https://github.com/rillki/humble-benchmarks/pull/4

 A nice quick morning exercise :)

I have added the proposed changes. The performance of D increased 
to almost that of C with ~1-2 seconds difference if using LDC!

The betterC version is still slightly faster though.

To sum up:
```
Clang C           9.1 s
Clang C++         9.4 s
LDC Das betterC   10.3 s
LDC D libC math   12.2 s
Rust              13 s
```

Thank you John for you invaluable help! I didn't know that Phobos 
math is twice as slow as libC math.

Aug 15 2021

max haughton <maxhaton gmail.com> writes:

On Sunday, 15 August 2021 at 09:20:56 UTC, Ki Rill wrote:
 On Saturday, 14 August 2021 at 10:26:52 UTC, John Colvin wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

 Lots of things to improve there.

 https://github.com/rillki/humble-benchmarks/pull/4

 A nice quick morning exercise :)

 I have added the proposed changes. The performance of D 
 increased to almost that of C with ~1-2 seconds difference if 
 using LDC!

 The betterC version is still slightly faster though.

 To sum up:
 ```
 Clang C           9.1 s
 Clang C++         9.4 s
 LDC Das betterC   10.3 s
 LDC D libC math   12.2 s
 Rust              13 s
 ```

 Thank you John for you invaluable help! I didn't know that 
 Phobos math is twice as slow as libC math.

I could be wrong but I think our routines internally use the max 
precision when the can, so they are slower but they are also more 
precise in the internals (where allowed by the platform). You 
could probably test this by running these benchmarks on ARM or 
similar.

Aug 15 2021

Jacob Shtokolov <jacob.100205 gmail.com> writes:

On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 * binary size (kb)

Regarding the binary size: please make sure that you're using 
dynamic linking for the D package, as by default it always links 
statically, while libc and libc++ are always linked dynamically.

Aug 14 2021

bachmeier <no spam.net> writes:

On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

I'm skeptical that you're measuring what you think you're 
measuring. I say this because the R version shouldn't be that 
much slower than the C version. All that happens when you call 
`fisher.test` is that it checks which case it's handling and then 
calls the builtin C function. For example, [this 
line](https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/fisher.test.R#L120).

More likely a chunk of your C code is being eliminated by the 
optimizer. Another thing is that printing to the screen is much 
slower in R than in C. You shouldn't benchmark printing to the 
screen since that is not something you would ever do in practice. 
If you really want performance, you can determine which case 
applies to your code and then make the underlying `.Call` 
yourself. If you don't do that, you're comparing Fisher's exact 
test against a routine that does a lot more than Fisher's exact 
test. In any event, you're not comparing against an R 
implementation of this test.

Aug 14 2021

Ki Rill <rill.ki yahoo.com> writes:

On Saturday, 14 August 2021 at 12:48:05 UTC, bachmeier wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 More likely a chunk of your C code is being eliminated by the 
 optimizer. Another thing is that printing to the screen is much 
 slower in R than in C. You shouldn't benchmark printing to the 
 screen since that is not something you would ever do in 
 practice.

It happens at the end of the program only once and takes a 
fraction of a second. I consider it to be irrelevant here.

 If you really want performance, you can determine which case 
 applies to your code and then make the underlying `.Call` 
 yourself. If you don't do that, you're comparing Fisher's exact 
 test against a routine that does a lot more than Fisher's exact 
 test. In any event, you're not comparing against an R 
 implementation of this test.

That is the point of this benchmark, to test it against Python/R 
implementation irrespective of what it does additionally. And to 
test compiled languages in general.

Aug 14 2021

bachmeier <no spam.net> writes:

On Saturday, 14 August 2021 at 14:08:05 UTC, Ki Rill wrote:

 If you really want performance, you can determine which case 
 applies to your code and then make the underlying `.Call` 
 yourself. If you don't do that, you're comparing Fisher's 
 exact test against a routine that does a lot more than 
 Fisher's exact test. In any event, you're not comparing 
 against an R implementation of this test.

 That is the point of this benchmark, to test it against 
 Python/R implementation irrespective of what it does 
 additionally. And to test compiled languages in general.

That might have been the point of your benchmark, but that 
doesn't mean the benchmark is meaningful, in this case for at 
least three reasons:

1. You're measuring the performance of completely different tasks 
in R and C, where the R task is much bigger.
2. What you've done is only one way to use R. Anyone that wanted 
performance would use .Call rather than what you're doing.
3. R has a JIT compiler, and you're likely not making use of it.

The comparison against R is not what you're after anyway. If you 
don't want to do it in a way that's meaningful - and that's 
perfectly understandable - it's best to delete it.

Aug 23 2021

russhy <russhy gmail.com> writes:

On Monday, 23 August 2021 at 13:12:21 UTC, bachmeier wrote:
 On Saturday, 14 August 2021 at 14:08:05 UTC, Ki Rill wrote:

 If you really want performance, you can determine which case 
 applies to your code and then make the underlying `.Call` 
 yourself. If you don't do that, you're comparing Fisher's 
 exact test against a routine that does a lot more than 
 Fisher's exact test. In any event, you're not comparing 
 against an R implementation of this test.

 That is the point of this benchmark, to test it against 
 Python/R implementation irrespective of what it does 
 additionally. And to test compiled languages in general.

 That might have been the point of your benchmark, but that 
 doesn't mean the benchmark is meaningful, in this case for at 
 least three reasons:

 1. You're measuring the performance of completely different 
 tasks in R and C, where the R task is much bigger.
 2. What you've done is only one way to use R. Anyone that 
 wanted performance would use .Call rather than what you're 
 doing.
 3. R has a JIT compiler, and you're likely not making use of it.

 The comparison against R is not what you're after anyway. If 
 you don't want to do it in a way that's meaningful - and that's 
 perfectly understandable - it's best to delete it.

JIT isn't something you want if you need fast execution time

And nobody gonna warm JIT 1000000 times to call a task, you want 
result immediately


they are only reliable if the programs is calling the same code 
100000000 times, wich never happen, except under heavy load, wich 
also almost never happen for most use cases other than webdev; 
and even then you have crappy execution time because of cold 
startup

This benchmarks even mention it:

 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

Aug 23 2021

bachmeier <no spam.net> writes:

On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time

?

I suppose they spent all those hours writing their JIT compilers 
because they had nothing else to do with their time.

Aug 23 2021

russhy <russhy gmail.com> writes:

On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time

 ?

 I suppose they spent all those hours writing their JIT 
 compilers because they had nothing else to do with their time.

that's why they are now spending their time writing an AOT 
compiler after GO started to ate their cake ;)

Aug 23 2021

Alexandru Ermicioi <alexandru.ermicioi gmail.com> writes:

On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 that's why they are now spending their time writing an AOT 
 compiler after GO started to ate their cake ;)

What does go have to do with aot development?

Aug 23 2021

Bienlein <ffm2002 web.de> writes:

On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time

 ?

 I suppose they spent all those hours writing their JIT 
 compilers because they had nothing else to do with their time.

 that's why they are now spending their time writing an AOT 
 compiler after GO started to ate their cake ;)


anything faster, see 
https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html

Aug 24 2021

russhy <russhy gmail.com> writes:

On Tuesday, 24 August 2021 at 09:29:03 UTC, Bienlein wrote:
 On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time

 ?

 I suppose they spent all those hours writing their JIT 
 compilers because they had nothing else to do with their time.

 that's why they are now spending their time writing an AOT 
 compiler after GO started to ate their cake ;)


 make anything faster, see 
 https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html

R2R is not true AOT, it still ship JIT and IL and recompile code 
at runtime, this benchmarkgame is flawed and a pure lie

For true AOT you need to use NativeAOT (old-CoreRT)

Aug 24 2021

Paulo Pinto <pjmlp progtools.org> writes:

On Tuesday, 24 August 2021 at 09:29:03 UTC, Bienlein wrote:
On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

JIT isn't something you want if you need fast execution time

I suppose they spent all those hours writing their JIT
compilers because they had nothing else to do with their time.

that's why they are now spending their time writing an AOT
compiler after GO started to ate their cake ;)

make anything faster, see
https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html

and Burst compiler, UWP .NET Native, Windows 8 Bartok, and
several community projects.

Full AOT on regular .NET is coming in .NET 6 with final touches
in .NET 7.

Regarding Java, like C and C++, there are plenty of
implementations to choose from, a couple of commercial JDKs with
proper AOT.

Then both OpenJDK and OpenJ9 do JIT cache between runs, which
gets improved each time the application runs thanks PGO profiles.

OpenJ9 and Azul go one step further by having AOT/JIT compiler
daemons that generates native code with PGO data from the whole
cluster.

Finally Android, despite not being really Java, uses an hand
written Assembly interpreter for fast startup, then JIT, and when
the device is idle, the JIT code gets AOT compiled with PGO
gathered during each run. Starting with Android 10 the PGO
profiles are shared across devices via the play store, so that
AOT compilation can be done right away skipping the whole
interpreter/JIT step.

Really, the benchmarks game is a joke, because they only use the
basic FOSS tooling available to them.

And after 25/20 years apparently plenty still don't know Java and
.NET ecosystems as they should.

Aug 24 2021

russhy <russhy gmail.com> writes:

On Tuesday, 24 August 2021 at 20:00:06 UTC, Paulo Pinto wrote:
 On Tuesday, 24 August 2021 at 09:29:03 UTC, Bienlein wrote:
 On Monday, 23 August 2021 at 22:27:02 UTC, russhy wrote:
 On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time

 ?

 I suppose they spent all those hours writing their JIT 
 compilers because they had nothing else to do with their 
 time.

 that's why they are now spending their time writing an AOT 
 compiler after GO started to ate their cake ;)


 make anything faster, see 
 https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharpaot.html





 and Burst compiler, UWP .NET Native, Windows 8 Bartok, and 
 several community projects.

 Full AOT on regular .NET is coming in .NET 6 with final touches 
 in .NET 7.

 Regarding Java, like C and C++, there are plenty of 
 implementations to choose from, a couple of commercial JDKs 
 with proper AOT.

 Then both OpenJDK and OpenJ9 do JIT cache between runs, which 
 gets improved each time the application runs thanks PGO 
 profiles.

 OpenJ9 and Azul go one step further by having AOT/JIT compiler 
 daemons that generates native code with PGO data from the whole 
 cluster.

 Finally Android, despite not being really Java, uses an hand 
 written Assembly interpreter for fast startup, then JIT, and 
 when the device is idle, the JIT code gets AOT compiled with 
 PGO gathered during each run. Starting with Android 10 the PGO 
 profiles are shared across devices via the play store, so that 
 AOT compilation can be done right away skipping the whole 
 interpreter/JIT step.

 Really, the benchmarks game is a joke, because they only use 
 the basic FOSS tooling available to them.

 And after 25/20 years apparently plenty still don't know Java 
 and .NET ecosystems as they should.

Android is the worst of all

A compiler in a supposed resource constrained device that runs on 
a limited power source (battery), good on Apple to forbid JIT in 
their store one of their best move

Aug 24 2021

Bienlein <ffm2002 web.de> writes:

On Monday, 23 August 2021 at 22:06:39 UTC, bachmeier wrote:
 On Monday, 23 August 2021 at 17:35:59 UTC, russhy wrote:

 JIT isn't something you want if you need fast execution time

 ?

 I suppose they spent all those hours writing their JIT 
 compilers because they had nothing else to do with their time.

JIT is a very good compromise for programming languages that are 
intended for application development and also server side 
development.


also working on making calls to C functions easy.

For application development using a jitter and moving performance 
intensive features into C programs is an approach that has shown 
to work well already in the old times about 20 years ago when 
Smalltalk had a time where it was doing commercially well.

Aug 24 2021

Guillaume Piolat <first.last gmail.com> writes:

On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

Using the `intel-intrinsics` package you can do 4x exp or log 
operations at once.

Aug 14 2021

Tejas <notrealemail gmail.com> writes:

On Saturday, 14 August 2021 at 14:14:08 UTC, Guillaume Piolat 
wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

 Using the `intel-intrinsics` package you can do 4x exp or log 
 operations at once.

I know both D and C can theoretically reach the same level of 
performance, but why does C **always** lead by a few 
milliseconds? What is it that we aren't doing? Is it the 
implementation's fault? The optimizer? What can we do for those 
precious few milliseconds?

It's so frustrating to see C/C++ always being the winners in the 
**absolute** sense, and we always end up making the argument 
about how much more painstaking it is to actually create a 
complete program in those languages only for negligibly better 
performance.

Do these benchmarks even matter if it's all about the quality of 
implementation?

Sorry if I'm sounding a little bitter.

Aug 14 2021

Guillaume Piolat <first.last gmail.com> writes:

On Saturday, 14 August 2021 at 16:20:21 UTC, Tejas wrote:
 It's so frustrating to see C/C++ always being the winners in 
 the **absolute** sense, and we always end up making the 
 argument about how much more painstaking it is to actually 
 create a complete program in those languages only for 
 negligibly better performance.

 Do these benchmarks even matter if it's all about the quality 
 of implementation?

If you pay me I can produce a faster D version of whatever small 
program you want.
But the reality is that noone really _needs_ those benchmark 
programs, and thinking about optimizing them is an adequate 
punishment for writing them.

The only things I can think of where C++ could wins a bit against 
D was that the ICC compiler could auto-vectorize transcendentals. 
Like logf in a loop, which LLVM doesn't do.
But the ICC compiler has been moving to LLVM recently.

When your compiler see the same IR from different front-end 
language, in the end it is the same codegen.

Aug 14 2021

max haughton <maxhaton gmail.com> writes:

On Saturday, 14 August 2021 at 19:28:42 UTC, Guillaume Piolat 
wrote:
 On Saturday, 14 August 2021 at 16:20:21 UTC, Tejas wrote:
 [...]

 If you pay me I can produce a faster D version of whatever 
 small program you want.
 But the reality is that noone really _needs_ those benchmark 
 programs, and thinking about optimizing them is an adequate 
 punishment for writing them.

 The only things I can think of where C++ could wins a bit 
 against D was that the ICC compiler could auto-vectorize 
 transcendentals. Like logf in a loop, which LLVM doesn't do.
 But the ICC compiler has been moving to LLVM recently.

 When your compiler see the same IR from different front-end 
 language, in the end it is the same codegen.

ICC *has* moved to LLVM. Past tense now, sadly.

Aug 14 2021

max haughton <maxhaton gmail.com> writes:

On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

If anyone is wondering why the GDC results look a bit weird: It's 
because GDC doesn't actually inline unless you compile with LTO 
or enable whole program optimization (The rationale is due to the 
interaction of linking with templates).

https://godbolt.org/z/Gj8hMjEch play with removing the 
'-fwhole-program' flag on that link.

Aug 14 2021

max haughton <maxhaton gmail.com> writes:

On Saturday, 14 August 2021 at 14:29:16 UTC, max haughton wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

 If anyone is wondering why the GDC results look a bit weird: 
 It's because GDC doesn't actually inline unless you compile 
 with LTO or enable whole program optimization (The rationale is 
 due to the interaction of linking with templates).

 https://godbolt.org/z/Gj8hMjEch play with removing the 
 '-fwhole-program' flag on that link.

A little more: I got the performance down to be less awful by 
using LTO (and found an LTO ICE in the process...), but as far as 
I can tell the limiting factor for GDC is that it's standard 
library by default doesn't seem to compile with either inlining 
or LTO support enabled so cycles are being wasted on (say) 
calling IsNaN sadly.

I also note that X87 code is generated in Phobos, which could be 
hypothetically required for the necessary precision on a generic 
target, but is probably quite slow.

Aug 14 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

Interesting. I know people say benchmarks aren't important, but I 
disagree. I think it's healthy to compare from time to time 👍

Aug 23 2021

russhy <russhy gmail.com> writes:

On Monday, 23 August 2021 at 07:52:01 UTC, Imperatorn wrote:
 On Saturday, 14 August 2021 at 02:19:02 UTC, Ki Rill wrote:
 It's a simple benchmark examining:
 * execution time (sec)
 * memory consumption (kb)
 * binary size (kb)
 * conciseness of a programming language (lines of code)

 [Link](https://github.com/rillki/humble-benchmarks/tree/main/fishers-exact-test)

 Interesting. I know people say benchmarks aren't important, but 
 I disagree. I think it's healthy to compare from time to time 👍

I agree

Aug 23 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Aug 23, 2021 at 05:37:01PM +0000, russhy via Digitalmars-d wrote:
 On Monday, 23 August 2021 at 07:52:01 UTC, Imperatorn wrote:

[...]
 Interesting. I know people say benchmarks aren't important, but I
 disagree. I think it's healthy to compare from time to time 👍

 
 I agree

I wouldn't say benchmarks aren't *important*, I think the issue is how
you interpret the results.  Sometimes people read too much into it,
i.e., over-generalize it far beyond the scope of the actual test,
forgetting that a benchmark is simply just that: a benchmark. It's a
rough estimation of performance in a contrived use case that may or may
not reflect real-world usage.

Many of the simpler benchmarks consist of repeating a task 100000...
times, which can be useful to measure, but hardly represents real-world
usage.  You can optimize the heck out of memcpy (substitute your
favorite function to measure here) until it beats everybody else in the
benchmark, but that does not necessarily mean it will actually make your
real-life program run faster. It may, it may not.  Because after running
the 100th time, all the code/data you're using has gone into cache, so
it will run much faster than the first few iterations.  But in a real
program, you usually don't need to call memcpy (etc) 100000 times in a
row. Usually you need to do something else in between, and that
something else may cause the memcpy-related data to be evicted from
cache, change the state of the branch predictor, etc..  So your
ultra-optimized code may not behave the same way in a real-life program
vs. the benchmark, and may turn out to be actually slower than expected.

Note that this does not mean optimizing for a benchmark is completely
worthless; usually the first few iterations represents actual
bottlenecks in the code, and improving that will improve performance in
a real-life scenario.  But up to a certain point.  Trying to optimize
beyond that, and you risk the danger of optimizing for your specific
benchmark instead of the general use-case, and therefore may end up
pessimizing the code instead of optimizing it.

Real-life example: GNU wc, which counts the number of lines in a text
file.  Some time ago I did a comparison with several different D
implementations of the same program that I wrote, and discovered that
because GNU wc uses glibc's memchr, which was ultra-optimized for
scanning large buffers, if your text files contains long lines then wc
will run faster; but if the text file contains many short lines, a naïve
File.byLine / foreach (ch; line) D implementation will actually beat wc.
The reason is that glibc's memchr implementation is optimized for large
buffers, and has a rather expensive setup overhead for the main
fast-scanning loop.  For long lines, the fast-scan more than outweighs
this overhead, but for short lines, the overhead dominates the running
time, so a naïve char-by-char implementation works faster.

I don't know the history behind glibc's memchr implementation, but I'll
bet that at some point somebody came up with a benchmark that tests
scanning of large buffers and said, look, this complicated bit-hack
fast-scanning loop improves the benchmark by X%!  So the code change was
accepted.  But not realizing that this fast scanning of large buffers
comes at the cost of pessimizing the small buffers case.

Past a certain point, optimization becomes a trade-off, and which route
you go depends on your specific application, not some general benchmarks
that do not represent real-world usage patterns.


T

-- 
Famous last words: I *think* this will work...

Aug 23 2021

D Programming

C/C++ Programming

Other

digitalmars.D - Humble benchmark (fisher's exact test)