digitalmars.D - Interesting performance results on Crc port from zlib

Igor (21/21) Jan 24 2021 I ported zlib way of CRC calculation [1] to D and then I compared

welkam (4/6) Jan 24 2021 How do you link those libraries? If you do static linking of D
Steven Schveighoffer (4/19) Jan 24 2021 Probably inlining. If you are using a compiled library, it can't inline

Igor (13/16) Jan 24 2021 I tried to modify dub build to exclude inlining. This how it says

welkam (5/6) Jan 24 2021 Edited the command for clarity. As I suspected the compiler can

Igor (12/18) Jan 25 2021 It seems you are right. I edited the example.c that comes with

H. S. Teoh (6/17) Jan 25 2021 Inlining of functions in hot inner loops makes a big difference to

James Blachly (5/16) Jan 24 2021 If you did not compile zlib yourself, it is possible that your ported

welkam (3/5) Jan 24 2021 No it is not possible. The compiler wont generate them unless you

James Blachly (2/7) Jan 24 2021 Yes I know, I was asking Igor how he compiled.

Igor <stojkovic.igor gmail.com> writes:

I ported zlib way of CRC calculation [1] to D and then I compared 
its performance to Phobos std.digest.crc and to calling actual 
zlib implementation through etc.c.zlib: crc32_z and I got some 
interesting results:

build type   |  zlib  | phobos | zlibdport |
--------------------------------------------
dmd debug    |  240ms |  272ms |     211ms |
dmd release  |  237ms |  279ms |     202ms |
ldc debug    |  238ms |  136ms |     264ms |
ldc release  |  238ms |  132ms |     125ms |

Calling actual zlib is pretty constant, I guess because it uses 
precompiled static library. What is really surprising is how much 
faster is the port from original zlib. Anyone has any idea how 
can that be?

I tested on:
Operating System: Manjaro Linux
OS Type: 64-bit
Processors: 12 × Intel® Core™ i7-8700K CPU   3.70GHz
Memory: 15.6 GiB of RAM

[1] 
https://github.com/madler/zlib/blob/d71dc66fa8a153fb6e7c626847095d9697a6cf42/crc32.c

Jan 24 2021

welkam <wwwelkam gmail.com> writes:

On Sunday, 24 January 2021 at 15:26:08 UTC, Igor wrote:
 What is really surprising is how much faster is the port from 
 original zlib. Anyone has any idea how can that be?

How do you link those libraries? If you do static linking of D 
version and compiler can see the source then it can do better 
optimizations.

Jan 24 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/24/21 10:26 AM, Igor wrote:
 I ported zlib way of CRC calculation [1] to D and then I compared its 
 performance to Phobos std.digest.crc and to calling actual zlib 
 implementation through etc.c.zlib: crc32_z and I got some interesting 
 results:
 
 build type   |  zlib  | phobos | zlibdport |
 --------------------------------------------
 dmd debug    |  240ms |  272ms |     211ms |
 dmd release  |  237ms |  279ms |     202ms |
 ldc debug    |  238ms |  136ms |     264ms |
 ldc release  |  238ms |  132ms |     125ms |
 
 Calling actual zlib is pretty constant, I guess because it uses 
 precompiled static library. What is really surprising is how much faster 
 is the port from original zlib. Anyone has any idea how can that be?

Probably inlining. If you are using a compiled library, it can't inline 
the code, it has to be an opaque call.

-Steve

Jan 24 2021

Igor <stojkovic.igor gmail.com> writes:

On Sunday, 24 January 2021 at 15:47:38 UTC, Steven Schveighoffer 
wrote:
 Probably inlining. If you are using a compiled library, it 
 can't inline the code, it has to be an opaque call.

 -Steve

I tried to modify dub build to exclude inlining. This how it says 
it is compiling and linking:

ldc2 -c 
-of.dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1F0
F7D4/dzlibtestapp.o -release -O3 -w -oq -od=.dub/obj -d-version=Have_dzlib
-Isource/ source/crc32.d test/test.d -vcolumns
Linking...
ldc2 
-of.dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1
00F7D4/dzlibtestapp .dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1F0
F7D4/dzlibtestapp.o -L--no-as-needed

I still get the same numbers in LDC release build. I forgot to 
mention I use this to measure the execution:

auto implDurs = benchmark!(zlibCrc, phobosCrc, dzlibCrc)(100_000);

And I am processing the same global 4K buffer in each.

Jan 24 2021

welkam <wwwelkam gmail.com> writes:

On Sunday, 24 January 2021 at 16:23:24 UTC, Igor wrote:
 ldc2 -c -O3 source/crc32.d test/test.d

Edited the command for clarity. As I suspected the compiler can 
see the whole programs source code and because of that it can 
perform better optimizations. If you tried to compile C version 
with link time optimizations you might get similar results.

Jan 24 2021

Igor <stojkovic.igor gmail.com> writes:

On Sunday, 24 January 2021 at 17:55:20 UTC, welkam wrote:
 On Sunday, 24 January 2021 at 16:23:24 UTC, Igor wrote:
 ldc2 -c -O3 source/crc32.d test/test.d

 Edited the command for clarity. As I suspected the compiler can 
 see the whole programs source code and because of that it can 
 perform better optimizations. If you tried to compile C version 
 with link time optimizations you might get similar results.

It seems you are right. I edited the example.c that comes with 
zlib to just do CRC calculation 100_000 times on 4K buffer and 
compiled it with its make file. But it first made a libz.a static 
lib and then it compiled example.c and linked them together. This 
resulted in executable that still calculates CRC in 262ms. But 
then I included crc32.c directly into example.c and compiled it 
as single compilation unit and now it does the job in 96ms. Same 
happened when I linked it with static lib using gcc -flto so LTO 
is used.

I had no idea this can have such big impact on performance. 
That's it, I am just using source libraries from now on :D

Jan 25 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jan 25, 2021 at 07:10:51PM +0000, Igor via Digitalmars-d wrote:
[...]
 I edited the example.c that comes with zlib to just do CRC calculation
 100_000 times on 4K buffer and compiled it with its make file. But it
 first made a libz.a static lib and then it compiled example.c and
 linked them together. This resulted in executable that still
 calculates CRC in 262ms. But then I included crc32.c directly into
 example.c and compiled it as single compilation unit and now it does
 the job in 96ms. Same happened when I linked it with static lib using
 gcc -flto so LTO is used.
 
 I had no idea this can have such big impact on performance. That's it,
 I am just using source libraries from now on :D

Inlining of functions in hot inner loops makes a big difference to
performance. This is especially important for CPU-intensive tasks like
computing CRCs.


--T

Jan 25 2021

James Blachly <james.blachly gmail.com> writes:

On 1/24/21 10:26 AM, Igor wrote:
 build type   |  zlib  | phobos | zlibdport |
 --------------------------------------------
 dmd debug    |  240ms |  272ms |     211ms |
 dmd release  |  237ms |  279ms |     202ms |
 ldc debug    |  238ms |  136ms |     264ms |
 ldc release  |  238ms |  132ms |     125ms |
 
 Calling actual zlib is pretty constant, I guess because it uses 
 precompiled static library. What is really surprising is how much faster 
 is the port from original zlib. Anyone has any idea how can that be?
 

If you did not compile zlib yourself, it is possible that your ported 
version uses advanced native instructions that are far more efficient 
e.g. handling data in parallel, whereas the zlib distributed with your 
operating system may be compiled for least common denominator CPU

Jan 24 2021

welkam <wwwelkam gmail.com> writes:

On Sunday, 24 January 2021 at 16:49:46 UTC, James Blachly wrote:
 it is possible that your ported version uses advanced native 
 instructions

No it is not possible. The compiler wont generate them unless you 
explicitly ask for it. For LDC its -march=native

Jan 24 2021

James Blachly <james.blachly gmail.com> writes:

On 1/24/21 1:02 PM, welkam wrote:
 On Sunday, 24 January 2021 at 16:49:46 UTC, James Blachly wrote:
 it is possible that your ported version uses advanced native instructions

 
 No it is not possible. The compiler wont generate them unless you 
 explicitly ask for it. For LDC its -march=native

Yes I know, I was asking Igor how he compiled.

Jan 24 2021

D Programming

C/C++ Programming

Other

digitalmars.D - Interesting performance results on Crc port from zlib