www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Interesting performance results on Crc port from zlib

reply Igor <stojkovic.igor gmail.com> writes:
I ported zlib way of CRC calculation [1] to D and then I compared 
its performance to Phobos std.digest.crc and to calling actual 
zlib implementation through etc.c.zlib: crc32_z and I got some 
interesting results:

build type   |  zlib  | phobos | zlibdport |
--------------------------------------------
dmd debug    |  240ms |  272ms |     211ms |
dmd release  |  237ms |  279ms |     202ms |
ldc debug    |  238ms |  136ms |     264ms |
ldc release  |  238ms |  132ms |     125ms |

Calling actual zlib is pretty constant, I guess because it uses 
precompiled static library. What is really surprising is how much 
faster is the port from original zlib. Anyone has any idea how 
can that be?

I tested on:
Operating System: Manjaro Linux
OS Type: 64-bit
Processors: 12 × Intel® Core™ i7-8700K CPU   3.70GHz
Memory: 15.6 GiB of RAM

[1] 
https://github.com/madler/zlib/blob/d71dc66fa8a153fb6e7c626847095d9697a6cf42/crc32.c
Jan 24
next sibling parent welkam <wwwelkam gmail.com> writes:
On Sunday, 24 January 2021 at 15:26:08 UTC, Igor wrote:
 What is really surprising is how much faster is the port from 
 original zlib. Anyone has any idea how can that be?
How do you link those libraries? If you do static linking of D version and compiler can see the source then it can do better optimizations.
Jan 24
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/24/21 10:26 AM, Igor wrote:
 I ported zlib way of CRC calculation [1] to D and then I compared its 
 performance to Phobos std.digest.crc and to calling actual zlib 
 implementation through etc.c.zlib: crc32_z and I got some interesting 
 results:
 
 build type   |  zlib  | phobos | zlibdport |
 --------------------------------------------
 dmd debug    |  240ms |  272ms |     211ms |
 dmd release  |  237ms |  279ms |     202ms |
 ldc debug    |  238ms |  136ms |     264ms |
 ldc release  |  238ms |  132ms |     125ms |
 
 Calling actual zlib is pretty constant, I guess because it uses 
 precompiled static library. What is really surprising is how much faster 
 is the port from original zlib. Anyone has any idea how can that be?
Probably inlining. If you are using a compiled library, it can't inline the code, it has to be an opaque call. -Steve
Jan 24
parent reply Igor <stojkovic.igor gmail.com> writes:
On Sunday, 24 January 2021 at 15:47:38 UTC, Steven Schveighoffer 
wrote:
 Probably inlining. If you are using a compiled library, it 
 can't inline the code, it has to be an opaque call.

 -Steve
I tried to modify dub build to exclude inlining. This how it says it is compiling and linking: ldc2 -c -of.dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1F0 F7D4/dzlibtestapp.o -release -O3 -w -oq -od=.dub/obj -d-version=Have_dzlib -Isource/ source/crc32.d test/test.d -vcolumns Linking... ldc2 -of.dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1 00F7D4/dzlibtestapp .dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1F0 F7D4/dzlibtestapp.o -L--no-as-needed I still get the same numbers in LDC release build. I forgot to mention I use this to measure the execution: auto implDurs = benchmark!(zlibCrc, phobosCrc, dzlibCrc)(100_000); And I am processing the same global 4K buffer in each.
Jan 24
parent reply welkam <wwwelkam gmail.com> writes:
On Sunday, 24 January 2021 at 16:23:24 UTC, Igor wrote:
 ldc2 -c -O3 source/crc32.d test/test.d
Edited the command for clarity. As I suspected the compiler can see the whole programs source code and because of that it can perform better optimizations. If you tried to compile C version with link time optimizations you might get similar results.
Jan 24
parent reply Igor <stojkovic.igor gmail.com> writes:
On Sunday, 24 January 2021 at 17:55:20 UTC, welkam wrote:
 On Sunday, 24 January 2021 at 16:23:24 UTC, Igor wrote:
 ldc2 -c -O3 source/crc32.d test/test.d
Edited the command for clarity. As I suspected the compiler can see the whole programs source code and because of that it can perform better optimizations. If you tried to compile C version with link time optimizations you might get similar results.
It seems you are right. I edited the example.c that comes with zlib to just do CRC calculation 100_000 times on 4K buffer and compiled it with its make file. But it first made a libz.a static lib and then it compiled example.c and linked them together. This resulted in executable that still calculates CRC in 262ms. But then I included crc32.c directly into example.c and compiled it as single compilation unit and now it does the job in 96ms. Same happened when I linked it with static lib using gcc -flto so LTO is used. I had no idea this can have such big impact on performance. That's it, I am just using source libraries from now on :D
Jan 25
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 25, 2021 at 07:10:51PM +0000, Igor via Digitalmars-d wrote:
[...]
 I edited the example.c that comes with zlib to just do CRC calculation
 100_000 times on 4K buffer and compiled it with its make file. But it
 first made a libz.a static lib and then it compiled example.c and
 linked them together. This resulted in executable that still
 calculates CRC in 262ms. But then I included crc32.c directly into
 example.c and compiled it as single compilation unit and now it does
 the job in 96ms. Same happened when I linked it with static lib using
 gcc -flto so LTO is used.
 
 I had no idea this can have such big impact on performance. That's it,
 I am just using source libraries from now on :D
Inlining of functions in hot inner loops makes a big difference to performance. This is especially important for CPU-intensive tasks like computing CRCs. --T
Jan 25
prev sibling parent reply James Blachly <james.blachly gmail.com> writes:
On 1/24/21 10:26 AM, Igor wrote:
 build type   |  zlib  | phobos | zlibdport |
 --------------------------------------------
 dmd debug    |  240ms |  272ms |     211ms |
 dmd release  |  237ms |  279ms |     202ms |
 ldc debug    |  238ms |  136ms |     264ms |
 ldc release  |  238ms |  132ms |     125ms |
 
 Calling actual zlib is pretty constant, I guess because it uses 
 precompiled static library. What is really surprising is how much faster 
 is the port from original zlib. Anyone has any idea how can that be?
 
If you did not compile zlib yourself, it is possible that your ported version uses advanced native instructions that are far more efficient e.g. handling data in parallel, whereas the zlib distributed with your operating system may be compiled for least common denominator CPU
Jan 24
parent reply welkam <wwwelkam gmail.com> writes:
On Sunday, 24 January 2021 at 16:49:46 UTC, James Blachly wrote:
 it is possible that your ported version uses advanced native 
 instructions
No it is not possible. The compiler wont generate them unless you explicitly ask for it. For LDC its -march=native
Jan 24
parent James Blachly <james.blachly gmail.com> writes:
On 1/24/21 1:02 PM, welkam wrote:
 On Sunday, 24 January 2021 at 16:49:46 UTC, James Blachly wrote:
 it is possible that your ported version uses advanced native instructions
No it is not possible. The compiler wont generate them unless you explicitly ask for it. For LDC its -march=native
Yes I know, I was asking Igor how he compiled.
Jan 24