digitalmars.D - Interesting performance results on Crc port from zlib
- Igor (21/21) Jan 24 2021 I ported zlib way of CRC calculation [1] to D and then I compared
- welkam (4/6) Jan 24 2021 How do you link those libraries? If you do static linking of D
- Steven Schveighoffer (4/19) Jan 24 2021 Probably inlining. If you are using a compiled library, it can't inline
- Igor (13/16) Jan 24 2021 I tried to modify dub build to exclude inlining. This how it says
- welkam (5/6) Jan 24 2021 Edited the command for clarity. As I suspected the compiler can
- Igor (12/18) Jan 25 2021 It seems you are right. I edited the example.c that comes with
- H. S. Teoh (6/17) Jan 25 2021 Inlining of functions in hot inner loops makes a big difference to
- James Blachly (5/16) Jan 24 2021 If you did not compile zlib yourself, it is possible that your ported
- welkam (3/5) Jan 24 2021 No it is not possible. The compiler wont generate them unless you
- James Blachly (2/7) Jan 24 2021 Yes I know, I was asking Igor how he compiled.
I ported zlib way of CRC calculation [1] to D and then I compared its performance to Phobos std.digest.crc and to calling actual zlib implementation through etc.c.zlib: crc32_z and I got some interesting results: build type | zlib | phobos | zlibdport | -------------------------------------------- dmd debug | 240ms | 272ms | 211ms | dmd release | 237ms | 279ms | 202ms | ldc debug | 238ms | 136ms | 264ms | ldc release | 238ms | 132ms | 125ms | Calling actual zlib is pretty constant, I guess because it uses precompiled static library. What is really surprising is how much faster is the port from original zlib. Anyone has any idea how can that be? I tested on: Operating System: Manjaro Linux OS Type: 64-bit Processors: 12 × Intel® Core™ i7-8700K CPU 3.70GHz Memory: 15.6 GiB of RAM [1] https://github.com/madler/zlib/blob/d71dc66fa8a153fb6e7c626847095d9697a6cf42/crc32.c
Jan 24 2021
On Sunday, 24 January 2021 at 15:26:08 UTC, Igor wrote:What is really surprising is how much faster is the port from original zlib. Anyone has any idea how can that be?How do you link those libraries? If you do static linking of D version and compiler can see the source then it can do better optimizations.
Jan 24 2021
On 1/24/21 10:26 AM, Igor wrote:I ported zlib way of CRC calculation [1] to D and then I compared its performance to Phobos std.digest.crc and to calling actual zlib implementation through etc.c.zlib: crc32_z and I got some interesting results: build type | zlib | phobos | zlibdport | -------------------------------------------- dmd debug | 240ms | 272ms | 211ms | dmd release | 237ms | 279ms | 202ms | ldc debug | 238ms | 136ms | 264ms | ldc release | 238ms | 132ms | 125ms | Calling actual zlib is pretty constant, I guess because it uses precompiled static library. What is really surprising is how much faster is the port from original zlib. Anyone has any idea how can that be?Probably inlining. If you are using a compiled library, it can't inline the code, it has to be an opaque call. -Steve
Jan 24 2021
On Sunday, 24 January 2021 at 15:47:38 UTC, Steven Schveighoffer wrote:Probably inlining. If you are using a compiled library, it can't inline the code, it has to be an opaque call. -SteveI tried to modify dub build to exclude inlining. This how it says it is compiling and linking: ldc2 -c -of.dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1F0 F7D4/dzlibtestapp.o -release -O3 -w -oq -od=.dub/obj -d-version=Have_dzlib -Isource/ source/crc32.d test/test.d -vcolumns Linking... ldc2 -of.dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1 00F7D4/dzlibtestapp .dub/build/testapp-release-linux.posix-x86_64-ldc_2094-828693E6DA608B0503B8ABCD1F0 F7D4/dzlibtestapp.o -L--no-as-needed I still get the same numbers in LDC release build. I forgot to mention I use this to measure the execution: auto implDurs = benchmark!(zlibCrc, phobosCrc, dzlibCrc)(100_000); And I am processing the same global 4K buffer in each.
Jan 24 2021
On Sunday, 24 January 2021 at 16:23:24 UTC, Igor wrote:ldc2 -c -O3 source/crc32.d test/test.dEdited the command for clarity. As I suspected the compiler can see the whole programs source code and because of that it can perform better optimizations. If you tried to compile C version with link time optimizations you might get similar results.
Jan 24 2021
On Sunday, 24 January 2021 at 17:55:20 UTC, welkam wrote:On Sunday, 24 January 2021 at 16:23:24 UTC, Igor wrote:It seems you are right. I edited the example.c that comes with zlib to just do CRC calculation 100_000 times on 4K buffer and compiled it with its make file. But it first made a libz.a static lib and then it compiled example.c and linked them together. This resulted in executable that still calculates CRC in 262ms. But then I included crc32.c directly into example.c and compiled it as single compilation unit and now it does the job in 96ms. Same happened when I linked it with static lib using gcc -flto so LTO is used. I had no idea this can have such big impact on performance. That's it, I am just using source libraries from now on :Dldc2 -c -O3 source/crc32.d test/test.dEdited the command for clarity. As I suspected the compiler can see the whole programs source code and because of that it can perform better optimizations. If you tried to compile C version with link time optimizations you might get similar results.
Jan 25 2021
On Mon, Jan 25, 2021 at 07:10:51PM +0000, Igor via Digitalmars-d wrote: [...]I edited the example.c that comes with zlib to just do CRC calculation 100_000 times on 4K buffer and compiled it with its make file. But it first made a libz.a static lib and then it compiled example.c and linked them together. This resulted in executable that still calculates CRC in 262ms. But then I included crc32.c directly into example.c and compiled it as single compilation unit and now it does the job in 96ms. Same happened when I linked it with static lib using gcc -flto so LTO is used. I had no idea this can have such big impact on performance. That's it, I am just using source libraries from now on :DInlining of functions in hot inner loops makes a big difference to performance. This is especially important for CPU-intensive tasks like computing CRCs. --T
Jan 25 2021
On 1/24/21 10:26 AM, Igor wrote:build type | zlib | phobos | zlibdport | -------------------------------------------- dmd debug | 240ms | 272ms | 211ms | dmd release | 237ms | 279ms | 202ms | ldc debug | 238ms | 136ms | 264ms | ldc release | 238ms | 132ms | 125ms | Calling actual zlib is pretty constant, I guess because it uses precompiled static library. What is really surprising is how much faster is the port from original zlib. Anyone has any idea how can that be?If you did not compile zlib yourself, it is possible that your ported version uses advanced native instructions that are far more efficient e.g. handling data in parallel, whereas the zlib distributed with your operating system may be compiled for least common denominator CPU
Jan 24 2021
On Sunday, 24 January 2021 at 16:49:46 UTC, James Blachly wrote:it is possible that your ported version uses advanced native instructionsNo it is not possible. The compiler wont generate them unless you explicitly ask for it. For LDC its -march=native
Jan 24 2021
On 1/24/21 1:02 PM, welkam wrote:On Sunday, 24 January 2021 at 16:49:46 UTC, James Blachly wrote:Yes I know, I was asking Igor how he compiled.it is possible that your ported version uses advanced native instructionsNo it is not possible. The compiler wont generate them unless you explicitly ask for it. For LDC its -march=native
Jan 24 2021