www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - "BOLT" post-link optimizer gives 15% speed boost to Clang

reply Johan Engelen <j j.nl> writes:
Hi all,
   "Post-link optimization", what? Yes, indeed, optimization 
_after_ the _linker_ has generated a _binary_.
Read about this interesting work and the discussion here: 
https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html

When applied to clang, the performance gain is 15% faster 
execution (note, that's the result of applying BOLT on a clang 
binary that was already built with PGO and LTO !)

Cheers,
   Johan
Oct 23 2018
next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Tuesday, 23 October 2018 at 22:07:21 UTC, Johan Engelen wrote:
 Hi all,
   "Post-link optimization", what? Yes, indeed, optimization 
 _after_ the _linker_ has generated a _binary_.
 Read about this interesting work and the discussion here: 
 https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html

 When applied to clang, the performance gain is 15% faster 
 execution (note, that's the result of applying BOLT on a clang 
 binary that was already built with PGO and LTO !)

 Cheers,
   Johan
Ahh, another thing to add to the list of cool stuff to integrate into LDC. Its getting rather long.
Oct 23 2018
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/23/2018 3:07 PM, Johan Engelen wrote:
 Hi all,
    "Post-link optimization", what? Yes, indeed, optimization _after_ the 
 _linker_ has generated a _binary_.
 Read about this interesting work and the discussion here: 
 https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html
 
 When applied to clang, the performance gain is 15% faster execution (note, 
 that's the result of applying BOLT on a clang binary that was already built
with 
 PGO and LTO !)
 
 Cheers,
    Johan
 
Digital Mars C++ had this in the 1990s. https://www.digitalmars.com/ctg/trace.html How it works is -gt switch is thrown to generate a runtime profile of the graph of how functions call each other. Then, a module definition file https://www.digitalmars.com/ctg/ctgDefFiles.html is generated that directs the linker to order the layout of functions in the executable to minimize cache misses. A 15% speedup is about right for this sort of optimization. Note that Digital Mars didn't invent this, it was invented by (I forgot who) who productized it as "The Segmentor". From https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md : "BOLT (Binary Optimization and Layout Tool) is designed to improve the application performance by laying out code in a manner that helps CPU better utilize its caching and branch predicting resources. "Before we can run BOLT optimizations, we need to collect the profile for Clang, Yup, it's reinvention of the same thing.
Oct 23 2018
next sibling parent reply Johan <j j.nl> writes:
On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright 
wrote:
 On 10/23/2018 3:07 PM, Johan Engelen wrote:
 Hi all,
    "Post-link optimization", what? Yes, indeed, optimization 
 _after_ the _linker_ has generated a _binary_.
 Read about this interesting work and the discussion here: 
 https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html
 
 When applied to clang, the performance gain is 15% faster 
 execution (note, that's the result of applying BOLT on a clang 
 binary that was already built with PGO and LTO !)
 
 Cheers,
    Johan
 
Digital Mars C++ had this in the 1990s. https://www.digitalmars.com/ctg/trace.html How it works is -gt switch is thrown to generate a runtime profile of the graph of how functions call each other. Then, a module definition file https://www.digitalmars.com/ctg/ctgDefFiles.html is generated that directs the linker to order the layout of functions in the executable to minimize cache misses. A 15% speedup is about right for this sort of optimization. Note that Digital Mars didn't invent this, it was invented by (I forgot who) who productized it as "The Segmentor". From https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md : "BOLT (Binary Optimization and Layout Tool) is designed to improve the application performance by laying out code in a manner that helps CPU better utilize its caching and branch predicting resources. "Before we can run BOLT optimizations, we need to collect the profile for Clang, Yup, it's reinvention of the same thing.
Sorry, but such statements show a large lack of insight in what is going on here. This isn't about who invented something first, the basic idea is of course very old. DMC++ most certainly didn't and doesn't do what is being discussed. I suggest you reread about it without your preconceptions. BOLT does not use compiler-inserted instrumentation for execution tracing (which btw is more than just function call tracing), that's what (usually) PGO and LTO use. Instead, the profile that BOLT uses is a sampling-based profile obtained e.g. using `perf` and hardware counters. The code layout optimization performed is not just putting functions in a better order, but moreso the reordering of basic blocks inside functions. PGO and LTO already do that too (and can also work with sampling based profiling but have trouble using all profile data information inside functions). Because the sampling profile data is more accurate than compiler-inserted instrumentation added before many optimization transformations, and because the optimization happens on machinecode level (meaning that more of the sampling profile data can be used well) substantial additional gains are obtained by post-link optimization on top of PGO+LTO. - Johan
Oct 24 2018
parent Walter Bright <newshound2 digitalmars.com> writes:
On 10/24/2018 3:26 AM, Johan wrote:
 On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright wrote:
 On 10/23/2018 3:07 PM, Johan Engelen wrote:
    "Post-link optimization", what? Yes, indeed, optimization _after_ the 
 _linker_ has generated a _binary_.
Yup, it's reinvention of the same thing.
Sorry, but such statements show a large lack of insight in what is going on here. This isn't about who invented something first, the basic idea is of course very old. DMC++ most certainly didn't and doesn't do what is being discussed.
 "Post-link optimization", what? Yes, indeed, optimization _after_ the 
_linker_ has generated a _binary_. That was the summary, implying it is some new idea. It's an incremental improvement. DMC++ does indeed do post-link optimization. If it was written as "BOLT improves on profile guided optimization after the linker has generated a binary by reordering basic blocks" I wouldn't have had any issue with that. clang is an excellent implementation of a lot of ideas.
Oct 24 2018
prev sibling parent welkam <wwwelkam gmail.com> writes:
On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright 
wrote:
 Yup, it's reinvention of the same thing.
The authors never claimed that they invented anything. What they claim is that doing reordering after linking has benefits that compiler cant easily do. Bolt improves performance on top of PGO+LTO by: 1. Using more accurate profile data from intel`s last branch record(LBR) 2. Inlining function some times changes hotness of code blocks based on where it was inlined so doing profiling after inlining gives better decision making for Bolt. The negative of Bolt`s approach is that is limited on what allowed transformations it can do so people should use PGO+LTO+Bolt for maximum performance. Second negative is that it works with ELF format so windows users wont benefit from it.
Oct 24 2018