www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Profile-guided optimization (PGO)

reply Johan Engelen <j j.nl> writes:
Hi all,
   I have been working on getting rudimentary PGO going in LDC. 
It's pretty much ready! [1]
(does not work on Windows yet... I have to fix LLVM's compile-rt 
code)

I've implemented something very similar to Clang: LDC uses 
profile information (generated by an instrumented executable 
built by LDC) to tag each branch in the code with branch weights. 
The actual optimizations are done by LLVM; at the moment LDC only 
adds metadata to the IR.

At this point, I want your input: commandline option naming, easy 
to use? (llvm-profdata is needed...), do you get substantial 
performance boosts, runtime library inclusion or separate lib for 
profile data file writing, bugs, uninstrumented 
branches/switches, etc.
All comments are welcome (please be kind ;-).

Before I announce it in the "Announce" forum, I want to hear your 
thoughts first.

Thanks!
   Johan


[1]
http://wiki.dlang.org/LDC_LLVM_profiling_instrumentation#Profile-Guided_Optimization_.28PGO.29_status_in_LDC
Dec 08 2015
next sibling parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
Hi Johan,

On 8 Dec 2015, at 20:13, Johan Engelen via digitalmars-d-ldc wrote:
 I've implemented something very similar to Clang: LDC uses profile 
 information (generated by an instrumented executable built by LDC)
Did you also try using it with sample profiles acquired by an external profiler yet, as described in the Clang page on PGO? — David
Dec 08 2015
next sibling parent reply Johan Engelen <j j.nl> writes:
On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger 
wrote:
 
 Did you also try using it with sample profiles acquired by an 
 external profiler yet, as described in the Clang page on PGO?
Hi David, No, I have not look at that yet. Thanks a lot for the testcase you posted on Github. Will sink my teeth in fixing that first.
Dec 08 2015
next sibling parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
On 8 Dec 2015, at 23:35, Johan Engelen via digitalmars-d-ldc wrote:
 Thanks a lot for the testcase you posted on Github. Will sink my teeth 
 in fixing that first.
You're welcome – I hope it's enough information to reproduce it, but I don't have a debug build of LLVM on this machine right now. — David
Dec 08 2015
parent reply Johan Engelen <j j.nl> writes:
On Tuesday, 8 December 2015 at 22:41:22 UTC, David Nadlinger 
wrote:
 On 8 Dec 2015, at 23:35, Johan Engelen via digitalmars-d-ldc 
 wrote:
 Thanks a lot for the testcase you posted on Github. Will sink 
 my teeth in fixing that first.
You're welcome – I hope it's enough information to reproduce it, but I don't have a debug build of LLVM on this machine right now.
After two more bug fixes: the regexp microbench now works. Results with the regexp bench (bench.d):
 time ldc2 bench.d -O3 -of=bench_normal --> 52s
time ./bench_normal --> 2.55s 98%cpu
 time ldc2 bench.d -fprofile-instr-generate -of=bench_instr --> 
 11s
time ./bench_instr --> 6.72s 99%cpu llvm-profdata merge default.profraw -o bench.profdata time ldc2 bench.d -O3 -fprofile-instr-use=bench.profdata -of=bench_pgo --> 48.35s time ./bench_pgo --> 2.48s 98%cpu (timing numbers for bench_normal and bench_pgo are +- 0.01) So PGO brings it from 2.55 to 2.48 sec, ~3% improvement. Disappointing, but well... it works!
Dec 10 2015
next sibling parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
On 11 Dec 2015, at 1:26, Johan Engelen via digitalmars-d-ldc wrote:
 After two more bug fixes: the regexp microbench now works.
 […]
 Disappointing, but well... it works!
Don't forget that this was just a random program I pulled from the Rosettacode compilation, though – I didn't have a benchmark ready where I know that branch prediction or inlining improvements would make a difference. — David
Dec 10 2015
parent reply Johan Engelen <j j.nl> writes:
What do you think about "llvm-profdata"? Should we ship that with 
LDC?
Dec 10 2015
parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
On 11 Dec 2015, at 1:38, Johan Engelen via digitalmars-d-ldc wrote:
 What do you think about "llvm-profdata"? Should we ship that with LDC?
Yes, we should probably ship it with the binary packages. For distro packages, we are of course dependent on the LLMV packages to include the tools, but at least the Homebrew package actually does. What is left to do before we can merge a first version into the main repository? A partial list: - Deal with the remaining FIXME comments (at least open separate GitHub issues for them), as well as with commented-out fragments from the Clang implementation. - Find some way to avoid ICE-type regressions on real-world D code, for example by building the druntime/Phobos unit tests with instrumentation on. - Decide on a name for the command line switches. The GCC-style "-f" prefix isn't currently used for most of the options, but that's not necessarily much of an argument. On a rather unrelated note, did you try whether the profile data also gives sensible results with llvm-cov? If yes, that might be something nice to mention in that upcoming announcement, even though we also have DMD-style -cov support, of course.  — David
Dec 13 2015
next sibling parent reply Johan Engelen <j j.nl> writes:
On Sunday, 13 December 2015 at 13:21:33 UTC, David Nadlinger 
wrote:
 What is left to do before we can merge a first version into the 
 main repository? A partial list:

  - Deal with the remaining FIXME comments (at least open 
 separate GitHub issues for them), as well as with commented-out 
 fragments from the Clang implementation.
Yep :-) That's what the FIXME comments are there for: so I don't forget :-) The plan is to deal with all FIXME's, and remove the unused commented-out Clang fragments.
  - Find some way to avoid ICE-type regressions on real-world D 
 code, for example by building the druntime/Phobos unit tests 
 with instrumentation on.
I just fixed two more ICEs, and now the dmd-testsuite succeeds with -fprofile-instr-generate. Running the druntime/Phobos unittests now with -fprofile-instr-generate. Also, I added a pragma(LDC_profile_instr, true|false) to enable/disable instrumentation codegen for specific functions. My main reason for this is to help people speed up instrumented binaries, and it also helps circumventing ICEs. See tests/ir/profile/pragma.d. (perhaps you think of a better name for the pragma)
  - Decide on a name for the command line switches. The 
 GCC-style "-f" prefix isn't currently used for most of the 
 options, but that's not necessarily much of an argument.
I have absolutely no preference here. I think we should do what the world is already familiar with. Iirc, -fprofile-instr-generate is a Clang option, and that Clang is moving towards / will support GCC's option naming (-fprofile-generate, -fprofile-use). DMD has a -profile option, but I have not read up on what that will do. I guess we will not add any option for PGO to ldmd2?
 On a rather unrelated note, did you try whether the profile 
 data also gives sensible results with llvm-cov? If yes, that 
 might be something nice to mention in that upcoming 
 announcement, even though we also have DMD-style -cov support, 
 of course.
Did not look into this at all yet. Clang's PGOGen code has some extra functions for gcov support and more. It's all commented out for now, but it looks like we can support more tools relatively easily with the current implementation. I also see hints of sampling-based PGO in the code, for example. Another important TODO item: remove the profiling runtime from druntime, and instead add a separate runtime-profiling lib (suggestions for a name? ldc-profile.lib?).
Dec 13 2015
parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
On 13 Dec 2015, at 14:59, Johan Engelen via digitalmars-d-ldc wrote:
 I just fixed two more ICEs, and now the dmd-testsuite succeeds with 
 -fprofile-instr-generate. Running the druntime/Phobos unittests now 
 with -fprofile-instr-generate.
Nice!
 I have absolutely no preference here. I think we should do what the 
 world is already familiar with. Iirc, -fprofile-instr-generate is a 
 Clang option, and that Clang is moving towards / will support GCC's 
 option naming (-fprofile-generate, -fprofile-use).
Yeah, me neither.
 DMD has a -profile option, but I have not read up on what that will 
 do.
It makes DMD's druntime emit some profiling info as text files (trace.def/trace.log) at program exit. This is for manual analysis only, no PGO-type functionality in sight.
 I guess we will not add any option for PGO to ldmd2?
Yeah, as DMD does not have any PGO functionality. Of course it will still pass the ldc2 options through.
 Another important TODO item: remove the profiling runtime from 
 druntime, and instead add a separate runtime-profiling lib 
 (suggestions for a name?  ldc-profile.lib?).
Maybe add a "rt" suffix to make clear that this is the actual program runtime part? Ultimately does not really matter, though. — David
Dec 13 2015
parent Johan Engelen <j j.nl> writes:
On Sunday, 13 December 2015 at 14:16:53 UTC, David Nadlinger 
wrote:
 On 13 Dec 2015, at 14:59, Johan Engelen via digitalmars-d-ldc 
 wrote:
 I guess we will not add any option for PGO to ldmd2?
Yeah, as DMD does not have any PGO functionality. Of course it will still pass the ldc2 options through.
Lol, I should try to read up on these simple things first... To do the tests, I had modified ldmd to recognize -fprofile-instr-generate...
Dec 13 2015
prev sibling parent Johan Engelen <j j.nl> writes:
Hi David,
   Could you have a look at the PR again?
I think it is almost ready for merging. (two FIXME's left that I 
hope to address soon)

On Sunday, 13 December 2015 at 13:21:33 UTC, David Nadlinger 
wrote:
 On 11 Dec 2015, at 1:38, Johan Engelen via digitalmars-d-ldc 
 wrote:
 What do you think about "llvm-profdata"? Should we ship that 
 with LDC?
Yes, we should probably ship it with the binary packages.
I have added llvm-profdata to the repo, and renamed the built executable to ldc-profdata (it has to be in-sync with the LLVM version that LDC was built with, and so the renaming prevents potential name clashing with a system installed profdata version).
 What is left to do before we can merge a first version into the 
 main repository? A partial list:

  - Deal with the remaining FIXME comments (at least open 
 separate GitHub issues for them), as well as with commented-out 
 fragments from the Clang implementation.
I think I will just remove all unused Clang implementation code (it is related to coverage stuff, which I think we should implement in a different branch after PGO is merged). Thanks! Johan
Jan 10 2016
prev sibling parent reply Kagamin <spam here.lot> writes:
On Friday, 11 December 2015 at 00:26:22 UTC, Johan Engelen wrote:
 So PGO brings it from 2.55 to 2.48 sec, ~3% improvement.
 Disappointing, but well... it works!
PGO can also reduce physical memory consumption due to less code loaded into memory.
Dec 11 2015
parent Johan Engelen <j j.nl> writes:
On Friday, 11 December 2015 at 14:07:56 UTC, Kagamin wrote:
 On Friday, 11 December 2015 at 00:26:22 UTC, Johan Engelen 
 wrote:
 So PGO brings it from 2.55 to 2.48 sec, ~3% improvement.
 Disappointing, but well... it works!
PGO can also reduce physical memory consumption due to less code loaded into memory.
Yes, indeed. I'd like to find a good example of code where this shows up in practice, so that PGO really does improve performance significantly (say, >10%).
Dec 11 2015
prev sibling parent reply David Nadlinger <code klickverbot.at> writes:
On Tuesday, 8 December 2015 at 22:35:11 UTC, Johan Engelen wrote:
 Thanks a lot for the testcase you posted on Github. Will sink 
 my teeth in fixing that first.
Speaking of test cases: This might be an obvious and/or stupid suggestion, but did you try building the Phobos unit tests (and maybe also dmd-testsuite/runnable) with PGO? I'd suspect it would give you quite a broad coverage of basic language constructs. - David
Dec 10 2015
parent Johan Engelen <j j.nl> writes:
On Thursday, 10 December 2015 at 14:27:59 UTC, David Nadlinger 
wrote:
 Speaking of test cases: This might be an obvious and/or stupid 
 suggestion, but did you try building the Phobos unit tests (and 
 maybe also dmd-testsuite/runnable) with PGO? I'd suspect it 
 would give you quite a broad coverage of basic language 
 constructs.
Nope didn't do that yet :S :S Looks like it is needed to iron out some remaining bugs. I underestimated the complexity of D's AST (some objects are placed in multiple locations in the AST?), which gave rise to an assertion fail in your testcase; plus I forgot to add throw statements to the AST tree walker, leading to another assertion fail. Those issues have been fixed now, and now it breaks with the same error you found. It is confusing because I did not (mean to) change any of the codegen, other than adding counter increment instructions and branch instruction metadata (both trivial additions). But I did have to add extra basicblocks for switch statements... perhaps I can search there first. Hope to have a resolution for your test case quickly. I also have not tested at all how this works with multiple object files linked together, or other possibly more complicated things. I thought a fun testcase would be to compile DDMD with PGO enabled, compile itself as a profiling run, rebuild with PGO and test if compiling, say, Phobos is quicker/slower. I am very curious to see what constructs will see a significant performance boost, if any at all.
Dec 10 2015
prev sibling parent reply David Nadlinger <code klickverbot.at> writes:
On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger 
wrote:
 Did you also try using it with sample profiles acquired by an 
 external profiler yet, as described in the Clang page on PGO?
Let me add that this would probably be something nice to have for the initial release, as users could fall back to using perf, etc. if the instrumentation part is still buggy or incomplete for their code. - David
Dec 10 2015
parent reply Liran Zvibel via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
Also, for use cases like ours, where the system runs for extended periods of
time, and optimizing the init time, which may be minutes is not interesting at
all, just being able to run perf while the system is doing something
interesting to improve is a big plus.

Liran

 On Dec 10, 2015, at 16:30, David Nadlinger via digitalmars-d-ldc
<digitalmars-d-ldc puremagic.com> wrote:
 
 On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger wrote:
 Did you also try using it with sample profiles acquired by an external
profiler yet, as described in the Clang page on PGO?
Let me add that this would probably be something nice to have for the initial release, as users could fall back to using perf, etc. if the instrumentation part is still buggy or incomplete for their code. - David
Dec 10 2015
parent reply Kagamin <spam here.lot> writes:
On Thursday, 10 December 2015 at 14:38:19 UTC, Liran Zvibel wrote:
 Also, for use cases like ours, where the system runs for 
 extended periods of time, and optimizing the init time, which 
 may be minutes is not interesting at all, just being able to 
 run perf while the system is doing something interesting to 
 improve is a big plus.
As I understand, if the profiling runs long enough, the long-running statistics will dominate startup statistics?
Dec 23 2015
parent Johan Engelen <j j.nl> writes:
On Wednesday, 23 December 2015 at 13:10:33 UTC, Kagamin wrote:
 On Thursday, 10 December 2015 at 14:38:19 UTC, Liran Zvibel 
 wrote:
 Also, for use cases like ours, where the system runs for 
 extended periods of time, and optimizing the init time, which 
 may be minutes is not interesting at all, just being able to 
 run perf while the system is doing something interesting to 
 improve is a big plus.
As I understand, if the profiling runs long enough, the long-running statistics will dominate startup statistics?
The profiling is pretty simple: it is just a bunch of counters whereever the code branches. So indeed, long runs will have long-running statistics dominate startup statistics.
Dec 23 2015
prev sibling next sibling parent Johan Engelen <j j.nl> writes:
On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:
 (does not work on Windows yet... I have to fix LLVM's 
 compile-rt code)
I fixed a nasty [*] bug in compile-rt's profile writing code, and now it also works on Windows. (The IR tests fail on Windows because running a compiled executable from LIT fails for some reason on Windows.) [*] https://stackoverflow.com/questions/5537066/strange-0x0d-being-added-to-my-binary-file Now I know what to look for first if I see 0x0D's in my files...
Dec 08 2015
prev sibling next sibling parent reply Johan Engelen <j j.nl> writes:
On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:
 Before I announce it in the "Announce" forum, I want to hear 
 your thoughts first.
Clearly I was too optimistic about the quality of my work so far, hehe.
Dec 10 2015
parent David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
On 10 Dec 2015, at 19:43, Johan Engelen via digitalmars-d-ldc wrote:
 Clearly I was too optimistic about the quality of my work so far, 
 hehe.
Quite the contrary – you chose to start with the hard part (instrumentation-based instead of sampling-based), and DMD's AST is notoriously, uh, fluid in meaning and under-documented. — David
Dec 10 2015
prev sibling parent Johan Engelen <j j.nl> writes:
On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:
 Hi all,
   I have been working on getting rudimentary PGO going in LDC. 
 It's pretty much ready!
It now works with LLVM 3.7 and LLVM 3.8 (trunk), on Mac OS X, Linux (tested on Ubuntu), and Windows! I have not tested other platforms.
Dec 23 2015