digitalmars.D.ldc - Profile-guided optimization (PGO)

Johan Engelen (22/22) Dec 08 2015 Hi all,

David Nadlinger via digitalmars-d-ldc (5/7) Dec 08 2015 Did you also try using it with sample profiles acquired by an external

Johan Engelen (6/9) Dec 08 2015 Hi David,

David Nadlinger via digitalmars-d-ldc (4/6) Dec 08 2015 You're welcome – I hope it's enough information to reproduce it, but I...

Johan Engelen (14/24) Dec 10 2015 After two more bug fixes: the regexp microbench now works.

David Nadlinger via digitalmars-d-ldc (6/9) Dec 10 2015 Don't forget that this was just a random program I pulled from the

Johan Engelen (2/2) Dec 10 2015 What do you think about "llvm-profdata"? Should we ship that with

David Nadlinger via digitalmars-d-ldc (21/22) Dec 13 2015 Yes, we should probably ship it with the binary packages. For distro

Johan Engelen (31/47) Dec 13 2015 Yep :-) That's what the FIXME comments are there for: so I don't

David Nadlinger via digitalmars-d-ldc (11/24) Dec 13 2015 Yeah, me neither.

Johan Engelen (5/10) Dec 13 2015 Lol, I should try to read up on these simple things first...

Johan Engelen (15/25) Jan 10 2016 Hi David,

Kagamin (3/5) Dec 11 2015 PGO can also reduce physical memory consumption due to less code

Johan Engelen (4/10) Dec 11 2015 Yes, indeed. I'd like to find a good example of code where this

David Nadlinger (6/8) Dec 10 2015 Speaking of test cases: This might be an obvious and/or stupid

Johan Engelen (22/27) Dec 10 2015 Nope didn't do that yet :S :S Looks like it is needed to iron

David Nadlinger (7/9) Dec 10 2015 Let me add that this would probably be something nice to have for

Liran Zvibel via digitalmars-d-ldc (2/10) Dec 10 2015

Kagamin (3/8) Dec 23 2015 As I understand, if the profiling runs long enough, the

Johan Engelen (4/13) Dec 23 2015 The profiling is pretty simple: it is just a bunch of counters

Johan Engelen (8/10) Dec 08 2015 I fixed a nasty [*] bug in compile-rt's profile writing code, and
Johan Engelen (3/5) Dec 10 2015 Clearly I was too optimistic about the quality of my work so far,

David Nadlinger via digitalmars-d-ldc (5/7) Dec 10 2015 Quite the contrary – you chose to start with the hard part

Johan Engelen (4/7) Dec 23 2015 It now works with LLVM 3.7 and LLVM 3.8 (trunk), on Mac OS X,

Johan Engelen <j j.nl> writes:

Hi all,
   I have been working on getting rudimentary PGO going in LDC. 
It's pretty much ready! [1]
(does not work on Windows yet... I have to fix LLVM's compile-rt 
code)

I've implemented something very similar to Clang: LDC uses 
profile information (generated by an instrumented executable 
built by LDC) to tag each branch in the code with branch weights. 
The actual optimizations are done by LLVM; at the moment LDC only 
adds metadata to the IR.

At this point, I want your input: commandline option naming, easy 
to use? (llvm-profdata is needed...), do you get substantial 
performance boosts, runtime library inclusion or separate lib for 
profile data file writing, bugs, uninstrumented 
branches/switches, etc.
All comments are welcome (please be kind ;-).

Before I announce it in the "Announce" forum, I want to hear your 
thoughts first.

Thanks!
   Johan


[1]
http://wiki.dlang.org/LDC_LLVM_profiling_instrumentation#Profile-Guided_Optimization_.28PGO.29_status_in_LDC

Dec 08 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

Hi Johan,

On 8 Dec 2015, at 20:13, Johan Engelen via digitalmars-d-ldc wrote:
 I've implemented something very similar to Clang: LDC uses profile 
 information (generated by an instrumented executable built by LDC)

Did you also try using it with sample profiles acquired by an external 
profiler yet, as described in the Clang page on PGO?

  — David

Dec 08 2015

Johan Engelen <j j.nl> writes:

On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger 
wrote:
 
 Did you also try using it with sample profiles acquired by an 
 external profiler yet, as described in the Clang page on PGO?

Hi David,
   No, I have not look at that yet.
Thanks a lot for the testcase you posted on Github. Will sink my 
teeth in fixing that first.

Dec 08 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

On 8 Dec 2015, at 23:35, Johan Engelen via digitalmars-d-ldc wrote:
 Thanks a lot for the testcase you posted on Github. Will sink my teeth 
 in fixing that first.

You're welcome – I hope it's enough information to reproduce it, but I 
don't have a debug build of LLVM on this machine right now.

  — David

Dec 08 2015

Johan Engelen <j j.nl> writes:

On Tuesday, 8 December 2015 at 22:41:22 UTC, David Nadlinger 
wrote:
 On 8 Dec 2015, at 23:35, Johan Engelen via digitalmars-d-ldc 
 wrote:
 Thanks a lot for the testcase you posted on Github. Will sink 
 my teeth in fixing that first.

 You're welcome – I hope it's enough information to reproduce 
 it, but I don't have a debug build of LLVM on this machine 
 right now.

After two more bug fixes: the regexp microbench now works.

Results with the regexp bench (bench.d):
 time ldc2 bench.d -O3 -of=bench_normal --> 52s

   time ./bench_normal --> 2.55s 98%cpu

 time ldc2 bench.d -fprofile-instr-generate -of=bench_instr --> 
 11s

   time ./bench_instr --> 6.72s 99%cpu
   llvm-profdata merge default.profraw -o bench.profdata
   time ldc2 bench.d -O3 -fprofile-instr-use=bench.profdata 
-of=bench_pgo
     --> 48.35s
   time ./bench_pgo --> 2.48s 98%cpu

(timing numbers for bench_normal and bench_pgo are +- 0.01)

So PGO brings it from 2.55 to 2.48 sec, ~3% improvement.
Disappointing, but well... it works!

Dec 10 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

On 11 Dec 2015, at 1:26, Johan Engelen via digitalmars-d-ldc wrote:
 After two more bug fixes: the regexp microbench now works.
 […]
 Disappointing, but well... it works!

Don't forget that this was just a random program I pulled from the 
Rosettacode compilation, though – I didn't have a benchmark ready 
where I know that branch prediction or inlining improvements would make 
a difference.

  — David

Dec 10 2015

Johan Engelen <j j.nl> writes:

What do you think about "llvm-profdata"? Should we ship that with 
LDC?

Dec 10 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

On 11 Dec 2015, at 1:38, Johan Engelen via digitalmars-d-ldc wrote:
 What do you think about "llvm-profdata"? Should we ship that with LDC?

Yes, we should probably ship it with the binary packages. For distro 
packages, we are of course dependent on the LLMV packages to include the 
tools, but at least the Homebrew package actually does.

What is left to do before we can merge a first version into the main 
repository? A partial list:

  - Deal with the remaining FIXME comments (at least open separate 
GitHub issues for them), as well as with commented-out fragments from 
the Clang implementation.

  - Find some way to avoid ICE-type regressions on real-world D code, 
for example by building the druntime/Phobos unit tests with 
instrumentation on.

  - Decide on a name for the command line switches. The GCC-style "-f" 
prefix isn't currently used for most of the options, but that's not 
necessarily much of an argument.

On a rather unrelated note, did you try whether the profile data also 
gives sensible results with llvm-cov? If yes, that might be something 
nice to mention in that upcoming announcement, even though we also have 
DMD-style -cov support, of course.

  — David

Dec 13 2015

Johan Engelen <j j.nl> writes:

On Sunday, 13 December 2015 at 13:21:33 UTC, David Nadlinger 
wrote:
 What is left to do before we can merge a first version into the 
 main repository? A partial list:

  - Deal with the remaining FIXME comments (at least open 
 separate GitHub issues for them), as well as with commented-out 
 fragments from the Clang implementation.

Yep :-)  That's what the FIXME comments are there for: so I don't 
forget :-)
The plan is to deal with all FIXME's, and remove the unused 
commented-out Clang fragments.

  - Find some way to avoid ICE-type regressions on real-world D 
 code, for example by building the druntime/Phobos unit tests 
 with instrumentation on.

I just fixed two more ICEs, and now the dmd-testsuite succeeds 
with -fprofile-instr-generate. Running the druntime/Phobos 
unittests now with -fprofile-instr-generate.

Also, I added a pragma(LDC_profile_instr, true|false) to 
enable/disable instrumentation codegen for specific functions. My 
main reason for this is to help people speed up instrumented 
binaries, and it also helps circumventing ICEs.
See tests/ir/profile/pragma.d.
(perhaps you think of a better name for the pragma)

  - Decide on a name for the command line switches. The 
 GCC-style "-f" prefix isn't currently used for most of the 
 options, but that's not necessarily much of an argument.

I have absolutely no preference here. I think we should do what 
the world is already familiar with. Iirc, 
-fprofile-instr-generate is a Clang option, and that Clang is 
moving towards / will support GCC's option naming 
(-fprofile-generate, -fprofile-use).
DMD has a -profile option, but I have not read up on what that 
will do.
I guess we will not add any option for PGO to ldmd2?

 On a rather unrelated note, did you try whether the profile 
 data also gives sensible results with llvm-cov? If yes, that 
 might be something nice to mention in that upcoming 
 announcement, even though we also have DMD-style -cov support, 
 of course.

Did not look into this at all yet. Clang's PGOGen code has some 
extra functions for gcov support and more. It's all commented out 
for now, but it looks like we can support more tools relatively 
easily with the current implementation. I also see hints of 
sampling-based PGO in the code, for example.

Another important TODO item: remove the profiling runtime from 
druntime, and instead add a separate runtime-profiling lib 
(suggestions for a name?  ldc-profile.lib?).

Dec 13 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

On 13 Dec 2015, at 14:59, Johan Engelen via digitalmars-d-ldc wrote:
 I just fixed two more ICEs, and now the dmd-testsuite succeeds with 
 -fprofile-instr-generate. Running the druntime/Phobos unittests now 
 with -fprofile-instr-generate.

Nice!

 I have absolutely no preference here. I think we should do what the 
 world is already familiar with. Iirc, -fprofile-instr-generate is a 
 Clang option, and that Clang is moving towards / will support GCC's 
 option naming (-fprofile-generate, -fprofile-use).

Yeah, me neither.

 DMD has a -profile option, but I have not read up on what that will 
 do.

It makes DMD's druntime emit some profiling info as text files 
(trace.def/trace.log) at program exit. This is for manual analysis only, 
no PGO-type functionality in sight.

 I guess we will not add any option for PGO to ldmd2?

Yeah, as DMD does not have any PGO functionality. Of course it will 
still pass the ldc2 options through.

 Another important TODO item: remove the profiling runtime from 
 druntime, and instead add a separate runtime-profiling lib 
 (suggestions for a name?  ldc-profile.lib?).

Maybe add a "rt" suffix to make clear that this is the actual program 
runtime part? Ultimately does not really matter, though.

  — David

Dec 13 2015

Johan Engelen <j j.nl> writes:

On Sunday, 13 December 2015 at 14:16:53 UTC, David Nadlinger 
wrote:
 On 13 Dec 2015, at 14:59, Johan Engelen via digitalmars-d-ldc 
 wrote:
 I guess we will not add any option for PGO to ldmd2?

 Yeah, as DMD does not have any PGO functionality. Of course it 
 will still pass the ldc2 options through.

Lol, I should try to read up on these simple things first...
To do the tests, I had modified ldmd to recognize 
-fprofile-instr-generate...

Dec 13 2015

Johan Engelen <j j.nl> writes:

Hi David,
   Could you have a look at the PR again?
I think it is almost ready for merging. (two FIXME's left that I 
hope to address soon)

On Sunday, 13 December 2015 at 13:21:33 UTC, David Nadlinger 
wrote:
 On 11 Dec 2015, at 1:38, Johan Engelen via digitalmars-d-ldc 
 wrote:
 What do you think about "llvm-profdata"? Should we ship that 
 with LDC?

 Yes, we should probably ship it with the binary packages.

I have added llvm-profdata to the repo, and renamed the built 
executable to ldc-profdata (it has to be in-sync with the LLVM 
version that LDC was built with, and so the renaming prevents 
potential name clashing with a system installed profdata version).

 What is left to do before we can merge a first version into the 
 main repository? A partial list:

  - Deal with the remaining FIXME comments (at least open 
 separate GitHub issues for them), as well as with commented-out 
 fragments from the Clang implementation.

I think I will just remove all unused Clang implementation code 
(it is related to coverage stuff, which I think we should 
implement in a different branch after PGO is merged).

Thanks!
   Johan

Jan 10 2016

Kagamin <spam here.lot> writes:

On Friday, 11 December 2015 at 00:26:22 UTC, Johan Engelen wrote:
 So PGO brings it from 2.55 to 2.48 sec, ~3% improvement.
 Disappointing, but well... it works!

PGO can also reduce physical memory consumption due to less code 
loaded into memory.

Dec 11 2015

Johan Engelen <j j.nl> writes:

On Friday, 11 December 2015 at 14:07:56 UTC, Kagamin wrote:
 On Friday, 11 December 2015 at 00:26:22 UTC, Johan Engelen 
 wrote:
 So PGO brings it from 2.55 to 2.48 sec, ~3% improvement.
 Disappointing, but well... it works!

 PGO can also reduce physical memory consumption due to less 
 code loaded into memory.

Yes, indeed. I'd like to find a good example of code where this 
shows up in practice, so that PGO really does improve performance 
significantly (say, >10%).

Dec 11 2015

David Nadlinger <code klickverbot.at> writes:

On Tuesday, 8 December 2015 at 22:35:11 UTC, Johan Engelen wrote:
 Thanks a lot for the testcase you posted on Github. Will sink 
 my teeth in fixing that first.

Speaking of test cases: This might be an obvious and/or stupid 
suggestion, but did you try building the Phobos unit tests (and 
maybe also dmd-testsuite/runnable) with PGO? I'd suspect it would 
give you quite a broad coverage of basic language constructs.

  - David

Dec 10 2015

Johan Engelen <j j.nl> writes:

On Thursday, 10 December 2015 at 14:27:59 UTC, David Nadlinger 
wrote:
 Speaking of test cases: This might be an obvious and/or stupid 
 suggestion, but did you try building the Phobos unit tests (and 
 maybe also dmd-testsuite/runnable) with PGO? I'd suspect it 
 would give you quite a broad coverage of basic language 
 constructs.

Nope didn't do that yet :S :S   Looks like it is needed to iron 
out some remaining bugs.

I underestimated the complexity of D's AST (some objects are 
placed in multiple locations in the AST?), which gave rise to an 
assertion fail in your testcase; plus I forgot to add throw 
statements to the AST tree walker, leading to another assertion 
fail. Those issues have been fixed now, and now it breaks with 
the same error you found. It is confusing because I did not (mean 
to) change any of the codegen, other than adding counter 
increment instructions and branch instruction metadata (both 
trivial additions).  But I did have to add extra basicblocks for 
switch statements... perhaps I can search there first.
Hope to have a resolution for your test case quickly.

I also have not tested at all how this works with multiple object 
files linked together, or other possibly more complicated things. 
I thought a fun testcase would be to compile DDMD with PGO 
enabled, compile itself as a profiling run, rebuild with PGO and 
test if compiling, say, Phobos is quicker/slower.

I am very curious to see what constructs will see a significant 
performance boost, if any at all.

Dec 10 2015

David Nadlinger <code klickverbot.at> writes:

On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger 
wrote:
 Did you also try using it with sample profiles acquired by an 
 external profiler yet, as described in the Clang page on PGO?

Let me add that this would probably be something nice to have for 
the initial release, as users could fall back to using perf, etc. 
if the instrumentation part is still buggy or incomplete for 
their code.

  - David

Dec 10 2015

Liran Zvibel via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

Also, for use cases like ours, where the system runs for extended periods of
time, and optimizing the init time, which may be minutes is not interesting at
all, just being able to run perf while the system is doing something
interesting to improve is a big plus.

Liran

 On Dec 10, 2015, at 16:30, David Nadlinger via digitalmars-d-ldc
<digitalmars-d-ldc puremagic.com> wrote:
 
 On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger wrote:
 Did you also try using it with sample profiles acquired by an external
profiler yet, as described in the Clang page on PGO?

 
 Let me add that this would probably be something nice to have for the initial
release, as users could fall back to using perf, etc. if the instrumentation
part is still buggy or incomplete for their code.
 
 - David

Dec 10 2015

Kagamin <spam here.lot> writes:

On Thursday, 10 December 2015 at 14:38:19 UTC, Liran Zvibel wrote:
 Also, for use cases like ours, where the system runs for 
 extended periods of time, and optimizing the init time, which 
 may be minutes is not interesting at all, just being able to 
 run perf while the system is doing something interesting to 
 improve is a big plus.

As I understand, if the profiling runs long enough, the 
long-running statistics will dominate startup statistics?

Dec 23 2015

Johan Engelen <j j.nl> writes:

On Wednesday, 23 December 2015 at 13:10:33 UTC, Kagamin wrote:
 On Thursday, 10 December 2015 at 14:38:19 UTC, Liran Zvibel 
 wrote:
 Also, for use cases like ours, where the system runs for 
 extended periods of time, and optimizing the init time, which 
 may be minutes is not interesting at all, just being able to 
 run perf while the system is doing something interesting to 
 improve is a big plus.

 As I understand, if the profiling runs long enough, the 
 long-running statistics will dominate startup statistics?

The profiling is pretty simple: it is just a bunch of counters 
whereever the code branches. So indeed, long runs will have 
long-running statistics dominate startup statistics.

Dec 23 2015

Johan Engelen <j j.nl> writes:

On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:
 (does not work on Windows yet... I have to fix LLVM's 
 compile-rt code)

I fixed a nasty [*] bug in compile-rt's profile writing code, and 
now it also works on Windows. (The IR tests fail on Windows 
because running a compiled executable from LIT fails for some 
reason on Windows.)

[*] 
https://stackoverflow.com/questions/5537066/strange-0x0d-being-added-to-my-binary-file
Now I know what to look for first if I see 0x0D's in my files...

Dec 08 2015

Johan Engelen <j j.nl> writes:

On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:
 Before I announce it in the "Announce" forum, I want to hear 
 your thoughts first.

Clearly I was too optimistic about the quality of my work so far, 
hehe.

Dec 10 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

On 10 Dec 2015, at 19:43, Johan Engelen via digitalmars-d-ldc wrote:
 Clearly I was too optimistic about the quality of my work so far, 
 hehe.

Quite the contrary – you chose to start with the hard part 
(instrumentation-based instead of sampling-based), and DMD's AST is 
notoriously, uh, fluid in meaning and under-documented.

  — David

Dec 10 2015

Johan Engelen <j j.nl> writes:

On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:
 Hi all,
   I have been working on getting rudimentary PGO going in LDC. 
 It's pretty much ready!

It now works with LLVM 3.7 and LLVM 3.8 (trunk), on Mac OS X, 
Linux (tested on Ubuntu), and Windows! I have not tested other 
platforms.

Dec 23 2015

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Profile-guided optimization (PGO)