www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [hackathon] My and Walter's ideas

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I've been on this project at work that took the "functionality first, 
performance later" approach. It has a Java-style approach of using class 
objects throughout and allocating objects casually.

So now we have a project that works but is kinda slow. Profiling shows 
it spends a fair amount of time collecting garbage (which is easily 
visible by just looking at code). Yet there is no tooling that tells 
where most allocations happen.

Since it's trivial to make D applications a lot faster by avoiding big 
ticket allocations and leave only the peanuts for the heap, there should 
be a simple tool to e.g. count how many objects of each type were 
allocated at the end of a run. This is the kind of tool that should be 
embarrassingly easy to turn on and use to draw great insights about the 
allocation behavior of any application.

First shot is a really simple proof of concept at 
http://dpaste.dzfl.pl/8baf3a2c4a38. I used manually replaced all "new 
T(args)" with "make!T(args)" and all "new T[n]" with "makeArray!T(n)". I 
didn't even worry about concatenations and array literals in the first 
approximation.

The support code collects in a thread-local table the locus of each 
allocation (file, line, and function of the caller) alongside with the 
type created. Total bytes allocated for each locus are tallied.

When a thread exits, it's table is dumped wholesale into a global table, 
which is synchronized. It's fine to use a global lock because the global 
table is only updated when a thread exits, not with each increment.

When the process exits, the global table is printed out.

This was extraordinarily informative essentially taking us from "well 
let's grep for new and reduce those, and replace class with struct where 
sensible" to a much more focused approach that targeted the top 
allocation sites. The distribution is Pareto, e.g. the locus with most 
allocations accounts for four times more bytes than the second, and the 
top few are responsible for statistically all allocations that matter. 
I'll post some sample output soon.

Walter will help me with hooking places that allocate in the runtime 
(new operator, catenations, array literals etc) to allow building this 
into druntime. At the end we'll write an article about this all.


Andrei
Apr 25 2015
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 26 April 2015 at 03:40:43 UTC, Andrei Alexandrescu 
wrote:
 Since it's trivial to make D applications a lot faster by 
 avoiding big ticket allocations and leave only the peanuts for 
 the heap, there should be a simple tool to e.g. count how many 
 objects of each type were allocated at the end of a run. This 
 is the kind of tool that should be embarrassingly easy to turn 
 on and use to draw great insights about the allocation behavior 
 of any application.
https://github.com/CyberShadow/Diamond Among other features:
 can display "top allocators" - call stacks that allocated most 
 bytes
Unfortunately still D1-only.
Apr 25 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/25/15 8:49 PM, Vladimir Panteleev wrote:
 On Sunday, 26 April 2015 at 03:40:43 UTC, Andrei Alexandrescu wrote:
 Since it's trivial to make D applications a lot faster by avoiding big
 ticket allocations and leave only the peanuts for the heap, there
 should be a simple tool to e.g. count how many objects of each type
 were allocated at the end of a run. This is the kind of tool that
 should be embarrassingly easy to turn on and use to draw great
 insights about the allocation behavior of any application.
https://github.com/CyberShadow/Diamond Among other features:
 can display "top allocators" - call stacks that allocated most bytes
(Enthusiasm rises)
 Unfortunately still D1-only.
(Enthusiasm decreases) Andrei
Apr 25 2015
parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 26 April 2015 at 03:58:44 UTC, Andrei Alexandrescu 
wrote:
 On 4/25/15 8:49 PM, Vladimir Panteleev wrote:
 On Sunday, 26 April 2015 at 03:40:43 UTC, Andrei Alexandrescu 
 wrote:
 Since it's trivial to make D applications a lot faster by 
 avoiding big
 ticket allocations and leave only the peanuts for the heap, 
 there
 should be a simple tool to e.g. count how many objects of 
 each type
 were allocated at the end of a run. This is the kind of tool 
 that
 should be embarrassingly easy to turn on and use to draw great
 insights about the allocation behavior of any application.
https://github.com/CyberShadow/Diamond Among other features:
 can display "top allocators" - call stacks that allocated 
 most bytes
(Enthusiasm rises)
 Unfortunately still D1-only.
(Enthusiasm decreases)
Maybe I should work on it for this hackathon. But I also have two other interesting D projects in the pipeline, much closer to being ready (or at least, announce-ready).
Apr 25 2015
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 4/25/15 11:40 PM, Andrei Alexandrescu wrote:
 I've been on this project at work that took the "functionality first,
 performance later" approach. It has a Java-style approach of using class
 objects throughout and allocating objects casually.

 So now we have a project that works but is kinda slow. Profiling shows
 it spends a fair amount of time collecting garbage (which is easily
 visible by just looking at code). Yet there is no tooling that tells
 where most allocations happen.

 Since it's trivial to make D applications a lot faster by avoiding big
 ticket allocations and leave only the peanuts for the heap, there should
 be a simple tool to e.g. count how many objects of each type were
 allocated at the end of a run. This is the kind of tool that should be
 embarrassingly easy to turn on and use to draw great insights about the
 allocation behavior of any application.

 First shot is a really simple proof of concept at
 http://dpaste.dzfl.pl/8baf3a2c4a38. I used manually replaced all "new
 T(args)" with "make!T(args)" and all "new T[n]" with "makeArray!T(n)". I
 didn't even worry about concatenations and array literals in the first
 approximation.

 The support code collects in a thread-local table the locus of each
 allocation (file, line, and function of the caller) alongside with the
 type created. Total bytes allocated for each locus are tallied.

 When a thread exits, it's table is dumped wholesale into a global table,
 which is synchronized. It's fine to use a global lock because the global
 table is only updated when a thread exits, not with each increment.

 When the process exits, the global table is printed out.

 This was extraordinarily informative essentially taking us from "well
 let's grep for new and reduce those, and replace class with struct where
 sensible" to a much more focused approach that targeted the top
 allocation sites. The distribution is Pareto, e.g. the locus with most
 allocations accounts for four times more bytes than the second, and the
 top few are responsible for statistically all allocations that matter.
 I'll post some sample output soon.

 Walter will help me with hooking places that allocate in the runtime
 (new operator, catenations, array literals etc) to allow building this
 into druntime. At the end we'll write an article about this all.
Everything to alter is in lifetime.d. It would be trivial to create this. The only thing is to have a malloc-based AA for tracking, so the tracking doesn't track itself (as that would likely be the most allocations!), where's that std.allocator? I think it's something that easily can be turned on via runtime variable, as allocating is so expensive, checking a bool to see if you should track it would be non-existent performance wise. Doing it via altering calls to new would be very invasive. However, note that this wouldn't track allocations that the compiler did for closures. I don't know how that works, as there's not an appropriate lifetime.d function for that. If we wanted to have a generic comprehensive solution, that would need to be added. -Steve
Apr 27 2015
parent reply "Martin Nowak" <code dawg.eu> writes:
On Monday, 27 April 2015 at 10:56:17 UTC, Steven Schveighoffer 
wrote:
 Everything to alter is in lifetime.d. It would be trivial to 
 create this.
https://issues.dlang.org/show_bug.cgi?id=13988
 The only thing is to have a malloc-based AA for tracking
https://github.com/D-Programming-Language/druntime/blob/18d57ffe3eed8674ca2052656bb3f410084379f6/src/rt/util/container/hashtab.d
 However, note that this wouldn't track allocations that the 
 compiler did for closures.
Plain _d_allocmemory.
Apr 27 2015
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 4/27/15 7:10 AM, Martin Nowak wrote:
 On Monday, 27 April 2015 at 10:56:17 UTC, Steven Schveighoffer wrote:
 The only thing is to have a malloc-based AA for tracking
https://github.com/D-Programming-Language/druntime/blob/18d57ffe3eed8674ca2052656bb3f410084379f6/src/rt/util/container/hashtab.d
sweet, that makes things REALLY trivial :)
 However, note that this wouldn't track allocations that the compiler
 did for closures.
Plain _d_allocmemory.
OK, I wasn't sure how it worked. But this really doesn't help much for fine-grained statistics gathering. In what situations does this function get called by the compiler? If it's just for closures, we can lump all closure allocations together in one stat. If you see closures are your big nemesis, it may be time to redesign :) -Steve
Apr 27 2015