digitalmars.D - Templates are slow.

Stefan Koch (26/26) Sep 07 2016 Hi Guys,

Sebastien Alaiwan (6/8) Sep 07 2016 How would you measure such things?

Stefan Koch (8/17) Sep 07 2016 I use a special profilng-build of dmd.

Ethan Watson (10/12) Sep 08 2016 I did a double take when Stefan told me the representative sample

Jonathan M Davis via Digitalmars-d (18/30) Sep 08 2016 IIRC, Don posted at one point about how the std.algorithm unit tests end...

Stefan Koch (17/22) Sep 08 2016 I agree. We need to make templates faster.

H. S. Teoh via Digitalmars-d (20/23) Sep 08 2016 [...]

Andrei Alexandrescu (3/7) Sep 08 2016 That would be a great article. Are there any situations that we can

Stefan Koch (5/7) Sep 08 2016 The rangefying functions in std.array come to mind.
Russel Winder via Digitalmars-d (14/27) Sep 09 2016 One that could get published in CVu or Overload.

jmh530 (4/10) Sep 08 2016 What about functions in std.algorithm, like reduce, that have
Lewis (44/44) Sep 08 2016 I recently went through the process of optimizing the build time

Lewis (23/23) Sep 08 2016 It's true that templates are inherently slow, and there isn't a

Ethan Watson (18/20) Sep 08 2016 I'm running with Visual D, which has a "COMPILE ALL THE THINGS"

Stefan Koch (9/30) Sep 08 2016 generating separate object files for each template instanciation

deadalnix (2/6) Sep 08 2016 You'd have tens of thousands of file and a big io problem.

Stefan Koch (7/15) Sep 08 2016 I already thought about that.

Johan Engelen (15/18) Sep 08 2016 Not as good as what you propose, but: LDC 1.1.0 can do _codegen_

Stefan Koch (10/10) Sep 08 2016 Hi Guys,

Nicholas Wilson (2/4) Sep 08 2016 (Waits for Walter to say, "Use a pool Luke!")
Steven Schveighoffer (10/16) Sep 09 2016 If we ever get Rainer's patch to collapse repetitive templates, we may

pineapple (7/11) Sep 09 2016 Any chance of a false positive is nontrivial when hashes are
Stefan Koch (8/27) Sep 09 2016 In this case the string is freshly and not available as reference
deadalnix (4/24) Sep 09 2016 You need to compare the string to unique them, so it doesn't

Stefan Koch (2/4) Sep 09 2016 It changes the frequency of comparisons.

Stefan Koch <uplink.coder googlemail.com> writes:

Hi Guys,

I have just hit a barrier trying to optimize the compile-time in 
binderoo.

Roughly 90% of the compile-time is spent instantiating templates.
The 10% for CTFE are small in comparison.

I will write an article about why templates are slow.

The gist will be however : "Templates being slow is an inherent 
property of templates." (We are only talking about templates as 
defined by (C++ and D)).

That said: Templates are great!
But you have to use them sanely.

If you are instantiating a template inside another template think 
very hard about the reason, often you can "inline" the template 
body of the inner template and get an instant speed win right 
there.
(Don't do this preemptively, ONLY when you know that this 
template is a problem!)

Phobos is has many templates inside templates.
In constraints for example.

I have no idea how to cut down on template-instanciations in 
phobos while still maintaining the same user-friendliness.

Of course myself and other will continue fighting on the 
compiler-front.
To give you that fastest implementation possible!

Cheers,
Stefan

Sep 07 2016

Sebastien Alaiwan <ace17 free.fr> writes:

On Thursday, 8 September 2016 at 05:02:38 UTC, Stefan Koch wrote:
 (Don't do this preemptively, ONLY when you know that this 
 template is a problem!)

How would you measure such things?
Is there such a thing like a "compilation time profiler" ?

(Running oprofile on a dmd with debug info comes first to mind ; 
however, this would only give me statistics on dmd's source code, 
not mine.)

Sep 07 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 8 September 2016 at 06:34:58 UTC, Sebastien Alaiwan 
wrote:
 On Thursday, 8 September 2016 at 05:02:38 UTC, Stefan Koch 
 wrote:
 (Don't do this preemptively, ONLY when you know that this 
 template is a problem!)

 How would you measure such things?
 Is there such a thing like a "compilation time profiler" ?

 (Running oprofile on a dmd with debug info comes first to mind 
 ; however, this would only give me statistics on dmd's source 
 code, not mine.)

I use a special profilng-build of dmd.
oprofile on dmd can give you a good first impression of where you 
run into problems and then you can wirte special profiling code 
for this.
If you do not want to write such code send me a message and I 
will look into it for you :)

Sep 07 2016

Ethan Watson <gooberman gmail.com> writes:

On Thursday, 8 September 2016 at 05:02:38 UTC, Stefan Koch wrote:
 I have just hit a barrier trying to optimize the compile-time 
 in binderoo.

I did a double take when Stefan told me the representative sample 
code I gave him to run with Binderoo instantiated ~20,000 
templates and resulted in ~10,000,000 hash map look ups inside 
the compiler.

I can certainly write it to be more optimised, but one of the 
goals I have is to make the codebase human readable so that it's 
not just Manu and myself that can understand the code. As a 
result, I figure this could be representative of how an ordinary 
user would write templated code.

Sep 08 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, September 08, 2016 07:43:10 Ethan Watson via Digitalmars-d wrote:
 On Thursday, 8 September 2016 at 05:02:38 UTC, Stefan Koch wrote:
 I have just hit a barrier trying to optimize the compile-time
 in binderoo.

 I did a double take when Stefan told me the representative sample
 code I gave him to run with Binderoo instantiated ~20,000
 templates and resulted in ~10,000,000 hash map look ups inside
 the compiler.

 I can certainly write it to be more optimised, but one of the
 goals I have is to make the codebase human readable so that it's
 not just Manu and myself that can understand the code. As a
 result, I figure this could be representative of how an ordinary
 user would write templated code.

IIRC, Don posted at one point about how the std.algorithm unit tests ended
up with over a million template instantiations. All of the eponymous
templates that we use for template constraints really add up, and having
heavily range-based code is going to rack up the number of template
instantiations as well. It's critical that we do what we can to make
templates fast. And if we can't make them fast enough, we'll definitely have
to come up with techniques/guidelines for reducing their usage when they're
not really needed.

Improvements to CTFE have really helped though. std.metastrings was dog slow
in comparison to using CTFE, and at this point, we don't need
std.metastrings anymore, and so it's gone. That's definitely not going to
work in all cases though - just in cases where something is being done with
templates that could be done with a function that could run at runtime now
that CTFE can do most things that can be done at runtime. And we've probably
gotten most everything we can out of that transition already - at least in
Phobos.

- Jonathan M Davis

Sep 08 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 8 September 2016 at 15:45:53 UTC, Jonathan M Davis 
wrote:
 It's critical that we do what we can to make templates fast. 
 And if we can't make them fast enough, we'll definitely have to 
 come up with techniques/guidelines for reducing their usage 
 when they're not really needed.

 - Jonathan M Davis

I agree. We need to make templates faster.
But it will be like squeezing water out of stones.
A few more oblivious optimizations I have tried did not have the 
desired effect at all.

 Andrei

Also we need to special case ranges in general.
And try harder to inline calls to range functions.
Maybe even early in the frontend.
secondly we need to inline range-chains into each other whenever 
possible.
If we do this the right way early on we can reduce the 
symbolName-length as well.

All we need for this is pattern-matching on a type-resolved 
call-graph.

Which is something I am working on as part of my ctfe work.

Sep 08 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, Sep 08, 2016 at 04:37:36PM +0000, Stefan Koch via Digitalmars-d wrote:
[...]
 Also we need to special case ranges in general.
 And try harder to inline calls to range functions.
 Maybe even early in the frontend.

[...]

Yeah, dmd's inliner is really pessimistic. It gives up too easily, and
as a result often misses a whole series of further opportunities that
could have opened up, had it decided to inline.

Having range-specific optimizations is a good idea, I think.  Since it's
one of D's big selling points, it needs to be as high-performance as we
can possibly make it.  I think there is a lot we can do in this area
that we can't do as general optimizations; range-based code has certain
recurring structures that should be highly-exploitable in terms of
optimization opportunities.

IME I've found that GDC is much better at optimizing range-based code
because of its aggressive inliner (and also better loop optimization
algorithms); but there's probably still room for range-specific
optimizations that a general optimizer like the gcc backend wouldn't
have.


T

-- 
Democracy: The triumph of popularity over principle. -- C.Bond

Sep 08 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/8/16 7:02 AM, Stefan Koch wrote:
 I will write an article about why templates are slow.

 The gist will be however : "Templates being slow is an inherent property
 of templates." (We are only talking about templates as defined by (C++
 and D)).

That would be a great article. Are there any situations that we can 
special-case away? Raising the roof. -- Andrei

Sep 08 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 8 September 2016 at 12:23:35 UTC, Andrei 
Alexandrescu wrote:
 Are there any situations that we can special-case away? Raising 
 the roof. -- Andrei

The rangefying functions in std.array come to mind.
That will give a huge boost to everyone.
(everyone who uses arrays anyway :))

Sep 08 2016

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thu, 2016-09-08 at 14:23 +0200, Andrei Alexandrescu via Digitalmars-
d wrote:
 On 9/8/16 7:02 AM, Stefan Koch wrote:
=20
=20
 I will write an article about why templates are slow.
=20
 The gist will be however : "Templates being slow is an inherent
 property
 of templates." (We are only talking about templates as defined by
 (C++
 and D)).

=20
 That would be a great article. Are there any situations that we can=C2=A0
 special-case away? Raising the roof. -- Andrei

One that could get published in CVu or Overload.

https://accu.org/index.php/journal

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 09 2016

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 8 September 2016 at 05:02:38 UTC, Stefan Koch wrote:
 If you are instantiating a template inside another template 
 think very hard about the reason, often you can "inline" the 
 template body of the inner template and get an instant speed 
 win right there.
 (Don't do this preemptively, ONLY when you know that this 
 template is a problem!)

What about functions in std.algorithm, like reduce, that have 
templates with multiple functions within them and the headline 
function will call private impl functions within them?

Sep 08 2016

Lewis <musicaljelly gmail.com> writes:

I recently went through the process of optimizing the build time 
on one of my projects. I started at ~3.08s, and got it down to 
~1.6s. The project is around 7000 non-comment-non-whitespace LOC. 
I timed the build in a pretty non-rigourous fashion (I just timed 
the python script that kicks off a command-line call to dmd and 
waits for it to finish), however I only cared about making 
changes that resulted in large improvements to the build time, so 
this was good enough for my purposes.

I too found that template instantiation was responsible for a lot 
of the extra build time. I found running dmd -v very helpful in 
tracking down excessive template instantiations or other places 
where the compiler was doing a lot of work that could be avoided.

The steps I took were as follows:

- Start (3.08s)
- I was using custom assert function that grabbed __LINE__ and 
__FILE__ as template arguments, meaning each of the ~130 assert 
calls required a separate instantiation. I switched to passing 
those in as run-time arguments (2.85s)
- I had a similar wrapper around some logging functions in 
std.experimental.logger. I made a small change to 
std.experimental.logger to allow a call path with no template 
instantiations, and similarly fixed my own wrapper. I had ~70 
logging calls (2.7s)
- Recompiled DMD with VS2015 (2.5s)
- Overclocked my CPU :D (2.3s)
- Created a file called heavytemplates.d that built to a .lib in 
a separate build step. The first templates I pulled out were a 
couple std.regex calls and instantiations (1.9s)
- Changed some tuples into structs (negligible improvement)
- Pulled several templates into heavytemplates.d that instantiate 
recursively over the Gamestate (a very large struct) (1.75s)
- Pulled out template instantiations used by msgpack-d, which 
also instantiate recursively over the Gamestate for save/load of 
the game (1.6s)

Of all of these, I was most surprised by the gain I got from 
pulling out std.regex calls into a separate build (0.4ms). 
Whether or not I used compile-time regexes didn't seem to affect 
build time substantially, just that I used anything at all. Also, 
whether I had one regex call or five didn't seem to matter, 
likely because std.regex instantiates using the string type as a 
parameter, and I just used plain old 'string' for all regex uses.

There's still work I could do, but at some point I start to get 
diminishing returns, and have to actually work on features 
instead of just optimizing my build  :D

Sep 08 2016

Lewis <musicaljelly gmail.com> writes:

It's true that templates are inherently slow, and there isn't a 
ton we can do about that. However, almost every time I compile 
the project (hundreds of times per day), the overwhelming 
majority of the time, the same templates are being 
re-instantiated in exactly the same way. I can't help but wonder 
if there were some way to automatically cache templates 
instantiations between runs of dmd?

The heavytemplates.d workaround I've used kind of accomplishes 
this as a total hack job. However...

- It adds complexity to the build process
- It adds a small overhead of linking in an extra .lib (although 
this is dwarfed by the win from no longer rebuilding expensive 
templates every build)
- It means that when heavytemplates.d changes, my rebuild is 
significantly longer than before since I'm running dmd twice
- It means extra work to implement that we don't want every 
developer to do themselves

Am I crazy in wondering about caching template instantiations? I 
understand that an incremental build would kind of accomplish 
this goal, but that comes with its own set of problems. I can't 
help but think that there's some way to make dmd smarter about 
not redoing the exact same work build after build, when the 
templates and their instantiations only change very rarely.

Sep 08 2016

Ethan Watson <gooberman gmail.com> writes:

On Thursday, 8 September 2016 at 19:17:42 UTC, Lewis wrote:
 I can't help but wonder if there were some way to automatically 
 cache templates instantiations between runs of dmd?

I'm running with Visual D, which has a "COMPILE ALL THE THINGS" 
mentality as the default. As part of the rapid iteration part of 
Binderoo, I plan on doing incremental linking.

Of course, if all template instantiations go in to one object 
file, that really ruins it. Each template instantiation going in 
to a separate object file will actually make life significantly 
easier, as each compile will have less output. The only time 
those template instantiations need to recompile is if the 
invoking module changes; the template's dependencies change; or 
the module the template lives in changes.

My opinion is that splitting up object files will do more to 
reduce compile time for me than anything else, the pipeline we 
had for Quantum Break was to compile and link in separate steps 
so it's not much effort at all for me to keep that idea running 
in Binderoo and make it incrementally link. But I don't know the 
DMD code and I'm not a compiler writer, so I cannot say that 
authoritatively. It sounds very reasonable to me at least.

Sep 08 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 8 September 2016 at 19:49:38 UTC, Ethan Watson wrote:
 On Thursday, 8 September 2016 at 19:17:42 UTC, Lewis wrote:
 I can't help but wonder if there were some way to 
 automatically cache templates instantiations between runs of 
 dmd?

 I'm running with Visual D, which has a "COMPILE ALL THE THINGS" 
 mentality as the default. As part of the rapid iteration part 
 of Binderoo, I plan on doing incremental linking.

 Of course, if all template instantiations go in to one object 
 file, that really ruins it. Each template instantiation going 
 in to a separate object file will actually make life 
 significantly easier, as each compile will have less output. 
 The only time those template instantiations need to recompile 
 is if the invoking module changes; the template's dependencies 
 change; or the module the template lives in changes.

 My opinion is that splitting up object files will do more to 
 reduce compile time for me than anything else, the pipeline we 
 had for Quantum Break was to compile and link in separate steps 
 so it's not much effort at all for me to keep that idea running 
 in Binderoo and make it incrementally link. But I don't know 
 the DMD code and I'm not a compiler writer, so I cannot say 
 that authoritatively. It sounds very reasonable to me at least.

generating separate object files for each template instanciation 
is and then only re-generating on change will only be effective 
if they do not change much.
 From one build to the next.

For binderoos purpose this could be rather effective.
As long as no one adds fields at the beginning of the structs :)
Without incremental linking however your compile-times will shoot 
through the roof. And will probably damage the moon as well.

Sep 08 2016

deadalnix <deadalnix gmail.com> writes:

On Thursday, 8 September 2016 at 20:10:01 UTC, Stefan Koch wrote:
 generating separate object files for each template 
 instanciation is and then only re-generating on change will 
 only be effective if they do not change much.
 From one build to the next.

You'd have tens of thousands of file and a big io problem.

Sep 08 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 9 September 2016 at 01:38:40 UTC, deadalnix wrote:
 On Thursday, 8 September 2016 at 20:10:01 UTC, Stefan Koch 
 wrote:
 generating separate object files for each template 
 instanciation is and then only re-generating on change will 
 only be effective if they do not change much.
 From one build to the next.

 You'd have tens of thousands of file and a big io problem.

I already thought about that.
The Idea is to stuff the object-code of all templates in one-file 
with a bit a meta-data.
make a do a hash lookup at instanciation.
And write the cached code in of the instanciation is found.

I agree, file I/O would kill any speed win many times over!

Sep 08 2016

Johan Engelen <j j.nl> writes:

On Thursday, 8 September 2016 at 19:17:42 UTC, Lewis wrote:
 Am I crazy in wondering about caching template instantiations? 
 I understand that an incremental build would kind of accomplish 
 this goal, but that comes with its own set of problems.

Not as good as what you propose, but: LDC 1.1.0 can do _codegen_ 
caching which I guess is some intermediate form of incremental 
building.

My testcase is a unittest piece from Weka.io that instantiates, 
oh, I don't remember exactly, 100.000+ templates. It takes about 
65 seconds to compile. With codegen caching, the re(!)compile 
time on cache-hit is reduced to 39s. Note that the front-end 
still instantiates all those templates, but LDC's codegen at -O0 
is not as fast as DMD's (calculating the hash also takes time and 
could be optimized further).

In summary, for LDC -O3 builds, you can expect a large speed 
boost by just adding `-ir2obj-cache=<cache dir>` to the 
commandline (LDC >= 1.1.0-alpha1).

-Johan

Sep 08 2016

Stefan Koch <uplink.coder googlemail.com> writes:

Hi Guys,

I have some more data.
In the binderoo example the main time is spent in the backend.
generating code and writing objects files.

The front-end spends most of it's time comparing strings of 
unique type-names :)
One particular outlier in the backend code is the function ecom 
which eliminates common subexpression.
We would potentially save some time by not emitting those in the 
first-place.

Sep 08 2016

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 8 September 2016 at 22:57:07 UTC, Stefan Koch wrote:

 The front-end spends most of it's time comparing strings of 
 unique type-names :)

(Waits for Walter to say, "Use a pool Luke!")

Sep 08 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 9/8/16 6:57 PM, Stefan Koch wrote:
 Hi Guys,

 I have some more data.
 In the binderoo example the main time is spent in the backend.
 generating code and writing objects files.

If we ever get Rainer's patch to collapse repetitive templates, we may 
help this problem. https://github.com/dlang/dmd/pull/5855

 The front-end spends most of it's time comparing strings of unique
 type-names :)

I thought the front end was changed to use the string pointer for symbol 
names as the match so string comparisons aren't done?

Hm... maybe to intern the string? That kind of makes sense.

I just had a thought. If you hash the string, and then compare the 
length of the string and first and last character along with the hash, 
what are the chances of it being a false positive?

-Steve

Sep 09 2016

pineapple <meapineapple gmail.com> writes:

On Friday, 9 September 2016 at 12:09:32 UTC, Steven Schveighoffer 
wrote:
 I just had a thought. If you hash the string, and then compare 
 the length of the string and first and last character along 
 with the hash, what are the chances of it being a false 
 positive?

Any chance of a false positive is nontrivial when hashes are 
being compared so often.

Never, never ever in production code are you to assume that 
hashes and other non-definitive data can be used to guarantee 
uniqueness.

Sep 09 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 9 September 2016 at 12:09:32 UTC, Steven Schveighoffer 
wrote:
 On 9/8/16 6:57 PM, Stefan Koch wrote:
 Hi Guys,

 I have some more data.
 In the binderoo example the main time is spent in the backend.
 generating code and writing objects files.

 If we ever get Rainer's patch to collapse repetitive templates, 
 we may help this problem. https://github.com/dlang/dmd/pull/5855

 The front-end spends most of it's time comparing strings of 
 unique
 type-names :)

 I thought the front end was changed to use the string pointer 
 for symbol names as the match so string comparisons aren't done?

In this case the string is freshly and not available as reference 
to an already lexed string
.
 Hm... maybe to intern the string? That kind of makes sense.

Yes that would be the way to go.

 I just had a thought. If you hash the string, and then compare 
 the length of the string and first and last character along 
 with the hash, what are the chances of it being a false 
 positive?

This depends entirely on the distribution of strings.
It's probably quite high :)

Sep 09 2016

deadalnix <deadalnix gmail.com> writes:

On Friday, 9 September 2016 at 12:09:32 UTC, Steven Schveighoffer 
wrote:
 On 9/8/16 6:57 PM, Stefan Koch wrote:
 Hi Guys,

 I have some more data.
 In the binderoo example the main time is spent in the backend.
 generating code and writing objects files.

 If we ever get Rainer's patch to collapse repetitive templates, 
 we may help this problem. https://github.com/dlang/dmd/pull/5855

 The front-end spends most of it's time comparing strings of 
 unique
 type-names :)

 I thought the front end was changed to use the string pointer 
 for symbol names as the match so string comparisons aren't done?

 Hm... maybe to intern the string? That kind of makes sense.

 I just had a thought. If you hash the string, and then compare 
 the length of the string and first and last character along 
 with the hash, what are the chances of it being a false 
 positive?

 -Steve

You need to compare the string to unique them, so it doesn't 
change anything.

Sep 09 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 9 September 2016 at 18:17:02 UTC, deadalnix wrote:
 You need to compare the string to unique them, so it doesn't 
 change anything.

It changes the frequency of comparisons.

Sep 09 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Templates are slow.