digitalmars.D - A modest proposal: eliminate template code bloat

Dmitry Olshansky (45/45) Apr 08 2012 I think it's been ages since I meant to ask why nobody (as in compiler

Marco Leise (6/61) Apr 08 2012 Thoughts? Nothing much. I thought of that a while ago, but as an externa...

Dmitry Olshansky (11/13) Apr 08 2012 I think I this idea largely formed years ago when I was working with c++...

Marco Leise (5/20) Apr 08 2012 I don't know compilers, but doesn't the IR level still distinguish betwe...

H. S. Teoh (37/82) Apr 08 2012 This would be incompatible with how current (non-dmd) linkers work. But

Marco Leise (6/8) Apr 08 2012 Executables (and object files) are made up mostly of sections, some of w...
Dmitry Olshansky (20/66) Apr 08 2012 Easy the symbol size is in object file (obviously) but it surely not

H. S. Teoh (13/51) Apr 08 2012 I agree!

Dmitry Olshansky (44/61) Apr 08 2012 Right the advantage is that one doesn't have to use tricks. One can

Andrei Alexandrescu (3/4) Apr 08 2012 You may always do.
Dmitry Olshansky (7/18) Apr 08 2012 Mm btw there is one step from here to use it on arbitrary common slices
Daniel Murphy (14/17) Apr 08 2012 I think you'll find that this is better done in the compiler instead of ...

Dmitry Olshansky (20/39) Apr 09 2012 "Easy": just add a hidden pointer argument to functions that have merged...

Somedude (5/19) Apr 08 2012 Actually, in C++ (as well as D), the added benefit would be a greatly

Artur Skawina (22/55) Apr 09 2012 They already do.
H. S. Teoh (6/26) Apr 09 2012 Exactly my point. I *want* to give incentive to toolchain devs to add

Artur Skawina (6/9) Apr 08 2012 Don't forget that this needs to work:

Dmitry Olshansky (4/13) Apr 08 2012 A reference to spec page plz.

Artur Skawina (18/30) Apr 08 2012 A reference to *a* D spec, please...

Andrei Alexandrescu (3/19) Apr 08 2012 Doesn't apply to C++.

Artur Skawina (16/40) Apr 08 2012 Well, let's say it isn't specified. "Functions-are-not-objects" allows

Dmitry Olshansky (13/42) Apr 08 2012 There is common sense and that is (in my book):
Daniel Murphy (5/9) Apr 08 2012 Or use a nop slide before the start of the function. Since we're modify...

H. S. Teoh (6/20) Apr 08 2012 [...]

Marco Leise (5/26) Apr 08 2012 I would like to know that use case as well, especially since I learned c...
Daniel Murphy (5/6) Apr 09 2012 Just because I can't think of a use case doesn't mean nobody is relying ...

H. S. Teoh (10/20) Apr 09 2012 [...]

Marco Leise (9/21) Apr 08 2012 Do you actually rely on that behavior? It is the same as asking this to ...

Walter Bright (6/8) Apr 08 2012 I worked out how to do it a while ago, but there's been no time to imple...

Dmitry Olshansky (7/18) Apr 08 2012 I'm thinking what if we kind of do it as an extension so that your

H. S. Teoh (15/24) Apr 08 2012 [...]

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

I think it's been ages since I meant to ask why nobody (as in compiler 
vendors) does what I think is rather simple optimization.

In the short term the plan is to introduce a "link-time" flavored 
optimization at code generation or (better) link step.

For simplicity let's assume compiler does all of the following during 
generation of an object file.

1. Every time a function is generated (or pretty much any symbol) not 
only a size calculated but also a checksum* of it's data.
(If we go for link-time optimization we should find a place to stick it 
to in the object file)

2. Compiler maintains a hash-table of symbol_size ---> list( ~ array) of 
pairs (references to data, checksum) of all symbols with given size. 
Call it a duplicate table. Every function generated and more generally 
global immutable data should end up there.

3. After any function was generated compiler checks an entry in the 
duplicate table that matches size, followed by matching checksum and 
only then (if required) doing a straight memcmp. If it happens that 
there is a match compiler just throws generated code away and _aliases_ 
it's symbol to that of a matched entry.
(so there has to be an alias table if there isn't one already)

*Yes, checksum. I think it should be real simple and easy to parallel 
hash function. The original checksum is no reliable but some amount 
balancing and profiling are obviously required when picking this function.

Applicability:
It's not only const-immutable bloat, it can be alleviated with inout. 
Yet there are plenty of places the exact same code is being generated: 
e.g. sealed containers of int vs uint, std.find on dchar[] vs 
uint[]/int[] an so on.
In general, the coarse grained parametrization is the root of all evil 
and it is inevitable since we are just humans after all.

Notes:
1. If we do checksum calculation on the fly during codegen it gets at 
virtually no cost as the data is in CPU data cache. Preliminary version 
can avoid hacking this part of backend though.

2. By _alias_ I mean the ability of compiler to emit references to a 
given symbol as if it was some other symbol (should be really straight 
forward).

3. Linker have more data and is able to achieve colossal size savings, 
essentially running through the same algorithm before actually linking 
things. Again it's easily separable step and can be an opt-in.

4. Ironically the same exact thing works with any kind of immutable data 
structures. It looks like string pooling is superseded by this proposal.

Thoughts?

-- 
Dmitry Olshansky

Apr 08 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 08 Apr 2012 15:01:56 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 I think it's been ages since I meant to ask why nobody (as in compiler 
 vendors) does what I think is rather simple optimization.
 
 In the short term the plan is to introduce a "link-time" flavored 
 optimization at code generation or (better) link step.
 
 For simplicity let's assume compiler does all of the following during 
 generation of an object file.
 
 1. Every time a function is generated (or pretty much any symbol) not 
 only a size calculated but also a checksum* of it's data.
 (If we go for link-time optimization we should find a place to stick it 
 to in the object file)
 
 2. Compiler maintains a hash-table of symbol_size ---> list( ~ array) of 
 pairs (references to data, checksum) of all symbols with given size. 
 Call it a duplicate table. Every function generated and more generally 
 global immutable data should end up there.
 
 3. After any function was generated compiler checks an entry in the 
 duplicate table that matches size, followed by matching checksum and 
 only then (if required) doing a straight memcmp. If it happens that 
 there is a match compiler just throws generated code away and _aliases_ 
 it's symbol to that of a matched entry.
 (so there has to be an alias table if there isn't one already)
 
 *Yes, checksum. I think it should be real simple and easy to parallel 
 hash function. The original checksum is no reliable but some amount 
 balancing and profiling are obviously required when picking this function.
 
 Applicability:
 It's not only const-immutable bloat, it can be alleviated with inout. 
 Yet there are plenty of places the exact same code is being generated: 
 e.g. sealed containers of int vs uint, std.find on dchar[] vs 
 uint[]/int[] an so on.
 In general, the coarse grained parametrization is the root of all evil 
 and it is inevitable since we are just humans after all.
 
 Notes:
 1. If we do checksum calculation on the fly during codegen it gets at 
 virtually no cost as the data is in CPU data cache. Preliminary version 
 can avoid hacking this part of backend though.
 
 2. By _alias_ I mean the ability of compiler to emit references to a 
 given symbol as if it was some other symbol (should be really straight 
 forward).
 
 3. Linker have more data and is able to achieve colossal size savings, 
 essentially running through the same algorithm before actually linking 
 things. Again it's easily separable step and can be an opt-in.
 
 4. Ironically the same exact thing works with any kind of immutable data 
 structures. It looks like string pooling is superseded by this proposal.
 
 Thoughts?

Thoughts? Nothing much. I thought of that a while ago, but as an external
program, that finds function calls by disassembling and removing dead/duplicate
code. So I agree with you. A similar feature is a CTFE cache (or general code
cache) that checksums a function's source code and gets the compiled version
from a cache.
Template bloat could be especially important to 'fix' on embedded systems. But
I don't consider it important enough at the moment. :/   Let's wait till the
bugs and important features are implemented or hack the compiler ourselves.

-- 
Marco

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08.04.2012 16:37, Marco Leise wrote:
[snip]
 Template bloat could be especially important to 'fix' on embedded systems.

I think I this idea largely formed years ago when I was working with c++ 
on 8bit micros. You won't believe the amount of code size one can save 
by using one separate generic save-it-all-then-load-it-all prolog/epilog 
for all functions (and esp ISRs).


  Let's wait till the bugs and important features are implemented or hack the
compiler ourselves.

Let's hack the compiler ;)
BTW I think it should be possible to apply the idea on the IR level not 
on the actual machine code.

-- 
Dmitry Olshansky

Apr 08 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 08 Apr 2012 20:58:15 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 On 08.04.2012 16:37, Marco Leise wrote:
 [snip]
 Template bloat could be especially important to 'fix' on embedded systems.

 
 I think I this idea largely formed years ago when I was working with c++ 
 on 8bit micros. You won't believe the amount of code size one can save 
 by using one separate generic save-it-all-then-load-it-all prolog/epilog 
 for all functions (and esp ISRs).
 
 
  Let's wait till the bugs and important features are implemented or hack the
compiler ourselves.

 
 Let's hack the compiler ;)
 BTW I think it should be possible to apply the idea on the IR level not 
 on the actual machine code.

I don't know compilers, but doesn't the IR level still distinguish between
signed and unsigned for example, even though the generated machine code will be
the same? Anyway that's what I had in mind with the caching of CTFE as well. To
create a hash over the IR (or AST?) of a function (or anything else) and store
the hash and the serialized IR into a file next to the object files.

-- 
Marco

Apr 08 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.
 
 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

This would be incompatible with how current (non-dmd) linkers work. But
I do like the idea. Perhaps if it works well, other linkers will adopt
it? (Just like how the gcc linker adopted duplicate template code
elimination due to C++ templates.)


 For simplicity let's assume compiler does all of the following
 during generation of an object file.
 
 1. Every time a function is generated (or pretty much any symbol)
 not only a size calculated but also a checksum* of it's data.
 (If we go for link-time optimization we should find a place to stick
 it to in the object file)

We'd have to make sure the checksum doesn't end up in the final
executable though, otherwise the bloat may negate any gains we've made.


 2. Compiler maintains a hash-table of symbol_size ---> list( ~
 array) of pairs (references to data, checksum) of all symbols with
 given size. Call it a duplicate table. Every function generated and
 more generally global immutable data should end up there.
 
 3. After any function was generated compiler checks an entry in the
 duplicate table that matches size, followed by matching checksum and
 only then (if required) doing a straight memcmp. If it happens that
 there is a match compiler just throws generated code away and
 _aliases_ it's symbol to that of a matched entry.
 (so there has to be an alias table if there isn't one already)

I think you don't even need an alias table; IIRC the OS dynamic linker
can easily handle symbols that have the same value (i.e. that point to
the same place). All you have to do is to change the value of the
duplicated symbols so that they all point to the same address.


[...]
 Applicability:
 It's not only const-immutable bloat, it can be alleviated with
 inout. Yet there are plenty of places the exact same code is being
 generated: e.g. sealed containers of int vs uint, std.find on
 dchar[] vs uint[]/int[] an so on.
 In general, the coarse grained parametrization is the root of all
 evil and it is inevitable since we are just humans after all.

I'm not sure I understand the last sentence there, but duplicate code
elimination is definitely a big plus. It will also mean that we can use
templates more freely without having to worry about template bloat.


 Notes:
 1. If we do checksum calculation on the fly during codegen it gets
 at virtually no cost as the data is in CPU data cache. Preliminary
 version can avoid hacking this part of backend though.
 
 2. By _alias_ I mean the ability of compiler to emit references to a
 given symbol as if it was some other symbol (should be really
 straight forward).

Like I said, I think this isn't even necessary, if the compiler can just
generate the same value for the duplicated symbols.


 3. Linker have more data and is able to achieve colossal size
 savings, essentially running through the same algorithm before
 actually linking things. Again it's easily separable step and can be
 an opt-in.

This assumes the (maybe external) linker knows how to take advantage of
the info. But IMO, linkers *should* be a lot smarter than they currently
are, so I don't see a problem with this as long as a dumb linker will
still produce a working executable (just without the space savings).

Alternatively we can have an external pre-link tool that scans a given
set of object files and eliminates duplicated code (by turning
duplicated symbols into external references in all but one of the
instances), before we hand the files off to the OS's native linker.
Come to think of it, this might be a good way to experiment with this
idea before we commit lots of effort into integrating it with a real
linker.


 4. Ironically the same exact thing works with any kind of immutable
 data structures. It looks like string pooling is superseded by this
 proposal.

[...]

Not really... string pooling can take advantage of overlapping
(sub)strings, but I don't think you can do that with code. But I think
your idea has a lot of merit. I'm for making linkers smarter than they
currently are.


T

-- 
It always amuses me that Windows has a Safe Mode during bootup. Does
that mean that Windows is normally unsafe?

Apr 08 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 8 Apr 2012 07:18:26 -0700
schrieb "H. S. Teoh" <hsteoh quickfur.ath.cx>:

 We'd have to make sure the checksum doesn't end up in the final
 executable though, otherwise the bloat may negate any gains we've made.

Executables (and object files) are made up mostly of sections, some of which
are 'special cased' to contain the code, zero initialized data, thread local
storage etc. and some user defined. The checksums would most probably end up in
their own section, like is already happening for debug info or comments. Using
a tool like strip you can remove any section by its name.

Once a linker knows how to use the checksums, it would strip them by default.

-- 
Marco

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08.04.2012 18:18, H. S. Teoh wrote:
[snip]
 1. Every time a function is generated (or pretty much any symbol)
 not only a size calculated but also a checksum* of it's data.
 (If we go for link-time optimization we should find a place to stick
 it to in the object file)

 We'd have to make sure the checksum doesn't end up in the final
 executable though, otherwise the bloat may negate any gains we've made.

Easy the symbol size is in object file (obviously) but it surely not 
present in the executable. If I worth anything legacy formats have a 
plenty of slack space reserved for future. Nwo is the future :)
[snip]
 3. After any function was generated compiler checks an entry in the
 duplicate table that matches size, followed by matching checksum and
 only then (if required) doing a straight memcmp. If it happens that
 there is a match compiler just throws generated code away and
 _aliases_ it's symbol to that of a matched entry.
 (so there has to be an alias table if there isn't one already)

 I think you don't even need an alias table; IIRC the OS dynamic linker
 can easily handle symbols that have the same value (i.e. that point to
 the same place). All you have to do is to change the value of the
 duplicated symbols so that they all point to the same address.

Nice to know.
 [...]
 Applicability:
 It's not only const-immutable bloat, it can be alleviated with
 inout. Yet there are plenty of places the exact same code is being
 generated: e.g. sealed containers of int vs uint, std.find on
 dchar[] vs uint[]/int[] an so on.
 In general, the coarse grained parametrization is the root of all
 evil and it is inevitable since we are just humans after all.

 I'm not sure I understand the last sentence there, but duplicate code
 elimination is definitely a big plus. It will also mean that we can use
 templates more freely without having to worry about template bloat.

It's easy - define a template on type T. Code it up.
Now how many times you did consider that e.g. you can parametrize on the 
size of the type instead of the type itself? I'm making a point that 
it's almost always the case with sealed containers of PODs for instance.
(now multiply the argument for total number of parameters)

 Notes:
 1. If we do checksum calculation on the fly during codegen it gets
 at virtually no cost as the data is in CPU data cache. Preliminary
 version can avoid hacking this part of backend though.

 2. By _alias_ I mean the ability of compiler to emit references to a
 given symbol as if it was some other symbol (should be really
 straight forward).

 Like I said, I think this isn't even necessary, if the compiler can just
 generate the same value for the duplicated symbols.

OK, I'm not much into how *exactly* linker works these days. I know the 
basics though.


 4. Ironically the same exact thing works with any kind of immutable
 data structures. It looks like string pooling is superseded by this
 proposal.

 [...]

 Not really... string pooling can take advantage of overlapping
 (sub)strings, but I don't think you can do that with code. But I think
 your idea has a lot of merit. I'm for making linkers smarter than they
 currently are.

Sorry, it's just me running ahead of train somewhat. Basically once this 
initial version is in place I have one cool refinement for it in mind. 
For now we just need to keep the hash function transitive and 
associative for the gods sake. 128/64bit checksum please ;)

-- 
Dmitry Olshansky

Apr 08 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Apr 08, 2012 at 08:45:19PM +0400, Dmitry Olshansky wrote:
 On 08.04.2012 18:18, H. S. Teoh wrote:
 [snip]
We'd have to make sure the checksum doesn't end up in the final
executable though, otherwise the bloat may negate any gains we've
made.

 
 Easy the symbol size is in object file (obviously) but it surely not
 present in the executable. If I worth anything legacy formats have a
 plenty of slack space reserved for future. Nwo is the future :)

I agree!


[...]
Applicability:
It's not only const-immutable bloat, it can be alleviated with
inout. Yet there are plenty of places the exact same code is being
generated: e.g. sealed containers of int vs uint, std.find on
dchar[] vs uint[]/int[] an so on.
In general, the coarse grained parametrization is the root of all
evil and it is inevitable since we are just humans after all.

I'm not sure I understand the last sentence there, but duplicate code
elimination is definitely a big plus. It will also mean that we can
use templates more freely without having to worry about template
bloat.

 
 It's easy - define a template on type T. Code it up.
 Now how many times you did consider that e.g. you can parametrize on
 the size of the type instead of the type itself? I'm making a point
 that it's almost always the case with sealed containers of PODs for
 instance.
 (now multiply the argument for total number of parameters)

Yeah, that's what I was thinking of. This would be a very big gain for
the new AA implementation, for example. I wouldn't have to worry so much
about template bloat if most of the instantiations are going to get
merged anyway. :-)


[...]
Not really... string pooling can take advantage of overlapping
(sub)strings, but I don't think you can do that with code. But I
think your idea has a lot of merit. I'm for making linkers smarter
than they currently are.

 
 Sorry, it's just me running ahead of train somewhat. Basically once
 this initial version is in place I have one cool refinement for it in
 mind. For now we just need to keep the hash function transitive and
 associative for the gods sake. 128/64bit checksum please ;)

[...]

And what would the cool refinement be? :-)


T

-- 
It's bad luck to be superstitious. -- YHL

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08.04.2012 21:24, H. S. Teoh wrote:
 Yeah, that's what I was thinking of. This would be a very big gain for
 the new AA implementation, for example. I wouldn't have to worry so much
 about template bloat if most of the instantiations are going to get
 merged anyway. :-)

Right the advantage is that one doesn't have to use tricks. One can
simply assume the compiler is doing it's job. That's what happened with 
inlining some 10+ years ago.

 [...]
 Not really... string pooling can take advantage of overlapping
 (sub)strings, but I don't think you can do that with code. But I
 think your idea has a lot of merit. I'm for making linkers smarter
 than they currently are.

 Sorry, it's just me running ahead of train somewhat. Basically once
 this initial version is in place I have one cool refinement for it in
 mind. For now we just need to keep the hash function transitive and
 associative for the gods sake. 128/64bit checksum please ;)

 [...]

 And what would the cool refinement be? :-)

OK, you sort of forced my hand.
I hope you have been warned about spoilers before ;)


The refinement is merging prefixes and suffixes of course.
And for that one needs to calculate hashes for all of prefixes and all 
of suffixes. I will define _all_ later on.

First observation is that if you calculated partial checksums for 
prefixes you have sums for all suffixes and vise-versa.
Namely:
//prefix ends on i, suffix start on i
sum_prefix[i] = total - sum_suffix[i];

that is derived from the fact that:
total = sum_prefix[i] + sum_suffix[i];
(taking into account the properties of +/- in the finite field)

Now take the original idea, substitute checksum with an array of partial 
sums and lengths (btw lengths follow the same rule of prefix<->suffix) 
and you know what this refinement all about.

In fact this all was easy, the trick is fitting this into the time frame 
of compilation.

To that end one should do full duplicate elimination first then mess 
with merging.

Then we see that generally we are about to do O((M1*...*Mn)^2) work 
where n - is number of functions, Mi - number of prefixes (= suffixes) 
we defined for each. The constant C in front of (M1...Mn)^2 here is not 
that big - it's comparing checksums & lengths but *sometimes* memcmp 
still keep in mind that the number of all functions is big.

So we have to get around this monstrous workload. At the moment I see 
the only way is doing this is use coarse grained prefixes and introduce 
some threshold on maximum number of prefixes we account for. That about 
defining above mentioned _all_ for prefixes.

Coarse grained means we do store partial checksums on a fixed block 
boundary (say every 16 or 32 bytes) if that aligns properly with 
instructions if not we just skip this prefix and move on.
(this also hopefully limits memory usage on partial sums array)

Another threshold is that we don't mess with partial sums for really BIG 
functions. (might be not needed)

I think there is some space for other heuristics to try out but they are 
mostly along the same lines.

P.S. Damn, I could have done a nice paper on that... too late :)

-- 
Dmitry Olshansky

Apr 08 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 4/8/12 1:49 PM, Dmitry Olshansky wrote:
 P.S. Damn, I could have done a nice paper on that... too late :)

You may always do.

Andrei

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08.04.2012 22:49, Dmitry Olshansky wrote:
 The refinement is merging prefixes and suffixes of course.
 And for that one needs to calculate hashes for all of prefixes and all
 of suffixes. I will define _all_ later on.

 First observation is that if you calculated partial checksums for
 prefixes you have sums for all suffixes and vise-versa.
 Namely:
 //prefix ends on i, suffix start on i
 sum_prefix[i] = total - sum_suffix[i];

 that is derived from the fact that:
 total = sum_prefix[i] + sum_suffix[i];
 (taking into account the properties of +/- in the finite field)

Mm btw there is one step from here to use it on arbitrary common slices 
(i, j) where i < j:
slice_sum(i, j) = sum_prefix[j] - sum_prefix[i]

P.S. (Goes for walk citing a dynamic programming maxim)

-- 
Dmitry Olshansky

Apr 08 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message 
news:jlsmka$22ce$1 digitalmars.com...
 The refinement is merging prefixes and suffixes of course.
 And for that one needs to calculate hashes for all of prefixes and all of 
 suffixes. I will define _all_ later on.

I think you'll find that this is better done in the compiler instead of the 
linker.  Merging prefixes is problematic because at some point you will need 
to work out which tail to execute, so you'll always need to modify the 
generated code.

Merging suffixes is easier, you can merge all returning blocks with 
identical code, and then merge all blocks that always jump to the same 
blocks, etc.
This will need to happen after code generation if you want to merge int/uint 
code, which would be difficult in dmd's back end.

Merging functions with identical bodies is of course much easier, and can be 
done in the linker without needing to modify any code (just the 
relocations).

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 09.04.2012 5:11, Daniel Murphy wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:jlsmka$22ce$1 digitalmars.com...
 The refinement is merging prefixes and suffixes of course.
 And for that one needs to calculate hashes for all of prefixes and all of
 suffixes. I will define _all_ later on.

 I think you'll find that this is better done in the compiler instead of the
 linker.  Merging prefixes is problematic because at some point you will need
 to work out which tail to execute, so you'll always need to modify the
 generated code.

"Easy": just add a hidden pointer argument to functions that have merged 
prefix (call it dispatch). The prefix part of code is followed by an 
indirect jump to this pointer.
Compiler arranges so that every time function is called the correct 
dispatch address is passed behind the scenes.

BTW there are no extra checks and such it's one naked indirect jump, and 
it's totally predictable unlike say switch jump.
(well unless you use a few copy-paste-susceptible functions in the same 
loop that turn out to have prefixes merged)

It still implies that prefix merging should be applied with more care 
then suffix merging. Once this tested and working even merging arbitrary 
parts of functions is doable with this approach.

 Merging suffixes is easier, you can merge all returning blocks with
 identical code, and then merge all blocks that always jump to the same
 blocks, etc.
 This will need to happen after code generation if you want to merge int/uint
 code, which would be difficult in dmd's back end.

Any chance to fit this into IR-->CodeGen step? Like use alternative 
comparison then memcmp. Or better do basically the same algorithm but on 
map!(....)(IR) that morphs things so that some identical ops (e.g. 
uint/int == and !=) are considered the same. In fact this might be even 
faster then generating useless machine code!

 Merging functions with identical bodies is of course much easier, and can be
 done in the linker without needing to modify any code (just the
 relocations).


-- 
Dmitry Olshansky

Apr 09 2012

Somedude <lovelydear mailmetrash.com> writes:

Le 08/04/2012 16:18, H. S. Teoh a �crit :
 On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.

 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

 
 This would be incompatible with how current (non-dmd) linkers work. But
 I do like the idea. Perhaps if it works well, other linkers will adopt
 it? (Just like how the gcc linker adopted duplicate template code
 elimination due to C++ templates.)
 
 T
 

Actually, in C++ (as well as D), the added benefit would be a greatly
improved compilation speed, wouldn't it ?
I bet if the idea works in D and proves increased compilation, compiler
writers would be very compelled to implement it in C++.

Apr 08 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 04/09/12 08:21, Somedude wrote:
 Le 08/04/2012 16:18, H. S. Teoh a écrit :
 On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.

 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

 This would be incompatible with how current (non-dmd) linkers work. But
 I do like the idea. Perhaps if it works well, other linkers will adopt
 it? (Just like how the gcc linker adopted duplicate template code
 elimination due to C++ templates.)

 T

 
 Actually, in C++ (as well as D), the added benefit would be a greatly
 improved compilation speed, wouldn't it ?
 I bet if the idea works in D and proves increased compilation, compiler
 writers would be very compelled to implement it in C++.
 

They already do.

It's a very simple and trivial optimization, the question is only about
programmer expectations. Every (memory) object having an unique address
*is* a valuable feature with clear benefits. (C++ has functions as
non-objects, that's why the compilers can get away with the optimization)
Note that that does not actually mean that everything has to be placed
at an unique address -- it only needs to behave *AS IF*, as long as the
program can't tell the difference.


On 04/09/12 02:59, Daniel Murphy wrote:
 "Artur Skawina" <art.08.09 gmail.com> wrote in message 
 news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

 
 Or use a nop slide before the start of the function.  Since we're modifying 
 the object file format anyway, it would be trivial for the compiler to mark 
 functions which have their address taken as needing a unique address. 

Nice idea. Given todays amounts of alignment noops emitted it would usually
be completely free.

But I now think the optimization would be ok, and should even on by default
for the case where the identical code sequence was generated from an
identical token sequence. That would handle the template bloat issue while
avoiding most of the problems; having non-unique addresses for this case
should be harmless and would just need to be properly documented.

It's only the random-completely-unrelated-function-replacement that is 
problematic - think such functions randomly appearing in the call chain,
confusing both downstream code and programmers looking at backtraces or
perf profiles, and breakpoints that magically appear out of nowhere at random.

artur

Apr 09 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Apr 09, 2012 at 08:21:08AM +0200, Somedude wrote:
 Le 08/04/2012 16:18, H. S. Teoh a �crit :
 On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.

 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

 
 This would be incompatible with how current (non-dmd) linkers work. But
 I do like the idea. Perhaps if it works well, other linkers will adopt
 it? (Just like how the gcc linker adopted duplicate template code
 elimination due to C++ templates.)
 
 T
 

 
 Actually, in C++ (as well as D), the added benefit would be a greatly
 improved compilation speed, wouldn't it ?
 I bet if the idea works in D and proves increased compilation, compiler
 writers would be very compelled to implement it in C++.

Exactly my point. I *want* to give incentive to toolchain devs to add
these kinds of enhancements to linkers in general.


T

-- 
Why is it that all of the instruments seeking intelligent life in the universe
are pointed away from Earth? -- Michael Beibl

Apr 09 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

[...]
 Thoughts?

Don't forget that this needs to work:   

   static auto f(T)(T a) { return a; }
   assert(cast(void*)&f!int!=cast(void*)&f!uint);

artur

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 [...]
 Thoughts?

 Don't forget that this needs to work:

     static auto f(T)(T a) { return a; }
     assert(cast(void*)&f!int!=cast(void*)&f!uint);

 artur

A reference to spec page plz.

-- 
Dmitry Olshansky

Apr 08 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 [...]
 Thoughts?

 Don't forget that this needs to work:

     static auto f(T)(T a) { return a; }
     assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

A reference to *a* D spec, please...

There isn't one, but that does not mean that common sense does
not need to apply.

Do you really suggest making different template instantiations
identical, just because the compiler happened to emit the same
code? The situation is not very different from 

   const int a = 1; const uint b = 1; assert(&a!=&b);

The user can take the addresses, compare them, use as AA keys, 
set breakpoints etc.

Note that my point is just that the compiler needs to emit a dummy
so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

that only is used when taking the address. Even calling f!int()
instead of the uint version could be acceptable, as long as there
is a compiler option to turn this optimization off (think breakpoints).

artur

Apr 08 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 4/8/12 10:59 AM, Artur Skawina wrote:
 On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 [...]
 Thoughts?

 Don't forget that this needs to work:

      static auto f(T)(T a) { return a; }
      assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

 A reference to *a* D spec, please...

 There isn't one, but that does not mean that common sense does
 not need to apply.

Doesn't apply to C++.

Andrei

Apr 08 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 04/08/12 18:14, Andrei Alexandrescu wrote:
 On 4/8/12 10:59 AM, Artur Skawina wrote:
 On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 [...]
 Thoughts?

 Don't forget that this needs to work:

      static auto f(T)(T a) { return a; }
      assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

 A reference to *a* D spec, please...

 There isn't one, but that does not mean that common sense does
 not need to apply.

 
 Doesn't apply to C++.

Well, let's say it isn't specified. "Functions-are-not-objects" allows
the compilers that merge the bodies to get away with it, because it's
not explicitly illegal. But that does not mean that returning a different
random pointer, which just happens to point to another identical code
sequence, is a good idea.


On 04/08/12 18:30, Dmitry Olshansky wrote:
 There is common sense and that is (in my book):
 don't tie up compiler's hands for no benefit.

In this case the benefit from not requiring unique function addresses
isn't very large (it only matters iff the address is taken).

Hmm, as long as the returned pointer points to functionally identical
code, often everything will still be fine. In other cases doing this
optimization *at all* will cause trouble. And i think it should be
done (the benefits are obvious), An "unique" function attribute would
be the solution; that would let us eat the cake, except when we want
to have it. :)

artur

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08.04.2012 19:59, Artur Skawina wrote:
 On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 [...]
 Thoughts?

 Don't forget that this needs to work:

      static auto f(T)(T a) { return a; }
      assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

 A reference to *a* D spec, please...

Yeah, sorry.
 There isn't one, but that does not mean that common sense does
 not need to apply.

There is common sense and that is (in my book):
don't tie up compiler's hands for no benefit.

 Do you really suggest making different template instantiations
 identical, just because the compiler happened to emit the same
 code? The situation is not very different from

     const int a = 1; const uint b = 1; assert(&a!=&b);

I wouldn't expect this assert to always hold. Moreover (all things being 
equal) I would expect taking address of constant integers a poor practice.

 The user can take the addresses, compare them, use as AA keys,
 set breakpoints etc.

Yes, that's what I call poor practice ( I mean function pointer as AA 
_key_, seriously?). As for breakpoints, obviously one debugs 
non-optimized program (at least most of the time).

 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

     module.f!uint:
         jmp module.f!int

Could work as a conservative option. But I don't think it's justified.

 that only is used when taking the address. Even calling f!int()
 instead of the uint version could be acceptable, as long as there
 is a compiler option to turn this optimization off (think breakpoints).

Yup, that's what optimizations are.

-- 
Dmitry Olshansky

Apr 08 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"Artur Skawina" <art.08.09 gmail.com> wrote in message 
news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

Or use a nop slide before the start of the function.  Since we're modifying 
the object file format anyway, it would be trivial for the compiler to mark 
functions which have their address taken as needing a unique address.

Apr 08 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Apr 09, 2012 at 10:59:26AM +1000, Daniel Murphy wrote:
 "Artur Skawina" <art.08.09 gmail.com> wrote in message 
 news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

 
 Or use a nop slide before the start of the function.  Since we're
 modifying the object file format anyway, it would be trivial for the
 compiler to mark functions which have their address taken as needing a
 unique address. 

[...]

Why is it so important to have unique addresses for functions?


T

-- 
BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL

Apr 08 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 8 Apr 2012 19:14:22 -0700
schrieb "H. S. Teoh" <hsteoh quickfur.ath.cx>:

 On Mon, Apr 09, 2012 at 10:59:26AM +1000, Daniel Murphy wrote:
 "Artur Skawina" <art.08.09 gmail.com> wrote in message 
 news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

 
 Or use a nop slide before the start of the function.  Since we're
 modifying the object file format anyway, it would be trivial for the
 compiler to mark functions which have their address taken as needing a
 unique address. 

 [...]
 
 Why is it so important to have unique addresses for functions?
 
 
 T

I would like to know that use case as well, especially since I learned coding
with the "copy&paste code is bad" phrase. It just seems odd to me to expect any
constant data/code to have unique addresses when they can be merged or
overlapped.

-- 
Marco

Apr 08 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.1518.1333937643.4860.digitalmars-d puremagic.com...
 Why is it so important to have unique addresses for functions?

Just because I can't think of a use case doesn't mean nobody is relying on 
it!

But I guess there really isn't one.

Apr 09 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Apr 09, 2012 at 11:58:01PM +1000, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.1518.1333937643.4860.digitalmars-d puremagic.com...
 Why is it so important to have unique addresses for functions?

 
 Just because I can't think of a use case doesn't mean nobody is
 relying on it!
 
 But I guess there really isn't one. 

[...]

Somebody brought up the matter of stacktraces. Which could be a valid
concern, I suppose, although I'm tempted to just say, use a
non-optimized build for debugging purposes. (But I suppose that is
arguable.)


T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs,
they'd be algorithms.

Apr 09 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 08 Apr 2012 16:21:14 +0200
schrieb Artur Skawina <art.08.09 gmail.com>:

 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 [...]
 Thoughts?

 
 Don't forget that this needs to work:   
 
    static auto f(T)(T a) { return a; }
    assert(cast(void*)&f!int!=cast(void*)&f!uint);
 
 artur

Do you actually rely on that behavior? It is the same as asking this to work, I
think:

string a = "abc";
string b = "abcdef";
assert(a.ptr !is b.ptr);

There should be ways to logically check the unequality of your two functions,
not by comparing their memory addresses.

-- 
Marco

Apr 08 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 4/8/2012 4:01 AM, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in compiler vendors)
 does what I think is rather simple optimization.

I worked out how to do it a while ago, but there's been no time to implement it.

(You can't do a memcmp because of all the relocations.)

The main difficulty is not being able to modify the linker. So you're pretty 
much limited to what the compiler is able to do before linking. D does allow
the 
compiler to deal with all the modules at compile time, so this helps a lot.

Apr 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08.04.2012 22:51, Walter Bright wrote:
 On 4/8/2012 4:01 AM, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in compiler
 vendors)
 does what I think is rather simple optimization.

 I worked out how to do it a while ago, but there's been no time to
 implement it.

 (You can't do a memcmp because of all the relocations.)

Yes, I thought there must have been some problem with it.

 The main difficulty is not being able to modify the linker. So you're
 pretty much limited to what the compiler is able to do before linking. D
 does allow the compiler to deal with all the modules at compile time, so
 this helps a lot.

I'm thinking what if we kind of do it as an extension so that your 
linker (e.g. optlink) knows about these checksums (or whatever you use 
instead) while normal linker just works the way it always did.

-- 
Dmitry Olshansky

Apr 08 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Apr 08, 2012 at 10:56:43PM +0400, Dmitry Olshansky wrote:
 On 08.04.2012 22:51, Walter Bright wrote:

[...]
The main difficulty is not being able to modify the linker. So you're
pretty much limited to what the compiler is able to do before
linking. D does allow the compiler to deal with all the modules at
compile time, so this helps a lot.

 
 I'm thinking what if we kind of do it as an extension so that your
 linker (e.g. optlink) knows about these checksums (or whatever you use
 instead) while normal linker just works the way it always did.

[...]

I agree. Either that, or have an external utility for doing it.

I must say that IMO current linker technology is quite dumb, and no harm
could come from experimenting with smarter linkers.  Things like
duplicate template instantiation elimination have been introduced due to
the influence of C++; I don't see why D shouldn't introduce its own
advancements to linkers too. :-)


T

-- 
Laissez-faire is a French term commonly interpreted by Conservatives to
mean 'lazy fairy,' which is the belief that if governments are lazy
enough, the Good Fairy will come down from heaven and do all their work
for them.

Apr 08 2012

D Programming

C/C++ Programming

Other

digitalmars.D - A modest proposal: eliminate template code bloat