www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A modest proposal: eliminate template code bloat

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
I think it's been ages since I meant to ask why nobody (as in compiler 
vendors) does what I think is rather simple optimization.

In the short term the plan is to introduce a "link-time" flavored 
optimization at code generation or (better) link step.

For simplicity let's assume compiler does all of the following during 
generation of an object file.

1. Every time a function is generated (or pretty much any symbol) not 
only a size calculated but also a checksum* of it's data.
(If we go for link-time optimization we should find a place to stick it 
to in the object file)

2. Compiler maintains a hash-table of symbol_size ---> list( ~ array) of 
pairs (references to data, checksum) of all symbols with given size. 
Call it a duplicate table. Every function generated and more generally 
global immutable data should end up there.

3. After any function was generated compiler checks an entry in the 
duplicate table that matches size, followed by matching checksum and 
only then (if required) doing a straight memcmp. If it happens that 
there is a match compiler just throws generated code away and _aliases_ 
it's symbol to that of a matched entry.
(so there has to be an alias table if there isn't one already)

*Yes, checksum. I think it should be real simple and easy to parallel 
hash function. The original checksum is no reliable but some amount 
balancing and profiling are obviously required when picking this function.

Applicability:
It's not only const-immutable bloat, it can be alleviated with inout. 
Yet there are plenty of places the exact same code is being generated: 
e.g. sealed containers of int vs uint, std.find on dchar[] vs 
uint[]/int[] an so on.
In general, the coarse grained parametrization is the root of all evil 
and it is inevitable since we are just humans after all.

Notes:
1. If we do checksum calculation on the fly during codegen it gets at 
virtually no cost as the data is in CPU data cache. Preliminary version 
can avoid hacking this part of backend though.

2. By _alias_ I mean the ability of compiler to emit references to a 
given symbol as if it was some other symbol (should be really straight 
forward).

3. Linker have more data and is able to achieve colossal size savings, 
essentially running through the same algorithm before actually linking 
things. Again it's easily separable step and can be an opt-in.

4. Ironically the same exact thing works with any kind of immutable data 
structures. It looks like string pooling is superseded by this proposal.

Thoughts?

-- 
Dmitry Olshansky
Apr 08 2012
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 08 Apr 2012 15:01:56 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 I think it's been ages since I meant to ask why nobody (as in compiler 
 vendors) does what I think is rather simple optimization.
 
 In the short term the plan is to introduce a "link-time" flavored 
 optimization at code generation or (better) link step.
 
 For simplicity let's assume compiler does all of the following during 
 generation of an object file.
 
 1. Every time a function is generated (or pretty much any symbol) not 
 only a size calculated but also a checksum* of it's data.
 (If we go for link-time optimization we should find a place to stick it 
 to in the object file)
 
 2. Compiler maintains a hash-table of symbol_size ---> list( ~ array) of 
 pairs (references to data, checksum) of all symbols with given size. 
 Call it a duplicate table. Every function generated and more generally 
 global immutable data should end up there.
 
 3. After any function was generated compiler checks an entry in the 
 duplicate table that matches size, followed by matching checksum and 
 only then (if required) doing a straight memcmp. If it happens that 
 there is a match compiler just throws generated code away and _aliases_ 
 it's symbol to that of a matched entry.
 (so there has to be an alias table if there isn't one already)
 
 *Yes, checksum. I think it should be real simple and easy to parallel 
 hash function. The original checksum is no reliable but some amount 
 balancing and profiling are obviously required when picking this function.
 
 Applicability:
 It's not only const-immutable bloat, it can be alleviated with inout. 
 Yet there are plenty of places the exact same code is being generated: 
 e.g. sealed containers of int vs uint, std.find on dchar[] vs 
 uint[]/int[] an so on.
 In general, the coarse grained parametrization is the root of all evil 
 and it is inevitable since we are just humans after all.
 
 Notes:
 1. If we do checksum calculation on the fly during codegen it gets at 
 virtually no cost as the data is in CPU data cache. Preliminary version 
 can avoid hacking this part of backend though.
 
 2. By _alias_ I mean the ability of compiler to emit references to a 
 given symbol as if it was some other symbol (should be really straight 
 forward).
 
 3. Linker have more data and is able to achieve colossal size savings, 
 essentially running through the same algorithm before actually linking 
 things. Again it's easily separable step and can be an opt-in.
 
 4. Ironically the same exact thing works with any kind of immutable data 
 structures. It looks like string pooling is superseded by this proposal.
 
 Thoughts?

Thoughts? Nothing much. I thought of that a while ago, but as an external program, that finds function calls by disassembling and removing dead/duplicate code. So I agree with you. A similar feature is a CTFE cache (or general code cache) that checksums a function's source code and gets the compiled version from a cache. Template bloat could be especially important to 'fix' on embedded systems. But I don't consider it important enough at the moment. :/ Let's wait till the bugs and important features are implemented or hack the compiler ourselves. -- Marco
Apr 08 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08.04.2012 16:37, Marco Leise wrote:
[snip]
 Template bloat could be especially important to 'fix' on embedded systems.

I think I this idea largely formed years ago when I was working with c++ on 8bit micros. You won't believe the amount of code size one can save by using one separate generic save-it-all-then-load-it-all prolog/epilog for all functions (and esp ISRs).
  Let's wait till the bugs and important features are implemented or hack the
compiler ourselves.

Let's hack the compiler ;) BTW I think it should be possible to apply the idea on the IR level not on the actual machine code. -- Dmitry Olshansky
Apr 08 2012
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.
 
 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

This would be incompatible with how current (non-dmd) linkers work. But I do like the idea. Perhaps if it works well, other linkers will adopt it? (Just like how the gcc linker adopted duplicate template code elimination due to C++ templates.)
 For simplicity let's assume compiler does all of the following
 during generation of an object file.
 
 1. Every time a function is generated (or pretty much any symbol)
 not only a size calculated but also a checksum* of it's data.
 (If we go for link-time optimization we should find a place to stick
 it to in the object file)

We'd have to make sure the checksum doesn't end up in the final executable though, otherwise the bloat may negate any gains we've made.
 2. Compiler maintains a hash-table of symbol_size ---> list( ~
 array) of pairs (references to data, checksum) of all symbols with
 given size. Call it a duplicate table. Every function generated and
 more generally global immutable data should end up there.
 
 3. After any function was generated compiler checks an entry in the
 duplicate table that matches size, followed by matching checksum and
 only then (if required) doing a straight memcmp. If it happens that
 there is a match compiler just throws generated code away and
 _aliases_ it's symbol to that of a matched entry.
 (so there has to be an alias table if there isn't one already)

I think you don't even need an alias table; IIRC the OS dynamic linker can easily handle symbols that have the same value (i.e. that point to the same place). All you have to do is to change the value of the duplicated symbols so that they all point to the same address. [...]
 Applicability:
 It's not only const-immutable bloat, it can be alleviated with
 inout. Yet there are plenty of places the exact same code is being
 generated: e.g. sealed containers of int vs uint, std.find on
 dchar[] vs uint[]/int[] an so on.
 In general, the coarse grained parametrization is the root of all
 evil and it is inevitable since we are just humans after all.

I'm not sure I understand the last sentence there, but duplicate code elimination is definitely a big plus. It will also mean that we can use templates more freely without having to worry about template bloat.
 Notes:
 1. If we do checksum calculation on the fly during codegen it gets
 at virtually no cost as the data is in CPU data cache. Preliminary
 version can avoid hacking this part of backend though.
 
 2. By _alias_ I mean the ability of compiler to emit references to a
 given symbol as if it was some other symbol (should be really
 straight forward).

Like I said, I think this isn't even necessary, if the compiler can just generate the same value for the duplicated symbols.
 3. Linker have more data and is able to achieve colossal size
 savings, essentially running through the same algorithm before
 actually linking things. Again it's easily separable step and can be
 an opt-in.

This assumes the (maybe external) linker knows how to take advantage of the info. But IMO, linkers *should* be a lot smarter than they currently are, so I don't see a problem with this as long as a dumb linker will still produce a working executable (just without the space savings). Alternatively we can have an external pre-link tool that scans a given set of object files and eliminates duplicated code (by turning duplicated symbols into external references in all but one of the instances), before we hand the files off to the OS's native linker. Come to think of it, this might be a good way to experiment with this idea before we commit lots of effort into integrating it with a real linker.
 4. Ironically the same exact thing works with any kind of immutable
 data structures. It looks like string pooling is superseded by this
 proposal.

Not really... string pooling can take advantage of overlapping (sub)strings, but I don't think you can do that with code. But I think your idea has a lot of merit. I'm for making linkers smarter than they currently are. T -- It always amuses me that Windows has a Safe Mode during bootup. Does that mean that Windows is normally unsafe?
Apr 08 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08.04.2012 18:18, H. S. Teoh wrote:
[snip]
 1. Every time a function is generated (or pretty much any symbol)
 not only a size calculated but also a checksum* of it's data.
 (If we go for link-time optimization we should find a place to stick
 it to in the object file)

We'd have to make sure the checksum doesn't end up in the final executable though, otherwise the bloat may negate any gains we've made.

Easy the symbol size is in object file (obviously) but it surely not present in the executable. If I worth anything legacy formats have a plenty of slack space reserved for future. Nwo is the future :) [snip]
 3. After any function was generated compiler checks an entry in the
 duplicate table that matches size, followed by matching checksum and
 only then (if required) doing a straight memcmp. If it happens that
 there is a match compiler just throws generated code away and
 _aliases_ it's symbol to that of a matched entry.
 (so there has to be an alias table if there isn't one already)

I think you don't even need an alias table; IIRC the OS dynamic linker can easily handle symbols that have the same value (i.e. that point to the same place). All you have to do is to change the value of the duplicated symbols so that they all point to the same address.

Nice to know.
 [...]
 Applicability:
 It's not only const-immutable bloat, it can be alleviated with
 inout. Yet there are plenty of places the exact same code is being
 generated: e.g. sealed containers of int vs uint, std.find on
 dchar[] vs uint[]/int[] an so on.
 In general, the coarse grained parametrization is the root of all
 evil and it is inevitable since we are just humans after all.

I'm not sure I understand the last sentence there, but duplicate code elimination is definitely a big plus. It will also mean that we can use templates more freely without having to worry about template bloat.

It's easy - define a template on type T. Code it up. Now how many times you did consider that e.g. you can parametrize on the size of the type instead of the type itself? I'm making a point that it's almost always the case with sealed containers of PODs for instance. (now multiply the argument for total number of parameters)
 Notes:
 1. If we do checksum calculation on the fly during codegen it gets
 at virtually no cost as the data is in CPU data cache. Preliminary
 version can avoid hacking this part of backend though.

 2. By _alias_ I mean the ability of compiler to emit references to a
 given symbol as if it was some other symbol (should be really
 straight forward).

Like I said, I think this isn't even necessary, if the compiler can just generate the same value for the duplicated symbols.

OK, I'm not much into how *exactly* linker works these days. I know the basics though.
 4. Ironically the same exact thing works with any kind of immutable
 data structures. It looks like string pooling is superseded by this
 proposal.

Not really... string pooling can take advantage of overlapping (sub)strings, but I don't think you can do that with code. But I think your idea has a lot of merit. I'm for making linkers smarter than they currently are.

Sorry, it's just me running ahead of train somewhat. Basically once this initial version is in place I have one cool refinement for it in mind. For now we just need to keep the hash function transitive and associative for the gods sake. 128/64bit checksum please ;) -- Dmitry Olshansky
Apr 08 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08.04.2012 21:24, H. S. Teoh wrote:
 Yeah, that's what I was thinking of. This would be a very big gain for
 the new AA implementation, for example. I wouldn't have to worry so much
 about template bloat if most of the instantiations are going to get
 merged anyway. :-)

Right the advantage is that one doesn't have to use tricks. One can simply assume the compiler is doing it's job. That's what happened with inlining some 10+ years ago.
 [...]
 Not really... string pooling can take advantage of overlapping
 (sub)strings, but I don't think you can do that with code. But I
 think your idea has a lot of merit. I'm for making linkers smarter
 than they currently are.

Sorry, it's just me running ahead of train somewhat. Basically once this initial version is in place I have one cool refinement for it in mind. For now we just need to keep the hash function transitive and associative for the gods sake. 128/64bit checksum please ;)

And what would the cool refinement be? :-)

I hope you have been warned about spoilers before ;) The refinement is merging prefixes and suffixes of course. And for that one needs to calculate hashes for all of prefixes and all of suffixes. I will define _all_ later on. First observation is that if you calculated partial checksums for prefixes you have sums for all suffixes and vise-versa. Namely: //prefix ends on i, suffix start on i sum_prefix[i] = total - sum_suffix[i]; that is derived from the fact that: total = sum_prefix[i] + sum_suffix[i]; (taking into account the properties of +/- in the finite field) Now take the original idea, substitute checksum with an array of partial sums and lengths (btw lengths follow the same rule of prefix<->suffix) and you know what this refinement all about. In fact this all was easy, the trick is fitting this into the time frame of compilation. To that end one should do full duplicate elimination first then mess with merging. Then we see that generally we are about to do O((M1*...*Mn)^2) work where n - is number of functions, Mi - number of prefixes (= suffixes) we defined for each. The constant C in front of (M1...Mn)^2 here is not that big - it's comparing checksums & lengths but *sometimes* memcmp still keep in mind that the number of all functions is big. So we have to get around this monstrous workload. At the moment I see the only way is doing this is use coarse grained prefixes and introduce some threshold on maximum number of prefixes we account for. That about defining above mentioned _all_ for prefixes. Coarse grained means we do store partial checksums on a fixed block boundary (say every 16 or 32 bytes) if that aligns properly with instructions if not we just skip this prefix and move on. (this also hopefully limits memory usage on partial sums array) Another threshold is that we don't mess with partial sums for really BIG functions. (might be not needed) I think there is some space for other heuristics to try out but they are mostly along the same lines. P.S. Damn, I could have done a nice paper on that... too late :) -- Dmitry Olshansky
Apr 08 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/8/12 1:49 PM, Dmitry Olshansky wrote:
 P.S. Damn, I could have done a nice paper on that... too late :)

You may always do. Andrei
Apr 08 2012
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08.04.2012 22:49, Dmitry Olshansky wrote:
 The refinement is merging prefixes and suffixes of course.
 And for that one needs to calculate hashes for all of prefixes and all
 of suffixes. I will define _all_ later on.

 First observation is that if you calculated partial checksums for
 prefixes you have sums for all suffixes and vise-versa.
 Namely:
 //prefix ends on i, suffix start on i
 sum_prefix[i] = total - sum_suffix[i];

 that is derived from the fact that:
 total = sum_prefix[i] + sum_suffix[i];
 (taking into account the properties of +/- in the finite field)

Mm btw there is one step from here to use it on arbitrary common slices (i, j) where i < j: slice_sum(i, j) = sum_prefix[j] - sum_prefix[i] P.S. (Goes for walk citing a dynamic programming maxim) -- Dmitry Olshansky
Apr 08 2012
prev sibling parent reply "Daniel Murphy" <yebblies nospamgmail.com> writes:
"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message 
news:jlsmka$22ce$1 digitalmars.com...
 The refinement is merging prefixes and suffixes of course.
 And for that one needs to calculate hashes for all of prefixes and all of 
 suffixes. I will define _all_ later on.

I think you'll find that this is better done in the compiler instead of the linker. Merging prefixes is problematic because at some point you will need to work out which tail to execute, so you'll always need to modify the generated code. Merging suffixes is easier, you can merge all returning blocks with identical code, and then merge all blocks that always jump to the same blocks, etc. This will need to happen after code generation if you want to merge int/uint code, which would be difficult in dmd's back end. Merging functions with identical bodies is of course much easier, and can be done in the linker without needing to modify any code (just the relocations).
Apr 08 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 09.04.2012 5:11, Daniel Murphy wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:jlsmka$22ce$1 digitalmars.com...
 The refinement is merging prefixes and suffixes of course.
 And for that one needs to calculate hashes for all of prefixes and all of
 suffixes. I will define _all_ later on.

I think you'll find that this is better done in the compiler instead of the linker. Merging prefixes is problematic because at some point you will need to work out which tail to execute, so you'll always need to modify the generated code.

"Easy": just add a hidden pointer argument to functions that have merged prefix (call it dispatch). The prefix part of code is followed by an indirect jump to this pointer. Compiler arranges so that every time function is called the correct dispatch address is passed behind the scenes. BTW there are no extra checks and such it's one naked indirect jump, and it's totally predictable unlike say switch jump. (well unless you use a few copy-paste-susceptible functions in the same loop that turn out to have prefixes merged) It still implies that prefix merging should be applied with more care then suffix merging. Once this tested and working even merging arbitrary parts of functions is doable with this approach.
 Merging suffixes is easier, you can merge all returning blocks with
 identical code, and then merge all blocks that always jump to the same
 blocks, etc.
 This will need to happen after code generation if you want to merge int/uint
 code, which would be difficult in dmd's back end.

Any chance to fit this into IR-->CodeGen step? Like use alternative comparison then memcmp. Or better do basically the same algorithm but on map!(....)(IR) that morphs things so that some identical ops (e.g. uint/int == and !=) are considered the same. In fact this might be even faster then generating useless machine code!
 Merging functions with identical bodies is of course much easier, and can be
 done in the linker without needing to modify any code (just the
 relocations).

-- Dmitry Olshansky
Apr 09 2012
prev sibling next sibling parent Somedude <lovelydear mailmetrash.com> writes:
Le 08/04/2012 16:18, H. S. Teoh a écrit :
 On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.

 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

This would be incompatible with how current (non-dmd) linkers work. But I do like the idea. Perhaps if it works well, other linkers will adopt it? (Just like how the gcc linker adopted duplicate template code elimination due to C++ templates.) T

Actually, in C++ (as well as D), the added benefit would be a greatly improved compilation speed, wouldn't it ? I bet if the idea works in D and proves increased compilation, compiler writers would be very compelled to implement it in C++.
Apr 08 2012
prev sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 04/09/12 08:21, Somedude wrote:
 Le 08/04/2012 16:18, H. S. Teoh a écrit :
 On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.

 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

This would be incompatible with how current (non-dmd) linkers work. But I do like the idea. Perhaps if it works well, other linkers will adopt it? (Just like how the gcc linker adopted duplicate template code elimination due to C++ templates.) T

Actually, in C++ (as well as D), the added benefit would be a greatly improved compilation speed, wouldn't it ? I bet if the idea works in D and proves increased compilation, compiler writers would be very compelled to implement it in C++.

They already do. It's a very simple and trivial optimization, the question is only about programmer expectations. Every (memory) object having an unique address *is* a valuable feature with clear benefits. (C++ has functions as non-objects, that's why the compilers can get away with the optimization) Note that that does not actually mean that everything has to be placed at an unique address -- it only needs to behave *AS IF*, as long as the program can't tell the difference. On 04/09/12 02:59, Daniel Murphy wrote:
 "Artur Skawina" <art.08.09 gmail.com> wrote in message 
 news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

Or use a nop slide before the start of the function. Since we're modifying the object file format anyway, it would be trivial for the compiler to mark functions which have their address taken as needing a unique address.

Nice idea. Given todays amounts of alignment noops emitted it would usually be completely free. But I now think the optimization would be ok, and should even on by default for the case where the identical code sequence was generated from an identical token sequence. That would handle the template bloat issue while avoiding most of the problems; having non-unique addresses for this case should be harmless and would just need to be properly documented. It's only the random-completely-unrelated-function-replacement that is problematic - think such functions randomly appearing in the call chain, confusing both downstream code and programmers looking at backtraces or perf profiles, and breakpoints that magically appear out of nowhere at random. artur
Apr 09 2012
prev sibling next sibling parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 Thoughts?

Don't forget that this needs to work: static auto f(T)(T a) { return a; } assert(cast(void*)&f!int!=cast(void*)&f!uint); artur
Apr 08 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 Thoughts?

Don't forget that this needs to work: static auto f(T)(T a) { return a; } assert(cast(void*)&f!int!=cast(void*)&f!uint); artur

A reference to spec page plz. -- Dmitry Olshansky
Apr 08 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/8/12 10:59 AM, Artur Skawina wrote:
 On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 Thoughts?

Don't forget that this needs to work: static auto f(T)(T a) { return a; } assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

A reference to *a* D spec, please... There isn't one, but that does not mean that common sense does not need to apply.

Doesn't apply to C++. Andrei
Apr 08 2012
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08.04.2012 19:59, Artur Skawina wrote:
 On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 Thoughts?

Don't forget that this needs to work: static auto f(T)(T a) { return a; } assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

A reference to *a* D spec, please...

Yeah, sorry.
 There isn't one, but that does not mean that common sense does
 not need to apply.

There is common sense and that is (in my book): don't tie up compiler's hands for no benefit.
 Do you really suggest making different template instantiations
 identical, just because the compiler happened to emit the same
 code? The situation is not very different from

     const int a = 1; const uint b = 1; assert(&a!=&b);

I wouldn't expect this assert to always hold. Moreover (all things being equal) I would expect taking address of constant integers a poor practice.
 The user can take the addresses, compare them, use as AA keys,
 set breakpoints etc.

Yes, that's what I call poor practice ( I mean function pointer as AA _key_, seriously?). As for breakpoints, obviously one debugs non-optimized program (at least most of the time).
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

     module.f!uint:
         jmp module.f!int

Could work as a conservative option. But I don't think it's justified.
 that only is used when taking the address. Even calling f!int()
 instead of the uint version could be acceptable, as long as there
 is a compiler option to turn this optimization off (think breakpoints).

Yup, that's what optimizations are. -- Dmitry Olshansky
Apr 08 2012
prev sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 04/08/12 18:14, Andrei Alexandrescu wrote:
 On 4/8/12 10:59 AM, Artur Skawina wrote:
 On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 Thoughts?

Don't forget that this needs to work: static auto f(T)(T a) { return a; } assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

A reference to *a* D spec, please... There isn't one, but that does not mean that common sense does not need to apply.

Doesn't apply to C++.

Well, let's say it isn't specified. "Functions-are-not-objects" allows the compilers that merge the bodies to get away with it, because it's not explicitly illegal. But that does not mean that returning a different random pointer, which just happens to point to another identical code sequence, is a good idea. On 04/08/12 18:30, Dmitry Olshansky wrote:
 There is common sense and that is (in my book):
 don't tie up compiler's hands for no benefit.

In this case the benefit from not requiring unique function addresses isn't very large (it only matters iff the address is taken). Hmm, as long as the returned pointer points to functionally identical code, often everything will still be fine. In other cases doing this optimization *at all* will cause trouble. And i think it should be done (the benefits are obvious), An "unique" function attribute would be the solution; that would let us eat the cake, except when we want to have it. :) artur
Apr 08 2012
prev sibling parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 04/08/12 17:20, Dmitry Olshansky wrote:
 On 08.04.2012 18:21, Artur Skawina wrote:
 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 Thoughts?

Don't forget that this needs to work: static auto f(T)(T a) { return a; } assert(cast(void*)&f!int!=cast(void*)&f!uint);


 A reference to spec page plz.

A reference to *a* D spec, please... There isn't one, but that does not mean that common sense does not need to apply. Do you really suggest making different template instantiations identical, just because the compiler happened to emit the same code? The situation is not very different from const int a = 1; const uint b = 1; assert(&a!=&b); The user can take the addresses, compare them, use as AA keys, set breakpoints etc. Note that my point is just that the compiler needs to emit a dummy so that the addresses remain unique, eg module.f!uint: jmp module.f!int that only is used when taking the address. Even calling f!int() instead of the uint version could be acceptable, as long as there is a compiler option to turn this optimization off (think breakpoints). artur
Apr 08 2012
parent "Daniel Murphy" <yebblies nospamgmail.com> writes:
"Artur Skawina" <art.08.09 gmail.com> wrote in message 
news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

Or use a nop slide before the start of the function. Since we're modifying the object file format anyway, it would be trivial for the compiler to mark functions which have their address taken as needing a unique address.
Apr 08 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 8 Apr 2012 07:18:26 -0700
schrieb "H. S. Teoh" <hsteoh quickfur.ath.cx>:

 We'd have to make sure the checksum doesn't end up in the final
 executable though, otherwise the bloat may negate any gains we've made.

Executables (and object files) are made up mostly of sections, some of which are 'special cased' to contain the code, zero initialized data, thread local storage etc. and some user defined. The checksums would most probably end up in their own section, like is already happening for debug info or comments. Using a tool like strip you can remove any section by its name. Once a linker knows how to use the checksums, it would strip them by default. -- Marco
Apr 08 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 08 Apr 2012 16:21:14 +0200
schrieb Artur Skawina <art.08.09 gmail.com>:

 On 04/08/12 13:01, Dmitry Olshansky wrote:
 3. After any function was generated compiler checks an entry in the duplicate
table that matches size, followed by matching checksum and only then (if
required) doing a straight memcmp. If it happens that there is a match compiler
just throws generated code away and _aliases_ it's symbol to that of a matched
entry.
 (so there has to be an alias table if there isn't one already)

 Thoughts?

Don't forget that this needs to work: static auto f(T)(T a) { return a; } assert(cast(void*)&f!int!=cast(void*)&f!uint); artur

Do you actually rely on that behavior? It is the same as asking this to work, I think: string a = "abc"; string b = "abcdef"; assert(a.ptr !is b.ptr); There should be ways to logically check the unequality of your two functions, not by comparing their memory addresses. -- Marco
Apr 08 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Apr 08, 2012 at 08:45:19PM +0400, Dmitry Olshansky wrote:
 On 08.04.2012 18:18, H. S. Teoh wrote:
 [snip]
We'd have to make sure the checksum doesn't end up in the final
executable though, otherwise the bloat may negate any gains we've
made.

Easy the symbol size is in object file (obviously) but it surely not present in the executable. If I worth anything legacy formats have a plenty of slack space reserved for future. Nwo is the future :)

I agree! [...]
Applicability:
It's not only const-immutable bloat, it can be alleviated with
inout. Yet there are plenty of places the exact same code is being
generated: e.g. sealed containers of int vs uint, std.find on
dchar[] vs uint[]/int[] an so on.
In general, the coarse grained parametrization is the root of all
evil and it is inevitable since we are just humans after all.

I'm not sure I understand the last sentence there, but duplicate code elimination is definitely a big plus. It will also mean that we can use templates more freely without having to worry about template bloat.

It's easy - define a template on type T. Code it up. Now how many times you did consider that e.g. you can parametrize on the size of the type instead of the type itself? I'm making a point that it's almost always the case with sealed containers of PODs for instance. (now multiply the argument for total number of parameters)

Yeah, that's what I was thinking of. This would be a very big gain for the new AA implementation, for example. I wouldn't have to worry so much about template bloat if most of the instantiations are going to get merged anyway. :-) [...]
Not really... string pooling can take advantage of overlapping
(sub)strings, but I don't think you can do that with code. But I
think your idea has a lot of merit. I'm for making linkers smarter
than they currently are.

Sorry, it's just me running ahead of train somewhat. Basically once this initial version is in place I have one cool refinement for it in mind. For now we just need to keep the hash function transitive and associative for the gods sake. 128/64bit checksum please ;)

And what would the cool refinement be? :-) T -- It's bad luck to be superstitious. -- YHL
Apr 08 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 08 Apr 2012 20:58:15 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 On 08.04.2012 16:37, Marco Leise wrote:
 [snip]
 Template bloat could be especially important to 'fix' on embedded systems.

I think I this idea largely formed years ago when I was working with c++ on 8bit micros. You won't believe the amount of code size one can save by using one separate generic save-it-all-then-load-it-all prolog/epilog for all functions (and esp ISRs).
  Let's wait till the bugs and important features are implemented or hack the
compiler ourselves.

Let's hack the compiler ;) BTW I think it should be possible to apply the idea on the IR level not on the actual machine code.

I don't know compilers, but doesn't the IR level still distinguish between signed and unsigned for example, even though the generated machine code will be the same? Anyway that's what I had in mind with the caching of CTFE as well. To create a hash over the IR (or AST?) of a function (or anything else) and store the hash and the serialized IR into a file next to the object files. -- Marco
Apr 08 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/8/2012 4:01 AM, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in compiler vendors)
 does what I think is rather simple optimization.

I worked out how to do it a while ago, but there's been no time to implement it. (You can't do a memcmp because of all the relocations.) The main difficulty is not being able to modify the linker. So you're pretty much limited to what the compiler is able to do before linking. D does allow the compiler to deal with all the modules at compile time, so this helps a lot.
Apr 08 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08.04.2012 22:51, Walter Bright wrote:
 On 4/8/2012 4:01 AM, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in compiler
 vendors)
 does what I think is rather simple optimization.

I worked out how to do it a while ago, but there's been no time to implement it. (You can't do a memcmp because of all the relocations.)

Yes, I thought there must have been some problem with it.
 The main difficulty is not being able to modify the linker. So you're
 pretty much limited to what the compiler is able to do before linking. D
 does allow the compiler to deal with all the modules at compile time, so
 this helps a lot.

I'm thinking what if we kind of do it as an extension so that your linker (e.g. optlink) knows about these checksums (or whatever you use instead) while normal linker just works the way it always did. -- Dmitry Olshansky
Apr 08 2012
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Apr 08, 2012 at 10:56:43PM +0400, Dmitry Olshansky wrote:
 On 08.04.2012 22:51, Walter Bright wrote:

The main difficulty is not being able to modify the linker. So you're
pretty much limited to what the compiler is able to do before
linking. D does allow the compiler to deal with all the modules at
compile time, so this helps a lot.

I'm thinking what if we kind of do it as an extension so that your linker (e.g. optlink) knows about these checksums (or whatever you use instead) while normal linker just works the way it always did.

I agree. Either that, or have an external utility for doing it. I must say that IMO current linker technology is quite dumb, and no harm could come from experimenting with smarter linkers. Things like duplicate template instantiation elimination have been introduced due to the influence of C++; I don't see why D shouldn't introduce its own advancements to linkers too. :-) T -- Laissez-faire is a French term commonly interpreted by Conservatives to mean 'lazy fairy,' which is the belief that if governments are lazy enough, the Good Fairy will come down from heaven and do all their work for them.
Apr 08 2012
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Apr 09, 2012 at 10:59:26AM +1000, Daniel Murphy wrote:
 "Artur Skawina" <art.08.09 gmail.com> wrote in message 
 news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

Or use a nop slide before the start of the function. Since we're modifying the object file format anyway, it would be trivial for the compiler to mark functions which have their address taken as needing a unique address.

Why is it so important to have unique addresses for functions? T -- BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL
Apr 08 2012
parent "Daniel Murphy" <yebblies nospamgmail.com> writes:
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.1518.1333937643.4860.digitalmars-d puremagic.com...
 Why is it so important to have unique addresses for functions?

Just because I can't think of a use case doesn't mean nobody is relying on it! But I guess there really isn't one.
Apr 09 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 8 Apr 2012 19:14:22 -0700
schrieb "H. S. Teoh" <hsteoh quickfur.ath.cx>:

 On Mon, Apr 09, 2012 at 10:59:26AM +1000, Daniel Murphy wrote:
 "Artur Skawina" <art.08.09 gmail.com> wrote in message 
 news:mailman.1480.1333900846.4860.digitalmars-d puremagic.com...
 Note that my point is just that the compiler needs to emit a dummy
 so that the addresses remain unique, eg

   module.f!uint:
       jmp module.f!int

Or use a nop slide before the start of the function. Since we're modifying the object file format anyway, it would be trivial for the compiler to mark functions which have their address taken as needing a unique address.

Why is it so important to have unique addresses for functions? T

I would like to know that use case as well, especially since I learned coding with the "copy&paste code is bad" phrase. It just seems odd to me to expect any constant data/code to have unique addresses when they can be merged or overlapped. -- Marco
Apr 08 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Apr 09, 2012 at 08:21:08AM +0200, Somedude wrote:
 Le 08/04/2012 16:18, H. S. Teoh a écrit :
 On Sun, Apr 08, 2012 at 03:01:56PM +0400, Dmitry Olshansky wrote:
 I think it's been ages since I meant to ask why nobody (as in
 compiler vendors) does what I think is rather simple optimization.

 In the short term the plan is to introduce a "link-time" flavored
 optimization at code generation or (better) link step.

This would be incompatible with how current (non-dmd) linkers work. But I do like the idea. Perhaps if it works well, other linkers will adopt it? (Just like how the gcc linker adopted duplicate template code elimination due to C++ templates.) T

Actually, in C++ (as well as D), the added benefit would be a greatly improved compilation speed, wouldn't it ? I bet if the idea works in D and proves increased compilation, compiler writers would be very compelled to implement it in C++.

Exactly my point. I *want* to give incentive to toolchain devs to add these kinds of enhancements to linkers in general. T -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? -- Michael Beibl
Apr 09 2012
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Apr 09, 2012 at 11:58:01PM +1000, Daniel Murphy wrote:
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.1518.1333937643.4860.digitalmars-d puremagic.com...
 Why is it so important to have unique addresses for functions?

Just because I can't think of a use case doesn't mean nobody is relying on it! But I guess there really isn't one.

Somebody brought up the matter of stacktraces. Which could be a valid concern, I suppose, although I'm tempted to just say, use a non-optimized build for debugging purposes. (But I suppose that is arguable.) T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
Apr 09 2012