www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - reduce mangled name sizes via link-time symbol renaming

reply Timothee Cour <thelastmammoth gmail.com> writes:
could a solution like proposed below be adapted to automatically
reduce size of long symbol names?

It allows final object files to be smaller; eg see the problem this causes:

* String Switch Lowering:
http://forum.dlang.org/thread/p4d777$1vij$1 digitalmars.com
caution: NSFW! contains huge mangled symbol name!
* http://lists.llvm.org/pipermail/lldb-dev/2018-January/013180.html
"[lldb-dev] Huge mangled names are causing long delays when loading
symbol table symbols")


```
main.d:
void foo_test1(){ }
void main(){ foo_test1(); }

dmd -c libmain.a

ld -r libmain.a -o libmain2.a -alias _D4main9foo_test1FZv _foobar
-unexported_symbol _D4main9foo_test1FZv
# or : via `-alias_list filename`

#NOTE: dummy.d only needed because somehow dmd needs at least one
object file or source file, a static library is somehow not enough
(dmd bug?)

dmd -of=main2 libmain2.a dummy.d

nm main2 | grep _foobar # ok

./main2 # ok
```

NOTE: to automate this process it could find all symbol names >
threshold and apply a mapping form long mangled names to short aliases
(eg: object_file_name + incremented_counter), that file with all the
mappings can be supplied for a demangler (eg for lldb/gdb debugging
etc)
Jan 25
parent reply Johannes Pfau <nospam example.com> writes:
Am Thu, 25 Jan 2018 14:24:12 -0800
schrieb Timothee Cour <thelastmammoth gmail.com>:

 could a solution like proposed below be adapted to automatically
 reduce size of long symbol names?
 
 It allows final object files to be smaller; eg see the problem this
 causes:
 
 * String Switch Lowering:
 http://forum.dlang.org/thread/p4d777$1vij$1 digitalmars.com
 caution: NSFW! contains huge mangled symbol name!
 * http://lists.llvm.org/pipermail/lldb-dev/2018-January/013180.html
 "[lldb-dev] Huge mangled names are causing long delays when loading
 symbol table symbols")
 
 
 ```
 main.d:
 void foo_test1(){ }
 void main(){ foo_test1(); }
 
 dmd -c libmain.a
 
 ld -r libmain.a -o libmain2.a -alias _D4main9foo_test1FZv _foobar
 -unexported_symbol _D4main9foo_test1FZv
 # or : via `-alias_list filename`
 
 #NOTE: dummy.d only needed because somehow dmd needs at least one
 object file or source file, a static library is somehow not enough
 (dmd bug?)
 
 dmd -of=main2 libmain2.a dummy.d
 
 nm main2 | grep _foobar # ok
 
 ./main2 # ok
 ```
 
 NOTE: to automate this process it could find all symbol names >
 threshold and apply a mapping form long mangled names to short aliases
 (eg: object_file_name + incremented_counter), that file with all the
 mappings can be supplied for a demangler (eg for lldb/gdb debugging
 etc)
What is the benefit of using link-time renaming (a linker specific feature) instead of directly renaming the symbol in the compiler? We could be quite radical and hash all symbols > a certain threshold. As long as we have a hash function with strong enough collision resistance there shouldn't be any problem. AFAICS we only need the mapping hashed_name ==> full name for debugging. So maybe we can simply stuff the full, mangled name somehow into dwarf debug information? We can even keep dwarf debug information in external files and support for this is just being added to GCCs libbacktrace, so even stack traces could work fine. -- Johannes
Jan 25
next sibling parent reply Seb <seb wilzba.ch> writes:
On Friday, 26 January 2018 at 07:34:50 UTC, Johannes Pfau wrote:
 Am Thu, 25 Jan 2018 14:24:12 -0800
 schrieb Timothee Cour <thelastmammoth gmail.com>:

 [...]
What is the benefit of using link-time renaming (a linker specific feature) instead of directly renaming the symbol in the compiler? We could be quite radical and hash all symbols > a certain threshold. As long as we have a hash function with strong enough collision resistance there shouldn't be any problem. AFAICS we only need the mapping hashed_name ==> full name for debugging. So maybe we can simply stuff the full, mangled name somehow into dwarf debug information? We can even keep dwarf debug information in external files and support for this is just being added to GCCs libbacktrace, so even stack traces could work fine. -- Johannes
I thought LDC is already doing this with -hashtres? https://github.com/ldc-developers/ldc/pull/1445
Jan 26
parent timotheecour <timothee.cour2 gmail.com> writes:
On Friday, 26 January 2018 at 08:44:26 UTC, Seb wrote:
 What is the benefit of using link-time renaming (a linker 
 specific feature) instead of directly renaming the symbol in 
 the compiler? We could be quite radical and hash all symbols > 
 a certain threshold. As long as we have a hash function with 
 strong enough collision resistance there shouldn't be any 
 problem.
 -- Johannes
I thought LDC is already doing this with -hashtres? https://github.com/ldc-developers/ldc/pull/1445
* What i suggested doesn't require any hashing, so it can produce minimal symbol size with 0 risk of collision, in fact optimally minimum symbol size if we wanted to (using an incremented counter i to remap the i'th symbol) * -hashtres is still experimental, and doesn't work with phobos, and has a lower bound on symbol size since it's using a hash; it has other limitations as you can see in https://github.com/ldc-developers/ldc/pull/1445#issue-149189001 * a potential extension of this proposal is to do it not at link time but at compile time, where we'd maintain (in memory) the mapping long_mangle=>short_mangle and serialize it to a file in case we'd like to support separate compilation.
Jan 26
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jan 26, 2018 at 08:34:50AM +0100, Johannes Pfau via Digitalmars-d wrote:
[...]
 What is the benefit of using link-time renaming (a linker specific
 feature) instead of directly renaming the symbol in the compiler? We
 could be quite radical and hash all symbols > a certain threshold. As
 long as we have a hash function with strong enough collision
 resistance there shouldn't be any problem.
I think this is something worthwhile to implement, or at least try out. Huge symbols have been an ongoing source of trouble in D code, esp. when there's heavy template usage. Even after Rainer's symbol backref PR was merged, which largely alleviated the recursive symbol bloat problem, we still have cases like object.__switch that need to be addressed.
 AFAICS we only need the mapping hashed_name ==> full name for
 debugging. So maybe we can simply stuff the full, mangled name somehow
 into dwarf debug information? We can even keep dwarf debug information
 in external files and support for this is just being added to GCCs
 libbacktrace, so even stack traces could work fine.
[...] I dunno, I'm skeptical that a 10,000-character symbol is of any use to anyone, even for debugging. I mean, what are you going to do with it? Visually scan 10,000 characters to see if it's the same symbol as another 10,000-character symbol in the program? If the only way to make practical use of it is to use a program to compare it, then substituting it with a hash is not any different. It seems to me that the most useful parts of a long symbol are basically its initial segment, which is usually the module name, useful for narrowing down where the symbol came from, and the ending segment, usually the last symbol(s) of a UFCS chain, or some argument types, useful for determining the function name, or which overload is being called. Given a long enough symbol, the middle portion is pretty much never looked at; it might as well be random characters. Which suggests the following scheme: if a symbol S exceeds N characters, for a suitably-chosen N (I'd say somewhere around 500 or 1000, as a rough initial stab), then replace it with: S[0 .. 80] ~ hashOf(s) ~ S[$-80 .. $] This gives you 160 human-readable characters of the most useful parts of the symbol, with the largely-useless middle part replaced with a fixed-length hash, so in the worst case, the symbol will be around 2-3 lines long and no more. I chose 80 arbitrarily, it can be longer or shorter, but it's approximately the length of 1 line of code, which presumably should be enough to uniquely identify the source module of the symbol as well as the last function name / parameter types. Perhaps it can be increased to about 200 or so, give or take, so that compressed symbols are approximately N characters long. Or N can be reduced to match the 160 + the ASCII-encoded size of the hash. T -- In a world without fences, who needs Windows and Gates? -- Christian Surchi
Jan 26