digitalmars.D - reduce mangled name sizes via link-time symbol renaming

Timothee Cour (29/29) Jan 25 2018 could a solution like proposed below be adapted to automatically

Johannes Pfau (13/54) Jan 25 2018 What is the benefit of using link-time renaming (a linker specific

Seb (3/19) Jan 26 2018 I thought LDC is already doing this with -hashtres?

timotheecour (13/22) Jan 26 2018 * What i suggested doesn't require any hashing, so it can produce

H. S. Teoh (39/49) Jan 26 2018 I think this is something worthwhile to implement, or at least try out.

Timothee Cour <thelastmammoth gmail.com> writes:

could a solution like proposed below be adapted to automatically
reduce size of long symbol names?

It allows final object files to be smaller; eg see the problem this causes:

* String Switch Lowering:
http://forum.dlang.org/thread/p4d777$1vij$1 digitalmars.com
caution: NSFW! contains huge mangled symbol name!
* http://lists.llvm.org/pipermail/lldb-dev/2018-January/013180.html
"[lldb-dev] Huge mangled names are causing long delays when loading
symbol table symbols")


```
main.d:
void foo_test1(){ }
void main(){ foo_test1(); }

dmd -c libmain.a

ld -r libmain.a -o libmain2.a -alias _D4main9foo_test1FZv _foobar
-unexported_symbol _D4main9foo_test1FZv


#NOTE: dummy.d only needed because somehow dmd needs at least one
object file or source file, a static library is somehow not enough
(dmd bug?)

dmd -of=main2 libmain2.a dummy.d




```

NOTE: to automate this process it could find all symbol names >
threshold and apply a mapping form long mangled names to short aliases
(eg: object_file_name + incremented_counter), that file with all the
mappings can be supplied for a demangler (eg for lldb/gdb debugging
etc)

Jan 25 2018

Johannes Pfau <nospam example.com> writes:

Am Thu, 25 Jan 2018 14:24:12 -0800
schrieb Timothee Cour <thelastmammoth gmail.com>:

 could a solution like proposed below be adapted to automatically
 reduce size of long symbol names?
 
 It allows final object files to be smaller; eg see the problem this
 causes:
 
 * String Switch Lowering:
 http://forum.dlang.org/thread/p4d777$1vij$1 digitalmars.com
 caution: NSFW! contains huge mangled symbol name!
 * http://lists.llvm.org/pipermail/lldb-dev/2018-January/013180.html
 "[lldb-dev] Huge mangled names are causing long delays when loading
 symbol table symbols")
 
 
 ```
 main.d:
 void foo_test1(){ }
 void main(){ foo_test1(); }
 
 dmd -c libmain.a
 
 ld -r libmain.a -o libmain2.a -alias _D4main9foo_test1FZv _foobar
 -unexported_symbol _D4main9foo_test1FZv

 
 #NOTE: dummy.d only needed because somehow dmd needs at least one
 object file or source file, a static library is somehow not enough
 (dmd bug?)
 
 dmd -of=main2 libmain2.a dummy.d
 

 

 ```
 
 NOTE: to automate this process it could find all symbol names >
 threshold and apply a mapping form long mangled names to short aliases
 (eg: object_file_name + incremented_counter), that file with all the
 mappings can be supplied for a demangler (eg for lldb/gdb debugging
 etc)

What is the benefit of using link-time renaming (a linker specific
feature) instead of directly renaming the symbol in the compiler? We
could be quite radical and hash all symbols > a certain threshold. As
long as we have a hash function with strong enough collision resistance
there shouldn't be any problem.

AFAICS we only need the mapping hashed_name ==> full name for
debugging. So maybe we can simply stuff the full, mangled name somehow
into dwarf debug information? We can even keep dwarf debug information
in external files and support for this is just being added to GCCs
libbacktrace, so even stack traces could work fine.

-- Johannes

Jan 25 2018

Seb <seb wilzba.ch> writes:

On Friday, 26 January 2018 at 07:34:50 UTC, Johannes Pfau wrote:
 Am Thu, 25 Jan 2018 14:24:12 -0800
 schrieb Timothee Cour <thelastmammoth gmail.com>:

 [...]

 What is the benefit of using link-time renaming (a linker 
 specific feature) instead of directly renaming the symbol in 
 the compiler? We could be quite radical and hash all symbols > 
 a certain threshold. As long as we have a hash function with 
 strong enough collision resistance there shouldn't be any 
 problem.

 AFAICS we only need the mapping hashed_name ==> full name for 
 debugging. So maybe we can simply stuff the full, mangled name 
 somehow into dwarf debug information? We can even keep dwarf 
 debug information in external files and support for this is 
 just being added to GCCs libbacktrace, so even stack traces 
 could work fine.

 -- Johannes

I thought LDC is already doing this with -hashtres?

https://github.com/ldc-developers/ldc/pull/1445

Jan 26 2018

timotheecour <timothee.cour2 gmail.com> writes:

On Friday, 26 January 2018 at 08:44:26 UTC, Seb wrote:
 What is the benefit of using link-time renaming (a linker 
 specific feature) instead of directly renaming the symbol in 
 the compiler? We could be quite radical and hash all symbols > 
 a certain threshold. As long as we have a hash function with 
 strong enough collision resistance there shouldn't be any 
 problem.
 -- Johannes

 I thought LDC is already doing this with -hashtres?

 https://github.com/ldc-developers/ldc/pull/1445

* What i suggested doesn't require any hashing, so it can produce 
minimal symbol size with 0 risk of collision, in fact optimally 
minimum symbol size if we wanted to (using an incremented counter 
i to remap the i'th symbol)

* -hashtres is still experimental, and doesn't work with phobos, 
and has a lower bound on symbol size since it's using a hash; it 
has other limitations as you can see in 
https://github.com/ldc-developers/ldc/pull/1445#issue-149189001

* a potential extension of this proposal is to do it not at link 
time but at compile time, where we'd maintain (in memory) the 
mapping long_mangle=>short_mangle and serialize it to a file in 
case we'd like to support separate compilation.

Jan 26 2018

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jan 26, 2018 at 08:34:50AM +0100, Johannes Pfau via Digitalmars-d wrote:
[...]
 What is the benefit of using link-time renaming (a linker specific
 feature) instead of directly renaming the symbol in the compiler? We
 could be quite radical and hash all symbols > a certain threshold. As
 long as we have a hash function with strong enough collision
 resistance there shouldn't be any problem.

I think this is something worthwhile to implement, or at least try out.
Huge symbols have been an ongoing source of trouble in D code, esp. when
there's heavy template usage.  Even after Rainer's symbol backref PR was
merged, which largely alleviated the recursive symbol bloat problem, we
still have cases like object.__switch that need to be addressed.


 AFAICS we only need the mapping hashed_name ==> full name for
 debugging. So maybe we can simply stuff the full, mangled name somehow
 into dwarf debug information? We can even keep dwarf debug information
 in external files and support for this is just being added to GCCs
 libbacktrace, so even stack traces could work fine.

[...]

I dunno, I'm skeptical that a 10,000-character symbol is of any use to
anyone, even for debugging. I mean, what are you going to do with it?
Visually scan 10,000 characters to see if it's the same symbol as
another 10,000-character symbol in the program? If the only way to make
practical use of it is to use a program to compare it, then substituting
it with a hash is not any different.

It seems to me that the most useful parts of a long symbol are basically
its initial segment, which is usually the module name, useful for
narrowing down where the symbol came from, and the ending segment,
usually the last symbol(s) of a UFCS chain, or some argument types,
useful for determining the function name, or which overload is being
called. Given a long enough symbol, the middle portion is pretty much
never looked at; it might as well be random characters.  Which suggests
the following scheme: if a symbol S exceeds N characters, for a
suitably-chosen N (I'd say somewhere around 500 or 1000, as a rough
initial stab), then replace it with:

	S[0 .. 80] ~ hashOf(s) ~ S[$-80 .. $]

This gives you 160 human-readable characters of the most useful parts of
the symbol, with the largely-useless middle part replaced with a
fixed-length hash, so in the worst case, the symbol will be around 2-3
lines long and no more.

I chose 80 arbitrarily, it can be longer or shorter, but it's
approximately the length of 1 line of code, which presumably should be
enough to uniquely identify the source module of the symbol as well as
the last function name / parameter types.  Perhaps it can be increased
to about 200 or so, give or take, so that compressed symbols are
approximately N characters long. Or N can be reduced to match the 160 +
the ASCII-encoded size of the hash.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi

Jan 26 2018

D Programming

C/C++ Programming

Other

digitalmars.D - reduce mangled name sizes via link-time symbol renaming