digitalmars.D - Policy for exposing range structs

Seb (32/32) Mar 25 2016 If I understand it correctly, the current policy in Phobos is

Steven Schveighoffer (4/27) Mar 25 2016 We should actually be moving *away from* voldemort types:

Adam D. Ruppe (7/8) Mar 25 2016 Indeed, they basically suck.

Andrei Alexandrescu (2/5) Mar 25 2016 I love those things! -- Andrei

Jonathan M Davis via Digitalmars-d (7/12) Mar 25 2016 They're very cool in principle but have some issues in practice. As long...
Steven Schveighoffer (6/12) Mar 26 2016 I prefer voldemort types too! To repeat all the template machinery in an...

Andrei Alexandrescu (2/5) Mar 25 2016 Has this bug been submitted? -- Andrei

Steven Schveighoffer (8/15) Mar 25 2016 I'm not sure it's a bug that can be fixed. It's caused by the design of

H. S. Teoh via Digitalmars-d (15/33) Mar 25 2016 [...]

Steven Schveighoffer (15/45) Mar 25 2016 I don't think it's that simple. For example:

Anon (31/92) Mar 25 2016 These repetitions could be eliminated relatively easily (from a

Steven Schveighoffer (6/10) Mar 25 2016 [snip]
Johan Engelen (28/36) Apr 01 2016 I have implemented the second part (simplest I thought) to see

Johan Engelen (16/22) Apr 01 2016 The times I measured are not including linking (!). But I think

Johan Engelen (33/39) Apr 19 2016 See my PR for LDC:

Jack Stouffer (4/9) Apr 19 2016 I'm surprised that's how it works. I just assumed there was some
Steven Schveighoffer (10/14) Apr 19 2016 I'd be surprised link times did not improve. With my trivial test cases,...

Marco Leise (5/5) Apr 19 2016 What about using old string compression algorithms on the

Steven Schveighoffer (7/10) Apr 19 2016 It's somewhat recursive, so I don't know how well string compression

Johan Engelen (13/29) Apr 20 2016 `nm` on my Mac cannot handle the long symbol names, and I have to

Andrei Alexandrescu (2/18) Mar 25 2016 Compression of template names. -- Andrei

Anon (12/37) Mar 26 2016 Would literally not help. The problem in the bug report is

Johan Engelen (7/9) Mar 31 2016 Hi Anon,

Anon (28/35) Mar 31 2016 No, I haven't started implemented things for that idea. The

Adam D. Ruppe (3/5) Mar 31 2016 How does it compare to simply gzipping the string and writing it

Anon (42/47) Mar 31 2016 My encoding is shorter in the typical use case, at least when

Adam D. Ruppe (6/12) Mar 31 2016 Yeah, but my thought is the typical use case isn't actually the

Anon (76/79) Mar 31 2016 Doubling the size of UTF-8 (the effect of the current base16

Adam D. Ruppe (23/34) Mar 31 2016 meh, if it is allowed, it is going to happen. Why make it worse

Anon (38/50) Mar 31 2016 Define "little cost". Whatever compression algorithm chosen will

David Nadlinger (9/10) Mar 27 2016 Compression in the usual sense won't help. Sure, it might reduce

Liran Zvibel (40/46) Mar 30 2016 We love Voldemort types at Weka, and use them a lot in our

Steven Schveighoffer (35/63) Mar 31 2016 Voldemort types are what cause the bloat, templates inside templates

Adam D. Ruppe (45/47) Mar 31 2016 So here's an idea of other things don't work out: voldemort types

Steven Schveighoffer (9/18) Mar 31 2016 Ugh, let's try the huffman coding thing first :) Stack traces would be

Adam D. Ruppe (28/33) Mar 31 2016 Do that after and along with!

Steven Schveighoffer (6/14) Mar 31 2016 Moving types outside the function actually results in a pretty readable

Steven Schveighoffer (12/16) Mar 31 2016 If you are hashing anyway, just hash the hash :) In other words, this

jmh530 (31/38) Mar 31 2016 If you move anything with a Voldemort type to their own modules,
Jonathan M Davis via Digitalmars-d (13/35) Mar 31 2016 This makes me wonder if there would be a sane way to basically treat the

H. S. Teoh via Digitalmars-d (51/70) Mar 25 2016 Actually, certain types of range structs *have* to be in module space

Seb <seb wilzba.ch> writes:

If I understand it correctly, the current policy in Phobos is 
that range methods should use static nested structs to avoid the 
name clutter and document the capabilities of the returned ranges 
in the documentation.
However there are a lot of old functions that still use public 
structs, which leads to the rather confusing documentation output 
in std.range:

Chunks · chunks · Cycle · cycle ... evenChunks · EvenChunks · 
FrontTransversal · frontTransversal · ... · indexed · Indexed · 
... · lockstep · Lockstep · .. · Recurrence · recurrence · 
RefRange · refRange · repeat · Repeat · ... · sequence · Sequence 
· SortedRange · ... · Take · take · ... Transversal · transversal 
· TransverseOptions · zip · Zip

(off-topic: it's quite interesting to see that sometimes the 
structs are before the method and sometimes after it)

Two arguments to keep exposing the structs are that (1) an API 
user can explicitly specify the return type (in contrast to auto) 
and (2) one can see the capabilities of the struct in the 
documentation.

There are many cases where methods in these structs are optional 
and depend on the capabilities of the input range (e.g. backward, 
length, random access, ...). I could imagine that

1) We rework ddoc, s.t. it doesn't list these structs in the 
overview and adds the struct methods to the function (could get 
ugly)
2) We deprecate those exposed structs and make them 
private/nested (does anyone know whether it's common to depend on 
those structs?)

What are your thoughts on this matter?

(This is a follow-up from a short discussion with Jakob Ovrum on 
github [1].)

[1] https://github.com/D-Programming-Language/phobos/pull/4027

Mar 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/25/16 7:37 AM, Seb wrote:
 If I understand it correctly, the current policy in Phobos is that range
 methods should use static nested structs to avoid the name clutter and
 document the capabilities of the returned ranges in the documentation.
 However there are a lot of old functions that still use public structs,
 which leads to the rather confusing documentation output in std.range:

 Chunks · chunks · Cycle · cycle ... evenChunks · EvenChunks ·
 FrontTransversal · frontTransversal · ... · indexed · Indexed · .... ·
 lockstep · Lockstep · .. · Recurrence · recurrence · RefRange · refRange
 · repeat · Repeat · ... · sequence · Sequence · SortedRange · ... ·
Take
 · take · ... Transversal · transversal · TransverseOptions · zip · Zip

 (off-topic: it's quite interesting to see that sometimes the structs are
 before the method and sometimes after it)

 Two arguments to keep exposing the structs are that (1) an API user can
 explicitly specify the return type (in contrast to auto) and (2) one can
 see the capabilities of the struct in the documentation.

 There are many cases where methods in these structs are optional and
 depend on the capabilities of the input range (e.g. backward, length,
 random access, ...). I could imagine that

 1) We rework ddoc, s.t. it doesn't list these structs in the overview
 and adds the struct methods to the function (could get ugly)
 2) We deprecate those exposed structs and make them private/nested (does
 anyone know whether it's common to depend on those structs?)

 What are your thoughts on this matter?

We should actually be moving *away from* voldemort types:

https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

-Steve

Mar 25 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Friday, 25 March 2016 at 14:07:41 UTC, Steven Schveighoffer 
wrote:
 We should actually be moving *away from* voldemort types:

Indeed, they basically suck.

Really, I think the structs ought to be where the bulk of the 
documentation is, and the function is just listed as a 
convenience method for creating them. That's really how it works 
anyway.

Mar 25 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 03/25/2016 10:30 AM, Adam D. Ruppe wrote:
 On Friday, 25 March 2016 at 14:07:41 UTC, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 Indeed, they basically suck.

I love those things! -- Andrei

Mar 25 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Saturday, March 26, 2016 01:24:11 Andrei Alexandrescu via Digitalmars-d 
wrote:
 On 03/25/2016 10:30 AM, Adam D. Ruppe wrote:
 On Friday, 25 March 2016 at 14:07:41 UTC, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 Indeed, they basically suck.

 I love those things! -- Andrei

They're very cool in principle but have some issues in practice. As long as
we can reasonably fix those issues, I think that we should continue to use
them, but if we can't, then we'll have to reconsider using them. But I hope
that we can solve the issues. It would certainly be a shame if we couldn't.

- Jonathan M Davis

Mar 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/26/16 1:24 AM, Andrei Alexandrescu wrote:
 On 03/25/2016 10:30 AM, Adam D. Ruppe wrote:
 On Friday, 25 March 2016 at 14:07:41 UTC, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 Indeed, they basically suck.

 I love those things! -- Andrei

I prefer voldemort types too! To repeat all the template machinery in an 
external struct is annoying. So nice to just put "struct X" inside the 
function.

But clearly, not worth the pain if this can't be fixed.

-Steve

Mar 26 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/25/16 10:07 AM, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

 -Steve

Has this bug been submitted? -- Andrei

Mar 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/25/16 11:07 AM, Andrei Alexandrescu wrote:
 On 3/25/16 10:07 AM, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

 Has this bug been submitted? -- Andrei

I'm not sure it's a bug that can be fixed. It's caused by the design of 
the way template name mangling is included.

I can submit a general "enhancement", but I don't know what it would 
say? Make template mangling more efficient? :)

I suppose having a bug report with a demonstration of why we should 
change it is a good thing. I'll add that.

-Steve

Mar 25 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Mar 25, 2016 at 11:40:11AM -0400, Steven Schveighoffer via
Digitalmars-d wrote:
 On 3/25/16 11:07 AM, Andrei Alexandrescu wrote:
On 3/25/16 10:07 AM, Steven Schveighoffer wrote:
We should actually be moving *away from* voldemort types:

https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

Has this bug been submitted? -- Andrei

 
 I'm not sure it's a bug that can be fixed. It's caused by the design
 of the way template name mangling is included.
 
 I can submit a general "enhancement", but I don't know what it would
 say?  Make template mangling more efficient? :)
 
 I suppose having a bug report with a demonstration of why we should
 change it is a good thing. I'll add that.

[...]

We've been talking about compressing template symbols a lot recently,
but there's a very simple symbol size reduction that we can do right
now: most of the templates in Phobos are eponymous templates, and under
the current mangling scheme that means repetition of the template name
and the eponymous member in the symbol.  My guess is that most of the 4k
symbol bloats come from eponymous templates. In theory, a single
character (or something in that vicinity) ought to be enough to indicate
an eponymous template. That should cut down symbol size significantly
(I'm guessing about 30-40% reduction at a minimum, probably more in
practice) without requiring a major overhaul of the mangling scheme.


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder

Mar 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/25/16 12:18 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Mar 25, 2016 at 11:40:11AM -0400, Steven Schveighoffer via
Digitalmars-d wrote:
 On 3/25/16 11:07 AM, Andrei Alexandrescu wrote:
 On 3/25/16 10:07 AM, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

 Has this bug been submitted? -- Andrei

 I'm not sure it's a bug that can be fixed. It's caused by the design
 of the way template name mangling is included.

 I can submit a general "enhancement", but I don't know what it would
 say?  Make template mangling more efficient? :)

 I suppose having a bug report with a demonstration of why we should
 change it is a good thing. I'll add that.

 [...]

 We've been talking about compressing template symbols a lot recently,
 but there's a very simple symbol size reduction that we can do right
 now: most of the templates in Phobos are eponymous templates, and under
 the current mangling scheme that means repetition of the template name
 and the eponymous member in the symbol.  My guess is that most of the 4k
 symbol bloats come from eponymous templates. In theory, a single
 character (or something in that vicinity) ought to be enough to indicate
 an eponymous template. That should cut down symbol size significantly
 (I'm guessing about 30-40% reduction at a minimum, probably more in
 practice) without requiring a major overhaul of the mangling scheme.

I don't think it's that simple. For example:

auto foo(T)(T t)

Needs to repeat T (whatever it happens to be) twice -- once for the 
template foo, and once for the function parameter. If foo returns an 
internally defined type that can be passed to foo:

x.foo.foo.foo.foo

Each nesting multiplies the size of the symbol by 2 (at least, maybe 
even 3). So it's exponential growth. Even if you compress it to one 
character, having a chain of, say, 16 calls brings you to 65k characters 
for the symbol. We need to remove the number of times the symbol is 
repeated, via some sort of substitution.

Added the bug report. Take a look and see what you think.

https://issues.dlang.org/show_bug.cgi?id=15831

-Steve

Mar 25 2016

Anon <anon anon.anon> writes:

On Friday, 25 March 2016 at 18:20:12 UTC, Steven Schveighoffer 
wrote:
 On 3/25/16 12:18 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Mar 25, 2016 at 11:40:11AM -0400, Steven Schveighoffer 
 via Digitalmars-d wrote:
 On 3/25/16 11:07 AM, Andrei Alexandrescu wrote:
 On 3/25/16 10:07 AM, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

 Has this bug been submitted? -- Andrei

 I'm not sure it's a bug that can be fixed. It's caused by the 
 design
 of the way template name mangling is included.

 I can submit a general "enhancement", but I don't know what 
 it would
 say?  Make template mangling more efficient? :)

 I suppose having a bug report with a demonstration of why we 
 should
 change it is a good thing. I'll add that.

 [...]

 We've been talking about compressing template symbols a lot 
 recently,
 but there's a very simple symbol size reduction that we can do 
 right
 now: most of the templates in Phobos are eponymous templates, 
 and under
 the current mangling scheme that means repetition of the 
 template name
 and the eponymous member in the symbol.  My guess is that most 
 of the 4k
 symbol bloats come from eponymous templates. In theory, a 
 single
 character (or something in that vicinity) ought to be enough 
 to indicate
 an eponymous template. That should cut down symbol size 
 significantly
 (I'm guessing about 30-40% reduction at a minimum, probably 
 more in
 practice) without requiring a major overhaul of the mangling 
 scheme.

 I don't think it's that simple. For example:

 auto foo(T)(T t)

 Needs to repeat T (whatever it happens to be) twice -- once for 
 the template foo, and once for the function parameter. If foo 
 returns an internally defined type that can be passed to foo:

 x.foo.foo.foo.foo

 Each nesting multiplies the size of the symbol by 2 (at least, 
 maybe even 3). So it's exponential growth. Even if you compress 
 it to one character, having a chain of, say, 16 calls brings 
 you to 65k characters for the symbol. We need to remove the 
 number of times the symbol is repeated, via some sort of 
 substitution.

 Added the bug report. Take a look and see what you think.

 https://issues.dlang.org/show_bug.cgi?id=15831

 -Steve

These repetitions could be eliminated relatively easily (from a 
user's perspective, anyway; things might be more difficult in the 
actual implementation).

Two changes to the mangling:

1) `LName`s of length 0 (which currently cannot exist) mean to 
repeat the previous `LName` of the current symbol.

2) N `Number` is added as a valid `Type`, meaning "Type Back 
Reference". Basically, all instances of a 
struct/class/interface/enum type in a symbol's mangling get 
counted (starting from zero), and subsequent instances of that 
type can be referred to by N0, N1, N2, etc.

So given:

```
module mod;
struct Foo;
Foo* func(Foo* a, Foo* b);
```

`func` currently mangles as:
_D3mod4funcFPS3mod3FooPS3mod3FooZPS3mod3Foo

It would instead be mangled as:
_D3mod4funcFPS3mod3FooPN0ZPN0

Nested templates declarations would get numbered depth first as 
follows:

S7!(S2!(S0, S1), S6!(S3, S5!(S4)))

I have another idea for reducing the byte impact of template 
string value parameters, but it is a bit more complicated and I 
need to finish de-bugging and optimizing some code to make sure 
it will work as well as I think. I'll post more on that soon, I 
suspect.

Mar 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/25/16 3:04 PM, Anon wrote:
 These repetitions could be eliminated relatively easily (from a user's
 perspective, anyway; things might be more difficult in the actual
 implementation).

 Two changes to the mangling:

[snip]

Please add these ideas to the bug report! I'm not sure if it completely 
fixes this or not, but it's worth exploring options.

This is similar to how I was thinking we should approach this.

-Steve

Mar 25 2016

Johan Engelen <j j.nl> writes:

On Friday, 25 March 2016 at 19:04:46 UTC, Anon wrote:
 Two changes to the mangling:

 1) `LName`s of length 0 (which currently cannot exist) mean to 
 repeat the previous `LName` of the current symbol.

 2) N `Number` is added as a valid `Type`, meaning "Type Back 
 Reference". Basically, all instances of a 
 struct/class/interface/enum type in a symbol's mangling get 
 counted (starting from zero), and subsequent instances of that 
 type can be referred to by N0, N1, N2, etc.

I have implemented the second part (simplest I thought) to see 
what it would bring (in LDC). 'Q' seems unused, so I used that 
instead of 'N'  ;)
I'm not 100% certain of the implementation, as it relies on DMD 
using the same pointer values for recurring types but it seemed 
to work with complicated code.
The core.demangle demangler was relatively easy to upgrade for 
repeated types.

Unfortunately, and perhaps expectedly, it did not give a large 
size reduction / speed boost. Not very statistically sound 
results though ;)
The change is relatively easy, I think, so perhaps it is still 
worthwhile pursuing.

An implementation issue that changes in mangling have to deal 
with:
One surprising thing was that std.traits uses the mangled name in 
`ParameterStorageClassTuple`. So std.traits contains its own mini 
mangling parser, and also calls .mangleof on Types, to skip to 
the next function argument in the mangled function string. The 
"type back reference" scheme means that Type.mangleof may not 
correspond to how the Type embedded in the mangling of a 
function, and so I had to upgrade std.traits with its own parser 
(I copied core.demangle.Demangle, and removed all output 
complications).

For changes in mangling, it'd be good if 
std.traits.ParameterStorageClassTuple could be rewritten to not 
use the mangled function name + string processing.

Apr 01 2016

Johan Engelen <j j.nl> writes:

On Friday, 1 April 2016 at 11:00:29 UTC, Johan Engelen wrote:
 Unfortunately, and perhaps expectedly, it did not give a large 
 size reduction / speed boost. Not very statistically sound 
 results though ;)

The times I measured are not including linking (!). But I think 
the small difference in file size does not predict much improved 
link times.

Meanwhile, I've implemented hashing of function names and other 
symbols *for the backend*, giving an object file size reduction 
of ~25% (hashing everything larger than 100 chars) for my current 
testcase (251MB -> 189MB).
Hashing symbols in the FE is not possible with my testcase 
because of std.traits.ParameterStorageClassTuple... :/

 For changes in mangling, it'd be good if 
 std.traits.ParameterStorageClassTuple could be rewritten to not 
 use the mangled function name + string processing.

std.traits.ParameterStorageClassTuple is a major obstacle for 
aggressive mangling changes. Perhaps someone has the time and 
skills to rewrite it if possible? (My D-fu is not really up to 
that task)

I'll try and keep you posted with some more measurements, to get 
a feel for what matters / what doesn't matter too much.

Apr 01 2016

Johan Engelen <j j.nl> writes:

On Friday, 1 April 2016 at 14:46:42 UTC, Johan Engelen wrote:
 Meanwhile, I've implemented hashing of function names and other 
 symbols *for the backend*, giving an object file size reduction 
 of ~25% (hashing everything larger than 100 chars) for my 
 current testcase (251MB -> 189MB).
 Hashing symbols in the FE is not possible with my testcase 
 because of std.traits.ParameterStorageClassTuple... :/

See my PR for LDC:
https://github.com/ldc-developers/ldc/pull/1445

"This adds MD5 hashing of symbol names that are larger than 
threshold set by -hashthres.

What is very unfortunate is that std.traits depends on the 
mangled name, doing string parsing of the mangled name of symbols 
to obtain symbol traits. This means that mangling cannot be 
changed (dramatically, like hashing) at a high level, and the 
hashing has to be done on a lower level.

Hashed symbols look like this:
_D3one3two5three3L3433_46a82aac733d8a4b3588d7fa8937aad66Result3fooZ
ddemangle gives:
one.two.three.L34._46a82aac733d8a4b3588d7fa8937aad6.Result.foo
Meaning: this symbol is defined in module one.two.three on line 
34. The identifier is foo and is contained in the struct or class 
Result.

Symbols that may be hashed:
- functions
- struct/class initializer
- vtable
- typeinfo (needed surgery inside FE code)

The feature is experimental, and has been tested on Weka.io's 
codebase. Compilation with -hashthres=1000 results in a binary 
that is half the size of the original (201MB vs. 461MB). I did 
not observe a significant difference in total build times. Hash 
threshold of 8000 gives 229MB, 800 gives 195MB binary size: there 
is not much gain after a certain hash threshold.
Linking Weka's code fails with a threshold of 500: phobos 
contains a few large symbols (one larger than 8kb!) and this PR 
currently does not disable hashing of symbols that are inside 
phobos, hence "experimental". Future work could try to figure out 
whether a symbol is inside phobos or not."

Apr 19 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Tuesday, 19 April 2016 at 14:44:12 UTC, Johan Engelen wrote:
 What is very unfortunate is that std.traits depends on the 
 mangled name, doing string parsing of the mangled name of 
 symbols to obtain symbol traits.

I'm surprised that's how it works. I just assumed there was some 
__traits call for that.

 Compilation with -hashthres=1000 results in a binary that is 
 half the size of the original (201MB vs. 461MB).

Seems promising!

Apr 19 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 4/19/16 10:44 AM, Johan Engelen wrote:

 The feature is experimental, and has been tested on Weka.io's codebase.
 Compilation with -hashthres=1000 results in a binary that is half the
 size of the original (201MB vs. 461MB). I did not observe a significant
 difference in total build times.

I'd be surprised link times did not improve. With my trivial test cases, 
the linker started getting hung up with very long symbols. The 
compilation itself was quick though.

Certainly, cutting exe sizes (this is significantly more than half, BTW) 
is a good thing in itself.

Another thing, however, is memory usage of the compiler. My 
understanding is that having large symbol names requires more memory. 
How does that fare?

-Steve

Apr 19 2016

Marco Leise <Marco.Leise gmx.de> writes:

What about using old string compression algorithms on the
mangled name? That should effectively get rid of all the
repeated words.

-- 
Marco

Apr 19 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 4/19/16 4:50 PM, Marco Leise wrote:
 What about using old string compression algorithms on the
 mangled name? That should effectively get rid of all the
 repeated words.

It's somewhat recursive, so I don't know how well string compression 
will work. It's possible, definitely. But I think a substitution scheme 
that is tailored to the issue may prove more effective. Just the fact 
that removing the struct out of the function itself cleans it up shows 
that there is a path to make this happen.

-Steve

Apr 19 2016

Johan Engelen <j j.nl> writes:

On Tuesday, 19 April 2016 at 17:59:48 UTC, Steven Schveighoffer 
wrote:
 On 4/19/16 10:44 AM, Johan Engelen wrote:

 The feature is experimental, and has been tested on Weka.io's 
 codebase.
 Compilation with -hashthres=1000 results in a binary that is 
 half the
 size of the original (201MB vs. 461MB). I did not observe a 
 significant
 difference in total build times.

 I'd be surprised link times did not improve. With my trivial 
 test cases, the linker started getting hung up with very long 
 symbols. The compilation itself was quick though.

`nm` on my Mac cannot handle the long symbol names, and I have to 
use llvm-nm :)

 Certainly, cutting exe sizes (this is significantly more than 
 half, BTW) is a good thing in itself.

 Another thing, however, is memory usage of the compiler. My 
 understanding is that having large symbol names requires more 
 memory. How does that fare?

I don't know. Generating the (unhashed) mangled symbol names 
seems fast. LDC is using so much memory (>6 GB) that I think a 
few extra megabytes for symbol names is not noticeable (the 
symbol name is cached for functions; adding caching for other 
symbol mangled names did not change compile times but I didn't 
look at memory). What worries me is that std.traits is doing all 
that string processing for simple things like getting the storage 
class of a function parameter. When it starts to do CTFE on a 
string that is 1 MB, that's not good I think.

Apr 20 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 03/25/2016 11:40 AM, Steven Schveighoffer wrote:
 On 3/25/16 11:07 AM, Andrei Alexandrescu wrote:
 On 3/25/16 10:07 AM, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

 Has this bug been submitted? -- Andrei

 I'm not sure it's a bug that can be fixed. It's caused by the design of
 the way template name mangling is included.

 I can submit a general "enhancement", but I don't know what it would
 say? Make template mangling more efficient? :)

 I suppose having a bug report with a demonstration of why we should
 change it is a good thing. I'll add that.

 -Steve

Compression of template names. -- Andrei

Mar 25 2016

Anon <anon anon.anon> writes:

On Saturday, 26 March 2016 at 05:22:56 UTC, Andrei Alexandrescu 
wrote:
 On 03/25/2016 11:40 AM, Steven Schveighoffer wrote:
 On 3/25/16 11:07 AM, Andrei Alexandrescu wrote:
 On 3/25/16 10:07 AM, Steven Schveighoffer wrote:
 We should actually be moving *away from* voldemort types:

 https://forum.dlang.org/post/n96k3g$ka5$1 digitalmars.com

 Has this bug been submitted? -- Andrei

 I'm not sure it's a bug that can be fixed. It's caused by the 
 design of
 the way template name mangling is included.

 I can submit a general "enhancement", but I don't know what it 
 would
 say? Make template mangling more efficient? :)

 I suppose having a bug report with a demonstration of why we 
 should
 change it is a good thing. I'll add that.

 -Steve

 Compression of template names. -- Andrei

Would literally not help. The problem in the bug report is 
recursive expansion of templates creating mangled name length 
O(2^n) where n is the number of recursive calls. If you compress 
that to an eighth of its size, you get O(2^(n-3)), which isn't 
actually fixing things, as that is still O(2^n). The 
(conceptually) simple change I suggested brings the mangled name 
length down to O(n).

You could compress *that*, but then you are compressing such a 
small amount of data most compression algorithms will cause the 
size to grow, not shrink.

Mar 26 2016

Johan Engelen <j j.nl> writes:

On Saturday, 26 March 2016 at 17:42:06 UTC, Anon wrote:
 The (conceptually) simple change I suggested brings the mangled 
 name length down to O(n).

Hi Anon,
   I've started implementing your idea. But perhaps you already 
have a beginning of an implementation? If so, please contact me :)
https://github.com/JohanEngelen

Thanks,
   Johan

Mar 31 2016

Anon <anon anon.anon> writes:

On Thursday, 31 March 2016 at 11:15:18 UTC, Johan Engelen wrote:
 Hi Anon,
   I've started implementing your idea. But perhaps you already 
 have a beginning of an implementation? If so, please contact me 
 :)
 https://github.com/JohanEngelen

 Thanks,
   Johan

No, I haven't started implemented things for that idea. The 
experiments I did with it were by manually altering mangled names 
in Vim.

I've been spending my D time thinking about potential changes to 
how template string value parameters are encoded. My code is a 
bit messy (and not integrated with the compiler at all), but I 
use a bootstring technique (similar to Punycode[1]) to encode 
Unicode text using only [a-zA-Z0-9_].

The results are always smaller than base16 and base64 encodings. 
For plain ASCII text, the encoding tends to grow by a small 
amount. For text containing larger UTF-8 code points, the 
encoding usually ends up smaller than the raw UTF-8 string.

A couple examples of my encoder at work:

---
some_identifier
some_identifier_

/usr/include/d/std/stdio.d
usrincludedstdstdiod_jqacdhbd

Hello, World!
HelloWorld_0far4i


こんにちは世界 (UTF-8: 21 bytes)
XtdCDr5mL02g3rv (15 bytes)
---

I still need to clean up the encoder/decoder and iron out some 
specifics on how this could fit into the mangling, but I should 
have time to work on this some more later today/tomorrow.

[1]: https://en.wikipedia.org/wiki/Punycode

Mar 31 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 31 March 2016 at 16:38:59 UTC, Anon wrote:
 I've been spending my D time thinking about potential changes 
 to how template string value parameters are encoded.


How does it compare to simply gzipping the string and writing it 
out with base62?

Mar 31 2016

Anon <anon anon.anon> writes:

On Thursday, 31 March 2016 at 16:46:42 UTC, Adam D. Ruppe wrote:
 On Thursday, 31 March 2016 at 16:38:59 UTC, Anon wrote:
 I've been spending my D time thinking about potential changes 
 to how template string value parameters are encoded.


 How does it compare to simply gzipping the string and writing 
 it out with base62?

My encoding is shorter in the typical use case, at least when 
using xz instead gzip. (xz was quicker/easier to get raw 
compressed data without a header.)

1= Raw UTF-8, 2= my encoder, 3= `echo -n "$1" | xz -Fraw | base64`

---
1. some_identifier
2. some_identifier_
3. AQA0c29tZV9pZGVudGlmaWVyAA==

1. /usr/include/d/std/stdio.d
2. usrincludedstdstdiod_jqacdhbd
3. AQAZL3Vzci9pbmNsdWRlL2Qvc3RkL3N0ZGlvLmQa

1. Hello, World!
2. HelloWorld_0far4i
3. AQAMSGVsbG8sIFdvcmxkIQA=

1. こんにちは世界
2. XtdCDr5mL02g3rv
3. AQAU44GT44KT44Gr44Gh44Gv5LiW55WMAA==
---

The problem is that compression isn't magical, and a string needs 
to be long enough and have enough repetition to compress well. If 
it isn't, compression causes the data to grow, and base64 
compounds that. For the sake of fairness, let's also do a larger 
(compressible) string.

Input: 1000 lines, each with the text "Hello World"

1. 12000 bytes
2. 12008 bytes
3. 94 bytes

However, my encoding is still fairly compressible, so we *could* 
route it through the same compression if/when a symbol is 
determined to be compressible. That yields 114 bytes.

The other thing I really like about my encoder is that plain C 
identifiers are left verbatim visible in the result. That would 
be especially nice with, e.g., opDispatch.

Would a hybrid approach (my encoding, optionally using 
compression when it would be advantageous) make sense? My encoder 
already has to process the whole string, so it could do some sort 
of analysis to estimate how compressible the result would be. I 
don't know what that would look like, but it could work.

Alternately, we could do the compression on whole mangled names, 
not just the string values, but I don't know how desirable that 
is.

Mar 31 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 31 March 2016 at 17:30:44 UTC, Anon wrote:
 My encoding is shorter in the typical use case


Yeah, but my thought is the typical use case isn't actually the 
problem - it is OK as it is. Longer strings are where it gets 
concerning to me.

 Would a hybrid approach (my encoding, optionally using 
 compression when it would be advantageous) make sense?

Yeah, that might be cool too.

 Alternately, we could do the compression on whole mangled 
 names, not just the string values, but I don't know how 
 desirable that is.

There are often a lot of repeats in there... so maybe.

Mar 31 2016

Anon <anon anon.anon> writes:

On Thursday, 31 March 2016 at 17:52:43 UTC, Adam D. Ruppe wrote:
 Yeah, but my thought is the typical use case isn't actually the 
 problem - it is OK as it is. Longer strings are where it gets 
 concerning to me.

Doubling the size of UTF-8 (the effect of the current base16 
encoding) bothers me regardless of string length. Especially when 
use of __MODULE__ and/or __FILE__ as template arguments seems to 
be fairly common.

Having thought about it a bit more, I am now of the opinion that 
super-long strings have no business being in template args, so we 
shouldn't cater to them.

The main case with longer strings going into template arguments 
I'm aware of is for when strings will be processed, then fed to 
`mixin()`. However, that compile time string processing is much 
better served with a CTFE-able function using a Rope for 
manipulating the string until it is finalized into a normal 
string. If you are doing compile-time string manipulation with 
templates, the big symbol is the least of your worries. The 
repeated allocation and reallocation will quickly make your code 
uncompilable due to soaring RAM usage. The same is (was?) true of 
non-rope CTFE string manipulation. Adding a relatively 
memory-intensive operation like compression isn't going to help 
in that case.

Granted, the language shouldn't have a cut-off for string length 
in template arguments, but if you load a huge string as a 
template argument, I think something has gone wrong in your code. 
Catering to that seems to me to be encouraging it, despite the 
existence of much better approaches.

The only other case I can think of where you might want a large 
string as a template argument is something like:

```
struct Foo(string s)
{
     mixin(s);
}

Foo!q{...} foo;
```

But that is much better served as something like:

```
mixin template Q()
{
     mixin(q{...}); // String doesn't end up in mangled name
}

struct Foo(alias A)
{
     mixin A;
}

Foo!Q foo;
```

Or better yet (when possible):

```
mixin template Q()
{
     ... // No string mixin needed
}

struct Foo(alias A)
{
     mixin A;
}

Foo!Q foo;
```

The original mangling discussion started from the need to either 
fix a mangling problem or officially discourage Voldemort types. 
Most of the ideas we've been discussing and/or working on have 
been toward keeping Voldemort types, since many here want them. 
I'm not sure what use case would actually motivate compressing 
strings/symbols.

My motivations for bootstring encoding:

* Mostly care about opDispatch, and use of __FILE__/__MODULE__ as 
compile-time parameters. Symbol bloat from their use isn't 
severe, but it could be better.
* ~50% of the current mangling size for template string parameters
* Plain C identifier strings (so, most identifiers) will end up 
directly readable in the mangled name even without a demangler
* Retains current ability to access D symbols from C (in contrast 
to ideas that would use characters like '$' or '?')
* I already needed bootstring encoding for an unrelated project, 
and figured I could offer to share it with D, since it seems like 
it would fit here, too

Mar 31 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 31 March 2016 at 20:12:34 UTC, Anon wrote:
 Having thought about it a bit more, I am now of the opinion 
 that super-long strings have no business being in template 
 args, so we shouldn't cater to them.

meh, if it is allowed, it is going to happen. Why make it worse 
when there's so little cost in making it better?

 The main case with longer strings going into template arguments 
 I'm aware of is for when strings will be processed, then fed to 
 `mixin()`.

I often don't actually modify the string at all and by putting 
the string as a template argument, it enables a kind of 
compile-time memoization like I talked about here a short while 
ago: http://stackoverflow.com/a/36251271/1457000

The string may be exceedingly if imported from a file or 
generated externally and cached:

MyType!(import("definition.txt")) foo;

enum foo = ctfeFunction();

MyType!foo test;


I've never had a huge problem with this in practice, but I've 
never had a huge problem with big names in practice at all 
either. I can imagine both though.

(what bothers me more than the mangle actually is the compiler 
error messages showing all those strings. Gross. I'd like to see 
XML error messages to make displaying them easier. But that's a 
separate topic.)

 The original mangling discussion started from the need to 
 either fix a mangling problem or officially discourage 
 Voldemort types.


Yes, indeed, that's probably the bigger problem.

 * Retains current ability to access D symbols from C (in 
 contrast to ideas that would use characters like '$' or '?')

$ is actually a valid identifier character in C (and quite a few 
other languages). You can use it in the linker as well as in C 
source code.

Mar 31 2016

Anon <anon anon.anon> writes:

On Thursday, 31 March 2016 at 20:40:03 UTC, Adam D. Ruppe wrote:
 meh, if it is allowed, it is going to happen. Why make it worse 
 when there's so little cost in making it better?

Define "little cost". Whatever compression algorithm chosen will 
need support added to any/all tools that want to demangle D. GDB 
and LLDB currently link to liblzma (on my system, at least. 
Configurations may vary). nm and objdump don't link to any 
compression lib. Good luck convincing binutils to add compression 
dependencies like that for D when they don't need them any other 
mangling schemes.

And no, ddemangle isn't a solution here, as then all those 
projects would need to be able to refer to it, and the best way 
for them to do that is to bundle it. Since ddemangle is written 
in D, that would mean binutils would suddenly depend on having a 
working D compiler. That won't happen in the next decade.

Also, any D code that uses core.demangle for any reason would 
suddenly depend on that compression library.

I'm not even fully convinced that my bootstring idea is low 
enough cost, and it's fairly simple, fast, and memory efficient 
compared to any compression algorithm.

 I often don't actually modify the string at all and by putting 
 the string as a template argument, it enables a kind of 
 compile-time memoization like I talked about here a short while 
 ago: http://stackoverflow.com/a/36251271/1457000

 The string may be exceedingly if imported from a file or 
 generated externally and cached:

 MyType!(import("definition.txt")) foo;

 enum foo = ctfeFunction();

 MyType!foo test;

Just assigning to an enum gets you memoization (tested on LDC w/ 
DFE 2.68.2).
I don't see how the template factors into this.

Now, yes, if you call the function directly multiple times 
assigning to different enums, it won't memoize those. And it 
doesn't work lazily how the SO question asked for, but this does:

enum lazily(string name: "foo") = ctfeFunction();

If you don't refer to lazily!"foo", ctfeFunction() never gets 
called. If you do, it gets called once, regardless of how many 
times you use lazily!"foo".

That gives you lazy memoization of any CTFEable function without 
ever needing the function parameters to become template 
parameters.

I'm not sure what MyType is, but that looks like a prime 
candidate for my previous post's mixin examples. If not, you 
could use "definition.txt" as its parameter, and have it import() 
as an implementation detail.

 $ is actually a valid identifier character in C

Nope. $ as an identifier character is a commonly supported 
extension, but code that uses it doesn't compile with `clang 
-std=c11 -Werror -pedantic`.

Mar 31 2016

David Nadlinger <code klickverbot.at> writes:

On Saturday, 26 March 2016 at 05:22:56 UTC, Andrei Alexandrescu 
wrote:
 Compression of template names. -- Andrei

Compression in the usual sense won't help. Sure, it might reduce 
the object file size, but the full string will again have to be 
generated first, still requiring absurd amounts time and space. 
The latter is definitely not negligible for symbol names of 
several hundreds of kilobytes; it shows up prominently in 
compiler profiles of affected Weka builds.

  — David

Mar 27 2016

Liran Zvibel <liran weka.io> writes:

On Sunday, 27 March 2016 at 17:01:39 UTC, David Nadlinger wrote:
 Compression in the usual sense won't help. Sure, it might 
 reduce the object file size, but the full string will again 
 have to be generated first, still requiring absurd amounts time 
 and space. The latter is definitely not negligible for symbol 
 names of several hundreds of kilobytes; it shows up prominently 
 in compiler profiles of affected Weka builds.

We love Voldemort types at Weka, and use them a lot in our 
non-gc-allocating ranges and algorithm libraries. Also, we 
liberally nest templates inside of other templates.
I don't think we can do many of the things we do if we had to 
define everything at module level. This flexibility is amazing 
for us and part of the reason we love D.

But, as David said -- it comes with a great price for us.

I just processed our biggest executable, and came up with the 
following numbers:
total symbols: 99649
Symbols longer than 1k: 9639
Symbols longer than 500k: 102
Symbols longer than 1M: 62. The longest symbols are about 5M 
bytes!

This affects our exe sizes in a terrible way, and also increases 
our compile and link times considerably. I will only be able to 
come up with statistics of how much time was wasted due to 
too-long-symbols after we fix it, but obviously this is a major 
problem for us.

I think we should try the solution proposed by Anon, as it has a 
good possibility of saving quite a bit.
It's important to make sure that when a template is given as a 
template parameter, the complete template is treated as the LName.

Thinking about the compression idea by Andrei, I think we get 
such long names since we have huge symbols that are being passed 
as Voldemort names to template parameters. Then we repeat the 
huge symbols several times in the new template.
Think of a .5M symbol passed few times to a template, this is 
probably how we get to 5M size symbols.
This could end up being too complex, but if we assign "huffman 
coding" like names to the complete template names in a module 
scope (lets say, only if the template name is longer than 30 
bytes), we then will be able to replace a very long string by the 
huffman coded version coupled with the LName+Number idea above, 
we will be able to shorten symbol names considerably.


solution, and then we can see if we also have to recursively 
couple it with huffman-coding of the results template names.

Liran

Mar 30 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/30/16 3:19 PM, Liran Zvibel wrote:
 On Sunday, 27 March 2016 at 17:01:39 UTC, David Nadlinger wrote:
 Compression in the usual sense won't help. Sure, it might reduce the
 object file size, but the full string will again have to be generated
 first, still requiring absurd amounts time and space. The latter is
 definitely not negligible for symbol names of several hundreds of
 kilobytes; it shows up prominently in compiler profiles of affected
 Weka builds.

 We love Voldemort types at Weka, and use them a lot in our
 non-gc-allocating ranges and algorithm libraries. Also, we liberally
 nest templates inside of other templates.
 I don't think we can do many of the things we do if we had to define
 everything at module level. This flexibility is amazing for us and part
 of the reason we love D.

Voldemort types are what cause the bloat, templates inside templates 
aren't as much of a problem. It's because the Voldemort type has to 
include in its symbol name at least twice, and I think 3 times actually 
(given the return type), the template parameter/function parameter types 
of the function it resides in. If the template is just a template, it's 
just included once. This is why moving the type outside the function is 
effective at mitigation. It's linear growth vs. exponential.

I too like Voldemort types, but I actually found moving the types 
outside the functions quite straightforward. It's just annoying to have 
to repeat the template parameters. If you make them private, then you 
can simply avoid all the constraints. It's a bad leak of implementation, 
since now anything in the file has access to that type directly, but 
it's better than the issues with voldemort types.

See the update to my iopipe library here: 
https://github.com/schveiguy/iopipe/commit/1b0696dc82fce500c6b314ec3d8e5e11e0c1bcd7

This one commit made my example program 'convert' 
(https://github.com/schveiguy/iopipe/blob/master/examples/convert/convert.d) 
save over 90% binary size (went from 10MB to <1MB).

This also calmed down some REALLY horrible stack traces when I was 
debugging. As in, I could actually understand what function it was 
talking about, and it didn't take 10 seconds to print stack trace.

 But, as David said -- it comes with a great price for us.

 I just processed our biggest executable, and came up with the following
 numbers:
 total symbols: 99649
 Symbols longer than 1k: 9639
 Symbols longer than 500k: 102
 Symbols longer than 1M: 62. The longest symbols are about 5M bytes!

 This affects our exe sizes in a terrible way, and also increases our
 compile and link times considerably. I will only be able to come up with
 statistics of how much time was wasted due to too-long-symbols after we
 fix it, but obviously this is a major problem for us.

 From my testing, it doesn't take much to get to the point where the 
linker is unusable. A simple struct when nested in 15 calls to a 
function makes the linker take an unreasonable amount of time (over 1.5 
minutes, I didn't wait to see how long). See my bug report for details.

Another factor in the name length is the module name which is included 
in every type and function. So you have a factor like 3^15 for the name, 
but then you multiply this by the module names as well.

 I think we should try the solution proposed by Anon, as it has a good
 possibility of saving quite a bit.
 It's important to make sure that when a template is given as a template
 parameter, the complete template is treated as the LName.

I hope this is given serious thought, looks like someone has already 
started implementation.

Anon, it appears that your mechanism has been well received by a few 
knowledgeable people here. I encourage you to solidify your proposal in 
a DIP (D improvement proposal) here: http://wiki.dlang.org/DIPs.

-Steve

Mar 31 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 31 March 2016 at 13:10:49 UTC, Steven Schveighoffer 
wrote:
 Voldemort types are what cause the bloat, templates inside 
 templates aren't as much of a problem.


So here's an idea of other things don't work out: voldemort types 
don't have a name that can be said... that could be true in the 
mangle too.

We could potentially just hash it to some fixed length. Take the 
existing name, SHA1 it, and call the mangle 
function_containing_type$that_hash. Demangling the inside is 
useless anyway, so we lose nothing from that.

For example, a function foo.bar() returning a voldemort type 
might have the mangle:

_D3foo3barv$55ca6286e3e4f4fba5d0448333fa99fc5a404a73

where that's the hash of whatever filth the compiler makes up for 
it.

and methods on the voldemort type would start with that too:

_D503foo3barv$55ca6286e3e4f4fba5d0448333fa99fc5a404a735empty

That's getting messy to see as an example, but it took the name 
of the function + the voldemort as a whole to be the public name 
of the type.

A demangler could scan that for the $ and recognize it as 
ugliness and slice it off, printing the name as a somewhat 
human-readable

foo.bar().empty

perhaps, so we can see it is a method on the return value of that 
function. It isn't completely impossible like demanging a whole 
hash; should be short in the binary and readable enough to make 
sense of in a stack trace.



I actually recall something like this being done at some point in 
the past, but I don't remember the details. I think it just 
truncated the name though, it didn't attempt to remain 
semi-reversible.



Of course, a chain of voldemort types will still include that 
type as a template argument in the next call, so names as a whole 
can still be very long, but how long are your chains? I can see 
this becoming a 10 KB long name in extreme circumstances (still 
yikes) but not megabytes.


And, still, having to run over the names to build the hash is 
going to be a memory/speed thing in the compiler, but SHA1ing a 
few megabytes isn't as slow as writing it out to disk over and 
over again.



Another thing to consider is what my D to Javascript thing did: 
abbreviate EVERYTHING in the object file. You can't demangle that 
at all, but it leads to much smaller binaries. This probably 
isn't that useful for D in general, but might solve Weka's 
problem since they can use internal hacks.

Mar 31 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/31/16 9:38 AM, Adam D. Ruppe wrote:
 On Thursday, 31 March 2016 at 13:10:49 UTC, Steven Schveighoffer wrote:
 Voldemort types are what cause the bloat, templates inside templates
 aren't as much of a problem.


 So here's an idea of other things don't work out: voldemort types don't
 have a name that can be said... that could be true in the mangle too.

 We could potentially just hash it to some fixed length. Take the
 existing name, SHA1 it, and call the mangle
 function_containing_type$that_hash. Demangling the inside is useless
 anyway, so we lose nothing from that.

Ugh, let's try the huffman coding thing first :) Stack traces would be 
unusable.

However, this does seem promising if that doesn't work out. The hash 
would definitely solve the linking and binary size issue.

A possible thing the compiler *could* do is place inside the binary a 
hash-to-actual-symbol table that the exception printer can utilize to 
print a nicer stack trace...

-Steve

Mar 31 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 31 March 2016 at 14:00:38 UTC, Steven Schveighoffer 
wrote:
 Ugh, let's try the huffman coding thing first :)

Do that after and along with!

 Stack traces would be unusable.

That's why I'd keep the function name outside the hash. The inner 
spam wouldn't be readable (not like a megabyte long name is 
readable anyway...), but the function name (which includes 
template arguments*) still is and that's probably the most useful 
part anyway.

* I just thought of another thing though.... string arguments to 
templates are included in the mangle, and with CTFE mixin stuff, 
they can become VERY long.

void foo(string s)() {}

pragma(msg, foo!"hi there, friend".mangleof);

_D1p48__T3fooVAyaa16_68692074686572652c20667269656e64Z3fooFNaNbNiNfZv

That "68692074686572652c20667269656e64" portion of it is the 
string represented as hexadecimal.

If you do a CTFE thing, the code string you pass in may be kept 
in ALL the names generated from it.

We should probably do something about these too. Simply gzipping 
before converting to hex is a possible option if we want it 
reversible. Or, of course, hashing too if we don't care about 
that.

And IMO a huge string in a stack trace is unreadable anyway... 
I'd be tempted to say if the string is longer than like 64 chars, 
just hash it. But this is debatable.

 A possible thing the compiler *could* do is place inside the 
 binary a hash-to-actual-symbol table that the exception printer 
 can utilize to print a nicer stack trace...

Indeed. I actually just emailed Liran with this suggestion as 
well. I don't think it is ideal for D in general, but for an 
internal project, it might solve some problems.

Mar 31 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/31/16 10:30 AM, Adam D. Ruppe wrote:
 On Thursday, 31 March 2016 at 14:00:38 UTC, Steven Schveighoffer wrote:
 Ugh, let's try the huffman coding thing first :)

 Do that after and along with!

 Stack traces would be unusable.

 That's why I'd keep the function name outside the hash. The inner spam
 wouldn't be readable (not like a megabyte long name is readable
 anyway...), but the function name (which includes template arguments*)
 still is and that's probably the most useful part anyway.

Moving types outside the function actually results in a pretty readable 
stack trace.

I noticed the stack trace printer just gives up after so many characters 
anyway.

-Steve

Mar 31 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/31/16 9:38 AM, Adam D. Ruppe wrote:

 Of course, a chain of voldemort types will still include that type as a
 template argument in the next call, so names as a whole can still be
 very long, but how long are your chains? I can see this becoming a 10 KB
 long name in extreme circumstances (still yikes) but not megabytes.

If you are hashing anyway, just hash the hash :) In other words, this 
function:

auto foo(T)(T t)
{
    static struct R {}
    return R;
}

would result in a return type of a hash, with no care about whether T 
was a hashed symbol or not. This puts a cap on the size of the name, it 
would never get that big.

-Steve

Mar 31 2016

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 31 March 2016 at 13:10:49 UTC, Steven Schveighoffer 
wrote:
 I too like Voldemort types, but I actually found moving the 
 types outside the functions quite straightforward. It's just 
 annoying to have to repeat the template parameters. If you make 
 them private, then you can simply avoid all the constraints. 
 It's a bad leak of implementation, since now anything in the 
 file has access to that type directly, but it's better than the 
 issues with voldemort types.

If you move anything with a Voldemort type to their own modules, 
then do what you say, then there is no longer an access issue. 
Leads to a proliferation of modules.

I can think of another alternative, but it is probably a needless 
complexity. Suppose there is a protection attribute with the 
property that things in the module can only access it if given 
permission explicitly. For instance, taking the D wiki Voldemort 
type example and modifying it to your approach would give

struct TheUnnameable
{
	int value;
	this(int x) {value = x;}
	int getValue() { return value; }
}

auto createVoldemortType(int value)
{
     return TheUnnameable(value);
}

The Unnameable would then be changed to

explicit struct TheUnnameable
{
	explicit(createVoldemortType);
	int value;
	this(int x) {value = x;}
	int getValue() { return value; }
}

where explicit used as a protection attribute would restrict 
TheUnnameable to only things where the explicit function gives 
explicit permission.

Mar 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, March 31, 2016 09:10:49 Steven Schveighoffer via Digitalmars-d 
wrote:
 On 3/30/16 3:19 PM, Liran Zvibel wrote:
 On Sunday, 27 March 2016 at 17:01:39 UTC, David Nadlinger wrote:
 Compression in the usual sense won't help. Sure, it might reduce the
 object file size, but the full string will again have to be generated
 first, still requiring absurd amounts time and space. The latter is
 definitely not negligible for symbol names of several hundreds of
 kilobytes; it shows up prominently in compiler profiles of affected
 Weka builds.

 We love Voldemort types at Weka, and use them a lot in our
 non-gc-allocating ranges and algorithm libraries. Also, we liberally
 nest templates inside of other templates.
 I don't think we can do many of the things we do if we had to define
 everything at module level. This flexibility is amazing for us and part
 of the reason we love D.

 Voldemort types are what cause the bloat, templates inside templates
 aren't as much of a problem. It's because the Voldemort type has to
 include in its symbol name at least twice, and I think 3 times actually
 (given the return type), the template parameter/function parameter types
 of the function it resides in. If the template is just a template, it's
 just included once. This is why moving the type outside the function is
 effective at mitigation. It's linear growth vs. exponential.

This makes me wonder if there would be a sane way to basically treat the
Voldemort type as if it were outside the function it was declared from a
mangling standpoint while still treating its accessibility the same way.
Maybe that wouldn't work due to potential name clashes, but if the problem
is purely because the type is declared inside the function, I would think
that it would make sense to try and find a way to treat it more like it was
declared outside the function when dealing with the name mangling. Then - in
theory - you'd get the benefits of using a Voldemort type while getting the
name mangling cost that you get when you declare a private type outside of
the function.

- Jonathan M Davis

Mar 31 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Mar 25, 2016 at 11:37:24AM +0000, Seb via Digitalmars-d wrote:
 If I understand it correctly, the current policy in Phobos is that
 range methods should use static nested structs to avoid the name
 clutter and document the capabilities of the returned ranges in the
 documentation.  However there are a lot of old functions that still
 use public structs, which leads to the rather confusing documentation
 output in std.range:

Actually, certain types of range structs *have* to be in module space
rather than inside the function, because of compiler limitations /
language restrictions. An example that comes to mind is a range with an
alias parameter -- I forget the details, but basically there are some
cases where this will not work correctly with certain uses if it's
declared inside the function, so it has to be in module space.

This should probably be investigated more, though. It seems to be a grey
area of semantics that could use a DIP.


[...]
 (off-topic: it's quite interesting to see that sometimes the structs
 are before the method and sometimes after it)

I think it's a recent change in convention that recommends declaring
such structs after the function. In any case, declaration order in D is
(generally -- mostly in module space) not important, so it's not a big
deal.


 Two arguments to keep exposing the structs are that (1) an API user
 can explicitly specify the return type (in contrast to auto) and (2)
 one can see the capabilities of the struct in the documentation.

I think (1) is moot, because there's always typeof and ReturnType.  In
fact, I'd argue that it's better to use typeof / ReturnType, because
sometimes the user *shouldn't* need to know exactly what the template
arguments are. For example, if some of the template arguments come from
IFTI, where the exact types may not be immediately obvious from the
user's POV, or if there are default template parameters that are a pain
to spell out every single time.

I argue that (2) is a bad idea, because a range ought to be opaque. The
whole point of the range API is that it should be possible for the
implementation to change drastically or be replaced by a different
implementation, yet user code should still continue to Just Work(tm)
without any modifications.  As such, user code should not depend on
implementation details of the range, and really shouldn't know anything
else about the range other than what is specified in the range API.  All
the user ought to know is whether it's an input range, forward range,
bidirectional range, etc.. Anything more than that leads to breakage
when the range implementation is updated/replaced, which breaks the
whole premise of using ranges in the first place.


 There are many cases where methods in these structs are optional and
 depend on the capabilities of the input range (e.g. backward, length,
 random access, ...). I could imagine that
 
 1) We rework ddoc, s.t. it doesn't list these structs in the overview
 and adds the struct methods to the function (could get ugly)

No. The docs of the function should simply state what kind of range it
returns -- input, forward, bidirectional, or random access. If it
depends on what the function is given, the docs should explain under
what conditions the function will return which kind of range.  Listing
individual range methods for every range function in Phobos is a lot of
needless repetition, and is error-prone.  If anything, we should write a
page that explains exactly what methods are available to each kind of
range, and just link to that from the function docs. That's what
hyperlinks are for.


 2) We deprecate those exposed structs and make them private/nested
 (does anyone know whether it's common to depend on those structs?)

[...]

I don't know if it's common, but I *have* seen people spell out the type
explicitly.  So changing this now will probably break existing code, and
people will likely be resistant to that.  (Arguably, though, it would be
for the better -- users really shouldn't need to know the exact range
type.)


T

-- 
Winners never quit, quitters never win.
But those who never quit AND never win are idiots.

Mar 25 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Policy for exposing range structs