www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - C++ concepts, templates, reducing code bloat

reply bearophile <bearophileHUGS lycos.com> writes:
An interesting thread on Reddit:
http://www.reddit.com/r/programming/comments/92tnd/concepts_removed_from_c0x/


In the meantime while I was doing some testing Walter has already answered to a
question shown by "deong":
Possibly. I don't really know D, and I'm not completely in the know on C++
concepts either. One objection to the approach you've demonstrated versus C++
concepts is that your code is littered with the checks. The concepts way would
define the constraints externally, and client code would simply use the defined
name of the concept. You could mix concepts freely without needing to figure
out how to assert the union of all the constraints. But since D already has at
least basic functionality, and C++ apparently isn't going to get it anytime
soon, I say you can feel as smug as you like. :)<

This code shown by Downs there doesn't work: bool compare(T)(T left, T right) if (is(typeof(left < right))) { return left < right; } because of this: http://d.puremagic.com/issues/show_bug.cgi?id=2782 This is an answer to deong: import std.stdio: writeln; template Comparable(T) { enum bool Comparable = is(typeof(T.init < T.init)); } bool compare(T)(T left, T right) if (Comparable!T) { return left < right; } void main() { writeln(compare(5, 6), " ", compare(5, 6)); writeln(compare("hello", 6), " ", compare(5, "hello")); } (__traits(compiles, ... can also be used). ---------------------- Related to the compilation and use of generic code, various languages work in different ways. D/C++ (and ShedSkin, using full type inference) compile a different function for each different set of template types. Java uses a single function, because even if it now has generics, at run time there's a single function that uses Objects. Haskell used to (and maybe uses still) to create an associative array for each generic function, and then create code (even at runtime, lazily, if necessary) for each set of input types, and then such dictionary is used at runtime to find the right compiled block of code of the function (this is not dynamic typing, but it doesn't look that far to me). Dynamic languages like Python, Ruby and JavaScript (and CLisp, sometimes) keep a single version of the function, and the single datum keeps their type at runtime. This is a low-performance solution to implement generic functions, but it's also simple to implement, use and understand. Just-in time compilers for Python (Psyco) and JavaScript (TraceMonkey, V8, etc. But they use very different strategies, Tracemonkey is closer to what I am saying here, V8 uses something that's worse and better at the same time) (and Lua) create at runtime new instances of the functions, and then use them. C# acts in yet another interesting way, not too much far from the Psyco strategy: http://www.osl.iu.edu/publications/prints/2006/Gregor06:Concepts.pdf
The adoption of the template inclusion model of compilation precludes separate
compilation for C++ templates. In this C++ constrained generics (concepts)
differ from C# and Java generics. In C# and Java the default is that all
instances of a generic method share the same native code. However, C# has some
flavor of the instantiation model: different code is generated for each
different instantiation of a generic method whose type arguments are value
types. To attain apparent separate compilation, this instantiation-specific
code is generated at run time.<

In theory D strategy is the one that leads to faster code at runtime (and doesn't require a VM like the C# case), but the binary has to contain all the compiled versions of a templated function, this can grow the binary size a lot. A big binary can be a problem also because there's more code that comes through the half-L1 cache (just 32 KB), this can and do reduce performance. I can see two possible solutions to this problem: - Many functions in D programs aren't performance-critical. If you "compile" one of such templated functions into a single dynamically typed function the performance of the code doesn't change, while the binary size is kept low. I think this is what the JavaHot spot does. LLVM that's behind LDC may be able to do this, and it can also probably allow to instantiate templates at run-time, but this looks more fit for a VM-based language, something different from the current D2 design (because it's generally not easy to know at compile time what template must be compiled and what one can be implemented with a dynamically typed piece of code, you need a profiled-guided compilation, or a VM that monitors the code at runtime like in the Java/C# case). - Another possible solution is to find shared equal asm code in different functions coming from a single template function, and put such pieces of code only one time inside the compiled binary. This requires to split the functions in sections, and such sections have to be joined by nonconditional jumps. You may think such jumps slow down the code (and sometimes they surely slow down code, so you have to avoid juping in the middle of inner loops). Time ago I have read an article about this, it shows that this strategy also reduces binary size, and this reduces cache misses in L1-code enough to usually balance the slowdown caused by jumps (in many examples there was even a performance gain). To show an extreme and unrealistic example I have created this Python program, that defined a D function template and then calls it in any combination of double and integer arguments (2^n possibilities, here n = 12): from itertools import product template = """\ version(Tango) import tango.stdc.stdio: printf; else import std.stdio: printf; float sum(%s)(%s) { return %s; } void main() { %s }""" n = 12 s1 = ", ".join("T%d" % i for i in xrange(n)) s2 = ", ".join("T%d x%d" % (i,i) for i in xrange(n)) s3 = " + ".join("x%d" % i for i in xrange(n)) template2 = """ printf("%%f\\n", sum%s);""" s4 = "\n".join(template2 % str(args) for args in product([1, 1.5], repeat=n)) result = template % (s1, s2, s3, s4) file("test2.d", "w").write(result) Compilation timings n = 12: DMD 2.031, compilation time, binary size, RAM used (Windows): Normal: 6.82 928_796 136 -O: 17.03 947_228 138 LDC, compilation time, binary size, RAM used (Pubuntu): Normal: 28.7 2_601_228 198 -O3: 26.7 1_115_888 159 On Pubuntu: 1_115_888 strip ==> 721_908 ==> zip 85_207 bytes Later I have tried a shorter D program, with similar results (binary was even larger and compilation time longer): version(Tango) import tango.stdc.stdio: printf; else import std.stdio: printf; float sum(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11)(T0 x0, T1 x1, T2 x2, T3 x3, T4 x4, T5 x5, T6 x6, T7 x7, T8 x8, T9 x9, T10 x10, T11 x11) { return x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11; } template Tuple(T...) { alias T Tuple; } void main() { auto v = Tuple!(1, 1.5); foreach (i0; v) foreach (i1; v) foreach (i2; v) foreach (i3; v) foreach (i4; v) foreach (i5; v) foreach (i6; v) foreach (i7; v) foreach (i8; v) foreach (i9; v) foreach (i10; v) foreach (i11; v) printf("%f\n", sum(i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11)); } This second D program, once compiled has a asm made of (cleaned up a bit): """ _D5test332__T3sumTdTiTdTiTiTdTdTdTiTdTdTiZ3sumFdidiidddiddiZf: enter 4,0 mov -4[EBP],EAX fild dword ptr 044h[EBP] fadd qword ptr 048h[EBP] fadd qword ptr 03Ch[EBP] fadd dword ptr 038h[EBP] fadd dword ptr 034h[EBP] fadd qword ptr 02Ch[EBP] fadd qword ptr 024h[EBP] fadd qword ptr 01Ch[EBP] fadd dword ptr 018h[EBP] fadd qword ptr 010h[EBP] fadd qword ptr 8[EBP] fadd dword ptr -4[EBP] leave ret 048h _D5test332__T3sumTdTiTdTiTiTdTdTdTiTdTdTdZ3sumFdidiidddidddZf: fild dword ptr 048h[ESP] fadd qword ptr 04Ch[ESP] fadd qword ptr 040h[ESP] fadd dword ptr 03Ch[ESP] fadd dword ptr 038h[ESP] fadd qword ptr 030h[ESP] fadd qword ptr 028h[ESP] fadd qword ptr 020h[ESP] fadd dword ptr 01Ch[ESP] fadd qword ptr 014h[ESP] fadd qword ptr 0Ch[ESP] fadd qword ptr 4[ESP] ret 050h I don't think such functions share blocks of code that can be moved away. Real code that contains real templates contains more block-redundancy. Bye, bearophile
Jul 20 2009
parent Don <nospam nospam.com> writes:
bearophile wrote:
 An interesting thread on Reddit:
 http://www.reddit.com/r/programming/comments/92tnd/concepts_removed_from_c0x/
 
 Related to the compilation and use of generic code, various languages work in
different ways.
 D/C++ (and ShedSkin, using full type inference) compile a different function
for each different set of template types.
 
 Java uses a single function, because even if it now has generics, at run time
there's a single function that uses Objects.

[...]
 In theory D strategy is the one that leads to faster code at runtime (and
doesn't require a VM like the C# case), but the binary has to contain all the
compiled versions of a templated function, this can grow the binary size a lot.
A big binary can be a problem also because there's more code that comes through
the half-L1 cache (just 32 KB), this can and do reduce performance.
 
 I can see two possible solutions to this problem:
 - Many functions in D programs aren't performance-critical. If you "compile"
one of such templated functions into a single dynamically typed function the
performance of the code doesn't change, while the binary size is kept low. I
think this is what the JavaHot spot does. LLVM that's behind LDC may be able to
do this, and it can also probably allow to instantiate templates at run-time,
but this looks more fit for a VM-based language, something different from the
current D2 design (because it's generally not easy to know at compile time what
template must be compiled and what one can be implemented with a dynamically
typed piece of code, you need a profiled-guided compilation, or a VM that
monitors the code at runtime like in the Java/C# case).
 - Another possible solution is to find shared equal asm code in different
functions coming from a single template function, and put such pieces of code
only one time inside the compiled binary. This requires to split the functions
in sections, and such sections have to be joined by nonconditional jumps. You
may think such jumps slow down the code (and sometimes they surely slow down
code, so you have to avoid juping in the middle of inner loops). Time ago I
have read an article about this, it shows that this strategy also reduces
binary size, and this reduces cache misses in L1-code enough to usually balance
the slowdown caused by jumps (in many examples there was even a performance
gain).

I commonly find that I need to use templates for integral types, and that's only because of the silly implicit casting rules from C. For example, initially I write foo(long x) { ... } and this covers long, int, uint, short, ushort, byte, ubyte. But ulong needs to be treated specially. So I add: foo(ulong x) { ... } But now, thanks to the silly implicit casting rules, foo(1) is now ambiguous! So I need to add overloads for all of the other integral types. That's ridiculous, so I make foo() a template. And I get code bloat. If there was a way to say: foo(explicit ulong x) { ... } // MUST really be ulong, don't allow implicit integral conversions then a huge fraction of the code bloat templates would disappear. (for 'explicit', substitute any keyword of your choice).
Jul 21 2009