digitalmars.D - C++ concepts, templates, reducing code bloat

bearophile (114/116) Jul 20 2009 An interesting thread on Reddit:

Don (17/29) Jul 21 2009 I commonly find that I need to use templates for integral types, and

bearophile <bearophileHUGS lycos.com> writes:

An interesting thread on Reddit:
http://www.reddit.com/r/programming/comments/92tnd/concepts_removed_from_c0x/


In the meantime while I was doing some testing Walter has already answered to a
question shown by "deong":
Possibly. I don't really know D, and I'm not completely in the know on C++
concepts either. One objection to the approach you've demonstrated versus C++
concepts is that your code is littered with the checks. The concepts way would
define the constraints externally, and client code would simply use the defined
name of the concept. You could mix concepts freely without needing to figure
out how to assert the union of all the constraints. But since D already has at
least basic functionality, and C++ apparently isn't going to get it anytime
soon, I say you can feel as smug as you like. :)<


This code shown by Downs there doesn't work:

bool compare(T)(T left, T right) if (is(typeof(left < right))) {
  return left < right;
}

because of this:
http://d.puremagic.com/issues/show_bug.cgi?id=2782

This is an answer to deong:

import std.stdio: writeln;

template Comparable(T) {
    enum bool Comparable = is(typeof(T.init < T.init));
}

bool compare(T)(T left, T right) if (Comparable!T) {
    return left < right;
}

void main() {
    writeln(compare(5, 6), " ", compare(5, 6));
    writeln(compare("hello", 6), " ", compare(5, "hello"));
}

(__traits(compiles, ... can also be used).

----------------------

Related to the compilation and use of generic code, various languages work in
different ways.
D/C++ (and ShedSkin, using full type inference) compile a different function
for each different set of template types.

Java uses a single function, because even if it now has generics, at run time
there's a single function that uses Objects.

Haskell used to (and maybe uses still) to create an associative array for each
generic function, and then create code (even at runtime, lazily, if necessary)
for each set of input types, and then such dictionary is used at runtime to
find the right compiled block of code of the function (this is not dynamic
typing, but it doesn't look that far to me).

Dynamic languages like Python, Ruby and JavaScript (and CLisp, sometimes) keep
a single version of the function, and the single datum keeps their type at
runtime. This is a low-performance solution to implement generic functions, but
it's also simple to implement, use and understand.

Just-in time compilers for Python (Psyco) and JavaScript (TraceMonkey, V8, etc.
But they use very different strategies, Tracemonkey is closer to what I am
saying here, V8 uses something that's worse and better at the same time) (and
Lua) create at runtime new instances of the functions, and then use them.


strategy:
http://www.osl.iu.edu/publications/prints/2006/Gregor06:Concepts.pdf
The adoption of the template inclusion model of compilation precludes separate
compilation for C++ templates. In this C++ constrained generics (concepts)


flavor of the instantiation model: different code is generated for each
different instantiation of a generic method whose type arguments are value
types. To attain apparent separate compilation, this instantiation-specific
code is generated at run time.<


In theory D strategy is the one that leads to faster code at runtime (and

compiled versions of a templated function, this can grow the binary size a lot.
A big binary can be a problem also because there's more code that comes through
the half-L1 cache (just 32 KB), this can and do reduce performance.

I can see two possible solutions to this problem:
- Many functions in D programs aren't performance-critical. If you "compile"
one of such templated functions into a single dynamically typed function the
performance of the code doesn't change, while the binary size is kept low. I
think this is what the JavaHot spot does. LLVM that's behind LDC may be able to
do this, and it can also probably allow to instantiate templates at run-time,
but this looks more fit for a VM-based language, something different from the
current D2 design (because it's generally not easy to know at compile time what
template must be compiled and what one can be implemented with a dynamically
typed piece of code, you need a profiled-guided compilation, or a VM that

- Another possible solution is to find shared equal asm code in different
functions coming from a single template function, and put such pieces of code
only one time inside the compiled binary. This requires to split the functions
in sections, and such sections have to be joined by nonconditional jumps. You
may think such jumps slow down the code (and sometimes they surely slow down
code, so you have to avoid juping in the middle of inner loops). Time ago I
have read an article about this, it shows that this strategy also reduces
binary size, and this reduces cache misses in L1-code enough to usually balance
the slowdown caused by jumps (in many examples there was even a performance
gain).

To show an extreme and unrealistic example I have created this Python program,
that defined a D function template and then calls it in any combination of
double and integer arguments (2^n possibilities, here n = 12):


from itertools import product

template = """\
version(Tango)
    import tango.stdc.stdio: printf;
else
    import std.stdio: printf;

float sum(%s)(%s) {
    return %s;
}

void main() {
%s
}"""

n = 12
s1 = ", ".join("T%d" % i for i in xrange(n))
s2 = ", ".join("T%d x%d" % (i,i) for i in xrange(n))
s3 = " + ".join("x%d" % i for i in xrange(n))

template2 = """    printf("%%f\\n", sum%s);"""
s4 = "\n".join(template2 % str(args) for args in product([1, 1.5], repeat=n))

result = template % (s1, s2, s3, s4)
file("test2.d", "w").write(result)



Compilation timings n = 12:
  DMD 2.031, compilation time,  binary size,  RAM used (Windows):
    Normal:  6.82               928_796       136
    -O:     17.03               947_228       138
  LDC,      compilation time,  binary size,   RAM used (Pubuntu):
    Normal: 28.7              2_601_228       198
    -O3:    26.7              1_115_888       159

On Pubuntu: 1_115_888 strip ==> 721_908 ==> zip 85_207 bytes


Later I have tried a shorter D program, with similar results (binary was even
larger and compilation time longer):

version(Tango)
    import tango.stdc.stdio: printf;
else
    import std.stdio: printf;

float sum(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11)(T0 x0, T1 x1, T2
x2, T3 x3, T4 x4, T5 x5, T6 x6, T7 x7, T8 x8, T9 x9, T10 x10, T11 x11) {
    return x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11;
}

template Tuple(T...) { alias T Tuple; }

void main() {
    auto v = Tuple!(1, 1.5);
    foreach (i0; v) foreach (i1; v) foreach (i2; v) foreach (i3; v) foreach
(i4; v) foreach (i5; v) foreach (i6; v) foreach (i7; v) foreach (i8; v) foreach
(i9; v) foreach (i10; v) foreach (i11; v) printf("%f\n", sum(i0, i1, i2, i3,
i4, i5, i6, i7, i8, i9, i10, i11));
}



This second D program, once compiled has a asm made of (cleaned up a bit):

"""
_D5test332__T3sumTdTiTdTiTiTdTdTdTiTdTdTiZ3sumFdidiidddiddiZf:
        enter   4,0
        mov -4[EBP],EAX
        fild    dword ptr 044h[EBP]
        fadd    qword ptr 048h[EBP]
        fadd    qword ptr 03Ch[EBP]
        fadd    dword ptr 038h[EBP]
        fadd    dword ptr 034h[EBP]
        fadd    qword ptr 02Ch[EBP]
        fadd    qword ptr 024h[EBP]
        fadd    qword ptr 01Ch[EBP]
        fadd    dword ptr 018h[EBP]
        fadd    qword ptr 010h[EBP]
        fadd    qword ptr 8[EBP]
        fadd    dword ptr -4[EBP]
        leave
        ret 048h

_D5test332__T3sumTdTiTdTiTiTdTdTdTiTdTdTdZ3sumFdidiidddidddZf:
        fild    dword ptr 048h[ESP]
        fadd    qword ptr 04Ch[ESP]
        fadd    qword ptr 040h[ESP]
        fadd    dword ptr 03Ch[ESP]
        fadd    dword ptr 038h[ESP]
        fadd    qword ptr 030h[ESP]
        fadd    qword ptr 028h[ESP]
        fadd    qword ptr 020h[ESP]
        fadd    dword ptr 01Ch[ESP]
        fadd    qword ptr 014h[ESP]
        fadd    qword ptr 0Ch[ESP]
        fadd    qword ptr 4[ESP]
        ret 050h


I don't think such functions share blocks of code that can be moved away.
Real code that contains real templates contains more block-redundancy.

Bye,
bearophile

Jul 20 2009

Don <nospam nospam.com> writes:

bearophile wrote:
 An interesting thread on Reddit:
 http://www.reddit.com/r/programming/comments/92tnd/concepts_removed_from_c0x/
 
 Related to the compilation and use of generic code, various languages work in
different ways.
 D/C++ (and ShedSkin, using full type inference) compile a different function
for each different set of template types.
 
 Java uses a single function, because even if it now has generics, at run time
there's a single function that uses Objects.

[...]
 In theory D strategy is the one that leads to faster code at runtime (and

compiled versions of a templated function, this can grow the binary size a lot.
A big binary can be a problem also because there's more code that comes through
the half-L1 cache (just 32 KB), this can and do reduce performance.
 
 I can see two possible solutions to this problem:
 - Many functions in D programs aren't performance-critical. If you "compile"
one of such templated functions into a single dynamically typed function the
performance of the code doesn't change, while the binary size is kept low. I
think this is what the JavaHot spot does. LLVM that's behind LDC may be able to
do this, and it can also probably allow to instantiate templates at run-time,
but this looks more fit for a VM-based language, something different from the
current D2 design (because it's generally not easy to know at compile time what
template must be compiled and what one can be implemented with a dynamically
typed piece of code, you need a profiled-guided compilation, or a VM that

 - Another possible solution is to find shared equal asm code in different
functions coming from a single template function, and put such pieces of code
only one time inside the compiled binary. This requires to split the functions
in sections, and such sections have to be joined by nonconditional jumps. You
may think such jumps slow down the code (and sometimes they surely slow down
code, so you have to avoid juping in the middle of inner loops). Time ago I
have read an article about this, it shows that this strategy also reduces
binary size, and this reduces cache misses in L1-code enough to usually balance
the slowdown caused by jumps (in many examples there was even a performance
gain).

I commonly find that I need to use templates for integral types, and 
that's only because of the silly implicit casting rules from C.
For example, initially I write
foo(long x) { ... }
and this covers long, int, uint, short, ushort, byte, ubyte.

But ulong needs to be treated specially. So I add:
foo(ulong x) { ... }

But now, thanks to the silly implicit casting rules, foo(1) is now 
ambiguous! So I need to add overloads for all of the other integral 
types. That's ridiculous, so I make foo() a template. And I get code bloat.

If there was a way to say:
foo(explicit ulong x) { ... } // MUST really be ulong, don't allow 
implicit integral conversions

then a huge fraction of the code bloat templates would disappear.

(for 'explicit', substitute any keyword of your choice).

Jul 21 2009

D Programming

C/C++ Programming

Other

digitalmars.D - C++ concepts, templates, reducing code bloat