www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Few ideas to reduce template bloat

reply bearophile <bearophileHUGS lycos.com> writes:
There are some ways to reduce the amount of binary code produced by templates
in a language like D.


1) There are few situations where the types are different but the asm
instructions inside the function are the same, like uint and int, but this
situation probably is not common enough.


2) Even in function templates that get instantiated to a different asm there
can be parts that are shared. If the shared part is at the end of the function
the compiler can use a jump to use a single function exit point and remove some
code duplication. I don't think this is a common situation.


3) There are more situations where a template type is one of several classes
(OOP), this can produce the same duplicated asm. In this case the compiler can
keep only one version in the final binary. This seems common enough.



run-time. D can of course do the same if it uses the LLVM backend, but let's
ignore this possibility for now.


5) Java generics don't cause code bloat because it removes the types at runtime
and uses boxed values. This can be type-safe, but it's less flexible than C++/D
templates, and the boxing-unboxing decreases performance. On the other hand
programming practice shows that in most programs not all functions have to be
top performance. So a compromise can be invented (I have never found this idea
elsewhere). An attribute like  boxed can be used to mark what function template
arguments are boxed (and don't multiply the template instances).

So for example this template function can be called with double or int values,
so it gets intantiated in four versions:

T2 foo(T1, T2)(T1 x, T2 y) {...}
Instantiation 1: foo(1, 2)
Instantiation 2: foo(1, 2.0)
Instantiation 3: foo(1.0, 2)
Instantiation 4: foo(1.0, 2.0)

If foo() is not performance-critical, then the same function template with one
 boxed type, and with the same inputs, gets instantiated just two times,
halving its combinatorial explosion:

 boxed(T2) foo(T1,  boxed T2)(T1 x, T2 y) {...}
Instantiation 1: foo(1, 2)  and  foo(1, 2.0)
Instantiation 2: foo(1.0, 2)  and  foo(1.0, 2.0)

The first argument is like a normal template argument and created different
instantiations, the second argument gets automatically boxed-unboxed, so it's
like a single type.


Inside this version of the function foo() instructions like:
static if typeof(T2 == int) {...

are not allowed by the compiler (despite they are formally correct) because
instances of T2 are objects and the compiler reminds you that you have to use a
run time cast:
if (cast(Integer)T2 !is null) {...


There are ways to reduce code bloat with idioms used by the programmer, but
that's good for C++ programmers. Programmers in a more modern language prefer
something safer done by the compiler :-)

Bye,
bearophile
Mar 25 2010
next sibling parent reply Norbert Nemec <Norbert Nemec-online.de> writes:
bearophile wrote:

run-time. D can of course do the same if it uses the LLVM backend, but let's
ignore this possibility for now.
 
 5) Java generics don't cause code bloat because it removes the types at
runtime and uses boxed values. This can be type-safe, but it's less flexible
than C++/D templates, and the boxing-unboxing decreases performance. On the
other hand programming practice shows that in most programs not all functions
have to be top performance. So a compromise can be invented (I have never found
this idea elsewhere). An attribute like  boxed can be used to mark what
function template arguments are boxed (and don't multiply the template
instances).
Indeed, this is the very fundamental distinction between templates (C++) Templates: * have to be instantiated at compile time * usable for meta-programming * cannot be compiled or type-checked on its own --> not type-safe (i.e. if you don't set the constraints correctly, the instantiation may fail with an error message deep inside the template code) Generics: * Every template parameter must follow a clear interface. * generic code can be compiled and fully type-checked on its own * instantiation not needed at all but possible as optimization -> use boxed values without instantiation -> optionally instantiate at run-time/link-time/compile-time (analogous to function inlining) Though both concepts have a large overlap, it is difficult to bridge the gap without adding significant complexity. Some experimental languages try to get the best of both ways by smearing or even giving up the distinction between compile-time and run-time. Though these attempts are very promising for the future, they are clearly beyond the pragmatic goals of D. D is clearly designed as language that can be compiled to a static binary.
Mar 25 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Norbert Nemec:
D is clearly designed as language that can be compiled to a static binary.<
Just to be sure you have understood: the solution number 5 is for a statically compiled language. It doesn't need a JIT. You just need normal objects, a virtual table, etc. Bye, bearophile
Mar 25 2010
next sibling parent reply sebastien.f <sebastien noreply.com> writes:
This problem reminds me of a recent paper written by Bjarne Stroustrup (and
others) describing and evaluating a technique named “generalized hoisting”.
It’s applied  to the C++ STL iterators and removes lot of code bloat and
increase performances surprisingly.
I found it a great read (as often with Bjarne.S) I thought you may find it
interesting too.
It says the ideas apply to the D language too.

Minimizing Dependencies within Generic Classes for Faster and Smaller Programs.
http://www.research.att.com/~bs/SCARY.pdf
Mar 25 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
sebastien.f:
 I found it a great read (as often with Bjarne.S) I thought you may find it
interesting too.
 It says the ideas apply to the D language too.
It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-) Bye, bearophile
Mar 25 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 03/25/2010 03:17 PM, bearophile wrote:
 sebastien.f:
 I found it a great read (as often with Bjarne.S) I thought you may
 find it interesting too. It says the ideas apply to the D language
 too.
It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-)
That's a very, very nice hat tip from the authors. Andrei
Mar 25 2010
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 On 03/25/2010 03:17 PM, bearophile wrote:
 sebastien.f:
 I found it a great read (as often with Bjarne.S) I thought you may
 find it interesting too. It says the ideas apply to the D language
 too.
It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-)
That's a very, very nice hat tip from the authors.
Yes!
Mar 25 2010
prev sibling parent Steve Teale <steve.teale britseyeview.com> writes:
On Thu, 25 Mar 2010 16:48:14 -0500, Andrei Alexandrescu wrote:

 On 03/25/2010 03:17 PM, bearophile wrote:
 sebastien.f:
 I found it a great read (as often with Bjarne.S) I thought you may
 find it interesting too. It says the ideas apply to the D language
 too.
It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-)
That's a very, very nice hat tip from the authors. Andrei
Do we have a balanced tree implementation in Phobos these days? Steve
Mar 25 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
After reading that paper I can say that D compilers can start doing the simpler
thing: recognizing instantiations of a function template that are compiled to
the same asm code, and keep only one of them. The Microsoft compiler seems to
do that already.

Bye,
bearophile
Mar 25 2010
next sibling parent Bernard Helyer <b.helyer gmail.com> writes:
On 26/03/10 10:18, bearophile wrote:
 The Microsoft compiler seems to do that already.
Microsoft Visual D? =P
Mar 25 2010
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 After reading that paper I can say that D compilers can start doing the
 simpler thing: recognizing instantiations of a function template that are
 compiled to the same asm code, and keep only one of them. The Microsoft
 compiler seems to do that already.
Because of separate compilation, that can't be done with the compiler in the general case. The linker can, though.
Mar 26 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 Because of separate compilation, that can't be done with the compiler in the 
 general case. The linker can, though.
LDC is able to do it already! I don't need 100% general solutions. Bye, bearophile
Mar 26 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 Because of separate compilation, that can't be done with the compiler in the 
 general case. The linker can, though.
LDC is able to do it already! I don't need 100% general solutions.
Are you sure LDC isn't doing it at the link step? Regardless, if you implement a half solution in the compiler, and then a full solution in the linker, you've doubled your effort for the same result. This may be a practical solution, however, if you desire to work with any linker, including ones you have no control over.
Mar 26 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 Are you sure LDC isn't doing it at the link step?
This is the post where I have explained it: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=108275 As you can see from the link too llvm source code at the top of my post: https://llvm.org/svn/llvm-project/llvm/tags/RELEASE_25/lib/Transforms/IPO/MergeFunctions.cpp that's a LLVM optimization pass, and ldc is spitting out already optimized asm, so it's not something done by the linker.
 Regardless, if you implement a half solution in the compiler, and then a full 
 solution in the linker, you've doubled your effort for the same result.
Right. Is dmd linker able to do that? The Gold linker (http://en.wikipedia.org/wiki/Gold_%28linker%29 ) that I think in future ldc will use may be able to do that. But you are right, if D modules are compiled separately the compiler can't recognize many cases of duplications. Bye, bearophile
Mar 26 2010
prev sibling next sibling parent Norbert Nemec <Norbert Nemec-online.de> writes:
bearophile wrote:
 Norbert Nemec:
 D is clearly designed as language that can be compiled to a static binary.<
Just to be sure you have understood: the solution number 5 is for a statically compiled language. It doesn't need a JIT. You just need normal objects, a virtual table, etc.
My comment was geared towards languages that experiment with multi-stage compilation (or "specialization"). What I wanted to say: statically compiled languages typically distinguish between generics (which "exist" at run time as objects) and templates (which exist at compile time only). Most languages offer only one of both. Once you start merging both concepts you quickly end up with smearing the boundary between run-time and compile-time and depend on JIT technology or multi-stage compilation. D may of course use these techniques for optimization, but the language should not depend on them.
Mar 26 2010
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Just to be sure you have understood: the solution number 5 is for a
 statically compiled language. It doesn't need a JIT. You just need normal
 objects, a virtual table, etc.
With enough indirection, template instances can all be done with one function regardless of the actual argument types. What is lost with that approach, however, are all the benefits of inlining, constant folding, specialization, etc. In addition, CPUs tend to do a much poorer job with indirection than they do with inline code. D's templates are also more powerful than generics in that they can be used to templatize data and symbols, not just functions and classes. Generic types can also be more bloated because they have to carry around the indirect references, while templated types can specialize them away. For example, a templated hash-map can often eliminate storing the hash, while a generic one will have to keep it regardless.
Mar 28 2010
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Oh, one more thing. A template system can be used to create a user-defined 
generic system, but a generic system cannot be used to create a template system.
Mar 28 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 With enough indirection, template instances can all be done with one function 
 regardless of the actual argument types. What is lost with that approach, 
 however, are all the benefits of inlining, constant folding, specialization, 
 etc
My idea 5 was not to remove templates from D, and not even to add normal Java-like generics to D. And I know templates are usually faster and more flexible. My idea was to create something mixed, where you can choose what input arguments are boxed as in Java, and what ones are not boxed as in D/C++. So the purpose was not to replace normal D templates, but to add flexibility. Because I think there are templates that don't need max performance, in such case using a mixed template-generic (or full generic) can give a "good enough" performance while keeping the final binary smaller. Reducing binary size is good because it decreases the pressure on the code L1 cache, and this speeds up the program a little. So it's like a cross between a template and a generic. This is a normal D template: T1 foo(T1, T2, T3)(T1 x, T2, y, T3 z) {...} This is a full java-style generic, that will have only one implementation in the binary, x, y and z will be boxed inside foo: generic T1 foo( generic T1, generic T2, generic T3)(T1 x, T2, y, T3 z) {...} And this is something mixed, here T1 and T3 will be boxed but T2 will not, T2 is like a D template argument, so the compiler will instantiate as many different T2 it will find: generic T1 foo( generic T1, T2, generic T3)(T1 x, T2, y, T3 z) {...}
Generic types can also be more bloated because they have to carry around the
indirect references, while templated types can specialize them away.<
You are right, as always in life it's a matter of compromises and balances :-) If you instantiate a hybrid template like foo( generic T1, T2, generic T3) like with ten different T2 types, you can usually produce more code than a single (probably bigger) 100% generic like generic T1 foo( generic T1, generic T2, generic T3) :-) Bye, bearophile
Mar 28 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
I have asked to llvm devs, it seems llvm already has an optimization pass that
removes duplicated functions. But it's disabled on default, you can use it with
LDC with "-mergefunc".

Its implementation, probably written by nlewycky:
https://llvm.org/svn/llvm-project/llvm/tags/RELEASE_25/lib/Transforms/IPO/MergeFunctions.cpp

--------------------

I have done few tests in D1 with LDC, the first one, here -mergefunc (and no
other optimizations) leaves a single function:

import tango.stdc.stdio: printf;
import tango.stdc.stdlib: atoi;

int factorial1(int X) {
  if (X == 0) return 1;
  return X*factorial1(X-1);
}
int factorial2(int X) {
  if (X == 0) return 1;
  return X*factorial2(X-1);
}

void main(char[][] args) {
  printf("%d\n", factorial1(atoi(args[1].ptr)));
  printf("%d\n", factorial2(atoi(args[1].ptr)));
}

--------------------

Second test I've done:

import tango.stdc.stdio: printf;
import tango.stdc.stdlib: atoi;

T factorial(T)(T x) {
  if (x == 0)
    return 1;
  else
    return x * factorial(x - 1);
}

void main(char[][] args) {
  printf("%d\n", factorial(atoi(args[1].ptr)));
  printf("%d\n", factorial(cast(uint)atoi(args[1].ptr)));
}


The resulting asm with -mergefunc shows some duplication left (a person says
this is a bug that can be fixed, I have to talk with another developer):

__unnamed_1:
	subl	$4, %esp
	movl	%eax, (%esp)
	testl	%eax, %eax
	jne	.LBB2_2
	movl	$1, %eax
	addl	$4, %esp
	ret
.LBB2_2:
	movl	(%esp), %eax
	decl	%eax
	call	factorial!uint
	imull	(%esp), %eax
	addl	$4, %esp
	ret

factorial!int:
	subl	$4, %esp
	call	__unnamed_1
	addl	$4, %esp
	ret

factorial!uint:
	subl	$4, %esp
	call	__unnamed_1
	addl	$4, %esp
	ret

_Dmain:
	subl	$28, %esp
	movl	36(%esp), %eax
	movl	%eax, 20(%esp)
	movl	32(%esp), %eax
	movl	%eax, 16(%esp)
	cmpl	$1, 16(%esp)
	jbe	.LBB1_3
	movl	20(%esp), %eax
	movl	12(%eax), %eax
	movl	%eax, (%esp)
	call	atoi
	call	factorial!int
	movl	%eax, 4(%esp)
	movl	$.str, (%esp)
	call	printf
	cmpl	$1, 16(%esp)
	jbe	.LBB1_4
	movl	20(%esp), %eax
	movl	12(%eax), %eax
	movl	%eax, (%esp)
	call	atoi
	call	factorial!uint
	movl	%eax, 4(%esp)
	movl	$.str2, (%esp)
	call	printf
	xorl	%eax, %eax
	addl	$28, %esp
	ret	$8
.LBB1_3:
	movl	.modulefilename+4, %eax
	movl	%eax, 4(%esp)
	movl	.modulefilename, %eax
	movl	%eax, (%esp)
	movl	$12, 8(%esp)
	call	_d_array_bounds
.LBB1_4:
	movl	.modulefilename+4, %eax
	movl	%eax, 4(%esp)
	movl	.modulefilename, %eax
	movl	%eax, (%esp)
	movl	$13, 8(%esp)
	call	_d_array_bounds

----------------------------

My third experiment:

import tango.stdc.stdio: printf;
import tango.stdc.stdlib: atoi;

int factorial1(int X) {
  if (X == 0) return 1;
  return X*factorial1(X-1);
}
int factorial2(int X) {
  if (X == 0) return 1;
  return X*factorial2(X-1);
}

int caller(int function(int) func, int x) {
    return func(x);
}

void main() {
  printf("%d\n", caller(&factorial1, atoi("10")));
  printf("%d\n", caller(&factorial2, atoi("10")));
}


Its asm without mergefunc:

_D5temp310factorial1FiZi:
	subl	$4, %esp
	movl	%eax, (%esp)
	movl	(%esp), %eax
	cmpl	$0, %eax
	jne	.LBB1_2
	movl	$1, %eax
	addl	$4, %esp
	ret
.LBB1_2:
	movl	(%esp), %eax
	subl	$1, %eax
	call	_D5temp310factorial1FiZi
	movl	%eax, %ecx
	movl	(%esp), %eax
	imull	%ecx, %eax
	addl	$4, %esp
	ret

_D5temp310factorial2FiZi:
	subl	$4, %esp
	movl	%eax, (%esp)
	movl	(%esp), %eax
	cmpl	$0, %eax
	jne	.LBB2_2
	movl	$1, %eax
	addl	$4, %esp
	ret
.LBB2_2:
	movl	(%esp), %eax
	subl	$1, %eax
	call	_D5temp310factorial2FiZi
	movl	%eax, %ecx
	movl	(%esp), %eax
	imull	%ecx, %eax
	addl	$4, %esp
	ret

_D5temp36callerFPFiZiiZi:
	subl	$12, %esp
	movl	16(%esp), %ecx
	movl	%ecx, 8(%esp)
	movl	%eax, 4(%esp)
	movl	8(%esp), %ecx
	movl	4(%esp), %eax
	call	*%ecx
	addl	$12, %esp
	ret	$4

_Dmain:
	subl	$12, %esp
	leal	.str1, %eax
	movl	%eax, (%esp)
	call	atoi
	movl	$_D5temp310factorial1FiZi, (%esp)
	call	_D5temp36callerFPFiZiiZi
	subl	$4, %esp
	movl	$.str, (%esp)
	movl	%eax, 4(%esp)
	call	printf
	leal	.str3, %eax
	movl	%eax, (%esp)
	call	atoi
	movl	$_D5temp310factorial2FiZi, (%esp)
	call	_D5temp36callerFPFiZiiZi
	subl	$4, %esp
	movl	$.str2, (%esp)
	movl	%eax, 4(%esp)
	call	printf
	xorl	%eax, %eax
	addl	$12, %esp
	ret	$8


Its asm with mergefunc:

_D5temp310factorial1FiZi:
	subl	$4, %esp
	movl	%eax, (%esp)
	testl	%eax, %eax
	jne	.LBB1_2
	movl	$1, %eax
	addl	$4, %esp
	ret
.LBB1_2:
	movl	(%esp), %eax
	decl	%eax
	call	_D5temp310factorial1FiZi
	imull	(%esp), %eax
	addl	$4, %esp
	ret

_D5temp310factorial2FiZi:
	subl	$4, %esp
	call	_D5temp310factorial1FiZi
	addl	$4, %esp
	ret

_D5temp36callerFPFiZiiZi:
	subl	$12, %esp
	movl	16(%esp), %ecx
	movl	%ecx, 8(%esp)
	movl	%eax, 4(%esp)
	call	*8(%esp)
	addl	$12, %esp
	ret	$4
	.size	_D5temp36callerFPFiZiiZi, .-_D5temp36callerFPFiZiiZi

_Dmain:
	subl	$12, %esp
	movl	$.str1, (%esp)
	call	atoi
	movl	$_D5temp310factorial1FiZi, (%esp)
	call	_D5temp36callerFPFiZiiZi
	subl	$4, %esp
	movl	%eax, 4(%esp)
	movl	$.str, (%esp)
	call	printf
	movl	$.str3, (%esp)
	call	atoi
	movl	$_D5temp310factorial2FiZi, (%esp)
	call	_D5temp36callerFPFiZiiZi
	subl	$4, %esp
	movl	%eax, 4(%esp)
	movl	$.str2, (%esp)
	call	printf
	xorl	%eax, %eax
	addl	$12, %esp
	ret	$8


llvm is not able to replace all function pointers factorial2 with the ones to
factorial1, so turns factorial2 into a stump that just calls the factorial1. In
C/D you can compare function pointers so in this case LLVM has done what it
can. But can't here llvm replace factorial2 with just a jump instruction? I am
not expert enough yet to answer this. If someone has an answer for me I'd like
to hear it :-)

Bye,
bearophile
Mar 26 2010