|
Archives
D Programming
digitalmars.Ddigitalmars.D.bugs digitalmars.D.dtl digitalmars.D.ide digitalmars.D.dwt digitalmars.D.announce digitalmars.D.learn digitalmars.D.debugger D.gnu D C/C++ Programming
c++c++.announce c++.atl c++.beta c++.chat c++.command-line c++.dos c++.dos.16-bits c++.dos.32-bits c++.idde c++.mfc c++.rtl c++.stl c++.stl.hp c++.stl.port c++.stl.sgi c++.stlsoft c++.windows c++.windows.16-bits c++.windows.32-bits c++.wxwindows digitalmars.empire digitalmars.DMDScript electronics |
digitalmars.D - Virtual methods
This post is long, I hope the web interface doesn't cut it down.
This post is about few simple experiments I've done in D about virtual calls
and how to improve their performance. So probably this can't be useful before
D2 is out of alpha state. For people working at Sun this stuff is probably very
old and very naive, but I think currently no D compilers are optimizing this,
and if you are not interested please ignore this post.
The performance problem caused by virtual calls is not the small amount of time
they require, but mostly the missing inlining they cause.
I have created a small synthetic benchmark, not realistic but useful to measure
this topic. The program creates a binary tree with 932_465 nodes, and then
scans it 600 times computing the sum of the values stored in the leaves,
incrementing such leave values each time.
The tree contains two different kinds of nodes, Internal and Leave, so the
walking on the tree is done with a virtual function. HotSpot shows a
significant performance improvement over the C++ code, probably caused by the
de-virtualizing the virtual calls. GCC Profile Guided Optimization doesn't
produce speedups on this program.
// sumtree_d1.d
// D code, works in D1 on Tango and Phobos
// not hard to adapt to D2
const bool USE_TRICKS = false;
version (Tango) {
import tango.stdc.stdio: printf;
static if (USE_TRICKS) {
import tango.stdc.stdlib: exit;
import tango.core.Memory: GC;
alias GC.disable disable;
}
} else {
import std.c.stdio: printf;
static if (USE_TRICKS) {
import std.c.stdlib: exit;
import std.gc: disable;
}
}
struct Random { // portable generator
private const int m = int.max;
private const int a = 48271;
private const int q = m / a;
private const int r = m % a;
public int seed;
public double nextDouble() {
int k = seed / q;
seed = a * (seed - k * q) - r * k;
if (seed < 1)
seed += m;
return cast(double)seed / m;
}
// returns in [0, max]
public int nextInt(int max) {
int n = max + 1;
int k = cast(int)(n * this.nextDouble());
return (k == n) ? k - 1 : k;
}
}
abstract class Node {
abstract int walkSum();
}
final class Leaf : Node {
int value;
this(int x) { this.value = x; }
override int walkSum() { return this.value++; }
}
final class Internal : Node {
Node left, right;
this(Node l, Node r) {
this.left = l;
this.right = r;
}
override int walkSum() {
// createTree() never generates null pointers, but we test
// anyway to be more general with other tree generators
return ((this.left is null) ? 0 : this.left.walkSum()) +
((this.right is null) ? 0 : this.right.walkSum());
}
}
struct SumTree {
static int nnodes;
static Random rnd;
static Node createTree(int deptMax, int dept) {
nnodes++;
if (rnd.nextDouble() > 0.30 && dept <= deptMax)
return new Internal(createTree(deptMax, dept+1),
createTree(deptMax, dept+1));
else
return new Leaf(rnd.nextInt(100));
}
}
void main() {
const int NLOOPS = 600;
static if (USE_TRICKS) { disable(); }
SumTree.rnd = Random(142465964);
SumTree.nnodes = 0;
Node root = SumTree.createTree(35, 0); // 35 0
printf("nnodes: %d\n", SumTree.nnodes);
if (root !is null) {
printf("sum: %d\n", root.walkSum());
int tot = 0;
for (int i = 0; i < NLOOPS-1; i++)
tot += root.walkSum();
printf("sum (%d scans): %d\n", NLOOPS-1, tot);
}
static if (USE_TRICKS) { exit(0); }
}
I have translated that code to Java and C++, ask if you want to see those
versions.
I have created more D versions:
- D #2: uses tagged structs, the Node.walkSum() method dispatches according to
the struct tag.
- D #3: the same as the #2 version, but now walkSum() is a free function, this
probably improves code inlining.
- D #4: this uses structs that are not tagged. It tells apart leaves and
internal nodes using pointers tagged with one bit (so the struct memory is
allocated from the C heap). The total memory allocated is less (the tags
require no extra memory) and this speeds up the allocation a little. The walk
is about as fast.
In the D code I have also added a global constant boolean USE_TRICKS that
enables/disables two dirty tricks, that essentially save some of the time used
by the D GC to allocate and deallocate the tree. Such tricks don't speed up the
tree walks.
The profiling of the C++ program shows that most of the time is spent inside
Internal::walkSum().
Code compiled with:
LLVM 2.6 gcc version 4.2.1 (Based on Apple Inc. build 5649) (LLVM build)
llvm-g++ -Wall -O3 -s -Wl,--enable-stdcall-fixup -fomit-frame-pointer
-march=native -ffast-math
Java build 1.7.0-ea-b76
java -server SumTree
CPU: Celeron 2.33 GHz, on Ubuntu running on VirtualBox running on Windows Vista
32 bit.
The current benchmark with seed=142465964 allocates 466_232 Internal nodes, and
466_234 Leaves.
Timings (just tree building), best of 3, seconds:
C++: 0.21
Java: 0.83
D 1 notricks: 0.49
D 2 notricks: 0.46
D 3 notricks: 0.45
D 4 notricks: 0.23
D 1 tricks: 0.24
D 2 tricks: 0.23
D 3 tricks: 0.23
D 4 tricks: 0.22
Timings (full program), best of 3, seconds:
C++: 15.64
Java: 8.74
D 1 notricks: 10.30
D 2 notricks: 14.26
D 3 notricks: 14.06
D 4 notricks: 13.57
D 1 tricks: 9.77
D 2 tricks: 13.58
D 3 tricks: 13.32
D 4 tricks: 13.06
As you can see from the timings it's not a matter of GC and memory allocations,
the building phase in the C++ is actually faster compared to the allocation in
the Java version. So probably here HotSpot is optimizing the virtual calls.
D #1:
Internal.walkSum
pushl %edi
pushl %esi
subl $4, %esp
movl 8(%eax), %ecx
testl %ecx, %ecx
movl %eax, %esi
je .LBB5_5
movl (%ecx), %edx
movl %ecx, %eax
call *24(%edx)
.LBB5_2:
movl %eax, %edi
movl 12(%esi), %eax
testl %eax, %eax
je .LBB5_6
movl (%eax), %ecx
call *24(%ecx)
.LBB5_4:
addl %edi, %eax
addl $4, %esp
popl %esi
popl %edi
ret
.LBB5_5:
xorl %eax, %eax
jmp .LBB5_2
.LBB5_6:
xorl %eax, %eax
jmp .LBB5_4
------------------
D #2:
Internal.walkSum:
pushl %ebp
pushl %ebx
pushl %edi
pushl %esi
subl $4, %esp
movl 4(%eax), %esi
testl %esi, %esi
movl %eax, %edi
je .LBB7_12
cmpl $0, (%esi)
jne .LBB7_13
movl 4(%esi), %eax
testl %eax, %eax
je .LBB7_14
call _D2s24Node7walkSumMFZi
.LBB7_4:
movl %eax, %ebp
movl 8(%esi), %eax
testl %eax, %eax
je .LBB7_15
cmpl $0, (%eax)
jne .LBB7_16
call _D2s28Internal7walkSumMFZi
movl %eax, %ebx
.LBB7_7:
addl %ebp, %ebx
.LBB7_8:
movl 8(%edi), %eax
testl %eax, %eax
je .LBB7_17
cmpl $0, (%eax)
jne .LBB7_18
call _D2s28Internal7walkSumMFZi
movl %eax, %ecx
.LBB7_11:
movl %ecx, %eax
addl %ebx, %eax
addl $4, %esp
popl %esi
popl %edi
popl %ebx
popl %ebp
ret
.LBB7_12:
xorl %ebx, %ebx
jmp .LBB7_8
.LBB7_13:
movl 4(%esi), %ebx
leal 1(%ebx), %eax
movl %eax, 4(%esi)
jmp .LBB7_8
.LBB7_14:
xorl %eax, %eax
jmp .LBB7_4
.LBB7_15:
xorl %ebx, %ebx
jmp .LBB7_7
.LBB7_16:
movl 4(%eax), %ebx
leal 1(%ebx), %ecx
movl %ecx, 4(%eax)
jmp .LBB7_7
.LBB7_17:
xorl %ecx, %ecx
jmp .LBB7_11
.LBB7_18:
movl 4(%eax), %ecx
leal 1(%ecx), %edx
movl %edx, 4(%eax)
jmp .LBB7_11
------------------
D #3:
walkSum:
pushl %ebx
pushl %edi
pushl %esi
testl %eax, %eax
jne .LBB6_3
xorl %eax, %eax
.LBB6_2:
popl %esi
popl %edi
popl %ebx
ret
.LBB6_3:
movl %eax, %esi
cmpl $0, (%esi)
jne .LBB6_8
movl 4(%esi), %eax
call _D2s37walkSumFPS2s34NodeZi
movl 8(%esi), %esi
testl %esi, %esi
movl %eax, %edi
je .LBB6_9
cmpl $0, (%esi)
jne .LBB6_10
movl 4(%esi), %eax
call _D2s37walkSumFPS2s34NodeZi
movl %eax, %ebx
movl 8(%esi), %eax
call _D2s37walkSumFPS2s34NodeZi
addl %ebx, %eax
.LBB6_7:
addl %edi, %eax
jmp .LBB6_2
.LBB6_8:
movl 4(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 4(%esi)
jmp .LBB6_2
.LBB6_9:
xorl %eax, %eax
jmp .LBB6_7
.LBB6_10:
movl 4(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 4(%esi)
jmp .LBB6_7
------------------
D #4:
walkSum:
pushl %ebx
pushl %edi
pushl %esi
movl %eax, %esi
andl $4294967294, %esi
testl %esi, %esi
jne .LBB13_3
xorl %eax, %eax
.LBB13_2:
popl %esi
popl %edi
popl %ebx
ret
.LBB13_3:
testb $1, %al
movl (%esi), %eax
je .LBB13_8
movl %eax, %edi
andl $4294967294, %edi
testl %edi, %edi
je .LBB13_9
testb $1, %al
movl (%edi), %eax
je .LBB13_10
call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
movl %eax, %ebx
movl 4(%edi), %eax
call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
addl %ebx, %eax
.LBB13_7:
movl %eax, %edi
movl 4(%esi), %eax
call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
addl %edi, %eax
jmp .LBB13_2
.LBB13_8:
leal 1(%eax), %ecx
movl %ecx, (%esi)
jmp .LBB13_2
.LBB13_9:
xorl %eax, %eax
jmp .LBB13_7
.LBB13_10:
leal 1(%eax), %ecx
movl %ecx, (%edi)
jmp .LBB13_7
------------------
st, C++, gcc:
Internal.walkSum:
subl $12, %esp
movl %ebx, 4(%esp)
movl %esi, 8(%esp)
xorl %ebx, %ebx
movl 16(%esp), %esi
movl 4(%esi), %edx
testl %edx, %edx
je .L5
movl (%edx), %eax
movl %edx, (%esp)
call *(%eax)
movl %eax, %ebx
.L5:
movl 8(%esi), %edx
xorl %eax, %eax
testl %edx, %edx
je .L7
movl (%edx), %eax
movl %edx, (%esp)
call *(%eax)
.L7:
addl %ebx, %eax
movl 8(%esp), %esi
movl 4(%esp), %ebx
addl $12, %esp
ret
------------------
st, C++, llvm-gcc:
Internal.walkSum:
pushl %edi
pushl %esi
subl $4, %esp
movl 16(%esp), %esi
movl 4(%esi), %eax
testl %eax, %eax
je LBB2_5
movl (%eax), %ecx
movl %eax, (%esp)
call *(%ecx)
LBB2_2:
movl %eax, %edi
movl 8(%esi), %eax
testl %eax, %eax
je LBB2_6
movl (%eax), %ecx
movl %eax, (%esp)
call *(%ecx)
LBB2_4:
addl %edi, %eax
addl $4, %esp
popl %esi
popl %edi
ret
LBB2_5:
xorl %eax, %eax
jmp LBB2_2
LBB2_6:
xorl %eax, %eax
jmp LBB2_4
I have collected the asm generated by HotSpot from the Internal.walkSum, but I
am not able to read it, it's too much messy for me.
I have created an alternative version of the first D program, sumtree_d1b, that
uses the __vptr to avoid the virtual table and to allow LDC to inline the code
better. The performance is better than sumtree_d1, but not good enough to beat
the Java version yet.
// sumtree_d1b.d
const bool USE_TRICKS = false;
version (Tango) {
import tango.stdc.stdio: printf;
static if (USE_TRICKS) {
import tango.stdc.stdlib: exit;
import tango.core.Memory: GC;
alias GC.disable disable;
}
} else {
import std.c.stdio: printf;
static if (USE_TRICKS) {
import std.c.stdlib: exit;
import std.gc: disable;
}
}
struct Random {
private const int m = int.max;
private const int a = 48271;
private const int q = m / a;
private const int r = m % a;
public int seed;
public double nextDouble() {
int k = seed / q;
seed = a * (seed - k * q) - r * k;
if (seed < 1)
seed += m;
return cast(double)seed / m;
}
// returns in [0, max]
public int nextInt(int max) {
int n = max + 1;
int k = cast(int)(n * this.nextDouble());
return (k == n) ? k - 1 : k;
}
}
//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2
abstract class Node {}
final class Leaf : Node {
int value;
this(int x) { this.value = x; }
}
final class Internal : Node {
Node left, right;
this(Node l, Node r) {
this.left = l;
this.right = r;
}
}
struct SumTree {
static int nnodes;
static Random rnd;
static Node createTree(int deptMax, int dept) {
nnodes++;
if (rnd.nextDouble() > 0.30 && dept <= deptMax)
return new Internal(createTree(deptMax, dept+1),
createTree(deptMax, dept+1));
else
return new Leaf(rnd.nextInt(100));
}
}
int walkTree(Node root) {
if (root.__vptr == leaf__vptr) {
return (cast(Leaf)cast(void*)root).value++;
} else {
Internal iroot = cast(Internal)cast(void*)root;
return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
((iroot.right is null) ? 0 : walkTree(iroot.right));
}
}
void main() {
const int NLOOPS = 600;
static if (USE_TRICKS) { disable(); }
{
scope Leaf l = new Leaf(0);
leaf__vptr = l.__vptr;
}
SumTree.rnd = Random(142465964);
SumTree.nnodes = 0;
Node root = SumTree.createTree(35, 0); // 35 0
printf("nnodes: %d\n", SumTree.nnodes);
if (root !is null) {
printf("sum: %d\n", walkTree(root));
int tot = 0;
for (int i = 0; i < NLOOPS-1; i++)
tot += walkTree(root);
printf("sum (%d scans): %d\n", NLOOPS-1, tot);
}
static if (USE_TRICKS) { exit(0); }
}
Timings (full program), best of 3, seconds:
Java: 8.83 (-server)
D 1 notricks: 10.44
D 1b notricks: 9.73
D 1 tricks: 10.18
D 1b tricks: 9.45
In this program there are only two possible classes of nodes, so to manually
remove virtual calls I have stored the pointer to the virtual table of one of
the two classes, the Leaf (Here unfortunately I can't use conditional
compilation to let the compiler compile the right code according to the
language version because the D2 version is not syntactially valid D1).
//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2
D docs state that user code shall not use the pointer to the virtual table
(probably because it's an implementation detail, that can change according to
the way virtual calls are implemented. For example dispatch trees don't have
such pointer. Here I have essentially created a small dispatch tree manually).
At the beginning of the main() the leaf__vptr gets initialized (I don't know a
way to initialize this value that doesn't require me the creation of a class
instance):
{
scope Leaf l = new Leaf(0);
leaf__vptr = l.__vptr;
}
Then this is a free function that scans the tree (it can be written iterative
too, but I think the performance is not that diffeernt on modern CPUs. This can
be tested):
int walkTree(Node root) {
if (root.__vptr == leaf__vptr) {
return (cast(Leaf)cast(void*)root).value++;
} else {
Internal iroot = cast(Internal)cast(void*)root;
return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
((iroot.right is null) ? 0 : walkTree(iroot.right));
}
}
The double cast like:
cast(Leaf)cast(void*)root
is necessary because a single cast in D is done dynamically, so the first cast
to void* punches a hole in the type system, so the pair becomes a static cast.
I think it's similar to the C++ code:
Leaf::root
The asm of walkTree() shows no virtual calls and it also shows some inlining:
walkTree of sumtree_d1b.d:
_D8sumtree27walkTreeFC8sumtree24NodeZi:
pushl %ebx
pushl %edi
pushl %esi
movl _D8sumtree210leaf__vptrPPv, %ecx
cmpl %ecx, (%eax)
movl %eax, %esi
je .LBB8_12
movl 8(%esi), %eax
testl %eax, %eax
je .LBB8_13
call _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_3:
movl %eax, %edi
movl 12(%esi), %esi
testl %esi, %esi
je .LBB8_14
movl _D8sumtree210leaf__vptrPPv, %eax
cmpl %eax, (%esi)
je .LBB8_15
movl 8(%esi), %eax
testl %eax, %eax
je .LBB8_16
call _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_7:
movl %eax, %ebx
movl 12(%esi), %eax
testl %eax, %eax
je .LBB8_17
call _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_9:
addl %ebx, %eax
.LBB8_10:
addl %edi, %eax
.LBB8_11:
popl %esi
popl %edi
popl %ebx
ret
.LBB8_12:
movl 8(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 8(%esi)
jmp .LBB8_11
.LBB8_13:
xorl %eax, %eax
jmp .LBB8_3
.LBB8_14:
xorl %eax, %eax
jmp .LBB8_10
.LBB8_15:
movl 8(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 8(%esi)
jmp .LBB8_10
.LBB8_16:
xorl %eax, %eax
jmp .LBB8_7
.LBB8_17:
xorl %eax, %eax
jmp .LBB8_9
A possible cause of the slow performance of sumtree_d1b compared to the Java
version is in the cache effects. The garbage collector of Java HotSpot can lay
objects in memory in a better way: In the "Eden" part of the memory allocated
by the hotspot GC the cache coherence is very high. While the current D GC has
to avoid long-term memory fragmentation, the Eden doesn't share this problem,
it can be used as a stack.
So I have created a different version of the walkTree() function that prints
the memory positions of the nodes. Then I have used Python to find the
requences of the distances between node pointers during the walkTree() in
sumtree_d1b:
Feb 03 2010
bearophile wrote:This post is about few simple experiments I've done in D about virtual calls and how to improve their performance. Feb 03 2010
bearophile wrote:This post is about few simple experiments I've done in D about virtual calls and how to improve their performance. So probably this can't be useful before D2 is out of alpha state. For people working at Sun this stuff is probably very old and very naive, but I think currently no D compilers are optimizing this, and if you are not interested please ignore this post. Feb 03 2010
Walter Bright:The virtual function calls are the calls to Internal.walkSum() via an expression typed as Node.walkSum(). Feb 03 2010
Sorry, I had missed the Leaf node. bearophile wrote:With LDC you can use a level of optimization (LTO) that spans all the modules you are compiling at once. Feb 03 2010
Walter Bright:That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't. Feb 03 2010
bearophile wrote:Walter Bright:That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't. Feb 03 2010
Walter Bright:Not impossible, and a "pseudo-link" could be done after the semantic phase of compilation, but these things would be a lot of work to implement. Feb 04 2010
bearophile wrote:Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.html Feb 04 2010
== Quote from Walter Bright (newshound1 digitalmars.com)'s articlebearophile wrote:Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.html Feb 04 2010
dsimcha:Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out. Feb 04 2010
Hello dsimcha,== Quote from Walter Bright (newshound1 digitalmars.com)'s articleBelieve it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging. Feb 04 2010
dsimcha wrote:== Quote from Walter Bright (newshound1 digitalmars.com)'s articleBelieve it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging. Feb 04 2010
Trass3r:What about C++ code? Feb 04 2010
Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do. Feb 04 2010
Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:Later more things can be improved, but I think this is enough to fix most of the situation. Feb 03 2010
retard wrote:Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:Later more things can be improved, but I think this is enough to fix most of the situation. Feb 03 2010
|