www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Port a benchmark to D?

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
This paper:

http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/

compares the efficiency of an algorithm implemented in C++, Scala, Java, 
and Go. I wonder how D would fare. Does anyone have the time and 
inclination to port the benchmark so we can take a look?


Thanks,

Andrei
Jun 03 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
Andrei Alexandrescu wrote:
 This paper:

 compares the efficiency of an algorithm implemented in C++, Scala, Java,
 and Go. I wonder how D would fare. Does anyone have the time and
 inclination to port the benchmark so we can take a look?


 Thanks,

 Andrei

OK. I'll start right away. Timon
Jun 03 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Timon Gehr:

 OK. I'll start right away.

I have independently started it too :-) But this article written by Google and its benchmarking methodology are not good. Bye, bearophile
Jun 03 2011
next sibling parent "Nick Sabalausky" <a a.a> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:isbd1j$1u9s$1 digitalmars.com...
 But this article written by Google and its benchmarking methodology are 
 not good.

I realize you'e probably talking about content, but...Pet Peeve: Trying to read a paged multi-column layout outside of dead-tree form.
Jun 03 2011
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

I have independently started it too :-) But this article written by Google and its benchmarking methodology are not good. Bye, bearophile

Ok, we can compare the two implementations afterwards. BTW, have you already noticed how incredibly stupid some parts of the implementation of cpp are? I think I will also do a d_pro version after porting cpp. Timon
Jun 03 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

I have independently started it too :-) But this article written by Google and its benchmarking methodology are not good. Bye, bearophile

Ok, we can compare the two implementations afterwards. BTW, have you already noticed how incredibly stupid some parts of the implementation of cpp are? I think I will also do a d_pro version after porting cpp. Timon

Terrific, thanks! I think this will be very instructive. Andrei
Jun 03 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

I have independently started it too :-) But this article written by Google and its benchmarking methodology are not good. Bye, bearophile

Ok, we can compare the two implementations afterwards. BTW, have you already noticed how incredibly stupid some parts of the implementation of cpp are? I think I will also do a d_pro version after porting cpp. Timon

One note - some of their std::list uses can be replaced by SList, and others by built-in arrays. Andrei
Jun 03 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
Andrei Alexandresco wrote:
 On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

I have independently started it too :-) But this article written by Google and its benchmarking methodology are not good. Bye, bearophile

Ok, we can compare the two implementations afterwards. BTW, have you already noticed how incredibly stupid some parts of the implementation of cpp are? I think I will also do a d_pro version after porting cpp. Timon

One note - some of their std::list uses can be replaced by SList, and others by built-in arrays. Andrei

Yes, but built-in arrays are more similar to std::vector, therefore it might already be an optimization over the original benchmark. I'm not sure if that is okay? Timon
Jun 03 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/11 3:01 PM, Timon Gehr wrote:
 Andrei Alexandresco wrote:
 On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

I have independently started it too :-) But this article written by Google and its benchmarking methodology are not good. Bye, bearophile

Ok, we can compare the two implementations afterwards. BTW, have you already noticed how incredibly stupid some parts of the implementation of cpp are? I think I will also do a d_pro version after porting cpp. Timon

One note - some of their std::list uses can be replaced by SList, and others by built-in arrays. Andrei

Yes, but built-in arrays are more similar to std::vector, therefore it might already be an optimization over the original benchmark. I'm not sure if that is okay? Timon

I noticed that the C++ code uses std::list without there being any need for a linked list structure. See for example the data structure used in FindSet. It's a list, but it's just appended too and then used for one iteration. Andrei
Jun 03 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any need
 for a linked list structure. See for example the data structure used in
 FindSet. It's a list, but it's just appended too and then used for one
 iteration.

 Andrei

Yes, but the list in FindSet is unnecessary anyways. If I start changing the original implementation, the first thing I will do is to remove that. First however, I will port the code as closely as possible. Is there any associative version of RedBlackTree (I realize it could be made associative quite easily), or should I just use built-in hash maps? Timon
Jun 03 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any need
 for a linked list structure. See for example the data structure used in
 FindSet. It's a list, but it's just appended too and then used for one
 iteration.

 Andrei

Yes, but the list in FindSet is unnecessary anyways. If I start changing the original implementation, the first thing I will do is to remove that. First however, I will port the code as closely as possible. Is there any associative version of RedBlackTree (I realize it could be made associative quite easily), or should I just use built-in hash maps?

You give RedBlackTree a different predicate if you want to treat it as a map. It defaults to "a < b" with allowDuplicates as false, which makes it a multiset. But If you adjust its template parameters, you can make it a set, map, multimap, or any other type of collection which uses a red-black tree > as its data structure. - Jonathan M Davis

Yes, thats what I had in mind, but I thought it is strange that there is no boilerplate map in std.container. Timon
Jun 03 2011
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-06 06:37, Steven Schveighoffer wrote:
 On Fri, 03 Jun 2011 17:30:54 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:
 Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any


need
 for a linked list structure. See for example the data structure


used in
 FindSet. It's a list, but it's just appended too and then used for


one
 iteration.
 
 Andrei

Yes, but the list in FindSet is unnecessary anyways. If I start

changing
 the original implementation, the first thing I will do is to remove

that.
 First however, I will port the code as closely as possible. Is there

any
 associative version of RedBlackTree (I realize it could be made

associative
 quite easily), or should I just use built-in hash maps?

You give RedBlackTree a different predicate if you want to treat it as a map. It defaults to "a < b" with allowDuplicates as false, which makes it a multiset.


This makes it a set, not a multiset. allowDuplicates set to true would make it a multiset.

Bleh. You're right. I guess that I wasn't thinking straight (or was thinking too quickly) on that one. :( - Jonathan M Davis
Jun 06 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any need
 for a linked list structure. See for example the data structure used in
 FindSet. It's a list, but it's just appended too and then used for one
 iteration.
 
 Andrei

Yes, but the list in FindSet is unnecessary anyways. If I start changing the original implementation, the first thing I will do is to remove that. First however, I will port the code as closely as possible. Is there any associative version of RedBlackTree (I realize it could be made associative quite easily), or should I just use built-in hash maps?

You give RedBlackTree a different predicate if you want to treat it as a map. It defaults to "a < b" with allowDuplicates as false, which makes it a multiset. But If you adjust its template parameters, you can make it a set, map, multimap, or any other type of collection which uses a red-black tree as its data structure. - Jonathan M Davis
Jun 03 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Jun 2011 17:30:54 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any  


 for a linked list structure. See for example the data structure  


 FindSet. It's a list, but it's just appended too and then used for  


 iteration.

 Andrei

Yes, but the list in FindSet is unnecessary anyways. If I start

 the original implementation, the first thing I will do is to remove  

 First however, I will port the code as closely as possible. Is there  

 associative version of RedBlackTree (I realize it could be made  

 quite easily), or should I just use built-in hash maps?

You give RedBlackTree a different predicate if you want to treat it as a map. It defaults to "a < b" with allowDuplicates as false, which makes it a multiset.


This makes it a set, not a multiset. allowDuplicates set to true would make it a multiset.
 Yes, thats what I had in mind, but I thought it is strange that there is  
 no
 boilerplate map in std.container.

I think the goal is to eventually have these higher-level types based on the implementation types, but I'm uncertain how they will look or where they will go. Almost certainly, there will be a template based on a set-like template for defining a map (e.g. map!(RedBlackTree, int, int) ). If you want a cookie-cutter map type that uses the exact same implementation as RedBlackTree, with a custom allocator that helps with performance for long-lasting containers, dcollections provides such a type (TreeMap). If you do end up using it, I recommend the latest trunk, I've made some critical bug fixes (I need to release soon...). -Steve
Jun 06 2011
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Timon Gehr:

 have you already noticed how incredibly stupid some parts of the
implementation of cpp are?

Some of those "stupid" things come from the very strict Google C++ coding standards. Even if my C++ experience is not extensive, I find this C++ easy to understand. One part of the code: } // node_pool.size } // Step c } // FindLoops private: MaoCFG *CFG_; // current control flow graph. LoopStructureGraph *lsg_; // loop forest. }; // HavlakLoopFinder The Ada language has a syntax to write those names at the closing ends, and the Ada compiler enforces such names to be always coherent and correct. In C/C++/D unfortunately such names (written as comments) may go out of sync. Despite this risk I add similar comments at the end of functions/classe/structs/loops in my D code too, because the risk of them getting out of sync is lower than the advantages they give me. I have not written an enhancement request yet about this in the D Bugzilla yet, also because I don't know what a good C-style syntax for it could be. Bye, bearophile
Jun 03 2011
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
bearophile wrote:
 Timon Gehr:

 have you already noticed how incredibly stupid some parts of the implementation


 Some of those "stupid" things come from the very strict Google C++ coding

sure? Eg. UnionFindNode *FindSet() { typedef std::list<UnionFindNode *> NodeListType; NodeListType nodeList; UnionFindNode *node = this; while (node != node->parent()) { if (node->parent() != node->parent()->parent()) nodeList.push_back(node); node = node->parent(); } // Path Compression, all nodes' parents point to the 1st level parent. NodeListType::iterator iter = nodeList.begin(); NodeListType::iterator end = nodeList.end(); for (; iter != end; ++iter) (*iter)->set_parent(node->parent()); return node; } They create a new linked list of nodes on the way to the root. It is not that that is just basically the same they already had when following the parent pointers. They obviously need a copy. Also, they made sure that node==node->parent(). (node is the a root) But then they call node->parent(). etc... Timon
Jun 03 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2011 1:24 PM, bearophile wrote:
 The Ada language has a syntax to write those names at the closing ends, and
 the Ada compiler enforces such names to be always coherent and correct. In
 C/C++/D unfortunately such names (written as comments) may go out of sync.

I never understood the point of such. My editor has a single key "find matching { [ < ( ) > ] }" command, and so I never have a need for such ugly comments. In fact I usually delete such comments when I encounter them.
Jun 04 2011
parent reply Jeff Nowakowski <jeff dilacero.org> writes:
On 06/04/2011 04:25 AM, Walter Bright wrote:
 On 6/3/2011 1:24 PM, bearophile wrote:
 The Ada language has a syntax to write those names at the closing
 ends, and
 the Ada compiler enforces such names to be always coherent and
 correct. In
 C/C++/D unfortunately such names (written as comments) may go out of
 sync.

I never understood the point of such. My editor has a single key "find matching { [ < ( ) > ] }" command, and so I never have a need for such ugly comments. In fact I usually delete such comments when I encounter them.

I thought you were big on printed out code reviews and not requiring any editing features from the language?
Jun 04 2011
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/4/2011 7:14 AM, Jeff Nowakowski wrote:
 I thought you were big on printed out code reviews and not requiring any
editing
 features from the language?

I don't find those comments useful in printed code either.
Jun 04 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Speaking of SList, for some reason std_slist.html is still distributed with DMD.
Jun 03 2011
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 6/3/11, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 Speaking of SList, for some reason std_slist.html is still distributed with
 DMD.

http://d.puremagic.com/issues/show_bug.cgi?id=6101
Jun 03 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On 2011-06-03 14:30, Timon Gehr wrote:
 Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any
 need for a linked list structure. See for example the data structure
 used in FindSet. It's a list, but it's just appended too and then
 used for one iteration.
 
 Andrei

Yes, but the list in FindSet is unnecessary anyways. If I start changing the original implementation, the first thing I will do is to remove that. First however, I will port the code as closely as possible. Is there any associative version of RedBlackTree (I realize it could be made associative quite easily), or should I just use built-in hash maps?

You give RedBlackTree a different predicate if you want to treat it as a map. It defaults to "a < b" with allowDuplicates as false, which makes it a multiset. But If you adjust its template parameters, you can make it a set, map, multimap, or any other type of collection which uses a red-black tree > as its data structure. - Jonathan M Davis

Yes, thats what I had in mind, but I thought it is strange that there is no boilerplate map in std.container.

std.container provides data structures. It's up to you to decide how to use the data structures. It does not attempt to create a wrapper or anything of the sort to give you particular uses for a data structure (such as using RedBlackTree as a map). Whether it should do that sort of thing is up for debate, but its focus is on data structures not on "collections" which are organized by how they're used. - Jonathan M Davis
Jun 03 2011
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei:

 Does anyone have the time and inclination to port the benchmark so we can take
a look?

My modified C++0x code: http://codepad.org/iBwINTkJ First D2 translation that seems to work: http://codepad.org/sEEFNlAd Notes on the translation: - The program crashes with zero error messages if you don't increase the stack. Time ago I have written an enhancement request on this: http://d.puremagic.com/issues/show_bug.cgi?id=6088 - The final print of the "." done using write() doesn't perform the fflush. So after few tries I have used printf. I have not found a good way to flush. Suggestions welcome. - The C++ code uses classes, but they are often more like D structs. In the first D version to keep things simple I have used final classes everywhere. I will profile the code and I will create a more efficient version that uses structs where needed. - I have had to track down a null reference error (caused by the nodes array, that contains just references in the first D version). Nonnullable type modifiers avoids similar bugs at compile-time. - The C++ code is 25 KB, the D code (first version is 21 KB), the C++ code uses about 100 MB at runtime, while the D code uses about 160 MB. - The C++ loops to iterate on collections is many times worse than the D foreach. - In the main() of C++ code, there is a LoopStructureGraph lsg that shadows another one, I've replaced it with a lsg2 in both D and C++ code. - In the C++ code I have used a std::unordered_map to implement BasicBlockMap, because it's significantly faster. So now the code is C++0x. - The method names start with uppercase, I'd like to change this in a second D version. - No linked lists or sorted dictionaries seem needed in this program. But I have felt a bit of need of a set data structure in D (I have used an AA with bool values). Despite there are some associative arrays of SimpleLoop and BasicBlock objects, I think their classes don't need a hash function and opEquals. Bye, bearophile
Jun 03 2011
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
bearophile wrote:
 My modified C++0x code:
 First D2 translation that seems to work:

What timings did you get? On my computer, the D version ran slightly faster (56 seconds vs 63s for C++) without optimizations turned on. With optimizations turned on, C++ took a nice lead (28 seconds vs 53 seconds for D). This is similar to other benchmarks I've run in the past. Standard D builds beat standard C++ builds, but gcc's optimizer takes the lead vs dmd's optimizer. However, in the past, I've found writing D style instead of C++ style lets standard D beat even optimized g++. I wonder if that's the case here too..
 - The final print of the "." done using write() doesn't perform
 the fflush. So after few tries I have used printf. I have not
 found a good way to flush. Suggestions welcome.

To flush with D's stdout, simply call stdout.flush: write("."); stdout.flush();
Jun 03 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/11 6:09 PM, Adam D. Ruppe wrote:
 bearophile wrote:
 My modified C++0x code:
 First D2 translation that seems to work:

What timings did you get? On my computer, the D version ran slightly faster (56 seconds vs 63s for C++) without optimizations turned on. With optimizations turned on, C++ took a nice lead (28 seconds vs 53 seconds for D).

What are your timings for the Java, Scala, and Go versions? Thanks, Andrei
Jun 03 2011
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
Andrei Alexandrescu wrote:
 What are your timings for the Java, Scala, and Go versions?

I don't have compilers nor the code for those languages, so I don't know.
Jun 03 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-03 17:16, Andrej Mitrovic wrote:
 I must be doing something wrong. Do I have to pipe some files to the
 module or something?
 
 I've compiled it and ran it, it's done in 1.5 seconds. :s

Or maybe you just have a way better machine? - Jonathan M Davis
Jun 03 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/11 7:42 PM, Andrej Mitrovic wrote:
 DMD debug: 50s
 DMD optimized: 49s (-release -noboundscheck -O -inline -L/STACK:1280000000)

 That's around 1GB of stack memory.
 Compiling with GDC will make the app throw an exception at runtime due
 to the stack being blown, the error message isn't special but it's
 better than no error message.

 GDC debug: 1m:24s
 GDC optimized: 1m:12s (gdc -v2 -O3 -Wl,--stack,1280000000)

 I haven't tested the other ones. Does Go have a Windows compiler yet?

I think they do. But another good data point would be the C++ timings on the same machine. Andrei
Jun 03 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/11 8:05 PM, Andrej Mitrovic wrote:
 I've compiled the C++ example from google, but I had to remove the
 "-lc" option due to errors (what is this option anyway?).

 gcc 4.4.0 with -O3: 1:50

Hmm, it would be odd if the D version is more than twice faster than the C++ one. Wonder what's going on there. Andrei
Jun 03 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/03/2011 08:47 PM, Andrej Mitrovic wrote:
 Running Ubuntu v10.10 x32 on Virtualbox with hardware virtualization,
 GCC v4.4.4 and DMD 2.053 installed.

 DMD debug: 54s
 DMD optimized: 47s

 Google's C++ optimized: 27s (gcc -O3 -lc -lstdc++)
 Bear's C++ optimized: 24s (g++ -O3 -std=gnu++0x)

 So it's MinGW's fault apparently.

Interesting. So now we have a run time comparable with C++'s run time in the paper. The implementation by bearophile has 721 lines. Once we have compilation times, binary size, and memory footprint, we have a good comparison point with the study in the paper. Far as I can tell D comes in the second place after C++ at run time. With optimizations and all it could get significantly closer. Thanks, Andrei
Jun 03 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei:

 Far as I can tell D comes in the second place after C++ at run time. 
 With optimizations and all it could get significantly closer.

First version, with just classes, a bit better cleaned up: http://codepad.org/DggCx26d Second version, with all structs: http://codepad.org/etsLsZV5 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And then I'll create one or two more optimized versions (one using a memory pool for the nodes, and one trying to apply some of the C++ improvement ideas from the original paper). The number of instances allocated: Class instances: SimpleLoop_counter 3_936_102 LoopStructureGraph_counter 15_051 UnionFindNode_counter 13_017_663 HavlakLoopFinder_counter 15_051 BasicBlockEdge_counter 378_036 BasicBlock_counter 252_013 MaoCFG_counter 1 UnionFindNode probably will give some gain if allocated from a pool. Later, bearophile
Jun 03 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB,  RES: 174MB]
 real    0m28.407s
 user    0m28.330s
 sys     0m0.070s

 DMD -O -release
 [VIRT: 94MB,  RES: 92MB]
 real    0m43.232s
 user    0m42.980s
 sys     0m0.070s

 GDC -O3
 [VIRT: 306MB,  RES: 295MB]
 real    1m10.788s
 user    1m10.570s
 sys     0m0.190s

 LDC2
 segmentation fault

Why not -inline on dmd?
Jun 03 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/04/2011 02:10 AM, Caligo wrote:
 And just to be fair to C++:

 g++ -O2 -m32
 [VIRT: 94MB,  RES: 92MB]
 real    0m24.567s
 user    0m24.500s
 sys     0m0.060s

Thanks to all who participated to this. I've shared these results on reddit: http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/ Andrei
Jun 04 2011
next sibling parent reply Mafi <mafi example.org> writes:
Am 04.06.2011 16:37, schrieb Andrei Alexandrescu:
 On 06/04/2011 02:10 AM, Caligo wrote:
 And just to be fair to C++:

 g++ -O2 -m32
 [VIRT: 94MB, RES: 92MB]
 real 0m24.567s
 user 0m24.500s
 sys 0m0.060s

Thanks to all who participated to this. I've shared these results on reddit: http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/ Andrei

Could you please give a link to the post. I can't find it.
Jun 04 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/04/2011 09:43 AM, Mafi wrote:
 Am 04.06.2011 16:37, schrieb Andrei Alexandrescu:
 On 06/04/2011 02:10 AM, Caligo wrote:
 And just to be fair to C++:

 g++ -O2 -m32
 [VIRT: 94MB, RES: 92MB]
 real 0m24.567s
 user 0m24.500s
 sys 0m0.060s

Thanks to all who participated to this. I've shared these results on reddit: http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/ Andrei

Could you please give a link to the post. I can't find it.

My id is andralex. http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/c1xqj0w Andrei
Jun 04 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/4/2011 7:37 AM, Andrei Alexandrescu wrote:
 Thanks to all who participated to this. I've shared these results on reddit:

 http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/

Direct link: http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/c1xqj0w
Jun 04 2011
prev sibling next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB,  RES: 174MB]
 real    0m28.407s
 user    0m28.330s
 sys     0m0.070s

 DMD -O -release
 [VIRT: 94MB,  RES: 92MB]
 real    0m43.232s
 user    0m42.980s
 sys     0m0.070s

 GDC -O3
 [VIRT: 306MB,  RES: 295MB]
 real    1m10.788s
 user    1m10.570s
 sys     0m0.190s

 LDC2
 segmentation fault

Oh, also, you way want to re-try the benchmark w/ 2.053. It looks rather allocation heavy and I substantially improved the GC performance for 2.053.
Jun 03 2011
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

First version, with just classes, a bit better cleaned up: http://codepad.org/DggCx26d Second version, with all structs: http://codepad.org/etsLsZV5 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And >

the nodes, and one trying to apply some of the C++ improvement ideas > from the original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

Your port segfaults DMD 2.053 with the -g flag (at least on linux). Andrei: You may want to point out on reddit that the code is approx. a 1 to 1 port of the C++ code and not specially tuned. Timon
Jun 04 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

First version, with just classes, a bit better cleaned up: http://codepad.org/DggCx26d Second version, with all structs: http://codepad.org/etsLsZV5 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And >

the nodes, and one trying to apply some of the C++ improvement ideas > from the original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

One simple but very powerful optimization is to minimize the runs of the GC. I have added a call to GC.disable(); in the beginning of main and then added a GC.collect(); after each 10 test runs. Results on my machine (32bit executables): C++ (-O2): 30.7s, ~170MB. D (-release -O -inline): 29.5s, ~520MB Ds GC needs to get faster. A concurrent GC would have hidden away most of the overhead on a multi-core processor ;). Timon
Jun 04 2011
parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 04.06.2011 18:25, schrieb Timon Gehr:
 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

First version, with just classes, a bit better cleaned up: http://codepad.org/DggCx26d Second version, with all structs: http://codepad.org/etsLsZV5 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And >

the nodes, and one trying to apply some of the C++ improvement ideas > from the original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

One simple but very powerful optimization is to minimize the runs of the GC. I have added a call to GC.disable(); in the beginning of main and then added a GC.collect(); after each 10 test runs. Results on my machine (32bit executables): C++ (-O2): 30.7s, ~170MB. D (-release -O -inline): 29.5s, ~520MB Ds GC needs to get faster. A concurrent GC would have hidden away most of the overhead on a multi-core processor ;). Timon

What was your time for D without disabling the GC? Probably 40-50s? This certainly is a big improvement, I didn't think the GC slows it down that much. What'd be really interesting is the benchmark with a D-style implementation of the code (if I understood correctly the current versions are more or less direct translations of the C++ code to D). Cheers, - Daniel
Jun 04 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
Daniel Gibson wrote:
 What was your time for D without disabling the GC? Probably 40-50s?
 This certainly is a big improvement, I didn't think the GC slows it down
 that much.

 What'd be really interesting is the benchmark with a D-style
 implementation of the code (if I understood correctly the current
 versions are more or less direct translations of the C++ code to D).

 Cheers,
 - Daniel

Without disabling the GC, it runs through in 38.2s. The D-style implementation would probably be about the same, bearophile has already replaced std::map/std::set with associative arrays and std::list with dynamic arrays. The algorithm cannot take advantage of Eg. array slicing. It solves a graph problem. I will try to tune the code some more. Timon
Jun 04 2011
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/04/2011 11:08 AM, Timon Gehr wrote:
 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

First version, with just classes, a bit better cleaned up: http://codepad.org/DggCx26d Second version, with all structs: http://codepad.org/etsLsZV5 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And>

the nodes, and one trying to apply some of the C++ improvement ideas> from the original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

Your port segfaults DMD 2.053 with the -g flag (at least on linux). Andrei: You may want to point out on reddit that the code is approx. a 1 to 1 port of the C++ code and not specially tuned. Timon

Better yet, feel free to contribute. The more of us participate, the better. Andrei
Jun 04 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2011 4:09 PM, Adam D. Ruppe wrote:
 This is similar to other benchmarks I've run in the past. Standard
 D builds beat standard C++ builds, but gcc's optimizer takes the
 lead vs dmd's optimizer.

It would be nice to figure out what is different. Try using the coverage analyzer and profiler for starters!
Jun 04 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 It would be nice to figure out what is different. Try using the coverage 
 analyzer and profiler for starters!

There are little differences and inefficiencies here and there, but in the second D version I think most of the performance difference over the C++ code is caused by the GC. I will do some tests. Bye, bearophile
Jun 04 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 6/4/2011 9:15 AM, bearophile wrote:
 Walter:

 It would be nice to figure out what is different. Try using the coverage
 analyzer and profiler for starters!

There are little differences and inefficiencies here and there, but in the second D version I think most of the performance difference over the C++ code is caused by the GC. I will do some tests. Bye, bearophile

That's probably right, for two reasons: 1. Other than the GC, D doesn't have any "hidden cost" features that would explain it being slower than C++ for similarly written code. 2. A few posts back, it was noted that DMD2.053, which includes my GC optimizations, was substantially faster than 2.052, which doesn't.
Jun 04 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 04.06.2011 16:54, schrieb dsimcha:
 On 6/4/2011 9:15 AM, bearophile wrote:
 Walter:

 It would be nice to figure out what is different. Try using the coverage
 analyzer and profiler for starters!

There are little differences and inefficiencies here and there, but in the second D version I think most of the performance difference over the C++ code is caused by the GC. I will do some tests. Bye, bearophile

That's probably right, for two reasons: 1. Other than the GC, D doesn't have any "hidden cost" features that would explain it being slower than C++ for similarly written code.

Besides better optimization by the compiler, see Adam Ruppe's post, 3 posts up:
 On my computer, the D version ran slightly faster (56 seconds vs 63s >

 With optimizations turned on, C++ took a nice lead (28 seconds vs 53
 seconds for D).

So it seems like it's not all the GCs fault.
 2.  A few posts back, it was noted that DMD2.053, which includes my GC
 optimizations, was substantially faster than 2.052, which doesn't.

Jun 04 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/4/2011 6:15 AM, bearophile wrote:
 There are little differences and inefficiencies here and there, but in the
 second D version I think most of the performance difference over the C++ code
 is caused by the GC. I will do some tests.

Easy to test, simply disable the gc.
Jun 04 2011
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/11 5:48 PM, bearophile wrote:
 Andrei:

 Does anyone have the time and inclination to port the benchmark so we can take
a look?

My modified C++0x code: http://codepad.org/iBwINTkJ First D2 translation that seems to work: http://codepad.org/sEEFNlAd Notes on the translation: - The program crashes with zero error messages if you don't increase the stack. Time ago I have written an enhancement request on this: http://d.puremagic.com/issues/show_bug.cgi?id=6088

I think a better tail call optimization would be in order.
 - The final print of the "." done using write() doesn't perform the fflush. So
after few tries I have used printf. I have not found a good way to flush.
Suggestions welcome.

Just call stdout.flush().
 - The C++ code uses classes, but they are often more like D structs.
 In the first D version to keep things simple I have used final
 classes everywhere. I will profile the code and I will create a more
 efficient version that uses structs where needed.

Makes sense.
 - I have had to track down a null reference error (caused by the
 nodes array, that contains just references in the first D version).
 Nonnullable type modifiers avoids similar bugs at compile-time.

Stay tuned. Good news coming soon. [snip] This is solid work. Looking forward to seeing measurements! Thanks, Andrei
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I must be doing something wrong. Do I have to pipe some files to the
module or something?

I've compiled it and ran it, it's done in 1.5 seconds. :s
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Oh I think it's that stack overflow thing bear was talking about?
Because it quits after this with no message:

Welcome to LoopTesterApp, D edition
Constructing App...
Constructing Simple CFG...
15000 dummy loops
Constructing CFG...
Performing Loop Recognition
1 Iteration
Jun 03 2011
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Okay, here's how to increase stack size via Optlink:
dmd benchmark.d -L/STACK:128000000
Jun 03 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Andrej Mitrovic:

 Okay, here's how to increase stack size via Optlink:
 dmd benchmark.d -L/STACK:128000000

Keep in mind that sets the max stack size, not the max heap size. -L/STACK:5000000 is plenty :-) Bye, bearophile
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
DMD debug: 50s
DMD optimized: 49s (-release -noboundscheck -O -inline -L/STACK:1280000000)

That's around 1GB of stack memory.
Compiling with GDC will make the app throw an exception at runtime due
to the stack being blown, the error message isn't special but it's
better than no error message.

GDC debug: 1m:24s
GDC optimized: 1m:12s (gdc -v2 -O3 -Wl,--stack,1280000000)

I haven't tested the other ones. Does Go have a Windows compiler yet?
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 6/4/11, bearophile <bearophileHUGS lycos.com> wrote:
 Andrej Mitrovic:

 Okay, here's how to increase stack size via Optlink:
 dmd benchmark.d -L/STACK:128000000

Keep in mind that sets the max stack size, not the max heap size. -L/STACK:5000000 is plenty :-) Bye, bearophile

Is that in bytes or kilobytes? Well, I guess Optlink/LD will automatically limit the stack size to something sane if I've blown it out of proportion. The app ends up using about 130megs.
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Btw for those wondering, the paper's code is at
http://code.google.com/p/multi-language-bench/
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I've compiled the C++ example from google, but I had to remove the
"-lc" option due to errors (what is this option anyway?).

gcc 4.4.0 with -O3: 1:50
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
SORRY, I meant gcc 4.5.2 not 4.4.0.
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I can't compile bear's C++0x example, I get a ton of undefined
reference errors. I've tried with:
gcc -O3 -std=gnu++0x raw.cpp

Maybe he's using gcc 4.6 or I'm missing some flags.
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
It seems I should have used g++:

g++ -O3 -std=gnu++0x -Wl,--stack,1280000000 raw.cpp

time: 2m:12s
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I'll give a shot at compiling on Linux, maybe this is all MinGW's fault.
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
What do I use to accurately time an executable under Linux?
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Running Ubuntu v10.10 x32 on Virtualbox with hardware virtualization,
GCC v4.4.4 and DMD 2.053 installed.

DMD debug: 54s
DMD optimized: 47s

Google's C++ optimized: 27s (gcc -O3 -lc -lstdc++)
Bear's C++ optimized: 24s (g++ -O3 -std=gnu++0x)

So it's MinGW's fault apparently.
Jun 03 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 6/4/11, bearophile <bearophileHUGS lycos.com> wrote:
 Second version, with all structs:
 http://codepad.org/etsLsZV5

38secs. Cut down by 10 secs from last time.
Jun 03 2011
prev sibling next sibling parent Caligo <iteronvexor gmail.com> writes:
Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
and latest LDC2]

g++ -O3
[VIRT: 185MB,  RES: 174MB]
real    0m28.407s
user    0m28.330s
sys     0m0.070s

DMD -O -release
[VIRT: 94MB,  RES: 92MB]
real    0m43.232s
user    0m42.980s
sys     0m0.070s

GDC -O3
[VIRT: 306MB,  RES: 295MB]
real    1m10.788s
user    1m10.570s
sys     0m0.190s

LDC2
segmentation fault
Jun 03 2011
prev sibling next sibling parent Caligo <iteronvexor gmail.com> writes:
On Fri, Jun 3, 2011 at 11:16 PM, dsimcha <dsimcha yahoo.com> wrote:
 On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB, =A0RES: 174MB]
 real =A0 =A00m28.407s
 user =A0 =A00m28.330s
 sys =A0 =A0 0m0.070s

 DMD -O -release
 [VIRT: 94MB, =A0RES: 92MB]
 real =A0 =A00m43.232s
 user =A0 =A00m42.980s
 sys =A0 =A0 0m0.070s

 GDC -O3
 [VIRT: 306MB, =A0RES: 295MB]
 real =A0 =A01m10.788s
 user =A0 =A01m10.570s
 sys =A0 =A0 0m0.190s

 LDC2
 segmentation fault

Why not -inline on dmd?

I don't like the '-inline' option, but here it is. Besides, I usually use GDC or LDC2 and I was expecting them to outperform DMD because they usually do, but not in this case. DMD-32bit v2.52 -O -release -inline [VIRT: 94MB, RES: 92MB] real 0m42.490s user 0m42.480s sys 0m0.000s DMD-32bit v2.53 -O -release -inline [VIRT: 107MB, RES: 104MB] real 0m34.011s user 0m33.930s sys 0m0.070s DMD-64bit v2.53 -O -release -inline segmentation fault DMD-64bit v2.53 -O -release [VIRT: 232MB, RES: 219MB] real 0m44.715s user 0m44.580s sys 0m0.080s P.S. It's a 64-bit system.
Jun 04 2011
prev sibling next sibling parent Caligo <iteronvexor gmail.com> writes:
And just to be fair to C++:

g++ -O2 -m32
[VIRT: 94MB,  RES: 92MB]
real    0m24.567s
user    0m24.500s
sys     0m0.060s
Jun 04 2011
prev sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 4 June 2011 08:03, Caligo <iteronvexor gmail.com> wrote:
 On Fri, Jun 3, 2011 at 11:16 PM, dsimcha <dsimcha yahoo.com> wrote:
 On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB, =A0RES: 174MB]
 real =A0 =A00m28.407s
 user =A0 =A00m28.330s
 sys =A0 =A0 0m0.070s

 DMD -O -release
 [VIRT: 94MB, =A0RES: 92MB]
 real =A0 =A00m43.232s
 user =A0 =A00m42.980s
 sys =A0 =A0 0m0.070s

 GDC -O3
 [VIRT: 306MB, =A0RES: 295MB]
 real =A0 =A01m10.788s
 user =A0 =A01m10.570s
 sys =A0 =A0 0m0.190s

 LDC2
 segmentation fault

Why not -inline on dmd?

I don't like the '-inline' option, but here it is. =A0Besides, I usually use GDC or LDC2 and I was expecting them to outperform DMD because they usually do, but not in this case.

Well, I'm not the least bit surprised by DMD outperforming GDC here. Mostly because the program produced by GDC will have 109+ extra array bounds checks, and 300+ extra assertions (hint: the program produced by DMD will have 0). --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jun 04 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Here's another run on win32 for bear's struct version:

DMD optimized, GC enabled:  38s.015
DMD optimized, GC disabled: 30s.890
Jun 04 2011
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
AFAIK google also messed with the GC in their paper, so I'd say this
is fair-game.
Jun 04 2011