digitalmars.D - Port a benchmark to D?

Andrei Alexandrescu (7/7) Jun 03 2011 This paper:

Timon Gehr (4/10) Jun 03 2011 OK. I'll start right away.

bearophile (5/6) Jun 03 2011 I have independently started it too :-)

Nick Sabalausky (4/6) Jun 03 2011 I realize you'e probably talking about content, but...Pet Peeve: Trying ...
Timon Gehr (5/11) Jun 03 2011 Ok, we can compare the two implementations afterwards. BTW, have you alr...

Andrei Alexandrescu (3/17) Jun 03 2011 Terrific, thanks! I think this will be very instructive.
Andrei Alexandrescu (4/18) Jun 03 2011 One note - some of their std::list uses can be replaced by SList, and

Andrej Mitrovic (1/1) Jun 03 2011 Speaking of SList, for some reason std_slist.html is still distributed w...
Timon Gehr (4/25) Jun 03 2011 Yes, but built-in arrays are more similar to std::vector, therefore it m...

Andrei Alexandrescu (6/33) Jun 03 2011 I noticed that the C++ code uses std::list without there being any need

Timon Gehr (7/12) Jun 03 2011 Yes, but the list in FindSet is unnecessary anyways. If I start changing...

Jonathan M Davis (7/21) Jun 03 2011 You give RedBlackTree a different predicate if you want to treat it as a...

Timon Gehr (4/25) Jun 03 2011 Yes, thats what I had in mind, but I thought it is strange that there is...

Jonathan M Davis (8/36) Jun 03 2011 std.container provides data structures. It's up to you to decide how to ...
Steven Schveighoffer (13/44) Jun 06 2011 This makes it a set, not a multiset. allowDuplicates set to true would ...

Jonathan M Davis (4/49) Jun 06 2011 Bleh. You're right. I guess that I wasn't thinking straight (or was thin...

Andrej Mitrovic (3/5) Jun 03 2011 I've filed it (and many more):

bearophile (14/15) Jun 03 2011 Some of those "stupid" things come from the very strict Google C++ codin...

Timon Gehr (26/29) Jun 03 2011 standards.
Walter Bright (5/8) Jun 04 2011 I never understood the point of such.

Jeff Nowakowski (3/14) Jun 04 2011 I thought you were big on printed out code reviews and not requiring any...

Walter Bright (2/4) Jun 04 2011 I don't find those comments useful in printed code either.

bearophile (20/21) Jun 03 2011 My modified C++0x code:

Adam D. Ruppe (15/20) Jun 03 2011 What timings did you get? On my computer, the D version ran

Andrei Alexandrescu (4/12) Jun 03 2011 What are your timings for the Java, Scala, and Go versions?

Adam D. Ruppe (3/4) Jun 03 2011 I don't have compilers nor the code for those languages, so I don't

Andrej Mitrovic (3/3) Jun 03 2011 I must be doing something wrong. Do I have to pipe some files to the
Andrej Mitrovic (9/9) Jun 03 2011 Oh I think it's that stack overflow thing bear was talking about?
Andrej Mitrovic (2/2) Jun 03 2011 Okay, here's how to increase stack size via Optlink:

bearophile (4/6) Jun 03 2011 Keep in mind that sets the max stack size, not the max heap size. -L/STA...

Andrej Mitrovic (4/11) Jun 03 2011 Is that in bytes or kilobytes? Well, I guess Optlink/LD will
Andrej Mitrovic (2/2) Jun 03 2011 Btw for those wondering, the paper's code is at

Jonathan M Davis (3/7) Jun 03 2011 Or maybe you just have a way better machine?
Andrej Mitrovic (9/9) Jun 03 2011 DMD debug: 50s

Andrei Alexandrescu (4/13) Jun 03 2011 I think they do. But another good data point would be the C++ timings on...

Andrej Mitrovic (3/3) Jun 03 2011 I've compiled the C++ example from google, but I had to remove the

Andrei Alexandrescu (4/7) Jun 03 2011 Hmm, it would be odd if the D version is more than twice faster than the...

Andrej Mitrovic (1/1) Jun 03 2011 SORRY, I meant gcc 4.5.2 not 4.4.0.
Andrej Mitrovic (4/4) Jun 03 2011 I can't compile bear's C++0x example, I get a ton of undefined
Andrej Mitrovic (3/3) Jun 03 2011 It seems I should have used g++:
Andrej Mitrovic (1/1) Jun 03 2011 I'll give a shot at compiling on Linux, maybe this is all MinGW's fault.
Andrej Mitrovic (1/1) Jun 03 2011 What do I use to accurately time an executable under Linux?
Andrej Mitrovic (7/7) Jun 03 2011 Running Ubuntu v10.10 x32 on Virtualbox with hardware virtualization,

Andrei Alexandrescu (9/16) Jun 03 2011 Interesting. So now we have a run time comparable with C++'s run time in...

bearophile (18/20) Jun 03 2011 First version, with just classes, a bit better cleaned up:

Andrej Mitrovic (2/4) Jun 03 2011 38secs. Cut down by 10 secs from last time.
Caligo (19/19) Jun 03 2011 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,

dsimcha (2/21) Jun 03 2011 Why not -inline on dmd?

Caligo (23/49) Jun 04 2011 I don't like the '-inline' option, but here it is. Besides, I usually
Caligo (6/6) Jun 04 2011 And just to be fair to C++:

Andrei Alexandrescu (4/10) Jun 04 2011 Thanks to all who participated to this. I've shared these results on red...

Mafi (2/14) Jun 04 2011 Could you please give a link to the post. I can't find it.

Andrei Alexandrescu (4/24) Jun 04 2011 My id is andralex.

Andrej Mitrovic (3/3) Jun 04 2011 Here's another run on win32 for bear's struct version:
Andrej Mitrovic (2/2) Jun 04 2011 AFAIK google also messed with the GC in their paper, so I'd say this

Walter Bright (3/5) Jun 04 2011 Direct link:

Iain Buclaw (8/40) Jun 04 2011 Well, I'm not the least bit surprised by DMD outperforming GDC here.

dsimcha (4/23) Jun 03 2011 Oh, also, you way want to re-try the benchmark w/ 2.053. It looks

Timon Gehr (7/27) Jun 04 2011 then I'll create one or two more optimized versions (one using a memory ...

Timon Gehr (12/32) Jun 04 2011 then I'll create one or two more optimized versions (one using a memory ...

Daniel Gibson (9/53) Jun 04 2011 What was your time for D without disabling the GC? Probably 40-50s?

Timon Gehr (7/15) Jun 04 2011 Without disabling the GC, it runs through in 38.2s.

Andrei Alexandrescu (3/37) Jun 04 2011 Better yet, feel free to contribute. The more of us participate, the bet...

Walter Bright (3/6) Jun 04 2011 It would be nice to figure out what is different. Try using the coverage...

bearophile (4/6) Jun 04 2011 There are little differences and inefficiencies here and there, but in t...

dsimcha (6/12) Jun 04 2011 That's probably right, for two reasons:

Daniel Gibson (5/28) Jun 04 2011 Besides better optimization by the compiler,

Walter Bright (2/5) Jun 04 2011 Easy to test, simply disable the gc.

Andrei Alexandrescu (9/26) Jun 03 2011 Just call stdout.flush().

bearophile (54/54) Jun 04 2011 UnionFindNode struct instances probably doesn't need a memory pool, it's...

bearophile (4/8) Jun 04 2011 Sorry, this code is wrong, it has to iterate from 0 to length.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

This paper:

http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/

compares the efficiency of an algorithm implemented in C++, Scala, Java, 
and Go. I wonder how D would fare. Does anyone have the time and 
inclination to port the benchmark so we can take a look?


Thanks,

Andrei

Jun 03 2011

Timon Gehr <timon.gehr gmx.ch> writes:

Andrei Alexandrescu wrote:
 This paper:

http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/
 compares the efficiency of an algorithm implemented in C++, Scala, Java,
 and Go. I wonder how D would fare. Does anyone have the time and
 inclination to port the benchmark so we can take a look?


 Thanks,

 Andrei

OK. I'll start right away.

Timon

Jun 03 2011

bearophile <bearophileHUGS lycos.com> writes:

Timon Gehr:

 OK. I'll start right away.

I have independently started it too :-)
But this article written by Google and its benchmarking methodology are not
good.

Bye,
bearophile

Jun 03 2011

"Nick Sabalausky" <a a.a> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:isbd1j$1u9s$1 digitalmars.com...
 But this article written by Google and its benchmarking methodology are 
 not good.

I realize you'e probably talking about content, but...Pet Peeve: Trying to 
read a paged multi-column layout outside of dead-tree form.

Jun 03 2011

Timon Gehr <timon.gehr gmx.ch> writes:

bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

 I have independently started it too :-)
 But this article written by Google and its benchmarking methodology are not
good.

 Bye,
 bearophile

Ok, we can compare the two implementations afterwards. BTW, have you already
noticed how incredibly stupid some parts of the implementation of cpp are?

I think I will also do a d_pro version after porting cpp.

Timon

Jun 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

 I have independently started it too :-)
 But this article written by Google and its benchmarking methodology are not
good.

 Bye,
 bearophile

 Ok, we can compare the two implementations afterwards. BTW, have you already
 noticed how incredibly stupid some parts of the implementation of cpp are?

 I think I will also do a d_pro version after porting cpp.

 Timon

Terrific, thanks! I think this will be very instructive.

Andrei

Jun 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

 I have independently started it too :-)
 But this article written by Google and its benchmarking methodology are not
good.

 Bye,
 bearophile

 Ok, we can compare the two implementations afterwards. BTW, have you already
 noticed how incredibly stupid some parts of the implementation of cpp are?

 I think I will also do a d_pro version after porting cpp.

 Timon

One note - some of their std::list uses can be replaced by SList, and 
others by built-in arrays.

Andrei

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Speaking of SList, for some reason std_slist.html is still distributed with DMD.

Jun 03 2011

Timon Gehr <timon.gehr gmx.ch> writes:

Andrei Alexandresco wrote:
 On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

 I have independently started it too :-)
 But this article written by Google and its benchmarking methodology are not
good.

 Bye,
 bearophile

 Ok, we can compare the two implementations afterwards. BTW, have you already
 noticed how incredibly stupid some parts of the implementation of cpp are?

 I think I will also do a d_pro version after porting cpp.

 Timon

 One note - some of their std::list uses can be replaced by SList, and
 others by built-in arrays.

 Andrei

Yes, but built-in arrays are more similar to std::vector, therefore it might
already be an optimization over the original benchmark. I'm not sure if that is
okay?

Timon

Jun 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/11 3:01 PM, Timon Gehr wrote:
 Andrei Alexandresco wrote:
 On 6/3/11 2:48 PM, Timon Gehr wrote:
 bearophile wrote:
 Timon Gehr:

 OK. I'll start right away.

 I have independently started it too :-)
 But this article written by Google and its benchmarking methodology are not
good.

 Bye,
 bearophile

 Ok, we can compare the two implementations afterwards. BTW, have you already
 noticed how incredibly stupid some parts of the implementation of cpp are?

 I think I will also do a d_pro version after porting cpp.

 Timon

 One note - some of their std::list uses can be replaced by SList, and
 others by built-in arrays.

 Andrei

 Yes, but built-in arrays are more similar to std::vector, therefore it might
 already be an optimization over the original benchmark. I'm not sure if that
is okay?

 Timon

I noticed that the C++ code uses std::list without there being any need 
for a linked list structure. See for example the data structure used in 
FindSet. It's a list, but it's just appended too and then used for one 
iteration.

Andrei

Jun 03 2011

Timon Gehr <timon.gehr gmx.ch> writes:

Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any need
 for a linked list structure. See for example the data structure used in
 FindSet. It's a list, but it's just appended too and then used for one
 iteration.

 Andrei

Yes, but the list in FindSet is unnecessary anyways. If I start changing the
original implementation, the first thing I will do is to remove that.

First however, I will port the code as closely as possible. Is there any
associative version of RedBlackTree (I realize it could be made associative
quite
easily), or should I just use built-in hash maps?

Timon

Jun 03 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any need
 for a linked list structure. See for example the data structure used in
 FindSet. It's a list, but it's just appended too and then used for one
 iteration.
 
 Andrei

 
 Yes, but the list in FindSet is unnecessary anyways. If I start changing
 the original implementation, the first thing I will do is to remove that.
 
 First however, I will port the code as closely as possible. Is there any
 associative version of RedBlackTree (I realize it could be made associative
 quite easily), or should I just use built-in hash maps?

You give RedBlackTree a different predicate if you want to treat it as a map. 
It defaults to "a < b" with allowDuplicates as false, which makes it a 
multiset. But If you adjust its template parameters, you can make it a set, 
map, multimap, or any other type of collection which uses a red-black tree as 
its data structure.

- Jonathan M Davis

Jun 03 2011

Timon Gehr <timon.gehr gmx.ch> writes:

Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any need
 for a linked list structure. See for example the data structure used in
 FindSet. It's a list, but it's just appended too and then used for one
 iteration.

 Andrei

 Yes, but the list in FindSet is unnecessary anyways. If I start changing
 the original implementation, the first thing I will do is to remove that.

 First however, I will port the code as closely as possible. Is there any
 associative version of RedBlackTree (I realize it could be made associative
 quite easily), or should I just use built-in hash maps?

 You give RedBlackTree a different predicate if you want to treat it as a map.
 It defaults to "a < b" with allowDuplicates as false, which makes it a
 multiset. But If you adjust its template parameters, you can make it a set,
 map, multimap, or any other type of collection which uses a red-black tree > as
 its data structure.

 - Jonathan M Davis

Yes, thats what I had in mind, but I thought it is strange that there is no
boilerplate map in std.container.

Timon

Jun 03 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On 2011-06-03 14:30, Timon Gehr wrote:
 Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any
 need for a linked list structure. See for example the data structure
 used in FindSet. It's a list, but it's just appended too and then
 used for one iteration.
 
 Andrei

 
 Yes, but the list in FindSet is unnecessary anyways. If I start
 changing the original implementation, the first thing I will do is to
 remove that.
 
 First however, I will port the code as closely as possible. Is there
 any associative version of RedBlackTree (I realize it could be made
 associative quite easily), or should I just use built-in hash maps?

 
 You give RedBlackTree a different predicate if you want to treat it as a
 map. It defaults to "a < b" with allowDuplicates as false, which makes
 it a multiset. But If you adjust its template parameters, you can make
 it a set, map, multimap, or any other type of collection which uses a
 red-black tree > as its data structure.
 
 - Jonathan M Davis

 
 Yes, thats what I had in mind, but I thought it is strange that there is no
 boilerplate map in std.container.

std.container provides data structures. It's up to you to decide how to use 
the data structures. It does not attempt to create a wrapper or anything of 
the sort to give you particular uses for a data structure (such as using 
RedBlackTree as a map). Whether it should do that sort of thing is up for 
debate, but its focus is on data structures not on "collections" which are 
organized by how they're used.

- Jonathan M Davis

Jun 03 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 03 Jun 2011 17:30:54 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any  


 need
 for a linked list structure. See for example the data structure  


 used in
 FindSet. It's a list, but it's just appended too and then used for  


 one
 iteration.

 Andrei

 Yes, but the list in FindSet is unnecessary anyways. If I start  

 changing
 the original implementation, the first thing I will do is to remove  

 that.
 First however, I will port the code as closely as possible. Is there  

 any
 associative version of RedBlackTree (I realize it could be made  

 associative
 quite easily), or should I just use built-in hash maps?

 You give RedBlackTree a different predicate if you want to treat it as  
 a map.
 It defaults to "a < b" with allowDuplicates as false, which makes it a
 multiset.


This makes it a set, not a multiset.  allowDuplicates set to true would  
make it a multiset.


 Yes, thats what I had in mind, but I thought it is strange that there is  
 no
 boilerplate map in std.container.

I think the goal is to eventually have these higher-level types based on  
the implementation types, but I'm uncertain how they will look or where  
they will go.  Almost certainly, there will be a template based on a  
set-like template for defining a map (e.g. map!(RedBlackTree, int, int) ).

If you want a cookie-cutter map type that uses the exact same  
implementation as RedBlackTree, with a custom allocator that helps with  
performance for long-lasting containers, dcollections provides such a type  
(TreeMap).  If you do end up using it, I recommend the latest trunk, I've  
made some critical bug fixes (I need to release soon...).

-Steve

Jun 06 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-06-06 06:37, Steven Schveighoffer wrote:
 On Fri, 03 Jun 2011 17:30:54 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:
 Jonathan M Davis wrote:
 On 2011-06-03 14:08, Timon Gehr wrote:
 Andrei Alexandrescu wrote:
 I noticed that the C++ code uses std::list without there being any


 
 need
 
 for a linked list structure. See for example the data structure


 
 used in
 
 FindSet. It's a list, but it's just appended too and then used for


 
 one
 
 iteration.
 
 Andrei

 
 Yes, but the list in FindSet is unnecessary anyways. If I start

 
 changing
 
 the original implementation, the first thing I will do is to remove

 
 that.
 
 First however, I will port the code as closely as possible. Is there

 
 any
 
 associative version of RedBlackTree (I realize it could be made

 
 associative
 
 quite easily), or should I just use built-in hash maps?

 
 You give RedBlackTree a different predicate if you want to treat it as
 a map.
 It defaults to "a < b" with allowDuplicates as false, which makes it a
 multiset.


 
 This makes it a set, not a multiset.  allowDuplicates set to true would
 make it a multiset.

Bleh. You're right. I guess that I wasn't thinking straight (or was thinking 
too quickly) on that one. :(

- Jonathan M Davis

Jun 06 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 6/3/11, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 Speaking of SList, for some reason std_slist.html is still distributed with
 DMD.

I've filed it (and many more):
http://d.puremagic.com/issues/show_bug.cgi?id=6101

Jun 03 2011

bearophile <bearophileHUGS lycos.com> writes:

Timon Gehr:

 have you already noticed how incredibly stupid some parts of the
implementation of cpp are?

Some of those "stupid" things come from the very strict Google C++ coding
standards.

Even if my C++ experience is not extensive, I find this C++ easy to understand.

One part of the code:

      }  // node_pool.size
    }  // Step c
  }  // FindLoops

 private:
  MaoCFG             *CFG_;      // current control flow graph.
  LoopStructureGraph *lsg_;      // loop forest.
};  // HavlakLoopFinder


The Ada language has a syntax to write those names at the closing ends, and the
Ada compiler enforces such names to be always coherent and correct. In C/C++/D
unfortunately such names (written as comments) may go out of sync. Despite this
risk I add similar comments at the end of functions/classe/structs/loops in my
D code too, because the risk of them getting out of sync is lower than the
advantages they give me. I have not written an enhancement request yet about
this in the D Bugzilla yet, also because I don't know what a good C-style
syntax for it could be.

Bye,
bearophile

Jun 03 2011

Timon Gehr <timon.gehr gmx.ch> writes:

bearophile wrote:
 Timon Gehr:

 have you already noticed how incredibly stupid some parts of the implementation


of cpp are?
 Some of those "stupid" things come from the very strict Google C++ coding

standards.

sure? Eg.

  UnionFindNode *FindSet() {
    typedef std::list<UnionFindNode *> NodeListType;
    NodeListType nodeList;

    UnionFindNode *node = this;
    while (node != node->parent()) {
      if (node->parent() != node->parent()->parent())
        nodeList.push_back(node);
      node = node->parent();
    }

    // Path Compression, all nodes' parents point to the 1st level parent.
    NodeListType::iterator iter = nodeList.begin();
    NodeListType::iterator end  = nodeList.end();
    for (; iter != end; ++iter)
      (*iter)->set_parent(node->parent());

    return node;
  }

They create a new linked list of nodes on the way to the root. It is not that
that
is just basically the same they already had when following the parent pointers.
They obviously need a copy.
Also, they made sure that node==node->parent(). (node is the a root) But then
they
call node->parent(). etc...

Timon

Jun 03 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 6/3/2011 1:24 PM, bearophile wrote:
 The Ada language has a syntax to write those names at the closing ends, and
 the Ada compiler enforces such names to be always coherent and correct. In
 C/C++/D unfortunately such names (written as comments) may go out of sync.

I never understood the point of such.

My editor has a single key "find matching { [ < ( ) > ] }" command, and so I 
never have a need for such ugly comments. In fact I usually delete such
comments 
when I encounter them.

Jun 04 2011

Jeff Nowakowski <jeff dilacero.org> writes:

On 06/04/2011 04:25 AM, Walter Bright wrote:
 On 6/3/2011 1:24 PM, bearophile wrote:
 The Ada language has a syntax to write those names at the closing
 ends, and
 the Ada compiler enforces such names to be always coherent and
 correct. In
 C/C++/D unfortunately such names (written as comments) may go out of
 sync.

 I never understood the point of such.

 My editor has a single key "find matching { [ < ( ) > ] }" command, and
 so I never have a need for such ugly comments. In fact I usually delete
 such comments when I encounter them.

I thought you were big on printed out code reviews and not requiring any 
editing features from the language?

Jun 04 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 6/4/2011 7:14 AM, Jeff Nowakowski wrote:
 I thought you were big on printed out code reviews and not requiring any
editing
 features from the language?

I don't find those comments useful in printed code either.

Jun 04 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei:

 Does anyone have the time and inclination to port the benchmark so we can take
a look?

My modified C++0x code:
http://codepad.org/iBwINTkJ

First D2 translation that seems to work:
http://codepad.org/sEEFNlAd

Notes on the translation:
- The program crashes with zero error messages if you don't increase the stack.
Time ago I have written an enhancement request on this:
http://d.puremagic.com/issues/show_bug.cgi?id=6088
- The final print of the "." done using write() doesn't perform the fflush. So
after few tries I have used printf. I have not found a good way to flush.
Suggestions welcome.
- The C++ code uses classes, but they are often more like D structs. In the
first D version to keep things simple I have used final classes everywhere. I
will profile the code and I will create a more efficient version that uses
structs where needed.
- I have had to track down a null reference error (caused by the nodes array,
that contains just references in the first D version). Nonnullable type
modifiers avoids similar bugs at compile-time.
- The C++ code is 25 KB, the D code (first version is 21 KB), the C++ code uses
about 100 MB at runtime, while the D code uses about 160 MB.
- The C++ loops to iterate on collections is many times worse than the D
foreach.
- In the main() of C++ code, there is a LoopStructureGraph lsg that shadows
another one, I've replaced it with a lsg2 in both D and C++ code.
- In the C++ code I have used a std::unordered_map to implement BasicBlockMap,
because it's significantly faster. So now the code is C++0x.
- The method names start with uppercase, I'd like to change this in a second D
version.
- No linked lists or sorted dictionaries seem needed in this program. But I
have felt a bit of need of a set data structure in D (I have used an AA with
bool values).

Despite there are some associative arrays of SimpleLoop and BasicBlock objects,
I think their classes don't need a hash function and opEquals.

Bye,
bearophile

Jun 03 2011

Adam D. Ruppe <destructionator gmail.com> writes:

bearophile wrote:
 My modified C++0x code:
 First D2 translation that seems to work:

What timings did you get? On my computer, the D version ran
slightly faster (56 seconds vs 63s for C++) without optimizations
turned on.

With optimizations turned on, C++ took a nice lead (28 seconds vs 53
seconds for D).


This is similar to other benchmarks I've run in the past. Standard
D builds beat standard C++ builds, but gcc's optimizer takes the
lead vs dmd's optimizer.

However, in the past, I've found writing D style instead of C++
style lets standard D beat even optimized g++. I wonder if that's
the case here too..

 - The final print of the "." done using write() doesn't perform
 the fflush. So after few tries I have used printf. I have not
 found a good way to flush. Suggestions welcome.


To flush with D's stdout, simply call stdout.flush:

write(".");
stdout.flush();

Jun 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/11 6:09 PM, Adam D. Ruppe wrote:
 bearophile wrote:
 My modified C++0x code:
 First D2 translation that seems to work:

 What timings did you get? On my computer, the D version ran
 slightly faster (56 seconds vs 63s for C++) without optimizations
 turned on.

 With optimizations turned on, C++ took a nice lead (28 seconds vs 53
 seconds for D).

What are your timings for the Java, Scala, and Go versions?

Thanks,

Andrei

Jun 03 2011

Adam D. Ruppe <destructionator gmail.com> writes:

Andrei Alexandrescu wrote:
 What are your timings for the Java, Scala, and Go versions?

I don't have compilers nor the code for those languages, so I don't
know.

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I must be doing something wrong. Do I have to pipe some files to the
module or something?

I've compiled it and ran it, it's done in 1.5 seconds. :s

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Oh I think it's that stack overflow thing bear was talking about?
Because it quits after this with no message:

Welcome to LoopTesterApp, D edition
Constructing App...
Constructing Simple CFG...
15000 dummy loops
Constructing CFG...
Performing Loop Recognition
1 Iteration

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Okay, here's how to increase stack size via Optlink:
dmd benchmark.d -L/STACK:128000000

Jun 03 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrej Mitrovic:

 Okay, here's how to increase stack size via Optlink:
 dmd benchmark.d -L/STACK:128000000

Keep in mind that sets the max stack size, not the max heap size.
-L/STACK:5000000 is plenty :-)

Bye,
bearophile

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 6/4/11, bearophile <bearophileHUGS lycos.com> wrote:
 Andrej Mitrovic:

 Okay, here's how to increase stack size via Optlink:
 dmd benchmark.d -L/STACK:128000000

 Keep in mind that sets the max stack size, not the max heap size.
 -L/STACK:5000000 is plenty :-)

 Bye,
 bearophile

Is that in bytes or kilobytes? Well, I guess Optlink/LD will
automatically limit the stack size to something sane if I've blown it
out of proportion. The app ends up using about 130megs.

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Btw for those wondering, the paper's code is at
http://code.google.com/p/multi-language-bench/

Jun 03 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-06-03 17:16, Andrej Mitrovic wrote:
 I must be doing something wrong. Do I have to pipe some files to the
 module or something?
 
 I've compiled it and ran it, it's done in 1.5 seconds. :s

Or maybe you just have a way better machine?

- Jonathan M Davis

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

DMD debug: 50s
DMD optimized: 49s (-release -noboundscheck -O -inline -L/STACK:1280000000)

That's around 1GB of stack memory.
Compiling with GDC will make the app throw an exception at runtime due
to the stack being blown, the error message isn't special but it's
better than no error message.

GDC debug: 1m:24s
GDC optimized: 1m:12s (gdc -v2 -O3 -Wl,--stack,1280000000)

I haven't tested the other ones. Does Go have a Windows compiler yet?

Jun 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/11 7:42 PM, Andrej Mitrovic wrote:
 DMD debug: 50s
 DMD optimized: 49s (-release -noboundscheck -O -inline -L/STACK:1280000000)

 That's around 1GB of stack memory.
 Compiling with GDC will make the app throw an exception at runtime due
 to the stack being blown, the error message isn't special but it's
 better than no error message.

 GDC debug: 1m:24s
 GDC optimized: 1m:12s (gdc -v2 -O3 -Wl,--stack,1280000000)

 I haven't tested the other ones. Does Go have a Windows compiler yet?

I think they do. But another good data point would be the C++ timings on 
the same machine.

Andrei

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I've compiled the C++ example from google, but I had to remove the
"-lc" option due to errors (what is this option anyway?).

gcc 4.4.0 with -O3: 1:50

Jun 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/11 8:05 PM, Andrej Mitrovic wrote:
 I've compiled the C++ example from google, but I had to remove the
 "-lc" option due to errors (what is this option anyway?).

 gcc 4.4.0 with -O3: 1:50

Hmm, it would be odd if the D version is more than twice faster than the 
C++ one. Wonder what's going on there.

Andrei

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

SORRY, I meant gcc 4.5.2 not 4.4.0.

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I can't compile bear's C++0x example, I get a ton of undefined
reference errors. I've tried with:
gcc -O3 -std=gnu++0x raw.cpp

Maybe he's using gcc 4.6 or I'm missing some flags.

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

It seems I should have used g++:

g++ -O3 -std=gnu++0x -Wl,--stack,1280000000 raw.cpp

time: 2m:12s

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I'll give a shot at compiling on Linux, maybe this is all MinGW's fault.

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

What do I use to accurately time an executable under Linux?

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Running Ubuntu v10.10 x32 on Virtualbox with hardware virtualization,
GCC v4.4.4 and DMD 2.053 installed.

DMD debug: 54s
DMD optimized: 47s

Google's C++ optimized: 27s (gcc -O3 -lc -lstdc++)
Bear's C++ optimized: 24s (g++ -O3 -std=gnu++0x)

So it's MinGW's fault apparently.

Jun 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/03/2011 08:47 PM, Andrej Mitrovic wrote:
 Running Ubuntu v10.10 x32 on Virtualbox with hardware virtualization,
 GCC v4.4.4 and DMD 2.053 installed.

 DMD debug: 54s
 DMD optimized: 47s

 Google's C++ optimized: 27s (gcc -O3 -lc -lstdc++)
 Bear's C++ optimized: 24s (g++ -O3 -std=gnu++0x)

 So it's MinGW's fault apparently.

Interesting. So now we have a run time comparable with C++'s run time in 
the paper. The implementation by bearophile has 721 lines. Once we have 
compilation times, binary size, and memory footprint, we have a good 
comparison point with the study in the paper.

Far as I can tell D comes in the second place after C++ at run time. 
With optimizations and all it could get significantly closer.


Thanks,

Andrei

Jun 03 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei:

 Far as I can tell D comes in the second place after C++ at run time. 
 With optimizations and all it could get significantly closer.

First version, with just classes, a bit better cleaned up:
http://codepad.org/DggCx26d

Second version, with all structs:
http://codepad.org/etsLsZV5

Tomorrow I'll de-optimize it a bit replacing some structs with classes. And
then I'll create one or two more optimized versions (one using a memory pool
for the nodes, and one trying to apply some of the C++ improvement ideas from
the original paper).

The number of instances allocated:
Class instances:
SimpleLoop_counter            3_936_102
LoopStructureGraph_counter       15_051
UnionFindNode_counter        13_017_663
HavlakLoopFinder_counter         15_051
BasicBlockEdge_counter          378_036
BasicBlock_counter              252_013
MaoCFG_counter                        1

UnionFindNode probably will give some gain if allocated from a pool.

Later,
bearophile

Jun 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 6/4/11, bearophile <bearophileHUGS lycos.com> wrote:
 Second version, with all structs:
 http://codepad.org/etsLsZV5

38secs. Cut down by 10 secs from last time.

Jun 03 2011

Caligo <iteronvexor gmail.com> writes:

Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
and latest LDC2]

g++ -O3
[VIRT: 185MB,  RES: 174MB]
real    0m28.407s
user    0m28.330s
sys     0m0.070s

DMD -O -release
[VIRT: 94MB,  RES: 92MB]
real    0m43.232s
user    0m42.980s
sys     0m0.070s

GDC -O3
[VIRT: 306MB,  RES: 295MB]
real    1m10.788s
user    1m10.570s
sys     0m0.190s

LDC2
segmentation fault

Jun 03 2011

dsimcha <dsimcha yahoo.com> writes:

On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB,  RES: 174MB]
 real    0m28.407s
 user    0m28.330s
 sys     0m0.070s

 DMD -O -release
 [VIRT: 94MB,  RES: 92MB]
 real    0m43.232s
 user    0m42.980s
 sys     0m0.070s

 GDC -O3
 [VIRT: 306MB,  RES: 295MB]
 real    1m10.788s
 user    1m10.570s
 sys     0m0.190s

 LDC2
 segmentation fault

Why not -inline on dmd?

Jun 03 2011

Caligo <iteronvexor gmail.com> writes:

On Fri, Jun 3, 2011 at 11:16 PM, dsimcha <dsimcha yahoo.com> wrote:
 On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB, =A0RES: 174MB]
 real =A0 =A00m28.407s
 user =A0 =A00m28.330s
 sys =A0 =A0 0m0.070s

 DMD -O -release
 [VIRT: 94MB, =A0RES: 92MB]
 real =A0 =A00m43.232s
 user =A0 =A00m42.980s
 sys =A0 =A0 0m0.070s

 GDC -O3
 [VIRT: 306MB, =A0RES: 295MB]
 real =A0 =A01m10.788s
 user =A0 =A01m10.570s
 sys =A0 =A0 0m0.190s

 LDC2
 segmentation fault

 Why not -inline on dmd?

I don't like the '-inline' option, but here it is.  Besides, I usually
use GDC or LDC2 and I was expecting them to outperform DMD because
they usually do, but not in this case.


DMD-32bit  v2.52 -O -release -inline
[VIRT: 94MB,  RES: 92MB]
real    0m42.490s
user    0m42.480s
sys     0m0.000s

DMD-32bit  v2.53  -O -release -inline
[VIRT: 107MB, RES: 104MB]
real    0m34.011s
user    0m33.930s
sys     0m0.070s

DMD-64bit  v2.53 -O -release -inline
segmentation fault

DMD-64bit  v2.53 -O -release
[VIRT: 232MB, RES: 219MB]
real    0m44.715s
user    0m44.580s
sys     0m0.080s

P.S.
It's a 64-bit system.

Jun 04 2011

Caligo <iteronvexor gmail.com> writes:

And just to be fair to C++:

g++ -O2 -m32
[VIRT: 94MB,  RES: 92MB]
real    0m24.567s
user    0m24.500s
sys     0m0.060s

Jun 04 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/04/2011 02:10 AM, Caligo wrote:
 And just to be fair to C++:

 g++ -O2 -m32
 [VIRT: 94MB,  RES: 92MB]
 real    0m24.567s
 user    0m24.500s
 sys     0m0.060s

Thanks to all who participated to this. I've shared these results on reddit:

http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/


Andrei

Jun 04 2011

Mafi <mafi example.org> writes:

Am 04.06.2011 16:37, schrieb Andrei Alexandrescu:
 On 06/04/2011 02:10 AM, Caligo wrote:
 And just to be fair to C++:

 g++ -O2 -m32
 [VIRT: 94MB, RES: 92MB]
 real 0m24.567s
 user 0m24.500s
 sys 0m0.060s

 Thanks to all who participated to this. I've shared these results on
 reddit:

 http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/



 Andrei

Could you please give a link to the post. I can't find it.

Jun 04 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/04/2011 09:43 AM, Mafi wrote:
 Am 04.06.2011 16:37, schrieb Andrei Alexandrescu:
 On 06/04/2011 02:10 AM, Caligo wrote:
 And just to be fair to C++:

 g++ -O2 -m32
 [VIRT: 94MB, RES: 92MB]
 real 0m24.567s
 user 0m24.500s
 sys 0m0.060s

 Thanks to all who participated to this. I've shared these results on
 reddit:

 http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/




 Andrei

 Could you please give a link to the post. I can't find it.

My id is andralex.

http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/c1xqj0w

Andrei

Jun 04 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Here's another run on win32 for bear's struct version:

DMD optimized, GC enabled:  38s.015
DMD optimized, GC disabled: 30s.890

Jun 04 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

AFAIK google also messed with the GC in their paper, so I'd say this
is fair-game.

Jun 04 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 6/4/2011 7:37 AM, Andrei Alexandrescu wrote:
 Thanks to all who participated to this. I've shared these results on reddit:

 http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/

Direct link:

http://www.reddit.com/r/programming/comments/hqkwk/google_paper_comparing_performance_of_c_java/c1xqj0w

Jun 04 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 4 June 2011 08:03, Caligo <iteronvexor gmail.com> wrote:
 On Fri, Jun 3, 2011 at 11:16 PM, dsimcha <dsimcha yahoo.com> wrote:
 On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB, =A0RES: 174MB]
 real =A0 =A00m28.407s
 user =A0 =A00m28.330s
 sys =A0 =A0 0m0.070s

 DMD -O -release
 [VIRT: 94MB, =A0RES: 92MB]
 real =A0 =A00m43.232s
 user =A0 =A00m42.980s
 sys =A0 =A0 0m0.070s

 GDC -O3
 [VIRT: 306MB, =A0RES: 295MB]
 real =A0 =A01m10.788s
 user =A0 =A01m10.570s
 sys =A0 =A0 0m0.190s

 LDC2
 segmentation fault

 Why not -inline on dmd?

 I don't like the '-inline' option, but here it is. =A0Besides, I usually
 use GDC or LDC2 and I was expecting them to outperform DMD because
 they usually do, but not in this case.

Well, I'm not the least bit surprised by DMD outperforming GDC here.
Mostly because the program produced by GDC will have 109+ extra array
bounds checks, and 300+ extra assertions (hint: the program produced
by DMD will have 0).


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jun 04 2011

dsimcha <dsimcha yahoo.com> writes:

On 6/4/2011 12:01 AM, Caligo wrote:
 Gentoo/Linux [gcc version 4.4.5, DMD 2.52, latest GDC with GCC 4.4.5,
 and latest LDC2]

 g++ -O3
 [VIRT: 185MB,  RES: 174MB]
 real    0m28.407s
 user    0m28.330s
 sys     0m0.070s

 DMD -O -release
 [VIRT: 94MB,  RES: 92MB]
 real    0m43.232s
 user    0m42.980s
 sys     0m0.070s

 GDC -O3
 [VIRT: 306MB,  RES: 295MB]
 real    1m10.788s
 user    1m10.570s
 sys     0m0.190s

 LDC2
 segmentation fault

Oh, also, you way want to re-try the benchmark w/ 2.053.  It looks 
rather allocation heavy and I substantially improved the GC performance 
for 2.053.

Jun 03 2011

Timon Gehr <timon.gehr gmx.ch> writes:

 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

 First version, with just classes, a bit better cleaned up:
 http://codepad.org/DggCx26d

 Second version, with all structs:
 http://codepad.org/etsLsZV5

 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And >

then I'll create one or two more optimized versions (one using a memory pool for
the nodes, and one trying to apply some of the C++ improvement ideas > from the
original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

Your port segfaults DMD 2.053 with the -g flag (at least on linux).
 Andrei: You may want to point out on reddit that the code is approx. a 1 to 1
port of the C++ code and not specially tuned.

Timon

Jun 04 2011

Timon Gehr <timon.gehr gmx.ch> writes:

 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

 First version, with just classes, a bit better cleaned up:
 http://codepad.org/DggCx26d

 Second version, with all structs:
 http://codepad.org/etsLsZV5

 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And >

then I'll create one or two more optimized versions (one using a memory pool for
the nodes, and one trying to apply some of the C++ improvement ideas > from the
original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

One simple but very powerful optimization is to minimize the runs of the GC. I
have added a call to GC.disable(); in the beginning of main and then added a
GC.collect(); after each 10 test runs.

Results on my machine (32bit executables):

C++ (-O2): 30.7s, ~170MB.
D (-release -O -inline): 29.5s, ~520MB

Ds GC needs to get faster. A concurrent GC would have hidden away most of the
overhead on a multi-core processor ;).

Timon

Jun 04 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 04.06.2011 18:25, schrieb Timon Gehr:
 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

 First version, with just classes, a bit better cleaned up:
 http://codepad.org/DggCx26d

 Second version, with all structs:
 http://codepad.org/etsLsZV5

 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And >

 then I'll create one or two more optimized versions (one using a memory pool
for
 the nodes, and one trying to apply some of the C++ improvement ideas > from the
 original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

 
 One simple but very powerful optimization is to minimize the runs of the GC. I
 have added a call to GC.disable(); in the beginning of main and then added a
 GC.collect(); after each 10 test runs.
 
 Results on my machine (32bit executables):
 
 C++ (-O2): 30.7s, ~170MB.
 D (-release -O -inline): 29.5s, ~520MB
 
 Ds GC needs to get faster. A concurrent GC would have hidden away most of the
 overhead on a multi-core processor ;).
 
 Timon

What was your time for D without disabling the GC? Probably 40-50s?
This certainly is a big improvement, I didn't think the GC slows it down
that much.

What'd be really interesting is the benchmark with a D-style
implementation of the code (if I understood correctly the current
versions are more or less direct translations of the C++ code to D).

Cheers,
- Daniel

Jun 04 2011

Timon Gehr <timon.gehr gmx.ch> writes:

Daniel Gibson wrote:
 What was your time for D without disabling the GC? Probably 40-50s?
 This certainly is a big improvement, I didn't think the GC slows it down
 that much.

 What'd be really interesting is the benchmark with a D-style
 implementation of the code (if I understood correctly the current
 versions are more or less direct translations of the C++ code to D).

 Cheers,
 - Daniel

Without disabling the GC, it runs through in 38.2s.

The D-style implementation would probably be about the same, bearophile has
already replaced std::map/std::set with associative arrays and std::list with
dynamic arrays. The algorithm cannot take advantage of Eg. array slicing. It
solves a graph problem. I will try to tune the code some more.

Timon

Jun 04 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/04/2011 11:08 AM, Timon Gehr wrote:
 Andrei:

 Far as I can tell D comes in the second place after C++ at run time.
 With optimizations and all it could get significantly closer.

 First version, with just classes, a bit better cleaned up:
 http://codepad.org/DggCx26d

 Second version, with all structs:
 http://codepad.org/etsLsZV5

 Tomorrow I'll de-optimize it a bit replacing some structs with classes. And>

 then I'll create one or two more optimized versions (one using a memory pool
for
 the nodes, and one trying to apply some of the C++ improvement ideas>  from the
 original paper).
 The number of instances allocated:
 Class instances:
 SimpleLoop_counter            3_936_102
 LoopStructureGraph_counter       15_051
 UnionFindNode_counter        13_017_663
 HavlakLoopFinder_counter         15_051
 BasicBlockEdge_counter          378_036
 BasicBlock_counter              252_013
 MaoCFG_counter                        1

 UnionFindNode probably will give some gain if allocated from a pool.

 Later,
 bearophile

 Your port segfaults DMD 2.053 with the -g flag (at least on linux).
  Andrei: You may want to point out on reddit that the code is approx. a 1 to 1
 port of the C++ code and not specially tuned.

 Timon

Better yet, feel free to contribute. The more of us participate, the better.

Andrei

Jun 04 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 6/3/2011 4:09 PM, Adam D. Ruppe wrote:
 This is similar to other benchmarks I've run in the past. Standard
 D builds beat standard C++ builds, but gcc's optimizer takes the
 lead vs dmd's optimizer.

It would be nice to figure out what is different. Try using the coverage 
analyzer and profiler for starters!

Jun 04 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 It would be nice to figure out what is different. Try using the coverage 
 analyzer and profiler for starters!

There are little differences and inefficiencies here and there, but in the
second D version I think most of the performance difference over the C++ code
is caused by the GC. I will do some tests.

Bye,
bearophile

Jun 04 2011

dsimcha <dsimcha yahoo.com> writes:

On 6/4/2011 9:15 AM, bearophile wrote:
 Walter:

 It would be nice to figure out what is different. Try using the coverage
 analyzer and profiler for starters!

 There are little differences and inefficiencies here and there, but in the
second D version I think most of the performance difference over the C++ code
is caused by the GC. I will do some tests.

 Bye,
 bearophile

That's probably right, for two reasons:

1.  Other than the GC, D doesn't have any "hidden cost" features that 
would explain it being slower than C++ for similarly written code.

2.  A few posts back, it was noted that DMD2.053, which includes my GC 
optimizations, was substantially faster than 2.052, which doesn't.

Jun 04 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 04.06.2011 16:54, schrieb dsimcha:
 On 6/4/2011 9:15 AM, bearophile wrote:
 Walter:

 It would be nice to figure out what is different. Try using the coverage
 analyzer and profiler for starters!

 There are little differences and inefficiencies here and there, but in
 the second D version I think most of the performance difference over
 the C++ code is caused by the GC. I will do some tests.

 Bye,
 bearophile

 
 That's probably right, for two reasons:
 
 1.  Other than the GC, D doesn't have any "hidden cost" features that
 would explain it being slower than C++ for similarly written code.
 

Besides better optimization by the compiler,
see Adam Ruppe's post, 3 posts up:

 On my computer, the D version ran slightly faster (56 seconds vs 63s >

for C++) without optimizations turned on.
 With optimizations turned on, C++ took a nice lead (28 seconds vs 53
 seconds for D).

So it seems like it's not all the GCs fault.

 2.  A few posts back, it was noted that DMD2.053, which includes my GC
 optimizations, was substantially faster than 2.052, which doesn't.

Jun 04 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 6/4/2011 6:15 AM, bearophile wrote:
 There are little differences and inefficiencies here and there, but in the
 second D version I think most of the performance difference over the C++ code
 is caused by the GC. I will do some tests.

Easy to test, simply disable the gc.

Jun 04 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/11 5:48 PM, bearophile wrote:
 Andrei:

 Does anyone have the time and inclination to port the benchmark so we can take
a look?

 My modified C++0x code:
 http://codepad.org/iBwINTkJ

 First D2 translation that seems to work:
 http://codepad.org/sEEFNlAd

 Notes on the translation:
 - The program crashes with zero error messages if you don't increase the
stack. Time ago I have written an enhancement request on this:
 http://d.puremagic.com/issues/show_bug.cgi?id=6088

I think a better tail call optimization would be in order.

 - The final print of the "." done using write() doesn't perform the fflush. So
after few tries I have used printf. I have not found a good way to flush.
Suggestions welcome.

Just call stdout.flush().

 - The C++ code uses classes, but they are often more like D structs.
 In the first D version to keep things simple I have used final
 classes everywhere. I will profile the code and I will create a more
 efficient version that uses structs where needed.

Makes sense.

 - I have had to track down a null reference error (caused by the
 nodes array, that contains just references in the first D version).
 Nonnullable type modifiers avoids similar bugs at compile-time.

Stay tuned. Good news coming soon.

[snip]

This is solid work. Looking forward to seeing measurements!


Thanks,

Andrei

Jun 03 2011

bearophile <bearophileHUGS lycos.com> writes:

UnionFindNode struct instances probably doesn't need a memory pool, it's
already allocated in (small) arrays.

I have not found a simple enough way to allocate SimpleLoop struct instances
with a memory pool.

I have not found simple ways to use most of the C++ optimization hints of the
original paper. The paper is badly written and its contents are worse. I think
they have written this as an internal benchmark and only later have created a
paper. I think they have used an old CPU because their PC farms often use
similar old CPUs.

I have created a D version a bit de-optimized, that uses classes instead of
structs for the classes that are instantiated only few times. Generally the
original C++ code is not good: it's low level where it doesn't need to be low
level, and it wastes memory and time where it matters (like in tiny data
structures with lot of overhead.) I do not generally write code like that.

I don't have desire to work further on this code.

I'd like a Set with a very handy API (copied from Python) in Phobos.

I'd like the built-in associative arrays to have an optimization for very small
number of items. Python dicts have an optimization for 8 items or less.

I have done no attempts to tune the GC. I confirm that if you disable the GC
and perform a collection only once in a while, the code gets faster. Regarding
the improvements of the D GC, I remember I have found an optimization for LDC1
lot of time ago, but I don't remember it well. I think the GC scans the static
memory too, etc, this slows down the GC.

Google seems to care a lot about compilation times too. The D code compiles in
about 1 second or less with full optimization. The C++0x code compiled with G++
doesn't come even close to that.


D code compiled with DMD 2.053 with:
dmd -O -release -inline -w -L/STACK:5000000
Using -g doesn't crash my code on Windows, so I suggest someone to minimize the
code to find this bug.
I also suggest someone to minimize the code for the 64 version to find the 64
bit DMD bug.

C++0x code compiled with G++ 4.6 with:
g++ -std=c++0x -Ofast -Wall -Wextra -s

With LDC and GDC I suggest to look much better for the correct switches.
Othrerwise this benchmark becomes even more meaningless (it's already not so
meaningful. Google has not released the full C++ code).


I have created a D version that uses a TinySet, following one of the C++
optimization hints, one of the D versions use it, but I don't currently have a
good PC to actually take the timings:


import std.typetuple: TypeTuple;
extern(C) pure nothrow void exit (in int status);

template Range(int stop) {
    static if (stop <= 0)
        alias TypeTuple!() Range;
    else
        alias TypeTuple!(Range!(stop-1), stop-1) Range;
}

struct TinySet(T, int maxLength) {
    T[maxLength] data;
    size_t length = 0; // readonly property

    void add(T x) {
        /*static*/ foreach (i; Range!maxLength)
            if (x == data[i])
                return;
        if (length < maxLength) {
            data[this.length] = x;
            length++;
        } else
            exit(1);
    }

    T[] opSlice() {
        return data[0 .. this.length];
    }
}


C++0x code a bit modified, uses unordered_map instead of silly map:
http://codepad.org/9NEL9yPo

D code with structs and classes (no manual deallocations), less ugly, probably
this is the version to show to people:
http://codepad.org/Xqg0Y26S

Experimental D code with TinySet, with structs and classes (no manual
deallocations):
http://codepad.org/4ATOvFsV

D code with only structs (no manual deallocations):
http://codepad.org/SePFDAaT

D code with only classes:
http://codepad.org/k0DUzP3D

Bye,
bearophile

Jun 04 2011

bearophile <bearophileHUGS lycos.com> writes:

     void add(T x) {
         /*static*/ foreach (i; Range!maxLength)
             if (x == data[i])
                 return;

Sorry, this code is wrong, it has to iterate from 0 to length.

If some code doesn't have unittests then it's broken.

Bye,
bearophile

Jun 04 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Port a benchmark to D?