digitalmars.D - D performance guideline (request)

bobef (4/4) Apr 12 2008 Hi,

Sean Kelly (43/45) Apr 12 2008 scripting language - FlowerScript, and I really need to take every spare...
janderson (55/61) Apr 12 2008 I imagine you understand this however I'll point it out anyway. The

janderson (2/2) Apr 12 2008 Here's another one. Plug in nedmalloc. There's a D port somewhere...

Unknown W. Brackets (19/25) Apr 12 2008 There's plenty to be said, and surely that will be said. When it comes
downs (7/7) Apr 12 2008 I can only think of a few, sadly.
Walter Bright (4/5) Apr 12 2008 The first thing I'd suggest is get very comfortable with D's built-in
Robert Fraser (11/17) Apr 12 2008 I'm no expert but (as others have pointed out):

Robert Fraser (3/5) Apr 12 2008 Or maybe not:

Georg Wrede (13/22) Apr 13 2008 Well, the guy certainly saw a lot of effort trying to prove that Judy

Robert Fraser (6/32) Apr 13 2008 While I often do sequential inserts into an associative array, I rarely

Craig Black (3/3) Apr 12 2008 In case it wasn't already mentioned, use Tango instead of Phobos. It ha...

downs (3/7) Apr 13 2008 If your supposedly speed-efficient program is GC-bound, you have bigger ...

Georg Wrede (19/26) Apr 13 2008 Seriously, there's a sharp and definitive point to that.

Bruno Medeiros (7/11) Apr 25 2008 If you really need the speed boost (I also say: profile first), then

bobef <bobef abv-nospam.bg> writes:

Hi,

I would like to request a guideline how to squeeze every last nanosecond out of
D. I am writing this scripting language - FlowerScript, and I really need to
take every spare bit... But the problem is... I am a noob. D performs really
well, of course, but lame people like me could make even D slow :) I know here
are a lot of people that are very good programmers with broad knowledge how
things work and a lot of experiene, so I would like to ask you to share your
experience - what to do when every tick matters. For example one could use
pointers for indexing arrays, to bypass the bounds checking. Yes, it is not
that pretty but gives a big boost. I am trying to experiment with things, but
sometimes it is really strange. For example I have few ifs. I rearrange them a
bit. Sometimes I move one if inside other and it becomes slower, sometimes I
simply add curly braces and it becomes faster. Or I wrap a block of code in a
try {} and it becomes faster. Why is that? If I wrap every line of code in try,
will I make it fly ? :D

Thanks,
bobef

Apr 12 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from bobef (bobef abv-nospam.bg)'s article
 Hi,
 I would like to request a guideline how to squeeze every last nanosecond out
of D. I am writing this

scripting language - FlowerScript, and I really need to take every spare bit...
But the problem is... I am a
noob. D performs really well, of course, but lame people like me could make
even D slow :) I know here
are a lot of people that are very good programmers with broad knowledge how
things work and a lot of
experiene, so I would like to ask you to share your experience - what to do
when every tick matters. For
example one could use pointers for indexing arrays, to bypass the bounds
checking. Yes, it is not that
pretty but gives a big boost. I am trying to experiment with things, but
sometimes it is really strange.
For example I have few ifs. I rearrange them a bit. Sometimes I move one if
inside other and it becomes
slower, sometimes I simply add curly braces and it becomes faster. Or I wrap a
block of code in a try {}
and it becomes faster. Why is that? If I wrap every line of code in try, will I
make it fly ? :D

The code generator used by DMD is really weird.  Similar things cropped up
during performance
tests of the Tango XML parser--sometimes just rearranging a set of declarations
would have a
considerable effect on performance.  I think a lot of it had to to with
alignment and such.  There's
generally no point in this level of optimization, because each compiler will
react differently to the
changes.  GDC, for example, tended to react in an opposite manner of DMD,
though it was more
consistent overall.

As for performance tuning itself--don't worry about pointers vs. indexing. 
Using the -release
flag will turn off bounds checking anyway.  Typically, the most notable effect
you'll see is from
the overarching algorithms you use.  However, if you're really counting cycles
then you can do
things like minimize branches by placing the condition you expect to be true
most often inside
an "if" statement.  ie.

    void func() {
        if( x ) {
            // A
            return;
        }
        // B
    }

With the above, A will be executed without any jumps, while execution will have
to jump to
B to continue.  If you expect x to almost always be true and the performance of
func() is
crucial, consider writing it like the above instead of:

    void func() {
        if( !x ) {
            // B
            return;
        }
        // A
    }

However, in general I suggest favoring readability over a nanosecond or two of
performance
gain.  There are other suggestions as well, but in general the 80-20 rule
dictates that such
things should really only occur once your app is done and working, and then be
tuned from
profiler results.


Sean

Apr 12 2008

janderson <askme me.com> writes:

bobef wrote:
 Hi,
 
 I would like to request a guideline how to squeeze every last nanosecond out
of D. I am writing this scripting language - FlowerScript, and I really need to
take every spare bit... But the problem is... I am a noob. D performs really
well, of course, but lame people like me could make even D slow :) I know here
are a lot of people that are very good programmers with broad knowledge how
things work and a lot of experiene, so I would like to ask you to share your
experience - what to do when every tick matters. For example one could use
pointers for indexing arrays, to bypass the bounds checking. Yes, it is not
that pretty but gives a big boost. I am trying to experiment with things, but
sometimes it is really strange. For example I have few ifs. I rearrange them a
bit. Sometimes I move one if inside other and it becomes slower, sometimes I
simply add curly braces and it becomes faster. Or I wrap a block of code in a
try {} and it becomes faster. Why is that? If I wrap every

 line of code in try, will I make it fly ? :D
 
 Thanks,
 bobef


I imagine you understand this however I'll point it out anyway.  The 
best optimizations programmers in the world for any given language know 
that only a small amount of the code actually matters to performance. 
They always profile to see where that is.  If they tried to optimize the 
entire program they would be spending less time on the parts that really 
matter.

Bound checking can be turned off at release time so I wouldn't worry 
about that.

So having said that, beware of using functions like getchar() and read 
blocks of data at a time.  Think about high level optimizations first, 
these will normally give you the biggest bang. For instance the fastest 
code you can run is the code that is never run.

Try to minimize memory allocation by allocating in bigger blocks.  For 
instance with arrays you can do this to reserve space.

array.length = 100
array.length = 0
array now has 100 spaces reserved

Don't manually flush the GC until you have idle time.  Its a very slow 
function.


When you do find a function that really is causing trouble here are a 
few tips:

1) Reduce branching.  Branching is bad for these reasons:
   - It can cause a cache miss
   - The CPU prefetcher and program optimizer will stall


2) Avoid using linked lists.  If you do absolutely need one allocate 
each node from an array.  99% of the time you will be spending more time 
traversing the array then insertion and deletion.  Inserting/deleting 
into an array can be faster then a linked list if its at the end (and 
the memory is already reserved).  The exception is generally when the 
link list is a memory allocator and that's its only responsibility.

3) Use foreach instead of for.  The compiler can sometimes optimize 
these better.

4) Try to do more stuff at compile time using templates and compile time 
functions however don't fall victim to making the programs memory 
footprint so big it actually runs slower.

5) In your innermost loops look at the generated asm code to figure out 
what it is doing (last resort).

6) Avoid using virtual functions in your innermost loops if they cause a 
problem.  Note that if the virtual function body contains more then 10 
instructions you *may* be wasting your time removing the virtual.  The 
amount of time processing those 10 instructions is can be much larger 
then the time cost of the virtual function.  Although like branching 
they can cause a stall. The reasons
   - Compiler can't inline virtual's
   - Cachemiss is almost certain because the Vtable is located far away 
from the
   - The CPU prefetcher and program optimizer will stall
   - Extra pointer lookups.

7) Try to push things around as blocks of memory rather then individual 
pieces.

8) Profile, Profile, Profile and let use know your results.

I hope that was helpful.
-Joel

Apr 12 2008

janderson <askme me.com> writes:

Here's another one.  Plug in nedmalloc.  There's a D port somewhere...

http://www.nedprod.com/programs/portable/nedmalloc/index.html

Apr 12 2008

"Unknown W. Brackets" <unknown simplemachines.org> writes:

There's plenty to be said, and surely that will be said.  When it comes 
to performance, I'm more of a server-oriented guy and I worry more about 
concurrency and performance per request... I'm sure you'll get more 
relevant help from others here.

I just want to suggest that, when checking performance, you make sure 
you're dealing with reality.  A few things that can help with that:

1. Do any file io before your benching... for example, load a file 
entirely into memory, then bench/profile parsing it, executing it, etc.

2. When benchmarking, compile with -O -inline -release.  You don't 
really care if bounds checking, lack of inlining, etc. are slowing you 
down.  You care about reality.

3. In general, it's hard to get anywhere without concentrating on a 
specific area of code.  Find something that is slow (through profiling) 
and fix it.  Worry about the rest later.

4. If you destroy maintainability by optimization, you'll just pay 
later.  Consider how heavy the price is before you make such a move.

-[Unknown]


bobef wrote:
 Hi,
 
 I would like to request a guideline how to squeeze every last nanosecond out
of D. I am writing this scripting language - FlowerScript, and I really need to
take every spare bit... But the problem is... I am a noob. D performs really
well, of course, but lame people like me could make even D slow :) I know here
are a lot of people that are very good programmers with broad knowledge how
things work and a lot of experiene, so I would like to ask you to share your
experience - what to do when every tick matters. For example one could use
pointers for indexing arrays, to bypass the bounds checking. Yes, it is not
that pretty but gives a big boost. I am trying to experiment with things, but
sometimes it is really strange. For example I have few ifs. I rearrange them a
bit. Sometimes I move one if inside other and it becomes slower, sometimes I
simply add curly braces and it becomes faster. Or I wrap a block of code in a
try {} and it becomes faster. Why is that? If I wrap every

 line of code in try, will I make it fly ? :D
 
 Thanks,
 bobef

Apr 12 2008

downs <default_357-line yahoo.de> writes:

I can only think of a few, sadly.

Use ~ as little as possible.
Try to determine the whole size beforehand, _then_ allocate an array, _then_
copy into it. Remember: Allocations are slow. Shit slow.

If you're creating multithreaded code that needs synchronized access, try to
see if you can use TLS instead of synchronization.

Use the final keyword :)

Try both GDC and DMD. There might be significant performance differences.

 --downs

Apr 12 2008

Walter Bright <newshound1 digitalmars.com> writes:

bobef wrote:
 I would like to request a guideline how to squeeze every last nanosecond out
of D.

The first thing I'd suggest is get very comfortable with D's built-in 
profiler. It'll be invaluable in helping focus your energies on exactly 
where it matters.

Apr 12 2008

Robert Fraser <fraserofthenight gmail.com> writes:

bobef wrote:
 Hi,
 
 [...]
 
 Thanks,
 bobef

I'm no expert but (as others have pointed out):

- Algorithmic efficiency is more important than saving cycles.
- Memory/cache efficiency is more important than saving cycles
- Try using memory-efficient data structures like Judy (
   http://judy.sourceforge.net/ ).
- Compile with both DMD and GDC
- Reduce heap allocations as much as possible. For example, when
   possible, use scope for your classes (since this means they will be
   allocated on the stack).

So, basically, what everyone else said.

Apr 12 2008

Robert Fraser <fraserofthenight gmail.com> writes:

Robert Fraser wrote:
 - Try using memory-efficient data structures like Judy (
   http://judy.sourceforge.net/ ).

Or maybe not:

http://www.nothings.org/computer/judy/

Apr 12 2008

Georg Wrede <georg nospam.org> writes:

Robert Fraser wrote:
 Robert Fraser wrote:
 
 - Try using memory-efficient data structures like Judy (
   http://judy.sourceforge.net/ ).

 
 
 Or maybe not:
 
 http://www.nothings.org/computer/judy/

Well, the guy certainly saw a lot of effort trying to prove that Judy 
isn't cool. I'm not impressed.

Especially the consistent performance of Judy_seq and Judy_64 was quite 
amazing. If one wants predictable and scalable performance, that 
additionally is excellent, then that is my choice. I assume that, if one 
knows the data is in order, then Judy is good, and for unordered data 
(i.e. that which really needs hashing) a hash is better. (Gee, surprise.)

So, it shows two things: Firstly, if you have a choice, then benchmark, 
benchmark, and benchmark. Don't just read ads or believe the buzz on the 
street. Second, every job has its tool.

IIUC, sparse arrays would be a dream application for Judy, especially if 
they can be filled in order. Or partial order.

Apr 13 2008

Robert Fraser <fraserofthenight gmail.com> writes:

Georg Wrede wrote:
 Robert Fraser wrote:
 Robert Fraser wrote:

 - Try using memory-efficient data structures like Judy (
   http://judy.sourceforge.net/ ).


 Or maybe not:

 http://www.nothings.org/computer/judy/

 
 Well, the guy certainly saw a lot of effort trying to prove that Judy 
 isn't cool. I'm not impressed.
 
 Especially the consistent performance of Judy_seq and Judy_64 was quite 
 amazing. If one wants predictable and scalable performance, that 
 additionally is excellent, then that is my choice. I assume that, if one 
 knows the data is in order, then Judy is good, and for unordered data 
 (i.e. that which really needs hashing) a hash is better. (Gee, surprise.)
 
 So, it shows two things: Firstly, if you have a choice, then benchmark, 
 benchmark, and benchmark. Don't just read ads or believe the buzz on the 
 street. Second, every job has its tool.
 
 IIUC, sparse arrays would be a dream application for Judy, especially if 
 they can be filled in order. Or partial order.

While I often do sequential inserts into an associative array, I rarely 
do sequential reads (Unless I'm enumerating through all the values). So, 
I guess every structure has its place, and Judy is much more 
memory-efficient. But that benchmark tells me "in common usage (where 
data is unordered), Judy is less efficient than a hash table".

Apr 13 2008

"Craig Black" <craigblack2 cox.net> writes:

In case it wasn't already mentioned, use Tango instead of Phobos.  It has a 
faster GC.

-Craig

Apr 12 2008

downs <default_357-line yahoo.de> writes:

Craig Black wrote:
 In case it wasn't already mentioned, use Tango instead of Phobos.  It
 has a faster GC.
 
 -Craig

If your supposedly speed-efficient program is GC-bound, you have bigger
problems than Tango or Phobos. :p

 --just downs' 5�

Apr 13 2008

Georg Wrede <georg nospam.org> writes:

downs wrote:
 Craig Black wrote:
 
 In case it wasn't already mentioned, use Tango instead of Phobos.
 It has a faster GC.

 
 If your supposedly speed-efficient program is GC-bound, you have
 bigger problems than Tango or Phobos. :p

Seriously, there's a sharp and definitive point to that.

 From what I hear (since I still haven't tested Tango, my appologies to 
the Tango Team), Tango is an excellent library, and faster than Phobos 
in most things.

But it really is true: if your program uses new (and 'abandon') enough 
for the GC to become a major decider on library choice -- then you 
really should halt all work and step back. You literally have something 
else to worry about than the choice between libraries.

That choice can bring you (what? My guess is) maybe a 10-20% speedup in 
the *GC* things, which probably is only 2-5% of total execution time 
(since your program probably does a lot more than just wait for the GC 
to finish)!

So, staying with whatever was your library (Tango or Phobos), and 
spending the time on figuring out why your program thinks it's got a 
"frequent flyer plan" with the GC, would be time much better spent.

Speedups of 2x to 10x should not be impossible *in such a* situation. 
(For those who've suffered the contemporary school system, that's like 
100% to 900% better.)

Apr 13 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Craig Black wrote:
 In case it wasn't already mentioned, use Tango instead of Phobos.  It 
 has a faster GC.
 
 -Craig

If you really need the speed boost (I also say: profile first), then
consider using manual memory management. It may be more difficult, but
no pain, no gain.

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Apr 25 2008

D Programming

C/C++ Programming

Other

digitalmars.D - D performance guideline (request)