Overlooked Essentials For Optimizing Code

September 10, 2010

I’ve been programming for 35 years now, and I’ve done a lot of work optimizing programs for speed (an example), and watching others optimize. Two essential techniques are consistently ignored.

Nope, it isn’t avoiding premature optimization. It isn’t replacing bubble sort with quicksort (i.e. algorithmic improvements). It’s not what language used, nor is it how good the compiler is. It isn’t writing i<<2 instead of i*4.

It is:

Using a profiler
Looking at the assembly code being executed

The people who do those are successful in writing fast code, the ones who do not are not. Let me explain.

Using A Profiler

The old programming saw is that a program spends 90% of its time in 10% of the code. I’ve found that to not be true. Over and over, I’ve found that programs spends 99% of its time in 1% of the code. But which 1%? A profiler will tell you. Spending 100 hours of dev time on that 1% will yield real benefits, while 100 hours on the other 99% will not produce much of anything worthwhile.

What’s the problem? Don’t people use profilers? Nope. One place I worked at had a fancy expensive profiler that was still in its shrink wrap 3 years after purchase. Why don’t people use profilers? I don’t really know. I once got into a heated exchange with a colleague who insisted he knew where the bottlenecks were; after all, he was an experienced professional. I finally ran the profiler myself on his project, and of course the bottleneck was in a completely unexpected place.

Consider auto racing. The team that wins has sensors and logging on just about everything they can stick a sensor on. You can race using seat-of-the-pants tuning and have a jolly good time on the track, but you won’t win and you won’t even be competitive. You won’t know if your poor speeds are caused by the engine, the exhaust, the aerodynamics, the tire pressure, or the driver. Why should programming be any different? You can’t improve what you can’t measure.

There are lots of profilers available. You can get ones that look at the hierarchy of function calls, function times, times broken down for each statement, and even at the instruction level. I’ve seen too many programmers eschew profilers, preferring instead to whittle away their time with useless and misdirected “optimizations” and getting trounced by their competitors.

Looking At The Assembly Code

Years ago, I had a colleague, Mary Bailey, who taught remedial algebra at the University of Washington. She told me once that when she wrote on the board:

x + 3 = 5

and asked her students to “solve for x”, they couldn’t answer. But, if she wrote:

__ + 3 = 5

and asked the students to “fill in the blank” all of them could do it. It seems that the magic word “x” seemed to cause them to reflexively think “x means algebra, I don’t understand algebra, I can’t do this.”

Assembler is the algebra of the programming world. If someone asks me “was my function inlined by the compiler” or “if I write i*4, will the compiler optimize it to a left shift” I’ll suggest they look at the asm output of the compiler. The reaction is how rude and unhelpful could I be? The person will follow up by saying he doesn’t know assembler. Even C++ experts will say this.

Assembler is the simplest language (especially compared with C++!). For example,

ADD ESI,x

is (expressed in C style):

ESI += x;

and:

CALL foo

is:

foo();

Details vary among CPUs, but that’s how it works. It’s not even really necessary to know that. Just looking at the assembler output and comparing it to the source code will tell a lot.

How does this help optimization? For example, I knew a programmer years ago who thought he’d discovered a new, faster algorithm to do X. I’m being deliberately vague to protect him. He had the benchmarks to prove it, and wrote a nice article about it. But then someone looked at the assembler output of the regular way, and his new fast way. It turns out that the way he’d written his improved version had allowed the compiler to replace two DIV instructions with one. This had really nothing to do with his algorithm. But DIV is an expensive instruction, and this was in the inner loop, and so his algorithm appeared to be faster. The regular implementation could also be recoded slightly to use only one DIV, too, and it would perform just as fast as the new algorithm. He had discovered nothing.

For my next example, a D user posted a benchmark showing that dmd (Digital Mars D compiler) was lousy at integer arithmetic, while ldc (LLVM D compiler) was much better. Being very concerned about such a result, I promptly looked at the assembler output. It was pretty much equivalent, nothing stood out as being accountable for a 2:1 difference. But there was a long divide in there, done with a call to a runtime library function. That function call completely dominated the timing results, all the adds and subtracts in the benchmark had no significant impact on the speed. Unexpectedly, the benchmark wasn’t about arithmetic code generation at all, it was about long division only. It turns out that dmd’s runtime library function had a crummy implementation of long division in it. Fixing that brought the speed up to par. It wasn’t the code generation at fault at all, but this was not discoverable without looking at the assembler.

Looking at the assembler often gives unexpected insight into why a program performs as it does. Unexpected function calls, unanticipated bloat, things that shouldn’t be there, etc., all are exposed when looking at it. It isn’t necessary to be an assembler crackerjack to be able to pick that up.

Conclusion

If you feel the need for speed, the way to get it is to use a profiler and be willing to examine the assembler for the bottlenecks. Only then is it time to think about better algorithms, faster languages, etc.

Conventional wisdom has it that choosing the best algorithm trumps any micro-optimizations. Though that is undeniably true, there are two caveats that don’t get taught in schools. First and most importantly, choosing the best algorithm for a part of the program that has no participation to the performance profile has a negative effect on optimization because it wastes your time that could be better invested in making actual progress, and diverts attention from the parts that matter. Second, algorithms’ performance always varies with the statistics of the data they operate on. Even bubble sort, the butt of all jokes, is still the best on almost-sorted data that has only a few unordered items. So worrying about using good algorithms without measuring where they matter is a waste of time — your’s and computer’s.

Just like ordering speed parts from an auto racing catalog isn’t going to put you anywhere near the winner’s circle (even if you get them installed right), without profiling, you won’t know where the problems are without a profiler. Without looking at the assembler, you may know where the problem is, but often won’t know why.

Acknowledgements

Thanks to Bartosz Milewski, David Held, and Andrei Alexandrescu for their helpful comments on a draft of this.

Articles

Overlooked Essentials For Optimizing Code

Using A Profiler

Looking At The Assembly Code

Conclusion

Acknowledgements