www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Numerical code

I have recently written some "numerical code", some floating point scientific
code I don't usually need. I have seen that this kind of code often contains
kernels, that is small chunks of code inside the inner loops, that good
compilers translate in even just 8-15 asm instructions.

There are situations where even just two or three asm instructions added or
removed from such kernels can determine the difference from a slow code and an
acceptably fast program. In other programs the kernel is a little longer but
good register spills become essential (FFT). So the programmer, the language,
and the optimization stages of the compiler must do all they can to pull away
as much as possible from such inner loops. Even a moderate form of
supercompilation can be appealing. Then, what's inside the other parts of the
(smallish) program is often not important. Some times it can even be written in
Python with only a moderate difference in the total running time. (This is why
they have created a Python package named CorePy, that is a kind of dynamic
assembler that gives a good help in the creation of computational kernels
written in good modern asm).

[Such kernels are often inside other outer loops. Usually such code is at least
O(n ln n) and often O(n^2). Today I see more and more algorithms almost O(n),
used online and in real time on huge stream of real time data, used for example
often by Google. Such algorithms can have a kernel, but on them plain I/O
effects play an important role.]

Beside the contents of such inner loops, there are two more things the
programmer & compiler have to take into account (and they will need to take
them even more in account in the close future), that is the effects of the
secondary CPU caches and the localization of data in the memory of the single
CPUs (Both such problems are taken in account quite well by the Chapel
language. Their designers have predicted well the future hardware).

1) The cache problems are not hard to explain: the kernels of the inner loops
must use CPU registers very well (seeing them like yet another tiny level of
cache) and the L1 CPU cache. (One way to do this is to use some form of memory
tiling, that is the algorithm designer has to find a way to process a limited
chunk of the input data, and then step to the next chunk. Often this designing
needs to be done even when there are cache-oblivious algorithms). But around
them there are one or two or more outer loops, so if the programmer wants an
efficient program then such outer loops must take into account the L2 and L3
caches too. So the outer loops too need some form of "higher level" tiling. In
practice often numerical programs need to be organized in a pyramidal way
(recently invented image processing algorithms are often designed this way).
Creating such code that has two or three levels of tiling is possible, but it
requires time and it's error-prone (this is why cache-oblivious algorithms are
more appreciated, but they often aren't enough or are not available). So here
you want some help from the programming language, to help in the partially
automatic definition of such hierarchically tiled code. Chapel language helps
in this.

2) I have read that when CPUs have more than about 16 cores (and AMD has
announced a 12 core CPU), the current structure of a single shared RAM doesn't
seem to work well any more. What's needed seems to be Core-local memory, plus
communications across cores (and no shared RAM). In this situation the code
must be quite local. The thread-local default of D is a step in the right
direction, but more abstractions can become necessary. Chapel already gives
means to write this kind of code with its Distributions:

I have found an update of the Chapel tutorial:

(In a future post I'd like to describe better the Chapel Domains, because they
seem an abstraction quite easy to understand and useful for more than just
numerical code.)

Apr 29 2010