digitalmars.D - Freedom, D and Chapel language

bearophile (250/251) Apr 06 2009 Despite D being a system language, I use it mostly for numerical process...
bearophile <bearophileHUGS lycos.com> writes:
Despite D being a system language, I use it mostly for numerical processing,
data processing, and so on, where Python is too much slow. I also appreciate D
because it gives me _freedom_ to use memory almost as I want (expecially from
the C heap, because the GC heap adds some constraints, because it allocates
memory as powers of 2), so I can implement data structures like trees and other
things based on enums, and tagged pointers, that are impossible to do
efficiently in Python. Such data structures are necessary when you want to
write certain hi-performance code that works on gigabytes of data.

By _freedom_ I mean the freedom to use, in a not too much complex way, most of
the hardware of the PC. I have a powerful (but cheap enough) GPU, a 64-bit
2-core CPU with SSE3 vector instructions, and I essentially have still no
language to use all this powerful hardware. I want a language that runs my code
on both cores and the GPU at the same time. And I don't want to use weeks of my
time to write such code. It's not good to have lot of hardware and then have
languages that allow you to use just 15% of it.

D may become a language suitable for numeric processing too, and seeing how
well it manages floating point values (probably D2 is now one of the best
languages around in such regard) it may even become one of its main
purposes/usages. To do this I think D has to develop much better management of
parallelism (data parallelism and otherwise) & array processing.

There's no single best way to do parallel processing, you need several ones
according to the problem and even to solve different part of a problem. Chapel
supports several ways, some of them can be mapped on different cores and other
ones on the vector parallelism of SSE4/AVX, and Larrabee.

The Chapel language looks like being potentially able to give some of such
freedom. To me it looks like potentially one very useful languages.

In the past I have said something about Chapel, but this time I have taken a
much better look. I may even try using this language a bit.

Main site:
http://chapel.cs.washington.edu/

The following notes come mostly from the documents in this gz here:
http://chapel.cs.washington.edu/SC08/slides/SC08-S07-Chapel-Tutorial.tar.gz

Take a look in that document pack if you want more info, especially regarding
the "task parallelism" and the "locale type".

Below I don't talk about things that I think are both badly designed AND not
interesting.

------------------

Amogn its types there are real, imag, complex. It's the same current design of
the D language (this suggests to not change this current design of D).


To define types of a different type you use:
int(64), int(8), real(32)
That are equal to D:
long, byte, float

If you don't add the () it uses the default size.

------------------

It has "configuration constants/variables". In a program you can write:
config const n = 10;
config const epsilon = 0.01;
config const verbose = false;

They are like a standard declaration of constants or variables, but they
supports command-line overrides. They must be declared at global scope:
 ./a.out --n=10000 --epsilon=0.0000001 --verbose=true

------------------


- reals do not implicitly convert to ints as in C
- ints and uints don't interconvert as handily as in C (uints can be implicitly
converted only to larger ints/uints).

------------------

Among the operators:
**   power
<=>  swap (I think this is overkill, a library function template may be enough)

------------------

Among its types there are also:
  Ranges: regular integer sequences
  Domains: index sets (explained below)


Ranges examples, they are closed on the right:
1..6              // 1, 2, 3, 4, 5, 6
1..6 by -1        // 6, 5, 4, 3, 2, 1
6..1              // an empty sequence
1..6 by 2         // 1, 3, 5
1..6 by -4        // 6, 2
1..               // 1, 2, 3, 4, 5, 6, 7, 8, ...

I think Python slice syntax is much better:
start:stops:stride








I think it's better for intervals to be by default open on the right, ad in
Python and D.




------------------

Arrays:

var A: [1..3] int,      // a 3-element array of ints
B: [1..3, 1..5] string, // a 2D 3x5 array of strings
C: [1..3] [1..5] real;  // a 3-element array of 5-element arrays of reals

------------------

For loops can be done in several ways, if you use a range you can't change the
loop variable:

for i in 1..3 do i += 1; // illegal, ranges yield consts

------------------

Loops aren't much different from Python ones, but you can use [] instead of ()
to activate a "zipping". Here I like Python sintax more, that uses an explicit
"zip" that is more readable.

for i in [0..1, 0..1] ...  // i = (0,0); (0,1); (1,0); (1,1)
for i in (0..1, 0..1) ...      // i = (0,0); (1,1)
for (x,y) in (0..1, 0..1) ...  // x=0, y=0;  x=1, y=1

------------------

It has Named Argument Passing. Arguments may be matched by name rather than
position:

def foo(x: int = 2, y:int = x) { ... }
foo();          // equivalent to foo(2, 2);
foo(3);         // equivalent to foo(3, 3);
foo(y=3);       // equivalent to foo(2, 3);
foo(y=3, x=2);  // equivalent to foo(2, 3);
foo(y=3, 2);    // equivalent to foo(2, 3);

I am waiting for this in D.

------------------

It has built-in Iterators:

def RMO() {
  for i in 0..#m do
    for j in 0..#n do
      yield (i, j);

can be used to drive for loops:

for (i,j) in RMO() do
  use (i, j)...

I do this all the time in Python. This is one of those things that requires 5
minutes to be understood and you then use it all the time. So it reduces the
complexity of programming.

------------------

Classes/records support a standard method named these(), similar to the
opApply() of D, but its syntax is much nicer, because you just need to use
"yield" into it to yield items.

If you also add "var" the these() method:

class Pair {
  var x, y: real;

  def these() var {
    yield x;
    yield y;
  }
}

Then you can even modify values as you iterate...

for x in p {
  writeln(x);
  x = -x;
}
writeln(p.x);
writeln(p.y);

It's like in D where you can modify x inside the loop:

foreach (ref x; somearray) x = -x;

I don't remember if this is possible in D too, but this syntax is surely much
better, and I don't need to look in docs to use every time as I do in D.

------------------

Chapel has refined built-in ways to do both use task parallelism and data
parallelism. See document 3.

I have tried to write a summary here of the task parallelism features of
Chapel, but at the end I have scrapped it, because I don't know enough about
this. It uses the statements: begin single sync cobegin coforall serial, to do
structured, unstructured task parallelism, and to perform parallel loops.

------------------

The section about Data Parallelism is even bigger, probably it's the strongest
point of Chapel.

A "domain" is a first-class index set. There three main classes of them:
- arithmetic: indices are Cartesian tuples. (rectilinear, multidimensional,
optionally strided and/or sparse)
- associative: indices serve as hash keys (supports hash tables, dictionaries)
- opaque: indices are anonymous (supports sets, graph-based computations)

An example of arithmetic (2D) domain:
var m = 4, n = 8;
var D: domain(2) = [1..m, 1..n];

Subdomains, a part of the matrix domain:
var m = 4, n = 8;
var D: domain(2) = [1..m, 1..n];
var InnerD: subdomain(D) = [2..m-1, 2..n-1];

It's like a generalizazion of the type ranges of Pascal.
You can use them for example to define the type of a matrix.
And too loop on the indexes:

forall (i, j) in InnerD do
    A(i, j) = i + j / 10.0;

Forall loops also supports other forms:

To allocate memory of a matrix and fill it (in parallel too, see below):
var A: [D] real = forall (i,j) in D do i + j/10.0;

A short syntax to fill a matrix:
[(i,j) in D] A(i,j) =  i + j/10.0;

To allocate it and fill it:
var A: [(i,j) in D] real = i + j/10.0;

------------------

Several types of for loops:

"for" loops: one task executes all iterations.

"forall" loops: some number of tasks executes the iterations.

"coforall" loops: one task per iteration.

-----------------

Domains can be sliced, and array domands copied too:

A[InnerD] = B[InnerD];

-----------------

As I've said Domain indices can be more than just dense integer tuples like
this:
var DnsDom: domain(2) = [1..10, 0..24],

strided:
StrDom: subdomain(DnsDom) = DnsDom by (2,4),

sparse:
SpsDom: subdomain(DnsDom) = genIndices();

associative:
NameDom: domain(string) = readNames();

Or even graphs (graphs are a built-in data type! This is higher level than
Python):
var GrphDom: domain(opaque)


Then you can declare arrays in a very uniform way:

var DnsArr: [DnsDom] complex,
    StrArr: [StrDom] real(32),
    SpsArr: [SpsDom] real;

var GrphArr: [GrphDom] string,
    NameArr: [NameDom] int(64);

All domain types support iteration:
forall ij in StrDom {
  DnsArr(ij) += SpsArr(ij);
}

All domain types support array slicing too:
DnsArr[StrDom] += SpsArr[StrDom];

All domain types support array reallocation too (to clear part of an already
existing array):
StrDom = DnsDom by (2,2);
SpsDom += genEquator();

The syntax here is natural enough, I have no problem undestanding such small
programs despite it's the first time I see them. This is quite positive, it
means that the design is good.

Bounds check are often not necessary because the compiler knows that certain
indexes can't go outside certain ranges. This means that you have safety and
speed.

Opaque Domains and Arrays can be used to create graphs, trees, sets in a nice
way. This is new for me and I don't fully understand it at first sight
(document 4, page 12).

-----------------

Promotion:

Functions/operators expecting scalar values can also take:

Arrays, causing each element to be passed in:
A = sin(B);
B = 2 * A;

Or domains, causing each index to be passed in:
def foo(x: (int, int)) { ... }
foo(SpsDom); // calls foo once per index in SpsDom

When multiple arguments are promoted, calls may use (this uses forall
internally, so it's partially parallel):
zippered promotion:
X = pow(A, B); // X is 2D; X(i,j) = pow(A(i,j), B(i,j))
Or tensor promotion:
Y = pow[A, B]; // Y is 2x2D;
               // Y(i,j)(k,l) = pow(A(i,j), B(k,l))

-----------------

Reduce is a built-in:
tot = + reduce A;  // tot is the sum of all elements in A
big = max reduce [i in InnerD] abs(A(i) + B(i)); // finds the max

In practice reduce isn't that useful, once you have sum(), max(), min(),
product() and few other functions that accept an iterable too. Understanding
reduce isn't always easy, that's why it's not in my dlibs and they have removed
it from Python3.

-----------------

It also has "scan" as built in, this is less common, I have seen something like
this only in Mathematica:

var A, B, C: [1..5] real;
A = 1.1;                // A is: 1.1 1.1 1.1 1.1 1.1
B = + scan A;           // B is: 1.1 2.2 3.3 4.4 5.5
B(3) = -B(3);           // B is: 1.1 2.2 -3.3 4.4 5.5
C = min scan B;         // C is: 1.1 1.1 -3.3 -3.3 -3.3

I don't know how often I can use this scan.

-----------------

A Jacobi Iteration in Chapel, it's very readable:

config const n = 6,
             epsilon = 1.0e-5;

const BigD: domain(2) = [0..n+1, 0..n+1],
      D: subdomain(BigD) = [1..n, 1..n],
      LastRow: subdomain(BigD) = D.exterior(1,0);

var A, Temp : [BigD] real;

A[LastRow] = 1.0;

do {
    [(i,j) in D] Temp(i,j) = (A(i-1,j) + A(i+1,j) +
                              A(i,j-1) + A(i,j+1)) / 4;
    const delta = max reduce abs(A[D] - Temp[D]);
    A[D] = Temp[D];
} while (delta > epsilon);

writeln(A);


It uses a Temp matrix because [(i,j) in D] is partially parallel, so the end
result is a higher performance. This code can be mapped to the GPU very well.

I have seen code in Fortress language, and Chapel looks much more readable,
more powerful and much simpler to use. So simple that I think in two hours I
have already understood a big percentage of the language usage.

exterior() is one of the built-in methods of arithmetic domains, it takes a
row, this time at the bottom.

if you change the third line as:

const BigD: domain(2) = [0..n+1, 0..n+1] dist Block,

Then all the domanins and all the matrixes (because all of them in this program
have a BigD domain) become sliced in 2D tiles, that are better managed by the
CPU cache. This is nice :-) Such tiles can also be managed by different CPU
cores, so they can also be a way to divide the work.

-----------------

Even more complex are Hierarchical Array Declarations:

config const n = 1024,
             numLevels = lg2(n);

const Levels = [0..#numLevels];
const ProblemSpace = [1..n, 1..n, 1..n];

var V: [ProblemSpace] real;
const HierSpace: [lvl in Levels] subdomain(ProblemSpace)
           = ProblemSpace by -2**lvl;
var U, R: [lvl in Levels] [HierSpace(lvl)] real;

This allows to perform hierarchical processing, an operation that today is very
common for example in image processing. It also allows to perform loops that
have complex mixes of such hierarchical processing. This is surely better than
doing such things manually. But I think this may be a bit overkill for D. This
is good for a purely number crunching language as Chapel, Fortress, etc.

-----------------

In document 5 they talk about the locale type, that allows to control the scope
of both data and the processing, to split the Domain processing in chunks given
to diffrent CPUs. It's not easy to understand stuff, I have never seen
something like that.

-----------------

Finally, this is an example of a complex usage of 3D stencils in Chapel:

def rprj3(S, R) {
    const Stencil = [-1..1, -1..1, -1..1],
          w: [0..3] real = (0.5, 0.25, 0.125, 0.0625),
          w3d = [(i,j,k) in Stencil] w((i!=0) + (j!=0) + (k!=0));

    forall ijk in S.domain do
        S(ijk) = + reduce [offset in Stencil]
                            (w3d(offset) * R(ijk + offset*R.stride));
}

Its translation in (parallel code) Fortran requires hundreds of lines of code,
and they say the performance is not too much diffrent.
I don't understand fully that code, but it looks clean and after reading the
docs it seems readable.

-----------------

Overall I see several design mistakes (I have not included most of them among
my comments here, I have just skipped them), and I think in several points D is
better designed. Still, there is a surprisingly large number of things that are
both powerful, easy to understand, and have a nice syntax, and I'd like to see
some of this (the simpler one) stuff in D. But they look hard to implement :-)

Bye,
bearophile
Apr 06 2009
D Programming

C/C++ Programming

Other

digitalmars.D - Freedom, D and Chapel language