www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Optionally strongly typed array indexes

reply "bearophile" <bearophileHUGS lycos.com> writes:
This language feature is absent in D, and it's present in Ada and 
partially present in ObjectPascal. I think it's significant.

All the following ideas are preliminary (uncooked, primitive), 
they are just ideas (and quite probably there are smart people 
that can invent better things).

This is an example of wrong code, it allocates and initializes a 
matrix with three rows and two columns:

void main() {
     import std.stdio: writeln;
     auto mat = new int[][](3, 2);
     writeln(mat);

     foreach (immutable i; 0 .. 2)
         foreach (immutable j; 0 .. 3)
             mat[i][j] = i * 10 + j; // Wrong.
     writeln(mat);
}


The out of bound array access bug is found at run-time with this 
error followed by a stack trace:

core.exception.RangeError test.d(8): Range violation


The Ada language allows you to spot that bug at compile time.

Note: in D you usually avoid that bug writing the code like this. 
But this is not enough in more complex situations:

     foreach (immutable i, row; mat)
         foreach (immutable j, ref m; row)
             m = i * 10 + j;


If you give strong types to the array indexes, the code becomes 
more self-documenting, and the compiler can catch some more of 
your mistakes. In D the associative array type syntax already 
gives a type to the indexes, so I have to tell apart the the case 
of associative array from the case of normal dynamic/fixed-size 
array with a strongly typed index. This will make the syntax 
worse.

This is a first possible syntax (it looks ugly):


void main() {
     auto M = new int[TR =  typed][TC =  typed](3, 2);

     foreach (immutable TR i; 0 .. 2)
         foreach (immutable TC j; 0 .. 3)
             M[i][j] = i * 10 + j; // Wrong.
}



 typed used inside the [] means that array has strongly typed 
(size_t) index. "TR =  typed" means that TR is the aliased name 
of such type (so you can think of it as "[alias TR =  typed]").

Now such program gives two compile time errors (type mismatch on 
the j and i indexes).


Other examples of allocation of 1D arrays with strongly typed 
indexes:

auto v1 = new int[TV =  typed](10);
int[TV =  typed 10] v2; // fixed-size.


An usage example in a simple function that performs a 2D matrix 
transposition:


T[][] transpose(T)(in T[TC =  typed][TR =  typed] m)
pure nothrow  safe {
     auto r = new T[ typed(TR)][ typed(TC)](m[0].length, m.length);
     foreach (immutable nr, const row; m)
         foreach (immutable nc, immutable c; row)
             r[nc][nr] = c;
     return r;
}


" typed(TR)" means that for this array I am using an already 
defined index type named TR.

In theory you can also infer the type of the index with the 
template, but I am not sure how much useful this is:

void foo(TI)(int[ typed(TI)] data) {}


In that transpose function you can also see that you can assign a 
matrix with typed indexes to one without index types:

int[TI =  typed 10] a;
int[10] b = a; // OK.
a = b;         // OK.

This transpose returns a matrix with untyped indexes to simplify 
a little the use of the resulting matrix. But if you iterate with 
a foreach on such result, the index types are inferred, so it's 
not a big problem.


Probably there is also some need for a trait to get the type of 
the index of an array (it could return size_t if it's untyped):

static assert(is(__trait(index_type, a) == TI));


(In Ada you can also get the range of an array, so such interval 
types also keep some other compile-time information. But I think 
this is not essential for D, so I have not included this 
information.)

--------------------------

I have used both dynamic languages (like Python) and strongly 
typed languages (like D, Haskell). I have seen that both have 
some advantages and disadvantages. In the Haskell world lot of 
people use types to guide their coding, but when I am not using 
Haskell I prefer dynamic typing in small programs or when I 
devise a complex algorithm, and strong static typing in larger 
programs or when I already have the basic code written and I want 
to be more sure of its correctness. So I like a language like D 
that gives me the freedom to use more precise types or less 
precise types according to my current needs. Optional strong 
types for arrays are meant for situations where I want to be more 
sure of the code correctness, or in larger programs, or when the 
complexity of a data structure or the intricacy of a piece of 
code require me to put down lot of precise types to avoid losing 
control of what I am doing.

Strong index types allow you to avoid mixing by mistake index 
variables when you have more than one array, or when you have 2D 
or 3D matrices and you need to not mix rows with columns. In my D 
programming I sometimes mix the indexes, and when I am lucky I 
find the bug at runtime (but if your 2D matrix is a square it's 
less immediate to spot the bug at run-time).

Bye,
bearophile
Jun 03 2014
parent reply "Mason McGill" <mmcgill caltech.edu> writes:
Interesting idea. I run into these sorts of bugs often doing 
multimedia processing. There's definitely a tension between 
writing generic code (NumPy-style, operating on abstract 
dimensions) and safe code (operating on dimensions with 
semantics; OpenCV supports this to an extent). I've never used 
Ada so I may be missing the point, but here are my impressions:

It seems to me that the users who may benefit most from this 
feature are likely to be using non-built-in data structures. For 
example, you mention interpreting the first level of array 
nesting as rows and the second as columns, implying you are 
treating the nested arrays as a matrix (rather than a list of 
lists, boxes of kittens, etc.). I'm inclined to believe use-cases 
for strongly-typed indices would often overlap with use-cases for 
strongly-typed containers. If that is true, then the language 
already provides decent support for annotating indices with 
"units":

   /* Deep in the belly of some image-processing library... */
   alias Row = Dimension!"row";
   alias Column = Dimension!"column";
   alias Channel = Dimension!"channel";
   alias Image = DenseGrid!(ubyte, "row column channel");

   /* In user code... */
   Image image = drawPrettyPicture();
   ubyte r = image[Row(0), Column(0), Channel(0)];
   ubyte g = image[Row(0), Column(0), Channel(1)];
   ubyte b = image[Row(0), Column(0), Channel(2)];

   /* Or if you want to get really fancy... */
   enum red = Channel(0), green = Channel(1), blue = Channel(2);
   ubyte r = image[Row(0), Column(0), red];
   ubyte g = image[Row(0), Column(0), green];
   ubyte b = image[Row(0), Column(0), blue];

   /* Depending on how DenseGrid.opIndex is defined,
      these statements could fail to compile... */
   image[Column(0), Row(0), red];
   image[blue, Row(0), Column(0)];

Speculation: A good behavior for such a `DenseGrid.opIndex` might 
allow untyped (size_t) indices for generic algorithms and users 
who can't be bothered, but reject incorrectly-typed indices (e.g. 
a `BoxOfKittens` should not be indexed with a `Row`, just as an 
`Image` should not be indexed with a `CatName`.
Jun 03 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Mason McGill:

 There's definitely a tension between writing generic code 
 (NumPy-style, operating on abstract dimensions) and safe code 
 (operating on dimensions with semantics;
While you can shove lot of different kind of code inside the matrix-based style of coding of APL, NumPy and so on, I find there is also plenty of code that is more naturally expressed with loops or with higher order functions as map/filter. And sometimes in D you want to implement the array processing functions (the ones of NumPy that are written in Fortran, C, C++), using loops. The non-built-in data structures need to be implemented in D, so strong typing for indexes helps you write those routines more correctly.
 It seems to me that the users who may benefit most from this 
 feature are likely to be using non-built-in data structures.
Strongly typed indexes are useful not just for handling 2D or 3D matrices, but also when you need to use two or more 1D arrays, to avoid confusing indexes of an array with the indexes of the other. Such situations are common even in non numerical D code. I write plenty of code like this.
 For example, you mention interpreting the first level of array 
 nesting as rows and the second as columns, implying you are 
 treating the nested arrays as a matrix
Right. You can give strong typing to indexes of a matrix library, this doesn't need language changes. So a dynamic array library in Phobos could fulfil most of the possible desires or needs for strongly typed indexes. But you lose some of the convenience of the built-in arrays, and you have to import a bulky library that you are mostly not using. I see several places in all kinds of D code where strongly typed indexes can help tell things apart more clearly and safely, but I think only in some of such situations programmers are willing to import a numpy-like library just to have strong typing. In D learn I see that people don't even add "immutable" to the foreach variable that spans an interval. So I don't expect people to import numpy just to give types to two indexes of two 1D arrays. And statistically I think such cases of just two or tree 1D arrays are much more common than numerical code that manages several matrices with a good matrix library. Bye, bearophile
Jun 04 2014
parent reply "Mason McGill" <mmcgill caltech.edu> writes:
On Wednesday, 4 June 2014 at 09:43:20 UTC, bearophile wrote
 You can give strong typing to indexes of a matrix library, this 
 doesn't need language changes. So a dynamic array library in 
 Phobos could fulfil most of the possible desires or needs for 
 strongly typed indexes. But you lose some of the convenience of 
 the built-in arrays, and you have to import a bulky library 
 that you are mostly not using.
 I see several places in all kinds of D code where strongly 
 typed indexes can help tell things apart more clearly and 
 safely, but I think only in some of such situations programmers 
 are willing to import a numpy-like library just to have strong 
 typing.
I'll have to check out Ada to get a sense of how this feature works and how people use it.
 In D learn I see that people don't even add "immutable" to the 
 foreach variable that spans an interval.
I noticed you did that, though I haven't seen this much on Dlang.org. Is this considered idiomatic? Coming mostly from JavaScript/Python/CUDA C++ I'm not sure how I feel about the terseness/safety trade-off (I can honestly see it both ways).
 So I don't expect people to import numpy just to give types to 
 two indexes of two 1D arrays. And statistically I think such 
 cases of just two or tree 1D arrays are much more common than 
 numerical code that manages several matrices with a good matrix 
 library.
Good point. Though, I don't think adding compiler support for strongly typed array dimensions is the full solution. I think this is one part of the larger problem of representing units. http://en.wikibooks.org/wiki/F_Sharp_Programming/Units_of_Measure), and it would be very nice to see something this in Phobos: /* User code. */ import std.awesome_unit_library: Unit, of; // Unit: defines new types that can be multiplied & divided. // of: wraps a container in a stricter version, enforcing units. alias Name = Unit!"name"; alias Food = Unit!"food"; auto names = ["Fred", "Alice", "Sue"].of!Name; auto foods = ["Apple", "Orange", "Tofu"].of!Food; foreach (name; ["Bill", "Ted"].of!Name) { names ~= name; // This compiles happily. foods ~= name; // This does not compile, preventing // Bill and Ted from being eaten. }
Jun 04 2014
parent reply "Mason McGill" <mmcgill caltech.edu> writes:
On Wednesday, 4 June 2014 at 18:46:04 UTC, Mason McGill wrote:
   /* User code. */
   import std.awesome_unit_library: Unit, of;
   // Unit: defines new types that can be multiplied & divided.
   // of: wraps a container in a stricter version, enforcing 
 units.

   alias Name = Unit!"name";
   alias Food = Unit!"food";
   auto names = ["Fred", "Alice", "Sue"].of!Name;
   auto foods = ["Apple", "Orange", "Tofu"].of!Food;

   foreach (name; ["Bill", "Ted"].of!Name)
   {
       names ~= name; // This compiles happily.
       foods ~= name; // This does not compile, preventing
                      // Bill and Ted from being eaten.
   }
You could actually just let the `of` template create the `Unit` type for you, if you want to be terse: import std.awesome_unit_library: of; auto names = ["Fred", "Alice", "Sue"].of!"name"; auto foods = ["Apple", "Orange", "Tofu"].of!"food"; foreach (name; ["Bill", "Ted"].of!"name") { names ~= name; // This compiles happily. foods ~= name; // This does not compile, preventing // Bill and Ted from being eaten. } It looks like this library solution puts user work at or below the level of your proposed syntax, though I'll have to think about whether it covers all the cases you're interested in.
Jun 04 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Sorry for my answers coming so slowly.

Mason McGill:

 Is this considered idiomatic?
Time ago I have asked here to make this index 'i' immutable on default, because the literal "0 .. 10" defines an immutable range of values: void main() { import std.stdio; foreach (i; 0 .. 10) { writeln(i); } } Originally code like this was giving weird results, but thanks to Kenji this design bug was fixed, and now this code prints the first ten integers normally, so now the situation is not as bad as it used to be: void main() { import std.stdio; foreach (i; 0 .. 10) { writeln(i); i++; } } Currently this prints just the even numbers, I think this is a little weird, but I think it sufficiently safe thanks to the explicit 'ref' annotation: void main() { import std.stdio; foreach (ref i; 0 .. 10) { writeln(i); i++; } } Generally in my D code I write like this, as explained here: http://blog.knatten.org/2011/11/11/disempower-every-variable/ void main() { import std.stdio; foreach (immutable i; 0 .. 10) { writeln(i); i++; // Not allowed. } } If your 'i' variable or your function arguments are const/immutable, you will have less surprises in your code (and a little less bugs). In D it's better to use const/immutable everywhere you can. Immutable by default is probably the right choice for a modern language. Unfortunately D has missed this opportunity.
 I think this is one part of the larger problem of representing 
 units.
Units of measure introduce a good amount of complexity and logic that is absent in the idea of giving strongly types to array indexes. You don't need to convert inches to centimetres, you don't need to perform a compile-time dimensional analysis for the correctness of expressions, you don't need to keep in mind the 0-based problems of converting Celsius to Fahrenheit, you don't need to solve the syntax problems of attaching "m / s" somehow to number literals, and so on. So they are two rather different problems.
 though I'll have to think about whether it covers all
 the cases you're interested in.
Another problem is visible in this code, I have a function bar() that must receive a different value for each item of the enum Foo. If I use an associative array, the compiler doesn't catch at compile-time the bug of the missing Foo.B case: enum Foo : ubyte { A, B } void bar(int[Foo] aa) {} void main() { bar([Foo.A: 10]); } And often using an associative array for this purpose is a waste of memory and computation. The Ada compiler is able to catch this bug at compile-time, using a simple fixed-size array (exactly as long as the number of items in the enum) with indexes strongly typed as Foo. So the array literal must contain all possible indexes only once: import std.traits: EnumMembers; enum Foo : ubyte { A, B } void bar(int[ typed(Foo) EnumMembers!Foo.length] input) {} void main() { bar([Foo.A: 10, Foo.B: 20]); } If you give a strong type to the array index you avoid that bug at compile time. (To be accepted as index type, an enumeration must behave nicely, all such tests can be done at compile-time). Bye, bearophile
Jun 05 2014
parent reply "Mason McGill" <mmcgill caltech.edu> writes:
On Thursday, 5 June 2014 at 09:23:00 UTC, bearophile wrote:
 Sorry for my answers coming so slowly.
Hah! I'm on these forums (selfishly) trying to understand/improve D for my own use at work, while you seem to be helping/teaching/contributing for the sake of it. I'm continually surprised by how pleasant and professional this community is; nothing at all to be sorry for!
 void main() {
     import std.stdio;
     foreach (i; 0 .. 10) {
         writeln(i);
         i++;
     }
 }

 Currently this prints just the even numbers, I think this is a 
 little weird.
That is weird. It goes counter to the idea of `i` "coming from" the range. Though, I'd say someone writing that code is asking for strange behavior, so it's not much of an issue.
 Generally in my D code I write like this, as explained here:
 http://blog.knatten.org/2011/11/11/disempower-every-variable/
I agree, with the caveat of trying to not add so much "noise" that it becomes unreadable.
 Immutable by default is probably the right choice for a modern 
 language.
Yes, or at least making operations with side effects look different from operations without.
 I think this is one part of the larger problem of representing 
 units.
Units of measure introduce a good amount of complexity and logic that is absent in the idea of giving strongly types to array indexes. You don't need to convert inches to centimetres,
True.
 You don't need to perform a compile-time dimensional analysis 
 for the correctness of expressions.
I'd argue you do, at least given my understanding of the goals of strongly-typed array indices. For example, consider strided indexing: const dogs = ["Rover", "Fido", "Rex"].of!"dog"; const cats = ["Fluffy", "Mr. Whiskers"].of!"cat"; // This should work. const i1 = index!"dog"(1); dogs[i1]; // This should not work. const i2 = index!"cat"(1); dogs[i2]; // This should work. const i3 = 2 * i1; dogs[i3]; // This should not work. const i4 = i2 * i1; dogs[i4]; So, the library needs to do some form of dimensional analysis to allow typed indices to be multiplied by unit-less scalars, but not scalars with units.
 You don't need to keep in mind the 0-based problems of 
 converting Celsius to Fahrenheit.
True. This is why I said the problem of strongly typed array indexes is *part* of the problem of units, not equivalent to it.
 You don't need to solve the syntax problems of attaching "m / 
 s" somehow to number literals.
Also true, though as a side note, I think a library solution for this could be quite nice: enum newton = 1.as!"kg*m/s^2"; // One possibility. enum newton = 1*kg*m/square(s); // Another.
 though I'll have to think about whether it covers all
 the cases you're interested in.
Another problem is visible in this code, I have a function bar() that must receive a different value for each item of the enum Foo. If I use an associative array, the compiler doesn't catch at compile-time the bug of the missing Foo.B case: enum Foo : ubyte { A, B } void bar(int[Foo] aa) {} void main() { bar([Foo.A: 10]); } And often using an associative array for this purpose is a waste of memory and computation. The Ada compiler is able to catch this bug at compile-time, using a simple fixed-size array (exactly as long as the number of items in the enum) with indexes strongly typed as Foo. So the array literal must contain all possible indexes only once: import std.traits: EnumMembers; enum Foo : ubyte { A, B } void bar(int[ typed(Foo) EnumMembers!Foo.length] input) {} void main() { bar([Foo.A: 10, Foo.B: 20]); } If you give a strong type to the array index you avoid that bug at compile time. (To be accepted as index type, an enumeration must behave nicely, all such tests can be done at compile-time).
That's quite clever, though couldn't it also be done with a library? enum Foo : ubyte { A, B } enum nFoo = EnumMembers!Foo.length; alias AllFoo = WithUnits!(int[nFoo], Foo); void bar(AllFoo) {} void main() { bar(AllFoo([/*A*/ 10, /*B*/ 20])); }
Jun 05 2014
parent reply Philippe Sigaud via Digitalmars-d <digitalmars-d puremagic.com> writes:
 Also true, though as a side note, I think a library solution for this could
 be quite nice:

   enum newton = 1.as!"kg*m/s^2";  // One possibility.
   enum newton = 1*kg*m/square(s); // Another.
Sorry to intrude, but you can also get: enum newton = 1.kg/m/s^^2; Which is quite readable. In this case, 'kg', 'm' and 's' are factory functions. The '1.kg' part is just UFCS in action. For the subjacent return type, you can also define the power operator, at least for integers.
Jun 06 2014
parent "Mason McGill" <mmcgill caltech.edu> writes:
On Friday, 6 June 2014 at 20:22:35 UTC, Philippe Sigaud via 
Digitalmars-d wrote:
 Sorry to intrude, but you can also get:

 enum newton = 1.kg/m/s^^2;
Good point. I actually learned D had an exponentiation operator just yesterday. It definitely helps readability in this case.
Jun 06 2014