www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Announcement and Request: Typesafe Coordinate Systems for

reply James Blachly <james.blachly gmail.com> writes:
In another post, I've just announced our D-based high throughput 
sequencing library, dhtslib.

One feature that is, AFAIK, novel in the field is leveraging the 
compiler's type system to enforce correctness regarding different 
genome/reference sequence coordinate systems. Clearly, the encoding of 
domain specific knowledge in a language's type system is nothing new, 
but it is surprising that this has not been done before in 
bioinformatics, and it is an idea that IMO is long overdue given the 
trainwreck of different coordinate systems in our field.

You can find dhtslib's develop branch, with Typesafe Coordinates merged 
and ready to use, here:

https://github.com/blachlylab/dhtslib/


**Now the request:**
We've drafted a manuscript describing Typesafe Coordinates as a sort of 
low-key endorsement of the D language and our library package `dhtslib`. 
You can find the manuscript here:

https://github.com/blachlylab/typesafe-coordinates/

We would be very grateful to those of you who would take the time to 
read the manuscript and post comments (publicly or privately), 
_especially if we have made any incorrect statements_ or our language 
regarding type systems is awkward or nonstandard.

We did praise D, and gently criticized Rust and OCaml* somewhat as it 
appeared to me that they lacked the features required to implement 
Typesafe Coordinate Systems in as ergonomic a way as we could in D. 
However, being a true novice at both of these other languages there is 
the possibility that I've missed something significant, and that the 
Rust and OCaml implementations could be retooled to match the D 
implementation. I'd still be glad to hear it if that's the case.

I plan to make a few minor cleanups and submit this to a preprint server 
as well as a scientific journal in the next week or so.

Kind regards

James S Blachly, MD
The Ohio State University


* as a side note, I actually find the OCaml code quite attractive in its 
terseness: `let j = cl_interval_of_ho (ob_interval_of_zb i)`
Aug 31 2021
parent reply Arne Ludwig <arne.ludwig posteo.de> writes:
On Wednesday, 1 September 2021 at 05:36:53 UTC, James Blachly 
wrote:
 In another post, I've just announced our D-based high 
 throughput sequencing library, dhtslib.

 One feature that is, AFAIK, novel in the field is leveraging 
 the compiler's type system to enforce correctness regarding 
 different genome/reference sequence coordinate systems. 
 Clearly, the encoding of domain specific knowledge in a 
 language's type system is nothing new, but it is surprising 
 that this has not been done before in bioinformatics, and it is 
 an idea that IMO is long overdue given the trainwreck of 
 different coordinate systems in our field.

 You can find dhtslib's develop branch, with Typesafe 
 Coordinates merged and ready to use, here:

 https://github.com/blachlylab/dhtslib/


 **Now the request:**
 We've drafted a manuscript describing Typesafe Coordinates as a 
 sort of low-key endorsement of the D language and our library 
 package `dhtslib`. You can find the manuscript here:

 https://github.com/blachlylab/typesafe-coordinates/

 We would be very grateful to those of you who would take the 
 time to read the manuscript and post comments (publicly or 
 privately), _especially if we have made any incorrect 
 statements_ or our language regarding type systems is awkward 
 or nonstandard.

 We did praise D, and gently criticized Rust and OCaml* somewhat 
 as it appeared to me that they lacked the features required to 
 implement Typesafe Coordinate Systems in as ergonomic a way as 
 we could in D. However, being a true novice at both of these 
 other languages there is the possibility that I've missed 
 something significant, and that the Rust and OCaml 
 implementations could be retooled to match the D 
 implementation. I'd still be glad to hear it if that's the case.

 I plan to make a few minor cleanups and submit this to a 
 preprint server as well as a scientific journal in the next 
 week or so.

 Kind regards

 James S Blachly, MD
 The Ohio State University


 * as a side note, I actually find the OCaml code quite 
 attractive in its terseness: `let j = cl_interval_of_ho 
 (ob_interval_of_zb i)`
Hi James and Charles, I am happy to hear of your latest idea of creating type-safe coordinate systems. It's a great idea! After reading the code on GitHub, I have only one major remark: IMHO, it would be great to separate the novel coordinates systems from any `htslib` dependencies ([see lines 47-50](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coo dinates.d#L47-L50)) as there are only auxiliary functions that use both the novel coordinates systems and `htslib`. The greater goal I have in mind is to provide the coordinate systems in a separate DUB sub-package (e.g. `dhtslib:coordinates`) that requires only a D compiler. That makes integration into existing projects that do not need `htslib` much easier. Also, I have a short list of minor, technical remarks: 1. The returned type in [line 114](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib coordinates.d#L114) has a typo, there is an additional 's'. 2. The array of identifiers `CoordSystemLabels` in [line 203](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib coordinates.d#L203) is a bit unsafe and not strictly required for two reasons: 1. It can by generated by the compiler using `enum CoordSystemLabels = __traits(allMembers, CoordSystem);`. 2. As far as I can tell its only application is in [line 376](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/ oordinates.d#L376). The same result can be achieved safely using `cs.stringof.split('.')[$ - 1]` or without use of `std.array.split`: `cs.stringof[CoordSystem.stringof.length + 1 .. $]`. 3. The function `unionImpl` in [line 326](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib coordinates.d#L326) actually computes the convex hull of the two intervals which should be noted in the doc comment for completeness' sake. 4. I have noted that you use operator overloading for union and intersection of `Interval`s. You may also add overloads for the `offset` function in both `Interval` and `Coordinate` with `auto opBinary(string op, T)(T off) if ((op == '+' || op == '-') && isIntegral!T)` and `auto opBinaryRight(string op, T)(T off) if ((op == '+' || op == '-') && isIntegral!T)`. I enjoyed reading the manuscript. It highlights the issue clearly and presents the solution without getting lost in details. Ignoring typos at this stage, I have no remarks on it – keep going! Cheers! -- Arne
Sep 01 2021
parent James Blachly <james.blachly gmail.com> writes:
On 9/1/21 5:01 AM, Arne Ludwig wrote:
 I am happy to hear of your latest idea of creating type-safe coordinate 
 systems. It's a great idea!
 
 After reading the code on GitHub, I have only one major remark: IMHO, it 
 would be great to separate the novel coordinates systems from any 
 `htslib` dependencies ([see lines 
 47-50](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coo
dinates.d#L47-L50)) 
 as there are only auxiliary functions that use both the novel 
 coordinates systems and `htslib`. The greater goal I have in mind is to 
 provide the coordinate systems in a separate DUB sub-package (e.g. 
 `dhtslib:coordinates`) that requires only a D compiler. That makes 
 integration into existing projects that do not need `htslib` much easier.
This is an absolutely **outstanding** idea. Those imports were only to reuse an htslib `chr:X-Y` string parsing function, but we can trivially rewrite this in native D to enable sub-package independence!
 Also, I have a short list of minor, technical remarks:
 
 1. The returned type in [line 
 114](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib
coordinates.d#L114) 
 has a typo, there is an additional 's'.
Ahh, the curse of templates. Without 100% test coverage these things which would cause failure to compile in non-template code seem to always sneak in. Thank you so much.
 2. The array of identifiers `CoordSystemLabels` in [line 
 203](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib
coordinates.d#L203) 
 is a bit unsafe and not strictly required for two reasons:
A very excellent suggestion. I am still a metaprogramming novice.
 3. The function `unionImpl` in [line 
 326](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib
coordinates.d#L326) 
 actually computes the convex hull of the two intervals which should be 
 noted in the doc comment for completeness' sake.
Yes, we had some internal debate about the appropriate result of both union and intersect operations when intervals are non-overlapping and return type is a non-array. Will leave as is and document as convex hull in this case.
 4. I have noted that you use operator overloading for union and 
 intersection of `Interval`s. You may also add overloads for the `offset` 
 function in both `Interval` and `Coordinate` with `auto opBinary(string 
 op, T)(T off) if ((op == '+' || op == '-') && isIntegral!T)` and `auto 
 opBinaryRight(string op, T)(T off) if ((op == '+' || op == '-') && 
 isIntegral!T)`.
Very nice. I do miss operator overloading in some of the other languages I explored recently.
 I enjoyed reading the manuscript. It highlights the issue clearly and 
 presents the solution without getting lost in details. Ignoring typos at 
 this stage, I have no remarks on it – keep going!
Thanks again for this critical review. As you know we are really pleased with how D has accelerated our science and wish to share it with the world. James
Sep 01 2021