www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Code improvement for DNA reverse complement?

reply biocyberman <biocyberman gmail.com> writes:
I am solving this problem http://rosalind.info/problems/revc/ as 
an exercise to learn D. This is my solution:

https://dpaste.dzfl.pl/8aa667f962b7

Is there some D tricks I can use to make the `reverseComplement` 
function more concise and speedy? Any other comments for 
improvement of the whole solution are also much appreciated.
May 19
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any other 
 comments for improvement of the whole solution are also much 
 appreciated.
I think doing a switch or even a if-else chain would be faster then using an AA.
May 19
next sibling parent biocyberman <biocyberman gmail.com> writes:
On Friday, 19 May 2017 at 07:46:13 UTC, Stefan Koch wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any 
 other comments for improvement of the whole solution are also 
 much appreciated.
I think doing a switch or even a if-else chain would be faster then using an AA.
Thank you Stefan. I will try that and benchmark the two implementations. I used AA approach because it looks more readable to me.
May 19
prev sibling parent aberba <karabutaworld gmail.com> writes:
On Friday, 19 May 2017 at 07:46:13 UTC, Stefan Koch wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any 
 other comments for improvement of the whole solution are also 
 much appreciated.
I think doing a switch or even a if-else chain would be faster then using an AA.
As part of your work on newCTFE, you are in a position to write a book titled "Fast D", about code that compiles, execute fast and efficiently.
May 25
prev sibling next sibling parent reply Biotronic <simen.kjaras gmail.com> writes:
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any other 
 comments for improvement of the whole solution are also much 
 appreciated.
Question about your implementation: you assume the input may contain newlines, but don't handle any other non-ACGT characters. The problem definition states 'DNA string' and the sample dataset contains no non-ACGT chars. Is this an oversight my part or yours, or did you just decide to support more than the problem requires? Here's a few variations I've come up with, and their timings on my machine: import std.exception; import std.stdio; import std.conv; import std.range; import std.algorithm; import std.datetime; import std.meta; string randomDna(int length) { import std.random; auto rnd = Random(unpredictableSeed); enum chars = ['A','C','G','T']; return iota(length).map!(a=>chars[uniform(0,4, rnd)]).array; } unittest { auto input = randomDna(2000); string previous = null; foreach (fn; AliasSeq!(revComp0, revComp1, revComp2, revComp3)) { auto timing = benchmark!({fn(input);})(10_000); writeln((&fn).stringof[2..$], ": ", timing[0].to!Duration); auto current = fn(input); if (previous != null) { if (current != previous) { writeln((&fn).stringof[2..$], " did not give correct results."); } else { previous = current; } } } } // 216 ms, 3 us, and 8 hnsecs string revComp0(string bps) { const N = bps.length; char[] result = new char[N]; for (int i = 0; i < N; ++i) { result[i] = {switch(bps[N-i-1]){ case 'A': return 'T'; case 'C': return 'G'; case 'G': return 'C'; case 'T': return 'A'; default: return '\0'; }}(); } return result.assumeUnique; } // 2 secs, 752 ms, and 969 us string revComp1(string bps) { return bps.retro.map!((a){switch(a){ case 'A': return 'T'; case 'C': return 'G'; case 'G': return 'C'; case 'T': return 'A'; default: assert(false); }}).array; } // 10 secs, 419 ms, 335 us, and 6 hnsecs string revComp2(string bps) { enum chars = ['A': 'T', 'T': 'A', 'C': 'G', 'G': 'C']; auto result = appender!string; foreach_reverse (c; bps) { result.put(chars[c]); } return result.data; } // 1 sec, 972 ms, 915 us, and 9 hnsecs string revComp3(string bps) { const N = bps.length; enum chars = [Repeat!('A'-'\0', '\0'), 'T', Repeat!('C'-'A'-1, '\0'), 'G', Repeat!('G'-'C'-1, '\0'), 'C', Repeat!('T'-'G'-1, '\0'), 'A']; char[] result = new char[N]; for (int i = 0; i < N; ++i) { result[i] = chars[bps[N-i-1]]; } return result.assumeUnique; }
May 19
parent reply biocyberman <biocyberman gmail.com> writes:
On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 [...]
Question about your implementation: you assume the input may contain newlines, but don't handle any other non-ACGT characters. The problem definition states 'DNA string' and the sample dataset contains no non-ACGT chars. Is this an oversight my part or yours, or did you just decide to support more than the problem requires? [...]
Firstly, thank you for showing me various solutions, and even cool benchmark code. To answer you questions: Yes I assume the input file would realistically contain newlines, even though the problem does not care about them. I also thought about non-CATG bases, but haven't taken care of those cases. In reality we should deal with at least ambiguous bases (N). I ran your code and also see that switch is faster than AA (i.e. revComp0 is the fastest). And Stefan is right about this. Some follow up questions: 1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'? 2. What is going on with the trick of making chars enum like that in 'revComp3'?
May 19
next sibling parent reply Biotronic <simen.kjaras gmail.com> writes:
On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
 1. Why do we need to use assumeUnique in 'revComp0' and 
 'revComp3'?
D strings are immutable, so if I'd created the result array as a string, I couldn't change the individual characters. Instead, I create a mutable array, change the elements in it, then cast it to immutable when I'm done. assumeUnique does that casting while keeping other type information and arguably providing better documentation through its name. Behind the scenes, it's basically doing cast(string)result;
 2. What is going on with the trick of making chars enum like 
 that in 'revComp3'?
By marking a symbol enum, we tell the compiler that its value should be calculated at compile-time. It's a bit of an optimization (but probably doesn't matter at all, and should be done by the compiler anyway), and a way to say it's really, really const. :p Mostly, it's a habit I try to build, of declaring symbols as const as possible, to make maintenance easier. Bonus! Three more variations, all faster than revComp0: string revComp4(string bps) { const N = bps.length; char[] result = new char[N]; for (int i = 0; i < N; ++i) { switch(bps[N-i-1]) { case 'A': result[i] = 'T'; break; case 'C': result[i] = 'G'; break; case 'G': result[i] = 'C'; break; case 'T': result[i] = 'A'; break; default: assert(false); } } return result.assumeUnique; } string revComp5(string bps) { const N = bps.length; char[] result = new char[N]; foreach (i, ref e; result) { switch(bps[N-i-1]) { case 'A': e = 'T'; break; case 'C': e = 'G'; break; case 'G': e = 'C'; break; case 'T': e = 'A'; break; default: assert(false); } } return result.assumeUnique; } string revComp6(string bps) { char[] result = new char[bps.length]; auto p1 = result.ptr; auto p2 = &bps[$-1]; while (p2 > bps.ptr) { switch(*p2) { case 'A': *p1 = 'T'; break; case 'C': *p1 = 'G'; break; case 'G': *p1 = 'C'; break; case 'T': *p1 = 'A'; break; default: assert(false); } p1++; p2--; } return result.assumeUnique; } revComp6 seems to be the fastest, but it's probably also the least readable (a common trade-off).
May 19
next sibling parent ag0aep6g <anonymous example.com> writes:
On 05/19/2017 02:55 PM, Biotronic wrote:
 On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
 1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?
D strings are immutable, so if I'd created the result array as a string, I couldn't change the individual characters. Instead, I create a mutable array, change the elements in it, then cast it to immutable when I'm done. assumeUnique does that casting while keeping other type information and arguably providing better documentation through its name. Behind the scenes, it's basically doing cast(string)result;
You could alternatively use `.reserve` on a `string`: ---- string result; result.reserve(N); for (...) { result ~= ...; } ---- That performs worse, though. Probably still has to check every time if there's reserved space available. On the plus side, it would be ` safe`.
 2. What is going on with the trick of making chars enum like that in 
 'revComp3'?
By marking a symbol enum, we tell the compiler that its value should be calculated at compile-time. It's a bit of an optimization (but probably doesn't matter at all, and should be done by the compiler anyway), and a way to say it's really, really const. :p
You fell into a trap there. The value is calculated at compile time, but it has copy/paste-like behavior. That is, whenever you use `chars`, the code behaves as if you typed out the array literal. That means, the whole array is re-created on every iteration. Use `static immutable` instead. It still forces compile-time calculation, but it doesn't have copy/paste behavior. Speeds up revComp3 a lot.
May 19
prev sibling parent reply crimaniak <crimaniak gmail.com> writes:
On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
 revComp6 seems to be the fastest, but it's probably also the 
 least readable (a common trade-off).
Try revComp7 with -release :) string revComp7(string bps) { char[] result = new char[bps.length]; auto p1 = result.ptr; auto p2 = &bps[$ - 1]; enum AT = 'A'^'T'; enum CG = 'C'^'G'; while (p2 > bps.ptr) { *p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG); p1++; p2--; } return result.assumeUnique; } In fact, when the size of the sequence is growing time difference between procedures is shrinking, so it's much more important to use memory-efficient presentation than to optimize logic.
May 19
parent reply Biotronic <simen.kjaras gmail.com> writes:
On Friday, 19 May 2017 at 22:53:39 UTC, crimaniak wrote:
 On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
 revComp6 seems to be the fastest, but it's probably also the 
 least readable (a common trade-off).
Try revComp7 with -release :) string revComp7(string bps) { char[] result = new char[bps.length]; auto p1 = result.ptr; auto p2 = &bps[$ - 1]; enum AT = 'A'^'T'; enum CG = 'C'^'G'; while (p2 > bps.ptr) { *p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG); p1++; p2--; } return result.assumeUnique; } In fact, when the size of the sequence is growing time difference between procedures is shrinking, so it's much more important to use memory-efficient presentation than to optimize logic.
revComp7 is pretty fast, but I followed ag0aep6g's advice: On Friday, 19 May 2017 at 13:38:20 UTC, ag0aep6g wrote:
 Use `static immutable` instead. It still forces compile-time 
 calculation, but it doesn't have copy/paste behavior. Speeds up 
 revComp3 a lot.
Also, with DMD (2.073.0) -release -O instead of -debug from this point. I'd blame someone else, but this is my fault. :p Anyways, full collection of the various versions I've written, plus crimaniak's revComp7 (rebranded as revComp8, because I already had 7 at the time): https://gist.github.com/Biotronic/20daaf0ed1262d313830bc8cd4199896 Timings: revComp0: 158 ms, 926 us revComp1: 1 sec, 157 ms, 15 us revComp2: 604 ms, 37 us revComp3: 18 ms, 545 us revComp4: 92 ms, 293 us revComp5: 86 ms, 731 us revComp6: 94 ms, 56 us revComp7: 917 ms, 576 us revComp8: 62 ms, 917 us This actually matches my expectations - the table lookup version should be crazy fast, and it is. It beats even your revComp7 (revComp8 in the table). LDC (-release -O3) timings: revComp0: 166 ms, 190 us revComp1: 352 ms, 917 us revComp2: 300 ms, 493 us revComp3: 10 ms, 950 us revComp4: 148 ms, 106 us revComp5: 144 ms, 152 us revComp6: 142 ms, 307 us revComp7: 604 ms, 274 us revComp8: 26 ms, 612 us Interesting how revComp4-6 got slower. What I really wanted to see with this though, was the effect on revComp1, which uses ranges all the way.
May 21
parent reply biocyberman <biocyberman gmail.com> writes:
On Monday, 22 May 2017 at 06:50:45 UTC, Biotronic wrote:
 On Friday, 19 May 2017 at 22:53:39 UTC, crimaniak wrote:
 On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
 revComp6 seems to be the fastest, but it's probably also the 
 least readable (a common trade-off).
Try revComp7 with -release :) string revComp7(string bps) { char[] result = new char[bps.length]; auto p1 = result.ptr; auto p2 = &bps[$ - 1]; enum AT = 'A'^'T'; enum CG = 'C'^'G'; while (p2 > bps.ptr) { *p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG); p1++; p2--; } return result.assumeUnique; } In fact, when the size of the sequence is growing time difference between procedures is shrinking, so it's much more important to use memory-efficient presentation than to optimize logic.
revComp7 is pretty fast, but I followed ag0aep6g's advice: On Friday, 19 May 2017 at 13:38:20 UTC, ag0aep6g wrote:
 Use `static immutable` instead. It still forces compile-time 
 calculation, but it doesn't have copy/paste behavior. Speeds 
 up revComp3 a lot.
Also, with DMD (2.073.0) -release -O instead of -debug from this point. I'd blame someone else, but this is my fault. :p Anyways, full collection of the various versions I've written, plus crimaniak's revComp7 (rebranded as revComp8, because I already had 7 at the time): https://gist.github.com/Biotronic/20daaf0ed1262d313830bc8cd4199896 Timings: revComp0: 158 ms, 926 us revComp1: 1 sec, 157 ms, 15 us revComp2: 604 ms, 37 us revComp3: 18 ms, 545 us revComp4: 92 ms, 293 us revComp5: 86 ms, 731 us revComp6: 94 ms, 56 us revComp7: 917 ms, 576 us revComp8: 62 ms, 917 us This actually matches my expectations - the table lookup version should be crazy fast, and it is. It beats even your revComp7 (revComp8 in the table). LDC (-release -O3) timings: revComp0: 166 ms, 190 us revComp1: 352 ms, 917 us revComp2: 300 ms, 493 us revComp3: 10 ms, 950 us revComp4: 148 ms, 106 us revComp5: 144 ms, 152 us revComp6: 142 ms, 307 us revComp7: 604 ms, 274 us revComp8: 26 ms, 612 us Interesting how revComp4-6 got slower. What I really wanted to see with this though, was the effect on revComp1, which uses ranges all the way.
Wow!!! Someone grab me a chair, I need to sit down. I can't tell enough how grateful I am to all you guys. This is so much fun to learn. Some specific comments and replies: Nicolas Wilson: Your explanation of the enum is clear and very helpful. I can recall to the same technique used in kh_hash in samtools and the associated. With that said, the chars enum is only to 'T' (85) elements. Regarding BioD, I have plan to work on it to add some more functionality. But first I need to sharpen my D skills a bit more. Laeeth Isharc: I do like ldc as well. I've came across several projects that use ldc, and learnt that it is a good choice for speed in general. ag0aep6g
You fell into a trap there. The value is calculated at compile 
time, but it has >copy/paste-like behavior. That is, whenever 
you use `chars`, the code behaves as if you >typed out the array 
literal. That means, the whole array is re-created on every 
iteration.
Use `static immutable` instead. It still forces compile-time 
calculation, but it doesn't > have copy/paste behavior. Speeds 
up revComp3 a lot.
With 'iteration' here you mean running lifetime of the function, or in other words, each one of the 10_000 cycles in the benchmark? Could you provide some more reading for what you are telling here? I can only guess it is intrinsic behavior of an 'enum'. crimaniak, Nicolas Wilson and Biotronic: I've realized before the reversible/negate property of XOR: 'A'^'T'^'T' = 'A' and 'A'^'T'^'A' = 'T'; To help myself and see it in bit patterns, I wrote this snippet: void main(){ enum AT = 'A'^'T'; enum CG = 'C'^'G'; enum chars = [Repeat!('A'-'\0', '\0'), 'T', Repeat!('C'-'A'-1, '\0'), 'G', Repeat!('G'-'C'-1, '\0'), 'C', Repeat!('T'-'G'-1, '\0'), 'A']; writef("BIN %0 8b DEC %d\n", 'A', 'A'); writef("BIN %0 8b DEC %d\n", 'T', 'T'); writef("XOR %0 8b DEC %d\n", AT, AT); writef("TOR %0 8b DEC %d\n", AT^'T', AT^'T', AT^'T'); writef("AOR %0 8b DEC %d\n", AT^'A', AT^'A', AT^'A'); foreach (i, c; chars){ if (i >= 60) writef("%02d: %0 8b, %d\n",i, c, c); // elements before 60 are all \0 } } // Output BIN 01000001 DEC 65 BIN 01010100 DEC 84 XOR 00010101 DEC 21 TOR 01000001 DEC 65 AOR 01010100 DEC 84 60: 00000000, 0 61: 00000000, 0 62: 00000000, 0 63: 00000000, 0 64: 00000000, 0 65: 01010100, 84 66: 00000000, 0 67: 01000111, 71 68: 00000000, 0 69: 00000000, 0 70: 00000000, 0 71: 01000011, 67 72: 00000000, 0 73: 00000000, 0 74: 00000000, 0 75: 00000000, 0 76: 00000000, 0 77: 00000000, 0 78: 00000000, 0 79: 00000000, 0 80: 00000000, 0 81: 00000000, 0 82: 00000000, 0 83: 00000000, 0 84: 01000001, 65
May 22
next sibling parent Biotronic <simen.kjaras gmail.com> writes:
On Monday, 22 May 2017 at 08:58:24 UTC, biocyberman wrote:
  Nicolas Wilson: Your explanation of the enum is clear and very 
 helpful. I can recall to the same technique used in kh_hash in 
 samtools and the associated. With that said, the chars enum is 
 only to 'T' (85) elements.
The reason for having only 85 elements in the array was pure laziness - the problem description seems to forbid non-ACGT letters, so I saw no reason to write more code to handle them. :p My code will crash if the input string contains lower-case letters, Zs, or any other letter beyond 'T' (or read random bits of memory if you're lucky).
  ag0aep6g
You fell into a trap there. The value is calculated at compile 
time, but it has >copy/paste-like behavior. That is, whenever 
you use `chars`, the code behaves as if you >typed out the 
array literal. That means, the whole array is re-created on 
every iteration.
Use `static immutable` instead. It still forces compile-time 
calculation, but it doesn't > have copy/paste behavior. Speeds 
up revComp3 a lot.
With 'iteration' here you mean running lifetime of the function, or in other words, each one of the 10_000 cycles in the benchmark? Could you provide some more reading for what you are telling here? I can only guess it is intrinsic behavior of an 'enum'.
ag0aep6g is absolutely correct in his observation, and the resulting code is basically this: string revComp3(string bps) { const N = bps.length; static immutable chars_saved = [Repeat!('A'-'\0', '\0'), 'T', Repeat!('C'-'A'-1, '\0'), 'G', Repeat!('G'-'C'-1, '\0'), 'C', Repeat!('T'-'G'-1, '\0'), 'A']; char[] result = new char[N]; for (int i = 0; i < N; ++i) { auto chars = chars_saved.dup; // Bad stuff happens here result[i] = chars[bps[N-i-1]]; } return result.assumeUnique; } As we can see, it copies the entire array for every character in the input string. That's basically an allocation and a memcpy in the innermost, hottest loop. Roughly as bad as it gets. (yup, that's 20 million allocations) As for why this happens, enum can be thought of as the analog of C's #define - the compiler precalculates the data to fill the array, and then copies that into the source code every time it's used.
May 22
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 05/22/2017 10:58 AM, biocyberman wrote:
  ag0aep6g
 You fell into a trap there. The value is calculated at compile time, 
 but it has >copy/paste-like behavior. That is, whenever you use 
 `chars`, the code behaves as if you >typed out the array literal. That 
 means, the whole array is re-created on every iteration.
 Use `static immutable` instead. It still forces compile-time 
 calculation, but it doesn't > have copy/paste behavior. Speeds up 
 revComp3 a lot.
With 'iteration' here you mean running lifetime of the function, or in other words, each one of the 10_000 cycles in the benchmark?
For reference, here is the version of revComp3 I commented on: ---- string revComp3(string bps) { const N = bps.length; enum chars = [Repeat!('A'-'\0', '\0'), 'T', Repeat!('C'-'A'-1, '\0'), 'G', Repeat!('G'-'C'-1, '\0'), 'C', Repeat!('T'-'G'-1, '\0'), 'A']; char[] result = new char[N]; for (int i = 0; i < N; ++i) { result[i] = chars[bps[N-i-1]]; } return result.assumeUnique; } ---- By "iteration" I mean every execution of the body of the `for` loop. For every new `i`, a new array is created. The loop above is equivalent to this: ---- for (int i = 0; i < N; ++i) { result[i] = [Repeat!('A'-'\0', '\0'), 'T', Repeat!('C'-'A'-1, '\0'), 'G', Repeat!('G'-'C'-1, '\0'), 'C', Repeat!('T'-'G'-1, '\0'), 'A'][bps[N-i-1]]; } ---- Used like that, the array literal [Repeat!('A'-'\0', '\0'), 'T', Repeat!('C'-'A'-1, '\0'), 'G', Repeat!('G'-'C'-1, '\0'), 'C', Repeat!('T'-'G'-1, '\0'), 'A'] allocates a new array on every execution of `result[i] = ...;`.
 Could you provide some more reading for what you are telling here? I can 
 only guess it is intrinsic behavior of an 'enum'.
Unfortunately, the spec page (<https://dlang.org/spec/enum.html>) doesn't seem to mention this. But Ali Çehreli covers it in his book on the "immutability" page (I would have expected to find it on the "enum" page): http://ddili.org/ders/d.en/const_and_immutable.html#ix_const_and_immutable.enum The details can be confusing here. There is an element of copy/paste-like behavior, but it's not as simple as taking the right-hand side of the enum declaration and substituting it for the left-hand name. The right-hand side is evaluated at compile time. The result of that can be thought of as an array literal. It's that array literal that gets substituted for the name. An example with comments: ---- import std.stdio: writeln; /* f prints a message when called at run time. Then it returns its argument times ten. */ int f(int x) { if (!__ctfe) writeln("f(", x, ")"); return x * 10; } void main() { /* The following line prints f's messages. The f calls are normal run-time calls. Then the writeln prints "false" because each array literal creates a new, distinct array. */ writeln([f(1), f(2)] is [f(1), f(2)]); /* false */ /* The next `enum` line does not print f's messages. The calls go through CTFE. The `writeln` line afterwards prints "false". ea gets pre-computed via CTFE, but the result acts like an array literal. So it's the same as writing `writeln([10, 20] is [10, 20]);`. */ enum int[] ea = [f(1), f(2)]; writeln(ea is ea); /* false */ /* No messages either with `static immutable`. Like ea, the right-hand side goes through CTFE. But unlike ea, ia does not act like an array literal. `writeln` prints "true". */ static immutable int[] ia = [f(1), f(2)]; writeln(ia is ia); /* true */ } ----
May 22
next sibling parent biocyberman <biocyberman gmail.com> writes:
On Monday, 22 May 2017 at 10:35:36 UTC, ag0aep6g wrote:
 On 05/22/2017 10:58 AM, biocyberman wrote:
 [...]
For reference, here is the version of revComp3 I commented on: ---- string revComp3(string bps) { const N = bps.length; enum chars = [Repeat!('A'-'\0', '\0'), 'T', Repeat!('C'-'A'-1, '\0'), 'G', Repeat!('G'-'C'-1, '\0'), 'C', Repeat!('T'-'G'-1, '\0'), 'A']; [...]
Very illustrative. I could easily miss and I did miss this subtle but important aspect. I wonder how D should widen the 'pit of success' that Scott Meyers mentioned about more than once. A take home message for myself, if one ever use an array as a lookup table, make it 'static immutable. And enum array does not make sense'. And in Ali's book:
 Consider the hidden cost of enum arrays and enum associative 
 arrays. Define them as immutable variables if the arrays are 
 large and they are used more than once in the program.
One thing also became clear: 'is' is not '=='. Therefore writeln([10,20] is [10,20]); /* false */ writeln([10,20] == [10,20]); /* true */ I did not notice that because I haven't come across 'is' so often.
May 22
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 05/22/2017 03:35 AM, ag0aep6g wrote:

 But Ali Çehreli covers it in his book on the "immutability" page (I
 would have expected to find it on the "enum" page):

 
http://ddili.org/ders/d.en/const_and_immutable.html#ix_const_and_immutable.enum Thank you for your feedback. I'm inserting a forward reference from the enum chapter. I would like to acknowledge you in the book hopefully with your real name but if you don't want or care, I will use ag0aep6g. :) Ali
May 25
parent ag0aep6g <anonymous example.com> writes:
On 05/25/2017 10:41 PM, Ali Çehreli wrote:
 I would like to acknowledge you in the book hopefully with your real 
 name but if you don't want or care, I will use ag0aep6g. :)
Thanks, but I don't really care for the recognition. If anything, please use ag0aep6g.
May 25
prev sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
 On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 [...]
Question about your implementation: you assume the input may contain newlines, but don't handle any other non-ACGT characters. The problem definition states 'DNA string' and the sample dataset contains no non-ACGT chars. Is this an oversight my part or yours, or did you just decide to support more than the problem requires? [...]
Firstly, thank you for showing me various solutions, and even cool benchmark code. To answer you questions: Yes I assume the input file would realistically contain newlines, even though the problem does not care about them. I also thought about non-CATG bases, but haven't taken care of those cases. In reality we should deal with at least ambiguous bases (N). I ran your code and also see that switch is faster than AA (i.e. revComp0 is the fastest). And Stefan is right about this. Some follow up questions: 1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?
Because `char[] result = new char[N];` is not a string (a.k.a. immutable(char)[]). But because it was created from the GC in this function we know that it is safe to assume that is a string.
 2. What is going on with the trick of making chars enum like 
 that in 'revComp3'?
What revComp3 is doing is effectively creating a table for each possible value of char that matches the behaviour of the switch. it could also be rewritten as ``` char[256] chars; // implicitly memset to '\0' chars['A'] = 'T'; chars['C'] = 'G'; chars['G'] = 'C'; chars['T'] = 'A'; ``` Other miscellaneous comments: If you haven't already checkout [BioD](https://github.com/biod/BioD), for most (all?) your bioinformatics needs. If you're trying to be fast you probably don't want to use string for internal calculations as it is very entropy non-optimal (2 bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding). I would have at least 2 "Dictionaries": one the standard nucleotides (ACGT) and another for your ambiguous representations (UNRYBDHVMKSW-) and the standard nucleotides, to get a better information density. If you're doing anything with protein sequences then you should use a translation table anyway as the DNA -> amino acid mapping changes between species/organelle (mt|cp|n)DNA.
May 19
prev sibling parent Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any other 
 comments for improvement of the whole solution are also much 
 appreciated.
You might try using ldc in release mode. With dmd, cost of ranges and so on might be higher than doing it in a more direct low-level way. With ldc that difference often seems to shrink to insignificance as its optimisation is more powerful. If you care about performance then use ldc or gdc. (I have less experience with gdc, but probably similar). But then it's best to compare results under ldc because the ratios of different options will be different vs dmd.
May 20