digitalmars.D.learn - Code improvement for DNA reverse complement?

biocyberman (6/6) May 19 2017 I am solving this problem http://rosalind.info/problems/revc/ as

Stefan Koch (3/10) May 19 2017 I think doing a switch or even a if-else chain would be faster

biocyberman (4/16) May 19 2017 Thank you Stefan. I will try that and benchmark the two
aberba (4/16) May 25 2017 As part of your work on newCTFE, you are in a position to write a

Biotronic (88/95) May 19 2017 Question about your implementation: you assume the input may

biocyberman (14/23) May 19 2017 Firstly, thank you for showing me various solutions, and even

Biotronic (62/66) May 19 2017 D strings are immutable, so if I'd created the result array as a

ag0aep6g (18/35) May 19 2017 You could alternatively use `.reserve` on a `string`:
crimaniak (20/22) May 19 2017 Try revComp7 with -release :)

Biotronic (35/61) May 21 2017 revComp7 is pretty fast, but I followed ag0aep6g's advice:

biocyberman (70/142) May 22 2017 Wow!!! Someone grab me a chair, I need to sit down. I can't tell

Biotronic (30/48) May 22 2017 The reason for having only 85 elements in the array was pure
ag0aep6g (79/93) May 22 2017 For reference, here is the version of revComp3 I commented on:

biocyberman (11/25) May 22 2017 Very illustrative. I could easily miss and I did miss this subtle
=?UTF-8?Q?Ali_=c3=87ehreli?= (7/10) May 25 2017 http://ddili.org/ders/d.en/const_and_immutable.html#ix_const_and_immutab...

ag0aep6g (3/5) May 25 2017 Thanks, but I don't really care for the recognition. If anything,

Nicholas Wilson (29/55) May 19 2017 Because `char[] result = new char[N];` is not a string (a.k.a.

Laeeth Isharc (9/16) May 20 2017 You might try using ldc in release mode. With dmd, cost of

biocyberman <biocyberman gmail.com> writes:

I am solving this problem http://rosalind.info/problems/revc/ as 
an exercise to learn D. This is my solution:

https://dpaste.dzfl.pl/8aa667f962b7

Is there some D tricks I can use to make the `reverseComplement` 
function more concise and speedy? Any other comments for 
improvement of the whole solution are also much appreciated.

May 19 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any other 
 comments for improvement of the whole solution are also much 
 appreciated.

I think doing a switch or even a if-else chain would be faster 
then using an AA.

May 19 2017

biocyberman <biocyberman gmail.com> writes:

On Friday, 19 May 2017 at 07:46:13 UTC, Stefan Koch wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any 
 other comments for improvement of the whole solution are also 
 much appreciated.

 I think doing a switch or even a if-else chain would be faster 
 then using an AA.

Thank you Stefan. I will try that and benchmark the two 
implementations. I used AA approach because it looks more 
readable to me.

May 19 2017

aberba <karabutaworld gmail.com> writes:

On Friday, 19 May 2017 at 07:46:13 UTC, Stefan Koch wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any 
 other comments for improvement of the whole solution are also 
 much appreciated.

 I think doing a switch or even a if-else chain would be faster 
 then using an AA.

As part of your work on newCTFE, you are in a position to write a 
book titled "Fast D", about code that compiles, execute fast and 
efficiently.

May 25 2017

Biotronic <simen.kjaras gmail.com> writes:

On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any other 
 comments for improvement of the whole solution are also much 
 appreciated.

Question about your implementation: you assume the input may 
contain newlines, but don't handle any other non-ACGT characters. 
The problem definition states 'DNA string' and the sample dataset 
contains no non-ACGT chars. Is this an oversight my part or 
yours, or did you just decide to support more than the problem 
requires?


Here's a few variations I've come up with, and their timings on 
my machine:

import std.exception;
import std.stdio;
import std.conv;
import std.range;
import std.algorithm;
import std.datetime;
import std.meta;

string randomDna(int length) {
     import std.random;
     auto rnd = Random(unpredictableSeed);
     enum chars = ['A','C','G','T'];
     return iota(length).map!(a=>chars[uniform(0,4, rnd)]).array;
}

unittest {
     auto input = randomDna(2000);

     string previous = null;
     foreach (fn; AliasSeq!(revComp0, revComp1, revComp2, 
revComp3)) {
         auto timing = benchmark!({fn(input);})(10_000);
         writeln((&fn).stringof[2..$], ": ", 
timing[0].to!Duration);
         auto current = fn(input);
         if (previous != null) {
             if (current != previous) {
                 writeln((&fn).stringof[2..$], " did not give 
correct results.");
             } else {
                 previous = current;
             }
         }
     }
}

// 216 ms, 3 us, and 8 hnsecs
string revComp0(string bps) {
     const N = bps.length;
     char[] result = new char[N];
     for (int i = 0; i < N; ++i) {
         result[i] = {switch(bps[N-i-1]){
             case 'A': return 'T';
             case 'C': return 'G';
             case 'G': return 'C';
             case 'T': return 'A';
             default: return '\0';
             }}();
     }
     return result.assumeUnique;
}

// 2 secs, 752 ms, and 969 us
string revComp1(string bps) {
     return bps.retro.map!((a){switch(a){
             case 'A': return 'T';
             case 'C': return 'G';
             case 'G': return 'C';
             case 'T': return 'A';
             default: assert(false);
             }}).array;
}

// 10 secs, 419 ms, 335 us, and 6 hnsecs
string revComp2(string bps) {
     enum chars = ['A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'];
     auto result = appender!string;
     foreach_reverse (c; bps) {
         result.put(chars[c]);
     }
     return result.data;
}

// 1 sec, 972 ms, 915 us, and 9 hnsecs
string revComp3(string bps) {
     const N = bps.length;
     enum chars = [Repeat!('A'-'\0', '\0'), 'T',
                 Repeat!('C'-'A'-1, '\0'), 'G',
                 Repeat!('G'-'C'-1, '\0'), 'C',
                 Repeat!('T'-'G'-1, '\0'), 'A'];

     char[] result = new char[N];
     for (int i = 0; i < N; ++i) {
         result[i] = chars[bps[N-i-1]];
     }
     return result.assumeUnique;
}

May 19 2017

biocyberman <biocyberman gmail.com> writes:

On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 [...]

 Question about your implementation: you assume the input may 
 contain newlines, but don't handle any other non-ACGT 
 characters. The problem definition states 'DNA string' and the 
 sample dataset contains no non-ACGT chars. Is this an oversight 
 my part or yours, or did you just decide to support more than 
 the problem requires?

 [...]

Firstly, thank you for showing me various solutions, and even 
cool benchmark code. To answer you questions: Yes I assume the 
input file would realistically contain newlines, even though the 
problem does not care about them. I also thought about non-CATG 
bases, but haven't taken care of those cases. In reality we 
should deal with at least ambiguous bases (N).

I ran your code and also see that switch is faster than AA (i.e. 
revComp0 is the fastest). And Stefan is right about this.

Some follow up questions:

1. Why do we need to use assumeUnique in 'revComp0' and 
'revComp3'?

2. What is going on with the trick of making chars enum like that 
in 'revComp3'?

May 19 2017

Biotronic <simen.kjaras gmail.com> writes:

On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
 1. Why do we need to use assumeUnique in 'revComp0' and 
 'revComp3'?

D strings are immutable, so if I'd created the result array as a 
string, I couldn't change the individual characters. Instead, I 
create a mutable array, change the elements in it, then cast it 
to immutable when I'm done. assumeUnique does that casting while 
keeping other type information and arguably providing better 
documentation through its name. Behind the scenes, it's basically 
doing cast(string)result;

 2. What is going on with the trick of making chars enum like 
 that in 'revComp3'?

By marking a symbol enum, we tell the compiler that its value 
should be calculated at compile-time. It's a bit of an 
optimization (but probably doesn't matter at all, and should be 
done by the compiler anyway), and a way to say it's really, 
really const. :p

Mostly, it's a habit I try to build, of declaring symbols as 
const as possible, to make maintenance easier.


Bonus! Three more variations, all faster than revComp0:

string revComp4(string bps) {
     const N = bps.length;
     char[] result = new char[N];
     for (int i = 0; i < N; ++i) {
         switch(bps[N-i-1]) {
             case 'A': result[i] = 'T'; break;
             case 'C': result[i] = 'G'; break;
             case 'G': result[i] = 'C'; break;
             case 'T': result[i] = 'A'; break;
             default: assert(false);
         }
     }
     return result.assumeUnique;
}

string revComp5(string bps) {
     const N = bps.length;
     char[] result = new char[N];
     foreach (i, ref e; result) {
         switch(bps[N-i-1]) {
             case 'A': e = 'T'; break;
             case 'C': e = 'G'; break;
             case 'G': e = 'C'; break;
             case 'T': e = 'A'; break;
             default: assert(false);
         }
     }
     return result.assumeUnique;
}

string revComp6(string bps) {
     char[] result = new char[bps.length];
     auto p1 = result.ptr;
     auto p2 = &bps[$-1];

     while (p2 > bps.ptr) {
         switch(*p2) {
             case 'A': *p1 = 'T'; break;
             case 'C': *p1 = 'G'; break;
             case 'G': *p1 = 'C'; break;
             case 'T': *p1 = 'A'; break;
             default: assert(false);
         }
         p1++; p2--;
     }
     return result.assumeUnique;
}

revComp6 seems to be the fastest, but it's probably also the 
least readable (a common trade-off).

May 19 2017

ag0aep6g <anonymous example.com> writes:

On 05/19/2017 02:55 PM, Biotronic wrote:
 On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
 1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?

 
 D strings are immutable, so if I'd created the result array as a string, 
 I couldn't change the individual characters. Instead, I create a mutable 
 array, change the elements in it, then cast it to immutable when I'm 
 done. assumeUnique does that casting while keeping other type 
 information and arguably providing better documentation through its 
 name. Behind the scenes, it's basically doing cast(string)result;

You could alternatively use `.reserve` on a `string`:

----
string result;
result.reserve(N);
for (...) {
    result ~= ...;
}
----

That performs worse, though. Probably still has to check every time if 
there's reserved space available. On the plus side, it would be ` safe`.

 2. What is going on with the trick of making chars enum like that in 
 'revComp3'?

 
 By marking a symbol enum, we tell the compiler that its value should be 
 calculated at compile-time. It's a bit of an optimization (but probably 
 doesn't matter at all, and should be done by the compiler anyway), and a 
 way to say it's really, really const. :p

You fell into a trap there. The value is calculated at compile time, but 
it has copy/paste-like behavior. That is, whenever you use `chars`, the 
code behaves as if you typed out the array literal. That means, the 
whole array is re-created on every iteration.

Use `static immutable` instead. It still forces compile-time 
calculation, but it doesn't have copy/paste behavior. Speeds up revComp3 
a lot.

May 19 2017

crimaniak <crimaniak gmail.com> writes:

On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
 revComp6 seems to be the fastest, but it's probably also the 
 least readable (a common trade-off).

Try revComp7 with -release :)

string revComp7(string bps)
{
     char[] result = new char[bps.length];
     auto p1 = result.ptr;
     auto p2 = &bps[$ - 1];
     enum AT = 'A'^'T';
     enum CG = 'C'^'G';

     while (p2 > bps.ptr)
     {
        *p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG);
         p1++;
         p2--;
     }
     return result.assumeUnique;
}

In fact, when the size of the sequence is growing time difference 
between procedures is shrinking, so it's much more important to 
use memory-efficient presentation than to optimize logic.

May 19 2017

Biotronic <simen.kjaras gmail.com> writes:

On Friday, 19 May 2017 at 22:53:39 UTC, crimaniak wrote:
 On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
 revComp6 seems to be the fastest, but it's probably also the 
 least readable (a common trade-off).

 Try revComp7 with -release :)

 string revComp7(string bps)
 {
     char[] result = new char[bps.length];
     auto p1 = result.ptr;
     auto p2 = &bps[$ - 1];
     enum AT = 'A'^'T';
     enum CG = 'C'^'G';

     while (p2 > bps.ptr)
     {
        *p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG);
         p1++;
         p2--;
     }
     return result.assumeUnique;
 }

 In fact, when the size of the sequence is growing time 
 difference between procedures is shrinking, so it's much more 
 important to use memory-efficient presentation than to optimize 
 logic.

revComp7 is pretty fast, but I followed ag0aep6g's advice:

On Friday, 19 May 2017 at 13:38:20 UTC, ag0aep6g wrote:
 Use `static immutable` instead. It still forces compile-time 
 calculation, but it doesn't have copy/paste behavior. Speeds up 
 revComp3 a lot.

Also, with DMD (2.073.0) -release -O instead of -debug from this 
point. I'd blame someone else, but this is my fault. :p

Anyways, full collection of the various versions I've written, 
plus crimaniak's revComp7 (rebranded as revComp8, because I 
already had 7 at the time):

https://gist.github.com/Biotronic/20daaf0ed1262d313830bc8cd4199896

Timings:
revComp0:        158 ms, 926 us
revComp1: 1 sec, 157 ms,  15 us
revComp2:        604 ms,  37 us
revComp3:         18 ms, 545 us
revComp4:         92 ms, 293 us
revComp5:         86 ms, 731 us
revComp6:         94 ms,  56 us
revComp7:        917 ms, 576 us
revComp8:         62 ms, 917 us

This actually matches my expectations - the table lookup version 
should be crazy fast, and it is. It beats even your revComp7 
(revComp8 in the table).

LDC (-release -O3) timings:

revComp0: 166 ms, 190 us
revComp1: 352 ms, 917 us
revComp2: 300 ms, 493 us
revComp3:  10 ms, 950 us
revComp4: 148 ms, 106 us
revComp5: 144 ms, 152 us
revComp6: 142 ms, 307 us
revComp7: 604 ms, 274 us
revComp8:  26 ms, 612 us

Interesting how revComp4-6 got slower. What I really wanted to 
see with this though, was the effect on revComp1, which uses 
ranges all the way.

May 21 2017

biocyberman <biocyberman gmail.com> writes:

On Monday, 22 May 2017 at 06:50:45 UTC, Biotronic wrote:
 On Friday, 19 May 2017 at 22:53:39 UTC, crimaniak wrote:
 On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
 revComp6 seems to be the fastest, but it's probably also the 
 least readable (a common trade-off).

 Try revComp7 with -release :)

 string revComp7(string bps)
 {
     char[] result = new char[bps.length];
     auto p1 = result.ptr;
     auto p2 = &bps[$ - 1];
     enum AT = 'A'^'T';
     enum CG = 'C'^'G';

     while (p2 > bps.ptr)
     {
        *p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG);
         p1++;
         p2--;
     }
     return result.assumeUnique;
 }

 In fact, when the size of the sequence is growing time 
 difference between procedures is shrinking, so it's much more 
 important to use memory-efficient presentation than to 
 optimize logic.

 revComp7 is pretty fast, but I followed ag0aep6g's advice:

 On Friday, 19 May 2017 at 13:38:20 UTC, ag0aep6g wrote:
 Use `static immutable` instead. It still forces compile-time 
 calculation, but it doesn't have copy/paste behavior. Speeds 
 up revComp3 a lot.

 Also, with DMD (2.073.0) -release -O instead of -debug from 
 this point. I'd blame someone else, but this is my fault. :p

 Anyways, full collection of the various versions I've written, 
 plus crimaniak's revComp7 (rebranded as revComp8, because I 
 already had 7 at the time):

 https://gist.github.com/Biotronic/20daaf0ed1262d313830bc8cd4199896

 Timings:
 revComp0:        158 ms, 926 us
 revComp1: 1 sec, 157 ms,  15 us
 revComp2:        604 ms,  37 us
 revComp3:         18 ms, 545 us
 revComp4:         92 ms, 293 us
 revComp5:         86 ms, 731 us
 revComp6:         94 ms,  56 us
 revComp7:        917 ms, 576 us
 revComp8:         62 ms, 917 us

 This actually matches my expectations - the table lookup 
 version should be crazy fast, and it is. It beats even your 
 revComp7 (revComp8 in the table).

 LDC (-release -O3) timings:

 revComp0: 166 ms, 190 us
 revComp1: 352 ms, 917 us
 revComp2: 300 ms, 493 us
 revComp3:  10 ms, 950 us
 revComp4: 148 ms, 106 us
 revComp5: 144 ms, 152 us
 revComp6: 142 ms, 307 us
 revComp7: 604 ms, 274 us
 revComp8:  26 ms, 612 us

 Interesting how revComp4-6 got slower. What I really wanted to 
 see with this though, was the effect on revComp1, which uses 
 ranges all the way.

Wow!!! Someone grab me a chair, I need to sit down. I can't tell 
enough how grateful I am to all you guys. This is so much fun to 
learn. Some specific comments and replies:

 Nicolas Wilson: Your explanation of the enum is clear and very 
helpful. I can recall to the same technique used in kh_hash in 
samtools and the associated. With that said, the chars enum is 
only to 'T' (85) elements.

  Regarding BioD, I have plan to work on it to add some more 
functionality. But first I need to sharpen my D skills a bit more.

 Laeeth Isharc: I do like ldc as well. I've came across several 
projects that use ldc, and learnt that it is a good choice for 
speed in general.


 ag0aep6g
You fell into a trap there. The value is calculated at compile 
time, but it has >copy/paste-like behavior. That is, whenever 
you use `chars`, the code behaves as if you >typed out the array 
literal. That means, the whole array is re-created on every 
iteration.

Use `static immutable` instead. It still forces compile-time 
calculation, but it doesn't > have copy/paste behavior. Speeds 
up revComp3 a lot.

With  'iteration' here you mean running lifetime of the function, 
or in other words, each one of the 10_000 cycles in the benchmark?

Could you provide some more reading for what you are telling 
here? I can only guess it is intrinsic behavior of an 'enum'.

 crimaniak, Nicolas Wilson and Biotronic: I've realized before 
the reversible/negate property of XOR: 'A'^'T'^'T' = 'A' and 
'A'^'T'^'A' = 'T'; To help myself and see it in bit patterns, I 
wrote this snippet:


void main(){

   enum AT = 'A'^'T';
   enum CG = 'C'^'G';
   enum chars = [Repeat!('A'-'\0', '\0'), 'T',
                 Repeat!('C'-'A'-1, '\0'), 'G',
                 Repeat!('G'-'C'-1, '\0'), 'C',
                 Repeat!('T'-'G'-1, '\0'), 'A'];


   writef("BIN %0 8b DEC %d\n", 'A', 'A');
   writef("BIN %0 8b DEC %d\n", 'T', 'T');
   writef("XOR %0 8b DEC %d\n", AT, AT);
   writef("TOR %0 8b DEC %d\n", AT^'T', AT^'T', AT^'T');
   writef("AOR %0 8b DEC %d\n", AT^'A', AT^'A', AT^'A');

   foreach (i, c; chars){
     if (i >= 60) writef("%02d: %0 8b, %d\n",i, c, c); // elements 
before 60 are all \0
   }
}


// Output
BIN 01000001 DEC 65
BIN 01010100 DEC 84
XOR 00010101 DEC 21
TOR 01000001 DEC 65
AOR 01010100 DEC 84
60: 00000000, 0
61: 00000000, 0
62: 00000000, 0
63: 00000000, 0
64: 00000000, 0
65: 01010100, 84
66: 00000000, 0
67: 01000111, 71
68: 00000000, 0
69: 00000000, 0
70: 00000000, 0
71: 01000011, 67
72: 00000000, 0
73: 00000000, 0
74: 00000000, 0
75: 00000000, 0
76: 00000000, 0
77: 00000000, 0
78: 00000000, 0
79: 00000000, 0
80: 00000000, 0
81: 00000000, 0
82: 00000000, 0
83: 00000000, 0
84: 01000001, 65

May 22 2017

Biotronic <simen.kjaras gmail.com> writes:

On Monday, 22 May 2017 at 08:58:24 UTC, biocyberman wrote:
  Nicolas Wilson: Your explanation of the enum is clear and very 
 helpful. I can recall to the same technique used in kh_hash in 
 samtools and the associated. With that said, the chars enum is 
 only to 'T' (85) elements.

The reason for having only 85 elements in the array was pure 
laziness - the problem description seems to forbid non-ACGT 
letters, so I saw no reason to write more code to handle them. :p 
My code will crash if the input string contains lower-case 
letters, Zs, or any other letter beyond 'T' (or read random bits 
of memory if you're lucky).

  ag0aep6g
You fell into a trap there. The value is calculated at compile 
time, but it has >copy/paste-like behavior. That is, whenever 
you use `chars`, the code behaves as if you >typed out the 
array literal. That means, the whole array is re-created on 
every iteration.

Use `static immutable` instead. It still forces compile-time 
calculation, but it doesn't > have copy/paste behavior. Speeds 
up revComp3 a lot.

 With  'iteration' here you mean running lifetime of the 
 function, or in other words, each one of the 10_000 cycles in 
 the benchmark?

 Could you provide some more reading for what you are telling 
 here? I can only guess it is intrinsic behavior of an 'enum'.

ag0aep6g is absolutely correct in his observation, and the 
resulting code is basically this:

string revComp3(string bps) {
     const N = bps.length;
     static immutable chars_saved = [Repeat!('A'-'\0', '\0'), 'T',
                 Repeat!('C'-'A'-1, '\0'), 'G',
                 Repeat!('G'-'C'-1, '\0'), 'C',
                 Repeat!('T'-'G'-1, '\0'), 'A'];
     char[] result = new char[N];
     for (int i = 0; i < N; ++i) {
         auto chars = chars_saved.dup; // Bad stuff happens here
         result[i] = chars[bps[N-i-1]];
     }
     return result.assumeUnique;
}

As we can see, it copies the entire array for every character in 
the input string. That's basically an allocation and a memcpy in 
the innermost, hottest loop. Roughly as bad as it gets. (yup, 
that's 20 million allocations)

As for why this happens, enum can be thought of as the analog of 
C's #define - the compiler precalculates the data to fill the 
array, and then copies that into the source code every time it's 
used.

May 22 2017

ag0aep6g <anonymous example.com> writes:

On 05/22/2017 10:58 AM, biocyberman wrote:
  ag0aep6g
 You fell into a trap there. The value is calculated at compile time, 
 but it has >copy/paste-like behavior. That is, whenever you use 
 `chars`, the code behaves as if you >typed out the array literal. That 
 means, the whole array is re-created on every iteration.

 
 Use `static immutable` instead. It still forces compile-time 
 calculation, but it doesn't > have copy/paste behavior. Speeds up 
 revComp3 a lot.

 
 With  'iteration' here you mean running lifetime of the function, or in 
 other words, each one of the 10_000 cycles in the benchmark?

For reference, here is the version of revComp3 I commented on:

----
string revComp3(string bps) {
     const N = bps.length;
     enum chars = [Repeat!('A'-'\0', '\0'), 'T',
                 Repeat!('C'-'A'-1, '\0'), 'G',
                 Repeat!('G'-'C'-1, '\0'), 'C',
                 Repeat!('T'-'G'-1, '\0'), 'A'];

     char[] result = new char[N];
     for (int i = 0; i < N; ++i) {
         result[i] = chars[bps[N-i-1]];
     }
     return result.assumeUnique;
}
----

By "iteration" I mean every execution of the body of the `for` loop. For 
every new `i`, a new array is created.

The loop above is equivalent to this:

----
     for (int i = 0; i < N; ++i) {
         result[i] = [Repeat!('A'-'\0', '\0'), 'T',
                 Repeat!('C'-'A'-1, '\0'), 'G',
                 Repeat!('G'-'C'-1, '\0'), 'C',
                 Repeat!('T'-'G'-1, '\0'), 'A'][bps[N-i-1]];
     }
----

Used like that, the array literal

     [Repeat!('A'-'\0', '\0'), 'T',
     Repeat!('C'-'A'-1, '\0'), 'G',
     Repeat!('G'-'C'-1, '\0'), 'C',
     Repeat!('T'-'G'-1, '\0'), 'A']

allocates a new array on every execution of `result[i] = ...;`.

 Could you provide some more reading for what you are telling here? I can 
 only guess it is intrinsic behavior of an 'enum'.

Unfortunately, the spec page (<https://dlang.org/spec/enum.html>) 
doesn't seem to mention this.

But Ali Çehreli covers it in his book on the "immutability" page (I 
would have expected to find it on the "enum" page):

http://ddili.org/ders/d.en/const_and_immutable.html#ix_const_and_immutable.enum

The details can be confusing here. There is an element of 
copy/paste-like behavior, but it's not as simple as taking the 
right-hand side of the enum declaration and substituting it for the 
left-hand name.

The right-hand side is evaluated at compile time. The result of that can 
be thought of as an array literal. It's that array literal that gets 
substituted for the name.

An example with comments:

----
import std.stdio: writeln;

/* f prints a message when called at run time. Then it returns its
argument times ten. */
int f(int x)
{
     if (!__ctfe) writeln("f(", x, ")");
     return x * 10;
}

void main()
{
      /* The following line prints f's messages. The f calls are normal
     run-time calls. Then the writeln prints "false" because each array
     literal creates a new, distinct array.
     */
     writeln([f(1), f(2)] is [f(1), f(2)]); /* false */

     /* The next `enum` line does not print f's messages. The calls go
     through CTFE.
     The `writeln` line afterwards prints "false". ea gets pre-computed
     via CTFE, but the result acts like an array literal. So it's the
     same as writing `writeln([10, 20] is [10, 20]);`.
     */
     enum int[] ea = [f(1), f(2)];
     writeln(ea is ea); /* false */

     /* No messages either with `static immutable`. Like ea, the
     right-hand side goes through CTFE.
     But unlike ea, ia does not act like an array literal. `writeln`
     prints "true".
     */
     static immutable int[] ia = [f(1), f(2)];
     writeln(ia is ia); /* true */
}
----

May 22 2017

biocyberman <biocyberman gmail.com> writes:

On Monday, 22 May 2017 at 10:35:36 UTC, ag0aep6g wrote:
 On 05/22/2017 10:58 AM, biocyberman wrote:
 [...]

 For reference, here is the version of revComp3 I commented on:

 ----
 string revComp3(string bps) {
     const N = bps.length;
     enum chars = [Repeat!('A'-'\0', '\0'), 'T',
                 Repeat!('C'-'A'-1, '\0'), 'G',
                 Repeat!('G'-'C'-1, '\0'), 'C',
                 Repeat!('T'-'G'-1, '\0'), 'A'];

 [...]

Very illustrative. I could easily miss and I did miss this subtle 
but important aspect. I wonder how D should widen the 'pit of 
success' that Scott Meyers mentioned about more than once. A take 
home message for myself, if one ever use an array as a lookup 
table, make it 'static immutable. And enum array does not make 
sense'. And in Ali's book:

 Consider the hidden cost of enum arrays and enum associative 
 arrays. Define them as immutable variables if the arrays are 
 large and they are used more than once in the program.

One thing also became clear: 'is' is not '=='. Therefore

writeln([10,20] is [10,20]); /* false */
writeln([10,20] == [10,20]); /* true */

I did not notice that because I haven't come across 'is' so often.

May 22 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 05/22/2017 03:35 AM, ag0aep6g wrote:

 But Ali Çehreli covers it in his book on the "immutability" page (I
 would have expected to find it on the "enum" page):

 

http://ddili.org/ders/d.en/const_and_immutable.html#ix_const_and_immutable.enum

Thank you for your feedback. I'm inserting a forward reference from the 
enum chapter.

I would like to acknowledge you in the book hopefully with your real 
name but if you don't want or care, I will use ag0aep6g. :)

Ali

May 25 2017

ag0aep6g <anonymous example.com> writes:

On 05/25/2017 10:41 PM, Ali Çehreli wrote:
 I would like to acknowledge you in the book hopefully with your real 
 name but if you don't want or care, I will use ag0aep6g. :)

Thanks, but I don't really care for the recognition. If anything, 
please use ag0aep6g.

May 25 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
 On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
 On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 [...]

 Question about your implementation: you assume the input may 
 contain newlines, but don't handle any other non-ACGT 
 characters. The problem definition states 'DNA string' and the 
 sample dataset contains no non-ACGT chars. Is this an 
 oversight my part or yours, or did you just decide to support 
 more than the problem requires?

 [...]

 Firstly, thank you for showing me various solutions, and even 
 cool benchmark code. To answer you questions: Yes I assume the 
 input file would realistically contain newlines, even though 
 the problem does not care about them. I also thought about 
 non-CATG bases, but haven't taken care of those cases. In 
 reality we should deal with at least ambiguous bases (N).

 I ran your code and also see that switch is faster than AA 
 (i.e. revComp0 is the fastest). And Stefan is right about this.
 
 Some follow up questions:

 1. Why do we need to use assumeUnique in 'revComp0' and 
 'revComp3'?

Because `char[] result = new char[N];` is not a string (a.k.a. 
immutable(char)[]).
But because it was created from the GC in this function we know 
that it is safe to assume that is a string.

 2. What is going on with the trick of making chars enum like 
 that in 'revComp3'?

What revComp3 is doing is effectively creating a table for each 
possible value of char that matches the behaviour of the switch.
it could also be rewritten as
```
char[256] chars; // implicitly memset to '\0'
chars['A'] = 'T';
chars['C'] = 'G';
chars['G'] = 'C';
chars['T'] = 'A';
```

Other miscellaneous comments:

If you haven't already checkout 
[BioD](https://github.com/biod/BioD), for most (all?) your 
bioinformatics needs.

If you're trying to be fast you probably don't want to use string 
for internal calculations as it is very entropy non-optimal (2 
bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding).
  I would have at least 2 "Dictionaries": one the standard 
nucleotides (ACGT) and another for your ambiguous representations 
(UNRYBDHVMKSW-) and the standard nucleotides, to get a better 
information density. If you're doing anything with protein 
sequences then you should use a translation table anyway as the 
DNA -> amino acid mapping changes between species/organelle 
(mt|cp|n)DNA.

May 19 2017

Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:

On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
 I am solving this problem http://rosalind.info/problems/revc/ 
 as an exercise to learn D. This is my solution:

 https://dpaste.dzfl.pl/8aa667f962b7

 Is there some D tricks I can use to make the 
 `reverseComplement` function more concise and speedy? Any other 
 comments for improvement of the whole solution are also much 
 appreciated.

You might try using ldc in release mode.  With dmd, cost of 
ranges and so on might be higher than doing it in a more direct 
low-level way.  With ldc that difference often seems to shrink to 
insignificance as its optimisation is more powerful.  If you care 
about performance then use ldc or gdc.  (I have less experience 
with gdc, but probably similar).  But then it's best to compare 
results under ldc because the ratios of different options will be 
different vs dmd.

May 20 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Code improvement for DNA reverse complement?