digitalmars.D.learn - fasta parser with iopipe?

biocyberman (25/36) Aug 23 2017 I lost my momentum to learn D and want to gain it up again.

Nicholas Wilson (41/77) Aug 23 2017 (if you know your sequence has no Ns or other ambiguous bases you
Steven Schveighoffer (46/88) Aug 23 2017 I'll respond to all your questions with what I would do, instead of

biocyberman (19/75) Aug 28 2017 Hi Nic and Steve

Steven Schveighoffer (12/19) Aug 30 2017 OK, nice to hear from you again :)

biocyberman <biocyberman gmail.com> writes:

I lost my momentum to learn D and want to gain it up again. 
Therefore I need some help with this seemingly simple task:




 \>Entry1_ID header field1|header field2|...
 CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
 TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
 TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
 GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
 \>Entry2_ID header field1|header field4|...
 GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
 TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA

 \>Entry3_ID header field1|header field2|...
 GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
 GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA



- Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry 
start. In this post I have to put an escape character (\) to make 
the '>' visible.
- Headers may contain annotation information separated by some 
delimiter (i.e. | in this case).
- Entry ID and header is a single line, which does not contain 
newline characters.
- Sequence under the header line is [ATCGN\n]* (Perl regex).
- A fasta file can be plain-text or gzip compressed.



Write a parser that uses Dlang range with iopipe library for 
performance and ease of use. A big fasta file can be dozens of 
gigabytes.



1. How do I model a fasta entry with a struct or class?
2. How to I implement a range of fasta entries with iopipe. A 
range in this case can be a forward range, but preferably a 
random access range.
3. I want to do with range to explore the power and elegance of 
ranges. But if performance is a big concern, what can I do 
alternatively?

Aug 23 2017

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Wednesday, 23 August 2017 at 09:53:49 UTC, biocyberman wrote:
 I lost my momentum to learn D and want to gain it up again. 
 Therefore I need some help with this seemingly simple task:




 \>Entry1_ID header field1|header field2|...
 CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
 TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
 TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
 GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
 \>Entry2_ID header field1|header field4|...
 GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
 TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA

 \>Entry3_ID header field1|header field2|...
 GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
 GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA



 - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry 
 start. In this post I have to put an escape character (\) to 
 make the '>' visible.
 - Headers may contain annotation information separated by some 
 delimiter (i.e. | in this case).
 - Entry ID and header is a single line, which does not contain 
 newline characters.
 - Sequence under the header line is [ATCGN\n]* (Perl regex).

(if you know your sequence has no Ns or other ambiguous bases you 
can can store 4 bases to a byte, or 3 if you want them in 
triplets)

 - A fasta file can be plain-text or gzip compressed.



 Write a parser that uses Dlang range with iopipe library for 
 performance and ease of use. A big fasta file can be dozens of 
 gigabytes.



 1. How do I model a fasta entry with a struct or class?

You could model the headers as a struct if you know the format, 
but most of the time I think they are just ignored, if you really 
need them just put them in a string[string].
The real info is in the sequences which are just a string of 
characters (not utf-8 so i'd go with ubyte or some compressed 
form).

id go with

struct FastaEntry
{
     string id;
     string[string] headers;
     ubyte[] sequence;
}

and the a fasta is an array of entries. Note that with the 
expected usage patterns you probably want this in struct of array 
(SoA) form (see https://maikklein.github.io/post/soa-d/) given 
unless you're bring it out you're unlike likely to need more than 
one field at a time.

 2. How to I implement a range of fasta entries with iopipe. A 
 range in this case can be a forward range, but preferably a 
 random access range.

Given the relative simplicity of the file format I'm don't think 
it should be too hard. iopipe would give you a streaming 
interface for it that would work with gzip compression but 
wouldn't get you a random access range. You can keep the most 
recent entries for look back but for real random access (e.g. if 
you're trying to align them) you need them all in an array see 
above comment about SoA .

 3. I want to do with range to explore the power and elegance of 
 ranges. But if performance is a big concern, what can I do 
 alternatively?

You can probably use it for parsing I'd take a look at the 
https://github.com/schveiguy/iopipe documentation and see what 
you can find.

OT, I'm taking "systems biology and bioinformatics" this semester 
so I'll be very interested to see what you do. Partly for the 
bioinformatics stuff but also I will be giving a (series I hope) 
of talks to the staff and some students at my university about 
computation for biology (rather introductory as is the biology 
department) but still I hope to get them interested in 
computational biology and D so if you have some good examples 
please do let me know.

Nic

Aug 23 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 8/23/17 5:53 AM, biocyberman wrote:
 I lost my momentum to learn D and want to gain it up again. Therefore I 
 need some help with this seemingly simple task:
 

 
 
 \>Entry1_ID header field1|header field2|...
 CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
 TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
 TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
 GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
 \>Entry2_ID header field1|header field4|...
 GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
 TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA

 
 \>Entry3_ID header field1|header field2|...
 GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
 GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA

 

 
 - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry start. In 
 this post I have to put an escape character (\) to make the '>' visible.
 - Headers may contain annotation information separated by some delimiter 
 (i.e. | in this case).
 - Entry ID and header is a single line, which does not contain newline 
 characters.
 - Sequence under the header line is [ATCGN\n]* (Perl regex).
 - A fasta file can be plain-text or gzip compressed.
 
 

 Write a parser that uses Dlang range with iopipe library for performance 
 and ease of use. A big fasta file can be dozens of gigabytes.
 

 
 1. How do I model a fasta entry with a struct or class?
 2. How to I implement a range of fasta entries with iopipe. A range in 
 this case can be a forward range, but preferably a random access range.
 3. I want to do with range to explore the power and elegance of ranges. 
 But if performance is a big concern, what can I do alternatively?

I'll respond to all your questions with what I would do, instead of 
answering each one.

I would suggest an approach similar to how I approached parsing JSON 
data. In your case, the protocol is even simpler, as there is no nesting.

1. The base layer iopipe should be something that tokenizes the input 
into reference-based structs. If you look at the jsoniopipe library 
(https://github.com/schveiguy/jsoniopipe), you can see that the lowest 
level finds the start of the next JSON token. In your case, it should be 
looking for >[[:alphanumeric:]] (or possibly just >).

This code is pretty straightforward, and roughly corresponds to this:
while(cannot find start sequence in stream)
     stream.extend;
make sure you aren't re-doing work that has already been done (i.e. save 
the last place you looked).

Once you have this, you can deduce each packet by the data between the 
starts.

2. The next layer should validate and parse the data into structs that 
contain referencing data from the buffer. I recommend not using actual 
ranges from the buffer, but information on how to build the ranges. The 
reason for this is that the buffer can move while being streamed by 
iopipe, so your data could become invalid if you take actual references 
to the buffer. If you look in the jsoniopipe library, the struct for 
storing a json item has a start and length, but not a reference to the 
buffer.

Potentially, you could take this mechanism and build an iopipe on top of 
the buffered data. This iopipe's elements would be the items themselves, 
with the underlying buffer hidden in the implementation details. 
Extending would parse out another set of items, releasing would allow 
those items to get reclaimed (and the underlying stream data).

This is something I actually wanted to explore with jsoniopipe but 
didn't have time before the conference. I probably will still build it.

3. build your real code on top of that layer. What do you want to do 
with the data? Easiest thing to do for proof of concept is build a range 
out of the functions. That can allow you to test performance with your 
lower layers. One of the awesome things about iopipe is testing 
correctness is really easy -- every string is also an iopipe :)

I actually worked with a person at dconf on a similar (maybe identical?) 
format and explained how it could be done in a very similar way. He was 
looking to remove data that had a low percentage of correctness (or 
something like that, not in bioinformatics, so I don't understand the 
real semantics).

With this mechanism in hand, the decompression is pretty easy to chain 
together with whatever actual stream you have, just use iopipe.zip.

Good luck, and email me if you need more help (schveiguy yahoo.com).

-Steve

Aug 23 2017

biocyberman <biocyberman gmail.com> writes:

On Wednesday, 23 August 2017 at 13:06:36 UTC, Steven 
Schveighoffer wrote:
 On 8/23/17 5:53 AM, biocyberman wrote:
 [...]

 I'll respond to all your questions with what I would do, 
 instead of answering each one.

 I would suggest an approach similar to how I approached parsing 
 JSON data. In your case, the protocol is even simpler, as there 
 is no nesting.

 1. The base layer iopipe should be something that tokenizes the 
 input into reference-based structs. If you look at the 
 jsoniopipe library (https://github.com/schveiguy/jsoniopipe), 
 you can see that the lowest level finds the start of the next 
 JSON token. In your case, it should be looking for
[...]

 This code is pretty straightforward, and roughly corresponds to 
 this:
 while(cannot find start sequence in stream)
     stream.extend;
 make sure you aren't re-doing work that has already been done 
 (i.e. save the last place you looked).

 Once you have this, you can deduce each packet by the data 
 between the starts.

 2. The next layer should validate and parse the data into 
 structs that contain referencing data from the buffer. I 
 recommend not using actual ranges from the buffer, but 
 information on how to build the ranges. The reason for this is 
 that the buffer can move while being streamed by iopipe, so 
 your data could become invalid if you take actual references to 
 the buffer. If you look in the jsoniopipe library, the struct 
 for storing a json item has a start and length, but not a 
 reference to the buffer.

 Potentially, you could take this mechanism and build an iopipe 
 on top of the buffered data. This iopipe's elements would be 
 the items themselves, with the underlying buffer hidden in the 
 implementation details. Extending would parse out another set 
 of items, releasing would allow those items to get reclaimed 
 (and the underlying stream data).

 This is something I actually wanted to explore with jsoniopipe 
 but didn't have time before the conference. I probably will 
 still build it.

 3. build your real code on top of that layer. What do you want 
 to do with the data? Easiest thing to do for proof of concept 
 is build a range out of the functions. That can allow you to 
 test performance with your lower layers. One of the awesome 
 things about iopipe is testing correctness is really easy -- 
 every string is also an iopipe :)

 I actually worked with a person at dconf on a similar (maybe 
 identical?) format and explained how it could be done in a very 
 similar way. He was looking to remove data that had a low 
 percentage of correctness (or something like that, not in 
 bioinformatics, so I don't understand the real semantics).

 With this mechanism in hand, the decompression is pretty easy 
 to chain together with whatever actual stream you have, just 
 use iopipe.zip.

 Good luck, and email me if you need more help 
 (schveiguy yahoo.com).

 -Steve

Hi Nic and Steve

Thank you both very much for your inputs. I am trying to make use 
of them. I will try to adapt jsoniopipe for fasta. This is on 
going and broken code: https://github.com/biocyberman/fastaq . 
PRs are welcome.

 Nic:
I am too very interested in bringing D to bioinformatics. I will 
be happy to share information I have. Feel free to email me at 
vql(.at.)rn.dk and we talk further about it.

 Steve: Yes we talked at dconf 2017. I had to other things so D 
learning got slow down. I am trying with Fasta format before 
jumping to Fastq again. The jsoniopipe is full feature, and 
relatively small project, which can be used to study case. 
However there are some aspects I still haven't fully understood. 
Would I be lucky enough to have you make the current broken code 
of fastaq to work? :) That will definitely save me time and 
headache dealing with newbie problems.

Aug 28 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 8/28/17 10:08 AM, biocyberman wrote:

  Steve: Yes we talked at dconf 2017. I had to other things so D learning 
 got slow down. I am trying with Fasta format before jumping to Fastq 
 again. The jsoniopipe is full feature, and relatively small project, 
 which can be used to study case. However there are some aspects I still 
 haven't fully understood. Would I be lucky enough to have you make the 
 current broken code of fastaq to work? :) That will definitely save me 
 time and headache dealing with newbie problems.

OK, nice to hear from you again :)

I have been swamped in free-time land, and iopipe has gone rather stale. 
I need to get back and finish the docs and unit tests so I can put it on 
dub.

When this is done, I will take a look at your code. Sorry that when I 
said to replicate the methodology of using javaiopipe I did not mean to 
port the code :) I meant to take a similar approach in terms of 
tokenizing and parsing. I would think that a DOM isn't necessary, since 
it's not a nested format.

I will get back to you at some point.

-Steve

Aug 30 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - fasta parser with iopipe?