www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - fasta parser with iopipe?

reply biocyberman <biocyberman gmail.com> writes:
I lost my momentum to learn D and want to gain it up again. 
Therefore I need some help with this seemingly simple task:

# Fasta sequence


 \>Entry1_ID header field1|header field2|...
 CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
 TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
 TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
 GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
 \>Entry2_ID header field1|header field4|...
 GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
 TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA
 \>Entry3_ID header field1|header field2|...
 GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
 GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA
# Some characteristics: - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry start. In this post I have to put an escape character (\) to make the '>' visible. - Headers may contain annotation information separated by some delimiter (i.e. | in this case). - Entry ID and header is a single line, which does not contain newline characters. - Sequence under the header line is [ATCGN\n]* (Perl regex). - A fasta file can be plain-text or gzip compressed. # Goals: Write a parser that uses Dlang range with iopipe library for performance and ease of use. A big fasta file can be dozens of gigabytes. # Questions: 1. How do I model a fasta entry with a struct or class? 2. How to I implement a range of fasta entries with iopipe. A range in this case can be a forward range, but preferably a random access range. 3. I want to do with range to explore the power and elegance of ranges. But if performance is a big concern, what can I do alternatively?
Aug 23
next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Wednesday, 23 August 2017 at 09:53:49 UTC, biocyberman wrote:
 I lost my momentum to learn D and want to gain it up again. 
 Therefore I need some help with this seemingly simple task:

 # Fasta sequence


 \>Entry1_ID header field1|header field2|...
 CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
 TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
 TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
 GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
 \>Entry2_ID header field1|header field4|...
 GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
 TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA
 \>Entry3_ID header field1|header field2|...
 GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
 GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA
# Some characteristics: - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry start. In this post I have to put an escape character (\) to make the '>' visible. - Headers may contain annotation information separated by some delimiter (i.e. | in this case). - Entry ID and header is a single line, which does not contain newline characters. - Sequence under the header line is [ATCGN\n]* (Perl regex).
(if you know your sequence has no Ns or other ambiguous bases you can can store 4 bases to a byte, or 3 if you want them in triplets)
 - A fasta file can be plain-text or gzip compressed.


 # Goals:
 Write a parser that uses Dlang range with iopipe library for 
 performance and ease of use. A big fasta file can be dozens of 
 gigabytes.

 # Questions:

 1. How do I model a fasta entry with a struct or class?
You could model the headers as a struct if you know the format, but most of the time I think they are just ignored, if you really need them just put them in a string[string]. The real info is in the sequences which are just a string of characters (not utf-8 so i'd go with ubyte or some compressed form). id go with struct FastaEntry { string id; string[string] headers; ubyte[] sequence; } and the a fasta is an array of entries. Note that with the expected usage patterns you probably want this in struct of array (SoA) form (see https://maikklein.github.io/post/soa-d/) given unless you're bring it out you're unlike likely to need more than one field at a time.
 2. How to I implement a range of fasta entries with iopipe. A 
 range in this case can be a forward range, but preferably a 
 random access range.
Given the relative simplicity of the file format I'm don't think it should be too hard. iopipe would give you a streaming interface for it that would work with gzip compression but wouldn't get you a random access range. You can keep the most recent entries for look back but for real random access (e.g. if you're trying to align them) you need them all in an array see above comment about SoA .
 3. I want to do with range to explore the power and elegance of 
 ranges. But if performance is a big concern, what can I do 
 alternatively?
You can probably use it for parsing I'd take a look at the https://github.com/schveiguy/iopipe documentation and see what you can find. OT, I'm taking "systems biology and bioinformatics" this semester so I'll be very interested to see what you do. Partly for the bioinformatics stuff but also I will be giving a (series I hope) of talks to the staff and some students at my university about computation for biology (rather introductory as is the biology department) but still I hope to get them interested in computational biology and D so if you have some good examples please do let me know. Nic
Aug 23
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/23/17 5:53 AM, biocyberman wrote:
 I lost my momentum to learn D and want to gain it up again. Therefore I 
 need some help with this seemingly simple task:
 
 # Fasta sequence
 
 
 \>Entry1_ID header field1|header field2|...
 CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
 TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
 TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
 GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
 \>Entry2_ID header field1|header field4|...
 GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
 TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA
 \>Entry3_ID header field1|header field2|...
 GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
 GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA
# Some characteristics: - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry start. In this post I have to put an escape character (\) to make the '>' visible. - Headers may contain annotation information separated by some delimiter (i.e. | in this case). - Entry ID and header is a single line, which does not contain newline characters. - Sequence under the header line is [ATCGN\n]* (Perl regex). - A fasta file can be plain-text or gzip compressed. # Goals: Write a parser that uses Dlang range with iopipe library for performance and ease of use. A big fasta file can be dozens of gigabytes. # Questions: 1. How do I model a fasta entry with a struct or class? 2. How to I implement a range of fasta entries with iopipe. A range in this case can be a forward range, but preferably a random access range. 3. I want to do with range to explore the power and elegance of ranges. But if performance is a big concern, what can I do alternatively?
I'll respond to all your questions with what I would do, instead of answering each one. I would suggest an approach similar to how I approached parsing JSON data. In your case, the protocol is even simpler, as there is no nesting. 1. The base layer iopipe should be something that tokenizes the input into reference-based structs. If you look at the jsoniopipe library (https://github.com/schveiguy/jsoniopipe), you can see that the lowest level finds the start of the next JSON token. In your case, it should be looking for >[[:alphanumeric:]] (or possibly just >). This code is pretty straightforward, and roughly corresponds to this: while(cannot find start sequence in stream) stream.extend; make sure you aren't re-doing work that has already been done (i.e. save the last place you looked). Once you have this, you can deduce each packet by the data between the starts. 2. The next layer should validate and parse the data into structs that contain referencing data from the buffer. I recommend not using actual ranges from the buffer, but information on how to build the ranges. The reason for this is that the buffer can move while being streamed by iopipe, so your data could become invalid if you take actual references to the buffer. If you look in the jsoniopipe library, the struct for storing a json item has a start and length, but not a reference to the buffer. Potentially, you could take this mechanism and build an iopipe on top of the buffered data. This iopipe's elements would be the items themselves, with the underlying buffer hidden in the implementation details. Extending would parse out another set of items, releasing would allow those items to get reclaimed (and the underlying stream data). This is something I actually wanted to explore with jsoniopipe but didn't have time before the conference. I probably will still build it. 3. build your real code on top of that layer. What do you want to do with the data? Easiest thing to do for proof of concept is build a range out of the functions. That can allow you to test performance with your lower layers. One of the awesome things about iopipe is testing correctness is really easy -- every string is also an iopipe :) I actually worked with a person at dconf on a similar (maybe identical?) format and explained how it could be done in a very similar way. He was looking to remove data that had a low percentage of correctness (or something like that, not in bioinformatics, so I don't understand the real semantics). With this mechanism in hand, the decompression is pretty easy to chain together with whatever actual stream you have, just use iopipe.zip. Good luck, and email me if you need more help (schveiguy yahoo.com). -Steve
Aug 23
parent reply biocyberman <biocyberman gmail.com> writes:
On Wednesday, 23 August 2017 at 13:06:36 UTC, Steven 
Schveighoffer wrote:
 On 8/23/17 5:53 AM, biocyberman wrote:
 [...]
I'll respond to all your questions with what I would do, instead of answering each one. I would suggest an approach similar to how I approached parsing JSON data. In your case, the protocol is even simpler, as there is no nesting. 1. The base layer iopipe should be something that tokenizes the input into reference-based structs. If you look at the jsoniopipe library (https://github.com/schveiguy/jsoniopipe), you can see that the lowest level finds the start of the next JSON token. In your case, it should be looking for
[...]
This code is pretty straightforward, and roughly corresponds to this: while(cannot find start sequence in stream) stream.extend; make sure you aren't re-doing work that has already been done (i.e. save the last place you looked). Once you have this, you can deduce each packet by the data between the starts. 2. The next layer should validate and parse the data into structs that contain referencing data from the buffer. I recommend not using actual ranges from the buffer, but information on how to build the ranges. The reason for this is that the buffer can move while being streamed by iopipe, so your data could become invalid if you take actual references to the buffer. If you look in the jsoniopipe library, the struct for storing a json item has a start and length, but not a reference to the buffer. Potentially, you could take this mechanism and build an iopipe on top of the buffered data. This iopipe's elements would be the items themselves, with the underlying buffer hidden in the implementation details. Extending would parse out another set of items, releasing would allow those items to get reclaimed (and the underlying stream data). This is something I actually wanted to explore with jsoniopipe but didn't have time before the conference. I probably will still build it. 3. build your real code on top of that layer. What do you want to do with the data? Easiest thing to do for proof of concept is build a range out of the functions. That can allow you to test performance with your lower layers. One of the awesome things about iopipe is testing correctness is really easy -- every string is also an iopipe :) I actually worked with a person at dconf on a similar (maybe identical?) format and explained how it could be done in a very similar way. He was looking to remove data that had a low percentage of correctness (or something like that, not in bioinformatics, so I don't understand the real semantics). With this mechanism in hand, the decompression is pretty easy to chain together with whatever actual stream you have, just use iopipe.zip. Good luck, and email me if you need more help (schveiguy yahoo.com). -Steve
Hi Nic and Steve Thank you both very much for your inputs. I am trying to make use of them. I will try to adapt jsoniopipe for fasta. This is on going and broken code: https://github.com/biocyberman/fastaq . PRs are welcome. Nic: I am too very interested in bringing D to bioinformatics. I will be happy to share information I have. Feel free to email me at vql(.at.)rn.dk and we talk further about it. Steve: Yes we talked at dconf 2017. I had to other things so D learning got slow down. I am trying with Fasta format before jumping to Fastq again. The jsoniopipe is full feature, and relatively small project, which can be used to study case. However there are some aspects I still haven't fully understood. Would I be lucky enough to have you make the current broken code of fastaq to work? :) That will definitely save me time and headache dealing with newbie problems.
Aug 28
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/28/17 10:08 AM, biocyberman wrote:

  Steve: Yes we talked at dconf 2017. I had to other things so D learning 
 got slow down. I am trying with Fasta format before jumping to Fastq 
 again. The jsoniopipe is full feature, and relatively small project, 
 which can be used to study case. However there are some aspects I still 
 haven't fully understood. Would I be lucky enough to have you make the 
 current broken code of fastaq to work? :) That will definitely save me 
 time and headache dealing with newbie problems.
OK, nice to hear from you again :) I have been swamped in free-time land, and iopipe has gone rather stale. I need to get back and finish the docs and unit tests so I can put it on dub. When this is done, I will take a look at your code. Sorry that when I said to replicate the methodology of using javaiopipe I did not mean to port the code :) I meant to take a similar approach in terms of tokenizing and parsing. I would think that a DOM isn't necessary, since it's not a nested format. I will get back to you at some point. -Steve
Aug 30