www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - parsing fastq files with D

reply eastanon <biorelated gmail.com> writes:
Fastq is a format for storing DNA sequences together with the 
associated quality information often encoded in ascii characters. 
It is typically made of 4 lines  for example 2 fastq entries 
would look like this.

 seq1
TTATTTTAAT
+
?+BBB/DHH 
 seq2
GACCCTTTGCA
+
?+BHB/DIH 

I do not have a lot of D expirience and I am writing a simple 
parser to help work with  these files. Ideally it should be fast 
with low memory footprint. I am working with very large files of 
this type and can be  up to 1GB.

module fastq;

import std.stdio;
import std.file;
import std.exception;
import std.algorithm;
import std.string;

struct Record{

     string sequence;
     string quals;
     string name;
}

auto Records(string filename){

     static auto toRecords(S)(S str){

         auto res = findSplitBefore(str,"+\n");

         auto seq = res[0];
         auto qual = res[1];

         return Record(seq,qual);
     }

     string text = cast(string)std.file.read(filename);

     enforce(text.length > 0 && text[0] == ' ');
     text = text[1 .. $];

     auto entries = splitter(text,' ');

     return map!toRecords(entries);
}

The issue with this is that the "+" character can be part of the 
quality information and I am using it to split the quality 
information from the sequence information. and ends up splitting 
the quality information which is wrong.
Ideally I do not want to use regex and I have heard of ragel for 
parsing but never used it. Such a solution would also be welcome, 
since I read it can be very fast.

Which is the idiomatic way to capture, sequence name (starts with 
  character and the first entry) the sequence, (line2) the 
quality scores( line 4)
Mar 23 2016
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
As a little fun thing to do I implemented it for you.

It won't allocate. Making this perfect for you.
With a bit of work you could make Result have buffers for result instead 
of using the input array allow for the source to be an input range itself.

I made this up on dpaste and single quotes were not playing nicely 
there. So you'll see "\r"[0] as a workaround.

struct FastQRecord {
	const(char)[] sequenceId;
	const(char)[] sequenceLetters;
	const(char)[] quality;
	
	static auto parse(const(char)[] from) {
		struct Result {
			private {
				const(char)[] source;
				FastQRecord value;
				bool isEmpty;
			}
			
			this(const(char)[] source) {
				this.source = source;
				popFront;
			}
			
			 property {
				FastQRecord front() {
					return value;
				}
				
				bool empty() {
					return isEmpty;
				}
			}
			
			void popFront() {
				import std.string : indexOf;
				
				if (source is null) {
					isEmpty = true;
					return;
				}
				
				void tidyInput() {
					foreach(i, c; source) {
						switch(c) {
							case 0: .. case ' ':
								break;
							default:
								source = source[i .. $];
								return;
						}
					}
					
					source = null;
				}
				
				tidyInput();
				
				if (source is null)
					return;
				
				// sequenceId
				
				assert(source[0] == ' ');
				
				ptrdiff_t len = source.indexOf("\n");
				assert(len > 0);
				
				value.sequenceId = source[1 .. len];
				if (value.sequenceId[$-1] == "\r"[0])
					value.sequenceId = value.sequenceId[0 .. $-1];
					
				source = source[len + 1 .. $];
				
				// sequenceLetters
				
				len = source.indexOf("\n");
				assert(len > 0);
				
				value.sequenceLetters = source[0 .. len];
				if (value.sequenceLetters[$-1] == "\r"[0])
					value.sequenceLetters = value.sequenceLetters[0 .. $-1];
					
				source = source[len + 1 .. $];
				
				// +sequenceId
				
				len = source.indexOf("\n");
				assert(len > 0);
				source = source[len + 1 .. $];
				
				// quality
				
				len = source.indexOf("\n");
				assert(len > 0);
				
				value.quality = source[0 .. len];
				if (value.quality[$-1] == "\r"[0])
					value.quality = value.quality[0 .. $-1];
				
				if (source.length > len + 1) {
					source = source[len + 1 .. $];
					tidyInput();
				} else
					source = null;
			}
		}
		
		return Result(from);
	}
}

void main() {}

unittest {
	string input = """
 seq1
TTATTTTAAT
+
?+BBB/DHH 
 seq2
GACCCTTTGCA
+
?+BHB/DIH 
 SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65		
"""[1 .. $];
		
	foreach(record; FastQRecord.parse(input)) {
		import std.stdio;
		writeln(record);
	}
}

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
Mar 23 2016
next sibling parent reply eastanon <biorelated gmail.com> writes:
On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole 
wrote:
 As a little fun thing to do I implemented it for you.

 It won't allocate. Making this perfect for you.
 With a bit of work you could make Result have buffers for 
 result instead of using the input array allow for the source to 
 be an input range itself.

 I made this up on dpaste and single quotes were not playing 
 nicely there. So you'll see "\r"[0] as a workaround.
Thank you very much. I think you have exposed me to a number of new concepts that I will go through and annotate the code with. I read all input from file as follows. string text = cast(string)std.file.read(inputfile); foreach(record;FastQRecord.parse(text)){ writeln(record); } </naivequestion>Does this mean that text is allocated to memory? and is there a better way to read and process the inputfile? </naivequestion>
Mar 24 2016
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 24/03/16 9:24 PM, eastanon wrote:
 On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:
 As a little fun thing to do I implemented it for you.

 It won't allocate. Making this perfect for you.
 With a bit of work you could make Result have buffers for result
 instead of using the input array allow for the source to be an input
 range itself.

 I made this up on dpaste and single quotes were not playing nicely
 there. So you'll see "\r"[0] as a workaround.
Thank you very much. I think you have exposed me to a number of new concepts that I will go through and annotate the code with. I read all input from file as follows. string text = cast(string)std.file.read(inputfile); foreach(record;FastQRecord.parse(text)){ writeln(record); } </naivequestion>Does this mean that text is allocated to memory? and is there a better way to read and process the inputfile? </naivequestion>
Yup, any string usage gets allocated in memory. Since the way I designed parse there, it won't allocate during returning of a record. Just don't go around modifying that memory or keeping copies (not critical but I wouldn't). There are better ways to read the input file. For example byChunk would read it in smaller groups of say 4096 chars. But you would need to overload that with support for e.g. lines parsing. The way I'd go is memory mapped files as an input range. After all, let the OS help you out with deciding when to put the file and parts of it into memory. After all, if you have 1gb files you want to parse. Could you just as easily have 2tb worth? Probably since it is DNA after all. You kinda don't want that all in memory at once if you know what I mean. --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Mar 24 2016
prev sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 24 March 2016 at 08:24:15 UTC, eastanon wrote:
 On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole 
 wrote:
 As a little fun thing to do I implemented it for you.

 It won't allocate. Making this perfect for you.
 With a bit of work you could make Result have buffers for 
 result instead of using the input array allow for the source 
 to be an input range itself.

 I made this up on dpaste and single quotes were not playing 
 nicely there. So you'll see "\r"[0] as a workaround.
Thank you very much. I think you have exposed me to a number of new concepts that I will go through and annotate the code with. I read all input from file as follows. string text = cast(string)std.file.read(inputfile); foreach(record;FastQRecord.parse(text)){ writeln(record); } </naivequestion>Does this mean that text is allocated to memory? and is there a better way to read and process the inputfile? </naivequestion>
Yes, it's read into your processes memory. You can use std.mmfile [1] to make things a bit more efficient. It will, too, read the data into memory, but it will do so in a way (memory mapping) that only loads what is actually accessed (everything in your case), and that allows the operating system to efficiently release and reload parts of it if memory runs low. Unfortunately there is no example in the documentation, but it works like this (untested): import std.mmfile; auto file = new MmFile(inputfile); string text = cast(string) file[]; ... [1] http://dlang.org/phobos/std_mmfile.html
Mar 24 2016
parent eastanon <biorelated gmail.com> writes:
On Thursday, 24 March 2016 at 13:38:32 UTC, Marc Sch├╝tz wrote:
 Yes, it's read into your processes memory. You can use 
 std.mmfile [1] to make things a bit more efficient. It will, 
 too, read the data into memory, but it will do so in a way 
 (memory mapping) that only loads what is actually accessed 
 (everything in your case), and that allows the operating system 
 to efficiently release and reload parts of it if memory runs 
 low.

 Unfortunately there is no example in the documentation, but it 
 works like this (untested):

 import std.mmfile;
 auto file = new MmFile(inputfile);
 string text = cast(string) file[];
 ...

 [1] http://dlang.org/phobos/std_mmfile.html
That is very clever. Thank you for the tip and I have implemented it and it works. I feel like it is a safer way of reading a file.
Mar 24 2016
prev sibling parent reply eastanon <biorelated gmail.com> writes:
On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole 
wrote:
 			void popFront() {
 				import std.string : indexOf;
 				
 				if (source is null) {
 					isEmpty = true;
 					return;
 				}
 				
 				void tidyInput() {
 					foreach(i, c; source) {
 						switch(c) {
 							case 0: .. case ' ':
 								break;
 							default:
 								source = source[i .. $];
 								return;
 						}
 					}
 					
 					source = null;
 				}
 				
 				tidyInput();
Do you mind to explain what is really going on at popFront() and tidyInput() functions?
Mar 29 2016
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 30/03/2016 1:46 AM, eastanon wrote:
 On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:
             void popFront() {
                 import std.string : indexOf;

                 if (source is null) {
                     isEmpty = true;
                     return;
                 }

                 void tidyInput() {
                     foreach(i, c; source) {
                         switch(c) {
                             case 0: .. case ' ':
                                 break;
                             default:
                                 source = source[i .. $];
                                 return;
                         }
                     }

                     source = null;
                 }

                 tidyInput();
Do you mind to explain what is really going on at popFront() and tidyInput() functions?
popFront is a special function used with input ranges. It has support in the language for e.g. foreach loops. You also need empty and front to go along with it however. With front returning a value and empty if there is any more items. The if statement in popFront is just to make empty set correctly (delayed). Now tidyInput basically just trims (on the left) the source and sets it to null if there is no more of it. That way things like " Myid" will still work fine. If you need an explanation about the rest of popFront, let me know!
Mar 29 2016