digitalmars.D.learn - Parsing and splitting textfile

Hugo Florentino (9/9) Feb 24 2014 Hello,

Steven Schveighoffer (3/12) Feb 24 2014 std.regex

Justin Whear (5/24) Feb 24 2014 Specifically std.regex.splitter[1] creates a lazy range over the input. ...

Hugo Florentino (2/7) Feb 24 2014 Interesting, thanks.
Hugo Florentino (21/26) Feb 24 2014 Would something like this work? (I cannot test it right now)

Justin Whear (36/67) Feb 24 2014 The code you've posted won't work, primarily because you don't need to

Hugo Florentino (11/12) Feb 24 2014 I should have explained myself better.

Steven Schveighoffer (6/18) Feb 24 2014 I'm not completely sure, Justin may have a solution.

Hugo Florentino <hugo acdam.cu> writes:

Hello,

Can you point me to an efficient way to parse a text file and split it 
by certain expression (for example, `\n\nFrom\ .+ .+$`), copying what 
has already been read to a separate file, and so on till the end of the 
file?

I am trying to implement a mailbox to maildir format conversion 
application in D, but I would like to avoid loading each mbox completely 
into memory.

Regards, Hugo

Feb 24 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Feb 2014 13:52:45 -0500, Hugo Florentino <hugo acdam.cu> wrote:

 Hello,

 Can you point me to an efficient way to parse a text file and split it  
 by certain expression (for example, `\n\nFrom\ .+ .+$`), copying what  
 has already been read to a separate file, and so on till the end of the  
 file?

 I am trying to implement a mailbox to maildir format conversion  
 application in D, but I would like to avoid loading each mbox completely  
 into memory.

 Regards, Hugo

std.regex

-Steve

Feb 24 2014

Justin Whear <justin economicmodeling.com> writes:

On Mon, 24 Feb 2014 14:00:09 -0500, Steven Schveighoffer wrote:

 On Mon, 24 Feb 2014 13:52:45 -0500, Hugo Florentino <hugo acdam.cu>
 wrote:
 
 Hello,

 Can you point me to an efficient way to parse a text file and split it
 by certain expression (for example, `\n\nFrom\ .+ .+$`), copying what
 has already been read to a separate file, and so on till the end of the
 file?

 I am trying to implement a mailbox to maildir format conversion
 application in D, but I would like to avoid loading each mbox
 completely into memory.

 Regards, Hugo

 
 std.regex
 
 -Steve

Specifically std.regex.splitter[1] creates a lazy range over the input.  
You can couple this with lazy file reading (e.g. `File("mailbox").byChunk
(1024).joiner`).

Justin

Feb 24 2014

Hugo Florentino <hugo acdam.cu> writes:

On Mon, 24 Feb 2014 19:08:16 +0000 (UTC), Justin Whear wrote:
 Specifically std.regex.splitter[1] creates a lazy range over the 
 input.
 You can couple this with lazy file reading (e.g. 
 `File("mailbox").byChunk
 (1024).joiner`).


Interesting, thanks.

Feb 24 2014

Hugo Florentino <hugo acdam.cu> writes:

On Mon, 24 Feb 2014 19:08:16 +0000 (UTC), Justin Whear wrote:
 Specifically std.regex.splitter[1] creates a lazy range over the 
 input.
 You can couple this with lazy file reading (e.g. 
 `File("mailbox").byChunk
 (1024).joiner`).

Would something like this work? (I cannot test it right now)

auto themailbox = args[1];
immutable uint chunksize = 1024 * 64;
static auto re = regex(`\n\nFrom .+ .+$`);
auto mailbox;
auto mail;
while (mailbox = File(themailbox).byChunk(chunksize).joiner) != EOF)
{
   mail = splitter(mailbox, re);
}

If so, I have a couple of furter doubts:

Using splitter actually removes the expression from the string, how 
could I reinsert it to the beginning of each resulting string in an 
efficient way (i.e. avoiding copying something which is already loaded 
in memory)?

I am seeing the splitter fuction returns a struct, how could I 
progressively dump to disk each resulting string, removing it from the 
struct, so that so that it does not end up having the full mailbox 
loaded into memory, in this case as a struct?

Regards, Hugo

Feb 24 2014

Justin Whear <justin economicmodeling.com> writes:

On Mon, 24 Feb 2014 15:19:06 -0500, Hugo Florentino wrote:

 On Mon, 24 Feb 2014 19:08:16 +0000 (UTC), Justin Whear wrote:
 Specifically std.regex.splitter[1] creates a lazy range over the input.
 You can couple this with lazy file reading (e.g.
 `File("mailbox").byChunk (1024).joiner`).

 Would something like this work? (I cannot test it right now)
 
 auto themailbox = args[1];
 immutable uint chunksize = 1024 * 64;
 static auto re = regex(`\n\nFrom .+ .+$`);
 auto mailbox;
 auto mail;
 while (mailbox = File(themailbox).byChunk(chunksize).joiner) != EOF) {
    mail = splitter(mailbox, re);
 }
 
 If so, I have a couple of furter doubts:
 
 Using splitter actually removes the expression from the string, how
 could I reinsert it to the beginning of each resulting string in an
 efficient way (i.e. avoiding copying something which is already loaded
 in memory)?
 
 I am seeing the splitter fuction returns a struct, how could I
 progressively dump to disk each resulting string, removing it from the
 struct, so that so that it does not end up having the full mailbox
 loaded into memory, in this case as a struct?
 
 Regards, Hugo

The code you've posted won't work, primarily because you don't need to 
loop over the file-reading range, nor will it ever "return" EOF.  Also, 
if you don't actually want to remove the regex matches, you can just use 
the matchAll function.  Here's some _untested_ sample code to set you on 
the right track.

--------------------------------------------------------
import std.algorithm, std.range, std.stdio, std.regex;
void main(string[] args)
{
    auto mailboxPath = args[1];
    immutable size_t chunksize = 1024 * 64;
    auto re = regex(`\n\nFrom .+ .+$`);
    // you might want to try using ctRegex

    auto mailStarts = File(mailboxPath).byChunk(chunksize).joiner
                          .matchAll(re);
}
--------------------------------------------------------

This code won't actually do any work--no data will be loaded from the 
file (caveat: the first chunk might be prefetched, not sure), no matches 
will actually be performed.  If you use `take(10)` on the mailStarts 
variable, the code will load only as much of the file (down to the 
granularity of chunksize) as is needed to find the first 10 instances of 
the regular expression.  The regex matches will not copy, but rather 
provide slices over the data that is in memory.

And, thinking about this further, you don't want to use my code either--
partly because byChunk reuses its buffer, partly because the functions in 
std.regex provide slices over the input data.  I think what you'll want 
to do is: load the data from File chunk-by-chunk lazily, scan each chunk 
with the regex, if you don't find a match, copy that data into an 
"overlap" buffer and repeat, if you do find a match then the contents of 
the overlap buffer + the slice up to the current match is one mail, rinse 
and repeat.  You should be able to encapsulate all of this in clean, lazy 
range, but I don't have the time right now to work out if it can be done 
by simply compositing existing functions from Phobos.

Justin

Feb 24 2014

Hugo Florentino <hugo acdam.cu> writes:

On Mon, 24 Feb 2014 14:00:09 -0500, Steven Schveighoffer wrote:
 std.regex

I should have explained myself better.
I have already used regular expressions a couple of times. My doubt 
here is how parse the file progressively, not loading it completely into 
memory.
If this can be done solely with std.regex, please clarify futher

I was thinking in using byLine, but for that, I see first I must use 
something like:

auto myfile = File(usermbox);

Doesn't that load the whole file into memory?

Regards, Hugo

Feb 24 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Feb 2014 14:17:14 -0500, Hugo Florentino <hugo acdam.cu> wrote:

 On Mon, 24 Feb 2014 14:00:09 -0500, Steven Schveighoffer wrote:
 std.regex

 I should have explained myself better.
 I have already used regular expressions a couple of times. My doubt here  
 is how parse the file progressively, not loading it completely into  
 memory.

OK, I did not understand that.

 If this can be done solely with std.regex, please clarify futher

I'm not completely sure, Justin may have a solution.

 I was thinking in using byLine, but for that, I see first I must use  
 something like:

 auto myfile = File(usermbox);

 Doesn't that load the whole file into memory?

I do know the answer to this, and it's no. File wraps a C FILE * buffered  
file.

-Steve

Feb 24 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Parsing and splitting textfile