digitalmars.D - Regex DNA benchmark

bearophile (120/120) Dec 01 2007 (The following notes are relative to DMD V.1.024).

Jascha Wetzel (6/7) Dec 01 2007 i have implemented a regex engine that is supposed to be categorically

bearophile (8/11) Dec 01 2007 - Regarding the Shootout I don't know how much useful it can be, because...

Jascha Wetzel (12/17) Dec 01 2007 unfortunately, the data structures for a DFA based regex engine are too

bearophile (5/8) Dec 01 2007 I think D needs into its std lib a RE engine as flexible and as fast as ...

Jascha Wetzel (9/11) Dec 01 2007 i don't know what exactly most people need. i'm just pretty sure they

Don Clugston (3/14) Dec 02 2007 Actually std.regexp does support non-greedy matching, it's just undocume...

Robert Fraser (4/11) Dec 01 2007 TCL's is a combination DFA/NFA (it uses an NFA engine where possible,

bearophile <bearophileHUGS lycos.com> writes:

(The following notes are relative to DMD V.1.024).
The Shooutout benchmarks are very interesting, they offer many ways to learn
about the languages, how to write fast code, ecc. One of the many good sides of
those benchmarks is that they compare different languages, so you can quickly
see when/where your language is "abnormally" slow, helping you to improve that
too.

This time I have tried the Regex DNA benchmark, it reads a simulated FASTA
file, and processes it with REs. Despite being a toy, this program is related
to real code bioinformatics do often. On the Shootout site there are two D
versions, one of them uses std.regexp, the other PCRE. Even if the PCRE-based
test is slower than the Python/Perl versions (but they use the trick of
splitting the RE, so they can't be fully compared), it shows D will need a much
much faster std.regexp (10 seconds with PCRE, versus 117 seconds with
std.regexp).

I have tried to improve the std.regexp-based version, because I like D for
allowing me to do medium-level programming. To create the D version I have
followed the Python version: when the std library of D is well done, then the D
sourcecode for this benchmark can be as short as the Python one. This gives
what the D blurb offers: >D is a systems programming language. Its focus is on
combining the power and high performance of C and C++ with the programmer
productivity of modern languages like Ruby and Python.<

This is the D code (only 2-space indent), I have left some comments that show
some D problems:

import std.stdio, std.string, std.cstream;
import std.regexp: RegExp, search, resplit = split;

void main() {
  string[] sseq;
  int n;
  char[1 << 15] cbuf;

  // auto seq = din.toString(); // SLOW
  while ((n = din.readBlock(cbuf.ptr, cbuf.length)) > 0)
    // sseq ~= cbuf[0 .. n][]; // slow
    sseq ~= cbuf[0 .. n].dup;
  auto seq = sseq.join("");
  auto ilen = seq.length;

  //seq = sub(seq, ">.*\n|\n", "", "g"); // SLOW!!
  seq = resplit(seq, ">.*\n|\n").join("");
  int clen = seq.length;

  foreach(p; split("agggtaaa|tttaccct
                    [cgt]gggtaaa|tttaccc[acg]
                    a[act]ggtaaa|tttacc[agt]t
                    ag[act]gtaaa|tttac[agt]ct
                    agg[act]taaa|ttta[agt]cct
                    aggg[acg]aaa|ttt[cgt]ccct
                    agggt[cgt]aa|tt[acg]accct
                    agggta[cgt]a|t[acg]taccct
                    agggtaa[cgt]|[acg]ttaccct")) {
      int m = 0;
      foreach(_; RegExp(p).search(seq))
        m++;
      writefln(p, ' ', m);
  }

  foreach(el; split("B(c|g|t) D(a|g|t) H(a|c|t) K(g|t) M(a|c)
                     N(a|c|g|t) R(a|g) S(c|g) V(a|c|g) W(a|t) Y(c|t)"))
    // seq = RegExp(el[0..1], "g").replace(seq, el[1..$]); // Slow
    seq = (new RegExp(el[0..1], "g")).replace(seq, el[1..$]);

  writefln("\n", ilen, "\n", clen, "\n", seq.length);
}

Note: if you don't belive my tests and my words you can try replacing the fast
things with the slow commented-out ones. (The input data for this program is a
FASTA file, you can generate it with the fasta benchmark on the Shootout site
itself).


The din.toString() is very slow; in the code I have replaced it with that
readBlock+join. If Walter wants to make din.toString() faster he may use
something like:

string readAllStdin() {
  string[] seq;
  size_t len, n, len_tot;
  char[1 << 15] cbuf;

  std.gc.disable; // can be nested
  seq.length = 1; // Initial guess
  while ((n = din.readBlock(cbuf.ptr, cbuf.length)) > 0) {
    if (len == seq.length)
      seq.length = seq.length * 2;
    seq[len] = cbuf[0 .. n].dup;
    len++;
    len_tot += n;
  }

  // Join seq
  string result;
  if (len_tot) {
    size_t pos;
    result.length = len_tot;

    foreach (word; seq) {
      result[pos .. pos + word.length] = word;
      pos += word.length;
    }
  }
  std.gc.enable;
  return result;
}


somestring.join("") when the joining sting is empty can become something faster
like:

TyElem[] join(TyElem)(TyElem[][] words) {
  TyElem[] result;

  if (words.length) {
    size_t len, pos;
    foreach(word; words)
      len += word.length;

    result.length = len;

    foreach (word; words) {
      result[pos .. pos + word.length] = word;
      pos += word.length;
    }
  }
  return result;
}


From my tests it seems that:
sseq ~= cbuf[0 .. n][];
is slower than:
sseq ~= cbuf[0 .. n].dup;


This is probably a bug of std.regexp, the sub() seems O(n^2):
seq = sub(seq, ">.*\n|\n", "", "g");
So I have used:
seq = resplit(seq, ">.*\n|\n").join("");


I have also used:
seq = (new RegExp(el[0..1], "g")).replace(seq, el[1..$]);
because this seems a bit slower:
// seq = RegExp(el[0..1], "g").replace(seq, el[1..$]);


This is a Python version (as the fast Perl version it splits some REs to speed
up, I haven't done it in the D code because it slows down the D code):

import re, sys

seq = sys.stdin.read()
ilen = len(seq)

seq = re.sub('>.*', '', seq)
seq = re.sub('\n', '', seq)
clen = len(seq)

for f in ('agggtaaa|tttaccct',
          '[cgt]gggtaaa|tttaccc[acg]',
          'a[act]ggtaaa|tttacc[agt]t',
          'ag[act]gtaaa|tttac[agt]ct',
          'agg[act]taaa|ttta[agt]cct',
          'aggg[acg]aaa|ttt[cgt]ccct',
          'agggt[cgt]aa|tt[acg]accct',
          'agggta[cgt]a|t[acg]taccct',
          'agggtaa[cgt]|[acg]ttaccct'):
    s, r = f.split('|')
    print f, len(re.findall(s, seq) + re.findall(r, seq))

sbs = "B(c|g|t) D(a|g|t) H(a|c|t) K(g|t) M(a|c) N(a|c|g|t) R(a|g) S(c|g)
V(a|c|g) W(a|t) Y(c|t)"

for el in sbs.split():
    seq = re.sub(el[0], el[1:], seq)

print "\n%d\n%d\n%d" % (ilen, clen, len(seq))


The findall() is quite useful, and other tidbits show that the API of
std.regexp (despite being good already) may be improved still a bit.

Bye,
bearophile

Dec 01 2007

Jascha Wetzel <firstname mainia.de> writes:

bearophile wrote:
 On the Shootout site there are two D versions, one of them uses std.regexp,
the other PCRE. Even if the PCRE-based test is slower than the Python/Perl
versions (but they use the trick of splitting the RE, so they can't be fully
compared), it shows D will need a much much faster std.regexp (10 seconds with
PCRE, versus 117 seconds with std.regexp).

i have implemented a regex engine that is supposed to be categorically 
faster than the one in Phobos (it's part of the apaged package). i will 
do some polishing soon, in order to make it useful as a library and do 
some benchmarking. i'm pretty confident that it can improve on that 
shootout-benchmark, not only because it can compile regexps to D code.

Dec 01 2007

bearophile <bearophileHUGS lycos.com> writes:

Jascha Wetzel:
 i will do some polishing soon, in order to make it useful as a library and do 
 some benchmarking. i'm pretty confident that it can improve on that 
 shootout-benchmark, not only because it can compile regexps to D code.

- Regarding the Shootout I don't know how much useful it can be, because that
site accepes only a really limited number of external libs, and generally
accepts only a single sourcecode (maybe your code spits out D code that can be
inserted in a single D sourcecode, but I don't know if the Shootout site
accepts that).
- Regarding the compilation to D code, does it work only at compile time? Or
does it call DMD at runtime?
- Regarding the generation of normal code, that later can be compiled, time ago
I have found this, to use REs in the lwc compiler:
http://students.ceid.upatras.gr/~sxanth/lwc/ex15.html
- There are TONS of open source RE engines around (the RE of TCL is very fast).
I presume we can find one of them with a license fit for D, so we can just copy
it into the std lib of D without writing yet another (slow, buggy, limited,
etc) one.

Bye,
bearophile

Dec 01 2007

Jascha Wetzel <firstname mainia.de> writes:

bearophile wrote:
 - Regarding the Shootout I don't know how much useful it can be, because that
site accepes only a really limited number of external libs, and generally
accepts only a single sourcecode (maybe your code spits out D code that can be
inserted in a single D sourcecode, but I don't know if the Shootout site
accepts that).
 - Regarding the compilation to D code, does it work only at compile time? Or
does it call DMD at runtime?

unfortunately, the data structures for a DFA based regex engine are too 
complex to (realistically) implement it with compile-time means.
it compile the regex to a DFA and interprets it at runtime, or generates 
D code from it, that can be compiled along with the rest of the program.

 - Regarding the generation of normal code, that later can be compiled, time
ago I have found this, to use REs in the lwc compiler:
 http://students.ceid.upatras.gr/~sxanth/lwc/ex15.html
 - There are TONS of open source RE engines around (the RE of TCL is very
fast). I presume we can find one of them with a license fit for D, so we can
just copy it into the std lib of D without writing yet another (slow, buggy,
limited, etc) one.

most regex engines are more general parsers that use backtracking and 
are slow by design (that includes Perl's and Java's engine, PCRE, and 
many more). that's why i bothered to implement yet another engine.
my engine uses the same approach as TRE, one of the very few engines out 
there that don't do backtracking (i think TCL's is another one).
here are details on these issues:
http://swtch.com/~rsc/regexp/regexp1.html

Dec 01 2007

bearophile <bearophileHUGS lycos.com> writes:

Jascha Wetzel:

 most regex engines are more general parsers that use backtracking and 
 are slow by design (that includes Perl's and Java's engine, PCRE, and 
 many more). that's why i bothered to implement yet another engine.

I think D needs into its std lib a RE engine as flexible and as fast as
Python/Perl/Ruby/etc ones. (They use an "almost" standard syntax, you can do
mostly the same things with them. People use the Perl REs every day).
Then later people can add any kind of faster & more specialized RE engines
(like your one). Don't you agree? :-)

Bye,
bearophile

Dec 01 2007

Jascha Wetzel <firstname mainia.de> writes:

bearophile wrote:
 I think D needs into its std lib a RE engine as flexible and as fast as
Python/Perl/Ruby/etc ones. (They use an "almost" standard syntax, you can do
mostly the same things with them. People use the Perl REs every day).
 Then later people can add any kind of faster & more specialized RE engines
(like your one). Don't you agree? :-)

i don't know what exactly most people need. i'm just pretty sure they 
need a faster implementation than the current one. std.regexp also lacks 
a few fundamental features (like non-greedy matching).
syntax-wise i certainly agree, that one should stick to either the unix 
or perl syntax.
i don't propagate my implementation as a replacement for std.regexp, i 
just wanted to mention it for people who need fast regexps with most 
standard features.

Dec 01 2007

Don Clugston <dac nospam.com.au> writes:

Jascha Wetzel wrote:
 bearophile wrote:
 I think D needs into its std lib a RE engine as flexible and as fast 
 as Python/Perl/Ruby/etc ones. (They use an "almost" standard syntax, 
 you can do mostly the same things with them. People use the Perl REs 
 every day).
 Then later people can add any kind of faster & more specialized RE 
 engines (like your one). Don't you agree? :-)

 
 i don't know what exactly most people need. i'm just pretty sure they 
 need a faster implementation than the current one. std.regexp also lacks 
 a few fundamental features (like non-greedy matching).

Actually std.regexp does support non-greedy matching, it's just undocumented.
<g>

Dec 02 2007

Robert Fraser <fraserofthenight gmail.com> writes:

Jascha Wetzel wrote:
 most regex engines are more general parsers that use backtracking and 
 are slow by design (that includes Perl's and Java's engine, PCRE, and 
 many more). that's why i bothered to implement yet another engine.
 my engine uses the same approach as TRE, one of the very few engines out 
 there that don't do backtracking (i think TCL's is another one).
 here are details on these issues:
 http://swtch.com/~rsc/regexp/regexp1.html

TCL's is a combination DFA/NFA (it uses an NFA engine where possible, 
and defaults to a DFA for constructs that can't be correctly process in 
an NFA).

Dec 01 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Regex DNA benchmark