www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fast regular expressions

reply Jascha Wetzel <"[firstname]" mainia.de> writes:
Here is an alpha release of a regular expression engine i'm working on:
http://mainia.de/tdfaregexp-0.1-alpha.zip
(artistic license 2.0)

It implements a slightly modified version of Ville Laurikari's tagged
NFA method that allows for regular expression parsing in linear time as
opposed to the very common backtracking method that requires exponential
time (worst case).

As of now, the engine supports greedy and non-greedy operators,
submatches and character classes. It compiles a regexp to a TDFA from
which it can either be applied directly to an input string or compiled
to D code.
Supported operators: . | ( ) ? * + ?? *? +?

The regexp-to-D compilation cannot be done at compile-time, though,
because the required data structures are too complex to be represented
in tuples or strings.

Compile regexp.d and run it for example like this:
regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234
This will output the TNFA, TDFA and D Code and try to match the 3 input
strings.
The file regextest.d is an example of how to use the generated code.

There is still a lot of room for optimization, but i think that
categorically, such a natively compiled, linear time regexp matcher
should be the fastest possible.

Comments and bug reports are encouraged ;)
Apr 14 2007
next sibling parent reply "Aziz K." <aziz.kerim gmail.com> writes:
Hi Jascha,

Since I found an important feature lacking in Phobos' regex  
implementation, I want to ask you: does your library support matching over  
new-lines?

In any case, good job on writing an improved regex library!

Regards.
Apr 15 2007
next sibling parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
"Aziz K." <aziz.kerim gmail.com> wrote in message 
news:op.tqtketww545v36 vaio...
 Hi Jascha,

 Since I found an important feature lacking in Phobos' regex 
 implementation, I want to ask you: does your library support matching over 
 new-lines?

I think phobos' version does support this by using the attribute 'g': char[] s = "strap a \nbla to bla\n"; auto ss = sub(s, " \nb", "ZZ", "g"); // result: strap aZZla to bla Or I must have understood what you meant? L.
Apr 15 2007
parent "Lionello Lunesu" <lionello lunesu.remove.com> writes:
 Or I must have understood what you meant?

Or I must have MISunderstood.. blergh :S L.
Apr 15 2007
prev sibling parent reply Jascha Wetzel <"[firstname]" mainia.de> writes:
thanks!

what exactly do you mean by "over new-lines"?
phobos' regexps will match newlines if stated explicitly. it can also
treat the input as multiple lines.

my engine doesn't care about special chars, yet. the point is merely to
improve on the algorithmic side. all the details with assertions,
special chars and charsets will be filled in when all the operators work.

Aziz K. wrote:
 Hi Jascha,
 
 Since I found an important feature lacking in Phobos' regex
 implementation, I want to ask you: does your library support matching
 over new-lines?
 
 In any case, good job on writing an improved regex library!
 
 Regards.

Apr 15 2007
parent reply "Aziz K." <aziz.kerim gmail.com> writes:
 what exactly do you mean by "over new-lines"?
 phobos' regexps will match newlines if stated explicitly. it can also
 treat the input as multiple lines.

What I mean is that the dot meta character in Phobos' regexps doesn't = match a new-line. For example: char[] s =3D "<p>ab\ncd</p>"; auto m =3D search(s, "<p>(.+)</p>", "gmi"); writefln(m.match(1)); // no match...
 my engine doesn't care about special chars, yet. the point is merely t=

 improve on the algorithmic side. all the details with assertions,
 special chars and charsets will be filled in when all the operators wo=

Yeah, I noticed that when I looked through your code a bit.
Apr 15 2007
parent Jascha Wetzel <"[firstname]" mainia.de> writes:
ah, ic. that will be parameterized later on so you can choose whether
to have the dot match it or not.
you can easily patch it into the alpha by changing charclass.d line 76 to
static const CharClass!(char_t) any_char = {parts: [{l:32, r:127},
{l:10,r:10}, {l:13,r:13}]};


Aziz K. wrote:
 what exactly do you mean by "over new-lines"?
 phobos' regexps will match newlines if stated explicitly. it can also
 treat the input as multiple lines.

What I mean is that the dot meta character in Phobos' regexps doesn't match a new-line. For example: char[] s = "<p>ab\ncd</p>"; auto m = search(s, "<p>(.+)</p>", "gmi"); writefln(m.match(1)); // no match...
 my engine doesn't care about special chars, yet. the point is merely to
 improve on the algorithmic side. all the details with assertions,
 special chars and charsets will be filled in when all the operators work.

Yeah, I noticed that when I looked through your code a bit.

Apr 15 2007
prev sibling next sibling parent reply Jeff <jeffrparsons optusnet.com.au> writes:
I wonder if the Tango team would be interested in incorporating this 
work into their library? I was disappointed to see they've adopted the 
Phobos regular expression module, given I had some pretty serious gripes 
with it (seemingly erroneous behavior - I don't think this was ever 
resolved).
Apr 15 2007
next sibling parent Lars Ivar Igesund <larsivar igesund.net> writes:
Jeff wrote:

 I wonder if the Tango team would be interested in incorporating this
 work into their library? I was disappointed to see they've adopted the
 Phobos regular expression module, given I had some pretty serious gripes
 with it (seemingly erroneous behavior - I don't think this was ever
 resolved).

I'd say we're likely to be interested once Jascha's library reaches a more stable state. We are aware that current implementation (which is the same in both Phobos and Tango) has issues. As with everything that goes into Tango, someone needs to write it, and mantain it. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Apr 15 2007
prev sibling parent Jascha Wetzel <"[firstname]" mainia.de> writes:
this is also somewhat meant to be a response to Walter's post a few
weeks ago:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=47838

Jeff wrote:
 I wonder if the Tango team would be interested in incorporating this
 work into their library? I was disappointed to see they've adopted the
 Phobos regular expression module, given I had some pretty serious gripes
 with it (seemingly erroneous behavior - I don't think this was ever
 resolved).

Apr 15 2007
prev sibling next sibling parent reply Davidl <Davidl 126.com> writes:
will u please add some feature of fuzzy matching? like
with user definable error char number.
for example,
in fuzzy match
people can match ab in abc or acb with 1char errorness
and they can match ab in acdb or aefb with 2 char errorness

 Here is an alpha release of a regular expression engine i'm working on:
 http://mainia.de/tdfaregexp-0.1-alpha.zip
 (artistic license 2.0)

 It implements a slightly modified version of Ville Laurikari's tagged
 NFA method that allows for regular expression parsing in linear time as
 opposed to the very common backtracking method that requires exponential
 time (worst case).

 As of now, the engine supports greedy and non-greedy operators,
 submatches and character classes. It compiles a regexp to a TDFA from
 which it can either be applied directly to an input string or compiled
 to D code.
 Supported operators: . | ( ) ? * + ?? *? +?

 The regexp-to-D compilation cannot be done at compile-time, though,
 because the required data structures are too complex to be represented
 in tuples or strings.

 Compile regexp.d and run it for example like this:
 regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234
 This will output the TNFA, TDFA and D Code and try to match the 3 input
 strings.
 The file regextest.d is an example of how to use the generated code.

 There is still a lot of room for optimization, but i think that
 categorically, such a natively compiled, linear time regexp matcher
 should be the fastest possible.

 Comments and bug reports are encouraged ;)

Apr 15 2007
parent Jeff <jeffrparsons optusnet.com.au> writes:
It seems to me that supporting fuzzy matching in a general regular 
expression library would add a whole lot of complexity that wouldn't be 
used by many people. I think it would be hard to justify this.

Were you really adamant about having regular expression matching, or 
more focussed on simple fuzzy matching? There are probably some 
libraries already that can search for fuzzy matches pretty well.

If you can't find anything that meets your needs, you might consider 
having a look at http://en.wikipedia.org/wiki/Bitap_algorithm and 
http://en.wikipedia.org/wiki/Levenshtein_distance - they could be useful 
if you decide to write your own.
Apr 16 2007
prev sibling parent yidabu <yidabu.nospamneed yidabu.com> writes:
great work! 


Jascha Wetzel <[firstname] mainia.de> Wrote:

 Here is an alpha release of a regular expression engine i'm working on:
 http://mainia.de/tdfaregexp-0.1-alpha.zip
 (artistic license 2.0)
 
 It implements a slightly modified version of Ville Laurikari's tagged
 NFA method that allows for regular expression parsing in linear time as
 opposed to the very common backtracking method that requires exponential
 time (worst case).
 
 As of now, the engine supports greedy and non-greedy operators,
 submatches and character classes. It compiles a regexp to a TDFA from
 which it can either be applied directly to an input string or compiled
 to D code.
 Supported operators: . | ( ) ? * + ?? *? +?
 
 The regexp-to-D compilation cannot be done at compile-time, though,
 because the required data structures are too complex to be represented
 in tuples or strings.
 
 Compile regexp.d and run it for example like this:
 regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234
 This will output the TNFA, TDFA and D Code and try to match the 3 input
 strings.
 The file regextest.d is an example of how to use the generated code.
 
 There is still a lot of room for optimization, but i think that
 categorically, such a natively compiled, linear time regexp matcher
 should be the fastest possible.
 
 Comments and bug reports are encouraged ;)

May 19 2007