digitalmars.D - "Regular Expression Matching Can Be Simple And Fast (but...) "

spir (11/11) Feb 24 2011 Hello,

Trass3r (2/2) Feb 24 2011 Yep, finite automata FTW!
Andrei Alexandrescu (5/12) Feb 24 2011 An often-quoted article. It turns out Thompson automata are 2x slower

PeteC (49/49) Feb 27 2011 A few months ago I was using D on a project that involved filtering and

Nick Sabalausky (35/41) Feb 24 2011 Yea, I've come across that before. There's definitely good stuff in ther...

spir (11/18) Feb 24 2011 Actually, couldn't you say that about 90% of all technical writings, inc...
Russel Winder (22/24) Feb 24 2011 On Thu, 2011-02-24 at 17:43 -0500, Nick Sabalausky wrote:

Paulo Pinto (4/16) Feb 25 2011 The author is one of Go's developers. I wonder how fast is the current

Russel Winder (25/45) Feb 25 2011 Is there a small microbenchmark you want to try out? I guess doing a C

Paulo Pinto (9/9) Feb 25 2011 Hi,

spir <denis.spir gmail.com> writes:

Hello,

"Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, 
PHP, Python, Ruby, ...)"
A *very* interesting and well written article about slow & fast regex engines, 
why and how:
http://swtch.com/~rsc/regexp/regexp1.html

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Feb 24 2011

Trass3r <un known.com> writes:

Yep, finite automata FTW!
Same goes for non-regexp string matching, see Aho-Corasick algorithm.

Feb 24 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/24/11 8:26 AM, spir wrote:
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in
 Java, Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex
 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

 Denis

An often-quoted article. It turns out Thompson automata are 2x slower 
than Perl's for many real-world inputs, so it should be taken with a 
grain of salt.

Andrei

Feb 24 2011

PeteC <petevik38 yahoo.com.au> writes:

A few months ago I was using D on a project that involved filtering and
tranforming file paths. Using D helped a lot, but I was a bit frustrated
to discover there were a number of bugs in the standard regex library. I
was able to fix a couple, which I reported in bugzilla, but I couldn't
understand the source code well enough to deal with all of them.

After reading Russ Cox's articles, I implemented an experimental regular
expression engine based on the operations and execution methods
mentioned in the http://swtch.com/~rsc/regexp/regexp2.html article.

It's not yet completed, but I've put it up on github if anyone is
interested in looking at it:
https://github.com/PeteChadwick/pcregexeng

Bear in mind that at this stage it's just an experiment and isn't
intented to be used as a regex library. Comments, suggestions and
criticisms are welcome.

The library includes a Thompson style machine, and a very basic
recursive backtracking machine. I tried a few simple benchmarks of these
machines, alongside std.regex.match. The results are below:

Regex test

Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 100000
Text length: 20
Text: User domain.name.com
lockstep  (Email): 2699 ticks  1415.05 ticks/MB (User domain.name.com)
backtrack (Email): 109 ticks 57.1474 ticks/MB (User domain.name.com)
std.regex (Email): 250 ticks 131.072 ticks/MB

Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 10
Text length: 210020
Text: not-an-email-address not-an-email-address not-an-e...
lockstep  (Email prefix 1): 2074 ticks  1035.5 ticks/MB (User domain.name.com)
backtrack (Email prefix 1): 468 ticks 233.66 ticks/MB (User domain.name.com)
std.regex (Email prefix 1): 1342 ticks 670.026 ticks/MB

Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 10
Text length: 1048596
Text: ||||||||||||||||||||||||||||||||||||||||||||||||||...
lockstep  (Email prefix 2): 5944 ticks  594.389 ticks/MB (User domain.name.com)
backtrack (Email prefix 2): 717 ticks 71.6986 ticks/MB (User domain.name.com)
std.regex (Email prefix 2): 468 ticks 46.7991 ticks/MB

Regex: 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaa'
Iterations: 1
Text length: 18
Text: aaaaaaaaaaaaaaaaaa
lockstep  (Pathalogical): 0 ticks  0 ticks/MB (aaaaaaaaaaaaaaaaaa)
backtrack (Pathalogical): 47 ticks 2.73795e+06 ticks/MB (aaaaaaaaaaaaaaaaaa)
std.regex (Pathalogical): 14586 ticks 8.49696e+08 ticks/MB

It terms of ticks/MB the Thompson machine implementation is quite a lot
slower in these tests, but shows more consistency, and is able to deal
with the pathalogical case.

Feb 27 2011

"Nick Sabalausky" <a a.a> writes:

"spir" <denis.spir gmail.com> wrote in message 
news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, 
 Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex 
 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

Yea, I've come across that before. There's definitely good stuff in there, 
and it's much better-written than a lot of things I've seen. But one thing I 
really don't like about it is that there's a fundamental relationship going 
on that he merely implies throughout the article and never clearly states 
right up-front:

---------------------------
A regex engine first needs to convert the regex to an NFA, but then there's 
three ways it can proceed:

1. Execute the NFA directly, which involves making guesses at certain points 
and then backing up if the guesses turn out wrong. This is what Perl does. 
Unfortunately, all this backing up and retrying makes it scale very poorly 
on certain complex regexes.

2. Convert the NFA to a DFA and execute the DFA. Optionally, the resulting 
DFA can then be optimized further. There's the extra upfront overhead of the 
conversion, but executing the DFA (even if it isn't optimized) never 
involves any guessing or backing up to retry, so it scales well.

3. Thompson NFA: Execute the NFA as if it were the equivalent DFA. You're 
essentially executing the DFA *as* you generate it from the NFA. (Although 
the resultant DFA doesn't actually get stored.)

Note, of course, that all this applies to any use of an NFA, not just regex.
---------------------------

Those few little paragraphs are the whole damn *point* of the article, but 
the only way to get that take-away is to stitch it together by reading the 
entire multi-page text. Without that "big picture" I had a much harder time 
than necessary understanding it the first time I read it, even though I was 
still aware that NFAs can be converted to DFAs.

Actually, that's one of the things I really hate about academic writings in 
general (not that this article is particularly academic, as far as 
academic-themed stuff goes): There's rarely a decent overview. You're 
expected to implicitly piece together the big picture *as* you learn the 
details. Which, of course, is stupid because details mean practically 
nothing without proper context. Context *helps* you understand the details, 
plus it's less to be trying to learn all at once.

Feb 24 2011

spir <denis.spir gmail.com> writes:

On 02/24/2011 11:43 PM, Nick Sabalausky wrote:
 Actually, that's one of the things I really hate about academic writings in
 general (not that this article is particularly academic, as far as
 academic-themed stuff goes): There's rarely a decent overview. You're
 expected to implicitly piece together the big picture *as* you learn the
 details. Which, of course, is stupid because details mean practically
 nothing without proper context. Context *helps* you understand the details,
 plus it's less to be trying to learn all at once.

Actually, couldn't you say that about 90% of all technical writings, including 
purely practicle ones (or rather supposed to), among which much of D doc. 
"You're expected to implicitly piece together the big picture *as* you learn 
the details". That's the feeling I get, at least. I'm all the time guessing 
(esp the /purpose/ of this or that feature or sub-feature).

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Feb 24 2011

Russel Winder <russel russel.org.uk> writes:

On Thu, 2011-02-24 at 17:43 -0500, Nick Sabalausky wrote:
[ . . . ]

Send feedback to Russ Cox?

[ . . . ]

 Actually, that's one of the things I really hate about academic writings =

in=20
 general (not that this article is particularly academic, as far as=20

[ . . . ]

I take exception to this category of "academic writing" and lumping
everything of this sort into that category to be slagged off.  Just
because some academics can't author good documents doesn't imply all
have this failing.  I would hazard a guess though that a far greater
percentage of software developers in industry and commerce are unable to
write good articles.
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 24 2011

"Paulo Pinto" <pjmlp progtools.org> writes:

The author is one of Go's developers. I wonder how fast is the current 
regexp implementation in Go.

"spir" <denis.spir gmail.com> wrote in message 
news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, 
 Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex 
 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

 Denis
 -- 
 _________________
 vita es estrany
 spir.wikidot.com

Feb 25 2011

Russel Winder <russel russel.org.uk> writes:

On Fri, 2011-02-25 at 09:37 +0100, Paulo Pinto wrote:
 The author is one of Go's developers. I wonder how fast is the current=

=20
 regexp implementation in Go.

Is there a small microbenchmark you want to try out?  I guess doing a C
or C++ / D / Go comparison is appropriate!  (Better that debating
rhetorical questions :-)  I'd be up for participating.

We should note though that the standard Go is known not to be highly
optimized, it is in fact quite slow.  This is a known state of affairs,
and is right given the evolution of the semantics of the language and
libraries just now.  They will soon though have to put high optimization
on the road map.

Of course there is gccgo which should benefit from all the GNU
optimization goodness.
=20

 "spir" <denis.spir gmail.com> wrote in message=20
 news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in Jav=


a,=20
 Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex=


=20
 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

 Denis
 --=20
 _________________
 vita es estrany
 spir.wikidot.com
=20

=20
=20

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 25 2011

"Paulo Pinto" <pjmlp progtools.org> writes:

Hi,

I know my way around C and C++ quite well, but I am afraid that I do lack
the knowledge to create a high performant regexp engine that would do 
justice
to the chosen language.

--
Paulo

"Russel Winder" <russel russel.org.uk> wrote in message 
news:mailman.1943.1298624548.4748.digitalmars-d puremagic.com...

Feb 25 2011

D Programming

C/C++ Programming

Other

digitalmars.D - "Regular Expression Matching Can Be Simple And Fast (but...) "