www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - "Regular Expression Matching Can Be Simple And Fast (but...) "

reply spir <denis.spir gmail.com> writes:
Hello,

"Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, 
PHP, Python, Ruby, ...)"
A *very* interesting and well written article about slow & fast regex engines, 
why and how:
http://swtch.com/~rsc/regexp/regexp1.html

Denis
-- 
_________________
vita es estrany
spir.wikidot.com
Feb 24 2011
next sibling parent Trass3r <un known.com> writes:
Yep, finite automata FTW!
Same goes for non-regexp string matching, see Aho-Corasick algorithm.
Feb 24 2011
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/24/11 8:26 AM, spir wrote:
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in
 Java, Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex
 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

 Denis

An often-quoted article. It turns out Thompson automata are 2x slower than Perl's for many real-world inputs, so it should be taken with a grain of salt. Andrei
Feb 24 2011
prev sibling next sibling parent "Nick Sabalausky" <a a.a> writes:
"spir" <denis.spir gmail.com> wrote in message 
news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, 
 Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex 
 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

Yea, I've come across that before. There's definitely good stuff in there, and it's much better-written than a lot of things I've seen. But one thing I really don't like about it is that there's a fundamental relationship going on that he merely implies throughout the article and never clearly states right up-front: --------------------------- A regex engine first needs to convert the regex to an NFA, but then there's three ways it can proceed: 1. Execute the NFA directly, which involves making guesses at certain points and then backing up if the guesses turn out wrong. This is what Perl does. Unfortunately, all this backing up and retrying makes it scale very poorly on certain complex regexes. 2. Convert the NFA to a DFA and execute the DFA. Optionally, the resulting DFA can then be optimized further. There's the extra upfront overhead of the conversion, but executing the DFA (even if it isn't optimized) never involves any guessing or backing up to retry, so it scales well. 3. Thompson NFA: Execute the NFA as if it were the equivalent DFA. You're essentially executing the DFA *as* you generate it from the NFA. (Although the resultant DFA doesn't actually get stored.) Note, of course, that all this applies to any use of an NFA, not just regex. --------------------------- Those few little paragraphs are the whole damn *point* of the article, but the only way to get that take-away is to stitch it together by reading the entire multi-page text. Without that "big picture" I had a much harder time than necessary understanding it the first time I read it, even though I was still aware that NFAs can be converted to DFAs. Actually, that's one of the things I really hate about academic writings in general (not that this article is particularly academic, as far as academic-themed stuff goes): There's rarely a decent overview. You're expected to implicitly piece together the big picture *as* you learn the details. Which, of course, is stupid because details mean practically nothing without proper context. Context *helps* you understand the details, plus it's less to be trying to learn all at once.
Feb 24 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 02/24/2011 11:43 PM, Nick Sabalausky wrote:
 Actually, that's one of the things I really hate about academic writings in
 general (not that this article is particularly academic, as far as
 academic-themed stuff goes): There's rarely a decent overview. You're
 expected to implicitly piece together the big picture *as* you learn the
 details. Which, of course, is stupid because details mean practically
 nothing without proper context. Context *helps* you understand the details,
 plus it's less to be trying to learn all at once.

Actually, couldn't you say that about 90% of all technical writings, including purely practicle ones (or rather supposed to), among which much of D doc. "You're expected to implicitly piece together the big picture *as* you learn the details". That's the feeling I get, at least. I'm all the time guessing (esp the /purpose/ of this or that feature or sub-feature). Denis -- _________________ vita es estrany spir.wikidot.com
Feb 24 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2011-02-24 at 17:43 -0500, Nick Sabalausky wrote:
[ . . . ]

Send feedback to Russ Cox?

[ . . . ]

 Actually, that's one of the things I really hate about academic writings =

 general (not that this article is particularly academic, as far as=20

I take exception to this category of "academic writing" and lumping everything of this sort into that category to be slagged off. Just because some academics can't author good documents doesn't imply all have this failing. I would hazard a guess though that a far greater percentage of software developers in industry and commerce are unable to write good articles. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 24 2011
prev sibling next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
The author is one of Go's developers. I wonder how fast is the current 
regexp implementation in Go.

"spir" <denis.spir gmail.com> wrote in message 
news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, 
 Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex 
 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

 Denis
 -- 
 _________________
 vita es estrany
 spir.wikidot.com
 

Feb 25 2011
prev sibling next sibling parent reply Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, 2011-02-25 at 09:37 +0100, Paulo Pinto wrote:
 The author is one of Go's developers. I wonder how fast is the current=

 regexp implementation in Go.

Is there a small microbenchmark you want to try out? I guess doing a C or C++ / D / Go comparison is appropriate! (Better that debating rhetorical questions :-) I'd be up for participating. We should note though that the standard Go is known not to be highly optimized, it is in fact quite slow. This is a known state of affairs, and is right given the evolution of the semantics of the language and libraries just now. They will soon though have to put high optimization on the road map. Of course there is gccgo which should benefit from all the GNU optimization goodness. =20
 "spir" <denis.spir gmail.com> wrote in message=20
 news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...
 Hello,

 "Regular Expression Matching Can Be Simple And Fast (but is slow in Jav=


 Perl, PHP, Python, Ruby, ...)"
 A *very* interesting and well written article about slow & fast regex=


 engines, why and how:
 http://swtch.com/~rsc/regexp/regexp1.html

 Denis
 --=20
 _________________
 vita es estrany
 spir.wikidot.com
=20

=20

--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 25 2011
parent "Paulo Pinto" <pjmlp progtools.org> writes:
Hi,

I know my way around C and C++ quite well, but I am afraid that I do lack
the knowledge to create a high performant regexp engine that would do 
justice
to the chosen language.

--
Paulo

"Russel Winder" <russel russel.org.uk> wrote in message 
news:mailman.1943.1298624548.4748.digitalmars-d puremagic.com... 
Feb 25 2011
prev sibling parent PeteC <petevik38 yahoo.com.au> writes:
A few months ago I was using D on a project that involved filtering and
tranforming file paths. Using D helped a lot, but I was a bit frustrated
to discover there were a number of bugs in the standard regex library. I
was able to fix a couple, which I reported in bugzilla, but I couldn't
understand the source code well enough to deal with all of them.

After reading Russ Cox's articles, I implemented an experimental regular
expression engine based on the operations and execution methods
mentioned in the http://swtch.com/~rsc/regexp/regexp2.html article.

It's not yet completed, but I've put it up on github if anyone is
interested in looking at it:
https://github.com/PeteChadwick/pcregexeng

Bear in mind that at this stage it's just an experiment and isn't
intented to be used as a regex library. Comments, suggestions and
criticisms are welcome.

The library includes a Thompson style machine, and a very basic
recursive backtracking machine. I tried a few simple benchmarks of these
machines, alongside std.regex.match. The results are below:

Regex test

Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 100000
Text length: 20
Text: User domain.name.com
lockstep  (Email): 2699 ticks  1415.05 ticks/MB (User domain.name.com)
backtrack (Email): 109 ticks 57.1474 ticks/MB (User domain.name.com)
std.regex (Email): 250 ticks 131.072 ticks/MB

Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 10
Text length: 210020
Text: not-an-email-address not-an-email-address not-an-e...
lockstep  (Email prefix 1): 2074 ticks  1035.5 ticks/MB (User domain.name.com)
backtrack (Email prefix 1): 468 ticks 233.66 ticks/MB (User domain.name.com)
std.regex (Email prefix 1): 1342 ticks 670.026 ticks/MB

Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 10
Text length: 1048596
Text: ||||||||||||||||||||||||||||||||||||||||||||||||||...
lockstep  (Email prefix 2): 5944 ticks  594.389 ticks/MB (User domain.name.com)
backtrack (Email prefix 2): 717 ticks 71.6986 ticks/MB (User domain.name.com)
std.regex (Email prefix 2): 468 ticks 46.7991 ticks/MB

Regex: 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaa'
Iterations: 1
Text length: 18
Text: aaaaaaaaaaaaaaaaaa
lockstep  (Pathalogical): 0 ticks  0 ticks/MB (aaaaaaaaaaaaaaaaaa)
backtrack (Pathalogical): 47 ticks 2.73795e+06 ticks/MB (aaaaaaaaaaaaaaaaaa)
std.regex (Pathalogical): 14586 ticks 8.49696e+08 ticks/MB

It terms of ticks/MB the Thompson machine implementation is quite a lot
slower in these tests, but shows more consistency, and is able to deal
with the pathalogical case.
Feb 27 2011