www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Degenerate Regex Case

reply "Guillaume" <gcl bigvikinggames.com> writes:
Hello, I'm trying to make a regex comparison with D, based off of 
this article: https://swtch.com/~rsc/regexp/regexp1.html

I've written my code like so:

import std.stdio, std.regex;

void main(string argv[]) {

	string m = argv[1];
	auto p = 
ctRegex!("a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
	if (match(m, p)) {
		writeln("match");
	} else {
		writeln("no match");
	}

}

And the compiler goes into swap. Doing it at runtime is no 
better. I was under the impression that this particular regex was 
used for showcasing the Thompson NFA which D claims to be using.

The golang code version of this runs fine, which makes me think 
that maybe D isn't using the correct regex engine for this 
particular regex. Or perhaps I'm using this wrong?
Apr 24 2015
next sibling parent "TheFlyingFiddle" <kurtyan student.chalmers.se> writes:
On Friday, 24 April 2015 at 18:28:16 UTC, Guillaume wrote:
 Hello, I'm trying to make a regex comparison with D, based off 
 of this article: https://swtch.com/~rsc/regexp/regexp1.html

 I've written my code like so:

 import std.stdio, std.regex;

 void main(string argv[]) {

 	string m = argv[1];
 	auto p = 
 ctRegex!("a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
 	if (match(m, p)) {
 		writeln("match");
 	} else {
 		writeln("no match");
 	}

 }

 And the compiler goes into swap. Doing it at runtime is no 
 better. I was under the impression that this particular regex 
 was used for showcasing the Thompson NFA which D claims to be 
 using.
The regex "a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaa aaaaaaaaaaaaaaaaaa" can be simplified to "a{30,60}" (if i counted correctly). The regex "a{30,60}" works fine. [Speculation] I don't have a good understanding of how D's regex engine work but I am guessing that it does not do any simplification of the regex input causing it to generate larger engines for each additional ? symbol. Thus needing more memory. Eventually as in this case the compiler runs out of memory.
Apr 24 2015
prev sibling parent reply "Dmitry Olshansky" <dmitry.olsh gmail.com> writes:
On Friday, 24 April 2015 at 18:28:16 UTC, Guillaume wrote:
 Hello, I'm trying to make a regex comparison with D, based off 
 of this article: https://swtch.com/~rsc/regexp/regexp1.html

 I've written my code like so:

 import std.stdio, std.regex;

 void main(string argv[]) {

 	string m = argv[1];
 	auto p = 
 ctRegex!("a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
 	if (match(m, p)) {
 		writeln("match");
 	} else {
 		writeln("no match");
 	}

 }
 And the compiler goes into swap. Doing it at runtime is no 
 better. I was under the impression that this particular regex 
 was used for showcasing the Thompson NFA which D claims to be 
 using.
A quick investigation shows that it gets stuck at the end of pattern compilation stage. The problem is that as a last pass D's regex goes to optimize the pattern to construct simple bit-scanning engine as approximation for prefix of original pattern. And that process is a lot like Thompson NFA ... _BUT_ the trick of merging equivalent threads wasn't applied there. So in short: file a bug, optimizer absolutely should do de-duplication of threads.
 The golang code version of this runs fine, which makes me think 
 that maybe D isn't using the correct regex engine for this 
 particular regex. Or perhaps I'm using this wrong?
It uses 2 kinds of engines, run-time one is Thompson NFA. Compile-time is (for now) still backtracking. --- Dmitry Olshansky
Apr 25 2015
parent "Guillaume" <gcl bigvikinggames.com> writes:
On Saturday, 25 April 2015 at 09:30:55 UTC, Dmitry Olshansky 
wrote:
 A quick investigation shows that it gets stuck at the end of 
 pattern compilation stage.

 The problem is that as a last pass D's regex goes to optimize 
 the pattern to construct simple bit-scanning engine as 
 approximation for prefix of original pattern. And that process 
 is a lot like Thompson NFA ... _BUT_ the trick of merging 
 equivalent threads wasn't applied there.

 So in short: file a bug, optimizer absolutely should do 
 de-duplication of threads.

 ---
 Dmitry Olshansky
Thanks for your help, I'll go file a bug.
Apr 26 2015