www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - ctRegex vs. Regex vs. plain string

reply "Chris" <wendlec tcd.ie> writes:
I have updated my code (finally!) to 2.060. As my project deals a 
lot with text processing including loads of special characters 
(á, ú etc.), I make extensive use of the std.regex module (and I 
really appreciate the use of the Thompson NFA). To optimize my 
program I have experimented with ctRegex / StaticRegex and Regex. 
However, there are still compile time problems with Regex and 
StaticRegex which is why I am using plain strings at the moment, 
which work fine with the same regular expressions. Are there any 
precautions I have to take when using compile time regular 
expressions? Does anyone have any experience as regards 
performance enhancement?
Dec 06 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12/6/2012 7:21 PM, Chris пишет:
 I have updated my code (finally!) to 2.060.

Congrats!
 As my project deals a lot
 with text processing including loads of special characters (á, ú etc.),
 I make extensive use of the std.regex module (and I really appreciate
 the use of the Thompson NFA). To optimize my program I have experimented
 with ctRegex / StaticRegex and Regex. However, there are still compile
 time problems with Regex and StaticRegex which is why I am using plain
 strings at the moment, which work fine with the same regular
 expressions.

At first I was confused by "make extensive use of the std.regex" and "using plain strings". But then I recalled the problematic "bug" in how the compiler treats globals. So if your code goes like this: //globals or statics auto re1 = regex(...); auto re2 = regex(...); //... auto reK = regex(...); //and e.g. in main: void main(){ ... use reX etc. ... } Then the long compilations are caused by the compiler doing constant-folding on re1-reK variables. This forces it to parse & compile these patterns at compile-time. While it's cute and looks like a minor optimization it can make compile times monstrous. Especially as it just produces the same normal pattern that R-T regex uses. The way out is to keep compiled patterns on stack or initialize them inside of static this. As for using strings as patterns - it does compile them internally and caches the last 8 of them. In other words it should be fine for scripts and programs that use a few patterns to go with plain strings. It doesn't slow things down considerably even in a tight loop. But once you are going for about 10+ commonly used patterns then precompiling them is a better option.
 Are there any precautions I have to take when using compile
 time regular expressions?

One precaution is to use ctRegex only when things are well tested and you are ready to go for that extra speed. It typically takes a lot of time and RAM to get it to compile. Then again testing that results do match is recommended. Simply because of the pressure it puts on the compiler ctRegex is not that well tested (it goes only through a couple of tests in the Phobos unittests) unlike the regular one.
 Does anyone have any experience as regards
 performance enhancement?

You tell me ;) As a matter of fact I collect problematic or frequent patterns, guess I need to advertise it somewhere. Seriously, it depends on patterns and the data. I'd expect about 20-50% faster. But there are even cases where it may slow it down (the C-T backend is not that sophisticated as primary R-T one... something to improve with time). -- Dmitry Olshansky
Dec 06 2012
prev sibling parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 6 December 2012 at 16:00:11 UTC, Dmitry Olshansky 
wrote:
 12/6/2012 7:21 PM, Chris пишет:
 I have updated my code (finally!) to 2.060.

Congrats!
 As my project deals a lot
 with text processing including loads of special characters (á, 
 ú etc.),
 I make extensive use of the std.regex module (and I really 
 appreciate
 the use of the Thompson NFA). To optimize my program I have 
 experimented
 with ctRegex / StaticRegex and Regex. However, there are still 
 compile
 time problems with Regex and StaticRegex which is why I am 
 using plain
 strings at the moment, which work fine with the same regular
 expressions.

At first I was confused by "make extensive use of the std.regex" and "using plain strings". But then I recalled the problematic "bug" in how the compiler treats globals. So if your code goes like this: //globals or statics auto re1 = regex(...); auto re2 = regex(...); //... auto reK = regex(...); //and e.g. in main: void main(){ ... use reX etc. ... } Then the long compilations are caused by the compiler doing constant-folding on re1-reK variables. This forces it to parse & compile these patterns at compile-time. While it's cute and looks like a minor optimization it can make compile times monstrous. Especially as it just produces the same normal pattern that R-T regex uses. The way out is to keep compiled patterns on stack or initialize them inside of static this. As for using strings as patterns - it does compile them internally and caches the last 8 of them. In other words it should be fine for scripts and programs that use a few patterns to go with plain strings. It doesn't slow things down considerably even in a tight loop. But once you are going for about 10+ commonly used patterns then precompiling them is a better option.
 Are there any precautions I have to take when using compile
 time regular expressions?

One precaution is to use ctRegex only when things are well tested and you are ready to go for that extra speed. It typically takes a lot of time and RAM to get it to compile. Then again testing that results do match is recommended. Simply because of the pressure it puts on the compiler ctRegex is not that well tested (it goes only through a couple of tests in the Phobos unittests) unlike the regular one.
 Does anyone have any experience as regards
 performance enhancement?

You tell me ;) As a matter of fact I collect problematic or frequent patterns, guess I need to advertise it somewhere. Seriously, it depends on patterns and the data. I'd expect about 20-50% faster. But there are even cases where it may slow it down (the C-T backend is not that sophisticated as primary R-T one... something to improve with time).

Thanks a lot. That's very useful information. I will follow the rules Roberto Ierusalimschy mentions: "In Lua, as in any other programming language, we should always follow the two maxims of program optimization: Rule #1: Don’t do it. Rule #2: Don’t do it yet. (for experts only)"
Dec 06 2012