www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Q about Phobos regex's architecture

reply "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
I'll admit this is a bit unorthodox, but...I've been wondering about 
something regarding Phobos regex's implementation and internal 
architecture: Just how compartmentalized is the parsing of standard PCRE 
regex syntax vs actual usage of regexes once parsed?

Or more to the point, (and again, I realize how unorthodox this is), 
what would it take to implement an alternate (ie, non-PCRE) syntax for 
regexes that still *uses* the rest of the Phobos regex implementation 
once the regex string is parsed?

It is currently coupled enough that the only realistic option is to 
translate the alternate syntax into standard PCRE regex syntax?

Is there a (perhaps "protected", but maybe even "public" if I'm really 
lucky) manual interface to Phobos regex implementation that bypasses the 
PCRE parsing?

Any tips/pointers on where to start with this?
Apr 12 2017
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 4/13/17 7:53 AM, Nick Sabalausky (Abscissa) wrote:
 I'll admit this is a bit unorthodox, but...I've been wondering about
 something regarding Phobos regex's implementation and internal
 architecture: Just how compartmentalized is the parsing of standard PCRE
 regex syntax vs actual usage of regexes once parsed?
Essentially regex is parsed to bytecode which is encapsulated in Regex!Char struct. Then the match family of functions use the bytecode to construct one of the engines, even CT engine is using the same bytecode under the hood. Thus matching has nothing to do with the parser.
 Or more to the point, (and again, I realize how unorthodox this is),
 what would it take to implement an alternate (ie, non-PCRE) syntax for
 regexes that still *uses* the rest of the Phobos regex implementation
 once the regex string is parsed?
It would take a new parser, the rest is transparently reused. It may take an entry point for CT-regex because it blindly uses `regex` function inside. Even the bytecode generator _might_ be reused. In the worst case the new parser will have to generate bytecode itself.
 It is currently coupled enough that the only realistic option is to
 translate the alternate syntax into standard PCRE regex syntax?
No, the bytecode nicely decouples matching from compiling.
 Is there a (perhaps "protected", but maybe even "public" if I'm really
 lucky) manual interface to Phobos regex implementation that bypasses the
 PCRE parsing?

 Any tips/pointers on where to start with this?
At the moment it's all internal. That doesn't stop you from doing `import std.regex.internal.ir` for instance. Still I think you'd need to hack on Phobos to bypass the protection levels. Look at std/regex/internal/parser.d, in particular makeRegex function which is the culmination of parsing step, there you'd see all the bits and pieces that the Regex!Char struct needs to be populated. The brief description of bytecode is in std/regex/internal/ir.d you can also trace its use in CodeGen struct. Also if you are interested I'd fully support building a public API to generate regexes w/o touching the parser. --- Dmitry Olshansky
Apr 13 2017