www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Compile time regex matching

reply "Jason den Dulk" <public2 jasondendulk.com> writes:
Hi

I am trying to write some code that uses and matches to regular 
expressions at compile time, but the compiler won't let me 
because matchFirst and matchAll make use of malloc().

Is there an alternative that I can use that can be run at compile 
time?

Thanks in advance.
Jason
Jul 13 2014
next sibling parent reply Philippe Sigaud via Digitalmars-d-learn writes:
 I am trying to write some code that uses and matches to regular expressions
 at compile time, but the compiler won't let me because matchFirst and
 matchAll make use of malloc().

 Is there an alternative that I can use that can be run at compile time?
You can try Pegged, a parser generator that works at compile-time (both the generator and the generated parser). https://github.com/PhilippeSigaud/Pegged docs: https://github.com/PhilippeSigaud/Pegged/wiki/Pegged-Tutorial It's also on dub: http://code.dlang.org/packages/pegged It takes a grammar as input, not a single regular expression, but the syntax is not too different. import pegged.grammar; mixin(grammar(` MyRegex: foo <- "abc"* "def"? `)); void main() { enum result = MyRegex("abcabcdefFOOBAR"); // compile-time parsing // everything can be queried and tested at compile-time, if need be. static assert(result.matches == ["abc", "abc", "def"]); static assert(result.begin == 0); static assert(result.end == 9); pragma(msg, result.toString()); // parse tree } It probably does not implement all those regex nifty features, but it has all the usual Parsing Expression Grammars powers. It gives you an entire parse result, though: matches, children, subchildren, etc. As you can see, matches are accessible at the top level. One thing to keep in mind, that comes from the language and not this library: in the previous code, since 'result' is an enum, it'll be 'pasted' in place everytime it's used in code: all those static asserts get an entire copy of the parse tree. It's a bit wasteful, but using 'immutable' directly does not work here, but this is OK: enum res = MyRegex("abcabcdefFOOBAR"); // compile-time parsing immutable result = res; // to avoid copying the enum value everywhere The static asserts then works (not the toString, though). Maybe someone more knowledgeable than me on DMD internals could certify it indeed avoid re-allocating those parse results.
Jul 14 2014
parent reply "Jason den Dulk" <public2 jasondendulk.com> writes:
On Monday, 14 July 2014 at 11:43:01 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:

 You can try Pegged, a parser generator that works at 
 compile-time
 (both the generator and the generated parser).
I did, and I got it to work. Unfortunately, the code used to in the CTFE is left in the final executable even though it is not used at runtime. So now the question is, is there away to get rid of the excess baggage? BTW Here is the code I am playing with. import std.stdio; string get_match() { import pegged.grammar; mixin(grammar(` MyRegex: foo <- "abc"* "def"? `)); auto result = MyRegex(import("config-file.txt")); // compile-time parsing return "writeln(\""~result.matches[0]~"\");"; } void main() { mixin(get_match()); }
Jul 15 2014
parent Philippe Sigaud via Digitalmars-d-learn writes:
 I did, and I got it to work. Unfortunately, the code used to in the CTFE is
 left in the final executable even though it is not used at runtime. So now
 the question is, is there away to get rid of the excess baggage?
Not that I know of. Once code is injected, it's compiled into the executable.
   auto result = MyRegex(import("config-file.txt")); // compile-time parsing
   return "writeln(\""~result.matches[0]~"\");";
   mixin(get_match());
I never tried that, I'm happy that works. Another solution would be to push these actions at runtime, by using a small script instead of your compilation command. This script can be in D. - The script takes a file name as input - Open the file - Use regex to parse it - Extract the values you want and write them to a temporary file. - Invoke the compiler (with std.process) on your main file with -Jpath flag to the temporary file. Inside your real code, you can thus use mixin(import("temp file")) happily. - Delete the temporary file once the previous step is finished. Compile the script once and for all, it should execute quite rapidly. It's a unusual pre-processor, in a way.
Jul 15 2014
prev sibling next sibling parent Artur Skawina via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On 07/14/14 13:42, Philippe Sigaud via Digitalmars-d-learn wrote:
 asserts get an entire copy of the parse tree. It's a bit wasteful, but
 using 'immutable' directly does not work here, but this is OK:
 
     enum res = MyRegex("abcabcdefFOOBAR"); // compile-time parsing
     immutable result = res; // to avoid copying the enum value everywhere   
static immutable result = MyRegex("abcabcdefFOOBAR"); // compile-time parsing
 The static asserts then works (not the toString, though). Maybe
diff --git a/pegged/peg.d b/pegged/peg.d index 98959294c40e..307e8a14b1dd 100644 --- a/pegged/peg.d +++ b/pegged/peg.d -55,7 +55,7 struct ParseTree /** Basic toString for easy pretty-printing. */ - string toString(string tabs = "") + string toString(string tabs = "") const { string result = name; -262,7 +262,7 Position position(string s) /** Same as previous overload, but from the begin of P.input to p.end */ -Position position(ParseTree p) +Position position(const ParseTree p) { return position(p.input[0..p.end]); } [completely untested; just did a git clone and fixed the two errors the compiler was whining about. Hmm, did pegged get faster? Last time i tried (years ago) it was unusably slow; right now, compiling your example, i didn't notice the extra multi-second delay that was there then.] artur
Jul 14 2014
prev sibling parent Philippe Sigaud via Digitalmars-d-learn writes:
On Mon, Jul 14, 2014 at 3:19 PM, Artur Skawina via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:
 On 07/14/14 13:42, Philippe Sigaud via Digitalmars-d-learn wrote:
 asserts get an entire copy of the parse tree. It's a bit wasteful, but
 using 'immutable' directly does not work here, but this is OK:

     enum res = MyRegex("abcabcdefFOOBAR"); // compile-time parsing
     immutable result = res; // to avoid copying the enum value everywhere
static immutable result = MyRegex("abcabcdefFOOBAR"); // compile-time parsing
Ah, static!
 The static asserts then works (not the toString, though). Maybe
(snip diff) I'll push that to the repo, thanks! I should sprinkle some const and pure everywhere...
 [completely untested; just did a git clone and fixed the two
  errors the compiler was whining about. Hmm, did pegged get
  faster? Last time i tried (years ago) it was unusably slow;
  right now, compiling your example, i didn't notice the extra
  multi-second delay that was there then.]
It's still slower than some handcrafted parsers. At some time, I could get it on par with std.regex (between 1.3 and 1.8 times slower), but that meant losing some other properties. I have other parsing engines partially implemented, with either a larger specter of grammars or better speed (but not both!). I hope the coming holidays will let me go back to it.
Jul 14 2014