www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Restrictions in std.regexp?

reply Olaf Pohlmann <op nospam.org> writes:
Hi,

the documentation of std.regexp is somewhat sparse, so I tried to find 
out a few things on my own. There seems to be no way to do lookaheads 
and lookbehinds. This:

	RegExp re = search("ABCDEF", "(?<=AB)CD(?=EF)");

should find "CD" as a match, but it yields a runtime error:

	Error: *+? not allowed in atom

Is there any other way to get this working or am I just out of luck with 
the current implementation?



op
May 02 2006
next sibling parent Lionello Lunesu <lio lunesu.remove.com> writes:
Olaf Pohlmann wrote:
 Hi,
 
 the documentation of std.regexp is somewhat sparse, so I tried to find 
 out a few things on my own. There seems to be no way to do lookaheads 
 and lookbehinds. This:
 
     RegExp re = search("ABCDEF", "(?<=AB)CD(?=EF)");
 
 should find "CD" as a match, but it yields a runtime error:

Use "AB(CD)EF" and re.match(1) ?? I'm very inexperienced with regexp, mind you :S L.
May 02 2006
prev sibling parent reply "Derek Parnell" <derek psych.ward> writes:
On Tue, 02 May 2006 23:39:13 +1000, Olaf Pohlmann <op nospam.org> wrote:

 Hi,

 the documentation of std.regexp is somewhat sparse, so I tried to find  
 out a few things on my own. There seems to be no way to do lookaheads  
 and lookbehinds. This:

 	RegExp re = search("ABCDEF", "(?<=AB)CD(?=EF)");

 should find "CD" as a match, but it yields a runtime error:

 	Error: *+? not allowed in atom

 Is there any other way to get this working or am I just out of luck with  
 the current implementation?

I can't tell what it is you are trying to do but it seems that the RE syntax you are expecting is not what has been implemented. See http:http://www.digitalmars.com/ctg/regular.html for details. Are you looking for an optional "AB" followed by "CD" followed by an optional "EF" ? If so try RegExp re = search("ABCDEF", "(AB)?(CD)(EF)?"); Here is a sample program ... import std.stdio; import std.regexp; void main() { RegExp re = search("AXCDEFGHI", "(AB)?(CD)(EF)?"); writefln("PRE: %s", re.pre()); writefln("MATCH: %s", re.match(0)); writefln("SUB1: %s", re.match(1)); writefln("SUB2: %s", re.match(2)); // this should be 'CD' writefln("SUB3: %s", re.match(3)); writefln("POST: %s", re.post()); } -- Derek Parnell Melbourne, Australia
May 02 2006
next sibling parent Olaf Pohlmann <op nospam.org> writes:
Derek Parnell wrote:
 Are you looking for an optional "AB" followed by "CD" followed by an  
 optional "EF" ?

No. I'm looking for a string that is preceeded and followed by well defined other strings. The match should *not* return the whole sequence but only what is in the middle. It's actually about parsing some kind of text markup. If it was html like "<body><h1>Welcome</h1></body>" it should allow me to retrieve only the "Welcome". If you just use some grouping the match will be the whole <h1> element, so you have to extract the content in a 2nd step. The regexp with lookahead and lookbehind works fine in Python: import re html = "<body>\n<h1>Welcome</h1>\n</body>" match = re.search("(?<=\<h1\>).*?(?=\</h1\>)", html) html[m.start():m.end()] This prints 'Welcome'. The regexp is a bit hard to read, so see http://docs.python.org/lib/re-syntax.html for a description. Now, I can retrieve the whole h1 element with the D version of regexps and then do another scan for the content but it would be nice to get it in one step, like in the Python version. op
May 02 2006
prev sibling parent Olaf Pohlmann <op nospam.org> writes:
Derek Parnell wrote:
     RegExp re = search("ABCDEF", "(AB)?(CD)(EF)?");

Oops, this is actually very close to the solution, just drop both '?'. It's even more readable than what I tried before: import std.stdio; import std.regexp; void main() { char[] html = "<body>\n<h1>Welcome</h1>\n</body>"; RegExp re = search(html, r"(\<h1\>)(.*?)(\</h1\>)"); if (re !is null) writefln("%s", re.match(2)); } op
May 02 2006