www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - regexp (reluctant)

reply Fredrik Olsson <peylow gmail.com> writes:
Since no one have replied to my "export to .h" post, and google yield no 
matches, I have begun coding a D exports to C header file tool. And I 
see it fitting to use D for the task. I would usual use Ruby, but hey, 
some practice at D and another recource for the comunity ;).


I have defined this regexp:
const char[] redoccom = "\\s*(\\/\\*[\\*\\!].*?\\*\\/)?\\s*";

Or in more "readable" form:
   \s*(\/\*[\*\!].*?\*\/)?\s*

The documentation for regexp does not mention reluctant quanitfiers, 
usually (exp)* would mean find the largest match you can of (exp) and 
(exp)*? would mean the smallest match of (exp). Well this is not the 
problem, even though I think the documentation should state if *, ? and 
the {} quantifiers are greedy or reluctant.

For those who read regular expressions this is a simplistic match for an 
optional documentation comment in code on one of theese two forms (With 
total ignorance of content as long as it is no nested comments):
   /**
     Foo
   */
or
   /*!
     Bar
   */
With a capture group for the actual comment. Useful for example as:
new TegExp(redocom ~ "export", "m");
to find exported members in a file allong with relevant documentation.

Any how. With or without the reluctant quantifier D does not give me the 
result I expect. I use SubEthaEdit with default regexp syntax (Ruby) to 
verify the matches and correct capture groups (Only problem is that with 
greedy quantifiers it matches from the start of the very first 
docdomment to the end of the very last comment, something reluctant 
quantifiers is required to compensate for).

regard
	Fredrik Olsson
Aug 09 2005
next sibling parent Chris Sauls <ibisbasenji gmail.com> writes:
Fredrik Olsson wrote:
 I have defined this regexp:
 const char[] redoccom = "\\s*(\\/\\*[\\*\\!].*?\\*\\/)?\\s*";
 
 Or in more "readable" form:
   \s*(\/\*[\*\!].*?\*\/)?\s*

This is such a little thing, but you probably should use WYSIWYG string literals for regexp's, to help with readability. (I know I do.) # const char[] RE_DOC_COM = r"\s*(\/\*[\*\!].*?\*\/)?\s*"c; Or: # const char[] RE_DOC_COM = `\s*(\/\*[\*\!].*?\*\/)?\s*`c; -- Chris Sauls
Aug 09 2005
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
Can you flesh it out a bit with some minimal sample text, the result you get
with regexp, and what the correct result should be? Also, can you try and
simpify the regular expression as much as possible?
Aug 09 2005
parent reply Fredrik Olsson <peylow gmail.com> writes:
Walter wrote:
 Can you flesh it out a bit with some minimal sample text, the result you get
 with regexp, and what the correct result should be? Also, can you try and
 simpify the regular expression as much as possible?
 
 

This is as shart as I can get it will the behavior still intact: /* BEGIN: regexp.d */ import std.regexp; import std.stdio; int main(char[][] args) { RegExp re = new RegExp(r"\s*(\*.*?\*)?\s*", null); char[][] ms = re.match("*\n foo\n * bar"); foreach(char[] m; ms) { writefln("'" ~ m ~ "'"); } return 0; } /* END: regexp.d */ And an actual compile/run session: peylow imanicken:~$ gdc regexp.d -o regexp; ./regexp '' '' peylow imanicken:~$ Excpected compile run: peylow imanicken:~$ gdc regexp.d -o regexp; ./regexp '* foo * ' '* foo *' peylow imanicken:~$ If I remove the newlines in the string and search in "* foo * bar" then I correctly get: peylow imanicken:~$ gdc regexp.d -o regexp; ./regexp '* foo * ' '* foo *' peylow imanicken:~$ So it seams that "." dos not match any character, as it misses newline. Regards Fredrik Olsson
Aug 10 2005
parent "Walter" <newshound digitalmars.com> writes:
Thanks, I can work with that.
Aug 10 2005