www.digitalmars.com         C & C++   DMDScript  

D - problems with regexps

reply Achim Schmitt <Achim_member pathlink.com> writes:
hi, this is my third or forth program in "D".

I want to split phobos.html (docs) into small chunks, to get something like the
python module index.

The output shows that one of the regexps does not match every time the filter
"<h2>([^<]+)</h2>" should match. It finds "compiler" in
"<a name="compiler"><h2>compiler</h2></a>".
But it dows not find "conv" in the next modules line...

# ./split_module_docs
Reading complete file...
Slicing...
len = 25
num matches: 2 in: compiler
2: <a name="compiler"><h2>compiler</h2></a>
num matches: 0 in: 2: <a name="conv"><h2>conv</h2></a>
num matches: 2 in: ctype
2: <a name="ctype"><h2>ctype</h2></a>
num matches: 0 in: 2: <a name="date"><h2>date</h2></a>
num matches: 0 in: 2: <a name="file"><h2>file</h2></a>
num matches: 2 in: gc
2: <a name="gc"><h2>gc</h2></a>
..


Here is the source - what is going wrong?
[and why do I have to call the dup method of an array without "()"?]


---------snip---------------
import std.file;
import std.regexp;

int main (char[][] args)
{
char[] splitter   = "<hr><!-- ===================================== -->";

RegExp reSplitter   = new RegExp( splitter, "");
RegExp reNewline    = new RegExp( "\n", "");
RegExp rePhobosName = new RegExp( "<h2>([^<]+)</h2>", "");  // * FILTER *

char[] phpbos_filename = "/data/tapo/dmd/html/d/phobos.html";

printf("Reading complete file...\n");
char[] buffer = cast(char[])std.file.read( phpbos_filename ); // :-o

printf("Slicing...\n");

char[][] slices = reSplitter.split( buffer );
printf(" len = %d\n", slices.length );

char[] line;

// Now, for every slice we'll take line #2, and extract the modules name:    
for ( int i = 1; i < slices.length; i++ ) {
char[] element = slices[ i ];
char[][] lines = reNewline.split( element ); // split into lines
if (lines.length > 1) {
line = lines[1].dup;
char[][] matches = rePhobosName.match( line );  // extract name
printf("num matches: %d in: ", matches.length);
if (matches.length > 1) {
printf("%.*s\n", matches[1]);              // print name
}
printf( "2: %.*s\n", lines[1] );     // no name???
}
else {
printf("no lines in %d.\n", i);
}
}

return 0;
}
Dec 06 2003
parent reply Adam Harper <a-news-d harper.nu> writes:
On Sat, 06 Dec 2003 19:00:24 +0000, Achim Schmitt wrote:
 <snip>
 Here is the source - what is going wrong? [and why do I have to call the
 dup method of an array without "()"?]
 
 <snip>

There's an issue with the current current implementation of the RegExp class which I'll explain in a bit, first a fix for your program. To get the expected behaviour you'll need to move the creation of the "rePhobosName" regular expression inside of the for loop (preferably within the "if" branch in which it actually gets used). Now the explanation (which turned into a somewhat long incoherent ramble *sigh*, skip to the end for the [very] short of it). You need to (re)create the regular expression before each test because of the way the RegExp class works. At present the RegExp class stores (amongst others) two variables "input" and "pmatch", "input" is the string it performed the last action on (be it match, search, etc.), "pmatch" is an array of "regmatch_t"'s (a struct containing two ints, one that indicates the starting position within "input" of the match and one that indicates the end). Now, when you call "my_regex.match( somestr )" the following happens (if the RegExp isn't configured with the global attribute): - "input" gets set to "somestr" - "test( input, pmatch[0].rm_eo ) gets called, "pmatch[0].rm_eo" is the end position of the last match ( its counterpart is "pmatch[0].rm_so" which is the start of the match) This is fine and dandy, /if the RegExp hasn't been used before/, because "pmatch[0].rm_eo" will be 0, thus telling "test" to start looking for a match at the beginning of "input". If the RegExp has been used before then "pmatch[0].rm_eo" will be the end position of the last match /for the previous input string/, which may be greater than the length of the /new/ input string. As a result, test will return 0, no match (after all it can't find a match after the end of the string can it!). In the context of your program, the following happens: - rePhobosName is initialised input = "" pmatch[0].rm_so, pmatch[0].rm_eo = 0 - We loop through the headers from "phobos.html" trying to match each one against the "rePhobosName" regexp - We try and match '<a name="compiler"><h2>compiler</h2></a>', which succeeds. Now: input = '<a name="compiler"><h2>compiler</h2></a>' pmatch[0].rm_so = 19 pmatch[0].rm_eo = 36 - We try and match '<a name="conv"><h2>conv</h2></a>', which fails because the test tries to start at character 36 but the string is only 32 characters long! Now: input = '<a name="conv"><h2>conv</h2></a>' pmatch[0].rm_so = 0 pmatch[0].rm_eo = 0 - We try and match '<a name="ctype"><h2>ctype</h2></a>', which succeeds. Now: input = '<a name="ctype"><h2>ctype</h2></a>' pmatch[0].rm_so = 16 pmatch[0].rm_eo = 30 - We try and match '<a name="date"><h2>date</h2></a>', which fails because the test starts at character 30 and as a result only tries to find a match in the substing 'a>'. Now: input = '<a name="date"><h2>date</h2></a>' pmatch[0].rm_so = -1 pmatch[0].rm_eo = -1 (Note: the pmatch values get set to -1 for no match but 0 when a test cannot be performed, as was the case with the "conv" header.) - We try and match '<a name="file"><h2>file</h2></a>', which fails because the start index (pmatch[0].rm_eo, which current;ly equals -1) is invalid. Now: input = '<a name="file"><h2>file</h2></a>' pmatch[0].rm_so = 0 pmatch[0].rm_eo = 0 The above sequence (one match followed by two failures then one match, etc.) then repeats until we run out headers. The fix for Phobos would be to reset pmatch[0].rm_so and pmatch[0].rm_eo to 0 every time the input value is changed.
Dec 06 2003
parent reply Achim Schmitt <Achim_member pathlink.com> writes:
Thanks for your explanation, Adam Harper!

That is not good. A language must have easy regexps to be liked by me :-).

There's 
- char[][] match(char[] string)
and
- char[][] exec(char[] string).

beside finding a better name for exec [matchagain? :)] there's one method to
start searching and another one to continue searching.

The first should reset the search.

Will someone correct that?

Say "yes", and I'll fall in love with "D"...

AS.
Dec 07 2003
parent Adam Harper <a-news-d harper.nu> writes:
Achim Schmitt wrote:
 Thanks for your explanation, Adam Harper!
 
 That is not good. A language must have easy regexps to be liked by me :-).
 
 There's 
 - char[][] match(char[] string)
 and
 - char[][] exec(char[] string).
 
 beside finding a better name for exec [matchagain? :)] there's one method to
 start searching and another one to continue searching.

 The first should reset the search.

Agreed.
 Will someone correct that?
 Say "yes", and I'll fall in love with "D"...

I've started a new topic on this with a patch that works but has a workaround for a bug I can't identify. We'll have to see what others make of it, either way I'm sure Walter will have it fixed shortly.
 
 AS.

Dec 07 2003