D - problems with regexps

Achim Schmitt (58/58) Dec 06 2003 hi, this is my third or forth program in "D".

Adam Harper (69/74) Dec 06 2003 There's an issue with the current current implementation of the RegExp

Achim Schmitt (12/12) Dec 07 2003 Thanks for your explanation, Adam Harper!

Adam Harper (5/21) Dec 07 2003 I've started a new topic on this with a patch that works but has a

Achim Schmitt <Achim_member pathlink.com> writes:

hi, this is my third or forth program in "D".

I want to split phobos.html (docs) into small chunks, to get something like the
python module index.

The output shows that one of the regexps does not match every time the filter
"<h2>([^<]+)</h2>" should match. It finds "compiler" in
"<a name="compiler"><h2>compiler</h2></a>".
But it dows not find "conv" in the next modules line...


Reading complete file...
Slicing...
len = 25
num matches: 2 in: compiler
2: <a name="compiler"><h2>compiler</h2></a>
num matches: 0 in: 2: <a name="conv"><h2>conv</h2></a>
num matches: 2 in: ctype
2: <a name="ctype"><h2>ctype</h2></a>
num matches: 0 in: 2: <a name="date"><h2>date</h2></a>
num matches: 0 in: 2: <a name="file"><h2>file</h2></a>
num matches: 2 in: gc
2: <a name="gc"><h2>gc</h2></a>
..


Here is the source - what is going wrong?
[and why do I have to call the dup method of an array without "()"?]


---------snip---------------
import std.file;
import std.regexp;

int main (char[][] args)
{
char[] splitter   = "<hr><!-- ===================================== -->";

RegExp reSplitter   = new RegExp( splitter, "");
RegExp reNewline    = new RegExp( "\n", "");
RegExp rePhobosName = new RegExp( "<h2>([^<]+)</h2>", "");  // * FILTER *

char[] phpbos_filename = "/data/tapo/dmd/html/d/phobos.html";

printf("Reading complete file...\n");
char[] buffer = cast(char[])std.file.read( phpbos_filename ); // :-o

printf("Slicing...\n");

char[][] slices = reSplitter.split( buffer );
printf(" len = %d\n", slices.length );

char[] line;


for ( int i = 1; i < slices.length; i++ ) {
char[] element = slices[ i ];
char[][] lines = reNewline.split( element ); // split into lines
if (lines.length > 1) {
line = lines[1].dup;
char[][] matches = rePhobosName.match( line );  // extract name
printf("num matches: %d in: ", matches.length);
if (matches.length > 1) {
printf("%.*s\n", matches[1]);              // print name
}
printf( "2: %.*s\n", lines[1] );     // no name???
}
else {
printf("no lines in %d.\n", i);
}
}

return 0;
}

Dec 06 2003

Adam Harper <a-news-d harper.nu> writes:

On Sat, 06 Dec 2003 19:00:24 +0000, Achim Schmitt wrote:
 <snip>
 Here is the source - what is going wrong? [and why do I have to call the
 dup method of an array without "()"?]
 
 <snip>

There's an issue with the current current implementation of the RegExp
class which I'll explain in a bit, first a fix for your program.

To get the expected behaviour you'll need to move the creation of the
"rePhobosName" regular expression inside of the for loop (preferably
within the "if" branch in which it actually gets used).

Now the explanation (which turned into a somewhat long incoherent ramble
*sigh*, skip to the end for the [very] short of it).

You need to (re)create the regular expression before each test because of
the way the RegExp class works.  At present the RegExp class stores
(amongst others) two variables "input" and "pmatch", "input" is the string
it performed the last action on (be it match, search, etc.), "pmatch" is
an array of "regmatch_t"'s (a struct containing two ints, one that
indicates the starting position within "input" of the match and one that
indicates the end).  Now, when you call "my_regex.match( somestr )" the
following happens (if the RegExp isn't configured with the global
attribute):

    - "input" gets set to "somestr"
    - "test( input, pmatch[0].rm_eo ) gets called, "pmatch[0].rm_eo" is
      the end position of the last match ( its counterpart is
      "pmatch[0].rm_so" which is the start of the match)

This is fine and dandy, /if the RegExp hasn't been used before/, because
"pmatch[0].rm_eo" will be 0, thus telling "test" to start looking for a
match at the beginning of "input".  If the RegExp has been used before
then "pmatch[0].rm_eo" will be the end position of the last match /for the
previous input string/, which may be greater than the length of the
/new/ input string.  As a result, test will return 0, no match (after all
it can't find a match after the end of the string can it!).

In the context of your program, the following happens:

    - rePhobosName is initialised
      input = ""
      pmatch[0].rm_so, pmatch[0].rm_eo = 0

    - We loop through the headers from "phobos.html" trying to match each
      one against the "rePhobosName" regexp

    - We try and match '<a name="compiler"><h2>compiler</h2></a>', which
      succeeds. Now:

          input = '<a name="compiler"><h2>compiler</h2></a>'
	  pmatch[0].rm_so = 19
          pmatch[0].rm_eo = 36

    - We try and match '<a name="conv"><h2>conv</h2></a>', which fails
      because the test tries to start at character 36 but the string is
      only 32 characters long! Now:

          input = '<a name="conv"><h2>conv</h2></a>'
          pmatch[0].rm_so = 0
          pmatch[0].rm_eo = 0

    - We try and match '<a name="ctype"><h2>ctype</h2></a>', which
      succeeds. Now:

          input = '<a name="ctype"><h2>ctype</h2></a>'
	  pmatch[0].rm_so = 16
          pmatch[0].rm_eo = 30

    - We try and match '<a name="date"><h2>date</h2></a>', which fails
      because the test starts at character 30 and as a result only tries
      to find a match in the substing 'a>'.  Now:

          input = '<a name="date"><h2>date</h2></a>'
          pmatch[0].rm_so = -1
          pmatch[0].rm_eo = -1

      (Note: the pmatch values get set to -1 for no match but 0 when a
             test cannot be performed, as was the case with the "conv"
             header.)

    - We try and match '<a name="file"><h2>file</h2></a>', which fails
      because the start index (pmatch[0].rm_eo, which current;ly equals
      -1) is invalid.  Now:

          input = '<a name="file"><h2>file</h2></a>'
          pmatch[0].rm_so = 0
          pmatch[0].rm_eo = 0

The above sequence (one match followed by two failures then one match,
etc.) then repeats until we run out headers.

The fix for Phobos would be to reset pmatch[0].rm_so and pmatch[0].rm_eo
to 0 every time the input value is changed.

Dec 06 2003

Achim Schmitt <Achim_member pathlink.com> writes:

Thanks for your explanation, Adam Harper!

That is not good. A language must have easy regexps to be liked by me :-).

There's 
- char[][] match(char[] string)
and
- char[][] exec(char[] string).

beside finding a better name for exec [matchagain? :)] there's one method to
start searching and another one to continue searching.

The first should reset the search.

Will someone correct that?

Say "yes", and I'll fall in love with "D"...

AS.

Dec 07 2003

Adam Harper <a-news-d harper.nu> writes:

Achim Schmitt wrote:
 Thanks for your explanation, Adam Harper!
 
 That is not good. A language must have easy regexps to be liked by me :-).
 
 There's 
 - char[][] match(char[] string)
 and
 - char[][] exec(char[] string).
 
 beside finding a better name for exec [matchagain? :)] there's one method to
 start searching and another one to continue searching.

 The first should reset the search.

Agreed.

 Will someone correct that?
 Say "yes", and I'll fall in love with "D"...

I've started a new topic on this with a patch that works but has a 
workaround for a bug I can't identify.  We'll have to see what others 
make of it, either way I'm sure Walter will have it fixed shortly.

 
 AS.

Dec 07 2003

D Programming

C/C++ Programming

Other

D - problems with regexps