www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Regular expression woes

reply just jeff <jeffrparsons optusnet.com.au> writes:
Is this a bug, or am I misunderstanding something? The code...

# import std.stdio;
# import std.regexp;
#
# int main(char[][] args) {
#     char[] string = "xfooxxxxxfoox";
#     writefln("Greedy matching:");
#     foreach (RegExp match; RegExp("x.*x").search(string))
#         writefln("%s[%s]%s", match.pre, match.match(0), match.post);
#     writefln("Conservative matching:");
#     foreach (RegExp match; RegExp("x.*?x").search(string))
#         writefln("%s[%s]%s", match.pre, match.match(0), match.post);
#     return 0;
# }

...compiled under GDC 0.21 (using the Phobos version that ships 
therewith) yields:

Greedy matching:
[xfooxxxxx]foox
Conservative matching:
[xfoox]xxxxfoox
xfoox[xx]xxfoox
xfooxxx[xx]foox

The latter part (conservative matching) makes plenty of sense to me, but 
I thought the former should have matched the whole string (i.e. read 
"[xfooxxxxxfoox]".

Is this behaviour intended?

Thanks. :)
Jan 17 2007
next sibling parent reply Lionello Lunesu <lio lunesu.remove.com> writes:
When searching for x.*x in xfooxxxxxfoox, VisualStudio 2005 matches the 
entire string:

[xfooxxxxxfoox]

Also, the following two appear to be missing from "conservative matching":

xfoo[xx]xxxfoox
xfooxxxx[xfoox]

L.
Jan 17 2007
parent reply just jeff <psychobrat gmail.com> writes:
Lionello Lunesu Wrote:

 (...)
 Also, the following two appear to be missing from "conservative matching":
 
 xfoo[xx]xxxfoox
 xfooxxxx[xfoox]

I wouldn't have expected them to be found; I had thought standard regex behavior was not to find overlapping matches (i.e. to start searching again just past the end of any match it finds). I'm at work at the moment (and unfortunately without my laptop), so the only library I have available to test that on is the VBA one that comes with Access (*shudders* :P), but that doesn't find those two matches either.
Jan 17 2007
next sibling parent just jeff <psychobrat gmail.com> writes:
I just tested Java, and it doesn't return the extra matches either.

I'll trawl through std.regexp when I get home to see if I can find what's going
on.

Any inspiration would be appreciated. I presume the default in std.regexp -is-
supposed to be a greedy match, and not some strange sort of half-way match?
Perhaps I presume too much? o_0
Jan 17 2007
prev sibling parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
"just jeff" <psychobrat gmail.com> wrote in message 
news:eom9pa$tjm$1 digitaldaemon.com...
 Lionello Lunesu Wrote:

 (...)
 Also, the following two appear to be missing from "conservative 
 matching":

 xfoo[xx]xxxfoox
 xfooxxxx[xfoox]

I wouldn't have expected them to be found; I had thought standard regex behavior was not to find overlapping matches (i.e. to start searching again just past the end of any match it finds).

VS2005 did find them, using x. x L.
Jan 17 2007
parent reply just jeff <jeffrparsons optusnet.com.au> writes:
 VS2005 did find them, using x. x

Ack, I can't find any documentation on the use of " ". Funny, that; I've never had much luck with Microsoft's documentation at all... ;) Care to elaborate?
Jan 18 2007
parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
"just jeff" <jeffrparsons optusnet.com.au> wrote in message 
news:eopti4$lan$1 digitaldaemon.com...
 VS2005 did find them, using x. x

Ack, I can't find any documentation on the use of " ". Funny, that; I've never had much luck with Microsoft's documentation at all... ;) Care to elaborate?

http://msdn2.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx But, interestingly, the .NET framework uses the same .*? http://msdn2.microsoft.com/en-us/library/3206d374(VS.80).aspx
Jan 19 2007
parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Lionello Lunesu wrote:
 "just jeff" <jeffrparsons optusnet.com.au> wrote in message 
 news:eopti4$lan$1 digitaldaemon.com...
 VS2005 did find them, using x. x

never had much luck with Microsoft's documentation at all... ;) Care to elaborate?

http://msdn2.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx But, interestingly, the .NET framework uses the same .*? http://msdn2.microsoft.com/en-us/library/3206d374(VS.80).aspx

Looks like the .NET framework uses the "standard" syntax. The reason VS uses a different syntax is probably because it's meant to search in source code and some characters like *() etc are commonly used in C-like languages. Therefore they might be the characters searched for quite often, and excessive quoting is inconvenient. {} are probably searched for a lot less, so they are arguably better choices for meta-characters in this context.
Jan 20 2007
prev sibling parent just jeff <jeffrparsons optusnet.com.au> writes:
Could somebody confident in the way std.regexp should work please 
confirm whether or not this is a bug?
Jan 23 2007