digitalmars.D.learn - std.regex bug? My regex doesn't match what it's supposed to.

Alex Folland (21/21) Feb 03 2011 I'm using std.regex from Phobos 2, which I heard was relatively new. My...

Alex Folland (10/10) Feb 03 2011 I figured something out, at least. I had forgotten to use backslashes

Alex Folland (4/4) Feb 03 2011 I figured out the bug. Inside a set of square brackets, \s doesn't

Stanislav Blinov (6/10) Feb 03 2011 It does match for me:

Alex Folland (14/24) Feb 03 2011 Okay, now actually try the test I suggested. I found it was working in

Stanislav Blinov (2/26) Feb 03 2011 Oh, yes, I see it now.

Jesse Phillips (4/7) Feb 03 2011 Might I suggest using a simpler regex? It gives the ability to do better...

Alex Folland (6/13) Feb 04 2011 I finished my function today. :) I did end up using shorter regexes

Alex Folland <lexlexlex gmail.com> writes:

I'm using std.regex from Phobos 2, which I heard was relatively new.  My 
regex is supposed to match a time to start playback in a game replay's 
file name (usually user-written).  It's very adaptive and works 
perfectly on http://regextester.com but doesn't match properly with Phobos.

I wrote a test program which displays filenames and the matched timecodes.
It's located here: http://lex.clansfx.co.uk/projects/wagametimecodes.d

The regex (might have to widen your mail client to see it properly):

((start|begin|enter|play(back)?)[\s_-]*)?(( |at|from)[\s_-]*)?(\d+([\.\-']|[\s_-]*|m(in(ute)?)?|[\s_-]*and[\s_-]*)*(s(ec(ond)?)?)?){1,3}
\_____________Ignore this part.__It works 
perfectly._________/\_______________This part is supposed to match time 
codes._______________/

The problematic string:

Guaton_at_9min59sec.WAgame

regextester.com matches "at_9min59sec" altogether perfectly, which is 
what I want to happen.
std.regex matches "at_9" and "59s", which I don't want to happen.

std.regex was matching "at_9min59s" before I changed the way it finds 
variations of "minute" and "second" from "[ms][inutecond]*" to its 
current method.  It was better before.  Now the numbers aren't even joined.

All in all, I'm pretty sure this is a std.regex bug, but I don't want to 
waste Andrei's time if it's not, since I'm not that experienced.

Feb 03 2011

Alex Folland <lexlexlex gmail.com> writes:

I figured something out, at least.  I had forgotten to use backslashes 
before the hyphens in the [...]s.  That makes the matches link together 
as expected, but it still doesn't make "s(ec(ond)?)?" match "sec" like 
it should.  It just matches "s".

For example, with std.regex, the following regex doesn't match the full 
string below it.

(\d+([\.\-\s_']|and|m(in(ute)?s?)?|s(ec(ond)?s?)?)*){1,3}
9min59sec24

It does match on http://regextester.com .  This is pretty clearly a bug 
at this point.  I don't see what else I could be doing wrong.

Feb 03 2011

Alex Folland <lexlexlex gmail.com> writes:

I figured out the bug.  Inside a set of square brackets, \s doesn't 
match whitespace.  It matches s instead.  I'm uncertain exactly how the 
ECMA-262 part 15.10 regular expression specification is meant to handle 
that situation.

Feb 03 2011

Stanislav Blinov <blinov loniir.ru> writes:

03.02.2011 18:03, Alex Folland пишет:
 I figured out the bug. Inside a set of square brackets, \s doesn't 
 match whitespace. It matches s instead. I'm uncertain exactly how the 
 ECMA-262 part 15.10 regular expression specification is meant to 
 handle that situation.

It does match for me:

foreach(m; match("a b c d e", regex("[a-z][\\s]?")))
{
writefln("%s[%s]%s", m.pre, m.hit, m.post);
}

Feb 03 2011

Alex Folland <lexlexlex gmail.com> writes:

On 2011-02-03 10:21, Stanislav Blinov wrote:
 03.02.2011 18:03, Alex Folland пишет:
 I figured out the bug. Inside a set of square brackets, \s doesn't
 match whitespace. It matches s instead. I'm uncertain exactly how the
 ECMA-262 part 15.10 regular expression specification is meant to
 handle that situation.

 It does match for me:

 foreach(m; match("a b c d e", regex("[a-z][\\s]?")))
 {
 writefln("%s[%s]%s", m.pre, m.hit, m.post);
 }

Okay, now actually try the test I suggested.  I found it was working in 
other sections too, but not in this test which has another "s" section 
it's supposed to look for.

Since it's broken, you'll see 2 matches instead of 1.

module main;

import std.stdio,std.regex;

void main()
{
   foreach(m; match("9min59sec24", 
regex(`(\d+([\s_]|and|m(in(ute)?s?)?|s(ec(ond)?s?)?)*){1,3}`, "gi")))
     writefln("%s[%s]%s", m.pre, m.hit, m.post);
   return;
}

Feb 03 2011

Stanislav Blinov <blinov loniir.ru> writes:

03.02.2011 19:08, Alex Folland пишет:
 On 2011-02-03 10:21, Stanislav Blinov wrote:
 03.02.2011 18:03, Alex Folland пишет:
 I figured out the bug. Inside a set of square brackets, \s doesn't
 match whitespace. It matches s instead. I'm uncertain exactly how the
 ECMA-262 part 15.10 regular expression specification is meant to
 handle that situation.

 It does match for me:

 foreach(m; match("a b c d e", regex("[a-z][\\s]?")))
 {
 writefln("%s[%s]%s", m.pre, m.hit, m.post);
 }

 Okay, now actually try the test I suggested.  I found it was working 
 in other sections too, but not in this test which has another "s" 
 section it's supposed to look for.

 Since it's broken, you'll see 2 matches instead of 1.

 module main;

 import std.stdio,std.regex;

 void main()
 {
   foreach(m; match("9min59sec24", 
 regex(`(\d+([\s_]|and|m(in(ute)?s?)?|s(ec(ond)?s?)?)*){1,3}`, "gi")))
     writefln("%s[%s]%s", m.pre, m.hit, m.post);
   return;

Oh, yes, I see it now.

Feb 03 2011

Jesse Phillips <jessekphillips+D gmail.com> writes:

Alex Folland Wrote:


 The problematic string:
 
 Guaton_at_9min59sec.WAgame

Might I suggest using a simpler regex? It gives the ability to do better error
checking/reporting. Instead of adding all the misspellings for minute and
second, just capture those locations as words and analyze them outside the
regex. In fact you could capture the whole time segment and use a second regex
to pull out the data:

regex("(\d+\w+)", "g")

Then your regex isn't updated when you need to add hours and such.

Feb 03 2011

Alex Folland <lexlexlex gmail.com> writes:

On 2011-02-03 17:03, Jesse Phillips wrote:
 Alex Folland Wrote:


 The problematic string:

 Guaton_at_9min59sec.WAgame

 Might I suggest using a simpler regex? It gives the ability to do better error
checking/reporting. Instead of adding all the misspellings for minute and
second, just capture those locations as words and analyze them outside the
regex. In fact you could capture the whole time segment and use a second regex
to pull out the data:

 regex("(\d+\w+)", "g")

 Then your regex isn't updated when you need to add hours and such.

I finished my function today.  :)  I did end up using shorter regexes 
and embedded foreach loops in order to break the timecodes down. 
However, the whole timecodes were found by a relatively large and 
adaptive regex, which seems to be necessary.  Thanks for the tip though. 
  It helped further down the line.

Feb 04 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - std.regex bug? My regex doesn't match what it's supposed to.