www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 9979] New: Regex bug

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979

           Summary: Regex bug
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: diggsey googlemail.com


--- Comment #0 from Diggory <diggsey googlemail.com> 2013-04-22 16:52:50 PDT ---
The following regular expression gives a zero length match for the last match
in the string:

regex(r"\b([a-zA-Z_][a-zA-Z_0-9]*)\b(\s*\([^)]*\))?", "g");

Example:
auto r = regex(r"\b([a-zA-Z_][a-zA-Z_0-9]*)\b(\s*\([^)]*\))?", "g");
string text = replace("A ? B : C", r, "0");
writeln(text);

Expected output:
"0 ? 0 : 0"

Actual output:
"0 ? 0 : 0C"

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Apr 22 2013
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979



--- Comment #1 from Diggory <diggsey googlemail.com> 2013-04-22 19:15:38 PDT ---
I've narrowed it down to what seems to be an off-by-one error in the regex
engine when a word break is encountered while the current string position is at
the end of the string.

This much simpler regex demonstrates the problem:
    auto r = regex(r"A\b", "g");

    string s = "A";
    auto m = s.match(r);

    writeln(m);

Clearly this should match "A" but it instead matches an empty string before A.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Apr 22 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979



--- Comment #2 from Diggory <diggsey googlemail.com> 2013-04-22 20:11:53 PDT ---
I've tracked it down to a flaw in the algorithm:

If the | represents the read position, of the regex:
A|B

The front character is "B" and the engine uses BackLooper to read the previous
character, and then compares the two characters to find a word boundary.

The problem is that BackLooper uses the index of the input stream as a base,
not the current read position, and the input stream is one character ahead
because "B" has already been read.

Input stream position:
AB|

This is normally not a problem because there is an off-by-one error in
BackLooper so it always reads one character further back than it should which
in most circumstances cancels out the two errors and gives the correct result
(A).

The problem occurs when at the end of the string because the input stream
position stops before going past the end.

Input stream position and read position:
AB|

In this case the two characters should be "B" and <end of string>, but because
BackLooper reads one character further back the two characters are "A" and <end
of string> missing out B entirely.

This error propagates out in the form of matching a string of zero length.

The most sensible way of fixing this would seem to be to fix the off-by-one
error in BackLooper, and make it use the read position rather than the input
stream position as a base.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Apr 22 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com


--- Comment #3 from Dmitry Olshansky <dmitry.olsh gmail.com> 2013-04-23
13:25:38 PDT ---
(In reply to comment #2)
 I've tracked it down to a flaw in the algorithm:
[snip] Yes, that's how it's working...
 
 The problem occurs when at the end of the string because the input stream
 position stops before going past the end.
 
 Input stream position and read position:
 AB|
 
 In this case the two characters should be "B" and <end of string>, but because
 BackLooper reads one character further back the two characters are "A" and <end
 of string> missing out B entirely.
 
 This error propagates out in the form of matching a string of zero length.
 
 The most sensible way of fixing this would seem to be to fix the off-by-one
 error in BackLooper, and make it use the read position rather than the input
 stream position as a base.
Nice analysis. Never liked the way this part of it was implemented... Since you went that far you can just go ahead and prepare a pull request with a fix :) Alternatively I'll get myself busy with it in a couple of days. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Apr 23 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979



--- Comment #4 from Diggory <diggsey googlemail.com> 2013-04-23 14:02:50 PDT ---
 Nice analysis. Never liked the way this part of it was implemented...
 Since you went that far you can just go ahead 
 and prepare a pull request with a fix :)
 Alternatively I'll get myself busy with it in a couple of days.
I'll try but although I've managed to track it down, the regex code is still fairly opaque to me. It's quite likely I'll break something else in the process! -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Apr 23 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979



--- Comment #5 from Diggory <diggsey googlemail.com> 2013-04-23 16:27:17 PDT ---
https://github.com/D-Programming-Language/phobos/pull/1273

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Apr 23 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979



--- Comment #6 from github-bugzilla puremagic.com 2013-05-25 23:35:20 PDT ---
Commit pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/ebb14579ce8ba69324372623123601d5b158572e
Merge pull request #1273 from Diggsey/master

fix issue 9979 - regex back-looper bug

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 25 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9979


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED
            Summary|Regex bug                   |Regex bug with \b and
                   |                            |look-behind


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 26 2013