www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 3136] New: Incorrect and strange behavior of std.regexp.RegExp if using a pattern with optional prefix and suffix longer than 1 char

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3136

           Summary: Incorrect and strange behavior of std.regexp.RegExp if
                    using a pattern with optional prefix and suffix longer
                    than 1 char
           Product: D
           Version: 2.030
          Platform: x86
        OS/Version: Windows
            Status: NEW
          Severity: major
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: marcellognani gmail.com


It seems like std.regexp.RegExp get confused if I try using a pattern with
optional prefix and suffix longer than 1 char.
An expression of the form ([A]{0,2})(C)([D]{0,2}) matches all off "AC", "BC",
"CD", "CE", "ACD", "BCE", "ABCDE", "C" (as expected).
An expression of the form ([AB]{0,2})(C)([DE]{0,2}) or
([AB]?[AB]?)(C)([DE]?[DE]?) fails (incorrectly and unexpectedly) in some of the
cases above (both "CD" and "CE", for example).

Here the code:
---
import std.regexp;
import std.stdio;

public
{
    static void main()
    {
        RegExp eTest;
        void SetExp(string pattern)
        {
            eTest=new RegExp(pattern,"g");
            std.stdio.writeln("Testing expression ",pattern);
          }
        void TryString(string s)
        {
            std.stdio.writeln("Trying on string\"",s,"\":");
            auto captures=eTest.exec(s);
            if(captures.length)
            {
                std.stdio.writeln("Success!");
                foreach(uint i,string capture;captures)
                    std.stdio.writeln(i,"): \"",capture,"\"");
            }
            else
            {
                std.stdio.writeln("Failure!");
            }
        }
        SetExp(r"([A]{0,2})(C)([D]{0,2})");
        TryString("AC");
        TryString("BC");
        TryString("CD");
        TryString("CE");
        TryString("ACD");
        TryString("BCE");
        TryString("ABCDE");
        TryString("C");
        TryString("F");
        SetExp(r"([AB]{0,2})(C)([DE]{0,2})");
        TryString("AC");
        TryString("BC");
        TryString("CD");
        TryString("CE");
        TryString("ACD");
        TryString("BCE");
        TryString("ABCDE");
        TryString("C");
        TryString("F");
        SetExp(r"([AB]?[AB]?)(C)([DE]?[DE]?)");
        TryString("AC");
        TryString("BC");
        TryString("CD");
        TryString("CE");
        TryString("ACD");
        TryString("BCE");
        TryString("ABCDE");
        TryString("C");
        TryString("F");
    }
}
---

Here the output:
---
Testing expression ([A]{0,2})(C)([D]{0,2})
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"CD":
Success!
0): "CD"
1): ""
2): "C"
3): "D"
Trying on string"CE":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"ABCDE":
Success!
0): "CD"
1): ""
2): "C"
3): "D"
Trying on string"C":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"F":
Failure!
Testing expression ([AB]{0,2})(C)([DE]{0,2})
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "BC"
1): "B"
2): "C"
3): ""
Trying on string"CD":
Failure!
Trying on string"CE":
Failure!
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "BCE"
1): "B"
2): "C"
3): "E"
Trying on string"ABCDE":
Success!
0): "ABCDE"
1): "AB"
2): "C"
3): "DE"
Trying on string"C":
Failure!
Trying on string"F":
Failure!
Testing expression ([AB]?[AB]?)(C)([DE]?[DE]?)
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "BC"
1): "B"
2): "C"
3): ""
Trying on string"CD":
Failure!
Trying on string"CE":
Failure!
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "BCE"
1): "B"
2): "C"
3): "E"
Trying on string"ABCDE":
Success!
0): "ABCDE"
1): "AB"
2): "C"
3): "DE"
Trying on string"C":
Failure!
Trying on string"F":
Failure!
---

Kind regards,
Marcello Gnani

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 04 2009
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3136





--- Comment #1 from Marcello Gnani <marcellognani gmail.com>  2009-07-08
12:06:26 PDT ---
I had the time to investigate further; the problem is related to an incorrect
optimization performed by Phobos on the optional prefix.
The constructor code of the RegExp object calls "public void compile(string
pattern, string attributes)", that builds a correct internal RegExp program;
then, an optimization is tried calling the "void optimize()" function. In this
function, during the optimization of the REbit opcode (the opcode that
implements the prefix match when the prefix is of more than one letter), the
optionality of the prefix is lost, leading to the incorrect behavior reported.

The simplest patch I came up is to modify slightly the "int starrchars(Range r,
const(ubyte)[] prog)" function (that is called by "optimize") as follows:
. . .
        case REnm:
        case REnmq:
        // len, n, m, ()
        len = (cast(uint *)&prog[i + 1])[0];
        n   = (cast(uint *)&prog[i + 1])[1];
        m   = (cast(uint *)&prog[i + 1])[2];
        pop = &prog[i + 1 + uint.sizeof * 3];
        if (!starrchars(r, pop[0 .. len]))
            return 0;
        if (n)
            return 1;
        i += 1 + uint.sizeof * 3 + len;
        break;
. . .
should return 0 if the n operand of the REnm opcode is 0 (this changes the line
before the break statement); this avoids the insertion of the
optionality-killing first filter:
. . .
        case REnm:
        case REnmq:
        // len, n, m, ()
        len = (cast(uint *)&prog[i + 1])[0];
        n   = (cast(uint *)&prog[i + 1])[1];
        m   = (cast(uint *)&prog[i + 1])[2];
        pop = &prog[i + 1 + uint.sizeof * 3];
        if (!starrchars(r, pop[0 .. len]))
            return 0;
        if (n)
            return 1;
        return 0;
        break;
. . .

I tried it and it works now.
Maybe this solves some other regexp bug yet open.

Best regards,
Marcello Gnani

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 08 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3136


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |andrei metalanguage.com
         AssignedTo|nobody puremagic.com        |andrei metalanguage.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Oct 11 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3136


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|andrei metalanguage.com     |dmitry.olsh gmail.com


--- Comment #2 from Andrei Alexandrescu <andrei metalanguage.com> 2011-06-05
08:11:26 PDT ---
Reassigning to Dmitry.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 05 2011
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3136


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


--- Comment #3 from Dmitry Olshansky <dmitry.olsh gmail.com> 2011-06-06
08:03:48 PDT ---
Fixed for std.regex
https://github.com/D-Programming-Language/phobos/commit/9afb00e36b625322d7f1d8ec0fbd876c2b5c03fc

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 06 2011