www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 11765] New: std.regex: Negation of character class is not applied to base class first

reply d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765

           Summary: std.regex: Negation of character class is not applied
                    to base class first
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: andrej.mitrovich gmail.com


--- Comment #0 from Andrej Mitrovic <andrej.mitrovich gmail.com> 2013-12-18
11:37:06 PST ---
-----
import std.regex;
import std.stdio;

void main()
{
    // expected: [["3"]] - but got: [["2"]]]
    writeln("123456789".match("[^1--[2]]"));

    // the above is *currently* equivalent to:
    writeln("123456789".match("[^[1--[2]]]"));

    // which means: subtract "1 - 2" (equals 1),
    // and then negate it (so "2" will match first in the string)

    // but I expect the first case to be equivalent to:
    writeln("123456789".match("[[^1]--[2]]"));

    // which means: negate 1 (for discussion assume 2-9 range),
    // subtract 2 and you get 3-9, which means "3" will match first.
}
-----

I'm not sure whether this is just how ECMAScript does it (since std.regex
references it), but e.g. .NET does negation on the base class first (The "1"
class above) and *then* it does subtraction with another class.

You can test this behavior here:

http://refiddle.com/

Using .net syntax:
[^01-[2]]
0123456789

It matches "3".

Either way if this report is invalid (e.g. expected behavior) then I think we
should update the docs so they state the precedence of the negation.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 18 2013
next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765



--- Comment #1 from Andrej Mitrovic <andrej.mitrovich gmail.com> 2013-12-18
11:38:32 PST ---
(In reply to comment #0)
 Using .net syntax:
 [^01-[2]]
 0123456789
 
 It matches "3".

Nevermind the leading zero, I meant to use this simpler example: [^1-[2]] 123456789 It matches "3". -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 18 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com


--- Comment #2 from Dmitry Olshansky <dmitry.olsh gmail.com> 2013-12-18
11:56:14 PST ---
(In reply to comment #0)
 I'm not sure whether this is just how ECMAScript does it (since std.regex
 references it), but e.g. .NET does negation on the base class first (The "1"
 class above) and *then* it does subtraction with another class.

ECMAScript doesn't even have it AFAIK ;) I think you (and .NET) are right - the prioriy of unary '^' operator should be higher then that of any other binary ops. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 18 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765



--- Comment #3 from Andrej Mitrovic <andrej.mitrovich gmail.com> 2013-12-19
04:55:48 PST ---
Is the following sample caused by the same issue?

writeln("abcdefghijklmnopqrstuvwxyz".match("[a-z&&[^aeiuo]]"));

It writes [["a"]], I was expecting the first non-vowel [["b"]]. It returns "b"
in Ruby, as for .NET I haven't found the syntax it uses.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 19 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765



--- Comment #4 from Dmitry Olshansky <dmitry.olsh gmail.com> 2013-12-19
10:27:35 PST ---
(In reply to comment #1)
 (In reply to comment #0)
 Using .net syntax:
 [^01-[2]]
 0123456789
 
 It matches "3".

Nevermind the leading zero, I meant to use this simpler example: [^1-[2]] 123456789 It matches "3".

Actually because of single dash it works as if all is fine... This one is good case: [^1--[2]] -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 19 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765



--- Comment #5 from Dmitry Olshansky <dmitry.olsh gmail.com> 2013-12-19
10:31:23 PST ---
(In reply to comment #3)
 Is the following sample caused by the same issue?
 
 writeln("abcdefghijklmnopqrstuvwxyz".match("[a-z&&[^aeiuo]]"));
 
 It writes [["a"]], I was expecting the first non-vowel [["b"]]. It returns "b"
 in Ruby, as for .NET I haven't found the syntax it uses.

From the look of it - an unrelated bug in set intersection. Better split it off as a new issue. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 19 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765



--- Comment #6 from Andrej Mitrovic <andrej.mitrovich gmail.com> 2013-12-20
00:51:17 PST ---
(In reply to comment #5)
 (In reply to comment #3)
 Is the following sample caused by the same issue?
 
 writeln("abcdefghijklmnopqrstuvwxyz".match("[a-z&&[^aeiuo]]"));
 
 It writes [["a"]], I was expecting the first non-vowel [["b"]]. It returns "b"
 in Ruby, as for .NET I haven't found the syntax it uses.

From the look of it - an unrelated bug in set intersection. Better split it off as a new issue.

Filed as Issue 11784. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 20 2013
prev sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11765



--- Comment #7 from Dmitry Olshansky <dmitry.olsh gmail.com> 2014-01-10
12:24:42 PST ---
Ruby makes me nervous:

print /[^abc[e-f]&&[ybc]]/.match('~haystack')

Prints '~' meaning that ^ operator has _lower_ priority then '&&'.

I'm surprised but it's the precedent.

And indeed the following reports empty set and warnings about '-' without
escape i.e. '--' is not supported...

print /[^1--[2]]/.match("0123456789")

re.rb:2: warning: character class has '-' without escape: /[^2--[1]]/
re.rb:2: empty range in char class: /[^2--[1]]/

 [^1-[2]]
 123456789
 
 It matches "3".

And .NET is disappointing [^[2]-1] doesn't match anything. They somehow special cased only the form of [..-[set]] and arbitrary nesting of it. So we have no good precedents. My thoughts are to make it proper operator precedence grammar with priorities: 0 - implict union (pieces that stand together, evaluated first) 1 - ^ (negation) 2 - && 3 - -- 4 - || (explicit union, evaluated last) -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 10 2014