www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 7551] New: Regex parsing bug for right bracket in character class

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7551

           Summary: Regex parsing bug for right bracket in character class
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: magnus hetland.org



PST ---
It seems that a bug has appeared for charsets in the std.regex. In previous
versions, a right bracket could be included in a character set by placing it
first, as is the case in many other languages/libraries. In the current version
(I'm using the canned DMD 2.058 for OS X), that doesn't work:

import std.regex;
void main() {
    auto r = regex("[]]");
}

This gives the following exception:

std.regex.RegexException /usr/share/dmd/src/phobos/std/regex.d(1951): wrong
CodepointSet
Pattern with error: `[]` <--HERE-- `]`

This should probably be permitted, as a "least surprise" practice, and to
preserve compatibility with older versions. (It doesn't seem to be explicitly
documented in the standard library docs, though. Then again, as far as I can
see, no other mechanism for including right brackets in charsets is documented
either.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 20 2012
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7551


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com



11:28:25 PST ---
It perfectly fine to use escapes for special characters:

import std.regex;
void main() {
    auto r = regex("[\]]");
}

The reason for killing first bracket doesn't count rule (if ever knew it
existed)
is that new regex allows doing things like 
[[abc0-9]--[bcd||1-9]] 
i.e. set operations 
the above should get you [bc0], it's more useful with \p{xxx} things.
Basically braces do matter more now. 
But this many other languages... (or better libraries) - which ones? Unless
there is strong precident I'm not doing another special case.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 24 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7551




PST ---
It did exist in the previous version -- my code broke with the new regexp
engine, but worked before :-)

If this is a conscious choice, then that's totally fine by me. Special cases
aren't the right way to go when the general mechanism works. I had some trouble
getting this to work (did something like what you wrote here, which won't work
-- but double-escaping does, of course), so I ended up with using the
or-operator, which was kind of hackish ;-)

So, yeah, I guess I "retract" my bug report :->

As for other languages: Yeah, I think this is pretty common. E.g., Python
(http://docs.python.org/library/re.html) and in Perl and Perl-compatible
regexps, as used in all kinds of places, such as PHP, Apache, Safari, …
(http://www.php.net/manual/en/regexp.reference.character-classes.php).

So I think the "place member end brackets as first character" is the "industry
standard" behavior.

But as a compromise: Perhaps a useful error message pointing out the escape
thing could be added? Or it could be explicitly pointed out in a note in the
documentation (to avoid special-casing the error code)?

I think some kind of "least surprise" handling for people coming from basically
anywhere else might be useful ;-)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 27 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7551




PST ---
This whole thing goes for start brackets, too, I guess. As far as I can see,
they, too, must be escaped when used inside character classes, now. This
follows from the definition in the docs, for sure, but wasn't entirely obvious
to me -- especially given that it worked before. (I.e., that was another thing
that broke in my code recently, when upgrading.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 27 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7551


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement



02:36:06 PST ---
Full backwards compatibility looked like a nice idea at start. 
I'm increasingly regret that decision, as things still got broken as I had to
add new features that block some undocumented behavior.

Ehm escape sequences were partly broken in 2.057 ... sorry about that.

BTW this page shows that [ and ] should be escaped, and not a single word on it
used as first character (unlike '-' that is supported).
http://www.php.net/manual/en/regexp.reference.character-classes.php

About Python, heh, I'm eager to see how would they go about adding set
operations without breaking compatibility (they count [ as plain '[' in the
middle of charset). I guess a brand new module if it they ever will.

 
 But as a compromise: Perhaps a useful error message pointing out the escape
 thing could be added? Or it could be explicitly pointed out in a note in the
 documentation (to avoid special-casing the error code)?
 
 I think some kind of "least surprise" handling for people coming from basically
 anywhere else might be useful ;-)
Hm.. that's a good idea. Hereby it's an enhacement request ;) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 27 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7551




PST ---
Quoting Dmitry:
 BTW this page shows that [ and ] should be escaped, and not a single word on it
 used as first character (unlike '-' that is supported).
 http://www.php.net/manual/en/regexp.reference.character-classes.php
Huh? Did you read the first paragraph…?-) Quoted, for your convenience (my highlight):
 An opening square bracket introduces a character class, terminated by a
closing square
 bracket. A closing square bracket on its own is not special. **If a closing
square bracket is
 required as a member of the class, it should be the first data character in
the class** (after
 an initial circumflex, if present) or escaped with a backslash.
It says so right there, no? This is the way it's been in several languages I've used throughout the years. I guess they just didn't have escaping inside character classes in the olden days ;-) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 27 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7551




05:18:12 PST ---

 Quoting Dmitry:
 BTW this page shows that [ and ] should be escaped, and not a single word on it
 used as first character (unlike '-' that is supported).
 http://www.php.net/manual/en/regexp.reference.character-classes.php
Huh? Did you read the first paragraph…?-)
Searching gets the better of me :( I 'greped' for "["
 
 Quoted, for your convenience (my highlight):
 An opening square bracket introduces a character class, terminated by a
closing square
 bracket. A closing square bracket on its own is not special. **If a closing
square bracket is
 required as a member of the class, it should be the first data character in
the class** (after
 an initial circumflex, if present) or escaped with a backslash.
It says so right there, no? This is the way it's been in several languages I've used throughout the years. I guess they just didn't have escaping inside character classes in the olden days ;-)
Apparently it's one of these historical kind of things. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 27 2012