digitalmars.D.bugs - [Issue 7551] New: Regex parsing bug for right bracket in character class
- d-bugmail puremagic.com (33/33) Feb 20 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7551
- d-bugmail puremagic.com (23/23) Feb 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7551
- d-bugmail puremagic.com (24/24) Feb 27 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7551
- d-bugmail puremagic.com (10/10) Feb 27 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7551
- d-bugmail puremagic.com (20/27) Feb 27 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7551
- d-bugmail puremagic.com (11/18) Feb 27 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7551
- d-bugmail puremagic.com (8/24) Feb 27 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7551
http://d.puremagic.com/issues/show_bug.cgi?id=7551 Summary: Regex parsing bug for right bracket in character class Product: D Version: D2 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: magnus hetland.org --- Comment #0 from Magnus Lie Hetland <magnus hetland.org> 2012-02-20 03:06:36 PST --- It seems that a bug has appeared for charsets in the std.regex. In previous versions, a right bracket could be included in a character set by placing it first, as is the case in many other languages/libraries. In the current version (I'm using the canned DMD 2.058 for OS X), that doesn't work: import std.regex; void main() { auto r = regex("[]]"); } This gives the following exception: std.regex.RegexException /usr/share/dmd/src/phobos/std/regex.d(1951): wrong CodepointSet Pattern with error: `[]` <--HERE-- `]` This should probably be permitted, as a "least surprise" practice, and to preserve compatibility with older versions. (It doesn't seem to be explicitly documented in the standard library docs, though. Then again, as far as I can see, no other mechanism for including right brackets in charsets is documented either.) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 20 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551 Dmitry Olshansky <dmitry.olsh gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dmitry.olsh gmail.com --- Comment #1 from Dmitry Olshansky <dmitry.olsh gmail.com> 2012-02-24 11:28:25 PST --- It perfectly fine to use escapes for special characters: import std.regex; void main() { auto r = regex("[\]]"); } The reason for killing first bracket doesn't count rule (if ever knew it existed) is that new regex allows doing things like [[abc0-9]--[bcd||1-9]] i.e. set operations the above should get you [bc0], it's more useful with \p{xxx} things. Basically braces do matter more now. But this many other languages... (or better libraries) - which ones? Unless there is strong precident I'm not doing another special case. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 24 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551 --- Comment #2 from Magnus Lie Hetland <magnus hetland.org> 2012-02-27 00:44:59 PST --- It did exist in the previous version -- my code broke with the new regexp engine, but worked before :-) If this is a conscious choice, then that's totally fine by me. Special cases aren't the right way to go when the general mechanism works. I had some trouble getting this to work (did something like what you wrote here, which won't work -- but double-escaping does, of course), so I ended up with using the or-operator, which was kind of hackish ;-) So, yeah, I guess I "retract" my bug report :-> As for other languages: Yeah, I think this is pretty common. E.g., Python (http://docs.python.org/library/re.html) and in Perl and Perl-compatible regexps, as used in all kinds of places, such as PHP, Apache, Safari, … (http://www.php.net/manual/en/regexp.reference.character-classes.php). So I think the "place member end brackets as first character" is the "industry standard" behavior. But as a compromise: Perhaps a useful error message pointing out the escape thing could be added? Or it could be explicitly pointed out in a note in the documentation (to avoid special-casing the error code)? I think some kind of "least surprise" handling for people coming from basically anywhere else might be useful ;-) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 27 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551 --- Comment #3 from Magnus Lie Hetland <magnus hetland.org> 2012-02-27 00:51:18 PST --- This whole thing goes for start brackets, too, I guess. As far as I can see, they, too, must be escaped when used inside character classes, now. This follows from the definition in the docs, for sure, but wasn't entirely obvious to me -- especially given that it worked before. (I.e., that was another thing that broke in my code recently, when upgrading.) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 27 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551 Dmitry Olshansky <dmitry.olsh gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement --- Comment #4 from Dmitry Olshansky <dmitry.olsh gmail.com> 2012-02-27 02:36:06 PST --- Full backwards compatibility looked like a nice idea at start. I'm increasingly regret that decision, as things still got broken as I had to add new features that block some undocumented behavior. Ehm escape sequences were partly broken in 2.057 ... sorry about that. BTW this page shows that [ and ] should be escaped, and not a single word on it used as first character (unlike '-' that is supported). http://www.php.net/manual/en/regexp.reference.character-classes.php About Python, heh, I'm eager to see how would they go about adding set operations without breaking compatibility (they count [ as plain '[' in the middle of charset). I guess a brand new module if it they ever will.But as a compromise: Perhaps a useful error message pointing out the escape thing could be added? Or it could be explicitly pointed out in a note in the documentation (to avoid special-casing the error code)? I think some kind of "least surprise" handling for people coming from basically anywhere else might be useful ;-)Hm.. that's a good idea. Hereby it's an enhacement request ;) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 27 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551 --- Comment #5 from Magnus Lie Hetland <magnus hetland.org> 2012-02-27 03:18:50 PST --- Quoting Dmitry:BTW this page shows that [ and ] should be escaped, and not a single word on it used as first character (unlike '-' that is supported). http://www.php.net/manual/en/regexp.reference.character-classes.phpHuh? Did you read the first paragraph…?-) Quoted, for your convenience (my highlight):An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. **If a closing square bracket is required as a member of the class, it should be the first data character in the class** (after an initial circumflex, if present) or escaped with a backslash.It says so right there, no? This is the way it's been in several languages I've used throughout the years. I guess they just didn't have escaping inside character classes in the olden days ;-) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 27 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551 --- Comment #6 from Dmitry Olshansky <dmitry.olsh gmail.com> 2012-02-27 05:18:12 PST --- (In reply to comment #5)Quoting Dmitry:Searching gets the better of me :( I 'greped' for "["BTW this page shows that [ and ] should be escaped, and not a single word on it used as first character (unlike '-' that is supported). http://www.php.net/manual/en/regexp.reference.character-classes.phpHuh? Did you read the first paragraph…?-)Quoted, for your convenience (my highlight):Apparently it's one of these historical kind of things. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. **If a closing square bracket is required as a member of the class, it should be the first data character in the class** (after an initial circumflex, if present) or escaped with a backslash.It says so right there, no? This is the way it's been in several languages I've used throughout the years. I guess they just didn't have escaping inside character classes in the olden days ;-)
Feb 27 2012