www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Poll of the week: How should std.regex handle unknown escape

reply "Marco Leise" <Marco.Leise gmx.de> writes:
http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2
Dec 02 2011
next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, December 02, 2011 23:33:34 Marco Leise wrote:
 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Why wouldn't std.regex accept an escaped sequence such as "\."? I thought that the whole point of something like "\." was to make it so that you could use "." directly in spite of the fact that it means something special in regexes. Or is it something special to do with the fact that it's between brackets? I'd still have thought that it would just escape it, since it _is_ an escape sequence. Or is that the escape sequence isn't necessary in between the brackets, and so the question is how to handle it, since it isn't necessary? - Jonathan M Davis
Dec 02 2011
prev sibling next sibling parent Jesse Phillips <jessekphillips+d gmail.com> writes:
On Fri, 02 Dec 2011 17:59:59 -0500, Jonathan M Davis wrote:

 On Friday, December 02, 2011 23:33:34 Marco Leise wrote:
 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Why wouldn't std.regex accept an escaped sequence such as "\."? I thought that the whole point of something like "\." was to make it so that you could use "." directly in spite of the fact that it means something special in regexes. Or is it something special to do with the fact that it's between brackets? I'd still have thought that it would just escape it, since it _is_ an escape sequence. Or is that the escape sequence isn't necessary in between the brackets, and so the question is how to handle it, since it isn't necessary? - Jonathan M Davis

Brackets being a character class, dot is used literally. So in this case was it meant to be: [\\.] or [.]
Dec 02 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, December 03, 2011 02:35:21 Jesse Phillips wrote:
 On Fri, 02 Dec 2011 17:59:59 -0500, Jonathan M Davis wrote:
 On Friday, December 02, 2011 23:33:34 Marco Leise wrote:
 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Why wouldn't std.regex accept an escaped sequence such as "\."? I thought that the whole point of something like "\." was to make it so that you could use "." directly in spite of the fact that it means something special in regexes. Or is it something special to do with the fact that it's between brackets? I'd still have thought that it would just escape it, since it _is_ an escape sequence. Or is that the escape sequence isn't necessary in between the brackets, and so the question is how to handle it, since it isn't necessary? - Jonathan M Davis

Brackets being a character class, dot is used literally. So in this case was it meant to be: [\\.] or [.]

Well, then if \. is not legal, I'd expect a static assertion failure or a template constraint failure if the string were given as a compile-time argument and an exception if it were given as a runtime argument. - Jonathan M Davis
Dec 02 2011
prev sibling next sibling parent Xinok <xinok live.com> writes:
On 12/2/2011 5:33 PM, Marco Leise wrote:
 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

I prefer that regexp engines are as consistent as possible. Everything I tested accepts this as a valid regular expression, so I think std.regex should as well.
Dec 02 2011
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
Marco Leise Wrote:

 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Erm... but "\." is a perfectly known escape sequence, so the question should be "How should std.regex handle known escape sequences as in: "[\.]"".
Dec 03 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/3/11 6:54 AM, Kagamin wrote:
 Marco Leise Wrote:

 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Erm... but "\." is a perfectly known escape sequence, so the question should be "How should std.regex handle known escape sequences as in: "[\.]"".

The dot inside a character set must not be escaped. Andrei
Dec 03 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03.12.2011 19:48, Andrei Alexandrescu wrote:
 On 12/3/11 6:54 AM, Kagamin wrote:
 Marco Leise Wrote:

 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Erm... but "\." is a perfectly known escape sequence, so the question should be "How should std.regex handle known escape sequences as in: "[\.]"".

The dot inside a character set must not be escaped. Andrei

And that breaks ehm ... e.g. rdmd ;) Anyhow I'm trying to pick a reasonable rule. 100% compatibility with old regex would mean ignore '\' where not applicable. My only concerns with it is future extensibility via \<character>. -- Dmitry Olshansky
Dec 03 2011
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03.12.2011 16:54, Kagamin wrote:
 Marco Leise Wrote:

 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Erm... but "\." is a perfectly known escape sequence, so the question should be "How should std.regex handle known escape sequences as in: "[\.]"".

Let's clarify this a bit. Well \. is more or less common outside of []. The question is more like: treat every \<something> as plain <something> (ignoring \) inside character classes [] if it's not a known escape sequence like \w, \d, \uXXXX, \W, \cA -\cZ and so on.
Dec 03 2011
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03.12.2011 21:00, Vladimir Panteleev wrote:
 On Sat, 03 Dec 2011 17:51:13 +0200, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 treat every \<something> as plain <something> (ignoring \) inside
 character classes [] if it's not a known escape sequence like \w, \d,
 \uXXXX, \W, \cA -\cZ and so on.

I think the common intuitive rules regarding escapes in regexes are as follows: 1) Unescaped punctuation usually has special meaning (so people often escape all punctuation literals) 2) Unescaped letters are literal 3) Escaped punctuation is literal 4) Escaped letters have special meaning

 Therefore, I think that std.regex should throw on unrecognized *letter*
 escapes. It's very likely that the user might be trying to use a
 character class or feature from another regex engine, but unsupported by
 std.regex.

-- Dmitry Olshansky
Dec 03 2011
prev sibling next sibling parent reply David Nadlinger <see klickverbot.at> writes:
On 12/3/11 6:00 PM, Vladimir Panteleev wrote:
 I think the common intuitive rules regarding escapes in regexes are as
 follows:

 1) Unescaped punctuation usually has special meaning (so people often
 escape all punctuation literals)

I am only a causal user of regexen, and I agree that this is what seems intuitive to me – in fact, I just used [^\(] to match non-brackets in my editor today. David
Dec 03 2011
parent David Nadlinger <see klickverbot.at> writes:
On 12/3/11 8:14 PM, David Nadlinger wrote:
 I am only a causal user […]

Oh well, typing is hard.
Dec 03 2011
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/3/11 9:51 AM, Dmitry Olshansky wrote:
 On 03.12.2011 16:54, Kagamin wrote:
 Marco Leise Wrote:

 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

Erm... but "\." is a perfectly known escape sequence, so the question should be "How should std.regex handle known escape sequences as in: "[\.]"".

Let's clarify this a bit. Well \. is more or less common outside of []. The question is more like: treat every \<something> as plain <something> (ignoring \) inside character classes [] if it's not a known escape sequence like \w, \d, \uXXXX, \W, \cA -\cZ and so on.

Probably this is not a place to get innovative. Let's do what gramps Perl does. Andrei
Dec 03 2011
prev sibling next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-12-02 22:33:34 +0000, "Marco Leise" <Marco.Leise gmx.de> said:

 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

I'd say, go with what other engines are doing. PCRE accepts them. I think POSIX does not, but does not allow any escaping inside a character class either. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 03 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sat, 03 Dec 2011 17:51:13 +0200, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 treat every \<something> as plain <something> (ignoring \) inside  
 character classes [] if it's not a known escape sequence like \w, \d,  
 \uXXXX, \W, \cA -\cZ and so on.

I think the common intuitive rules regarding escapes in regexes are as follows: 1) Unescaped punctuation usually has special meaning (so people often escape all punctuation literals) 2) Unescaped letters are literal 3) Escaped punctuation is literal 4) Escaped letters have special meaning Therefore, I think that std.regex should throw on unrecognized *letter* escapes. It's very likely that the user might be trying to use a character class or feature from another regex engine, but unsupported by std.regex. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Dec 03 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sat, 03 Dec 2011 19:00:44 +0200, Vladimir Panteleev  
<vladimir thecybershadow.net> wrote:

 escapes

For the context of my post, I meant "escaped" as "prefixed by a backslash". -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Dec 03 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 02 Dec 2011 23:33:34 +0100, Marco Leise <Marco.Leise gmx.de> wrote:

 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

auto s = "[\.]"; => Error: undefined escape sequence. Do you actually mean r"[\.]" or `[\.]`?
Dec 04 2011
prev sibling next sibling parent Jesse Phillips <jessekphillips+d gmail.com> writes:
On Sun, 04 Dec 2011 19:53:59 +0100, Martin Nowak wrote:

 On Fri, 02 Dec 2011 23:33:34 +0100, Marco Leise <Marco.Leise gmx.de>
 wrote:
 
 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

auto s = "[\.]"; => Error: undefined escape sequence. Do you actually mean r"[\.]" or `[\.]`?

He was referring to invalid regular expression escape sequences, so yes r"[\.]" would build a proper string and invalid regex (or more to be determined if it is invalid).
Dec 04 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/2/2011 2:33 PM, Marco Leise wrote:
 http://www.easypolls.net/poll.html?p=4ed9478e4fb7b0e4886eeea2

In general, behavior for things we don't know what to do with should be "failure". Then, when we do figure out what to do with it, we don't break existing code.
Dec 04 2011