www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Issues with std.regex

reply "MrAppleseed" <email email.com> writes:
Hey all,

I'm currently trying to port my small toy language I invented 
awhile back in Java to D. However, a main part of my lexical 
analyzer was regular expression matching, which I've been having 
issues with in D. The regex expression in question is as follows:

[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]

This works well enough in Java to produce a series of tokens that 
I could then pass to my parser. But when I tried to port this 
into D, I almost always get an error when using brackets, braces, 
or parenthesis. I've tried several different combinations, have 
looked through the std.regex library reference, have Googled this 
issue, have tested my regular expression in several online-regex 
testers (primarily http://regexpal.com/, and 
http://regexhelper.com/), and have even looked it up in the book, 
"The D Programming Language" (good book, by the way), yet I still 
can't get it working right. Here's the code I've been using:

...
auto tempCont = cast(char[])read(location, fileSize);
string contents = cast(string)tempCont;
auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
auto m = match(contents, reg);
auto token = m.captures
...

When I try to run the code above, I get:
parser.d(64): Error: undefined escape sequence \[
parser.d(64): Error: undefined escape sequence \]

When I remove the escaped characters (turning my regex into
"[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues 
compiling or linking. However, on first run, I get the following 
error (I cut the error short, full error is pasted 
http://pastebin.com/vjMhkx4N):

std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): 
wrong CodepointSet
Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- 
`<>,{}^#/\]`

I'm very confused on what to do, and much of the information in 
the library reference seems to contradict what I'm doing. Any 
help would greatly appreciated!

Thanks!
~Mr. Appleseed

Additional information:

OS/Compiler information:
Ubuntu 12.10 x64
DMD64 D Compiler v2.061

Compiled with:
dmd main.d parser.d
Feb 16 2013
next sibling parent reply FG <home fgda.pl> writes:
On 2013-02-16 21:22, MrAppleseed wrote:
 auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");

 When I try to run the code above, I get:
 parser.d(64): Error: undefined escape sequence \[
 parser.d(64): Error: undefined escape sequence \]

 When I remove the escaped characters (turning my regex into
 "[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues compiling or linking.
 However, on first run, I get the following error (I cut the error short, full
 error is pasted http://pastebin.com/vjMhkx4N):

 std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): wrong
 CodepointSet
 Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- `<>,{}^#/\]`
Perhaps try this: "[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\]"
Feb 16 2013
parent reply "MrAppleseed" <email email.com> writes:
On Saturday, 16 February 2013 at 20:33:15 UTC, FG wrote:
 On 2013-02-16 21:22, MrAppleseed wrote:
 auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");

 When I try to run the code above, I get:
 parser.d(64): Error: undefined escape sequence \[
 parser.d(64): Error: undefined escape sequence \]

 When I remove the escaped characters (turning my regex into
 "[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues 
 compiling or linking.
 However, on first run, I get the following error (I cut the 
 error short, full
 error is pasted http://pastebin.com/vjMhkx4N):

 std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): 
 wrong
 CodepointSet
 Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- 
 `<>,{}^#/\]`
Perhaps try this: "[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\]"
Hey, Thanks for the reply! You guys are quite the friendly people. :) I made the changes you suggested above, and although it compiled fine, on the first run I got a similar error: std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): unexpected end of CodepointSet Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\]` <--HERE-- `` (Full error is here: http://pastebin.com/rTmHuVjG)
Feb 16 2013
next sibling parent reply "jerro" <a a.com> writes:
 Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\]` 
 <--HERE-- ``
The problem here is that you have \ right before the ] at the end of the string. Because it is preceeded by \, ] is interpretted as a character you are matching on, not as a closing bracket for the initial [. If you want to match \ you need this: [ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\\]
Feb 16 2013
parent reply "MrAppleseed" <email email.com> writes:
On Saturday, 16 February 2013 at 21:58:23 UTC, jerro wrote:
 Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\]` 
 <--HERE-- ``
The problem here is that you have \ right before the ] at the end of the string. Because it is preceeded by \, ] is interpretted as a character you are matching on, not as a closing bracket for the initial [. If you want to match \ you need this: [ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\\]
Sorry for the delay in response, As you can read in the original post, I have tried the suggestions in both of your comments (couldn't figure out how to reply to both, unfortunately). Both of which caused errors. The code you suggested is one of the first I tried,( auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]"); ), yet I still got that error. I believe that the regex engine changed the "\\" into a single backslash "\" which is displayed in the error you quoted.
Feb 20 2013
parent "MrAppleseed" <email email.com> writes:
Hello to everyone, and thank you for your help!

Sorry for the delay in response, as I was busy with family 
matters. However, upon returning today, and with everyone's help, 
I have successfully gotten it to work. The code below worked out 
swimmingly:


auto reg = regex(`[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\\]`);
auto m = match(contents, reg);
auto token = m.captures;

Once again, thank you all for your help! :)
Feb 20 2013
prev sibling next sibling parent FG <home fgda.pl> writes:
On 2013-02-16 22:36, MrAppleseed wrote:
 Perhaps try this:  "[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\]"
I made the changes you suggested above, and although it compiled fine, on the first run I got a similar error: std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): unexpected end of CodepointSet Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\]` <--HERE-- ``
Ah, right. Sorry for that. You'd need as much as 4 backslashes there. :) "[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\\\]" Ain't pretty so it's better to go with raw strings, but apparently there are some problems with them right now, looking at the other posts here, right?
Feb 16 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Feb-2013 01:36, MrAppleseed пишет:
 On Saturday, 16 February 2013 at 20:33:15 UTC, FG wrote:
 On 2013-02-16 21:22, MrAppleseed wrote:
 auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");

 When I try to run the code above, I get:
 parser.d(64): Error: undefined escape sequence \[
 parser.d(64): Error: undefined escape sequence \]

 When I remove the escaped characters (turning my regex into
 "[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues compiling or
 linking.
Like others noted the problem is 2-fold: - escaping special characters (both as D string literla and regex escaping itself). So just use `` or r"" or some form of WYSIWYG string *AND* do escaping for things that are part of regex syntax. - [] are used for nesting, as std.regex supports set-wise operations inside of [...] character class e.g. [[A-Z]&&[A-D]] means intersection and would yield a set of [A-D]. It gets more useful with Unicode character sets. -- Dmitry Olshansky
Feb 17 2013
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Feb 16, 2013 at 09:22:07PM +0100, MrAppleseed wrote:
 Hey all,
 
 I'm currently trying to port my small toy language I invented awhile
 back in Java to D. However, a main part of my lexical analyzer was
 regular expression matching, which I've been having issues with in
 D. The regex expression in question is as follows:
 
 [ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]
 
 This works well enough in Java to produce a series of tokens that I
 could then pass to my parser. But when I tried to port this into D,
 I almost always get an error when using brackets, braces, or
 parenthesis. I've tried several different combinations, have looked
 through the std.regex library reference, have Googled this issue,
 have tested my regular expression in several online-regex testers
 (primarily http://regexpal.com/, and http://regexhelper.com/), and
 have even looked it up in the book, "The D Programming Language"
 (good book, by the way), yet I still can't get it working right.
 Here's the code I've been using:
 
 ...
 auto tempCont = cast(char[])read(location, fileSize);
 string contents = cast(string)tempCont;
 auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
The problem is that you're using D's double-quoted string literal, which adds another level of interpretation to the \'s. What you should do is to use the backtick string literal, which does *not* interpret backslashes: auto reg = regex(`[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]`); If you have trouble typing `, you can also use r"...", which means the same thing. Hope this helps. --T
Feb 16 2013
next sibling parent "Namespace" <rswhite4 googlemail.com> writes:
As long as there is \" I get the same error.
Feb 16 2013
prev sibling parent reply "MrAppleseed" <email email.com> writes:
On Saturday, 16 February 2013 at 20:35:48 UTC, H. S. Teoh wrote:
 On Sat, Feb 16, 2013 at 09:22:07PM +0100, MrAppleseed wrote:
 Hey all,
 
 I'm currently trying to port my small toy language I invented 
 awhile
 back in Java to D. However, a main part of my lexical analyzer 
 was
 regular expression matching, which I've been having issues 
 with in
 D. The regex expression in question is as follows:
 
 [ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]
 
 This works well enough in Java to produce a series of tokens 
 that I
 could then pass to my parser. But when I tried to port this 
 into D,
 I almost always get an error when using brackets, braces, or
 parenthesis. I've tried several different combinations, have 
 looked
 through the std.regex library reference, have Googled this 
 issue,
 have tested my regular expression in several online-regex 
 testers
 (primarily http://regexpal.com/, and http://regexhelper.com/), 
 and
 have even looked it up in the book, "The D Programming 
 Language"
 (good book, by the way), yet I still can't get it working 
 right.
 Here's the code I've been using:
 
 ...
 auto tempCont = cast(char[])read(location, fileSize);
 string contents = cast(string)tempCont;
 auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
The problem is that you're using D's double-quoted string literal, which adds another level of interpretation to the \'s. What you should do is to use the backtick string literal, which does *not* interpret backslashes: auto reg = regex(`[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]`); If you have trouble typing `, you can also use r"...", which means the same thing. Hope this helps. --T
Thanks for the quick reply! I replaced the double-quotes with backticks, compiled it with no problems, but on the first run I got a similar error: std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): invalid escape sequence Pattern with error: `[ 0-9a-zA-Z.*=+-;()\"` <--HERE-- `\'[]<>,{}^#/\\]` After removing the invalid escape sequence, I compiled it, once again with no problems, and attempted to run it, but I got the same error as before: std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): wrong CodepointSet Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- `<>,{}^#/\\]` (Entire error here: http://pastebin.com/Su9XzbXW)
Feb 16 2013
parent "jerro" <a a.com> writes:
 std.regex.RegexException /usr/include/dmd/phobos/std/regex.d(1942): 
 wrong CodepointSet
 Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- 
 `<>,{}^#/\\]`

 (Entire error here: http://pastebin.com/Su9XzbXW)
You need to put \ in front of [ or ] if you want to match those two characters. The relevant part of std.regex documentation: \c where c is one of [|*+?() Matches the character c itself.
Feb 16 2013