www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Why does "*" cause my tiny regextester program to crash?

reply Alex Folland <lexlexlex gmail.com> writes:
I wrote this little program to test for regular expression matches.  I 
compiled it with in Windows with DMD 2.051 through Visual Studio 2010 
with Visual D.  It crashes if regexbuf is just the single character, 
"*".  Why?  Shouldn't it match the entire string?

Visual Studio's debug output is this:

First-chance exception at 0x76fde124 in regextester.exe: 0xE0440001: 
0xe0440001.
The program '[5492] regextester.exe: Native' has exited with code 1 (0x1).

Also, why does it match an unlimited number of times on "$" instead of 
just once?  Is this a Phobos-specific issue, or are regular expressions 
supposed to do that?  I mean, it doesn't match an unlimited times on 
"h", for example.

Mind you, I'm very new to both regular expressions and D.  I'm also not 
an experienced programmer of anything else.  I've spent years dabbling 
in the surface various programming languages without learning anything 
meaty.  I've learned syntax mostly.

My debug build is here: http://lex.clansfx.co.uk/projects/regextester.exe

Here's the source code:

import std.stdio, std.regex;

void main()
{
   char[] regexbuf;
   char[] teststring;
   while(1)
   {
     write("test string: ");
     std.stdio.readln(teststring); teststring.length=teststring.length-1;
     while(teststring.length>0)
     {
       uint i=0;
       write("regex input: ");
       std.stdio.readln(regexbuf); regexbuf.length=regexbuf.length-1;
       if(regexbuf.length>0)
       foreach(m; match(teststring, regex(regexbuf)))
       {
         i++;
         writefln("Match number %s: %s[%s]%s",i,m.pre,m.hit,m.post);
         if(i >= 50) { writefln("There have been %s matches.  I'm 
breaking for safety.",i); break; }
       }
     }
   }
}
Jan 30 2011
next sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
Alex Folland Wrote:

 I wrote this little program to test for regular expression matches.  I 
 compiled it with in Windows with DMD 2.051 through Visual Studio 2010 
 with Visual D.  It crashes if regexbuf is just the single character, 
 "*".  Why?  Shouldn't it match the entire string?

While it would be best to give your example data. The regular expression for matching all data is ".*" * represents a repeating something of zero or more. . represents anything So "*" just makes no sense.
Jan 30 2011
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Mon, 31 Jan 2011 03:57:44 +0200, Alex Folland <lexlexlex gmail.com>  
wrote:

 I wrote this little program to test for regular expression matches.  I  
 compiled it with in Windows with DMD 2.051 through Visual Studio 2010  
 with Visual D.  It crashes if regexbuf is just the single character,  
 "*".  Why?  Shouldn't it match the entire string?

"*" in regular expressions means 0 or more instances of the previous entity: http://www.regular-expressions.info/repeat.html It doesn't make sense at the start of an expression. ".*" is the regexp that matches anything[1]. std.regex probably can't handle invalid regexps very well. Note that std.regex is a new module that intends to replace the older std.regexp, but still has some problems.
 Also, why does it match an unlimited number of times on "$" instead of  
 just once?

Looks like another std.regex bug.
 My debug build is here: http://lex.clansfx.co.uk/projects/regextester.exe

A note for the future: compiled executables aren't very useful when source is available, especially considering many people here don't use Windows. [1]: A dot in a regular expression may not match newlines, depending on the implementation and search options. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Jan 30 2011
parent reply Alex Folland <lexlexlex gmail.com> writes:
On 2011-01-30 21:47, Vladimir Panteleev wrote:
 On Mon, 31 Jan 2011 03:57:44 +0200, Alex Folland <lexlexlex gmail.com>
 wrote:

 I wrote this little program to test for regular expression matches. I
 compiled it with in Windows with DMD 2.051 through Visual Studio 2010
 with Visual D. It crashes if regexbuf is just the single character,
 "*". Why? Shouldn't it match the entire string?

"*" in regular expressions means 0 or more instances of the previous entity: http://www.regular-expressions.info/repeat.html It doesn't make sense at the start of an expression. ".*" is the regexp that matches anything[1]. std.regex probably can't handle invalid regexps very well. Note that std.regex is a new module that intends to replace the older std.regexp, but still has some problems.

Okay, so that particular regex is invalid. Yeah, it still shouldn't crash. You're right. How should I prevent my program from crashing without fixing std.regex (code I definitely don't trust myself to touch)? Would the Scope statement be useful? I still can't figure out exactly what it does. I tried using scope(exit)writeln("Bad regex."); just before my foreach loop, but it still crashes. I then tried changing "exit" to "failure", but that didn't help either; same behavior. Am I using scope wrong?
 Also, why does it match an unlimited number of times on "$" instead of
 just once?

Looks like another std.regex bug.

I thought it through and decided that it might not be std.regex' bug. I mean, there's no way m could have an unlimited number of elements for foreach to loop through, right? Actually, it probably is std.regex' bug. Though, all of this doesn't really matter since nobody uses just "$" as a regex, since it'd match an obvious point in any input. I bet Andrei would still be irked by it if he knew though.
 My debug build is here: http://lex.clansfx.co.uk/projects/regextester.exe

A note for the future: compiled executables aren't very useful when source is available, especially considering many people here don't use Windows.

Right.
 [1]: A dot in a regular expression may not match newlines, depending on
 the implementation and search options.

Thanks for the extra info. :)
Jan 30 2011
parent reply Alex Folland <lexlexlex gmail.com> writes:
On 2011-01-31 0:50, Alex Folland wrote:
 On 2011-01-30 21:47, Vladimir Panteleev wrote:
 On Mon, 31 Jan 2011 03:57:44 +0200, Alex Folland <lexlexlex gmail.com>
 wrote:

 I wrote this little program to test for regular expression matches. I
 compiled it with in Windows with DMD 2.051 through Visual Studio 2010
 with Visual D. It crashes if regexbuf is just the single character,
 "*". Why? Shouldn't it match the entire string?

"*" in regular expressions means 0 or more instances of the previous entity: http://www.regular-expressions.info/repeat.html It doesn't make sense at the start of an expression. ".*" is the regexp that matches anything[1]. std.regex probably can't handle invalid regexps very well. Note that std.regex is a new module that intends to replace the older std.regexp, but still has some problems.

Okay, so that particular regex is invalid. Yeah, it still shouldn't crash. You're right. How should I prevent my program from crashing without fixing std.regex (code I definitely don't trust myself to touch)? Would the Scope statement be useful? I still can't figure out exactly what it does. I tried using scope(exit)writeln("Bad regex."); just before my foreach loop, but it still crashes. I then tried changing "exit" to "failure", but that didn't help either; same behavior. Am I using scope wrong?

Yeah, you nitwit. You didn't realize that scope(failure) doesn't prevent the program from exiting and throwing an exception. It merely runs the code inside itself before throwing an exception and exiting. However, one can use scope(failure){writeln("Bad regex");break;} to prevent an actual crash and just write "Bad regex"! I expected that it would break automatically out of scope. I was wrong. Anyway, this is beautiful. I love D. Haha.
Jan 30 2011
parent Alex Folland <lexlexlex gmail.com> writes:
On 2011-01-31 2:28, Alex Folland wrote:
 scope(failure){writeln("Bad regex");break;}

Oh, that resets the program rather than continuing from where that line was placed. The continue statement is what I wanted.
Jan 31 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Mon, 31 Jan 2011 09:28:25 +0200, Alex Folland <lexlexlex gmail.com>  
wrote:

 scope(failure){writeln("Bad regex");break;}

I think the proper construct here is a try/catch block. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Jan 31 2011
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 31.01.2011 4:57, Alex Folland wrote:
 I wrote this little program to test for regular expression matches.  I 
 compiled it with in Windows with DMD 2.051 through Visual Studio 2010 
 with Visual D.  It crashes if regexbuf is just the single character, 
 "*".  Why?  Shouldn't it match the entire string?

 Visual Studio's debug output is this:

 First-chance exception at 0x76fde124 in regextester.exe: 0xE0440001: 
 0xe0440001.
 The program '[5492] regextester.exe: Native' has exited with code 1 
 (0x1).

After using Visual D quite a lot I can tell you that this output *doesn't* mean the program crushed (in terms of segfault etc.). All it means is that there is uncaught exception, which is quite reasonable. Regexp parsing function throws when it detects wrong pattern. In general, when you are uncertain on what the **uk is going on I suggest to wrap suspicious code with : try{ ... //code }catch(Exception e){ writeln(e); } to see what's the errors are. with lone "*" in regexp output would be : object.Exception: *+? not allowed in atom not very detailed perhaps but gives a hint ;) -- Dmitry Olshansky
Jan 31 2011
parent Alex Folland <lexlexlex gmail.com> writes:
On 2011-01-31 15:43, Dmitry Olshansky wrote:
 On 31.01.2011 4:57, Alex Folland wrote:
 I wrote this little program to test for regular expression matches. I
 compiled it with in Windows with DMD 2.051 through Visual Studio 2010
 with Visual D. It crashes if regexbuf is just the single character,
 "*". Why? Shouldn't it match the entire string?

 Visual Studio's debug output is this:

 First-chance exception at 0x76fde124 in regextester.exe: 0xE0440001:
 0xe0440001.
 The program '[5492] regextester.exe: Native' has exited with code 1
 (0x1).

After using Visual D quite a lot I can tell you that this output *doesn't* mean the program crushed (in terms of segfault etc.). All it means is that there is uncaught exception, which is quite reasonable. Regexp parsing function throws when it detects wrong pattern. In general, when you are uncertain on what the **uk is going on I suggest to wrap suspicious code with : try{ ... //code }catch(Exception e){ writeln(e); } to see what's the errors are.

Ty. Did that. My updated code is here: http://lex.clansfx.co.uk/projects/regextester.d
Jan 31 2011