www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 8725] New: segmentation fault with negative-lookahead in module-level regex

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8725

           Summary: segmentation fault with negative-lookahead in
                    module-level regex
           Product: D
           Version: D2
          Platform: x86_64
        OS/Version: Mac OS X
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: val markovic.io



The following program crashes with a segmentation fault:

-------------


import std.stdio;
import std.regex;

auto italic = regex( r"\*
                    (?!\s+)
                    (.*?)
                    (?!\s+)
                    \*", "gx" );

void main() {
  string input = "this * is* interesting, *very* interesting";
  writeln( replace( input, italic, "<i>$1</i>" ) );
}
--------------

If one removes the first line with (?!\s+), then the program doesn't crash. 

I was under the impression that this snippet of code operates under the SafeD
subset and therefore shouldn't cause a segmentation fault. A thrown exception
on problems or something, that I can understand. But a segfault?

In other sad news, these are the first lines of D I've ever written :( ... so
much for experimentation...

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 25 2012
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8725




Oh, and the segfault goes away if I put the regex creation directly in the
call, like so:

  writeln( replace( input, regex( r"\*
                                  (?!\s+)
                                  (.*?)
                                  (?!\s+)
                                  \*", "gx" ), "<i>$1</i>" ) );

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 25 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8725


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com



06:46:49 PDT ---
I suspect that is a long standing bug with compile-time evaluation that
compiler parses regex pattern at compile time wrongly (unlike at R-T).
See also: http://d.puremagic.com/issues/show_bug.cgi?id=7810

The problem is that once D compiler sees an initialized global variable it has
to const-fold it:

int fact10 = factorial(10);
//will compute and hardcode the value of factorial(10)

then with regex ...:
auto italic = regex( ... ); 
// *parses* and *generates* binary object for compiled regex pattern object
with all the datastructures for matching it 
All of this *at compile time* via CTFE, see about it here (near the bottom of):
http://dlang.org/function.html

Though previously it only caused unexpectedly long compilation time (CTFE is
slow) and in a select cases it failed with assert *during compilation*, it
never segfaulted.
Probably internal structure has subtle corruption that self-test failed to
catch.

E.g this one also works because italic regex is created at run-time:

import std.stdio;
import std.regex;


void main() {
 auto italic = regex( r"\*
                    (?!\s+)
                    (.*?)
                    (?!\s+)
                    \*", "gx" );
  string input = "this * is* interesting, *very* interesting";
  writeln( replace( input, italic, "<i>$1</i>" ) );
}

Also a tip: the second lookahead should be lookbehind! As is is it will test
that \* is not a space indeed... Also both can be just \s, because \s+ matches
whenever \s matches. And since you don't capture the contents of
lookahead/lookbehind it'll be faster/simpler to use a single \s.

About SafeD: it shouldn't segfault but the program listed is  system (as this
is the default) :). Otherwise since regex is  trusted, it's my responsibilty to
verfiy  that it is memory safe, so blame me (or rather the compiler).

To be actually in SafeD try putting  safe: at the top of your code or just tag
main and all functions with  safe.
AFAIK writeln in SafeD  wouldn't work as it's still  system (obviously it
should be safe/trusted). To be honest SafeD hasn't been addressed properly in
the standard library yet.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 26 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8725




Thanks for the explanation!

WRT the regex string being faulty, I was aware of that; I was just
experimenting when I encountered a segfault. 

Thanks for the pointer about adding  safe: at the top; too bad writeln is still
 system. That kinda kills the usefulness of SafeD, doesn't it? I mean if I
literally can't write a Hello World program in SafeD, then SafeD is quite far
from ready. :)

I've read the TDPL last week and this is my first encounter with writing real D
code; all in all, the language is freaking awesome (goodbye C++) and I'm even
willing to live with esoteric bugs in the compiler/libs if I can work around
them. I understand that D is still a work-in-progress language.

I intend to write a substantial (multi KLOC) D program as a learning
experience; will report any bugs I find as I find them.

Anyway, good luck fixing this. :)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 26 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8725


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE



12:49:42 PST ---
Works with current git master.
Must have been fixed along with the compiler bug in 7810.

*** This issue has been marked as a duplicate of issue 7810 ***

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 30 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8725




Commit pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/0f2947d4d1360f0a0f797279e6f13f95695e45ec
bugfixes for compile-time regex

fix issue 8725

fix issue 8349

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 01 2012