www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - prefix match of a regex and optimized dirEntries for regex search

reply Timothee Cour via Digitalmars-d <digitalmars-d puremagic.com> writes:
--047d7b2e3d92d2a94f04fc1fbb53
Content-Type: text/plain; charset=UTF-8

I made a simple modification to std.regex to allow an option to prefix
match a regex.
Formally, if L(R) is the language recognized by a regex R, the language
recognized by prefix matching of R  is:

L(p(R)) = prefix(L(R)) = {u : uv in L(R) for some v}

Trying to come up (by hand or algorithmically) with a regex R' such that
L(R') L(p(R)) is awkward and inefficient, eg:

R='hello';
R'=`|h|he|hell|hello` = `(h(e(l(l(o)?)?)?)?)?`;

However thinking in terms of state machine this is much easier and
efficient.

It looks like this:
assert("hel".match(`hello\d+`.regex("p")); //p for prefix match

If there's interest in adding this I can prepare a pull request and we can
discuss more.

Example use case:
I wrote a function to search a file given a regex, and it is optimized to
prune directories early on if they fail to prefix match the regex, eg:

dirEntriesOptimized(`abc/folder_\d+/\w+\.cpp`)
when encountering `abc/bad_subfolder/` it will not recurse on this as it
fails the prefix regex match.

--047d7b2e3d92d2a94f04fc1fbb53
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I made a simple modification to std.regex to allow an opti=
on to prefix match a regex.<div>Formally, if L(R) is the language recognize=
d by a regex R, the language recognized by prefix matching of R =C2=A0is:</=
div>
<div><br></div><div>L(p(R)) =3D prefix(L(R)) =3D {u : uv in L(R) for some v=
}</div><div>=C2=A0</div><div>Trying to come up (by hand or algorithmically)=
 with a regex R&#39; such that L(R&#39;) L(p(R)) is awkward and inefficient=
, eg:</div>
<div><br></div><div>R=3D&#39;hello&#39;;</div><div>R&#39;=3D`|h|he|hell|hel=
lo` =3D `(h(e(l(l(o)?)?)?)?)?`;=C2=A0</div><div><div><div></div></div><div>=
<br></div><div>However thinking in terms of state machine this is much easi=
er and efficient.</div>
<div><br></div><div><div>It looks like this:</div><div>assert(&quot;hel&quo=
t;.match(`hello\d+`.regex(&quot;p&quot;)); //p for prefix match</div></div>=
<div><br></div><div>If there&#39;s interest in adding this I can prepare a =
pull request and we can discuss more.</div>
<div><br></div></div><div>Example use case:</div><div>I wrote a function to=
 search a file given a regex, and it is optimized to prune directories earl=
y on if they fail to prefix match the regex, eg:</div><div><br></div><div>
dirEntriesOptimized(`abc/folder_\d+/\w+\.cpp`)</div><div>when encountering =
`abc/bad_subfolder/` it will not recurse on this as it fails the prefix reg=
ex match.</div><div><br></div><div><br></div></div>

--047d7b2e3d92d2a94f04fc1fbb53--
Jun 18 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
18-Jun-2014 21:38, Timothee Cour via Digitalmars-d пишет:
 I made a simple modification to std.regex to allow an option to prefix
 match a regex.
 Formally, if L(R) is the language recognized by a regex R, the language
 recognized by prefix matching of R  is:

 L(p(R)) = prefix(L(R)) = {u : uv in L(R) for some v}
 Trying to come up (by hand or algorithmically) with a regex R' such that
 L(R') L(p(R)) is awkward and inefficient, eg:

 R='hello';
 R'=`|h|he|hell|hello` = `(h(e(l(l(o)?)?)?)?)?`;

 However thinking in terms of state machine this is much easier and
 efficient.

 It looks like this:
 assert("hel".match(`hello\d+`.regex("p")); //p for prefix match

Well, since "p" would apply to operation rather then regex I think adding a partialMatch function would be simpler. Plus it may need to return extended information like where in the pattern it stopped, so the return type/API may need to be different from normal match.
 If there's interest in adding this I can prepare a pull request and we
 can discuss more.

direction is warmly welcome. Would be cool if you could take a look at https://github.com/D-Programming-Language/phobos/pull/2216 It's nothing special except splitting up std.regex into a package but it's a stepping stone of going forward with it.
 Example use case:
 I wrote a function to search a file given a regex, and it is optimized
 to prune directories early on if they fail to prefix match the regex, eg:

I'm not sure I follow - a normal regex match would do essentially the same by first checking the prefix and failing on wrong ones. Where is optimization? What could be interesting is to limit matching to not go beyond say 80 code units and report if it's (partial) match of given pattern.
 dirEntriesOptimized(`abc/folder_\d+/\w+\.cpp`)
 when encountering `abc/bad_subfolder/` it will not recurse on this as it
 fails the prefix regex match.

-- Dmitry Olshansky
Jun 18 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
19-Jun-2014 11:36, Timothee Cour via Digitalmars-d пишет:
 On Wed, Jun 18, 2014 at 12:37 PM, Dmitry Olshansky via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:

     18-Jun-2014 21:38, Timothee Cour via Digitalmars-d пишет:

         I made a simple modification to std.regex to allow an option to
         prefix
         match a regex.
         Formally, if L(R) is the language recognized by a regex R, the
         language
         recognized by prefix matching of R  is:

         L(p(R)) = prefix(L(R)) = {u : uv in L(R) for some v}
         Trying to come up (by hand or algorithmically) with a regex R'
         such that
         L(R') L(p(R)) is awkward and inefficient, eg:

     So essentially it's just a set of all partial matches?


 Not sure, that depends on your definition of partial matches; how do you
 formally define partial matches?

 There's a subtlety, let's call my proposal prefixMatch:
 "hel".prefixMatch("^hello") //true
 "".prefixMatch("^hello") //true
 "mel".prefixMatch("^hello") //false

 But if the regex doesn't start with `^` then it is not useful:

 "whatever".prefixMatch("hello") //true
 because the scanner will try each position in the input string and fail,
 except for the last one where it'll succeed trivially ("" is a prefix of
 "hello").

Well, I think partialMatch is only defined in the current position. Unlike normal match we can't provide default meaningful action e.g. if a trivial "" result means search next or just report trivial partial match.
 In other words, this is more useful for one shot matching.

Yes, it makes that much more sense for one shot matching IMO.
 But we can also consider partial partial matches if we provide
 additional constraints given via a lambda:

 "blabla hel blabla".partialMatch!(a=>a.hit.length>=3) ("hello")
 => finds a match ("hel") because the lambda succeeds on this match.

 partialMatch could also be used to find the longest partial match
 (either in terms of length of matching string or length of the regex):

 auto longestPartialMatch(R)(string input, R regex){
 string longest;
 while(true){
 auto m=partialMatch!(a=>a.hit.length>longest.length) (input,regex);
 if(m) longest=a.hit; else return longest;
 }

Seems doable as helpers on top, but this kind of thing could be made faster if built-in in std.regex like you suggest. I also can't see anything better then lambda at the moment.
     Would be cool if you could take a look at
     https://github.com/D-__Programming-Language/phobos/__pull/2216
     <https://github.com/D-Programming-Language/phobos/pull/2216>

     It's nothing special except splitting up std.regex into a package
     but it's a stepping stone of going forward with it.


 That's a very welcome change, this module is too big currently. That
 seems like it would be the 1st phobos module to take advantage of
 package feature?

There is competition of std.container! https://github.com/D-Programming-Language/phobos/pull/2174 Truth be told it's the most obvious candidate, as containers have nothing to do with each other.
         Example use case:
         I wrote a function to search a file given a regex, and it is
         optimized
         to prune directories early on if they fail to prefix match the
         regex, eg:


     I'm not sure I follow - a normal regex match would do essentially
     the same by first checking the prefix and failing on wrong ones.
     Where is optimization?


 The optimization is avoiding unnecessary file system access by pruning
 early on directories that we can prove will not contain any file which
 will match the regex. Without optimization, you have to consider (and
 reject) `abc/bad_subfolder/file_1`. With optimization, you never have to
 consider that file (nor ask the file system to list it), as
 `abc/bad_subfolder/` fails the prefix regex test (see my example). A
 normal regex needs to be applied to the whole path, and you need prefix
 matching to reject partial paths (ie, directories).

Now I see - you don't have the full input while traversing directories, cool idea.
     What could be interesting is to limit matching to not go beyond say
     80 code units and report if it's (partial) match of given pattern.


 That'd be nice; it needs to be as generic as possible; I propose using a
 user defined lambda (see above).

Well, why not but I'd suggest to focus on getting the partialMatch at the current position as a good generic building block to extend from. -- Dmitry Olshansky
Jun 19 2014
prev sibling parent Timothee Cour via Digitalmars-d <digitalmars-d puremagic.com> writes:
--001a11c22172546b8b04fc2b70e6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Wed, Jun 18, 2014 at 12:37 PM, Dmitry Olshansky via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 18-Jun-2014 21:38, Timothee Cour via Digitalmars-d =D0=BF=D0=B8=D1=88=D0=

  I made a simple modification to std.regex to allow an option to prefix
 match a regex.
 Formally, if L(R) is the language recognized by a regex R, the language
 recognized by prefix matching of R  is:

 L(p(R)) =3D prefix(L(R)) =3D {u : uv in L(R) for some v}
 Trying to come up (by hand or algorithmically) with a regex R' such that
 L(R') L(p(R)) is awkward and inefficient, eg:

  So essentially it's just a set of all partial matches?


Not sure, that depends on your definition of partial matches; how do you formally define partial matches? There's a subtlety, let's call my proposal prefixMatch: "hel".prefixMatch("^hello") //true "".prefixMatch("^hello") //true "mel".prefixMatch("^hello") //false But if the regex doesn't start with `^` then it is not useful: "whatever".prefixMatch("hello") //true because the scanner will try each position in the input string and fail, except for the last one where it'll succeed trivially ("" is a prefix of "hello"). In other words, this is more useful for one shot matching. But we can also consider partial partial matches if we provide additional constraints given via a lambda: "blabla hel blabla".partialMatch!(a=3D>a.hit.length>=3D3) ("hello") =3D> finds a match ("hel") because the lambda succeeds on this match. partialMatch could also be used to find the longest partial match (either in terms of length of matching string or length of the regex): auto longestPartialMatch(R)(string input, R regex){ string longest; while(true){ auto m=3DpartialMatch!(a=3D>a.hit.length>longest.length) (input,regex); if(m) longest=3Da.hit; else return longest; } This could be useful for debugging complex regex to see why it didn't capture a given string for example.
  R=3D'hello';
 R'=3D`|h|he|hell|hello` =3D `(h(e(l(l(o)?)?)?)?)?`;

 However thinking in terms of state machine this is much easier and
 efficient.

 It looks like this:
 assert("hel".match(`hello\d+`.regex("p")); //p for prefix match

a partialMatch function would be simpler.

 Plus it may need to return extended information like where in the pattern
 it stopped, so the return type/API may need to be different from normal
 match.

Good point, I agree.
  If there's interest in adding this I can prepare a pull request and we
 can discuss more.

  Sure partial matching wold be nice to have and any help in this


ok, I wanted to make sure there was interest.
 Would be cool if you could take a look at
 https://github.com/D-Programming-Language/phobos/pull/2216

 It's nothing special except splitting up std.regex into a package but it'=

 a stepping stone of going forward with it.

That's a very welcome change, this module is too big currently. That seems like it would be the 1st phobos module to take advantage of package feature?
  Example use case:
 I wrote a function to search a file given a regex, and it is optimized
 to prune directories early on if they fail to prefix match the regex, eg=



I'm not sure I follow - a normal regex match would do essentially the sam=

 by first checking the prefix and failing on wrong ones. Where is
 optimization?

early on directories that we can prove will not contain any file which will match the regex. Without optimization, you have to consider (and reject) `abc/bad_subfolder/file_1`. With optimization, you never have to consider that file (nor ask the file system to list it), as `abc/bad_subfolder/` fails the prefix regex test (see my example). A normal regex needs to be applied to the whole path, and you need prefix matching to reject partial paths (ie, directories).
 What could be interesting is to limit matching to not go beyond say 80
 code units and report if it's (partial) match of given pattern.

That'd be nice; it needs to be as generic as possible; I propose using a user defined lambda (see above).
 --
 Dmitry Olshansky

--001a11c22172546b8b04fc2b70e6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail= _quote">On Wed, Jun 18, 2014 at 12:37 PM, Dmitry Olshansky via Digitalmars-= d <span dir=3D"ltr">&lt;<a href=3D"mailto:digitalmars-d puremagic.com" targ= et=3D"_blank">digitalmars-d puremagic.com</a>&gt;</span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex">18-Jun-2014 21:38, Timothee Cour via Digitalmars-d =D0=BF= =D0=B8=D1=88=D0=B5=D1=82:<div class=3D""> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex"> I made a simple modification to std.regex to allow an option to prefix<br> match a regex.<br> Formally, if L(R) is the language recognized by a regex R, the language<br> recognized by prefix matching of R =C2=A0is:<br> <br> L(p(R)) =3D prefix(L(R)) =3D {u : uv in L(R) for some v}<br> Trying to come up (by hand or algorithmically) with a regex R&#39; such tha= t<br> L(R&#39;) L(p(R)) is awkward and inefficient, eg:<br> <br> </blockquote></div> So essentially it&#39;s just a set of all partial matches?</blockquote><div=
<br></div><div>Not sure, that depends on your definition of partial matche=

There&#39;s a subtlety, let&#39;s call my proposal=C2=A0prefixMatch:<br></d= iv><div><div><div>&quot;hel&quot;.prefixMatch(&quot;^hello&quot;) //true</d= iv></div></div><div><div><div>&quot;&quot;.prefixMatch(&quot;^hello&quot;) = //true</div> </div></div><div><div>&quot;mel&quot;.prefixMatch(&quot;^hello&quot;) //fal= se</div></div><div><br></div><div>But if the regex doesn&#39;t start with `= ^` then it is not useful:</div><div><div><br></div><div>&quot;whatever&quot= ;.prefixMatch(&quot;hello&quot;) //true</div> <div>because the scanner will try each position in the input string and fai= l, except for the last one where it&#39;ll succeed trivially (&quot;&quot; = is a prefix of &quot;hello&quot;).=C2=A0</div><div><br></div><div>In other = words, this is more useful for one shot matching.=C2=A0</div> <div><br></div><div>But we can also consider partial partial matches if we = provide additional constraints given via a lambda:</div><div><br></div><div=
&quot;blabla hel blabla&quot;.partialMatch!(a=3D&gt;a.hit.length&gt;=3D3) =

<div>=3D&gt; finds a match (&quot;hel&quot;) because the lambda succeeds on= this match.</div><div><br></div><div>partialMatch could also be used to fi= nd the longest partial match (either in terms of length of matching string = or length of the regex):<br> </div><div><br></div><div>auto longestPartialMatch(R)(string input, R regex= ){</div><div>string longest;</div><div>while(true){</div><div>auto m=3Dpart= ialMatch!(a=3D&gt;a.hit.length&gt;longest.length) (input,regex);<br></div><= div> if(m) longest=3Da.hit; else return longest;</div><div>}</div><div><br></div=
<div>This could be useful for debugging complex regex to see why it didn&#=

iv> <br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8= ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-sty= le:solid;padding-left:1ex"><div class=3D""> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex"> R=3D&#39;hello&#39;;<br> R&#39;=3D`|h|he|hell|hello` =3D `(h(e(l(l(o)?)?)?)?)?`;<br> <br> However thinking in terms of state machine this is much easier and<br> efficient.<br> <br> It looks like this:<br> assert(&quot;hel&quot;.match(`hello\d+`.<u></u>regex(&quot;p&quot;)); //p f= or prefix match<br> <br> </blockquote> <br></div> Well, since &quot;p&quot; would apply to operation rather then regex I thin= k adding a partialMatch function would be simpler.<br></blockquote><blockqu= ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-wid= th:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-l= eft:1ex"> <br> Plus it may need to return extended information like where in the pattern i= t stopped, so the return type/API may need to be different from normal matc= h.</blockquote><div><br></div><div>=C2=A0Good point, I agree.</div><div><br=

order-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:s= olid;padding-left:1ex"><div class=3D""><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex"> If there&#39;s interest in adding this I can prepare a pull request and we<= br> can discuss more.<br> <br> </blockquote></div> Sure partial matching wold be nice to have and any help in this direction i= s warmly welcome.<br></blockquote><div><br></div><div>ok, I wanted to make = sure there was interest.</div><div>=C2=A0</div><blockquote class=3D"gmail_q= uote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-c= olor:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> <br> Would be cool if you could take a look at<br> <a href=3D"https://github.com/D-Programming-Language/phobos/pull/2216" targ= et=3D"_blank">https://github.com/D-<u></u>Programming-Language/phobos/<u></= u>pull/2216</a><br> <br> It&#39;s nothing special except splitting up std.regex into a package but i= t&#39;s a stepping stone of going forward with it.</blockquote><div><br></d= iv><div>That&#39;s a very welcome change, this module is too big currently.= That seems like it would be the 1st phobos module to take advantage of pac= kage feature?=C2=A0</div> <div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px = 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l= eft-style:solid;padding-left:1ex"><div class=3D""><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex"> Example use case:<br> I wrote a function to search a file given a regex, and it is optimized<br> to prune directories early on if they fail to prefix match the regex, eg:<b= r> </blockquote> <br></div> I&#39;m not sure I follow - a normal regex match would do essentially the s= ame by first checking the prefix and failing on wrong ones. Where is optimi= zation?<br> <br></blockquote><div><br></div><div>The optimization is avoiding unnecessa= ry file system access by pruning early on directories that we can prove wil= l not contain any file which will match the regex. Without optimization, yo= u have to consider (and reject) `abc/bad_subfolder/file_1`. With optimizati= on, you never have to consider that file (nor ask the file system to list i= t), as `abc/bad_subfolder/` fails the prefix regex test (see my example). A= normal regex needs to be applied to the whole path, and you need prefix ma= tching to reject partial paths (ie, directories).</div> <div><div><br></div></div><div>=C2=A0</div><blockquote class=3D"gmail_quote= " style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color= :rgb(204,204,204);border-left-style:solid;padding-left:1ex"> What could be interesting is to limit matching to not go beyond say 80 code= units and report if it&#39;s (partial) match of given pattern.</blockquote=
<div><br></div><div>That&#39;d be nice; it needs to be as generic as possi=

<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px = 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l= eft-style:solid;padding-left:1ex"><div class=3D""><div class=3D"h5"> <br></div></div><span class=3D""><font color=3D"#888888"> -- <br> Dmitry Olshansky<br> </font></span></blockquote></div><br></div></div> --001a11c22172546b8b04fc2b70e6--
Jun 19 2014