www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fixing the API of std.regex

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
*Spoiler*: let's slowly deprecate "g" option in std.regex in a few years 
or with any luck a bit faster. The better replacement is proposed.

For better or worse the current API has retained a (high) level of 
compatibility with the old API. That means I've missed the chance to fix 
it when I could, and here is the prime problem (the hardest) I have with it:

foreach(m; match("bleh-blah", "bl[ea]h"))
{
	writeln(m.hit);
}

The "quiz" is - how many lines will this print?

The current answer is 1. And that the right solution for all matches is:

foreach(m; match("bleh-blah", regex("bl[ea]h","g"))
{
	writeln(m.hit);
}

Which is not only looks unsightly but also confuses operation option 
(find _all_ vs find _first_) with property of a pattern (like 
case-insensitivity is). And if regex pattern is defined elsewhere it 
could easily introduce a bug (albeit one that's easy to track, "usually").

To underline the point: std.regex.splitter doesn't take "g" flag into 
account at all (it makes no sense there).

I've pondered a couple of solutions in a bug report by bearophile:
http://d.puremagic.com/issues/show_bug.cgi?id=7260

After all of these ideas born and discarded, here is what I believe is 
the way forward out of this mess:

Make "g" indicates only the intended _default_ search mode of this 
pattern (global - first match).

User is free to override this default explicitly and in fact encouraged 
to do so. The idea of default search mode attached to the regex pattern 
is marked as discouraged.

The overrides have to be convenient and backwards compatible.
Thus I propose the follwing:

match and replace become structs (types, oh my!) with the following 
"interface":

struct match //ditto  for replace
{
	//current behavior
	static auto opCall(.....);
	//get the first match / replace only first occurance
	static auto first();
	// force to find all matches (still lazy range) and
	static auto all();
}

OT: C++ folks call this namespace, but they don't have static opCall - 
suckers ;)  And I actually proposed (twice) to kill static opCall, sweet 
irony.

Then the motivating example would be :

foreach(m; match.all("bleh-blah", "bl[ea]h"))
{
	writeln(m.hit);
}

and :

//prints all submatches of the first match:
foreach(m; match.first("bleh-blah", "bl[ea]h"))
{
	// don't compile, m - is the first match itself no .hit there
	// that should make it harder to confuse
	// "first match" with "all matches"
	//writeln(m.hit);
	writeln(m);
}

We can go further and introduce the enhancement I long dreamed of:

//'any' or 'test' are also the names to choose from
if(match.anywhere(string, "[0-9]+"))
{
	//there is at least 1 match (no need for other info)
	...
}

The reason I want this "shorthand" is that regex engine can cut a bunch 
of corners and serve up this "is there a match somewhere?" request much, 
MUCH faster then "where is the first match and all of its submatches?". 
And many use cases only need this yes/no thing anyway.

... that got a bit lengthy - any thoughts, criticism, opinions ?

-- 
Dmitry Olshansky
Mar 12 2013
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 3/12/13, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 struct match //ditto  for replace
 {
 	//current behavior
 	static auto opCall(.....);
 }

For a second I was worried this would break UFCS, but actually it still works. Pretty kewl.
Mar 12 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Mar-2013 14:36, Andrej Mitrovic пишет:
 On 3/12/13, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 struct match //ditto  for replace
 {
 	//current behavior
 	static auto opCall(.....);
 }

For a second I was worried this would break UFCS, but actually it still works. Pretty kewl.

Actually it does... but only partially: struct match //ditto for replace { //current behavior static void opCall(T)(T s) { writeln(s); } //get the first match / replace only first occurance static void first(T)(T s, int k) { writelen("FIRST: ", s); } // force to find all matches (still lazy range) and static void all(T)(T s, int p) { writlen("ALL: ", s); } } void main() { "abc".match(); //works "abc".match.first(42); //doesn't } There's got to be some way out of it that doesn't involve alias this and proxies... -- Dmitry Olshansky
Mar 12 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Mar-2013 17:12, Dmitry Olshansky пишет:
 12-Mar-2013 14:36, Andrej Mitrovic пишет:
 On 3/12/13, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 struct match //ditto  for replace
 {
     //current behavior
     static auto opCall(.....);
 }

For a second I was worried this would break UFCS, but actually it still works. Pretty kewl.

Actually it does... but only partially:

So with my initial idea being **cked up by UFCS on 'b' in case of 'a.b.c' resolution chain. The problem is not anything new BTW as there is no way to do a fully qualified call with UFCS. something.std.ascii.isWhite also won't work. Darn UFCS. Maybe we can discover some sane rule to cover both corner cases?
 There's got to be some way out of it that doesn't involve alias this and
 proxies...

And w/o proxies I can get at least match(...).all and match(...).first. But not replace as it used to return a naked array and thus it would need to be a proxy... and proxies have another problem. Again the problem is not anything new but I believe is flaw in any proxies design in D. It's the fact that auto type inference on intialization sees proxies for what they are: auto x = replace(...); // now typeof(x) is some ugly proxy junk auto y = replace(...).all; //fine - typeof(y) is array auto z = replace(...)first; //fine - typeof(z) is array Would it make sense to somehow tweak the langauge to allow proxies to decay to some 'default' type (of thier choice) on initialization? It seems to me that any container (or whatever) that builds on proxies is going to hit this wall. -- Dmitry Olshansky
Mar 12 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Mar-2013 19:08, Nick Sabalausky пишет:
 On Tue, 12 Mar 2013 11:06:56 -0400
 Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> wrote:
 matchFirst
 matchAll
 matchTest

 ?

s/matchTest/isMatch/

or rather 'hasMatch' But that's too obvious :) The problem is that I wanted to avoid creating a bunch of new names, especially as they are tweaks/option on the original behavior. I'd go with direct enum flags but it's a bit too verbose: match("blah-bleh", "bl[ae]h", Match.all); //Match.first That's why I've thought to see a way to get any of match.all(...) or match(...).all working. And that is possible, but not with replace. -- Dmitry Olshansky
Mar 12 2013
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Tue, 12 Mar 2013 18:07:42 +0400
Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 12-Mar-2013 17:12, Dmitry Olshansky =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 12-Mar-2013 14:36, Andrej Mitrovic =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 On 3/12/13, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 struct match //ditto  for replace
 {
     //current behavior
     static auto opCall(.....);
 }

For a second I was worried this would break UFCS, but actually it still works. Pretty kewl.

Actually it does... but only partially:

=20 So with my initial idea being **cked up by UFCS on 'b' in case of=20 'a.b.c' resolution chain. =20 The problem is not anything new BTW as there is no way to do a fully=20 qualified call with UFCS. something.std.ascii.isWhite also won't work. =20 Darn UFCS. Maybe we can discover some sane rule to cover both corner cases? =20

matchFirst matchAll matchTest ?
Mar 12 2013
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Tue, 12 Mar 2013 11:06:56 -0400
Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> wrote:
 
 matchFirst
 matchAll
 matchTest
 
 ?
 

s/matchTest/isMatch/
Mar 12 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Tuesday, 12 March 2013 at 09:41:08 UTC, Dmitry Olshansky wrote:
 *Spoiler*: let's slowly deprecate "g" option in std.regex in a 
 few years or with any luck a bit faster. The better replacement 
 is proposed.

 For better or worse the current API has retained a (high) level 
 of compatibility with the old API. That means I've missed the 
 chance to fix it when I could, and here is the prime problem 
 (the hardest) I have with it:

 foreach(m; match("bleh-blah", "bl[ea]h"))
 {
 	writeln(m.hit);
 }

 The "quiz" is - how many lines will this print?

 The current answer is 1. And that the right solution for all 
 matches is:

 foreach(m; match("bleh-blah", regex("bl[ea]h","g"))
 {
 	writeln(m.hit);
 }

 Which is not only looks unsightly but also confuses operation 
 option (find _all_ vs find _first_) with property of a pattern 
 (like case-insensitivity is). And if regex pattern is defined 
 elsewhere it could easily introduce a bug (albeit one that's 
 easy to track, "usually").

 To underline the point: std.regex.splitter doesn't take "g" 
 flag into account at all (it makes no sense there).

 I've pondered a couple of solutions in a bug report by 
 bearophile:
 http://d.puremagic.com/issues/show_bug.cgi?id=7260

 After all of these ideas born and discarded, here is what I 
 believe is the way forward out of this mess:

 Make "g" indicates only the intended _default_ search mode of 
 this pattern (global - first match).

 User is free to override this default explicitly and in fact 
 encouraged to do so. The idea of default search mode attached 
 to the regex pattern is marked as discouraged.

I nearly always forget to include "g" so I welcome any changes that make make "g" go away. match.first/match.all/etc. is easy to read and the intent is right up front which I prefer over tacking a flag argument on the end. matchFirst/matchAll/etc. is fine too but not nearly as cool :).
Mar 12 2013
prev sibling parent "Jesse Phillips" <Jessekphillips+D gmail.com> writes:
On Tuesday, 12 March 2013 at 09:41:08 UTC, Dmitry Olshansky wrote:
 ... that got a bit lengthy - any thoughts, criticism, opinions ?

I like it. Maybe Nick is right in just having separate functions so UFCS is still working. Or since opCall works we can just say the new types are only callable without UFCS, and maybe the future will hold an improvement for it.
Mar 12 2013