www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Template regexes, version 2

reply Don Clugston <dac nospam.com.au> writes:
I've had another go at doing compile-time regexps. With all the bugfixes 
in recent releases, it's now possible to do it with mixins, generating 
local functions instead of global ones. This provides _considerably_ 
more flexibility.

The ultimately generated functions can look like:

char [] firstmatch(char [] s)
{
    char [][] results;
    int someOtherResult;

    void func(char [] s) {
       // this parses the string, using the original regexp string.
       // intermediate results (eg expressions in parentheses) are
       // in the local variables: results[][], someOtherResult, etc.
    }
    func(s);
    return results[0];
}

Usage is like:
char [] x = firstmatch!("ab+")(someStr);

It seems to me that there are 3 types of regexes:
* pure static -- where the regex string is a string literal, known at 
compile time
* pure dynamic -- eg, as used in a grep utility.
* pseudo-static. By this I mean regexps where the structure is constant, 
but some strings are replaced with variables.

Perl-ish languages deal with the last case by allowing embedded 
variables in strings. The template regexes Ive been developing can cope 
with them by using additional parameters and extended syntax, eg  1 for 
the string value of the first parameter.
eg

char [] var1 = "here";
char [] input = " a56aherea";
char [] x = firstmatch!("a 1a")(input, var1);
var1 = "56";
char [] y = firstmatch!("a 1a")(input, var1);

would give x == "aherea", y=="aherea".

As far as runtime efficiency goes, it's almost ideal, except that 
consecutive terms result in heavily nested trivial functions:

eg for the trivial "x\dy":
bool func(char [] s)
{
     bool a(char [] s) {
     	bool a(char [] s) {
             return s[0]=='y';
     	}
         return isdigit(s[0]) && a(s[1..$]);
     }
     return s[0]=='x' && a(s[1..$]);
}

and we need to rely on the optimiser to turn this into:
s[0]=='x' && isdigit(s[1]) && s[2]=='y';
How good is DMD at performing this optimisation? A nested function that 
is only called once _should_ always be inlined, no matter how deeply 
nested it is, but reality could be quite different...

Many optimisations are possible with this scheme (for example, the inner 
functions could just use int parameters, so that they don't need to pass 
s[] repeatedly; almost all of the optimisations from the runtime 
std.regexp could be applied). It should also be possible to implement 
Perl 6-style regexps.

...And then I return to the D newsgroups after a week and find the 
goalposts have moved: regexps are now built into the language.

The mixin regexps are only at an early stage of development, but given 
the current discussions I think it's important to know what can be done
with templates (probably more than most people expect). In the case of 
what I've called "pseudo-static" regexps, they are arguably more 
powerful than the built-in regexps of DMD 0.147.

I don't know where to go from here. There are a few possibilities:
1. use template regexps as a demonstration of the power of D templates.
--> Implement reduced syntax, keep the code simple and not well 
optimised; more of a proof-of-concept; go for "maximum coolness".
2. Like 1, but with optimisations; try to be the fastest regexp of any 
language. (cleaner-faster-better ... but we use the built-in regexps 
anyway <g>).
3. Create a standard library. This is what I was aiming for, but I don't
think it makes sense any more.
4. potential integration into the language. Is this even possible?

Probably the most sensible is to go with 1, to wake up the C++/boost 
crowd. Hopefully this should also provide some direction for the 
built-in syntax.
Thoughts?
Feb 21 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
       Awesome!


Don Clugston wrote:
 I've had another go at doing compile-time regexps. With all the bugfixes 
 in recent releases, it's now possible to do it with mixins, generating 
 local functions instead of global ones. This provides _considerably_ 
 more flexibility.
 
 The ultimately generated functions can look like:
 
 char [] firstmatch(char [] s)
 {
    char [][] results;
    int someOtherResult;
 
    void func(char [] s) {
       // this parses the string, using the original regexp string.
       // intermediate results (eg expressions in parentheses) are
       // in the local variables: results[][], someOtherResult, etc.
    }
    func(s);
    return results[0];
 }
 
 Usage is like:
 char [] x = firstmatch!("ab+")(someStr);

Nice!!
 It seems to me that there are 3 types of regexes:
 * pure static -- where the regex string is a string literal, known at 
 compile time
 * pure dynamic -- eg, as used in a grep utility.
 * pseudo-static. By this I mean regexps where the structure is constant, 
 but some strings are replaced with variables.

Pure static is what I always wanted with regexps in D! Pure dynamic has less of utility to me, but others may disagree. And that is taken care of by Walter now. Pseudo-static is cool!! And I believe *immensely* useful. If I do log analyzers, net statistics programs, serious data mining frameworks, then pseudo-static regexps is what I do all day long! And, at the other end, for even most 'dscript' tasks this is the core!
 As far as runtime efficiency goes, it's almost ideal, 

Amazing!
 ...And then I return to the D newsgroups after a week and find the 
 goalposts have moved: regexps are now built into the language.

Awwww, they're just runtime. Mere syntax sherades to smoke-and-mirror away the fact.
 The mixin regexps are only at an early stage of development, but given 
 the current discussions I think it's important to know what can be done
 with templates (probably more than most people expect). 

Most people couldn't. :-) But then again, "a language can't be Serious if all its features are graspable to VB programmers".
 In the case of 
 what I've called "pseudo-static" regexps, they are arguably more 
 powerful than the built-in regexps of DMD 0.147.

I DEMAND that pseudo-static regexps be in the very next release of DMD.
 I don't know where to go from here. There are a few possibilities:

 1. use template regexps as a demonstration of the power of D templates.
 --> Implement reduced syntax, keep the code simple and not well 
 optimised; more of a proof-of-concept; go for "maximum coolness".

From a "marketing point of view", that may be wise.
 2. Like 1, but with optimisations; try to be the fastest regexp of any 
 language. (cleaner-faster-better ... but we use the built-in regexps 
 anyway <g>).

I probably would not use the "built-ins" (if that refers to the current runtime library-come-syntax stuff) hardly at all.
 3. Create a standard library. This is what I was aiming for,
 but I don't think it makes sense any more.

Of course it does!
 4. potential integration into the language. Is this even possible?

Wanna guess my vote on this?!
 Probably the most sensible is to go with 1, to wake up the C++/boost 
 crowd. 

Why wake 'em up???? Unless... you expect them all to abandon C++ on sight and stampede over here? Heh, I wonder what 20 Dons would achieve together!
Feb 21 2006
prev sibling next sibling parent reply "Craig Black" <cblack ara.com> writes:
 The mixin regexps are only at an early stage of development, but given the 
 current discussions I think it's important to know what can be done
 with templates (probably more than most people expect). In the case of 
 what I've called "pseudo-static" regexps, they are arguably more powerful 
 than the built-in regexps of DMD 0.147.

 I don't know where to go from here. There are a few possibilities:
 1. use template regexps as a demonstration of the power of D templates.
 --> Implement reduced syntax, keep the code simple and not well optimised; 
 more of a proof-of-concept; go for "maximum coolness".
 2. Like 1, but with optimisations; try to be the fastest regexp of any 
 language. (cleaner-faster-better ... but we use the built-in regexps 
 anyway <g>).
 3. Create a standard library. This is what I was aiming for, but I don't
 think it makes sense any more.
 4. potential integration into the language. Is this even possible?

 Probably the most sensible is to go with 1, to wake up the C++/boost 
 crowd. Hopefully this should also provide some direction for the built-in 
 syntax.
 Thoughts?

You are the man! Keep up the good work! As far as what direction to go, a stardard library makes perfect sense. If Walter's kludge for regex's doesn't fit with this new approach, perhaps he can adapt it to make it work. Even without this, I would much rather have faster regex's than language integration. BTW, I developed a new syntax that is a spinoff of regular expressions (using C++). It includes syntax to manipulate a stack and store values in a parse tree. Using this language, I was able to define a parser for a modern object-oriented language in like 230 lines of code. I could show you this stuff if you are interested. -Craig
Feb 21 2006
next sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Craig Black wrote:
 Don Clugston wrote:

 BTW, I developed a new syntax that is a spinoff of regular
 expressions (using C++).  It includes syntax to manipulate a stack
 and store values in a parse tree.  Using this language, I was able to
 define a parser for a modern object-oriented language in like 230
 lines of code.  I could show you this stuff if you are interested.
 
 -Craig

I'm not even gonna try to imagine what happens when Don gets his hands on that stuff. And I used to be pretty good at imagining stuff.
Feb 21 2006
parent pragma <pragma_member pathlink.com> writes:
In article <43FB4A18.4080904 nospam.org>, Georg Wrede says...
Craig Black wrote:
 Don Clugston wrote:

 BTW, I developed a new syntax that is a spinoff of regular
 expressions (using C++).  It includes syntax to manipulate a stack
 and store values in a parse tree.  Using this language, I was able to
 define a parser for a modern object-oriented language in like 230
 lines of code.  I could show you this stuff if you are interested.
 
 -Craig

I'm not even gonna try to imagine what happens when Don gets his hands on that stuff. And I used to be pretty good at imagining stuff.

You and me both. I'm just as excited as everyone else. - Eric Anderton at yahoo
Feb 21 2006
prev sibling parent reply "Craig Black" <cblack ara.com> writes:
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Since it seems that there may be some interest, and to prove that I'm =
not bull-shitting, I thought I would post the code that I mentioned =
earlier.  It describes a parser for an unnamed object-oriented language =
of my design (that rips off a lot of stuff from D).  It's actually less =
than 230 lines if you don't count the comments and whitespace.  It's all =
one big string that gets parsed at run-time, but perhaps D could turn it =
into optimized source code.

const char *languageParser =3D
"Real::(!Delims&(-?(/d+(/.(/d)+)?)|(/d*/.(/d)+)([eE][/-+]?/d{1,3})?) ~;)"=
 // A real number
"Char::(!Delims&(\') ~&((\\)?.) ~\';;)" // char literal
"Text::(!Delims&(\") ~&([^\".]*) ~\";;)" // text literal

// Delimeters
"Comment1::(//(^(/n).)*)"
"Comment2::(/*(^(/*//).)*/*//)"
"Delims::(/s(//(!Comment1|!Comment2)/s)*)"

//
// Expressions
//
"BiOp1:([/./^])" // . ^ (exponentiation)
"BiOp2:([///*/%/#])" // / div * mul % mod # concat
"BiOp3:([/-+])" // - +
"BiOp4:(=3D|/!=3D|<|>|<=3D|>=3D|in|/~|/!/~)" // =3D !=3D < > <=3D >=3D =
in ~ !~
"BiOp5:(/&|/||/./.)" // & | .. (range operator)
"Parens:(/(!Expr/))" // Expression bound by parenthesis
"MParam:(((`=3D) ~&(/i) ~;;=3D)?!Expr)"
"MParens:(&(/() ~(!MParam(,!MParam)*)?/);)" // 0 or more comma delimited =
expressions bound by parens
"Index:(&(/[) ~!Expr(,!Expr)*/];)" // Array index
"Tuple1:(&(/[) ~!MParam(,!MParam)*/];)" // Array index tuple
"Tuple2:(&(/() ~!MParam(,!MParam)*/);)" // Parens tuple
"Tuple:(!Tuple1|!Tuple2)"
"TBang:(&(/!) ~;)" // template instantiation bang
"DotExpr:(&(/.) ~;&(/i) ~;)" // dots in an id expression
"IdExpr:((`/:idexpr) ~&(/i) ~;(!DotExpr)*(((!TBang)?!MParens)|!Index)*;)"=
 // Id expression
"LExpr:(!Real|!Text|!Char|!TypeofExpr|!IdExpr|!Tuple|!Parens|!MethodExpr|=
!DelegateExpr|!ClosureExpr|!PrefixExpr)"
// Prefix operators
"PrefixOp:([/*/-/!/&]|sizeof|typeid|typeof)"=20
"PrefixExpr:(&(!PrefixOp) ~!LExpr;)"
// Exponentiation
"BiExpr1:((`t1) ~!LExpr&(!BiOp1)$~!BiSub2;)"
// Multiplication, division, and modulus
"BiSub2:(!BiExpr1|!LExpr)"
"BiExpr2:((`t2) ~!BiSub2&(!BiOp2)$~!BiSub3;)"
// Addition and Subtraction
"BiSub3:(!BiExpr2|!BiSub2)"
"BiExpr3:((`t3) ~!BiSub3&(!BiOp3)$~!BiSub4;)"
// Comparisons =3D=3D !=3D < > <=3D >=3D
"BiSub4:(!BiExpr3|!BiSub3)"
"BiExpr4:((`t4) ~!BiSub4&(!BiOp4)$~!BiSub5;)"

// & | ..
"BiSub5:(!BiExpr4|!BiSub4)"
"BiSub6:(!BiExpr5|!BiSub5)"
"BiExpr5:((`t5) ~!BiSub5&(!BiOp5)$~!BiSub6;)"
"Keywords:(module|interface|class|enum|static|const|delegate|closure|meth=
od)"
// Type expression
"TypeofExpr:(&(typeof) ~!Expr(/.&(/i) ~;)*;)"
"TypeDots:(^(!Keywords)&(/i) ~(/.&(/i) ~;)*;)"
"Array:(&(/[) ~(!Expr(,!Expr)*)?/];)" // Array type
"Const:((&(/!) ~;)|((const(`/!)) ~;))"
"MTypeExpr:(((!Const)?(&(/*) ~;|!Array))*(!Const)?(!DelegateType|!Closure=
Type|!TypeofExpr|!TypeDots))"
"TypeExpr:((`/:type) ~!MTypeExpr;)"
// delegate, closure types
//  void(int) myDelegate =3D  myMethod;
//  void(int) myDelegate =3D  myInstance.myMethod;
// #void(int) myClosure =3D #myMethod(5);
// #void(int) myDelegate =3D #myInstance.myMethod(5);
"SID:(&(/i) ~;^(/i))"
"DParam:((`/:var) ~!TypeExpr(!SID(,!SID)*)?(,;!DParam)?)"
"DParens:(/((!DParam;)?/))"
"Delegate:(/ |delegate)"
"Closure:(/#|closure)"
"DelegateType:((`/ ) ~!Delegate!TypeExpr!DParens;)"
"ClosureType:((`/#) ~!Closure!TypeExpr!DParens;)"
// Anonymous method expression
// %void(int a) myDelegate =3D %void(int a) Write(a);;
"AParam:((`/:var) ~!TypeExpr!SID(,!SID)*(,;!AParam)?)"
"AParens:(/((!AParam;)?/))"
"MethodExpr:((`/%) ~!Method!TypeExpr!AParens!Block;)"
"DelegateExpr:((`/ ) ~!Delegate(!IdExpr/.)?&(/i) ~;;)"
// Closure expression
"ClosureExpr:((`/#) ~!Closure!IdExpr;)"
// New expression
"NewExpr:(&(new) ~!TypeExpr(!MParens)?;)"
// CastExpr
"CastExpr:((`cast) ~!Expr&((!Static)?as)$~!TypeExpr;)"
// An expression
"BiOps:(!BiOp1|!BiOp2|!BiOp3|!BiOp4|!BiOp5)"
"PreLExpr:(!LExpr^(!BiOps))"
"Expr:(^[/]/;/)/}](!PreLExpr|!BiExpr5|!BiExpr4|!BiExpr3|"
"!BiExpr2|!BiExpr1|!NewExpr|!CastExpr))"
//
// Statements
//
// Assignment statement
"AssOp:(=3D|/+=3D|/-=3D|/*=3D|//=3D|/^=3D)"
"AssSt:((`ass) ~!IdExpr(&(!AssOp)$~)!Expr;)"
// Delete statement
"DeleteSt:(&(delete) ~!IdExpr;/;)"
// Return, break, continue statements
"ReturnSt:(&(return) ~(!Expr)?;/;)"=20
"BreakSt:(&(break) ~;/;)"
"ContinueSt:(&(continue) ~;/;)"
// Postfix statement
"PostfixOp:(/+/+|/-/-)"
"PostfixSt:((`pf) ~!IdExpr&(!PostfixOp)$~;)"
// Variable declaration statement
"Var:(&(/i) ~(=3D(!NewExpr|!Expr))?;)" // var =3D expr
"Static:((&(/$) ~;)|(static(`/$) ~;))"
"VarSt:((`/:var) ~(!Static)?!TypeExpr!Var(,!Var)*;/;)"
"SVarSt:((`/:var) ~!TypeExpr!Var(,!Var)*;/;)"
// If statement
"IfSt:(&(if) ~!Expr(/:)?!Block(else!Block)?;)"
// Static if statement
"SIfSt:(&(!Static(if)) ~!Expr(/:)?!Block(else!Block)?;)"
// Switch statement
"CaseSt:((`case) ~(case)?!Expr*(/:)?!Block;)"
"DefaultSt:(&(default) ~(/:)?!Block;)"
"SwitchSt:(&(switch) ~/{(!CaseSt)*(!DefaultSt)?/};)"
// While statement
"WhileSt:(&(while) ~!Expr!Block;)"=20

// Do statement
"DoSt:(&(do) ~!Block(while)!Expr;/;)"

// Method call statement
"MethodSt:((`/(/)) ~!IdExpr;/;)"
// Foreach statement=20
"FEVarDecl:((`/:var) ~!TypeExpr&(/i) ~;;)"
"ForeachHeader:(!FEVarDecl(,!FEVarDecl;)?/;!Expr(&(step) ~!Expr;)?)"
"ForeachSt:(&(foreach) ~/(!ForeachHeader/)!Block(else!Block);)"
// For statement
"FVarDecl:((`/:var) ~!TypeExpr&(/i) ~;(=3D!Expr);)"
"ForInit:(!FVarDecl|!AssSt)"
"FStatement:(!AssSt|!PostfixSt)"
"ForSt:(&(for) ~/((!ForInit(,!ForInit)*)?/;!Expr/;(!FStatement(,!FStateme=
nt)*)?/)!Block;)"
// In and out blocks can include preconditions and postconditions=20
"InBlock:(&(in) ~!Block;)"
"OutBlock:(&(out) ~!Block;)"
// Trace block
"TraceBlock:(&(trace) ~!Block;)"
// Echo, assert
"EchoSt:(&(echo) ~!Expr;)"
"AssertSt:(&(assert) ~!Expr;)"
// Exceptions
"CatchSt:((`catch) ~(catch)?&(/i) ~;*(/:)?!Block;)"
"DefaultSt:(&(default) ~(/:)?!Block;)"
"TryBlock:(&(try) ~/{(!CatchSt)*(!DefaultSt)?/};)"
"ThrowSt:(&(throw) ~;)"
// Enumeration definition
"EnumNode:(&(/i) ~(=3D!Expr);)"
"EnumDef:(&(enum) ~/{!EnumNode(,!EnumNode)*/};)"
// Method definition
"TID:(&(/i) ~^(/i)(=3D!Expr)?;)"
"TParam:((`/:var) ~(&(in|out|inout) ~;)?!TypeExpr!TID(,!TID)*(,;!TParam)?=
)"
"TParens:(/((!TParam;)?/))"
"Method:(/%|method)"
"SMethodDef:((`/%) ~!Method!TypeExpr&(/i)  ~;!TParens!Block;)"
"MMethodDef:((`/%) ~(!Static|!Const)?!Method!TypeExpr&(/i)  ~;!TParens!Bl=
ock;)"
"IMethodDef:((`/%) ~(!Const)?!Method!TypeExpr&(/i)  ~;!TParens/;)"
//"Construct:(&(construct) ~!Block;)"
//"Destruct:(&(destruct) ~!Block;)"
//"Startup:..."
//"Terminate:..."
//"Set:(&(set) ~!TParens!Block;)"
//"Get:(!TypeExprget!Block;)"
// Operator overloading
"OpOverload:(&(operator) ~(!PrefixO|!PostfixO|!AssignO|!BinaryO)!Block;)"=

"BinaryO:(&(^(/.)!BiOps) ~;!TypeExpr/(!TypeExpr&(/i) ~;,!TypeExpr&(/i) ~;=
/))"
"AssignO:(&(^(=3D)!AssOp) ~;void/(!TypeExpr&(/i) ~;/))"
"PrefixO:(&(/-/!) ~;!TypeExpr/(/))"
"PostfixO:(&(/+/+|/-/-) ~;void/(/))"
"BinaryOverload:(&(operator) ~!BinaryO!Block;)"
// Class definition
"ProtSt:(&(public|private|protected) ~;/:)"
"ClassSt:(!ProtSt|!Alias|!TypeDef|!VarSt|!MMethodDef|!ClassDef|!EnumDef|!=
InterfaceDef|!OpOverload)"
"ClassBlock:(&(/{) ~(!ClassSt)*/};)"
"TemplateParam:((&(int) ~;)?(&(/i) ~;))"
"TemplateParams:(!TemplateParam(,!TemplateParam)*)"
"ClassDef:(&(class) ~&(/i) ~(/(!TemplateParams/))?;(/:!IdExpr(,!IdExpr)*)=
?!ClassBlock;)"
// Interface definition
"InterfaceSt:(!ProtSt|!Alias|!TypeDef|!IMethodDef|!ClassDef|!EnumDef|!Int=
erfaceDef)"
"InterfaceBlock:(&(/{) ~(!InterfaceSt)*/};)"
"InterfaceDef:(&(interface) ~&(/i) ~;(/:!IdExpr(,!IdExpr)*)?!InterfaceBlo=
ck;)"
"Alias:(&(alias) ~&(/i) ~;=3D!TypeExpr;)"
"TypeDef:(&(typedef) ~&(/i) ~;=3D!TypeExpr;)"
// import
"ImportNode:(&(/i) ~(/.&(/i)  ~;)*;)"
"ImportSt:(&(import) ~!ImportNode(,!ImportNode)*;/;)"
// error checking, exceptions
// in, out, inout params
// in, out code blocks
"ModuleInit:(&(module) ~&(/i)  ~;/:)"
"ModuleSt:(!ProtSt|!Alias|!TypeDef|!ImportSt|!SVarSt|!SMethodDef|!ClassDe=
f|"
"!EnumDef|!InterfaceDef|!BinaryOverload)"
"ModuleDef:(!ModuleInit(!ModuleSt)*)"
// A statement
"Statement:(!IfSt|!ForSt|!ForeachSt|(!PostfixSt/;)|(!AssSt/;)|!MethodSt|"=

"!VarSt|!InBlock|!OutBlock|!TraceBlock|!TryBlock|!ThrowSt|!EchoSt|"
"!AssertSt|!SwitchSt|!WhileSt|!DoSt|!ReturnSt|!BreakSt|!ContinueSt|"
"!DeleteSt|!SIfSt|!EnumDef|!SMethodDef|!ClassDef)"
// A statement block
"Block:((`/{) ~(/{(!Statement)*/}|!Statement);)";
Feb 21 2006
parent reply Ivan Senji <ivan.senji_REMOVE_ _THIS__gmail.com> writes:
Craig Black wrote:
 Since it seems that there may be some interest, and to prove that I'm 
 not bull-shitting, I thought I would post the code that I mentioned 
 earlier.  It describes a parser for an unnamed object-oriented language 
 of my design (that rips off a lot of stuff from D).  It's actually less 
 than 230 lines if you don't count the comments and whitespace.  It's all 
 one big string that gets parsed at run-time, but perhaps D could turn it 
 into optimized source code.
  

<snip> That looks scary :)
Feb 21 2006
next sibling parent "Craig Black" <cblack ara.com> writes:
 That looks scary :)

Be afraid, be very afraid.
Feb 21 2006
prev sibling parent reply "Craig Black" <cblack ara.com> writes:
 That looks scary :)

Probably you are joking, but I feel the need to defend my code anyway. To someone who doesn't know the syntax it is not going to be very readable. <g> However, my code is way more readable than a lot of the regular expressions that I've seen. If you don't believe me, google for regular expressions that can be used to parse C++. One benefit of my language over regular expressions is that you can define subroutines. Thus, it's not all one big mess. It's a bunch of smaller, little messes. This makes a long regular expression much more managable. -Craig
Feb 21 2006
parent reply Ivan Senji <ivan.senji_REMOVE_ _THIS__gmail.com> writes:
Craig Black wrote:
That looks scary :)

Probably you are joking

I was joking (but it does look scary)
 , but I feel the need to defend my code anyway.  To 
 someone who doesn't know the syntax it is not going to be very readable. <g> 
 However, my code is way more readable than a lot of the regular expressions 

I guess i'll believe you.
 that I've seen.  If you don't believe me, google for regular expressions 
 that can be used to parse C++.  

Well I would be surprised to find that because C++ grammar isn't regular, it isn't even LR, it is something awfull.
 One benefit of my language over regular 
 expressions is that you can define subroutines.  

Interesting? What class of languages can it parse?
 Thus, it's not all one big 
 mess.  It's a bunch of smaller, little messes.  This makes a long regular 
 expression much more managable.
 

Feb 21 2006
parent "Craig Black" <cblack ara.com> writes:
"Ivan Senji" <ivan.senji_REMOVE_ _THIS__gmail.com> wrote in message 
news:dtg4am$2si9$1 digitaldaemon.com...
 Craig Black wrote:
That looks scary :)

Probably you are joking

I was joking (but it does look scary)
 , but I feel the need to defend my code anyway.  To someone who doesn't 
 know the syntax it is not going to be very readable. <g> However, my code 
 is way more readable than a lot of the regular expressions

I guess i'll believe you.
 that I've seen.  If you don't believe me, google for regular expressions 
 that can be used to parse C++.

Well I would be surprised to find that because C++ grammar isn't regular, it isn't even LR, it is something awfull.

No, the regular expression doesn't actually parse C++, but it helps to parse it. A friend of mine was writing a C++ parser and he showed me a regular expression he found on the internet (I would show you, but I just looked but couldn't find it.) It was hideously long because it has a lot of repetitous stuff. When you add subroutines to the syntax, the repetition disappears.
 One benefit of my language over regular expressions is that you can 
 define subroutines.

Interesting? What class of languages can it parse?

Theoretically it can parse anything. It includes syntax for pushing and poping text on a stack and populating text values in a simple parse tree format.
 Thus, it's not all one big mess.  It's a bunch of smaller, little messes. 
 This makes a long regular expression much more managable.
 


Feb 22 2006
prev sibling next sibling parent "Walter Bright" <newshound digitalmars.com> writes:
"Don Clugston" <dac nospam.com.au> wrote in message 
news:dtev42$1g98$1 digitaldaemon.com...
 ...And then I return to the D newsgroups after a week and find the 
 goalposts have moved: regexps are now built into the language.

Don't worry about that, they'll be gone again in the next version <g>. Based on the considerable feedback about this issue, I think I've got a better design.
Feb 21 2006
prev sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Don Clugston" <dac nospam.com.au> wrote in message 
news:dtev42$1g98$1 digitaldaemon.com...
 Perl-ish languages deal with the last case by allowing embedded variables 
 in strings. The template regexes Ive been developing can cope with them by 
 using additional parameters and extended syntax, eg  1 for the string 
 value of the first parameter.
 eg

 char [] var1 = "here";
 char [] input = " a56aherea";
 char [] x = firstmatch!("a 1a")(input, var1);
 var1 = "56";
 char [] y = firstmatch!("a 1a")(input, var1);

 would give x == "aherea", y=="aherea".

Great! But can I suggest using the format used by std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.
 How good is DMD at performing this optimisation?

Not good enough. But I wouldn't worry about that for the moment.
 I don't know where to go from here. There are a few possibilities:
 1. use template regexps as a demonstration of the power of D templates.
 --> Implement reduced syntax, keep the code simple and not well optimised; 
 more of a proof-of-concept; go for "maximum coolness".
 2. Like 1, but with optimisations; try to be the fastest regexp of any 
 language. (cleaner-faster-better ... but we use the built-in regexps 
 anyway <g>).
 3. Create a standard library. This is what I was aiming for, but I don't
 think it makes sense any more.
 4. potential integration into the language. Is this even possible?

 Probably the most sensible is to go with 1, to wake up the C++/boost 
 crowd. Hopefully this should also provide some direction for the built-in 
 syntax.
 Thoughts?

#1. No question about it!
Feb 21 2006
parent reply Don Clugston <dac nospam.com.au> writes:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Walter Bright wrote:
 "Don Clugston" <dac nospam.com.au> wrote in message 
 news:dtev42$1g98$1 digitaldaemon.com...
 Perl-ish languages deal with the last case by allowing embedded variables 
 in strings. The template regexes Ive been developing can cope with them by 
 using additional parameters and extended syntax, eg  1 for the string 
 value of the first parameter.
 eg

 char [] var1 = "here";
 char [] input = " a56aherea";
 char [] x = firstmatch!("a 1a")(input, var1);
 var1 = "56";
 char [] y = firstmatch!("a 1a")(input, var1);

 would give x == "aherea", y=="aherea".

Great! But can I suggest using the format used by std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.

OK. Except ... I think replace() doesn't have to worry about $ colliding with the end of line marker? Anyway, I've used $n for now. (Doesn't SQL use for parameters in stored procedures? Can't quite remember.)
 How good is DMD at performing this optimisation?

Not good enough. But I wouldn't worry about that for the moment.

I'm not really surprised -- I imagine the back-end doesn't have much that's specific to D. I've attached what I've done. It's not complete -- doesn't do ranges, or any of the escape characters, or beginning and end of line. But it does do the full set of repetition operators, parentheses, and unions (arbitrarily nestable). It also does the $n parameters. Most of the remaining stuff is straightforward, and not very exciting. I feel as though I've finally worked out what mixins are for. They are a perfect match for this application. In any case, this should give you a good idea for what you'll able to use in your presentation.
Feb 22 2006
next sibling parent reply Don Clugston <dac nospam.com.au> writes:
I should mention that the code I posted doesn't do real regexps, because 
it has no backtracking. It doesn't do *, +, ? properly, they are always 
greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.
However, the mixin technique means that the regexp is actually turned 
into a binary tree of nested functions.

eg (ab.*|c)a
becomes something like:

bool test(char [] searchstr)
{
   int p=0;

   bool seq() {
     bool union() {
        bool seq() {
            bool char_a {}
            bool seq() {
              bool char_b {}
              bool star() {
                  bool dot()
              }
            }
         }
         bool char_c() {}
     }
     bool char_a() {}
   }
   return seq();
}

so that in every case (including unions), the next function that needs 
to be called is always in scope. Intriguingly, it also has access to the 
local variables all the way down the tree. (Since of these levels can 
contain its own stack as an array, I'm pretty sure that a Turing machine 
can be created, it's likely to be more powerful than the state machines 
normally used for regexps).

Obviously this can be used to do regexps properly. I was hoping to use a 
clever naming scheme to avoid the alias issues from pragma's version, 
but it seems to be inevitable.

It seems hard to do this properly without creating a highly optimised 
regexp parser!

Don Clugston wrote:
 Walter Bright wrote:
 "Don Clugston" <dac nospam.com.au> wrote in message 
 news:dtev42$1g98$1 digitaldaemon.com...
 Perl-ish languages deal with the last case by allowing embedded 
 variables in strings. The template regexes Ive been developing can 
 cope with them by using additional parameters and extended syntax, eg 
  1 for the string value of the first parameter.
 eg

 char [] var1 = "here";
 char [] input = " a56aherea";
 char [] x = firstmatch!("a 1a")(input, var1);
 var1 = "56";
 char [] y = firstmatch!("a 1a")(input, var1);

 would give x == "aherea", y=="aherea".

Great! But can I suggest using the format used by std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.

OK. Except ... I think replace() doesn't have to worry about $ colliding with the end of line marker? Anyway, I've used $n for now. (Doesn't SQL use for parameters in stored procedures? Can't quite remember.)
 How good is DMD at performing this optimisation?

Not good enough. But I wouldn't worry about that for the moment.

I'm not really surprised -- I imagine the back-end doesn't have much that's specific to D. I've attached what I've done. It's not complete -- doesn't do ranges, or any of the escape characters, or beginning and end of line. But it does do the full set of repetition operators, parentheses, and unions (arbitrarily nestable). It also does the $n parameters. Most of the remaining stuff is straightforward, and not very exciting. I feel as though I've finally worked out what mixins are for. They are a perfect match for this application. In any case, this should give you a good idea for what you'll able to use in your presentation.

Feb 23 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Don Clugston wrote:

 I should mention that the code I posted doesn't do real regexps, because 
 it has no backtracking. It doesn't do *, +, ? properly, they are always 
 greedy. It can never match ".*a", for example.

 It seems hard to do this properly without creating a highly optimised 
 regexp parser!

LOL! (Self-reference.)
Feb 23 2006
prev sibling parent reply pragma <pragma_member pathlink.com> writes:
Don, I love this rendition.

In article <dtjv60$1tc9$1 digitaldaemon.com>, Don Clugston says...
I should mention that the code I posted doesn't do real regexps, because 
it has no backtracking. It doesn't do *, +, ? properly, they are always 
greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.

I'm scratching my head as to how to fix the greedy-matching problem. (my apologies if you've already mulled this over) One way to go is provide a switch for greedy and non-greedy behaviors at the interface level, such that a programmer can select the behavior. Another is to adopt a different token sequence in the expression itself (I think some engines use "*?", "??", "{m,n}?" and "+?" for non-greedy matching). A third option, is to redesign the expression engine itself to work with multiple matches, such that each predicate is understood to match the left-hand side against n right-hand sides (as returned from other predicates). This way, the expression ".*a" will match "a" in each sucessive substring following each successive attempt at matching "."; the result would be a "char[][]" containing zero or more matches of the "*" predicate expression.
However, the mixin technique means that the regexp is actually turned 
into a binary tree of nested functions.

eg (ab.*|c)a
becomes something like:

bool test(char [] searchstr)
{
   int p=0;

   bool seq() {
     bool union() {
        bool seq() {
            bool char_a {}
            bool seq() {
              bool char_b {}
              bool star() {
                  bool dot()
              }
            }
         }
         bool char_c() {}
     }
     bool char_a() {}
   }
   return seq();
}

so that in every case (including unions), the next function that needs 
to be called is always in scope. Intriguingly, it also has access to the 
local variables all the way down the tree. (Since of these levels can 
contain its own stack as an array, I'm pretty sure that a Turing machine 
can be created, it's likely to be more powerful than the state machines 
normally used for regexps).

Also, I'm curious to see what DMD would do to optimize that code. Would some of those internals dissapear, or would they have to be anonymous functions in order for that to happen? It might be worth the trouble to go the latter, given there are no function arguments used anywhere, as the resulting code might be pretty darn close to using raw goto statements (is DMD smart enough to *not* create a stack frame without use of the 'naked' keyword?).
Obviously this can be used to do regexps properly. I was hoping to use a 
clever naming scheme to avoid the alias issues from pragma's version, 
but it seems to be inevitable.

Heh... finding that problem was the worst part. Don't forget how my solution has a tendency to hammer and pollute the symbol namespace for large expressions. It looks like you managed to work around some of that as well. Again, I'm floored by how much more can be gained via nested functions rather than a more "traditional" approach.
It seems hard to do this properly without creating a highly optimised 
regexp parser!

Hehe.. so the optimal solution is also the most complete? ;) - Eric Anderton at yahoo
Feb 23 2006
next sibling parent David Medlock <noone nowhere.com> writes:
pragma wrote:
 Don, I love this rendition.
 
 In article <dtjv60$1tc9$1 digitaldaemon.com>, Don Clugston says...
 
I should mention that the code I posted doesn't do real regexps, because 
it has no backtracking. It doesn't do *, +, ? properly, they are always 
greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.

I'm scratching my head as to how to fix the greedy-matching problem. (my apologies if you've already mulled this over) <snip> Hehe.. so the optimal solution is also the most complete? ;) - Eric Anderton at yahoo

One option is something like the following: (Assuming Regex is a linked list of atomic regular expressions, and it knows if it is nullable) bool match( string txt, Regex reg, bool delegate() backtrack ) { string tail ; // attempt to match the regex against the text bool success = reg.match( txt, tail ); // this function skips the current regex and tries the txt on the successor bool inner_skip() { bool result = match( txt, reg.next, backtrack ); if ( result==false ) return backtrack(); else return result; } if ( success ) { if ( reg.nullable ) return match( tail, reg.next, &inner_skip ); else return match( tail, reg.next, backtrack ); } else return ( backtrack is null ? false : backtrack() ); } //call with : match( text, regex, null ) A nice addition would be a callback delegate for matches. -DavidM
Feb 23 2006
prev sibling parent Don Clugston <dac nospam.com.au> writes:
pragma wrote:
 Don, I love this rendition.

Thanks! I'm just amazed at how this stuff has progressed, it's so easy now. I'm a little uncomfortable that people seem to be credited me for the regexp parser which you wrote. I want to make sure that you receive proper acknowledgment.
 In article <dtjv60$1tc9$1 digitaldaemon.com>, Don Clugston says...
 I should mention that the code I posted doesn't do real regexps, because 
 it has no backtracking. It doesn't do *, +, ? properly, they are always 
 greedy. It can never match ".*a", for example.
 I was too focused on getting access to variables.

I'm scratching my head as to how to fix the greedy-matching problem. (my apologies if you've already mulled this over)

 
 One way to go is provide a switch for greedy and non-greedy behaviors at the
 interface level, such that a programmer can select the behavior.  Another is to
 adopt a different token sequence in the expression itself (I think some engines
 use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).  
 
 A third option, is to redesign the expression engine itself to work with
 multiple matches, such that each predicate is understood to match the left-hand
 side against n right-hand sides (as returned from other predicates).  This way,
 the expression ".*a" will match "a" in each sucessive substring following each
 successive attempt at matching "."; the result would be a "char[][]" containing
 zero or more matches of the "*" predicate expression.
 
 
 However, the mixin technique means that the regexp is actually turned 
 into a binary tree of nested functions.

 eg (ab.*|c)a
 becomes something like:

 bool test(char [] searchstr)
 {
   int p=0;

   bool seq() {
     bool union() {
        bool seq() {
            bool char_a {}
            bool seq() {
              bool char_b {}
              bool star() {
                  bool dot()
              }
            }
         }
         bool char_c() {}
     }
     bool char_a() {}
   }
   return seq();
 }

 so that in every case (including unions), the next function that needs 
 to be called is always in scope. Intriguingly, it also has access to the 
 local variables all the way down the tree. (Since of these levels can 
 contain its own stack as an array, I'm pretty sure that a Turing machine 
 can be created, it's likely to be more powerful than the state machines 
 normally used for regexps).

Also, I'm curious to see what DMD would do to optimize that code. Would some of those internals dissapear, or would they have to be anonymous functions in order for that to happen? It might be worth the trouble to go the latter, given there are no function arguments used anywhere, as the resulting code might be pretty darn close to using raw goto statements (is DMD smart enough to *not* create a stack frame without use of the 'naked' keyword?).

I suspect it's not very optimised, since nested functions are not relevant for C++. But there's a lot of potential. If it can convert CALL; RET; sequences into JMP, the whole thing will collapse into an optimal mass of goto spaghetti, something like xxx call part2 xxx part2: xxx call part3 xxx part3: xxx ret
 
 Obviously this can be used to do regexps properly. I was hoping to use a 
 clever naming scheme to avoid the alias issues from pragma's version, 
 but it seems to be inevitable.

Heh... finding that problem was the worst part. Don't forget how my solution has a tendency to hammer and pollute the symbol namespace for large expressions. It looks like you managed to work around some of that as well. Again, I'm floored by how much more can be gained via nested functions rather than a more "traditional" approach.

Me too. It's just awesome how many language features are coming together in this: real compile time constants, template string parameters, nested functions, having ~ concatenation built into the language, mixins, alias template parameters, static if, constant folding, the !() notation for templates, array slices... And C++ can't do any of those things. It's incredible. Update: I got the aliases to work in about 20 minutes, without difficulties. Now it does proper non-greedy matches. Amazingly, the code is barely any more complicated. It still doesn't need explicit backtracking! Greedy matching can be implemented by repeatedly attempting non-greedy matches, and finally redoing the last one which was successful. Because each step in the process has its own local variables, the only backtracking is restoring a copy of the current position. I think that greedy matching is an inherently inefficient operation, it would be much nicer to have non-greedy as the default. Consider the case (expr1)*(expr2) I think that what most engines do is keep matching (expr1) until they get a failure. Then try (expr2). If it fails, wind back (expr1) and try (expr2) again. Keep going until you've wound all the way back to the beginning. This gives only one successful match to expr2, but many unnecessary matches to expr1. So it's good for cases like "b*(really_complex_regexp)" But it's bad for cases like "(really_complex_regexp)*b" Also, backtracking expr1 is a real pain. My method is the reverse, it has no unnecessary matches to (expr1), but may have many for (expr2). I'm not sure that it's really inferior. Ideally, at compile time you'd determine which of the expressions was more complex (perhaps just by counting the number of *,+, ? in each string) and choose between the two methods that way. if something that matches (expr1) can ever be a partial match for (expr2); and if it can't, then you know you can just keep matching (expr1) until it fails, and only then evaluate (expr2). But I suspect that comparing two regexps is one of those NP-Hard problems from CompSci that's just not feasible to solve.
 It seems hard to do this properly without creating a highly optimised 
 regexp parser!

Hehe.. so the optimal solution is also the most complete? ;)

It certainly looks that way. And it seems to be the simplest, too. For quantifiers "{m,n}, *,+, ?", I used the (I thought) quick-and-dirty method of dealing with all of them at once. But a side-effect is that it's more optimal. Also, I used functions with no parameters because it was quick and dirty. But it's optimal, too, and it's really easy to add more features. It's a bizarre experience.
Feb 24 2006
prev sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
Thanks! 
Feb 23 2006
parent Don Clugston <dac nospam.com.au> writes:
Walter Bright wrote:
 Thanks! 
 
 

only). I've also added a couple of the trivial bits like character classes. http://trac.dsource.org/projects/ddl/browser/trunk/meta/regexp2.d I think that since it now does non-greedy matching correctly AFAIK even in the tricky cases, it should now be clear to everyone that the concept works. But it could be greatly improved, and some of the basic stuff is still missing (like escape characters). Let me know what you think is most important for your presentation.
Feb 25 2006