www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - What is a regular expression?

reply Georg Wrede <georg.wrede nospam.org> writes:
So the guru unrolls a long scroll and
holds it in front of my eyes.

"What do you see here?"
I saw a long D program.

"What do you see in the code?"
I see an Unquoted Regular Expression.

"What do you reckon that is?"
A kind of string literal, I guess.

"Why?"
Er, the compiler knows what it is, so it's unquoted.

"What does the compiler think it is?"
Uhhh, a regular expression?

"What does the compiler think that is?"
Ehhh, well, a string literal?

"I see you have a problem. You have a cluttered mind."
????

"Your mind has been tortured for decades by this
man in Washington state. Your brain cells have been
fried by a man originally from Denmark. And your
perception is impaired, and this has been brought
upon you by those who type with two fingers, whose
books are stained with mayo, who use the floor as
their ashtray, and who do not entertain respect."

Head down and tail between my legs, I went home.
I decided to find out what a Regular Expression is.

---------

program lines

/unquoted-regular-expression/

program lines


Okay, let's start: this regular expression, what
does it want? It wants a string.

What does it give? Sometimes it gives a few other
strings.

Anything else? It could indicate whether it is happy.

Anything more? Well, I hope it tells me where it
stopped. Like when it didn't consume the entire
string.

Now, suppose we remove the word regular expression
from this? Then it should become obvious what we
are talking about.

We are talking about a function.

This function takes a string, and returns a group
of strings, and a pointer to where it stopped.

We also want to know whether it succeeded, which
might either be explicitly returned as a boolean
value, or we could infer it from the group of (or
lack of) strings.

---------

To be precise, it is an anonymous function.

Using this notion, we could use it in D already,
with no changes to the language -- other than
having the compiler understand them, of course.
Feb 16 2005
next sibling parent Norbert Nemec <Norbert Nemec-online.de> writes:
Georg Wrede schrieb:
[...]
 
 To be precise, it is an anonymous function.
 
 Using this notion, we could use it in D already,
 with no changes to the language -- other than
 having the compiler understand them, of course.

So then, what would your proposed syntax look like?
Feb 16 2005
prev sibling next sibling parent reply Russ Lewis <spamhole-2001-07-16 deming-os.org> writes:
I am intrigued by your suggestion, but, like Norbert, wonder what syntax 
you would propose.



P.S. You can "find out where a regular expression stopped" by adding 
(.*) to the tail of your expression, and then viewing the string 
returned by that.
Feb 16 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
I am reposting my answer to this. It seems to have ended
up in the message tree in a place where nobody noticed.  :-(

The answer is at the end.

Russ Lewis wrote:
 I am intrigued by your suggestion, but, like Norbert, wonder what syntax 
 you would propose.

########################################################## Norbert Nemec wrote:
 Ben Hinkle schrieb:

 "Norbert Nemec" <Norbert Nemec-online.de>

 Matthew schrieb:

 All we need now is to use that reserved $ for built-in regex, and 




 I've been thinking along that line as well. There is a clear 



Python, you have to call 'compile' on regexps before you can use it. This already makes clear, that the translation from a regexp in string form to a regexp in executable representation basically is a compilation step which costs run-time performace.
 I could see another string prefix for patterns that looks like the 


at compile-time instead of run-time. I bet, though, that the time spent in compiling a pattern at run-time is small compared to the time spent "running" the pattern. The downside as other people mentioned is that the format of the compiled pattern would have to match whatever the library expects.
 Depends on what you call 'compiling'. Your words remind me more of 

can then be used be the matching library.
 What I was thinking of would be to produce actual executable code 

kinds of optimizations on a given pattern. ----------------------------------------- georg: What if we just consider an Unquoted Regular Expression as just another piece of program code? It is, after all, only program code, albeit written with a syntax of its own. Then we might decide on a fixed signature for such. This would make it easy for all kinds of libraries to interact with regexps without hassles. What if the unquoted regex were treated exactly the same as bit justAnotherRegex(in char[] s, inout int pos, out char[][] subStrings) { //implementation in D. } Then we could write ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars); with no problem. And, if we code responsibly and so Mother would be proud, we might make it a habit to write in the following style: enum {firstNam, lastNam} if ( /lkjslkjlskjdlksjdf/ (currentLine, where, S) ) { victim = S[lastNam] ~ " ," ~ S[firstNam]; } Also, if a bare regular expression is considered an anonymous function, then it could be passed around with Delegates, Function Pointers, and Dynamic Closures.
Feb 16 2005
next sibling parent reply "Charlie Patterson" <charliep1 excite.com> writes:
"Georg Wrede" <georg.wrede nospam.org> wrote in message 
news:42139061.5060708 nospam.org...
 Then we might decide on a fixed signature for such. This would
 make it easy for all kinds of libraries to interact with
 regexps without hassles. What if the unquoted regex were
 treated exactly the same as

 bit justAnotherRegex(in char[] s,
                      inout int pos,
                      out char[][] subStrings)
 {
   //implementation in D.
 }

 Then we could write

 ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);

 with no problem.

Sounds good! But to simplify, why not just assume that a failure to match returns an empty set of "subStrings". Also I don't know what pos is supposed to do since the code can always do the entire input s at once. This leaves the equivalent of char[][] justAnotherRegex(in char[] s ) and is instantiated thusly theDollars = /lkajsdlkjaksdf/ (aDataLine); This is similar to Perl except I like your function-looking call. The Perl way would have been # note approximate Perl. it's been a while theDollars = aDataline ~= /lkajsdkljaksdf/ I find yours clever and easier to read.
Feb 16 2005
parent reply Kris <Kris_member pathlink.com> writes:
In article <cv0480$pid$1 digitaldaemon.com>, Charlie Patterson says...
"Georg Wrede" <georg.wrede nospam.org> wrote in message 
news:42139061.5060708 nospam.org...
 Then we might decide on a fixed signature for such. This would
 make it easy for all kinds of libraries to interact with
 regexps without hassles. What if the unquoted regex were
 treated exactly the same as

 bit justAnotherRegex(in char[] s,
                      inout int pos,
                      out char[][] subStrings)
 {
   //implementation in D.
 }

 Then we could write

 ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);

 with no problem.

Sounds good! But to simplify, why not just assume that a failure to match returns an empty set of "subStrings". Also I don't know what pos is supposed to do since the code can always do the entire input s at once. This leaves the equivalent of char[][] justAnotherRegex(in char[] s ) and is instantiated thusly theDollars = /lkajsdlkjaksdf/ (aDataLine); This is similar to Perl except I like your function-looking call. The Perl way would have been # note approximate Perl. it's been a while theDollars = aDataline ~= /lkajsdkljaksdf/ I find yours clever and easier to read.

.. and encapsulated too (assuming there's a RegExp class instance to contain the char[][], or struct of char[]'s or whatever). - Kris
Feb 16 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Kris wrote:
 In article <cv0480$pid$1 digitaldaemon.com>, Charlie Patterson says...
Sounds good!  But to simplify, why not just assume that a failure to match 
returns an empty set of "subStrings".  


Sometimes you might consider it a success even if no substrings are found.
Also I don't know what pos is 
supposed to do since the code can always do the entire input s at once. 


I can imagine several scenarios where you scan, say through an entire file, and depending on what you find, you might want to treat the immediately following part differently. Like reading source code in a folding text editor, or such. Or creating an interpreter for a new language.
This leaves the equivalent of

   char[][] justAnotherRegex(in char[] s )

and is instantiated thusly

   theDollars = /lkajsdlkjaksdf/ (aDataLine);

This is similar to Perl except I like your function-looking call.  The Perl 
way would have been

   # note approximate Perl. it's been a while
   theDollars = aDataline ~= /lkajsdkljaksdf/

I find yours clever and easier to read.

.. and encapsulated too (assuming there's a RegExp class instance to contain the char[][], or struct of char[]'s or whatever).

There are actually two different situations for using a regexp. One is scanning, the other is search-and-replace. So, actually we need two signatures. function bit (in char[] s, inout int pos, out char[][] subStrings) {} function bit (in char[] stringIn, inout int posStringIn, out char[] stringOut, inout int posStringOut, out char[][] subStrings) {} The first one takes the string to be examined, a position in it, and a pointer to an array of strings. It returns false if it did not find anything, or if the string to search was null. The array of strings is the set of "$1 ..." substrings, that are so often used in Perl, and the like. As I said, one probably uses this function several times over on a given textstring, so saving the position is essential. And we want this function to be re-entrant, so the position can't be saved "within the function." The second one is for when we are doing a search-and- replace. The new parameters are the output string, and a position in it. Since speed is a main point in using regular expressions, returning the output via a parameter instead of as the return value works well. Especially since we have two things, the output "string" and the position in it. Normally, instead of concatenating (which is slow), one would reserve an empty string, maybe 1.5 * the initial guess of size, for output. Search-and-replace are hardly ever done in-place. (even if we all are used to thinking this -- but that is because everyday we see text editors do it "in place" in the text!) So this gives a natural reason to have the two signatures. When the compiler sees the regexp, it immediately sees whether it is an outputting (replacing) regexp or just a searching one. At that point it decides which signature it gives to the regexp, and goes on compiling the binary for it. If we (or actually Walter!) feel industrious, we might even have two more signatures. One for the case where no substrings are generated, and another for when we don't even care about the end position. function bit (in char[] s, inout int pos) {} function bit (in char[] s) {} But I guess these come for "free" once the first two are written. :-)
Feb 16 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Wed, 16 Feb 2005 21:33:55 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Kris wrote:
 In article <cv0480$pid$1 digitaldaemon.com>, Charlie Patterson says...
 Sounds good!  But to simplify, why not just assume that a failure to  
 match returns an empty set of "subStrings".


Sometimes you might consider it a success even if no substrings are found.

I prefer to ask "can this function ever return false?". I can only think of 1 time, a malformed regular expression. If the regexp is 'compiled' then the compiler can check that at compile time, and we dont need a 'false' return value. If the regexp is formed at runtime and malformed then we should be throwing an exception. So either way, no false return is required. function char[][] (int char[] s, out int rest);
 This leaves the equivalent of

   char[][] justAnotherRegex(in char[] s )

 and is instantiated thusly

   theDollars = /lkajsdlkjaksdf/ (aDataLine);

 This is similar to Perl except I like your function-looking call.  The  
 Perl way would have been

   # note approximate Perl. it's been a while
   theDollars = aDataline ~= /lkajsdkljaksdf/

 I find yours clever and easier to read.

contain the char[][], or struct of char[]'s or whatever).

There are actually two different situations for using a regexp. One is scanning, the other is search-and-replace. So, actually we need two signatures. function bit (in char[] s, inout int pos, out char[][] subStrings) {} function bit (in char[] stringIn, inout int posStringIn, out char[] stringOut, inout int posStringOut, out char[][] subStrings) {}

Can't we do search and replace with the original function eg. char[] searchReplace(char[] input, char[] search, char[] replace) { char[][] s; //this regexp is really a run-time one s = /regexp/(input); //find and replace strings foreach(int i, inout char[] r; s) { if (r == search) r = replace; } //generic function which appends all strings in an array into 1 string return combine(s); } The above function assumes the results in s are non overlapping, non repeated slices of the input. Basically I don't see why we need more than the 1 regexp function, writing a generic search and replace function is trivial.
 The first one takes the string to be examined, a position
 in it, and a pointer to an array of strings. It returns
 false if it did not find anything, or if the string
 to search was null.

 The array of strings is the set of "$1 ..." substrings,
 that are so often used in Perl, and the like.

 As I said, one probably uses this function several times
 over on a given textstring, so saving the position is
 essential. And we want this function to be re-entrant,
 so the position can't be saved "within the function."

 The second one is for when we are doing a search-and-
 replace. The new parameters are the output string,
 and a position in it.

 Since speed is a main point in using regular expressions,
 returning the output via a parameter instead of as the
 return value works well. Especially since we have two
 things, the output "string" and the position in it.

If we need to store state, then IMO this is an indication we should be using a class (or struct). function RegExp (char[] input); class RegExp { char[] input; int rest; char[][] results; void continue() {} } RegExp r; r = /regexp/(input); //initial use/construction ..etc.. r.continue(); //reuse
 Normally, instead of concatenating (which is slow), one
 would reserve an empty string, maybe 1.5 * the initial
 guess of size, for output.

Modifying my above search replace, to optimise the memory allocation, I get... char[] searchReplace(char[] input, char[] search, char[] replace) { char[] newstring; int length = 0; RegExp r; //this regexp is really a run-time one r = /regexp/(input); //find and replace strings (does inout work here?) foreach(inout char[] s; r.results) { if (s == search) s = replace; length += s.length; } //generic function which copies all strings into array provided newstring.length = length; combine(r.results,newstring); return newstring; }
 Search-and-replace are hardly ever done in-place. (even
 if we all are used to thinking this --

In D, given that a char[][] is an array of arrays, the regexp results can be slices into a existing array, it actually seems/looks like we _are_ doing it in place :) Even tho in reality of course we're not. All up I think this is a great idea, but I don't see the need for any more than the 1 function/form. Regan
Feb 16 2005
prev sibling parent reply pragma <pragma_member pathlink.com> writes:
In article <42139061.5060708 nospam.org>, Georg Wrede says...
Then we might decide on a fixed signature for such. This would
make it easy for all kinds of libraries to interact with
regexps without hassles. What if the unquoted regex were
treated exactly the same as

bit justAnotherRegex(in char[] s,
                      inout int pos,
                      out char[][] subStrings)
{
   //implementation in D.
}

Then we could write

ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);

with no problem.

And, if we code responsibly and so Mother would be proud,
we might make it a habit to write in the following style:

enum {firstNam, lastNam}
if ( /lkjslkjlskjdlksjdf/ (currentLine, where, S) )
{
     victim = S[lastNam] ~ " ," ~ S[firstNam];
}

Also, if a bare regular expression is considered an
anonymous function, then it could be passed around
with Delegates, Function Pointers, and Dynamic Closures.

I, for one, like this idea. I can also envision the following:
 alias bit function(in char[] s,
                      inout int pos,
                      out char[][] subStrings) Regexp;

 Regexp myExpression = /lkjslkjlskjdlksjdf/;
 if (myExpression(currentLine, where, S))
 {
     victim = S[lastNam] ~ " ," ~ S[firstNam];
 }

..which looks a touch cleaner. Another possibility is to tie the language a little closer to Phobos by way of the RegExp class; although that throws all the talk of compiler-based optimization out the window. - EricAnderton at yahoo
Feb 16 2005
next sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
pragma wrote:
 In article <42139061.5060708 nospam.org>, Georg Wrede says...
 I, for one, like this idea.  I can also envision the following:
 
 
alias bit function(in char[] s,
                     inout int pos,
                     out char[][] subStrings) Regexp;

Regexp myExpression = /lkjslkjlskjdlksjdf/;
if (myExpression(currentLine, where, S))
{
    victim = S[lastNam] ~ " ," ~ S[firstNam];
}

..which looks a touch cleaner. Another possibility is to tie the language a little closer to Phobos by way of the RegExp class; although that throws all the talk of compiler-based optimization out the window.

That looks really good! This actually means, that we could use regexps now in several cool ways! Ah, and about the RegExp class you mention, we still need a runtime regexp compiling facility. That would naturally be in the library. That would be used when the regexp to be parsed is only known at runtime. Also, anybody willing could write an OO interface to the "hard-coded-regexps" thing discussed in this thread. That too should go to a library.
Feb 16 2005
parent John Reimer <brk_6502 yahoo.com> writes:
Georg Wrede wrote:
 pragma wrote:
 
 In article <42139061.5060708 nospam.org>, Georg Wrede says...
 I, for one, like this idea.  I can also envision the following:


 alias bit function(in char[] s,
                     inout int pos,
                     out char[][] subStrings) Regexp;

 Regexp myExpression = /lkjslkjlskjdlksjdf/;
 if (myExpression(currentLine, where, S))
 {
    victim = S[lastNam] ~ " ," ~ S[firstNam];
 }

..which looks a touch cleaner. Another possibility is to tie the language a little closer to Phobos by way of the RegExp class; although that throws all the talk of compiler-based optimization out the window.

That looks really good! This actually means, that we could use regexps now in several cool ways! Ah, and about the RegExp class you mention, we still need a runtime regexp compiling facility. That would naturally be in the library. That would be used when the regexp to be parsed is only known at runtime. Also, anybody willing could write an OO interface to the "hard-coded-regexps" thing discussed in this thread. That too should go to a library.

Must admit... looks good.
Feb 16 2005
prev sibling parent reply Kris <Kris_member pathlink.com> writes:
In article <cv06f4$s5u$1 digitaldaemon.com>, pragma says...
In article <42139061.5060708 nospam.org>, Georg Wrede says...
Then we might decide on a fixed signature for such. This would
make it easy for all kinds of libraries to interact with
regexps without hassles. What if the unquoted regex were
treated exactly the same as

bit justAnotherRegex(in char[] s,
                      inout int pos,
                      out char[][] subStrings)
{
   //implementation in D.
}

Then we could write

ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);

with no problem.

And, if we code responsibly and so Mother would be proud,
we might make it a habit to write in the following style:

enum {firstNam, lastNam}
if ( /lkjslkjlskjdlksjdf/ (currentLine, where, S) )
{
     victim = S[lastNam] ~ " ," ~ S[firstNam];
}

Also, if a bare regular expression is considered an
anonymous function, then it could be passed around
with Delegates, Function Pointers, and Dynamic Closures.

I, for one, like this idea. I can also envision the following:
 alias bit function(in char[] s,
                      inout int pos,
                      out char[][] subStrings) Regexp;

 Regexp myExpression = /lkjslkjlskjdlksjdf/;
 if (myExpression(currentLine, where, S))
 {
     victim = S[lastNam] ~ " ," ~ S[firstNam];
 }

..which looks a touch cleaner. Another possibility is to tie the language a little closer to Phobos by way of the RegExp class; although that throws all the talk of compiler-based optimization out the window. - EricAnderton at yahoo

So yeah, one of the points was to have the compiler convert the search pattern at compile-time. This could be done within static constructors, but let's suppose the compiler did it instead -- this would then require an ABI for the compiled-regex representation, and the runtime executor would have to abide by that. Right? Alternatively, the compiler could compile the regexp into an object file. I mean, that's what compilers do, right? Presumeably one mightleverage such things via an Interface or a set of delegates (same thing), for match, replace, etc. For such patterns, there would be no need for a runtime execution model. This is, indeed, similar to what some regex implementations do (compile to x86 code). That's all well and good. But then what about patterns not known at compile time? Perhaps the part of the compiler responsible for compiling the regex might be exposed at runtime also? Yet another alternative would be to support "compile time extension functions" .. these would have to accept const data and emit const data. The compiler might invoke them via some new 'static' or 'const' syntax? At that stage, one could extend the compiler in a number of interesting ways. The point is that the current *runtime* regex compiler could be invoked at *compile time* instead. Not a huge gain, given static constructors, but interesting perhaps. Regardless; given the functionality that RegExp exposes, there really needs to be a context maintained for each instantiation (via a class, struct, or some other means). Again, there's that encapsulation I keep harping on about. - Kris
Feb 16 2005
parent reply Kris <Kris_member pathlink.com> writes:
In article <cv0iap$1bss$1 digitaldaemon.com>, Kris says...
Alternatively, the compiler could compile the regexp into an object file. I
mean, that's what compilers do, right? Presumeably one mightleverage such things
via an Interface or a set of delegates (same thing), for match, replace, etc.
For such patterns, there would be no need for a runtime execution model. This
is, indeed, similar to what some regex implementations do (compile to x86 code).

That's all well and good. But then what about patterns not known at compile
time? Perhaps the part of the compiler responsible for compiling the regex might
be exposed at runtime also?

(poor form, replying to ones own post ...) Thought I'd note that one *could* invoke the compiler, as a subprocess, to support regex compilation at runtime. The output (for the compiled regex) would need to be a shared-lib/DLL ... but if each regex were compiled to a class anyway, that would work just fine. This would: - alleviate issues relating to a regex ABI - remove the need for a RegExp library - would presumably take care of code bloat via char/wchar/dchar RegExp templates (they'd be in the compiler rather than in the runtime code) - arm D with the fastest damn regex functionality around (worth a lot in some environments) Just a thought;
Feb 16 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Kris wrote:
 In article <cv0iap$1bss$1 digitaldaemon.com>, Kris says...
 
Alternatively, the compiler could compile the regexp into an object file. I
mean, that's what compilers do, right? Presumeably one mightleverage such things
via an Interface or a set of delegates (same thing), for match, replace, etc.
For such patterns, there would be no need for a runtime execution model. This
is, indeed, similar to what some regex implementations do (compile to x86 code).

That's all well and good. But then what about patterns not known at compile
time? Perhaps the part of the compiler responsible for compiling the regex might
be exposed at runtime also?

(poor form, replying to ones own post ...) Thought I'd note that one *could* invoke the compiler, as a subprocess, to support regex compilation at runtime. The output (for the compiled regex) would need to be a shared-lib/DLL ... but if each regex were compiled to a class anyway, that would work just fine. This would: - alleviate issues relating to a regex ABI - remove the need for a RegExp library - would presumably take care of code bloat via char/wchar/dchar RegExp templates (they'd be in the compiler rather than in the runtime code)

 - arm D with the fastest damn regex functionality around (worth a lot in some
 environments)

Yessss! :-)
 Just a thought;

((( FTR: I better change tone, I just read all my regex posts of lately, and some of them came across as a bit arrogant. Hope I was the only one who thought that! :-( But back to the issue.))) Assuming compile time regexps were implemented as recently discussed, then we need to think about how we could get runtime regexps to behave smoothly, too. Maybe, if this is ok, if ( /aCompileTimeRegexHere/ (instr, i, S) ) { // do something } then, the question is: How does a _runtime_ regex "enter" into our program? Presumably it would start its life as the contents of a string? userregex = askUser("give me the regex"); Or any of a thousand other ways, like as found in a file. Now, what would be really nice is to end up using this regex the same way as the compile time regexes. Like if ( aRuntimeCompiledRegex (instr, i, S) ) { // do something } So, we need to transform this "userregex" to this "aRuntimeCompiledRegex" thing. Skipping a few steps here, my suggestion is: can we afford a new type in the language? Then we could simply write regex myRegex; myRegex = RunTimeCompileRegex(regexAsString); if ( myRegex (instr, i, S) ) { // do something } And the myRegex could then be used in Delegates, Function Pointers, and Dynamic Closures just as well as the compile-time ones. Maybe we could make the compilation implicit? Like: regex myRegex = new regex(regexAsString); That of course assuming code on the heap can be executed in both Windows and Linux. Gurus: help! As to the UTF issue, can we just decide that the resulting myRegex will work on the same kind of strings from which it was made? Or is there some reason to make this more complicated? --- Somebody pointed out earlier that the regex type should only have one signature. I'm starting to believe this. It would surely help when passing them around and interacting with libraries. At the same time it would be nice to be able to use regexps like they had many signatures: if ( aRegex (thestring) ) {xxxxxx} if ( aRegex (str, pos) ) {xxxxxx} if ( aRegex (str, outSubstArray) ) {xxxxxx} if ( aRegex (str, pos, outSubstrArray) ) {xxxxxx} if ( aRegex (instr, outstr) ) {xxxxxx} if ( aRegex (instr, outstr, outSubArr) ) {xxxxxx} ... and a couple of others? Is this an easy or hard problem to fix? --- Somebody pointed out that we do not necessarily need the regex to return a boolean value. I assume that is because the outcome can be inferred from the parameter returns. I think however, that returning a boolean makes the code overall look nicer. False would mean not found, and the like, and exceptions would be raised only for more serious conditions. Indeed, if a regexp can even theoretically need to do that? Malformed regexps of course would be caught at instantiation.
Feb 17 2005
parent reply KennyB <funisher gmail.com> writes:
 Maybe, if this is ok,
 
 if ( /aCompileTimeRegexHere/ (instr, i, S) )
 {
    // do something
 }
 
 then, the question is: How does a _runtime_ regex "enter"
 into our program?

I like your idea of the type. A runtime regex could also be something like this: char[] my_regex = "/aRuntimeRegexHere/"; if ( my_regex(instr, i, S) ) { // do something } This will require a check to see if there is parenthesis after an array of char (may be more trouble than it's worth, so perhaps, adding .regex() as a compiler interface would be a good way of doing it (similar to .length). -------------------------- In the case of optimizing, there isn't a lot of good ways to do it at runtime. There are two cases I can foresee. (the performance enhancement might only be negligible though) The main advantage of non-runtime regexes is that they are compiled ahead of time. This could be provided to the programmer for runtime regexes too. It might be highly impractical to compile into machine code using the static compiler, as that would require executable memory... The other option would be bytecode. Again, not sure how much of a boost that would give though, if the programmer is allowed to compile into bytecode ahead of time. myregex.compileregex; // do stuff if( myregex.regex(instr, i, S) ) The other possible performance increase could be caching the regex' bytecode or asm with a hash or something (not md5 -- it's too big. something smaller and fast like two 32 bit hashes or something). When the same regex is done twice, it doesn't need compilation again. I suppose this could be added to phobos or something. That also can be done with your 'type' method as well. It just seems so much for such a small performance increase. Someone should write a test program and benchmark it. The real performance will definitely be in staticly compiled regexes. tough subject :) I like the ideas a lot though...
Feb 19 2005
parent reply Russ Lewis <spamhole-2001-07-16 deming-os.org> writes:
How about the following syntax:

char[][] = /<putYourRegexHere>/(<stringToMatch>);

Then, we devise a preprocessor which would turn it into something like this:

char[][] = function char[][](char[] arg) { ... } (<stringToMatch>);

The advantage here being that the preprocessor can compile the regex 
into a function, and detect any regex errors at build time.

Of course, if we got this functionality working well, then we could 
suggest that Walter integrate it into the compiler, and thus get rid of 
the preprocessor utility.
Feb 19 2005
parent reply pragma <pragma_member pathlink.com> writes:
In article <cv827q$c8s$1 digitaldaemon.com>, Russ Lewis says...
How about the following syntax:

char[][] = /<putYourRegexHere>/(<stringToMatch>);

Then, we devise a preprocessor which would turn it into something like this:

char[][] = function char[][](char[] arg) { ... } (<stringToMatch>);

The advantage here being that the preprocessor can compile the regex 
into a function, and detect any regex errors at build time.

Of course, if we got this functionality working well, then we could 
suggest that Walter integrate it into the compiler, and thus get rid of 
the preprocessor utility.

There's just one hole in the syntax that I can think of: comments will not play nice with the slashes:
char[][] = /*is this a comment?/(<stringToMatch>);
char[][] = /+is this a comment?/(<stringToMatch>);

I, for one, cannot forsee a way to make this syntax compatible with the rest of the D language. Perhaps a string decorator should be used instead:
char[][] = r"<putYourRegexHere>"(<stringToMatch>);
char[][] = $"<putYourRegexHere>"(<stringToMatch>);

Its not as pretty, but using / is going to cause trouble. - EricAnderton at yahoo
Feb 19 2005
parent h3r3tic <foo bar.baz> writes:
pragma wrote:
 There's just one hole in the syntax that I can think of: comments will not play
 nice with the slashes:
 
 
char[][] = /*is this a comment?/(<stringToMatch>);
char[][] = /+is this a comment?/(<stringToMatch>);


Well, I'd say it's a comment since it doesn't make any sense as a regexp :) * or + will always follow some other part of a regular expression. Here they don't. It's a comment Tom
Feb 19 2005
prev sibling parent "Craig Black" <cblack ara.com> writes:
If D had better support for RegEx's this would expedite a self-hosting 
compiler, no?  And that's the real test of a programming language.  You know 
you're getting somewhere when you are self-hosting.

-Craig 
Feb 16 2005