www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Questions about builtin RegExp

reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
1) Will builtin RegExp increase minimal size of D executable?
I mean if this executable is not using regexp at all.

2) Is it possible to override operator ~~ ?

3) What is the main purpose of incorporating
interprettable regexps in natively compileable language?

4) When happens check of regexp for syntax correctness -
at compile time or at runtime?  "..." ~~ "..."
If ~~ is a part of language syntax then one can assume that expression
is getting compiled somehow.

Andrew. 
Feb 16 2006
next sibling parent reply Oskar Linde <olREM OVEnada.kth.se> writes:
Andrew Fedoniouk wrote:

 1) Will builtin RegExp increase minimal size of D executable?
 I mean if this executable is not using regexp at all.

No. This was as far as I understood one of the considerations.
 2) Is it possible to override operator ~~ ?

Yes. opMatch() and opNext().
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.
 4) When happens check of regexp for syntax correctness -
 at compile time or at runtime?  "..." ~~ "..."
 If ~~ is a part of language syntax then one can assume that expression
 is getting compiled somehow.

At runtime. For now atleast. In the future it could possibly be compiled at compile time, but there will still always be a need to support run-time regexps anyway. /Oskar
Feb 17 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Oskar Linde" <olREM OVEnada.kth.se> wrote in message 
news:dt40sg$29nc$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:

 1) Will builtin RegExp increase minimal size of D executable?
 I mean if this executable is not using regexp at all.

No. This was as far as I understood one of the considerations.
 2) Is it possible to override operator ~~ ?

Yes. opMatch() and opNext().

And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.

??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.
 4) When happens check of regexp for syntax correctness -
 at compile time or at runtime?  "..." ~~ "..."
 If ~~ is a part of language syntax then one can assume that expression
 is getting compiled somehow.

At runtime. For now atleast. In the future it could possibly be compiled at compile time, but there will still always be a need to support run-time regexps anyway.

Having "builtin" regexps without strings in the language seems unnatural. Andrew.
Feb 17 2006
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 17 Feb 2006 20:46:01 -0800, Andrew Fedoniouk  
<news terrainformatica.com> wrote:
 "Oskar Linde" <olREM OVEnada.kth.se> wrote in message
 news:dt40sg$29nc$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:

 1) Will builtin RegExp increase minimal size of D executable?
 I mean if this executable is not using regexp at all.

No. This was as far as I understood one of the considerations.
 2) Is it possible to override operator ~~ ?

Yes. opMatch() and opNext().

And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.

??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.

I think you're thinking inside the box. :) With the recent additions is it not possible to write scripts in D? Regan
Feb 17 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message 
news:ops45qq5rn23k2f5 nrage.netwin.co.nz...
 On Fri, 17 Feb 2006 20:46:01 -0800, Andrew Fedoniouk 
 <news terrainformatica.com> wrote:
 "Oskar Linde" <olREM OVEnada.kth.se> wrote in message
 news:dt40sg$29nc$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:

 1) Will builtin RegExp increase minimal size of D executable?
 I mean if this executable is not using regexp at all.

No. This was as far as I understood one of the considerations.
 2) Is it possible to override operator ~~ ?

Yes. opMatch() and opNext().

And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.

??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.

I think you're thinking inside the box. :) With the recent additions is it not possible to write scripts in D?

I beleive there is a sort of misunderstanding about what scripting is and why there are scripting (typeless) languages, compiled bytecoded and compiled native. These three groups has their own niches. D as a compiled language will never reach flexibility of e.g. prototype based JavaScript or Ruby. There are just different definitions of flexibility for these groups - different and sometimes even orthogonal tasks . Andrew.
Feb 18 2006
next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Sat, 18 Feb 2006 00:36:23 -0800, Andrew Fedoniouk  
<news terrainformatica.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:ops45qq5rn23k2f5 nrage.netwin.co.nz...
 On Fri, 17 Feb 2006 20:46:01 -0800, Andrew Fedoniouk
 <news terrainformatica.com> wrote:
 "Oskar Linde" <olREM OVEnada.kth.se> wrote in message
 news:dt40sg$29nc$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:

 1) Will builtin RegExp increase minimal size of D executable?
 I mean if this executable is not using regexp at all.

No. This was as far as I understood one of the considerations.
 2) Is it possible to override operator ~~ ?

Yes. opMatch() and opNext().

And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.

??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.

I think you're thinking inside the box. :) With the recent additions is it not possible to write scripts in D?

I beleive there is a sort of misunderstanding about what scripting is and why there are scripting (typeless) languages, compiled bytecoded and compiled native. These three groups has their own niches. D as a compiled language will never reach flexibility of e.g. prototype based JavaScript or Ruby. There are just different definitions of flexibility for these groups - different and sometimes even orthogonal tasks .

I think there is some overlap, i.e. some scripting tasks do not require the flexibilty you mention, instead the important factor may be one or more of: - how fast can I code the solution - how easily can I code the solution - how easily can I maintain the solution - how likely is my solution to contain bugs - how easy will it be to find those bugs Assuming you're a D programmer and assuming the D std lib contains the tools to achieve your task, why not use D? Regan
Feb 18 2006
parent "Andrew Fedoniouk" <news terrainformatica.com> writes:
 I beleive there is a sort of misunderstanding about what scripting is and
 why there are scripting (typeless) languages, compiled bytecoded and
 compiled native.
 These three groups has their own niches. D as a compiled language will 
 never
 reach
 flexibility of e.g. prototype based JavaScript or Ruby. There are just
 different definitions of flexibility
 for these groups - different and sometimes even orthogonal tasks .

I think there is some overlap, i.e. some scripting tasks do not require the flexibilty you mention, instead the important factor may be one or more of: - how fast can I code the solution - how easily can I code the solution - how easily can I maintain the solution - how likely is my solution to contain bugs - how easy will it be to find those bugs Assuming you're a D programmer and assuming the D std lib contains the tools to achieve your task, why not use D?

1) Scrtipting langauges are being used usualy as built into some other environments. This use case is quite different from D execution model. Different life cycle. 2) Scripting langauges are safe. Tremendous effort needed to make GPF in scripting environment. In D to make GPF is a piece of cake. I mean not because of bugs in language or libs but because you can dereference null object pointer for example. 3) Scripting languages provide very high level and convenient set of ready to use task oriented set of classes/objects. Example: for building D projects you would rather use make or build scripts than D itself, right? Even if you would have something like std.build I bet you will use some scripting tool for your builds. What I want to say: To write fast scripting engine in D is possible and this is what D is best for (among other things). But to write something D-ish in scripting.... Completely different areas of use to be short.
Feb 18 2006
prev sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt6ma6$1jt0$1 digitaldaemon.com...
 I beleive there is a sort of misunderstanding about what scripting is and
 why there are scripting (typeless) languages, compiled bytecoded and 
 compiled native.
 These three groups has their own niches. D as a compiled language will 
 never reach
 flexibility of e.g. prototype based JavaScript or Ruby. There are just 
 different definitions of flexibility
 for these groups - different and sometimes even orthogonal tasks .

I agree. But I don't believe that there's anything special about scripting that makes it especially suited for regex, but regex is a large reason people use scripting languages.
Feb 18 2006
next sibling parent reply kris <fu bar.org> writes:
Walter Bright wrote:
[snip]
 regex is a large reason 
 people use scripting languages. 

Really? Do you have some kind of data to back that assertion?
Feb 18 2006
parent "Walter Bright" <newshound digitalmars.com> writes:
"kris" <fu bar.org> wrote in message news:dt7m3l$2hc5$1 digitaldaemon.com...
 Walter Bright wrote:
 [snip]
 regex is a large reason people use scripting languages.


Peer reviewed statistical research studies? Nope. But it's a pretty good impression one gets by reading the examples in manuals for scripting languages, listening to what people say about those languages, and looking at a sampling of actual scripts. Here's a quote from "Programming Perl"'s preface by Larry Wall: "Perl is no longer just for text processing." That means, to me, that Perl was DESIGNED to be a text processing language. Why would the backbone of that, regex, not be why a large number of people use Perl? Perl stands for "Practical Extraction and Report Language", i.e. text manipulation. Larry goes out of his way to say that Perl is a superset of sed and awk, which are regex string manipulation scripting languages.
Feb 18 2006
prev sibling parent Lucas Goss <lgoss007 gmail.com> writes:
Walter Bright wrote:
 ... but regex is a large reason people use scripting languages.

I've never used scripting languages for that purpose. The only reason I've used scripting languages is because they are often times easier, quicker, and have a huge library to write portable code. D almost matches them in being as easy and as quick, but lacks the huge standard library.
Feb 18 2006
prev sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt3v1o$27nk$1 digitaldaemon.com...
 1) Will builtin RegExp increase minimal size of D executable?
 I mean if this executable is not using regexp at all.

No.
 2) Is it possible to override operator ~~ ?

Overload, yes. With opMatch().
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

Make them easier to use.
 4) When happens check of regexp for syntax correctness -
 at compile time or at runtime?  "..." ~~ "..."

Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.
 If ~~ is a part of language syntax then one can assume that expression
 is getting compiled somehow.

Feb 17 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
Thanks, Walter,

 2) Is it possible to override operator ~~ ?

Overload, yes. With opMatch().

Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above? 2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation. 3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

Make them easier to use.

Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...); And easier is not mean more effective. while( true ) { if( "mask" ~~ file_name ) .... } As far as I understand you will generate: while( true ) { regexp re = new regexp("mask"); re.test(file_name); .... }
 4) When happens check of regexp for syntax correctness -
 at compile time or at runtime?  "..." ~~ "..."

Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.

If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time. So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2) In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation. Andrew.
Feb 17 2006
next sibling parent reply Ivan Senji <ivan.senji_REMOVE_ _THIS__gmail.com> writes:
Andrew Fedoniouk wrote:
 Thanks, Walter,
 
 
2) Is it possible to override operator ~~ ?

Overload, yes. With opMatch().

Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above? 2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation. 3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?

Instead of an answer a quick example of what I tried and what works: <CODE> import std.stdio; class ArrayBeginsWith { static ArrayBeginsWith opCall(int a) { check = a; return instance; } static ArrayBeginsWith instance; static int check; static this() { instance = new ArrayBeginsWith; } static bool opMatch(int[] nums) { if(nums.length < 1)return false; if(nums[0] == check) return true; else return false; } } static bool opMatch(int[] nums) { if(nums.length < 2)return false; if(nums[0] == 0 && nums[1] == 1) return true; else return false; } void main() { static int[] somearray1 = [0,1,2]; static int[] somearray2 = [2,1,2]; writefln(ArrayBeginsWith(0) ~~ somearray1); writefln(ArrayBeginsWith(0) ~~ somearray2); writefln(ArrayBeginsWith(2) ~~ somearray1); writefln(ArrayBeginsWith(2) ~~ somearray2); } </CODE>
 
 
3) What is the main purpose of incorporating
interprettable regexps in natively compileable language?

Make them easier to use.

Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);

Nothing is wrong with this, but ~~ is easier :)
 And easier is not mean more effective.
 
 while( true )
 {
     if( "mask" ~~ file_name )
        ....
 }
 
 As far as I understand you will generate:
 
 while( true )
 {
      regexp re = new regexp("mask");
      re.test(file_name);
        ....
 }
 

I don't think this is to hard to optimize away. Compiler can even generate global RegExp instance for each regular expression literal and use it many times.
 
 
4) When happens check of regexp for syntax correctness -
at compile time or at runtime?  "..." ~~ "..."

Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.

If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time. So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2) In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation.

This generation of new RegExp doesn't have to be true. But ~~ provides us with a feature of testing arbitrary types for arbitrary things.
Feb 17 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
Thanks, Ivan, see below:


"Ivan Senji" <ivan.senji_REMOVE_ _THIS__gmail.com> wrote in message 
news:dt5b54$h1q$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Thanks, Walter,


2) Is it possible to override operator ~~ ?

Overload, yes. With opMatch().

Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above? 2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation. 3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?

Instead of an answer a quick example of what I tried and what works: <CODE> import std.stdio; class ArrayBeginsWith { static ArrayBeginsWith opCall(int a) { check = a; return instance; } static ArrayBeginsWith instance; static int check; static this() { instance = new ArrayBeginsWith; } static bool opMatch(int[] nums) { if(nums.length < 1)return false; if(nums[0] == check) return true; else return false; } } static bool opMatch(int[] nums) { if(nums.length < 2)return false; if(nums[0] == 0 && nums[1] == 1) return true; else return false; } void main() { static int[] somearray1 = [0,1,2]; static int[] somearray2 = [2,1,2]; writefln(ArrayBeginsWith(0) ~~ somearray1); writefln(ArrayBeginsWith(0) ~~ somearray2); writefln(ArrayBeginsWith(2) ~~ somearray1); writefln(ArrayBeginsWith(2) ~~ somearray2); } </CODE>

function startsWith( int[] arr, int v ) { if(arr.length < 1) return false; return arr[0] == check); } and its usage: static int[] somearray2 = [2,1,2]; if( somearray2.startsWith( 0 ) ) ... will be more a) compact b) human readable c) maintainable d) natural the same apply to function match( const char[] str, RegExp re ) { ... } if( mystr.match(someRe) ) .... ------------------------------------ I would go to normal implementation of outer methods instead of this :p~~.
3) What is the main purpose of incorporating
interprettable regexps in natively compileable language?

Make them easier to use.

Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);

Nothing is wrong with this, but ~~ is easier :)
 And easier is not mean more effective.

 while( true )
 {
     if( "mask" ~~ file_name )
        ....
 }

 As far as I understand you will generate:

 while( true )
 {
      regexp re = new regexp("mask");
      re.test(file_name);
        ....
 }

I don't think this is to hard to optimize away. Compiler can even generate global RegExp instance for each regular expression literal and use it many times.
4) When happens check of regexp for syntax correctness -
at compile time or at runtime?  "..." ~~ "..."

Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.

If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time. So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2) In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation.

This generation of new RegExp doesn't have to be true. But ~~ provides us with a feature of testing arbitrary types for arbitrary things.

As I said having defined function with name 'match' and clearly defined parameters is way better than to make syntax of the language look like an Xmas Tree - with all possible smiley notations (http://www.helpbytes.co.uk/smileys.php) Andrew.
Feb 17 2006
next sibling parent reply Ivan Senji <ivan.senji_REMOVE_ _THIS__gmail.com> writes:
Andrew Fedoniouk wrote:
 Thanks, Ivan, see below:

...
 
 function startsWith( int[] arr, int v )
 {
     if(arr.length < 1) return false;
     return arr[0] == check);
 }
 
 and its usage:
 
 static int[] somearray2 = [2,1,2];
 
 if( somearray2.startsWith( 0 ) ) ...
 
 will be more a) compact b) human readable c) maintainable d) natural
 

Naturally, but this was just a see-if-it-can-be-done example. :)
 As I said having defined function with name 'match' and clearly defined 
 parameters
 is way better than to make syntax of the language look like an Xmas Tree -

Well i don't see it like that, I see it as a abstracted concept of "matching", and that can be interpreted as an elementary operation. Plus we can overload ~~ to mean matching of any kind we want that makes sense.
Feb 17 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 static int[] somearray2 = [2,1,2];

 if( somearray2.startsWith( 0 ) ) ...

 will be more a) compact b) human readable c) maintainable d) natural

Naturally, but this was just a see-if-it-can-be-done example. :)

:D or better :~~D
 As I said having defined function with name 'match' and clearly defined 
 parameters
 is way better than to make syntax of the language look like an Xmas 
 Tree -

Well i don't see it like that, I see it as a abstracted concept of "matching", and that can be interpreted as an elementary operation. Plus we can overload ~~ to mean matching of any kind we want that makes sense.

1) According to http://www.digitalmars.com/d/expression.html#MatchExpression "Both operands must be implicitly convertible to char[]. " so yours "matching of any kind we want " is not strictly true. 2) ~~ has sidefects. Moreover it is implemented as statefull comparison so consequent ~~'s on the same arguments will yeld to different results. 3) while(true) { bool r = "a" ~~ r"\w"; } must allocate new RegExp.
Feb 17 2006
parent reply Ivan Senji <ivan.senji_REMOVE_ _THIS__gmail.com> writes:
Andrew Fedoniouk wrote:
static int[] somearray2 = [2,1,2];

if( somearray2.startsWith( 0 ) ) ...

will be more a) compact b) human readable c) maintainable d) natural

Naturally, but this was just a see-if-it-can-be-done example. :)

:D or better :~~D

That's a good smiley.
 
As I said having defined function with name 'match' and clearly defined 
parameters
is way better than to make syntax of the language look like an Xmas 
Tree -

Well i don't see it like that, I see it as a abstracted concept of "matching", and that can be interpreted as an elementary operation. Plus we can overload ~~ to mean matching of any kind we want that makes sense.

:) 1) According to http://www.digitalmars.com/d/expression.html#MatchExpression "Both operands must be implicitly convertible to char[]. " so yours "matching of any kind we want " is not strictly true.

Well it wouldn't be the first time that the documentation is wrong/incomplete. Both types *do* have to be implicitly convertible to char[] unless you use a match expression with your own type with defined opMatch operator.
 
 2) ~~ has sidefects. Moreover it is implemented as statefull comparison so
 consequent ~~'s on the same arguments will yeld to different results.
 

char[] ~~ char[] is implemented that way, but users Foo ~~ Bar[] doesn't have to behave that way (but it can if it makes sense there are more matches)
 3)
 while(true)
 {
     bool r = "a" ~~ r"\w";
 }
 
 must allocate new RegExp.

Why? Why couldn't a compiler optimize this away into something like: RegExp __regexp0001; static this() { __regexp0001 = new RegExp("a"); } and then later whenever literal "a" is used as regex: while(true) { bool r = __regexp0001 ~~ r"\w"; } So it is true that a new RegExp is allocated but it needs only to be done once.
Feb 17 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 3)
 while(true)
 {
     bool r = "a" ~~ r"\w";
 }

 must allocate new RegExp.

Why? Why couldn't a compiler optimize this away into something like: RegExp __regexp0001; static this() { __regexp0001 = new RegExp("a"); } and then later whenever literal "a" is used as regex: while(true) { bool r = __regexp0001 ~~ r"\w"; } So it is true that a new RegExp is allocated but it needs only to be done once.

And what is this opNext for then? And more: traditionally there are two "test" operations in RegExps: 'match' and 'test' as far as I remember. match returns matched substring and test returns boolean. There is also /g flag which allow to scan the whole string (Perl) $i = 0while ($string =~ m/regex/g) { print "Gotcha #" . $i. "!\n"; }So what exactly this ~~ does?Andrew.
Feb 17 2006
parent "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt5ton$10qu$1 digitaldaemon.com...
 There is also /g flag which allow to scan the whole string  (Perl)
 $i = 0while ($string =~ m/regex/g) {
  print "Gotcha #" . $i. "!\n";
 }So what exactly this ~~ does?Andrew.

m/regex/g => RegExp("regex", "g")
Feb 17 2006
prev sibling parent "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt5eo9$kgu$1 digitaldaemon.com...
 will be more a) compact b) human readable c) maintainable d) natural

For startsWith(), sure. But if that was all regex was used for, nobody would have ever invented them. Regexes can search for arbitrarilly complex patterns, and are used that way. Writing a library of custom functions for each is out of the question. What you're also missing in the examples is using the match result, not just testing for the match.
Feb 17 2006
prev sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt591g$erk$1 digitaldaemon.com...
 Next questions then:
 [char string literal] ~~ [char string literal]

 1) For what object I need to override opMatch to be able
 to get it invoked in the line above?

None. Operator overloading requires one object be a class or a struct. But you could do: RegExp("string") ~~ "string" and overload opMatch for RegExp.
 2) For some types of RE (alike) expressions there is no need
 to create instance of RegExp, e.g. test
 "*.ext" ~~ file_name
 can be implemented times faster than standard RE creation/invocation.

Sure. Create your own MyReg object, and use it like: MyReg("*.ext") ~~ filename
 3) Some objects has no string representation of match operation.
 For example CSS selector as an object has match operation with
 DOM element as an argument. But you have a requirement:

 "Both operands must be implicitly convertible to char[]."

 What to do in this case?

Operator overloading happens before implicit conversions.
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

Make them easier to use.

Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);

For whatever reason, people find that confusing and impractical.
 And easier is not mean more effective.

True. I didn't say it was more effective.
 If it does not compile this regexp at compile time than this is just a 
 fake and not a
 a solution at all for the language of D level.
 Even Perl compiles its regular expresions in compile time.

It isn't worth trying to do them at compile time if the feature itself doesn't catch on.
 So the real meaning of
  arg1 ~~ arg2
 notation is just a shortcut of
  arg1.test(arg2)

It's more than that, because of the implicit declaration of the match results.
 In general shortcuts are good but in this particular case
 it has hidden side effects in creation of new RegExp object on each test 
 invocation.

Yes, but why is that a bad thing?
Feb 17 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:dt6da8$1ci6$1 digitaldaemon.com...
 "Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
 news:dt591g$erk$1 digitaldaemon.com...
 Next questions then:
 [char string literal] ~~ [char string literal]

 1) For what object I need to override opMatch to be able
 to get it invoked in the line above?

None. Operator overloading requires one object be a class or a struct. But you could do: RegExp("string") ~~ "string" and overload opMatch for RegExp.

And this RegExp("string") ~~ "string" is more honest, isn't it? Or as in Harmonia: string s = .... bool r = s.like("str*");
 2) For some types of RE (alike) expressions there is no need
 to create instance of RegExp, e.g. test
 "*.ext" ~~ file_name
 can be implemented times faster than standard RE creation/invocation.

Sure. Create your own MyReg object, and use it like: MyReg("*.ext") ~~ filename

But I want my own function for char[] ~~ char[] ! Simple pattern match does not require compilation phase or even memory allocation...
 3) Some objects has no string representation of match operation.
 For example CSS selector as an object has match operation with
 DOM element as an argument. But you have a requirement:

 "Both operands must be implicitly convertible to char[]."

 What to do in this case?

Operator overloading happens before implicit conversions.

I don't understand why not allow this: bool opMatch(char[] a, char[] b) ?
 3) What is the main purpose of incorporating
 interprettable regexps in natively compileable language?

Make them easier to use.

Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);

For whatever reason, people find that confusing and impractical.

uh, people.... I see.
 And easier is not mean more effective.

True. I didn't say it was more effective.
 If it does not compile this regexp at compile time than this is just a 
 fake and not a
 a solution at all for the language of D level.
 Even Perl compiles its regular expresions in compile time.

It isn't worth trying to do them at compile time if the feature itself doesn't catch on.
 So the real meaning of
  arg1 ~~ arg2
 notation is just a shortcut of
  arg1.test(arg2)

It's more than that, because of the implicit declaration of the match results.
 In general shortcuts are good but in this particular case
 it has hidden side effects in creation of new RegExp object on each test 
 invocation.

Yes, but why is that a bad thing?

You need to explain very well what is going on under the hood of this ~~ - it is statefull operator (if it is /g). <ot> I am using stream tokenizer in Harmonia instead of this /g. (class TokenizerT(CHAR) // harmonia/string.d) Simple like(pattern) method is enough in 90% of cases. Perl is completely different story - it is built around RegExp. And it is typeless. </ot> BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/ Andrew.
Feb 17 2006
next sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt6gbc$1eig$1 digitaldaemon.com...
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:dt6da8$1ci6$1 digitaldaemon.com...
 None. Operator overloading requires one object be a class or a struct. 
 But you could do:

    RegExp("string") ~~ "string"

 and overload opMatch for RegExp.

And this RegExp("string") ~~ "string" is more honest, isn't it? Or as in Harmonia: string s = .... bool r = s.like("str*");

That doesn't give the match results, though.
 Sure. Create your own MyReg object, and use it like:
    MyReg("*.ext") ~~ filename


Consider overloading the '+' in '1+2'? To overload operators, one of the operands must be a user defined type.
 I don't understand why not allow this:
 bool opMatch(char[] a, char[] b) ?

For the same reason opAdd(int a, int b) is not allowed. Such a function would apply globally, all the library code will break, etc.
 BTW: Have you seen Nemerle and its way of meta-programming?
 http://nemerle.org/

I don't know anything about it. I'll take a look at the link.
Feb 18 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:dt6nug$1lhe$1 digitaldaemon.com...
 Or as in Harmonia:

 string s = ....
 bool r = s.like("str*");

That doesn't give the match results, though.

Who cares in most of cases? user input validation tasks or simple filename matching ... When you need match results you will use regexp or something more effective like tokenizers.
 Sure. Create your own MyReg object, and use it like:
    MyReg("*.ext") ~~ filename


Consider overloading the '+' in '1+2'? To overload operators, one of the operands must be a user defined type.
 I don't understand why not allow this:
 bool opMatch(char[] a, char[] b) ?

For the same reason opAdd(int a, int b) is not allowed. Such a function would apply globally, all the library code will break, etc.
 BTW: Have you seen Nemerle and its way of meta-programming?
 http://nemerle.org/

I don't know anything about it. I'll take a look at the link.

Take a look. A bit ugly on my taste but some ideas of Nemerle macros can be reused. They allow to add your own problem specific notation and syntax to the language.
Feb 18 2006
parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt7qm4$2kn0$1 digitaldaemon.com...
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:dt6nug$1lhe$1 digitaldaemon.com...
 string s = ....
 bool r = s.like("str*");


Who cares in most of cases?

In a very large fraction of cases, it matters. After all, if you are searching a posting for an embedded email address, it doesn't do much good to only know that one is/isn't there. One is searching for it so one can do something with it.
 When you need match results you will use regexp
 or something more effective like tokenizers.

Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.
Feb 18 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:dt80n7$2qiu$3 digitaldaemon.com...
 "Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
 news:dt7qm4$2kn0$1 digitaldaemon.com...
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:dt6nug$1lhe$1 digitaldaemon.com...
 string s = ....
 bool r = s.like("str*");


Who cares in most of cases?

In a very large fraction of cases, it matters. After all, if you are searching a posting for an embedded email address, it doesn't do much good to only know that one is/isn't there. One is searching for it so one can do something with it.

Probably in some Perl-ish use cases this is really so needed. In my http://blocknote.net hyperlink auto-recognition start working on each complete non-ws sequence - I already know position. But this is a particular use case.
 When you need match results you will use regexp
 or something more effective like tokenizers.

Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.

Why? Here is simple Tokenizer for C/C++/D/etc. alike texts module harmonia.string; class TokenizerT(CHAR) { enum token { EOT, SPACE, WORD, QUOTE, DELIMETER, COMMENT } ... } And module harmonia.html.scanner; is simple HTML/XML push parser (scanner) ---------------------- I mean that std.lib should have multiple text handling tools. RegExp is not only one possible. I would like to see something like customizeable TokenizerT above in std lib. Frequently such tokenizer is what really needed rather than regexp and scriptin style poor man tokenizing using array.split and the like. Andrew.
Feb 18 2006
parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message 
news:dt87fd$314d$1 digitaldaemon.com...
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:dt80n7$2qiu$3 digitaldaemon.com...
 Writing a real lexer takes a lot of effort. That's why people invented 
 regex, it'll handle most jobs without having to write a lexer. C's 
 strtok() is embarassingly inadequate.

Why?

I'd like to see strtok() parse an email address out of a body of text.
Feb 19 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:dt9ho8$20e4$3 digitaldaemon.com...

 Writing a real lexer takes a lot of effort. That's why people invented 
 regex, it'll handle most jobs without having to write a lexer. C's 
 strtok() is embarassingly inadequate.

Why?

I'd like to see strtok() parse an email address out of a body of text.

I don't really understand "parse an email address out of a body of text." Do you mean something like this: char* pw = text; url u; forever { pw = strtok( pw, " \t\n\r" ); if( !pw ) return; if( !u.parse(pw) ) continue; if( u.protocol() == url::MAILTO ) //found - do something here ; }; ? Andrew.
Feb 19 2006
next sibling parent Chris Sauls <ibisbasenji gmail.com> writes:
Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:dt9ho8$20e4$3 digitaldaemon.com...
 
 
Writing a real lexer takes a lot of effort. That's why people invented 
regex, it'll handle most jobs without having to write a lexer. C's 
strtok() is embarassingly inadequate.

Why?

I'd like to see strtok() parse an email address out of a body of text.

I don't really understand "parse an email address out of a body of text." Do you mean something like this: char* pw = text; url u; forever { pw = strtok( pw, " \t\n\r" ); if( !pw ) return; if( !u.parse(pw) ) continue; if( u.protocol() == url::MAILTO ) //found - do something here ; }; ? Andrew.

I think he meant something more like (using MatchExpr, sorry): # char[] text = ...; # char[] addr, user, host, tld; # if (`([_a-z0-9]*) ([_a-z0-9]*).([_a-z0-9]*)` ~~ text) { # addr = _match[0]; # user = _match[1]; # host = _match[2]; # tld = _match[3]; # # // do something # } Granted, I just tossed that together in five seconds flat, so its probably not quite right. I'm just recently starting to lean into the RegExp camp myself. Its made parsing of Lyra scripts a dream. One thing I miss from a scripting language in doing the above, is PHP's lovely list() construct. Pretending we had this in D: # char[] text = ...; # char[] addr, user, host, tld; # if (`([_a-z0-9]*) ([_a-z0-9]*).([_a-z0-9]*)` ~~ text) { # list(addr,user,host,tld) = _match; # // do something # } -- Chris Nicholson-Sauls
Feb 19 2006
prev sibling parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
Andrew Fedoniouk,

What he's saying is... essentially... please take this string:

char[] some_text = "The email address Walter is posting from is 
newshound digitalmars.com.  The headers for your message have 
<news terrainformatica.com>, so I would assume that is your address.  My 
address can be found in this HTML: <a 
href=\"mailto:unknown simplemachines.org\">my email</a>";

Now use strtok to output just the email addresses.  I would expect the 
output to be like this:

1: newshound digitalmars.com
2: news terrainformatica.com
3: unknown simplemachines.org

How many lines will it take to grab those addresses, without using a 
regular expression?  You can use "like()" all you like, and strtok(), or 
even strpos()...

He does not mean a whitespace separated list of addresses, why would you 
need to work to parse that?  Most people would not use a regular 
expression for that, it'd be silly.

I think you're looking at this from a different angle than Walter is.

Just illustrating,
-[Unknown]


 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:dt9ho8$20e4$3 digitaldaemon.com...
 
 Writing a real lexer takes a lot of effort. That's why people invented 
 regex, it'll handle most jobs without having to write a lexer. C's 
 strtok() is embarassingly inadequate.



I don't really understand "parse an email address out of a body of text." Do you mean something like this: char* pw = text; url u; forever { pw = strtok( pw, " \t\n\r" ); if( !pw ) return; if( !u.parse(pw) ) continue; if( u.protocol() == url::MAILTO ) //found - do something here ; }; ? Andrew.

Feb 19 2006
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Sun, 19 Feb 2006 14:47:43 -0800, Unknown W. Brackets  
<unknown simplemachines.org> wrote:
 Andrew Fedoniouk,

 What he's saying is... essentially... please take this string:

 char[] some_text = "The email address Walter is posting from is  
 newshound digitalmars.com.  The headers for your message have  
 <news terrainformatica.com>, so I would assume that is your address.  My  
 address can be found in this HTML: <a  
 href=\"mailto:unknown simplemachines.org\">my email</a>";

 Now use strtok to output just the email addresses.  I would expect the  
 output to be like this:

 1: newshound digitalmars.com
 2: news terrainformatica.com
 3: unknown simplemachines.org

 How many lines will it take to grab those addresses, without using a  
 regular expression?  You can use "like()" all you like, and strtok(), or  
 even strpos()...

Here's how I'd do it: import std.stdio; import std.string; char[] some_text = "The email address Walter is posting from is newshound digitalmars.com. The headers for your message have <news terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown simplemachines.org\">my email</a>"; void main() { char[][] res; res = parse_string(some_text); foreach(int i, char[] r; res) writefln("%d. %s",i+1,r); } bool valid_email_char(char c) { char* special = "<>()[]\\.,;: \""; if (c == '.') return true; if (c <= 0x1F) return false; if (c == 0x7F) return false; if (c == ' ') return false; if (strchr(special,c)) return false; return true; } char[][] parse_string(char[] text) { char[][] res; char* raw = toStringz(text); char* p; char* e; for(p = strchr(raw,' '); p; p = strchr(e,' ')) { for(e = p+1; valid_email_char(*e); e++) {} if (e > raw && *(e-1) == '.') e--; for(; p > raw && valid_email_char(*(p-1)); p--) {} res ~= p[0..(e-p)]; //add .dup if required } return res; } Regan
Feb 19 2006
parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message 
news:ops48ur1em23k2f5 nrage.netwin.co.nz...
 Here's how I'd do it:

Your's is a lot of code to do what a regex does. Now recognize a url <g>.
Feb 19 2006
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Sun, 19 Feb 2006 18:52:19 -0800, Walter Bright  
<newshound digitalmars.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:ops48ur1em23k2f5 nrage.netwin.co.nz...
 Here's how I'd do it:

Your's is a lot of code to do what a regex does.

This is true, though my code is likely faster.
 Now recognize a url <g>.

Nah. You've made your point.. in fact I was secretly trying to help. <g> Regex is a good general purpose string parsing facility. I personally find composing a regex can be complicated, likely it's easier with practice. A custom piece of code is probably faster and I find it easier to tweak. In the end, unless it was performance critical or has resisted my initial efforts at composing a regex, I'd probably use a regex. Regan
Feb 19 2006
parent Georg Wrede <georg.wrede nospam.org> writes:
Regan Heath wrote:
 Walter Bright <newshound digitalmars.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote 

 Here's how I'd do it:

Your's is a lot of code to do what a regex does.

This is true, though my code is likely faster.
 Now recognize a url <g>.

Nah. You've made your point.. in fact I was secretly trying to help. <g>

DISCLAIMER INSERTED WHEN PROOFREADING: I'm not attacking you, or anybody's opinion here, I'm just thinking aloud -- mostly to sort out my own opinion on this issue! :-)
 Regex is a good general purpose string parsing facility. I personally 
 find  composing a regex can be complicated, likely it's easier with 
 practice. A  custom piece of code is probably faster and I find it 
 easier to tweak. In  the end, unless it was performance critical or has 
 resisted my initial  efforts at composing a regex, I'd probably use a 
 regex.

Heh, interestingly, I have the same feeling about all three!! (I.e. composing nontrivial regexes is hard, custom code is faster and easier to tweak.) But I can't but wonder whether I'm wrong on all three! In other words, writing custom code to do the same as a nontrivial regexp might feel the easier choice at the outset, but the sheer number of lines required (for example for the url recognition task) makes the code error prone and unobvious. And I too _feel_ that the custom code would be faster, but, on second thought, I'd probably have to do some intensive optimizing cycles if I were against an average regexp implementation. ;-( This regexp stuff is "well understood" and polished during decades, after all. As to "easier to tweak", suppose that Boss comes to you 2 months later and wants this Url Recognizer (which you had to write in a hurry to compete with the regexp guy in the next cubicle) to only accept top-level domains in country specific urls, you'd be hard put to know where to start tweaking, while the other guy gets it right in 30 seconds flat tweaking his regexp code. (The boss' tweak accepts foo.fi but not foo.bar.fi nor foo.com)
 Here's how I'd do it:
 
 import std.stdio;
 import std.string;
 
 char[] some_text = "The email address Walter is posting from is 
newshound digitalmars.com.  The headers for your message have 
<news terrainformatica.com>, so I would assume that is your address.  My 
address can be found in this HTML: <a 
href=\"mailto:unknown simplemachines.org\">my email</a>";
 
 void main()
 {
     char[][] res;   
     res = parse_string(some_text);
     foreach(int i, char[] r; res)
         writefln("%d. %s",i+1,r);
 }
 
 bool valid_email_char(char c)
 {
     char* special = "<>()[]\\.,;: \"";
     if (c == '.') return true;
     if (c <= 0x1F) return false;
     if (c == 0x7F) return false;
     if (c == ' ') return false;
     if (strchr(special,c)) return false;
     return true;
 }
 
 char[][] parse_string(char[] text)
 {
     char[][] res;
     char* raw = toStringz(text);
     char* p;
     char* e;
     
     for(p = strchr(raw,' '); p; p = strchr(e,' ')) {
         for(e = p+1; valid_email_char(*e); e++) {}
         if (e > raw && *(e-1) == '.') e--;
         for(; p > raw && valid_email_char(*(p-1)); p--) {}
         res ~= p[0..(e-p)]; //add .dup if required
     }
     
     return res;
 }

Feb 20 2006
prev sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com>
 "Andrew Fedoniouk" <news terrainformatica.com>


 In general shortcuts are good but in this particular case it has
 hidden side effects in creation of new RegExp object on each test
 invocation.

Yes, but why is that a bad thing?

You need to explain very well what is going on under the hood of this ~~ - it is statefull operator (if it is /g). <ot> I am using stream tokenizer in Harmonia instead of this /g. (class TokenizerT(CHAR) // harmonia/string.d) Simple like(pattern) method is enough in 90% of cases. Perl is completely different story - it is built around RegExp. And it is typeless. </ot> BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/

Had I to do stuff on the M$ "platform", I'd definitely look long and hard on Nemerle, before even touching C#. The macro thing looks quite a bit like what I had in mind last winter when we were discussing whether the high level (that is, metaprogramming) features of D should be implemented in a syntax distinct from the "normal" language syntax or not. Seems I lost. :-) (No hard feelings, Walter and Don are really amazing me, over and over again!) Still, there's a lot of obvious stuff that seems trivial with a separate syntax, while either impossible or cumbersome with the current one. (But hey, with the rate W&D are going, all that will also be fixed by D 1.5.)
Feb 20 2006