www.digitalmars.com         C & C++   DMDScript  

D - regexp suggestion

reply "Pavel Minayev" <evilone omen.ru> writes:
It would be really nice to have a method of RegExp similar to test(),
but only matching regexp at the position given, not advancing
further on error, and returning number of bytes read (or 0 on failure).
It could be used for easy token parsing:

    RegExp identifier = new RegExp('\w', "");
    char[] code, token;
    int pos;
    ...
    int count = identifier.get(code, pos);
    if (count)
    {
        token = code[pos .. pos + count];
        pos += count;    // next token
    }
Feb 08 2002
parent reply "Walter" <walter digitalmars.com> writes:
I believe you can already do that with regexp by looking at the match array
and using it to slice the input array.

"Pavel Minayev" <evilone omen.ru> wrote in message
news:a41ccn$2m50$1 digitaldaemon.com...
 It would be really nice to have a method of RegExp similar to test(),
 but only matching regexp at the position given, not advancing
 further on error, and returning number of bytes read (or 0 on failure).
 It could be used for easy token parsing:

     RegExp identifier = new RegExp('\w', "");
     char[] code, token;
     int pos;
     ...
     int count = identifier.get(code, pos);
     if (count)
     {
         token = code[pos .. pos + count];
         pos += count;    // next token
     }

Feb 08 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a41imc$2pnk$1 digitaldaemon.com...
 I believe you can already do that with regexp by looking at the match

 and using it to slice the input array.

Yes, but it's sloooooow!
Feb 08 2002
parent reply "Walter" <walter digitalmars.com> writes:
You can also use the "g" attribute.

"Pavel Minayev" <evilone omen.ru> wrote in message
news:a41jep$2q3p$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a41imc$2pnk$1 digitaldaemon.com...
 I believe you can already do that with regexp by looking at the match

 and using it to slice the input array.

Yes, but it's sloooooow!

Feb 08 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a41oek$2se5$1 digitaldaemon.com...

 You can also use the "g" attribute.

Sorry, I'm not very familiar with regexp... how is it supposed to do what I want?
Feb 08 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a42jse$6h1$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a41oek$2se5$1 digitaldaemon.com...

 You can also use the "g" attribute.

Sorry, I'm not very familiar with regexp... how is it supposed to do what I want?

If you use the "g" attribute to the RegExp constructor, and repeated calls to exec() will each pick up where the previous left off.
Feb 09 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a42tc9$hrc$1 digitaldaemon.com...

 If you use the "g" attribute to the RegExp constructor, and repeated calls
 to exec() will each pick up where the previous left off.

But doesn't it try to search for the regexp further if it doens't match in current position?
Feb 09 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a433vk$l3i$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a42tc9$hrc$1 digitaldaemon.com...

 If you use the "g" attribute to the RegExp constructor, and repeated


 to exec() will each pick up where the previous left off.

But doesn't it try to search for the regexp further if it doens't match in current position?

Yes.
Feb 09 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a43tq3$11uk$2 digitaldaemon.com...

 But doesn't it try to search for the regexp further if it doens't
 match in current position?

Yes.

Then I don't understand how it can be used to tokenize the string. Suppose I have: foo123 = bar456 + 789; Now I first search for the identifier, and get "foo123" and "bar456". Then I search for numbers and get "123", "456" and "789" - and only the latter is correct... With my suggestion implemented, however, it'd look somewhat different. First I check for identifier, and get "foo123". Now I advance after the end of that token, and perform another check... when I get to "789", I check if it matches an identifier /\w.../ - it doesn't, so I check if it is a number /0-9+/ and succeed... that's how it is supposed to work.
Feb 09 2002
next sibling parent reply "Sean L. Palmer" <spalmer iname.com> writes:
I think sscanf could do this if it could return a pointer to how far it got
in the input string during processing in addition to how many fields were
converted.  sscanf as it exists in C is not so useful.

Sean

"Pavel Minayev" <evilone omen.ru> wrote in message
news:a443lq$147s$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a43tq3$11uk$2 digitaldaemon.com...

 But doesn't it try to search for the regexp further if it doens't
 match in current position?

Yes.

Then I don't understand how it can be used to tokenize the string. Suppose I have: foo123 = bar456 + 789; Now I first search for the identifier, and get "foo123" and "bar456". Then I search for numbers and get "123", "456" and "789" - and only the latter is correct... With my suggestion implemented, however, it'd look somewhat different. First I check for identifier, and get "foo123". Now I advance after the end of that token, and perform another check... when I get to "789", I check if it matches an identifier /\w.../ - it doesn't, so I check if it is a number /0-9+/ and succeed... that's how it is supposed to work.

Feb 09 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Sean L. Palmer" <spalmer iname.com> wrote in message
news:a444t2$14qa$1 digitaldaemon.com...

 I think sscanf could do this if it could return a pointer to how far it

 in the input string during processing in addition to how many fields were
 converted.  sscanf as it exists in C is not so useful.

Also if sscanf would understoof regexps... =) That's why I suggest RegExp.scan();
Feb 09 2002
parent "Sean L. Palmer" <spalmer iname.com> writes:
sscanf has alot more power than most people realize.  I myself didn't
discover alot of it until recently.  But it won't tell you where it got to
in the string.

Sean

"Pavel Minayev" <evilone omen.ru> wrote in message
news:a447tq$161o$1 digitaldaemon.com...
 "Sean L. Palmer" <spalmer iname.com> wrote in message
 news:a444t2$14qa$1 digitaldaemon.com...

 I think sscanf could do this if it could return a pointer to how far it

 in the input string during processing in addition to how many fields


 converted.  sscanf as it exists in C is not so useful.

Also if sscanf would understoof regexps... =) That's why I suggest RegExp.scan();

Feb 09 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a443lq$147s$1 digitaldaemon.com...
 With my suggestion implemented, however, it'd look somewhat
 different. First I check for identifier, and get "foo123".
 Now I advance after the end of that token, and perform another
 check... when I get to "789", I check if it matches an
 identifier /\w.../ - it doesn't, so I check if it is a number
 /0-9+/ and succeed... that's how it is supposed to work.

If you're changing the regular expression you're searching for, which is what you're doing by switching from looking for an identifier to looking for a number, you'll need to create a new RegExp for each different regular expression. Then apply them as required to the remainder of the input string.
Feb 09 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a446n4$15hm$1 digitaldaemon.com...

 If you're changing the regular expression you're searching for, which is
 what you're doing by switching from looking for an identifier to looking

 a number, you'll need to create a new RegExp for each different regular
 expression. Then apply them as required to the remainder of the input
 string.

I pre-create them all in form of an array; RegExp[] tokens; static this() { tokens = new RegExp('\w+', ""), // word new RegExp('\d+', ""), // number ... } Now how do I apply them to the remainder of the input string (whatever this means)? I can of course first retrieve identifiers, and remove them from the array, then get rid of numbers, symbols... etc. But it would be damn slow. This could be also done by "regexp comparison" function, if there were one: // read a token for (int i = 0; i < token.length; i++) { // RegExp.cmp() returns the number of chars at the beginning // of given string that match the regexp, or 0 if no match int len = tokens[0].cmp(text[pos .. text.length]); if (len) { // match! token = text[pos .. pos + len]; pos += len; } } Regexp comparison is a good idea anyhow, IMO. Can be used for lots of different things.
Feb 09 2002
next sibling parent "Pavel Minayev" <evilone omen.ru> writes:
         tokens =
             new RegExp('\w+', ""),    // word
             new RegExp('\d+', ""),    // number
             ...

Sorry =) This should of course look: tokens = new RegExp('\w+', "") ~ // word new RegExp('\d+', "") ~ // number ...
Feb 09 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
All you have to do is:

    r1 = new RegExp(...);

    m1 = r1.match(input);
    if (m1.length)
        m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length];

and so on...
Feb 09 2002
next sibling parent Karl Bochert <kbochert ix.netcom.com> writes:
On Sat, 9 Feb 2002 15:56:56 -0800, "Walter" <walter digitalmars.com> wrote:
 All you have to do is:
 
     r1 = new RegExp(...);
 
     m1 = r1.match(input);
     if (m1.length)
         m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length];
 
 and so on...
 
 

to hide the gore? r1 = new RegExp (...); r1.exec(input); x = r1.matches (); //returns number of parenthesized matches tail = r1.tail (); //returns portion of input after match m1 = getMatch (n) //returns the nth matching substring Regular expressions are very powerful but can also be very complicated. Shouldn't the class help by providing well-named queries? In addition it would be more like PCRE, which is already well understood. Karl Bochert
Feb 09 2002
prev sibling parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a44fdn$18t6$1 digitaldaemon.com...

 All you have to do is:

     r1 = new RegExp(...);

     m1 = r1.match(input);
     if (m1.length)
         m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length];

 and so on...

If the first token will be r2, and not r1, but there are some r1s further in the string, the first match() will skip the r2 and get the r1.
Feb 10 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a45a2l$1kk4$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a44fdn$18t6$1 digitaldaemon.com...

 All you have to do is:

     r1 = new RegExp(...);

     m1 = r1.match(input);
     if (m1.length)
         m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length];

 and so on...

If the first token will be r2, and not r1, but there are some r1s further in the string, the first match() will skip the r2 and get the r1.

Yes, but if you are using multiple RegExp's on the same string, you need to decide which slices get searched for which patterns. If you are using one RegExp, just set the "g" attribute. If you use one RegExp to search for two different patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched.
Feb 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a45e05$1m8o$1 digitaldaemon.com...

 RegExp, just set the "g" attribute. If you use one RegExp to search for

 different patterns, use parenthesized subexpressions, and the math[][]
 return will tell you which one was matched.

This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp. Once again suppose the token was "foo666". Once again I need to check all possible versions, and if I check for the number first, I'll have a match - "666"... of course a check can be done for starting position == 0 - which involves too many checks, IMO, or the regexp can have "^" inserted at the front... but even then, each token gets checked twice - first in the RegExp.match(), then by my type detection routine. Wouldn't it be slow? I'm not asking for much... just the version of test() with for-loop removed.
Feb 10 2002
next sibling parent reply Karl Bochert <kbochert ix.netcom.com> writes:
On Sun, 10 Feb 2002 17:54:52 +0300, "Pavel Minayev" <evilone omen.ru> wrote:
 "Walter" <walter digitalmars.com> wrote in message
 news:a45e05$1m8o$1 digitaldaemon.com...
 
 RegExp, just set the "g" attribute. If you use one RegExp to search for

 different patterns, use parenthesized subexpressions, and the math[][]
 return will tell you which one was matched.

This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp. Once again suppose the token was "foo666". Once again I need to check all possible versions, and if I check for the number first, I'll have a match - "666"... of course a check can be done for starting position == 0 - which involves too many checks, IMO, or the regexp can have "^" inserted at the front... but even then, each token gets checked twice - first in the RegExp.match(), then by my type detection routine. Wouldn't it be slow? I'm not asking for much... just the version of test() with for-loop removed.

I may be missing the point here but: The power of regular expressions is their ability to search for multiple patterns at once. If the next thing in the input is either a number or a word which could have embedded digits then "\w[\w\d]*" matches a word "\d+" matches a number "(\w[\w\d]*)|(\d+)" matches a word or a number and "[\t ]*(\w[\w\d]*)|(\d+)" matches any spaces followed by a word or a number. In the last 2 cases, the result of the search is up to 3 substrings : the overall match, and the substrings within the parentheses. Perform the search and then the lengths of the substrings will tell you what you found. Documentation on standard regex's can be found at: http://compy.ww.tu-berlin.de/doc/packages/pcre/pcre.html among many other places.
Feb 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Karl Bochert" <kbochert ix.netcom.com> wrote in message
news:1103_1013361883 bose...

 In the last 2 cases, the result of the search is up to 3 substrings : the

 match, and the substrings within the parentheses. Perform the search and
 then the  lengths of the substrings will tell you what you found.

How can these lengths tell? Token type is determined by the forming characters (described by regexp in my case), not by the length - or am I missing something? Suppose the input was: foo bar123 456 baz Now I get the following tokens: "foo", "bar123", "baz", "123", "456" How do I know that "123" is not supposed to be here?
Feb 10 2002
parent reply Karl Bochert <kbochert ix.netcom.com> writes:
On Sun, 10 Feb 2002 20:32:11 +0300, "Pavel Minayev" <evilone omen.ru> wrote:
 "Karl Bochert" <kbochert ix.netcom.com> wrote in message
 news:1103_1013361883 bose...
 
 In the last 2 cases, the result of the search is up to 3 substrings : the

 match, and the substrings within the parentheses. Perform the search and
 then the  lengths of the substrings will tell you what you found.

How can these lengths tell? Token type is determined by the forming characters (described by regexp in my case), not by the length - or am I missing something? Suppose the input was: foo bar123 456 baz Now I get the following tokens: "foo", "bar123", "baz", "123", "456" How do I know that "123" is not supposed to be here?

Declare a regular expression: p = Regexp( "(\w[\w\d]*)|(\d+)" ) then: p.match ("123test") produces 3 substrings: "123" -- the overall match "" -- the match for the first set of parens "123" -- the match for the second set of parens In PCRE (the common C implementation) the substrings are returned as an array of pointers into the string (6 in this case). I suspect D returns an equivalent array of offsets (slices?) into the string? The non-zero length of the third substring shows that a number ("\d+") was found. In your example: p.exec (foo bar123 baz 123); produces: "foo" "foo" "" and: p.exec ("bar123 baz 123") produces: "bar123" "bar123" "" and: p.exec ("123 456"); produces: "123" "" "123" I have used exec() here because it is probably the same as PCRE's exec function. I have read the RegExp documentation but do not understand the difference between the exec() and match() methods. Maybe match() is just exec() anchored to the start of the text? Karl
Feb 10 2002
next sibling parent "Pavel Minayev" <evilone omen.ru> writes:
"Karl Bochert" <kbochert ix.netcom.com> wrote in message
news:1103_1013375566 bose...

 In your example:
      p.exec (foo bar123 baz 123);
 produces:
     "foo"
     "foo"
     ""

 and:
     p.exec ("bar123 baz 123")
 produces:
     "bar123"
     "bar123"
     ""

 and:
     p.exec ("123 456");
 produces:
     "123"
     ""
     "123"

Yep, right. Now I have all the tokens, how do I determine the _type_ of each (identifier, number, string...), with regexp describing those types?
Feb 10 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Karl Bochert" <kbochert ix.netcom.com> wrote in message
news:1103_1013375566 bose...
 I have used exec() here because it is probably the same as PCRE's exec
 function. I have read the RegExp documentation but do not understand the
 difference between the exec() and match() methods. Maybe match() is
 just exec() anchored to the start of the text?

There is no difference if the global attribute is set. If the global attribute is not set, then match returns an array of all the matches in the input.
Feb 10 2002
parent Karl Bochert <kbochert ix.netcom.com> writes:
On Sun, 10 Feb 2002 15:47:34 -0800, "Walter" <walter digitalmars.com> wrote:
 
 "Karl Bochert" <kbochert ix.netcom.com> wrote in message
 news:1103_1013375566 bose...
 I have used exec() here because it is probably the same as PCRE's exec
 function. I have read the RegExp documentation but do not understand the
 difference between the exec() and match() methods. Maybe match() is
 just exec() anchored to the start of the text?

There is no difference if the global attribute is set. If the global attribute is not set, then match returns an array of all the matches in the input.

in the subject string, but loses the 'which substring' information. That might explain Pavel's problem -- to parse the next token and get it's type info he should use exec() or global match(). Karl Bochert
Feb 10 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a461ka$1tv0$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a45e05$1m8o$1 digitaldaemon.com...

 RegExp, just set the "g" attribute. If you use one RegExp to search for

 different patterns, use parenthesized subexpressions, and the math[][]
 return will tell you which one was matched.

This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp.

That's not a problem with parenthesized subexpressions. You can tell which one got the match by the index in match[][]. The second index 0 is the overall match, subsequent indices are the matches for each subexpression.
Feb 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a470kt$2art$1 digitaldaemon.com...

 That's not a problem with parenthesized subexpressions. You can tell which
 one got the match by the index in match[][]. The second index 0 is the
 overall match, subsequent indices are the matches for each subexpression.

Walter, where is that match[][] thing? match() returns char[][], which ain't what I need...
Feb 10 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a47ir2$2i4j$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a470kt$2art$1 digitaldaemon.com...

 That's not a problem with parenthesized subexpressions. You can tell


 one got the match by the index in match[][]. The second index 0 is the
 overall match, subsequent indices are the matches for each


 Walter, where is that match[][] thing? match() returns char[][], which
 ain't what I need...

It sounds like just what you need. I guess I just don't understand what's wrong.
Feb 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a47r1i$2lhb$1 digitaldaemon.com...

 It sounds like just what you need. I guess I just don't understand what's
 wrong.

char[][] is the list of tokens, or, to be more exact, the list of their _values_. But how do I know their _types_ (string or number or ..)? Suppose the regexp was: ([A-Za-z_]+|0-9+) And I get 10 tokens. How do I tell if the first matched [A-Za-z_]+ part or the 0-9+ part, without checking it separately (which results in two checks per token)?
Feb 11 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a485eh$2rbg$1 digitaldaemon.com...
 char[][] is the list of tokens, or, to be more exact, the list of their
 _values_. But how do I know their _types_ (string or number or ..)?

 the regexp was:

     ([A-Za-z_]+|0-9+)

 And I get 10 tokens. How do I tell if the first matched [A-Za-z_]+ part
 or the 0-9+ part, without checking it separately (which results in two
 checks per token)?

You can tell which parenthesized subexpression matched by checking to see which index it was in: char[][] m; r = new RegExp("(a)|(b)", "g"); // search for "a" or "b" while ((m = r.exec("a b and a b")) != null) { if (m[1]) ; // matched an "a" else if (m[2]) ; // matched a "b" }
Feb 11 2002
parent Karl Bochert <kbochert ix.netcom.com> writes:
On Mon, 11 Feb 2002 14:57:58 -0800, "Walter" <walter digitalmars.com> wrote:
 
 "Pavel Minayev" <evilone omen.ru> wrote in message
 news:a485eh$2rbg$1 digitaldaemon.com...
 char[][] is the list of tokens, or, to be more exact, the list of their
 _values_. But how do I know their _types_ (string or number or ..)?

 the regexp was:

     ([A-Za-z_]+|0-9+)

 And I get 10 tokens. How do I tell if the first matched [A-Za-z_]+ part
 or the 0-9+ part, without checking it separately (which results in two
 checks per token)?

You can tell which parenthesized subexpression matched by checking to see which index it was in: char[][] m; r = new RegExp("(a)|(b)", "g"); // search for "a" or "b" while ((m = r.exec("a b and a b")) != null) { if (m[1]) ; // matched an "a" else if (m[2]) ; // matched a "b" }

Or: m = r.exec (...); switch (m.length) { case 0: // no match case 2: // matched 'a' case 3: //matched 'b' ... ???
Feb 11 2002