digitalmars.D - Is str ~ regex the root of all evil, or the leaf of all good?

Andrei Alexandrescu (45/45) Feb 18 2009 I'm almost done rewriting the regular expression engine, and some pretty...

Bill Baxter (13/56) Feb 18 2009 No. ~ means matching in Perl. In D it means concatenation. This

BCS (2/6) Feb 19 2009 vote += lots; // I had the same thought as well

Daniel Keep (14/22) Feb 19 2009 If a regex represents a set of strings, then wouldn't

Michel Fortin (17/33) Feb 19 2009 That seems reasonable, although if we support it it shouldn't be

bearophile (5/13) Feb 19 2009 I agree, I have said the same thing regarding splitter()/xsplitter().

BCS (3/19) Feb 19 2009 If the overhead of regex(string) is small enough (vs having it in the fi...

Andrei Alexandrescu (3/24) Feb 19 2009 The overhead is low because the last few used regexes are cached.

Andrei Alexandrescu (11/50) Feb 19 2009 Well I'm a bit unhappy about that one. At least in current D and to

Lionello Lunesu (12/26) Feb 19 2009 At least, "in" refers to a look-up, whereas "~" refers to concatenation,...

Leandro Lucarella (18/42) Feb 19 2009 [snip]

Andrei Alexandrescu (5/38) Feb 19 2009 Yah, but since even bearophile admitted python kinda botched regexes, I

Denis Koroskin (2/48) Feb 19 2009 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b...

Christopher Wright (6/9) Feb 19 2009 This isn't so good for two reasons.

Denis Koroskin (6/15) Feb 19 2009 auto re = regex("a[b-e]", "g");

Simen Kjaeraas (7/24) Feb 19 2009 This:

bearophile (6/9) Feb 19 2009 D has operator overload, that Java lacks, but this fact doesn't force yo...

Daniel Keep (7/20) Feb 19 2009 But it doesn't, and I can't see how it could given how confusing it

Christopher Wright (7/26) Feb 19 2009 Your first example was:

Andrei Alexandrescu (3/32) Feb 19 2009 Why is it problematic? Is the name "match" too common?

Christopher Wright (6/19) Feb 20 2009 No. What is the difference between those two? One is building a regex

Denis Koroskin (2/48) Feb 19 2009 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b...

Brian (4/7) Feb 19 2009 i dont see a problem either with just using .match. if you use spaces

Max Samukha (10/54) Feb 19 2009 Please anything but ~. It would be fine, if it didn't follow an array.
bearophile (44/45) Feb 19 2009 I like the following syntaxes (the one with .match() too):

Max Samukha (4/7) Feb 19 2009 I don't like 'sub' because it can denote anything. The most confusing

Andrei Alexandrescu (3/13) Feb 19 2009 Ok.

Andrei Alexandrescu (11/72) Feb 19 2009 These all put the regex before the string, something many people would

Derek Parnell (5/7) Feb 19 2009 I don't. To me the regex is what you are looking for so it's like saying

Andrei Alexandrescu (32/39) Feb 19 2009 Yah, but to most others it's "match this string against that pattern".

Andrei Alexandrescu (3/15) Feb 19 2009 ... "I'm not thrilled about".
Derek Parnell (9/26) Feb 19 2009 I use the Euphoria language a lot, and its routine API is find(needle,
Sergey Gromov (6/18) Feb 20 2009 I think calling a regex a 'haystack' is a far-fetched metaphor. A

Bill Baxter (13/31) Feb 20 2009 I thought so too. It's a stretch. And I also agree with the various

Denis Koroskin (4/67) Feb 19 2009 Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might ...

Andrei Alexandrescu (6/77) Feb 19 2009 I got disabused a very long time ago of the notion that everything about...
Lionello Lunesu (5/76) Feb 19 2009 I think it's worth an overload! (I also keep forgetting those flags.)
Benji Smith (3/11) Feb 19 2009 I prefer the enum options too. But not vociferously. I could live with

Daniel Keep (25/40) Feb 19 2009 I dislike enum options because it dramatically bloats the code, in terms

Jarrett Billingsley (9/18) Feb 19 2009 While we're on the subject I'd like to mention that an unbelievably

Andrei Alexandrescu (6/27) Feb 19 2009 Well I agree for searches but not for substitutions.

Don (13/73) Feb 19 2009 I agree with the comments against ~.

Andrei Alexandrescu (7/85) Feb 19 2009 At the moment these are not supported. It's a good question.

Michel Fortin (29/34) Feb 19 2009 I don't like `sub`, I mean the name. Makes me think of substring more

Andrei Alexandrescu (28/68) Feb 19 2009 Ok. Probably subex is a bit of a killer, but I see your point (subex is
Daniel Keep (4/12) Feb 19 2009 Didn't D previously have special regex literals that got dropped for

bearophile (7/10) Feb 19 2009 I think that using "in" into foreach() leads to less bugs, because it's ...

Jarrett Billingsley (12/17) Feb 19 2009 The semicolon does not introduce bugs. If you don't have a semicolon,

bearophile (20/27) Feb 19 2009 Wikipedia agrees with me:

Ary Borenszweig (8/57) Feb 19 2009 Why would you do that?

Jarrett Billingsley (6/9) Feb 19 2009 Andrei did a pretty good job of explaining indirectly why it's wrong ;)

Ary Borenszweig (2/14) Feb 19 2009 I think you're right.

Bill Baxter (4/21) Feb 19 2009 And yet sometimes that is exactly the case.
Jarrett Billingsley (2/3) Feb 19 2009 Perhaps I'm just too absolute ;)

Andrei Alexandrescu (5/40) Feb 19 2009 Not confusing me. I'll note that if "in" were used, you could write:

Jarrett Billingsley (32/51) Feb 19 2009 The representation of a program is separate from its semantics, and

Andrei Alexandrescu (36/42) Feb 19 2009 I agree. One thing that ranges still don't address is binding multiple

Derek Parnell (10/12) Feb 19 2009 Of course its proper English.

Jarrett Billingsley (5/13) Feb 19 2009 See, it's funny, because I would think "I don't think I like X" means

Derek Parnell (6/24) Feb 19 2009 Maybe is a difference between American English and Australian English?

Jarrett Billingsley (2/7) Feb 19 2009 You might have something there.

Nick Sabalausky (4/22) Feb 19 2009 I think I don't like "I think I don't like X".

Christopher Wright (4/6) Feb 19 2009 Not so. That is the only *reasonable* interpretation, but the person

bearophile (26/33) Feb 19 2009 programming.<

Andrei Alexandrescu (5/11) Feb 19 2009 How can anyone think they don't like something? You like it or not, but

Derek Parnell (11/22) Feb 19 2009 It is not a question of whether one likes or doesn't like; this expressi...

Andrei Alexandrescu (9/26) Feb 19 2009 I see. Me, I always use "think" to evoke an actual thinking process.

Andrei Alexandrescu (15/64) Feb 19 2009 Excellent idea. Let's see:

bearophile (26/37) Feb 19 2009 Thank you for all your work and the will to answer the posts here.

Andrei Alexandrescu (10/24) Feb 19 2009 It's the regex engine that has generated the match. I coded that wrong

bearophile (4/12) Feb 19 2009 Well, then match() may return just a dynamic array of such groups/captur...

Andrei Alexandrescu (11/28) Feb 19 2009 Looks simple but it isn't. How do you advance to the next match?

jovo (5/15) Feb 19 2009 foreach(capture; match(s, r))

Andrei Alexandrescu (14/39) Feb 19 2009 The consecrated terminology is:

jovo (4/17) Feb 19 2009 I think you must answer this question more generally, same for all libra...

Andrei Alexandrescu (8/34) Feb 19 2009 I'd hate to fall again into the fallacy of trying to appease everyone's

Bill Baxter (8/8) Feb 19 2009 I don't like the syntax I saw somewhere earlier in the thread of

Denis Koroskin (2/10) Feb 19 2009 Agree. I thought that iter.captures is a set (range) of captures.

Andrei Alexandrescu (3/19) Feb 19 2009 I'm done implementing that.

KennyTM~ (2/13) Feb 19 2009 iter.count

Bill Baxter (5/19) Feb 19 2009 Maybe I haven't paid close enough attention here, but I think the

Denis Koroskin (12/35) Feb 19 2009 On Thu, 19 Feb 2009 19:00:41 +0300, Andrei Alexandrescu

Andrei Alexandrescu (4/49) Feb 19 2009 They're good. The code I posted was dumb. The "engine" thing does not

Leandro Lucarella (16/38) Feb 19 2009 BTW, why are the flags passed as string and not as an integer mask? For
Benji Smith (53/53) Feb 19 2009 Some of the things I'd like to see in the regex implementation:

bearophile (7/11) Feb 20 2009 See:

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I'm almost done rewriting the regular expression engine, and some pretty 
interesting things have transpired.

First, I separated the engine into two parts, one that is the actual 
regular expression engine, and the other that is the state of the match 
with some particular input. The previous code combined the two into a 
huge class. The engine (written by Walter) translates the regex string 
into a bytecode-compiled form. Given that there is a deterministic 
correspondence between the regex string and the bytecode, the Regex 
engine object is in fact invariant and cached by the implementation. 
Caching makes for significant time savings even if e.g. the user 
repeatedly creates a regular expression engine in a loop.

In contrast, the match state depends on the input string. I defined it 
to implement the range interface, so you can either inspect it directly 
or iterate it for all matches (if the "g" option was passed to the engine).

The new codebase works with char, wchar, and dchar and any random-access 
range as input (forward ranges to come, and at some point in the future 
input ranges as well). In spite of the added flexibility, the code size 
has shrunk from 3396 lines to 2912 lines. I plan to add support for 
binary data (e.g. ubyte - handling binary file formats can benefit a LOT 
from regexes) and also, probably unprecedented, support for arbitrary 
types such as integers, floating point numbers, structs, what have you. 
any type that supports comparison and ranges is a good candidate for 
regular expression matching. I'm not sure how regular expression 
matching can be harnessed e.g. over arrays of int, but I suspect some 
pretty cool applications are just around the corner. We can introduce 
that generalization without adding complexity and there is nothing in 
principle opposed to it.

The interface is very simple, mainly consisting of the functions 
regex(), match(), and sub(), e.g.

foreach (e; match("abracazoo", regex("a[b-e]", "g")))
     writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

Two other syntactic options are available:

"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")

I could have made match a member of regex:

regex("a[b-e]", "g")).match("abracazoo")

but most regex code I've seen mentions the string first and the regex 
second. So I dropped that idea.

Now, match() is likely to be called very often so I'm considering:

foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
     writeln(e);

In general I'm weary of unwitting operator overloading, but I think this 
case is more justified than others. Thoughts?


Andrei

Feb 18 2009

Bill Baxter <wbaxter gmail.com> writes:

On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 I'm almost done rewriting the regular expression engine, and some pretty
 interesting things have transpired.

 First, I separated the engine into two parts, one that is the actual regular
 expression engine, and the other that is the state of the match with some
 particular input. The previous code combined the two into a huge class. The
 engine (written by Walter) translates the regex string into a
 bytecode-compiled form. Given that there is a deterministic correspondence
 between the regex string and the bytecode, the Regex engine object is in
 fact invariant and cached by the implementation. Caching makes for
 significant time savings even if e.g. the user repeatedly creates a regular
 expression engine in a loop.

 In contrast, the match state depends on the input string. I defined it to
 implement the range interface, so you can either inspect it directly or
 iterate it for all matches (if the "g" option was passed to the engine).

 The new codebase works with char, wchar, and dchar and any random-access
 range as input (forward ranges to come, and at some point in the future
 input ranges as well). In spite of the added flexibility, the code size has
 shrunk from 3396 lines to 2912 lines. I plan to add support for binary data
 (e.g. ubyte - handling binary file formats can benefit a LOT from regexes)
 and also, probably unprecedented, support for arbitrary types such as
 integers, floating point numbers, structs, what have you. any type that
 supports comparison and ranges is a good candidate for regular expression
 matching. I'm not sure how regular expression matching can be harnessed e.g.
 over arrays of int, but I suspect some pretty cool applications are just
 around the corner. We can introduce that generalization without adding
 complexity and there is nothing in principle opposed to it.

 The interface is very simple, mainly consisting of the functions regex(),
 match(), and sub(), e.g.

 foreach (e; match("abracazoo", regex("a[b-e]", "g")))
    writeln(e.pre, e.hit, e.post);
 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

 Two other syntactic options are available:

 "abracazoo".match(regex("a[b-e]", "g")))
 "abracazoo".match("a[b-e]", "g")

 I could have made match a member of regex:

 regex("a[b-e]", "g")).match("abracazoo")

 but most regex code I've seen mentions the string first and the regex
 second. So I dropped that idea.

 Now, match() is likely to be called very often so I'm considering:

 foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
    writeln(e);

 In general I'm weary of unwitting operator overloading, but I think this
 case is more justified than others. Thoughts?

No.  ~ means matching in Perl.  In D it means concatenation.  This
special case is not special enough to warrant breaking D's convention,
in my opinion.  It also breaks D's convention that operators have an
inherent meaning which shouldn't be subverted to do unrelated things.
What about turning it around and using 'in' though?

   foreach(e; regex("a[b-e]", "g") in "abracazoo")
      writeln(e);

The charter for "in" isn't quite as focused as that for ~, and anyway
you could view this as finding instances of the regular expression
"in" the string.

--bb

Feb 18 2009

BCS <none anon.com> writes:

Hello Bill,

 What about turning it around and using 'in' though?
 
 foreach(e; regex("a[b-e]", "g") in "abracazoo")
 writeln(e);


vote += lots; // I had the same thought as well

Feb 19 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

BCS wrote:
 Hello Bill,
 
 What about turning it around and using 'in' though?

 foreach(e; regex("a[b-e]", "g") in "abracazoo")
 writeln(e);

 
 vote += lots; // I had the same thought as well

If a regex represents a set of strings, then wouldn't

  "abracazoo" in regex("a[b-e]", "g")

make more sense?  Of course, that's "match" semantics; if you turn it
around and say that you're looking for elements from the set in the
string, then it's

  regex("a[b-e]", "g") in "abracazoo"

Hmm...

None the less, I do prefer the 'in' syntax over the '~' syntax.  Please
let's no go down the road of co-opting operators to do things other than
what they're designed for.

If you REALLY want a custom operator, you could always convince Walter
to let us define infix functions using Unicode characters.  :D

  -- Daniel

Feb 19 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-02-19 00:50:06 -0500, Bill Baxter <wbaxter gmail.com> said:

 On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu
 In general I'm weary of unwitting operator overloading, but I think this
 case is more justified than others. Thoughts?

 
 No.  ~ means matching in Perl.  In D it means concatenation.  This
 special case is not special enough to warrant breaking D's convention,
 in my opinion.  It also breaks D's convention that operators have an
 inherent meaning which shouldn't be subverted to do unrelated things.

Indeed. That's why I don't like seeing `~` here.


 What about turning it around and using 'in' though?
 
    foreach(e; regex("a[b-e]", "g") in "abracazoo")
       writeln(e);
 
 The charter for "in" isn't quite as focused as that for ~, and anyway
 you could view this as finding instances of the regular expression
 "in" the string.

That seems reasonable, although if we support it it shouldn't be 
limited to regular expressions for coherency reasons. For instance:

	foreach(e; "co" in "conoco")
		writeln(e);

should work too. If we can't make that work in the most simple case, 
then I'd say it shouldn't with the more complicated ones either.

By the way, regular expressions should work everywhere where we can 
search for a string. For instance (from std.string):

	auto firstMatchIndex = find("conoco", "co");

should work with a regex too:

	auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Michel Fortin:
 	foreach(e; "co" in "conoco")
 		writeln(e);
 should work too.

Of course :-) I think eventually it will work, it's handy and natural.


 By the way, regular expressions should work everywhere where we can 
 search for a string. For instance (from std.string):
 	auto firstMatchIndex = find("conoco", "co");
 should work with a regex too:
 	auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));

I agree, I have said the same thing regarding splitter()/xsplitter().

Bye,
bearophile

Feb 19 2009

BCS <ao pathlink.com> writes:

Reply to bearophile,

 Michel Fortin:
 
 foreach(e; "co" in "conoco")
 writeln(e);
 should work too.

 Of course :-) I think eventually it will work, it's handy and natural.
 
 By the way, regular expressions should work everywhere where we can
 search for a string. For instance (from std.string):
 auto firstMatchIndex = find("conoco", "co");
 should work with a regex too:
 auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));

 I agree, I have said the same thing regarding splitter()/xsplitter().
 
 Bye,
 bearophile

If the overhead of regex(string) is small enough (vs having it in the
find/split 
function) I'd go with overloads rather than different names.

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BCS wrote:
 Reply to bearophile,
 
 Michel Fortin:

 foreach(e; "co" in "conoco")
 writeln(e);
 should work too.

 Of course :-) I think eventually it will work, it's handy and natural.

 By the way, regular expressions should work everywhere where we can
 search for a string. For instance (from std.string):
 auto firstMatchIndex = find("conoco", "co");
 should work with a regex too:
 auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));

 I agree, I have said the same thing regarding splitter()/xsplitter().

 Bye,
 bearophile

 
 If the overhead of regex(string) is small enough (vs having it in the 
 find/split function) I'd go with overloads rather than different names.

The overhead is low because the last few used regexes are cached.

Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Michel Fortin wrote:
 On 2009-02-19 00:50:06 -0500, Bill Baxter <wbaxter gmail.com> said:
 
 On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu
 In general I'm weary of unwitting operator overloading, but I think this
 case is more justified than others. Thoughts?

 No.  ~ means matching in Perl.  In D it means concatenation.  This
 special case is not special enough to warrant breaking D's convention,
 in my opinion.  It also breaks D's convention that operators have an
 inherent meaning which shouldn't be subverted to do unrelated things.

 
 Indeed. That's why I don't like seeing `~` here.
 
 
 What about turning it around and using 'in' though?

    foreach(e; regex("a[b-e]", "g") in "abracazoo")
       writeln(e);

 The charter for "in" isn't quite as focused as that for ~, and anyway
 you could view this as finding instances of the regular expression
 "in" the string.

 
 That seems reasonable, although if we support it it shouldn't be limited 
 to regular expressions for coherency reasons. For instance:
 
     foreach(e; "co" in "conoco")
         writeln(e);
 
 should work too. If we can't make that work in the most simple case, 
 then I'd say it shouldn't with the more complicated ones either.

Well I'm a bit unhappy about that one. At least in current D and to 
yours truly, "in" means "fast membership lookup". The use above is 
linear lookup. I'm not saying that's bad, but I prefer the non-diluted 
semantics. For linear search, there's always find().

 By the way, regular expressions should work everywhere where we can 
 search for a string. For instance (from std.string):
 
     auto firstMatchIndex = find("conoco", "co");
 
 should work with a regex too:
 
     auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));

If you mean typeof(firstMatchIndex) to be size_t, that's unlikely to be 
enough. When looking for a regular expression, you need more than just 
an index - you need captures, "pre" and "post" substrings, the works. 
That's why matching a string against a regex must return a richer 
structure that can't be easily integrated with std.algorithm.


Andrei

Feb 19 2009

Lionello Lunesu <lio lunesu.remove.com> writes:

Andrei Alexandrescu wrote:
 Michel Fortin wrote:
 That seems reasonable, although if we support it it shouldn't be 
 limited to regular expressions for coherency reasons. For instance:

     foreach(e; "co" in "conoco")
         writeln(e);

 should work too. If we can't make that work in the most simple case, 
 then I'd say it shouldn't with the more complicated ones either.

 
 Well I'm a bit unhappy about that one. At least in current D and to 
 yours truly, "in" means "fast membership lookup". The use above is 
 linear lookup. I'm not saying that's bad, but I prefer the non-diluted 
 semantics. For linear search, there's always find().

At least, "in" refers to a look-up, whereas "~" refers to concatenation, 
which has nothing in common with the regex matching.

Furthermore, we can't make any complexity guarantees for operators; this 
always depends on the data structure you use the operator on. And, if 
I'm not mistaken, "in" is only used by the associated array at the 
moment. It's a "fast look-up" because of the associated array, but it 
doesn't have to be.

(Similarly, to me, ~ and ~= feel slow, O(n), but that shouldn't keep us 
from using it with other data structures that can do a similar 
concat/append operation with lower complexity.)

L.

Feb 19 2009

Leandro Lucarella <llucax gmail.com> writes:

Bill Baxter, el 19 de febrero a las 14:50 me escribiste:
[snip]
 regex("a[b-e]", "g")).match("abracazoo")

 but most regex code I've seen mentions the string first and the regex
 second. So I dropped that idea.


[snip]
 Now, match() is likely to be called very often so I'm considering:

 foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
    writeln(e);

 In general I'm weary of unwitting operator overloading, but I think this
 case is more justified than others. Thoughts?

 
 No.  ~ means matching in Perl.  In D it means concatenation.  This
 special case is not special enough to warrant breaking D's convention,
 in my opinion.  It also breaks D's convention that operators have an
 inherent meaning which shouldn't be subverted to do unrelated things.
 What about turning it around and using 'in' though?
 
    foreach(e; regex("a[b-e]", "g") in "abracazoo")
       writeln(e);
 
 The charter for "in" isn't quite as focused as that for ~, and anyway
 you could view this as finding instances of the regular expression
 "in" the string.

I think match is pretty short, I don't see any need for any shortcut wich
makes the code more obscure.

BTW, in case Andrei was looking for a precedent, Python uses the syntax
like:
regex("a[b-e]", "g")).match("abracazoo")

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
<Palmer> recien estuvimos con el vita... se le paro yo lo vi
<Luca> ???????????????????????????????????????????????????????
<Palmer> sisi, cuando vio a josefina
<Luca> y quién es josefina?
<Palmer> Mi computadora nuevaaaaa

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Leandro Lucarella wrote:
 Bill Baxter, el 19 de febrero a las 14:50 me escribiste:
 [snip]
 regex("a[b-e]", "g")).match("abracazoo")

 but most regex code I've seen mentions the string first and the regex
 second. So I dropped that idea.


 [snip]
 Now, match() is likely to be called very often so I'm considering:

 foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
    writeln(e);

 In general I'm weary of unwitting operator overloading, but I think this
 case is more justified than others. Thoughts?

 No.  ~ means matching in Perl.  In D it means concatenation.  This
 special case is not special enough to warrant breaking D's convention,
 in my opinion.  It also breaks D's convention that operators have an
 inherent meaning which shouldn't be subverted to do unrelated things.
 What about turning it around and using 'in' though?

    foreach(e; regex("a[b-e]", "g") in "abracazoo")
       writeln(e);

 The charter for "in" isn't quite as focused as that for ~, and anyway
 you could view this as finding instances of the regular expression
 "in" the string.

 
 I think match is pretty short, I don't see any need for any shortcut wich
 makes the code more obscure.
 
 BTW, in case Andrei was looking for a precedent, Python uses the syntax
 like:
 regex("a[b-e]", "g")).match("abracazoo")

Yah, but since even bearophile admitted python kinda botched regexes, I 
better not consider this argument :o). The Unix toolchain invariably 
puts the string before the regex.

Andrei

Feb 19 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 I'm almost done rewriting the regular expression engine, and some pretty  
 interesting things have transpired.

 First, I separated the engine into two parts, one that is the actual  
 regular expression engine, and the other that is the state of the match  
 with some particular input. The previous code combined the two into a  
 huge class. The engine (written by Walter) translates the regex string  
 into a bytecode-compiled form. Given that there is a deterministic  
 correspondence between the regex string and the bytecode, the Regex  
 engine object is in fact invariant and cached by the implementation.  
 Caching makes for significant time savings even if e.g. the user  
 repeatedly creates a regular expression engine in a loop.

 In contrast, the match state depends on the input string. I defined it  
 to implement the range interface, so you can either inspect it directly  
 or iterate it for all matches (if the "g" option was passed to the  
 engine).

 The new codebase works with char, wchar, and dchar and any random-access  
 range as input (forward ranges to come, and at some point in the future  
 input ranges as well). In spite of the added flexibility, the code size  
 has shrunk from 3396 lines to 2912 lines. I plan to add support for  
 binary data (e.g. ubyte - handling binary file formats can benefit a LOT  
 from regexes) and also, probably unprecedented, support for arbitrary  
 types such as integers, floating point numbers, structs, what have you.  
 any type that supports comparison and ranges is a good candidate for  
 regular expression matching. I'm not sure how regular expression  
 matching can be harnessed e.g. over arrays of int, but I suspect some  
 pretty cool applications are just around the corner. We can introduce  
 that generalization without adding complexity and there is nothing in  
 principle opposed to it.

 The interface is very simple, mainly consisting of the functions  
 regex(), match(), and sub(), e.g.

 foreach (e; match("abracazoo", regex("a[b-e]", "g")))
      writeln(e.pre, e.hit, e.post);
 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

 Two other syntactic options are available:

 "abracazoo".match(regex("a[b-e]", "g")))
 "abracazoo".match("a[b-e]", "g")

 I could have made match a member of regex:

 regex("a[b-e]", "g")).match("abracazoo")

 but most regex code I've seen mentions the string first and the regex  
 second. So I dropped that idea.

 Now, match() is likely to be called very often so I'm considering:

 foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
      writeln(e);

 In general I'm weary of unwitting operator overloading, but I think this  
 case is more justified than others. Thoughts?


 Andrei

"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]",
"g") but doesn't existing conventions. I prefer it over '~' version. In is also
fine (both ways).

Feb 19 2009

Christopher Wright <dhasenan gmail.com> writes:

Denis Koroskin wrote:
 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ 
 regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over 
 '~' version. In is also fine (both ways).

This isn't so good for two reasons.
First, I can't reuse regexes in your way, so if there is any expensive 
initialization, that is duplicated.

Second, I can't reuse regexes in your way, so I have to use a pair of 
string constants.

Feb 19 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright  
<dhasenan gmail.com> wrote:

 Denis Koroskin wrote:
 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~  
 regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over  
 '~' version. In is also fine (both ways).

 This isn't so good for two reasons.
 First, I can't reuse regexes in your way, so if there is any expensive  
 initialization, that is duplicated.

 Second, I can't reuse regexes in your way, so I have to use a pair of  
 string constants.

auto re = regex("a[b-e]", "g");
foreach (e; "abracazoo".match(re)) {
     // what's wrong with that?
}

Feb 19 2009

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Thu, 19 Feb 2009 14:34:30 +0100, Denis Koroskin <2korden gmail.com> wrote:

 On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright  
 <dhasenan gmail.com> wrote:

 Denis Koroskin wrote:
 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~  
 regex("a[b-e]", "g") but doesn't existing conventions. I prefer it  
 over '~' version. In is also fine (both ways).

 This isn't so good for two reasons.
 First, I can't reuse regexes in your way, so if there is any expensive  
 initialization, that is duplicated.

 Second, I can't reuse regexes in your way, so I have to use a pair of  
 string constants.

 auto re = regex("a[b-e]", "g");
 foreach (e; "abracazoo".match(re)) {
      // what's wrong with that?
 }

This:

auto re = regex("a[b-e]", "g");
foreach (e; "abracazoo" / re) {
}

--
Simen

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Simen Kjaeraas:
 auto re = regex("a[b-e]", "g");
 foreach (e; "abracazoo" / re) {
 }

D has operator overload, that Java lacks, but this fact doesn't force you to
use them even when they are unreadable.

For people that like the "in" there I'd like to remind how it can look once (if
it will ever happen) the foreach uses "in" too:

foreach (e in (re in "abracazoo")) {...}

Bye,
bearophile

Feb 19 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

bearophile wrote:
 Simen Kjaeraas:
 auto re = regex("a[b-e]", "g");
 foreach (e; "abracazoo" / re) {
 }

 
 D has operator overload, that Java lacks, but this fact doesn't force you to
use them even when they are unreadable.
 
 For people that like the "in" there I'd like to remind how it can look once
(if it will ever happen) the foreach uses "in" too:
 
 foreach (e in (re in "abracazoo")) {...}
 
 Bye,
 bearophile

But it doesn't, and I can't see how it could given how confusing it
would make things.  Besides which, we shouldn't be making judgements
based on possible, not planned for syntax changes at some unspecified
point in the future.

We have enough trouble with deciding on things as it is. :P

  -- Daniel

Feb 19 2009

Christopher Wright <dhasenan gmail.com> writes:

Denis Koroskin wrote:
 On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright 
 <dhasenan gmail.com> wrote:
 
 Denis Koroskin wrote:
 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ 
 regex("a[b-e]", "g") but doesn't existing conventions. I prefer it 
 over '~' version. In is also fine (both ways).

 This isn't so good for two reasons.
 First, I can't reuse regexes in your way, so if there is any expensive 
 initialization, that is duplicated.

 Second, I can't reuse regexes in your way, so I have to use a pair of 
 string constants.

 
 auto re = regex("a[b-e]", "g");
 foreach (e; "abracazoo".match(re)) {
     // what's wrong with that?
 }

Your first example was:
auto match (char[] source, char[] pattern, char[] options);

Your second example was:
auto match (char[] source, regex expression);

The second is good, but more typing than you said originally. The first 
is problematic.

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Christopher Wright wrote:
 Denis Koroskin wrote:
 On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright 
 <dhasenan gmail.com> wrote:

 Denis Koroskin wrote:
 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ 
 regex("a[b-e]", "g") but doesn't existing conventions. I prefer it 
 over '~' version. In is also fine (both ways).

 This isn't so good for two reasons.
 First, I can't reuse regexes in your way, so if there is any 
 expensive initialization, that is duplicated.

 Second, I can't reuse regexes in your way, so I have to use a pair of 
 string constants.

 auto re = regex("a[b-e]", "g");
 foreach (e; "abracazoo".match(re)) {
     // what's wrong with that?
 }

 
 Your first example was:
 auto match (char[] source, char[] pattern, char[] options);
 
 Your second example was:
 auto match (char[] source, regex expression);
 
 The second is good, but more typing than you said originally. The first 
 is problematic.

Why is it problematic? Is the name "match" too common?

Andrei

Feb 19 2009

Christopher Wright <dhasenan gmail.com> writes:

Andrei Alexandrescu wrote:
 Christopher Wright wrote:
 Your first example was:
 auto match (char[] source, char[] pattern, char[] options);

 Your second example was:
 auto match (char[] source, regex expression);

 The second is good, but more typing than you said originally. The 
 first is problematic.

 
 Why is it problematic? Is the name "match" too common?
 
 Andrei

No. What is the difference between those two? One is building a regex 
internally and not letting me store it; and it forces me to pass two 
parameters rather than one. The other takes a regex as a parameter, 
which I can store, and I can set both options (the pattern and the match 
options) once for all.

Feb 20 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 I'm almost done rewriting the regular expression engine, and some pretty  
 interesting things have transpired.

 First, I separated the engine into two parts, one that is the actual  
 regular expression engine, and the other that is the state of the match  
 with some particular input. The previous code combined the two into a  
 huge class. The engine (written by Walter) translates the regex string  
 into a bytecode-compiled form. Given that there is a deterministic  
 correspondence between the regex string and the bytecode, the Regex  
 engine object is in fact invariant and cached by the implementation.  
 Caching makes for significant time savings even if e.g. the user  
 repeatedly creates a regular expression engine in a loop.

 In contrast, the match state depends on the input string. I defined it  
 to implement the range interface, so you can either inspect it directly  
 or iterate it for all matches (if the "g" option was passed to the  
 engine).

 The new codebase works with char, wchar, and dchar and any random-access  
 range as input (forward ranges to come, and at some point in the future  
 input ranges as well). In spite of the added flexibility, the code size  
 has shrunk from 3396 lines to 2912 lines. I plan to add support for  
 binary data (e.g. ubyte - handling binary file formats can benefit a LOT  
 from regexes) and also, probably unprecedented, support for arbitrary  
 types such as integers, floating point numbers, structs, what have you.  
 any type that supports comparison and ranges is a good candidate for  
 regular expression matching. I'm not sure how regular expression  
 matching can be harnessed e.g. over arrays of int, but I suspect some  
 pretty cool applications are just around the corner. We can introduce  
 that generalization without adding complexity and there is nothing in  
 principle opposed to it.

 The interface is very simple, mainly consisting of the functions  
 regex(), match(), and sub(), e.g.

 foreach (e; match("abracazoo", regex("a[b-e]", "g")))
      writeln(e.pre, e.hit, e.post);
 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

 Two other syntactic options are available:

 "abracazoo".match(regex("a[b-e]", "g")))
 "abracazoo".match("a[b-e]", "g")

 I could have made match a member of regex:

 regex("a[b-e]", "g")).match("abracazoo")

 but most regex code I've seen mentions the string first and the regex  
 second. So I dropped that idea.

 Now, match() is likely to be called very often so I'm considering:

 foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
      writeln(e);

 In general I'm weary of unwitting operator overloading, but I think this  
 case is more justified than others. Thoughts?


 Andrei

"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]",
"g") but doesn't break existing conventions. I prefer it over '~' version. 'in'
is also fine (both ways).

Feb 19 2009

Brian <digitalmars brianguertin.com> writes:

On Thu, 19 Feb 2009 11:31:42 +0300, Denis Koroskin wrote:
 "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~
 regex("a[b-e]", "g") but doesn't break existing conventions. I prefer it
 over '~' version. 'in' is also fine (both ways).

i dont see a problem either with just using .match. if you use spaces 
around the ~ then its actually to more characters plus needing to hit 
shift. certainly not as simple as .match

Feb 19 2009

Max Samukha <samukha voliacable.com.removethis> writes:

On Wed, 18 Feb 2009 21:35:20 -0800, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

I'm almost done rewriting the regular expression engine, and some pretty 
interesting things have transpired.

First, I separated the engine into two parts, one that is the actual 
regular expression engine, and the other that is the state of the match 
with some particular input. The previous code combined the two into a 
huge class. The engine (written by Walter) translates the regex string 
into a bytecode-compiled form. Given that there is a deterministic 
correspondence between the regex string and the bytecode, the Regex 
engine object is in fact invariant and cached by the implementation. 
Caching makes for significant time savings even if e.g. the user 
repeatedly creates a regular expression engine in a loop.

In contrast, the match state depends on the input string. I defined it 
to implement the range interface, so you can either inspect it directly 
or iterate it for all matches (if the "g" option was passed to the engine).

The new codebase works with char, wchar, and dchar and any random-access 
range as input (forward ranges to come, and at some point in the future 
input ranges as well). In spite of the added flexibility, the code size 
has shrunk from 3396 lines to 2912 lines. I plan to add support for 
binary data (e.g. ubyte - handling binary file formats can benefit a LOT 
from regexes) and also, probably unprecedented, support for arbitrary 
types such as integers, floating point numbers, structs, what have you. 
any type that supports comparison and ranges is a good candidate for 
regular expression matching. I'm not sure how regular expression 
matching can be harnessed e.g. over arrays of int, but I suspect some 
pretty cool applications are just around the corner. We can introduce 
that generalization without adding complexity and there is nothing in 
principle opposed to it.

The interface is very simple, mainly consisting of the functions 
regex(), match(), and sub(), e.g.

foreach (e; match("abracazoo", regex("a[b-e]", "g")))
     writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

Two other syntactic options are available:

"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")

I could have made match a member of regex:

regex("a[b-e]", "g")).match("abracazoo")

but most regex code I've seen mentions the string first and the regex 
second. So I dropped that idea.

Now, match() is likely to be called very often so I'm considering:

foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
     writeln(e);

In general I'm weary of unwitting operator overloading, but I think this 
case is more justified than others. Thoughts?

Please anything but ~. It would be fine, if it didn't follow an array.
It could be 'in' as Bill suggested. Or maybe /, which reminds of sed
and downs:

foreach (e; "abracazoo"/regex("a[b-e]", "g"))
     writeln(e);

If you made 'match' a member of Regex, another option would be the
opCall:

regex("a[b-e]", "g")( "abracazoo")

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

but most regex code I've seen mentions the string first and the regex second.
So I dropped that idea.<

I like the following syntaxes (the one with .match() too):

import std.re: regex;

foreach (e; regex("a[b-e]", "g") in "abracazoo")
     writeln(e);

foreach (e; regex("a[b-e]", "g").match("abracazoo"))
     writeln(e);

auto re1 = regex("a[b-e]", "g");
foreach (e; re1.match("abracazoo"))
     writeln(e);

auto re1 = regex("a[b-e]", "g");
foreach (e; re1 in "abracazoo")
     writeln(e);

----------------

I like the support of verbose regular expressions too, that ignore whitespace
and comments (for example with //...) inserted into the regex itself. This
simple thing is able to turn the messy world of regexes into programming again.

This is an example of usual RE in Python:

finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")



single-line comment syntax):

finder = re.compile(r"""









    """, flags=re.VERBOSE)

As you can see it's often very positive to indent logically those lines just
like code.

----------------

As the other people here, I don't like the following much, it's a misleading
overload of the ~ operator:

"abracazoo" ~ regex("a[b-e]", "g")

----------------

I don't like that "g" argument much, my suggestions:

RE attributes:
"repeat", "r": Repeat over the whole input string
"ignorecase", "i": case insensitive
"multiline", "m": treat as multiple lines separated by newlines
"verbose", "v": ignores space outside [] and allows comments

----------------

If not already so, I'd like sub() to take as replacement a string or a callable.

Bye,
bearophile

Feb 19 2009

Max Samukha <samukha voliacable.com.removethis> writes:

On Thu, 19 Feb 2009 06:47:57 -0500, bearophile
<bearophileHUGS lycos.com> wrote:

If not already so, I'd like sub() to take as replacement a string or a callable.

Bye,
bearophile

I don't like 'sub' because it can denote anything. The most confusing
is substring. 'replace' seems to be better.

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Max Samukha wrote:
 On Thu, 19 Feb 2009 06:47:57 -0500, bearophile
 <bearophileHUGS lycos.com> wrote:
 
 If not already so, I'd like sub() to take as replacement a string or a
callable.

 Bye,
 bearophile

 
 I don't like 'sub' because it can denote anything. The most confusing
 is substring. 'replace' seems to be better.

Ok.

Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 but most regex code I've seen mentions the string first and the regex second.
So I dropped that idea.<

 
 I like the following syntaxes (the one with .match() too):
 
 import std.re: regex;
 
 foreach (e; regex("a[b-e]", "g") in "abracazoo")
      writeln(e);
 
 foreach (e; regex("a[b-e]", "g").match("abracazoo"))
      writeln(e);
 
 auto re1 = regex("a[b-e]", "g");
 foreach (e; re1.match("abracazoo"))
      writeln(e);
 
 auto re1 = regex("a[b-e]", "g");
 foreach (e; re1 in "abracazoo")
      writeln(e);

These all put the regex before the string, something many people would 
find unsavory.

 ----------------
 
 I like the support of verbose regular expressions too, that ignore whitespace
and comments (for example with //...) inserted into the regex itself. This
simple thing is able to turn the messy world of regexes into programming again.
 
 This is an example of usual RE in Python:
 
 finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")
 
 

single-line comment syntax):
 
 finder = re.compile(r"""









     """, flags=re.VERBOSE)
 
 As you can see it's often very positive to indent logically those lines just
like code.

Yah, I saw that ECMA introduced comments in regexes too. At some point 
we'll implement that.

 ----------------
 
 As the other people here, I don't like the following much, it's a misleading
overload of the ~ operator:
 
 "abracazoo" ~ regex("a[b-e]", "g")
 
 ----------------
 
 I don't like that "g" argument much, my suggestions:
 
 RE attributes:
 "repeat", "r": Repeat over the whole input string
 "ignorecase", "i": case insensitive
 "multiline", "m": treat as multiple lines separated by newlines
 "verbose", "v": ignores space outside [] and allows comments

And how do you combine them? "repeat, ignorecase"? Writing and parsing 
such options becomes a little adventure in itself. I think the "g", "i", 
and "m" flags are popular enough if you've done any amount of regex 
programming. If not, you'll look up the manual regardless.

 If not already so, I'd like sub() to take as replacement a string or a
callable.

It does, I haven't mentioned it yet. Pass-by-alias of course :o).


Andrei

Feb 19 2009

Derek Parnell <derek psych.ward> writes:

On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:

 These all put the regex before the string, something many people would 
 find unsavory.

I don't. To me the regex is what you are looking for so it's like saying
"find this pattern in that string". 

--
Derek Parnell

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Derek Parnell wrote:
 On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:
 
 These all put the regex before the string, something many people would 
 find unsavory.

 
 I don't. To me the regex is what you are looking for so it's like saying
 "find this pattern in that string". 

Yah, but to most others it's "match this string against that pattern". 
Again, regexes have a long history behind them. So probably we need to 
have both "find" and "match" with different order of arguments, something .

Anyway, std.algorithm defines find() like this:

find(haystack, needle)

In the least structured case, the haystack is a range and needle is 
either an element or another range. But then we can think, hey, we can 
think of efficient finds by using a more structured haystack and/or a 
more structured needle. So then:

string a = "conoco", b = "co";
// linear find
auto r1 = find(a, b[0]);
// quadratic find
auto r2 = find(a, b);
// organize a in a Boyer-Moore structure; sublinear find
auto r3 = find(boyerMoore(a), b);

I'll actually implement the above, it's pretty nice. Now the question 
is, what's the haystack and what's the needle in a regex find?

auto r3 = find("conoco", regex("c[a-z]"));

or

auto r3 = find(regex("c[a-z]"), "conoco");

?

The argument could go both ways:

"Organize the set of 2-char strings starting with 'c' and ending with 
'a' to 'z' into a structured haystack, then look for substrings of 
"conoco" in that haystack."

versus

"Given the unstructured haystack conoco, look for a structured needle in 
it that is any 2-char string starting with 'c' and ending with 'a' to 'z'."

What is the most natural way?


Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Andrei Alexandrescu wrote:
 Derek Parnell wrote:
 On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:

 These all put the regex before the string, something many people 
 would find unsavory.

 I don't. To me the regex is what you are looking for so it's like saying
 "find this pattern in that string". 

 
 Yah, but to most others it's "match this string against that pattern". 
 Again, regexes have a long history behind them. So probably we need to 
 have both "find" and "match" with different order of arguments, something .

... "I'm not thrilled about".

Andrei

Feb 19 2009

Derek Parnell <derek psych.ward> writes:

On Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:

 Derek Parnell wrote:
 On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:
 
 These all put the regex before the string, something many people would 
 find unsavory.

 
 I don't. To me the regex is what you are looking for so it's like saying
 "find this pattern in that string". 

 
 Yah, but to most others it's "match this string against that pattern". 

I might not be normal ;-)

 Again, regexes have a long history behind them. So probably we need to 
 have both "find" and "match" with different order of arguments, something .
 
 Anyway, std.algorithm defines find() like this:
 
 find(haystack, needle)

I use the Euphoria language a lot, and its routine API is find(needle,
haystack), so I'm sure this is where my normality springs from.


 What is the most natural way?

Get your "personal assistant" to do it.

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell

Feb 19 2009

Sergey Gromov <snake.scaly gmail.com> writes:

Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:

 The argument could go both ways:
 
 "Organize the set of 2-char strings starting with 'c' and ending with 
 'a' to 'z' into a structured haystack, then look for substrings of 
 "conoco" in that haystack."
 
 versus
 
 "Given the unstructured haystack conoco, look for a structured needle in 
 it that is any 2-char string starting with 'c' and ending with 'a' to 'z'."
 
 What is the most natural way?

I think calling a regex a 'haystack' is a far-fetched metaphor.  A
haystack is a pile of stuff, and a needle is a precise thing you're
looking for.  I think they're unambiguous.

Also, the in operator doesn't leave you guessing whether you should put
a haystack or a needle first.

Feb 20 2009

Bill Baxter <wbaxter gmail.com> writes:

On Fri, Feb 20, 2009 at 9:59 PM, Sergey Gromov <snake.scaly gmail.com> wrote:
 Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:

 The argument could go both ways:

 "Organize the set of 2-char strings starting with 'c' and ending with
 'a' to 'z' into a structured haystack, then look for substrings of
 "conoco" in that haystack."

 versus

 "Given the unstructured haystack conoco, look for a structured needle in
 it that is any 2-char string starting with 'c' and ending with 'a' to 'z'."

 What is the most natural way?

 I think calling a regex a 'haystack' is a far-fetched metaphor.  A
 haystack is a pile of stuff, and a needle is a precise thing you're
 looking for.  I think they're unambiguous.

I thought so too.  It's a stretch.  And I also agree with the various
posts people have made about allowing plain char[] search strings to
be interchangeable with regex'es as much as possible.  And in that
light saying you're looking for [big-string] inside of [sub-string]
just sounds ridiculous.  But a string is certainly a kind of
special-case regex that just describes a set consisting 1 element.  So
yeh, you /could/ say you're looking for matches inside that set, but
it's quite a stretch.

 Also, the in operator doesn't leave you guessing whether you should put
 a haystack or a needle first.

That's a good point in it's favor.  As long as you aren't one of these
folks who thinks "looking for matches inside a regular grammar" is
just as reasonable as "looking for a pattern inside a string".

--bb

Feb 20 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 bearophile wrote:
 Andrei Alexandrescu:

 but most regex code I've seen mentions the string first and the regex  
 second. So I dropped that idea.<

  I like the following syntaxes (the one with .match() too):
  import std.re: regex;
  foreach (e; regex("a[b-e]", "g") in "abracazoo")
      writeln(e);
  foreach (e; regex("a[b-e]", "g").match("abracazoo"))
      writeln(e);
  auto re1 = regex("a[b-e]", "g");
 foreach (e; re1.match("abracazoo"))
      writeln(e);
  auto re1 = regex("a[b-e]", "g");
 foreach (e; re1 in "abracazoo")
      writeln(e);

 These all put the regex before the string, something many people would  
 find unsavory.

 ----------------
  I like the support of verbose regular expressions too, that ignore  
 whitespace and comments (for example with //...) inserted into the  
 regex itself. This simple thing is able to turn the messy world of  
 regexes into programming again.
  This is an example of usual RE in Python:
  finder =  
 re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")

 single-line comment syntax):
  finder = re.compile(r"""









     """, flags=re.VERBOSE)
  As you can see it's often very positive to indent logically those  
 lines just like code.

 Yah, I saw that ECMA introduced comments in regexes too. At some point  
 we'll implement that.

 ----------------
  As the other people here, I don't like the following much, it's a  
 misleading overload of the ~ operator:
  "abracazoo" ~ regex("a[b-e]", "g")
  ----------------
  I don't like that "g" argument much, my suggestions:
  RE attributes:
 "repeat", "r": Repeat over the whole input string
 "ignorecase", "i": case insensitive
 "multiline", "m": treat as multiple lines separated by newlines
 "verbose", "v": ignores space outside [] and allows comments

 And how do you combine them? "repeat, ignorecase"? Writing and parsing  
 such options becomes a little adventure in itself. I think the "g", "i",  
 and "m" flags are popular enough if you've done any amount of regex  
 programming. If not, you'll look up the manual regardless.

Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be  
better? I don't find "gmi" immediately clear nor self-documenting.

 If not already so, I'd like sub() to take as replacement a string or a  
 callable.

 It does, I haven't mentioned it yet. Pass-by-alias of course :o).


 Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Denis Koroskin wrote:
 On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 bearophile wrote:
 Andrei Alexandrescu:

 but most regex code I've seen mentions the string first and the 
 regex second. So I dropped that idea.<

  I like the following syntaxes (the one with .match() too):
  import std.re: regex;
  foreach (e; regex("a[b-e]", "g") in "abracazoo")
      writeln(e);
  foreach (e; regex("a[b-e]", "g").match("abracazoo"))
      writeln(e);
  auto re1 = regex("a[b-e]", "g");
 foreach (e; re1.match("abracazoo"))
      writeln(e);
  auto re1 = regex("a[b-e]", "g");
 foreach (e; re1 in "abracazoo")
      writeln(e);

 These all put the regex before the string, something many people would 
 find unsavory.

 ----------------
  I like the support of verbose regular expressions too, that ignore 
 whitespace and comments (for example with //...) inserted into the 
 regex itself. This simple thing is able to turn the messy world of 
 regexes into programming again.
  This is an example of usual RE in Python:
  finder = 
 re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")

 Python single-line comment syntax):
  finder = re.compile(r"""









     """, flags=re.VERBOSE)
  As you can see it's often very positive to indent logically those 
 lines just like code.

 Yah, I saw that ECMA introduced comments in regexes too. At some point 
 we'll implement that.

 ----------------
  As the other people here, I don't like the following much, it's a 
 misleading overload of the ~ operator:
  "abracazoo" ~ regex("a[b-e]", "g")
  ----------------
  I don't like that "g" argument much, my suggestions:
  RE attributes:
 "repeat", "r": Repeat over the whole input string
 "ignorecase", "i": case insensitive
 "multiline", "m": treat as multiple lines separated by newlines
 "verbose", "v": ignores space outside [] and allows comments

 And how do you combine them? "repeat, ignorecase"? Writing and parsing 
 such options becomes a little adventure in itself. I think the "g", 
 "i", and "m" flags are popular enough if you've done any amount of 
 regex programming. If not, you'll look up the manual regardless.

 
 Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might 
 be better? I don't find "gmi" immediately clear nor self-documenting.

I got disabused a very long time ago of the notion that everything about 
regexes is clear or self-documenting. Really. You just get to a level of 
understanding that's appropriate for your needs. On that scale, getting 
used to "gmi" is so low, it's not even worth discussing.


Andrei

Feb 19 2009

Lionello Lunesu <lio lunesu.remove.com> writes:

Denis Koroskin wrote:
 On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 bearophile wrote:
 Andrei Alexandrescu:

 but most regex code I've seen mentions the string first and the 
 regex second. So I dropped that idea.<

  I like the following syntaxes (the one with .match() too):
  import std.re: regex;
  foreach (e; regex("a[b-e]", "g") in "abracazoo")
      writeln(e);
  foreach (e; regex("a[b-e]", "g").match("abracazoo"))
      writeln(e);
  auto re1 = regex("a[b-e]", "g");
 foreach (e; re1.match("abracazoo"))
      writeln(e);
  auto re1 = regex("a[b-e]", "g");
 foreach (e; re1 in "abracazoo")
      writeln(e);

 These all put the regex before the string, something many people would 
 find unsavory.

 ----------------
  I like the support of verbose regular expressions too, that ignore 
 whitespace and comments (for example with //...) inserted into the 
 regex itself. This simple thing is able to turn the messy world of 
 regexes into programming again.
  This is an example of usual RE in Python:
  finder = 
 re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")

 Python single-line comment syntax):
  finder = re.compile(r"""









     """, flags=re.VERBOSE)
  As you can see it's often very positive to indent logically those 
 lines just like code.

 Yah, I saw that ECMA introduced comments in regexes too. At some point 
 we'll implement that.

 ----------------
  As the other people here, I don't like the following much, it's a 
 misleading overload of the ~ operator:
  "abracazoo" ~ regex("a[b-e]", "g")
  ----------------
  I don't like that "g" argument much, my suggestions:
  RE attributes:
 "repeat", "r": Repeat over the whole input string
 "ignorecase", "i": case insensitive
 "multiline", "m": treat as multiple lines separated by newlines
 "verbose", "v": ignores space outside [] and allows comments

 And how do you combine them? "repeat, ignorecase"? Writing and parsing 
 such options becomes a little adventure in itself. I think the "g", 
 "i", and "m" flags are popular enough if you've done any amount of 
 regex programming. If not, you'll look up the manual regardless.

 
 Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might 
 be better? I don't find "gmi" immediately clear nor self-documenting.

I think it's worth an overload! (I also keep forgetting those flags.)

In fact, *the first thing* the current RegExp.compile does is convert 
the string attributes to enum flags!

L.

Feb 19 2009

Benji Smith <dlanguage benjismith.net> writes:

 And how do you combine them? "repeat, ignorecase"? Writing and parsing 
 such options becomes a little adventure in itself. I think the "g", 
 "i", and "m" flags are popular enough if you've done any amount of 
 regex programming. If not, you'll look up the manual regardless.

 
 Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might 
 be better? I don't find "gmi" immediately clear nor self-documenting.

I prefer the enum options too. But not vociferously. I could live with 
the single-char flags.

--benji

Feb 19 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Benji Smith wrote:
 And how do you combine them? "repeat, ignorecase"? Writing and
 parsing such options becomes a little adventure in itself. I think
 the "g", "i", and "m" flags are popular enough if you've done any
 amount of regex programming. If not, you'll look up the manual
 regardless.

 Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase);
 might be better? I don't find "gmi" immediately clear nor
 self-documenting.

 
 I prefer the enum options too. But not vociferously. I could live with
 the single-char flags.
 
 --benji

I dislike enum options because it dramatically bloats the code, in terms
of how much typing it takes.

This is another thing that Visual Basic actually got right [1]; instead of:

Match(string, "a[b-e]", Regex.Repeat | Regex.IgnoreCase)

you could use this:

Match(string, "a[b-e]", Repeat | IgnoreCase)

Since the compiler knew the type of that third argument, it allowed you
to omit the prefix.

If D did that, I would completely reverse my dislike of enums and
DEFINITELY prefer them over strings; you could always have this:

enum Regex
{
  Repeat,
  IgnoreCase,
  R = Repeat,
  I = IgnoreCase,
}

And then instead of "ri" you have R|I which is actually shorter AND safer!

But yeah; I think Walter said he wanted to do this ages and ages ago,
but it never happened.

  -- Daniel


[1] It's funny how many things that poor language *did* get right.  I
mean, yeah, it's a terrible language from a design standpoint, but boy
did it ever let you just get shit done.

Feb 19 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Thu, Feb 19, 2009 at 10:01 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 RE attributes:
 "repeat", "r": Repeat over the whole input string
 "ignorecase", "i": case insensitive
 "multiline", "m": treat as multiple lines separated by newlines
 "verbose", "v": ignores space outside [] and allows comments

 And how do you combine them? "repeat, ignorecase"? Writing and parsing such
 options becomes a little adventure in itself. I think the "g", "i", and "m"
 flags are popular enough if you've done any amount of regex programming. If
 not, you'll look up the manual regardless.

While we're on the subject I'd like to mention that an unbelievably
overwhelming proportion of the time, when I use regexen, I want them
to be global.  As in, I don't think I've ever used a non-global regex.

To that effect I'd like to propose that either "g" be the default
attribute, or that it should be on _unless_ some other attribute ("o"
for once?) is present.  I think this is one thing that Perl got
terribly wrong.

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jarrett Billingsley wrote:
 On Thu, Feb 19, 2009 at 10:01 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 
 RE attributes:
 "repeat", "r": Repeat over the whole input string
 "ignorecase", "i": case insensitive
 "multiline", "m": treat as multiple lines separated by newlines
 "verbose", "v": ignores space outside [] and allows comments

 And how do you combine them? "repeat, ignorecase"? Writing and parsing such
 options becomes a little adventure in itself. I think the "g", "i", and "m"
 flags are popular enough if you've done any amount of regex programming. If
 not, you'll look up the manual regardless.

 
 While we're on the subject I'd like to mention that an unbelievably
 overwhelming proportion of the time, when I use regexen, I want them
 to be global.  As in, I don't think I've ever used a non-global regex.
 
 To that effect I'd like to propose that either "g" be the default
 attribute, or that it should be on _unless_ some other attribute ("o"
 for once?) is present.  I think this is one thing that Perl got
 terribly wrong.

Well I agree for searches but not for substitutions.

In D searches, the lazy way of matching means you can always go with "g" 
and change your mind whenever you please. I think I'll simply eliminate 
"g" from the offered options for search.


Andrei

Feb 19 2009

Don <nospam nospam.com> writes:

Andrei Alexandrescu wrote:
 I'm almost done rewriting the regular expression engine, and some pretty 
 interesting things have transpired.
 
 First, I separated the engine into two parts, one that is the actual 
 regular expression engine, and the other that is the state of the match 
 with some particular input. The previous code combined the two into a 
 huge class. The engine (written by Walter) translates the regex string 
 into a bytecode-compiled form. Given that there is a deterministic 
 correspondence between the regex string and the bytecode, the Regex 
 engine object is in fact invariant and cached by the implementation. 
 Caching makes for significant time savings even if e.g. the user 
 repeatedly creates a regular expression engine in a loop.
 
 In contrast, the match state depends on the input string. I defined it 
 to implement the range interface, so you can either inspect it directly 
 or iterate it for all matches (if the "g" option was passed to the engine).
 
 The new codebase works with char, wchar, and dchar and any random-access 
 range as input (forward ranges to come, and at some point in the future 
 input ranges as well). In spite of the added flexibility, the code size 
 has shrunk from 3396 lines to 2912 lines. I plan to add support for 
 binary data (e.g. ubyte - handling binary file formats can benefit a LOT 
 from regexes) and also, probably unprecedented, support for arbitrary 
 types such as integers, floating point numbers, structs, what have you. 
 any type that supports comparison and ranges is a good candidate for 
 regular expression matching. I'm not sure how regular expression 
 matching can be harnessed e.g. over arrays of int, but I suspect some 
 pretty cool applications are just around the corner. We can introduce 
 that generalization without adding complexity and there is nothing in 
 principle opposed to it.
 
 The interface is very simple, mainly consisting of the functions 
 regex(), match(), and sub(), e.g.
 
 foreach (e; match("abracazoo", regex("a[b-e]", "g")))
     writeln(e.pre, e.hit, e.post);
 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
 
 Two other syntactic options are available:
 
 "abracazoo".match(regex("a[b-e]", "g")))
 "abracazoo".match("a[b-e]", "g")
 
 I could have made match a member of regex:
 
 regex("a[b-e]", "g")).match("abracazoo")
 
 but most regex code I've seen mentions the string first and the regex 
 second. So I dropped that idea.
 
 Now, match() is likely to be called very often so I'm considering:
 
 foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
     writeln(e);
 
 In general I'm weary of unwitting operator overloading, but I think this 
 case is more justified than others. Thoughts?
 
 
 Andrei

I agree with the comments against ~.
I believe this Perl6 document is a must-read:

http://dev.perl.org/perl6/doc/design/apo/A05.html

There are some excellent observations there, especially near the 
beginning. By separating the engine from the state of the match, you 
open the possibilty of subsequently providing cleaner regex syntax.

I do wonder though, how you'd deal with a regex which includes a match 
to a literal string provided as a variable. Would this be passed to the 
engine, or to the match state?
If the engine is using backtracking, there's no difference in the 
generated bytecode; but if it's creating an automata, the compiled 
engine depends on the contents of the string variable.

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Don wrote:
 Andrei Alexandrescu wrote:
 I'm almost done rewriting the regular expression engine, and some 
 pretty interesting things have transpired.

 First, I separated the engine into two parts, one that is the actual 
 regular expression engine, and the other that is the state of the 
 match with some particular input. The previous code combined the two 
 into a huge class. The engine (written by Walter) translates the regex 
 string into a bytecode-compiled form. Given that there is a 
 deterministic correspondence between the regex string and the 
 bytecode, the Regex engine object is in fact invariant and cached by 
 the implementation. Caching makes for significant time savings even if 
 e.g. the user repeatedly creates a regular expression engine in a loop.

 In contrast, the match state depends on the input string. I defined it 
 to implement the range interface, so you can either inspect it 
 directly or iterate it for all matches (if the "g" option was passed 
 to the engine).

 The new codebase works with char, wchar, and dchar and any 
 random-access range as input (forward ranges to come, and at some 
 point in the future input ranges as well). In spite of the added 
 flexibility, the code size has shrunk from 3396 lines to 2912 lines. I 
 plan to add support for binary data (e.g. ubyte - handling binary file 
 formats can benefit a LOT from regexes) and also, probably 
 unprecedented, support for arbitrary types such as integers, floating 
 point numbers, structs, what have you. any type that supports 
 comparison and ranges is a good candidate for regular expression 
 matching. I'm not sure how regular expression matching can be 
 harnessed e.g. over arrays of int, but I suspect some pretty cool 
 applications are just around the corner. We can introduce that 
 generalization without adding complexity and there is nothing in 
 principle opposed to it.

 The interface is very simple, mainly consisting of the functions 
 regex(), match(), and sub(), e.g.

 foreach (e; match("abracazoo", regex("a[b-e]", "g")))
     writeln(e.pre, e.hit, e.post);
 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

 Two other syntactic options are available:

 "abracazoo".match(regex("a[b-e]", "g")))
 "abracazoo".match("a[b-e]", "g")

 I could have made match a member of regex:

 regex("a[b-e]", "g")).match("abracazoo")

 but most regex code I've seen mentions the string first and the regex 
 second. So I dropped that idea.

 Now, match() is likely to be called very often so I'm considering:

 foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
     writeln(e);

 In general I'm weary of unwitting operator overloading, but I think 
 this case is more justified than others. Thoughts?


 Andrei

 
 I agree with the comments against ~.
 I believe this Perl6 document is a must-read:
 
 http://dev.perl.org/perl6/doc/design/apo/A05.html
 
 There are some excellent observations there, especially near the 
 beginning. By separating the engine from the state of the match, you 
 open the possibilty of subsequently providing cleaner regex syntax.

I'd read it a while ago, but a refresher is in order. Thanks!

 I do wonder though, how you'd deal with a regex which includes a match 
 to a literal string provided as a variable. Would this be passed to the 
 engine, or to the match state?

At the moment these are not supported. It's a good question.

 If the engine is using backtracking, there's no difference in the 
 generated bytecode; but if it's creating an automata, the compiled 
 engine depends on the contents of the string variable.

The current engine is, to the best of my understanding, using 
backtracking. At least when there's an "or", it tries both matches as 
recursive calls and picks the longest.


Andrei

Feb 19 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-02-19 00:35:20 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

I don't like `sub`, I mean the name. Makes me think of substring more 
than substitute. My choice would be to reuse what we have in std.string 
and augment it to work with regular expressions:

	auto s = replace("abracazoo", regex("a([b-e])", "g"), subex("A$1"));

This way it works consistently whether you're using a string or a 
regular expression: just replace any pattern string with regex(...) and 
any replacement string with subex(...) -- "substition-expression" -- 
when you want them to be parsed as such. Omitting subex in the above 
would make it a plain string replacement for instance (this way it's 
easy to place use a variable there).

These functions should allow easy substitution of any string or regex 
pattern with another algorithm for matching the pattern.

And there's not way to get a range of matches using std.string, but 
there should be, and it should follow the same rule as above: 
supporting strings and regex consistently. (Using the `in` operator as 
suggested by Bill Baxter seems a good fit for this function.)

And if any of you complains about the extra verbosity, here's what I suggest:

	auto s = replace("abracazoo", re"a([b-e])"g, se"A$1");

Yes, syntaxic sugar for declaring regular expressions.


 Two other syntactic options are available:
 
 "abracazoo".match(regex("a[b-e]", "g")))
 "abracazoo".match("a[b-e]", "g")

I despise the second one, because if you omit regex(...) it makes me 
think you're checking for string matches, not expression matches. 
There's nothing in the name of the funciton telling you you're dealing 
with a regular expression, so it could easily get confusing.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Michel Fortin wrote:
 On 2009-02-19 00:35:20 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:
 
 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

 
 I don't like `sub`, I mean the name. Makes me think of substring more 
 than substitute. My choice would be to reuse what we have in std.string 
 and augment it to work with regular expressions:
 
     auto s = replace("abracazoo", regex("a([b-e])", "g"), subex("A$1"));

Ok. Probably subex is a bit of a killer, but I see your point (subex is 
not an arbitrary string).

 This way it works consistently whether you're using a string or a 
 regular expression: just replace any pattern string with regex(...) and 
 any replacement string with subex(...) -- "substition-expression" -- 
 when you want them to be parsed as such. Omitting subex in the above 
 would make it a plain string replacement for instance (this way it's 
 easy to place use a variable there).

Indeed, that was part of the impetus for making regex a distinct type 
that participates in larger functions. The only problem is that regex 
does not work with std.algorithm in an obvious way, e.g. find() works 
very differently for strings and regexes. I considered at a point trying 
to integrate them, but decided to not spend that effort right now.

 These functions should allow easy substitution of any string or regex 
 pattern with another algorithm for matching the pattern.
 
 And there's not way to get a range of matches using std.string, but 
 there should be, and it should follow the same rule as above: supporting 
 strings and regex consistently. (Using the `in` operator as suggested by 
 Bill Baxter seems a good fit for this function.)

I defined the following in std.algorithm (signatures simplified):

// Split a range by a 1-element separator
Splitter!(...) splitter(Range, Element)(Range input, Range separator);
// Split a range by a subrange separator
Splitter!(...) splitter(Range)(Range input, Range separator);

I then defined this in std.regex:

// Split a range by a subrange separator
Splitter!(...) splitter(Range)(Range input, Regex separator);

Now this is very nice because you get to switch from one to another very 
easily.

foreach (e; splitter(input, ',')) { ... }
foreach (e; splitter(input, ", ")) { ... }
foreach (e; splitter(input, regex(", *"))) { ... }

The speed/flexibility tradeoff is self-evident and under the control of 
the programmer without much fuss as it's very easy to switch from one 
form to another.

 And if any of you complains about the extra verbosity, here's what I 
 suggest:
 
     auto s = replace("abracazoo", re"a([b-e])"g, se"A$1");
 
 Yes, syntaxic sugar for declaring regular expressions.
 
 
 Two other syntactic options are available:

 "abracazoo".match(regex("a[b-e]", "g")))
 "abracazoo".match("a[b-e]", "g")

 
 I despise the second one, because if you omit regex(...) it makes me 
 think you're checking for string matches, not expression matches. 
 There's nothing in the name of the funciton telling you you're dealing 
 with a regular expression, so it could easily get confusing.

This is yet another proof that discussion of syntax, notation, and 
naming will never go out of fashion. I was half convinced by the others 
that we're in good shape with input.match(regex).


Andrei

Feb 19 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Michel Fortin wrote:
 [snip]
 
 And if any of you complains about the extra verbosity, here's what I
 suggest:
 
     auto s = replace("abracazoo", re"a([b-e])"g, se"A$1");
 
 Yes, syntaxic sugar for declaring regular expressions.

Didn't D previously have special regex literals that got dropped for
being unpopular and/or hated?

  -- Daniel

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Daniel Keep:

 But it doesn't, and I can't see how it could given how confusing it
 would make things.

I think that using "in" into foreach() leads to less bugs, because it's easy to
not tell apart "," and a ";".

So far it was not accepted in D mostly because the compiler stages of D are
meant to be very separated.


 Besides which, we shouldn't be making judgements based on possible, not
planned for syntax changes at some unspecified point in the future. We have
enough trouble with deciding on things as it is. :P<

I agree that the situation isn't easy, and designing a language is hard. But
it's very useful to keep a long-range sight and avoid to step on our future
toes, when possible and when it's a cheap thing to do.

Bye,
bearophile

Feb 19 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Thu, Feb 19, 2009 at 10:38 AM, bearophile <bearophileHUGS lycos.com> wrote:
 Daniel Keep:

 But it doesn't, and I can't see how it could given how confusing it
 would make things.

 I think that using "in" into foreach() leads to less bugs, because it's easy
to not tell apart "," and a ";".

The semicolon does not introduce bugs.  If you don't have a semicolon,
you get a simple parser error.  That is not a bug.  If you can't tell
; and , apart, get a better font.

 So far it was not accepted in D mostly because the compiler stages of D are
meant to be very separated.

That has little to nothing to do with it.  'in' in a foreach loop
header is unambiguous to parse.  I think it has much more to do with
the fact that semicolon works fine, is already present in mounds of D
code, and changing it to 'in' does not really benefit anyone except
you, since you're so goddamned attached to Python's syntax.  Use
Delight, ffs.

Also, "I think I don't like X" is not proper English.  Say "I don't
think I like X" or just "I don't like X" instead.

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Jarrett Billingsley:

This is an old discussion, and maybe it will not lead to much.


If you don't have a semicolon, you get a simple parser error. That is not a
bug.<

Wikipedia agrees with me:

http://en.wikipedia.org/wiki/Software_bug
A software bug is an error, flaw, mistake, failure, or fault in a computer
program that prevents it from behaving as intended (e.g., producing an
incorrect or unexpected result).<

So a parser error is a bug too, despite the compiler will help you find it in a
moment.

I have written and debugged many times "mistakes" like:

foreach (a, b, iterable)
foreach (a; b; iterable)
foreach (a; b, iterable)

And probably I am not the only one :-)


If you can't tell ; and , apart, get a better font.<

I have already modified a good font to tell apart . and ; better when I program
D:

http://www.fantascienza.net/leonardo/ar/inconsolatag/inconsolata-g_font.zip

But having a language that is more bug-prone isn't good.


That has little to nothing to do with it.  'in' in a foreach loop header is
unambiguous to parse.<

You may have missed the discussion last time, when I think Walter has explained
what I have told you the problem about the compilation stages.


and changing it to 'in' does not really benefit anyone except you, since you're
so goddamned attached to Python's syntax.<

Thank you, I attach myself to things I think are good and well designed.
And Python isn't the only language that uses "in" with a "for-each" :-)


Use Delight, ffs.<

I don't know what "ffs" means, and I'm on Windows again now :-)


Also, "I think I don't like X" is not proper English.  Say "I don't think I
like X" or just "I don't like X" instead.<

Thank you very much, I'll try to remember that.

Bye,
bearophile

Feb 19 2009

Ary Borenszweig <ary esperanto.org.ar> writes:

bearophile wrote:
 Jarrett Billingsley:
 
 This is an old discussion, and maybe it will not lead to much.
 
 
 If you don't have a semicolon, you get a simple parser error. That is not a
bug.<

 
 Wikipedia agrees with me:
 
 http://en.wikipedia.org/wiki/Software_bug
 A software bug is an error, flaw, mistake, failure, or fault in a computer
program that prevents it from behaving as intended (e.g., producing an
incorrect or unexpected result).<

 
 So a parser error is a bug too, despite the compiler will help you find it in
a moment.
 
 I have written and debugged many times "mistakes" like:
 
 foreach (a, b, iterable)
 foreach (a; b; iterable)
 foreach (a; b, iterable)
 
 And probably I am not the only one :-)

Why would you do that?



because it's shorter and you write a lot of foreach loops in a program.

Maybe we should vote and see how many people make the mistake of 
confusing comma and semicolon in this case.

 
 
 If you can't tell ; and , apart, get a better font.<

 
 I have already modified a good font to tell apart . and ; better when I
program D:
 
 http://www.fantascienza.net/leonardo/ar/inconsolatag/inconsolata-g_font.zip
 
 But having a language that is more bug-prone isn't good.
 
 
 That has little to nothing to do with it.  'in' in a foreach loop header is
unambiguous to parse.<

 
 You may have missed the discussion last time, when I think Walter has
explained what I have told you the problem about the compilation stages.
 
 
 and changing it to 'in' does not really benefit anyone except you, since
you're so goddamned attached to Python's syntax.<

 
 Thank you, I attach myself to things I think are good and well designed.
 And Python isn't the only language that uses "in" with a "for-each" :-)
 
 
 Use Delight, ffs.<

 
 I don't know what "ffs" means, and I'm on Windows again now :-)
 
 
 Also, "I think I don't like X" is not proper English.  Say "I don't think I
like X" or just "I don't like X" instead.<


To Jarrett: why isn't it proper English? It makes sense to me.

Feb 19 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Thu, Feb 19, 2009 at 11:45 AM, Ary Borenszweig <ary esperanto.org.ar> wrote:

 Also, "I think I don't like X" is not proper English.  Say "I don't think
 I like X" or just "I don't like X" instead.<


 To Jarrett: why isn't it proper English? It makes sense to me.

Andrei did a pretty good job of explaining indirectly why it's wrong ;)

If you say "I think <something about myself>", it sounds very strange,
because it sounds like you don't know what's going on in your own
brain.  "I think X" often means "I'm not sure of X", so saying that
you're unsure of what you do or don't like sounds odd indeed.

Feb 19 2009

Ary Borenszweig <ary esperanto.org.ar> writes:

Jarrett Billingsley escribi�:
 On Thu, Feb 19, 2009 at 11:45 AM, Ary Borenszweig <ary esperanto.org.ar> wrote:
 
 Also, "I think I don't like X" is not proper English.  Say "I don't think
 I like X" or just "I don't like X" instead.<


 To Jarrett: why isn't it proper English? It makes sense to me.

 
 Andrei did a pretty good job of explaining indirectly why it's wrong ;)
 
 If you say "I think <something about myself>", it sounds very strange,
 because it sounds like you don't know what's going on in your own
 brain.  "I think X" often means "I'm not sure of X", so saying that
 you're unsure of what you do or don't like sounds odd indeed.

I think you're right.

Feb 19 2009

Bill Baxter <wbaxter gmail.com> writes:

On Fri, Feb 20, 2009 at 10:48 AM, Ary Borenszweig <ary esperanto.org.ar> wr=
ote:
 Jarrett Billingsley escribi=F3:
 On Thu, Feb 19, 2009 at 11:45 AM, Ary Borenszweig <ary esperanto.org.ar>
 wrote:

 Also, "I think I don't like X" is not proper English.  Say "I don't
 think
 I like X" or just "I don't like X" instead.<


 To Jarrett: why isn't it proper English? It makes sense to me.

 Andrei did a pretty good job of explaining indirectly why it's wrong ;)

 If you say "I think <something about myself>", it sounds very strange,
 because it sounds like you don't know what's going on in your own
 brain.  "I think X" often means "I'm not sure of X", so saying that
 you're unsure of what you do or don't like sounds odd indeed.


And yet sometimes that is exactly the case.

---bb

Feb 19 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Thu, Feb 19, 2009 at 9:01 PM, Bill Baxter <wbaxter gmail.com> wrote:
 And yet sometimes that is exactly the case.

Perhaps I'm just too absolute ;)

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Ary Borenszweig wrote:
 bearophile wrote:
 Jarrett Billingsley:

 This is an old discussion, and maybe it will not lead to much.


 If you don't have a semicolon, you get a simple parser error. That is 
 not a bug.<

 Wikipedia agrees with me:

 http://en.wikipedia.org/wiki/Software_bug
 A software bug is an error, flaw, mistake, failure, or fault in a 
 computer program that prevents it from behaving as intended (e.g., 
 producing an incorrect or unexpected result).<

 So a parser error is a bug too, despite the compiler will help you 
 find it in a moment.

 I have written and debugged many times "mistakes" like:

 foreach (a, b, iterable)
 foreach (a; b; iterable)
 foreach (a; b, iterable)

 And probably I am not the only one :-)

 
 Why would you do that?
 


 because it's shorter and you write a lot of foreach loops in a program.
 
 Maybe we should vote and see how many people make the mistake of 
 confusing comma and semicolon in this case.

Not confusing me. I'll note that if "in" were used, you could write:

foreach (a in b in c) {}

Now try explaining that one :o).


Andrei

Feb 19 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Thu, Feb 19, 2009 at 11:07 AM, bearophile <bearophileHUGS lycos.com> wrote:

If you don't have a semicolon, you get a simple parser error. That is not a
bug.<

 Wikipedia agrees with me:

 http://en.wikipedia.org/wiki/Software_bug
A software bug is an error, flaw, mistake, failure, or fault in a computer
program that prevents it from behaving as intended (e.g., producing an
incorrect or unexpected result).<

 So a parser error is a bug too, despite the compiler will help you find it in
a moment.

The representation of a program is separate from its semantics, and
the semantics are only knowable if the representation is correct.
Bugs are semantic errors.  If your program cannot be compiled due to
an incorrect representation, you can't tell whether the program's
semantics are correct.  Therefore, an incorrect representation is not
a bug.

 I have written and debugged many times "mistakes" like:

 foreach (a, b, iterable)
 foreach (a; b; iterable)
 foreach (a; b, iterable)

 And probably I am not the only one :-)
 ...
 But having a language that is more bug-prone isn't good.


"foreach(string s; something)".  And the same with Java, which uses

are wrong and they should be changed to the semicolon because I made
mistakes?

The point I'm making here is it doesn't matter whether it uses 'in' or

expression, because there will always be people who feel that another
syntax is better or more natural.  Instead of arguing over minute
details like this, let's worry about the important things, like the
semantics of foreach loops.

That has little to nothing to do with it.  'in' in a foreach loop header is
unambiguous to parse.<

 You may have missed the discussion last time, when I think Walter has
explained what I have told you the problem about the compilation stages.

I know very well what Walter is talking about when he mentions the
independence of the stages of compilation.  I've written a compiler
too, based on D's and with a similar staged compilation strategy.  The
fact is that there is no grammar production in which 'in' can be
ambiguous within the context of a foreach loop header.

foreach(something; x)

You can replace ';' with 'in' or 'out' or 'forble' or pretty much any
other token, as long as it doesn't cause ambiguity with the
'something' part.  'something' is not an expression, so there is no
way that the compiler could mistake 'in' for an expression there.  It
doesn't require any semantic analysis to determine what 'in' means in
that context.

and changing it to 'in' does not really benefit anyone except you, since you're
so goddamned attached to Python's syntax.<

 Thank you, I attach myself to things I think are good and well designed.
 And Python isn't the only language that uses "in" with a "for-each" :-)

And D isn't the only language that _doesn't_ use 'in'.  And?  What's your point?

Use Delight, ffs.<

 I don't know what "ffs" means, and I'm on Windows again now :-)

"For f**k's sake."  It's an expression of exasperation.

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jarrett Billingsley wrote:
 The point I'm making here is it doesn't matter whether it uses 'in' or

 expression, because there will always be people who feel that another
 syntax is better or more natural.  Instead of arguing over minute
 details like this, let's worry about the important things, like the
 semantics of foreach loops.

I agree. One thing that ranges still don't address is binding multiple 
values to them:

foreach (a, b, c; range) statement

Steve promoted the idea that the code above is translated to:

{
     T1 a;
     T2 b;
     T3 c;
     auto __r = range;
     for (; !__r.empty; __r.next)
     {
         __r.head(a, b, c);
         statement
     }
}

It's a good idea, and I'd favor e.g. a discussion around it as opposed 
to one on whether ";" is the proper separator.

Oh, there was another wrinkle: if you have a container, how do you 
obtain a range from it? I suggested container.all, but then people said 
that's a step backwards from opApply. I think [] should be used for 
accessing all of a range. Something that is already a range simply 
returns "this" from opSlice(). So the code above with this other 
proposal tucked in becomes:

{
     T1 a;
     T2 b;
     T3 c;
     auto __r = range[];
     for (; !__r.empty; __r.next)
     {
         __r.head(a, b, c);
         statement
     }
}


Andrei

Feb 19 2009

Derek Parnell <derek psych.ward> writes:

On Thu, 19 Feb 2009 10:45:44 -0500, Jarrett Billingsley wrote:

 Also, "I think I don't like X" is not proper English.  Say "I don't
 think I like X" or just "I don't like X" instead.

Of course its proper English.

"I think I don't like X" means that I'm undecided about whether or not I
like X but I probably do not like it.

"I don't think I like X" means that I *know* that I don't like X, there is
no uncertainty.


-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell

Feb 19 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Thu, Feb 19, 2009 at 4:06 PM, Derek Parnell <derek psych.ward> wrote:
 On Thu, 19 Feb 2009 10:45:44 -0500, Jarrett Billingsley wrote:

 Also, "I think I don't like X" is not proper English.  Say "I don't
 think I like X" or just "I don't like X" instead.

 Of course its proper English.

 "I think I don't like X" means that I'm undecided about whether or not I
 like X but I probably do not like it.

 "I don't think I like X" means that I *know* that I don't like X, there is
 no uncertainty.

See, it's funny, because I would think "I don't think I like X" means
that I'm undecided about whether or not I like X but I probably don't;
and that "I don't like X" means that I know that I don't like it.  "I
think I don't X" just sounds very unnatural to me.

Feb 19 2009

Derek Parnell <derek psych.ward> writes:

On Thu, 19 Feb 2009 17:59:13 -0500, Jarrett Billingsley wrote:

 On Thu, Feb 19, 2009 at 4:06 PM, Derek Parnell <derek psych.ward> wrote:
 On Thu, 19 Feb 2009 10:45:44 -0500, Jarrett Billingsley wrote:

 Also, "I think I don't like X" is not proper English.  Say "I don't
 think I like X" or just "I don't like X" instead.

 Of course its proper English.

 "I think I don't like X" means that I'm undecided about whether or not I
 like X but I probably do not like it.

 "I don't think I like X" means that I *know* that I don't like X, there is
 no uncertainty.

 
 See, it's funny, because I would think "I don't think I like X" means
 that I'm undecided about whether or not I like X but I probably don't;
 and that "I don't like X" means that I know that I don't like it.  "I
 think I don't X" just sounds very unnatural to me.

Maybe is a difference between American English and Australian English?

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell

Feb 19 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Thu, Feb 19, 2009 at 6:31 PM, Derek Parnell <derek psych.ward> wrote:
 See, it's funny, because I would think "I don't think I like X" means
 that I'm undecided about whether or not I like X but I probably don't;
 and that "I don't like X" means that I know that I don't like it.  "I
 think I don't X" just sounds very unnatural to me.

 Maybe is a difference between American English and Australian English?

You might have something there.

Feb 19 2009

"Nick Sabalausky" <a a.a> writes:

"Jarrett Billingsley" <jarrett.billingsley gmail.com> wrote in message 
news:mailman.799.1235084361.22690.digitalmars-d puremagic.com...
 On Thu, Feb 19, 2009 at 4:06 PM, Derek Parnell <derek psych.ward> wrote:
 On Thu, 19 Feb 2009 10:45:44 -0500, Jarrett Billingsley wrote:

 Also, "I think I don't like X" is not proper English.  Say "I don't
 think I like X" or just "I don't like X" instead.

 Of course its proper English.

 "I think I don't like X" means that I'm undecided about whether or not I
 like X but I probably do not like it.

 "I don't think I like X" means that I *know* that I don't like X, there 
 is
 no uncertainty.

 See, it's funny, because I would think "I don't think I like X" means
 that I'm undecided about whether or not I like X but I probably don't;
 and that "I don't like X" means that I know that I don't like it.  "I
 think I don't X" just sounds very unnatural to me.

I think I don't like "I think I don't like X".

Sorry, I had to say it ;)

Feb 19 2009

Christopher Wright <dhasenan gmail.com> writes:

Derek Parnell wrote:
 "I don't think I like X" means that I *know* that I don't like X, there is
 no uncertainty.

Not so. That is the only *reasonable* interpretation, but the person 
might not have any opinion whatsoever on the issue of whether they like 
X, and know that.

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

programming.<

I think I don't like the "g".

-----------------------

To test an API it's often good to try to use it or compare it against similar
practical&common operations done with another language or library. So here I
show two examples in Python. You can try to translate such two operations with
the std.re of D2 to see how they become :-)


The first example shows the usage of a callable for re.sub() (in D it may be
called replace()).

Here replacer() is a user-defined function given to re.sub()/matchobj.sub()
that they call on each match.

Note that in Python functions are objects, so I have dynamically added to the
replacer() function an instance attribute named "counter". In D (and Python)
you can do the same thing creating a small class with counter attribute.


import re

def replacer(mobj):
    replacer.counter += 1
    return "REPL%02d" % replacer.counter
replacer.counter = 0

s1 = ".......TAG............TAG................TAG..........TAG....."

result = ".......REPL01............REPL02................REPL03..........REPL04..."

r = re.sub("TAG", replacer, s1)
assert r == result

----------

This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()



('hello1', 'are5')


(notes that here all groups are found eagerly. If you want a lazy matching in
Python you have to use re.finditer() or matchobj.finditer()).

I may like a syntax similar to this, where opIndex() allows to find the matched
group:

 patt.match(data)[0]



'hello1'
 patt.match(data)[1]



'are5'

Bye,
bearophile

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

 programming.<
 
 I think I don't like the "g".

How can anyone think they don't like something? You like it or not, but 
it's not the result of a thought process. I guess.

Anyway: g is from Perl. Let's keep it that way.


Andrei

Feb 19 2009

Derek Parnell <derek psych.ward> writes:

On Thu, 19 Feb 2009 07:51:46 -0800, Andrei Alexandrescu wrote:

 bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

 programming.<
 
 I think I don't like the "g".

 
 How can anyone think they don't like something? You like it or not, but 
 it's not the result of a thought process. I guess.

It is not a question of whether one likes or doesn't like; this expression
is attempting to say something about one's level of certainty about liking
something. That is to say, one might not be positive if they *know* if they
like something or not, therefore they *think* (suspect, but not have
definitive evidence) of their stance.

 Anyway: g is from Perl. Let's keep it that way.

Perfect justification ;-)

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Derek Parnell wrote:
 On Thu, 19 Feb 2009 07:51:46 -0800, Andrei Alexandrescu wrote:
 
 bearophile wrote:
 Andrei Alexandrescu:

 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

 programming.<

 I think I don't like the "g".

 How can anyone think they don't like something? You like it or not, but 
 it's not the result of a thought process. I guess.

 
 It is not a question of whether one likes or doesn't like; this expression
 is attempting to say something about one's level of certainty about liking
 something. That is to say, one might not be positive if they *know* if they
 like something or not, therefore they *think* (suspect, but not have
 definitive evidence) of their stance.

I see. Me, I always use "think" to evoke an actual thinking process. 
Otherwise I use "feel" or "believe". (This turns out to be important in 
various interpersonal interactions, e.g. do you want to drive the 
conversation towards thoughts or feelings? Guess which is gonna get you 
a date :o).) So by definition I can't think I like something. But I 
understand how some may use "I think" as a synonym for "Without being 
sure, to me it seems".


Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

 programming.<
 
 I think I don't like the "g".
 
 -----------------------
 
 To test an API it's often good to try to use it or compare it against similar
practical&common operations done with another language or library. So here I
show two examples in Python. You can try to translate such two operations with
the std.re of D2 to see how they become :-)
 
 
 The first example shows the usage of a callable for re.sub() (in D it may be
called replace()).
 
 Here replacer() is a user-defined function given to re.sub()/matchobj.sub()
that they call on each match.
 
 Note that in Python functions are objects, so I have dynamically added to the
replacer() function an instance attribute named "counter". In D (and Python)
you can do the same thing creating a small class with counter attribute.
 
 
 import re
 
 def replacer(mobj):
     replacer.counter += 1
     return "REPL%02d" % replacer.counter
 replacer.counter = 0
 
 s1 = ".......TAG............TAG................TAG..........TAG....."
 
 result = ".......REPL01............REPL02................REPL03..........REPL04..."
 
 r = re.sub("TAG", replacer, s1)
 assert r == result
 
 ----------

Excellent idea. Let's see:

uint counter;
string replacer(string) { return format("REPL%02d", counter++); }
auto s1 = ".......TAG............TAG................TAG..........TAG.....";
auto result = 
".......REPL01............REPL02................REPL03..........REPL04...";
r = replace!(replacer)(s1, "TAG");
assert(r == result);

 This is a little example of managing groups in Python:
 
 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()



 ('hello1', 'are5')

auto data = ">hello1 how are5 you?<";
auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
foreach (i; 0 .. iter.engine.captures)
     writeln(iter.capture[i]);

 (notes that here all groups are found eagerly. If you want a lazy matching in
Python you have to use re.finditer() or matchobj.finditer()).
 
 I may like a syntax similar to this, where opIndex() allows to find the
matched group:
 
 patt.match(data)[0]



 'hello1'
 patt.match(data)[1]



 'are5'

No go due to confusions with random-access ranges.


Andrei

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

Excellent idea. Let's see:<

Thank you for all your work and the will to answer the posts here.
Some usable API is slowly shaping up :-)


 uint counter;
 string replacer(string) { return format("REPL%02d", counter++); }
 auto s1 = ".......TAG............TAG................TAG..........TAG.....";
 auto result = ".......REPL01............REPL02................REPL03..........REPL04...";
 r = replace!(replacer)(s1, "TAG");
 assert(r == result);

It looks good enough.

With a static variable it may become:

string replacer(string) {
    static int counter;
    return format("REPL%02d", counter++);
}


With small struct/class it may become:

struct Replacer {
    int counter;
    string opCall(string s) {
        this.counter++;
        return format("REPL%02d", counter);
    }
}

-------------------

 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);

I don't understand that.

What's the purpose of ".engine"?

"captures" may be better named "ngroups" or "ncaptures", or you may just use
the .len/.length attribute in some way.

foreach (i, group; iter.groups)
    writeln(i " ", group);

"group" may be a struct that defines toString and can be cast to string, and
also keeps the starting position of the group into the original string.

Bye,
bearophile

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);

 
 I don't understand that.
 
 What's the purpose of ".engine"?

It's the regex engine that has generated the match. I coded that wrong 
in two different ways, it should have been:

foreach (i; 0 .. iter.captures)
       writeln(iter.capture(i));

 "captures" may be better named "ngroups" or "ncaptures", or you may just use
the .len/.length attribute in some way.

"Capture" is the traditional term as far as I understand. I can't use 
.length because it messes up with range semantics. "len" would be too 
confusing. "ncaptures" is too cute. Nobody's perfect :o).

 foreach (i, group; iter.groups)
     writeln(i " ", group);
 
 "group" may be a struct that defines toString and can be cast to string, and
also keeps the starting position of the group into the original string.

That sounds good.


Andrei

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 foreach (i; 0 .. iter.captures)
       writeln(iter.capture(i));

 "Capture" is the traditional term as far as I understand. I can't use
 .length because it messes up with range semantics. "len" would be too
 confusing. "ncaptures" is too cute. Nobody's perfect :o).

 "group" may be a struct that defines toString and can be cast to string,
 and also keeps the starting position of the group into the original string.


 That sounds good.

Well, then match() may return just a dynamic array of such groups/captures. So
such array has both .length and opIndex. It looks simple :-)

Bye,
bearophile

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 foreach (i; 0 .. iter.captures) writeln(iter.capture(i));

 
 "Capture" is the traditional term as far as I understand. I can't
 use .length because it messes up with range semantics. "len" would
 be too confusing. "ncaptures" is too cute. Nobody's perfect :o).

 
 "group" may be a struct that defines toString and can be cast to
 string, and also keeps the starting position of the group into
 the original string.


 
 That sounds good.

 
 Well, then match() may return just a dynamic array of such
 groups/captures. So such array has both .length and opIndex. It looks
 simple :-)

Looks simple but it isn't. How do you advance to the next match?

foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

This should print:

r
c
d
r

There's need to make progress in the matching, not in the capture. How 
do you distinguish among them?


Andrei

Feb 19 2009

"jovo" <jovo at.home> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:gnk8te$cgl$1 digitalmars.com...
 Looks simple but it isn't. How do you advance to the next match?

 foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

 This should print:

 r
 c
 d
 r

 There's need to make progress in the matching, not in the capture. How do 
 you distinguish among them?


 Andrei

foreach(capture; match(s, r))
  foreach(group; capture)
    writeln(group);

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

jovo wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:gnk8te$cgl$1 digitalmars.com...
 Looks simple but it isn't. How do you advance to the next match?

 foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

 This should print:

 r
 c
 d
 r

 There's need to make progress in the matching, not in the capture. How do 
 you distinguish among them?


 Andrei

 
 foreach(capture; match(s, r))
   foreach(group; capture)
     writeln(group);
 
 
 

The consecrated terminology is:

foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

"Group" is a group defined without an intent to capture. A "capture" is 
a group that also binds to the state of the match.

Anyhow... this can be done but things get a tad more confusing for other 
uses. How about this:

foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

?


Andrei

Feb 19 2009

"jovo" <jovo at.home> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:gnkc24$hul$1 digitalmars.com...
 The consecrated terminology is:

 foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

 "Group" is a group defined without an intent to capture. A "capture" is a 
 group that also binds to the state of the match.

 Anyhow... this can be done but things get a tad more confusing for other 
 uses. How about this:

 foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

 ?


 Andrei


I think you must answer this question more generally, same for all library.
May be both?

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

jovo wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:gnkc24$hul$1 digitalmars.com...
 The consecrated terminology is:

 foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

 "Group" is a group defined without an intent to capture. A "capture" is a 
 group that also binds to the state of the match.

 Anyhow... this can be done but things get a tad more confusing for other 
 uses. How about this:

 foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

 ?


 Andrei

 
 
 I think you must answer this question more generally, same for all library.
 May be both?

I'd hate to fall again into the fallacy of trying to appease everyone's 
taste. Really std.regexp has set a negative record with the incredible 
array of names: find, search, exec, match, test, and probably I forgot a 
couple. Also it has offered a variety of random features in both 
free-function and member-function format, not even always doing the same 
thing. Germans have a saying: "Kurtz und gut". Let's make it short and good.

Andrei

Feb 19 2009

Bill Baxter <wbaxter gmail.com> writes:

I don't like the syntax I saw somewhere earlier in the thread of
    0..iter.captures

.captures looks like it should be a set of captures, not a count.

This is a need that comes up again and again -- querying the size, or
count, or length of some sub-element like this -- so I think it would
greatly benefit Phobos to choose some less ambiguous convention and
stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

---bb

Feb 19 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 23:23:13 +0300, Bill Baxter <wbaxter gmail.com> wrote:

 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb

Agree. I thought that iter.captures is a set (range) of captures.

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Denis Koroskin wrote:
 On Thu, 19 Feb 2009 23:23:13 +0300, Bill Baxter <wbaxter gmail.com> wrote:
 
 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb

 
 Agree. I thought that iter.captures is a set (range) of captures.
 

I'm done implementing that.

Andrei

Feb 19 2009

KennyTM~ <kennytm gmail.com> writes:

Bill Baxter wrote:
 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures
 
 .captures looks like it should be a set of captures, not a count.
 
 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.
 
 ---bb

iter.count

Feb 19 2009

Bill Baxter <wbaxter gmail.com> writes:

On Fri, Feb 20, 2009 at 9:47 AM, KennyTM~ <kennytm gmail.com> wrote:
 Bill Baxter wrote:
 I don't like the syntax I saw somewhere earlier in the thread of
    0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb

 iter.count

Maybe I haven't paid close enough attention here, but I think the
reason he didn't say .count or .length is that it's ambiguous whether
it means the number of captures or the number of matches.

--bb

Feb 19 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 19:00:41 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

[snip]
 This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()



 ('hello1', 'are5')

 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);

I would expect that to be

foreach (/*Capture */ i; 0 .. iter.engine.captures)
     writeln(i);

 (notes that here all groups are found eagerly. If you want a lazy  
 matching in Python you have to use re.finditer() or  
 matchobj.finditer()).
  I may like a syntax similar to this, where opIndex() allows to find  
 the matched group:

 patt.match(data)[0]



 'hello1'
 patt.match(data)[1]



 'are5'

 No go due to confusions with random-access ranges.

Why iter.capture[0] and iter.capture[1] aren't good enough?
How are they different from iter.engine.captures[0] and  
iter.engine.captures[1]?

Why it is a no go if you access iter.captures as a random-access range?

I'm sorry if these are dumb questions, but the code you've shown is a bit  
confusing (these iter.engine.captures and iter.captures).

 Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Denis Koroskin wrote:
 On Thu, 19 Feb 2009 19:00:41 +0300, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 [snip]
 This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()



 ('hello1', 'are5')

 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);

 
 I would expect that to be
 
 foreach (/*Capture */ i; 0 .. iter.engine.captures)
     writeln(i);
 
 (notes that here all groups are found eagerly. If you want a lazy 
 matching in Python you have to use re.finditer() or 
 matchobj.finditer()).
  I may like a syntax similar to this, where opIndex() allows to find 
 the matched group:

 patt.match(data)[0]



 'hello1'
 patt.match(data)[1]



 'are5'

 No go due to confusions with random-access ranges.

 
 Why iter.capture[0] and iter.capture[1] aren't good enough?
 How are they different from iter.engine.captures[0] and 
 iter.engine.captures[1]?
 
 Why it is a no go if you access iter.captures as a random-access range?
 
 I'm sorry if these are dumb questions, but the code you've shown is a 
 bit confusing (these iter.engine.captures and iter.captures).

They're good. The code I posted was dumb. The "engine" thing does not 
belong there, and "captures" should be indeed a random-access range.


Andrei

Feb 19 2009

Leandro Lucarella <llucax gmail.com> writes:

Andrei Alexandrescu, el 18 de febrero a las 21:35 me escribiste:
 I'm almost done rewriting the regular expression engine, and some pretty
interesting things have transpired.
 
 First, I separated the engine into two parts, one that is the actual regular
expression engine, and the other that is the state of the match with some 
 particular input. The previous code combined the two into a huge class. The
engine (written by Walter) translates the regex string into a bytecode-compiled 
 form. Given that there is a deterministic correspondence between the regex
string and the bytecode, the Regex engine object is in fact invariant and
cached by 
 the implementation. Caching makes for significant time savings even if e.g.
the user repeatedly creates a regular expression engine in a loop.
 
 In contrast, the match state depends on the input string. I defined it to
implement the range interface, so you can either inspect it directly or iterate
it 
 for all matches (if the "g" option was passed to the engine).
 
 The new codebase works with char, wchar, and dchar and any random-access range
as input (forward ranges to come, and at some point in the future input ranges 
 as well). In spite of the added flexibility, the code size has shrunk from
3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - 
 handling binary file formats can benefit a LOT from regexes) and also,
probably unprecedented, support for arbitrary types such as integers, floating
point 
 numbers, structs, what have you. any type that supports comparison and ranges
is a good candidate for regular expression matching. I'm not sure how regular 
 expression matching can be harnessed e.g. over arrays of int, but I suspect
some pretty cool applications are just around the corner. We can introduce that 
 generalization without adding complexity and there is nothing in principle
opposed to it.
 
 The interface is very simple, mainly consisting of the functions regex(),
match(), and sub(), e.g.
 
 foreach (e; match("abracazoo", regex("a[b-e]", "g")))
     writeln(e.pre, e.hit, e.post);
 auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

BTW, why are the flags passed as string and not as an integer mask? For
example:
auto s = regex.sub("abracazoo", regex.regex("a([b-e])", regex.G), "A$1");

This way you can catch a few errors at compile-time.


-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
When I was a child I had a fever
My hands felt just like two balloons.
Now I've got that feeling once again
I can't explain you would not understand
This is not how I am.
I have become comfortably numb.

Feb 19 2009

Benji Smith <dlanguage benjismith.net> writes:

Some of the things I'd like to see in the regex implementation:

All functions accepting a compiled regex object/struct should also 
accept a string version of the pattern (and vice versa). Some 
implementations (Java) only accept the compiled version in some places 
and the string pattern in other places. That's annoying.

Just like with ordinary string-searching functions, you should be able 
to specify a start position (and maybe an end position) for the search. 
Even if the match exists somewhere in the string, it fails if not found 
within the target slice. Something like this:

    auto text = "ABCDEFG";
    auto pattern = regex("[ABCEFG]");

    // returns false, because the char at position 3 does not match
    auto result = match(text, 3);

    // this should be exactly equivalent (but the previous version
    // uses less memory, and ought to work with infinite ranges, whereas
    // the slice version wouldn't make any sense)
    auto equivalent = match(text[3..$]);

I've needed to use this technique in a few cases to implement a simple 
lexical scanner, and it's a godsend, if the regex engine supports it 
(though most don't).

Finally, it'd be extremely cool if the regex compiler automatically 
eliminated redundant nodes from its NFA, converting as much of it as 
possible to a DFA. I did some work on this a few years ago, and it's 
actually remarkably simple to implement using prefix trees.

    // These two expressions produce an identical set of matches,
    // but the first one is functionally an NFA, while the second
    // one is a DFA.
    auto a = regex("(cat|car|cry|dog|door|dry)");
    auto b = regex("(c(?:a[tr]|ry)|d(?:o(?:g|or)|ry)");

In cases where the expression can only be partially simplified, you can 
leave some NFA nodes deep within the tree, while still DFA-ifying the 
rest of it:

    auto a = regex("(attitude|attribute|att.+ion");
    auto b = regex("(att(?:itude|ribute|.+ion)");

It's a very simple transformation, increases speed (dramatically) for 
complex regular expressions (especially those produced dynamically at 
runtime by combining large sets of unrelated target expressions), and it 
reliably produces equivalent results with the inefficient version.

The only really tricky part is if the subexpressions have their own 
capturing groups, in which case the DFA transformation screws up the 
ordinal-numbering of the resultant captures.

Anyhoo...

I don't have any strong feelings about the function names (though I'd 
rather have functions that operators, like "~", for searching and matching).

And I don't have any strong feelings about whether the compiled regex is 
an object or a struct (though I prefer reference semantics over value 
semantics for regexen, and right now, I think that makes objects the 
(slightly) better choice).

Thanks for your hard work! I've implemented a small regex engine before, 
so I know it's no small chunk of effort. Regular expressions are my 
personal favorite "tiny language", and I'm glad to see them get some 
special attention in phobos2.

--benji

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Benji Smith:
 It's a very simple transformation, increases speed (dramatically) for 
 complex regular expressions (especially those produced dynamically at 
 runtime by combining large sets of unrelated target expressions), and it 
 reliably produces equivalent results with the inefficient version.

See:
http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/List.pm
http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Optimizer.pm

Something like that can be implemented as small pre-processing layer over the
re module.

Bye,
bearophile

Feb 20 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Is str ~ regex the root of all evil, or the leaf of all good?