digitalmars.D - Is str ~ regex the root of all evil, or the leaf of all good?
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 18 2009
- Bill Baxter <wbaxter gmail.com> Feb 18 2009
- BCS <none anon.com> Feb 19 2009
- Daniel Keep <daniel.keep.lists gmail.com> Feb 19 2009
- Michel Fortin <michel.fortin michelf.com> Feb 19 2009
- bearophile <bearophileHUGS lycos.com> Feb 19 2009
- BCS <ao pathlink.com> Feb 19 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Lionello Lunesu <lio lunesu.remove.com> Feb 19 2009
- Leandro Lucarella <llucax gmail.com> Feb 19 2009
- "Denis Koroskin" <2korden gmail.com> Feb 19 2009
- Christopher Wright <dhasenan gmail.com> Feb 19 2009
- bearophile <bearophileHUGS lycos.com> Feb 19 2009
- Daniel Keep <daniel.keep.lists gmail.com> Feb 19 2009
- Christopher Wright <dhasenan gmail.com> Feb 19 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Christopher Wright <dhasenan gmail.com> Feb 20 2009
- "Denis Koroskin" <2korden gmail.com> Feb 19 2009
- "Simen Kjaeraas" <simen.kjaras gmail.com> Feb 19 2009
- "Denis Koroskin" <2korden gmail.com> Feb 19 2009
- Brian <digitalmars brianguertin.com> Feb 19 2009
- Max Samukha <samukha voliacable.com.removethis> Feb 19 2009
- bearophile <bearophileHUGS lycos.com> Feb 19 2009
- Max Samukha <samukha voliacable.com.removethis> Feb 19 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Derek Parnell <derek psych.ward> Feb 19 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Derek Parnell <derek psych.ward> Feb 19 2009
- Sergey Gromov <snake.scaly gmail.com> Feb 20 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Lionello Lunesu <lio lunesu.remove.com> Feb 19 2009
- Benji Smith <dlanguage benjismith.net> Feb 19 2009
- Daniel Keep <daniel.keep.lists gmail.com> Feb 19 2009
- "Denis Koroskin" <2korden gmail.com> Feb 19 2009
- Jarrett Billingsley <jarrett.billingsley gmail.com> Feb 19 2009
- Bill Baxter <wbaxter gmail.com> Feb 20 2009
- Don <nospam nospam.com> Feb 19 2009
- Michel Fortin <michel.fortin michelf.com> Feb 19 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009
- Daniel Keep <daniel.keep.lists gmail.com> Feb 19 2009
- Leandro Lucarella <llucax gmail.com> Feb 19 2009
- Benji Smith <dlanguage benjismith.net> Feb 19 2009
- bearophile <bearophileHUGS lycos.com> Feb 20 2009
I'm almost done rewriting the regular expression engine, and some pretty
interesting things have transpired.
First, I separated the engine into two parts, one that is the actual
regular expression engine, and the other that is the state of the match
with some particular input. The previous code combined the two into a
huge class. The engine (written by Walter) translates the regex string
into a bytecode-compiled form. Given that there is a deterministic
correspondence between the regex string and the bytecode, the Regex
engine object is in fact invariant and cached by the implementation.
Caching makes for significant time savings even if e.g. the user
repeatedly creates a regular expression engine in a loop.
In contrast, the match state depends on the input string. I defined it
to implement the range interface, so you can either inspect it directly
or iterate it for all matches (if the "g" option was passed to the engine).
The new codebase works with char, wchar, and dchar and any random-access
range as input (forward ranges to come, and at some point in the future
input ranges as well). In spite of the added flexibility, the code size
has shrunk from 3396 lines to 2912 lines. I plan to add support for
binary data (e.g. ubyte - handling binary file formats can benefit a LOT
from regexes) and also, probably unprecedented, support for arbitrary
types such as integers, floating point numbers, structs, what have you.
any type that supports comparison and ranges is a good candidate for
regular expression matching. I'm not sure how regular expression
matching can be harnessed e.g. over arrays of int, but I suspect some
pretty cool applications are just around the corner. We can introduce
that generalization without adding complexity and there is nothing in
principle opposed to it.
The interface is very simple, mainly consisting of the functions
regex(), match(), and sub(), e.g.
foreach (e; match("abracazoo", regex("a[b-e]", "g")))
writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
Two other syntactic options are available:
"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")
I could have made match a member of regex:
regex("a[b-e]", "g")).match("abracazoo")
but most regex code I've seen mentions the string first and the regex
second. So I dropped that idea.
Now, match() is likely to be called very often so I'm considering:
foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
writeln(e);
In general I'm weary of unwitting operator overloading, but I think this
case is more justified than others. Thoughts?
Andrei
Feb 18 2009
On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. foreach (e; match("abracazoo", regex("a[b-e]", "g"))) writeln(e.pre, e.hit, e.post); auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g") I could have made match a member of regex: regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
No. ~ means matching in Perl. In D it means concatenation. This special case is not special enough to warrant breaking D's convention, in my opinion. It also breaks D's convention that operators have an inherent meaning which shouldn't be subverted to do unrelated things. What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e); The charter for "in" isn't quite as focused as that for ~, and anyway you could view this as finding instances of the regular expression "in" the string. --bb
Feb 18 2009
Hello Bill,What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e);
vote += lots; // I had the same thought as well
Feb 19 2009
BCS wrote:Hello Bill,What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e);
vote += lots; // I had the same thought as well
If a regex represents a set of strings, then wouldn't "abracazoo" in regex("a[b-e]", "g") make more sense? Of course, that's "match" semantics; if you turn it around and say that you're looking for elements from the set in the string, then it's regex("a[b-e]", "g") in "abracazoo" Hmm... None the less, I do prefer the 'in' syntax over the '~' syntax. Please let's no go down the road of co-opting operators to do things other than what they're designed for. If you REALLY want a custom operator, you could always convince Walter to let us define infix functions using Unicode characters. :D -- Daniel
Feb 19 2009
On 2009-02-19 00:50:06 -0500, Bill Baxter <wbaxter gmail.com> said:On Thu, Feb 19, 2009 at 2:35 PM, Andrei AlexandrescuIn general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
No. ~ means matching in Perl. In D it means concatenation. This special case is not special enough to warrant breaking D's convention, in my opinion. It also breaks D's convention that operators have an inherent meaning which shouldn't be subverted to do unrelated things.
Indeed. That's why I don't like seeing `~` here.What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e); The charter for "in" isn't quite as focused as that for ~, and anyway you could view this as finding instances of the regular expression "in" the string.
That seems reasonable, although if we support it it shouldn't be limited to regular expressions for coherency reasons. For instance: foreach(e; "co" in "conoco") writeln(e); should work too. If we can't make that work in the most simple case, then I'd say it shouldn't with the more complicated ones either. By the way, regular expressions should work everywhere where we can search for a string. For instance (from std.string): auto firstMatchIndex = find("conoco", "co"); should work with a regex too: auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g")); -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 19 2009
Michel Fortin:foreach(e; "co" in "conoco") writeln(e); should work too.
Of course :-) I think eventually it will work, it's handy and natural.By the way, regular expressions should work everywhere where we can search for a string. For instance (from std.string): auto firstMatchIndex = find("conoco", "co"); should work with a regex too: auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));
I agree, I have said the same thing regarding splitter()/xsplitter(). Bye, bearophile
Feb 19 2009
Reply to bearophile,Michel Fortin:foreach(e; "co" in "conoco") writeln(e); should work too.
By the way, regular expressions should work everywhere where we can search for a string. For instance (from std.string): auto firstMatchIndex = find("conoco", "co"); should work with a regex too: auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));
Bye, bearophile
If the overhead of regex(string) is small enough (vs having it in the find/split function) I'd go with overloads rather than different names.
Feb 19 2009
BCS wrote:Reply to bearophile,Michel Fortin:foreach(e; "co" in "conoco") writeln(e); should work too.
By the way, regular expressions should work everywhere where we can search for a string. For instance (from std.string): auto firstMatchIndex = find("conoco", "co"); should work with a regex too: auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));
Bye, bearophile
If the overhead of regex(string) is small enough (vs having it in the find/split function) I'd go with overloads rather than different names.
The overhead is low because the last few used regexes are cached. Andrei
Feb 19 2009
Michel Fortin wrote:On 2009-02-19 00:50:06 -0500, Bill Baxter <wbaxter gmail.com> said:On Thu, Feb 19, 2009 at 2:35 PM, Andrei AlexandrescuIn general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
No. ~ means matching in Perl. In D it means concatenation. This special case is not special enough to warrant breaking D's convention, in my opinion. It also breaks D's convention that operators have an inherent meaning which shouldn't be subverted to do unrelated things.
Indeed. That's why I don't like seeing `~` here.What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e); The charter for "in" isn't quite as focused as that for ~, and anyway you could view this as finding instances of the regular expression "in" the string.
That seems reasonable, although if we support it it shouldn't be limited to regular expressions for coherency reasons. For instance: foreach(e; "co" in "conoco") writeln(e); should work too. If we can't make that work in the most simple case, then I'd say it shouldn't with the more complicated ones either.
Well I'm a bit unhappy about that one. At least in current D and to yours truly, "in" means "fast membership lookup". The use above is linear lookup. I'm not saying that's bad, but I prefer the non-diluted semantics. For linear search, there's always find().By the way, regular expressions should work everywhere where we can search for a string. For instance (from std.string): auto firstMatchIndex = find("conoco", "co"); should work with a regex too: auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));
If you mean typeof(firstMatchIndex) to be size_t, that's unlikely to be enough. When looking for a regular expression, you need more than just an index - you need captures, "pre" and "post" substrings, the works. That's why matching a string against a regex must return a richer structure that can't be easily integrated with std.algorithm. Andrei
Feb 19 2009
Andrei Alexandrescu wrote:Michel Fortin wrote:That seems reasonable, although if we support it it shouldn't be limited to regular expressions for coherency reasons. For instance: foreach(e; "co" in "conoco") writeln(e); should work too. If we can't make that work in the most simple case, then I'd say it shouldn't with the more complicated ones either.
Well I'm a bit unhappy about that one. At least in current D and to yours truly, "in" means "fast membership lookup". The use above is linear lookup. I'm not saying that's bad, but I prefer the non-diluted semantics. For linear search, there's always find().
At least, "in" refers to a look-up, whereas "~" refers to concatenation, which has nothing in common with the regex matching. Furthermore, we can't make any complexity guarantees for operators; this always depends on the data structure you use the operator on. And, if I'm not mistaken, "in" is only used by the associated array at the moment. It's a "fast look-up" because of the associated array, but it doesn't have to be. (Similarly, to me, ~ and ~= feel slow, O(n), but that shouldn't keep us from using it with other data structures that can do a similar concat/append operation with lower complexity.) L.
Feb 19 2009
Bill Baxter, el 19 de febrero a las 14:50 me escribiste: [snip]regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.
Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
No. ~ means matching in Perl. In D it means concatenation. This special case is not special enough to warrant breaking D's convention, in my opinion. It also breaks D's convention that operators have an inherent meaning which shouldn't be subverted to do unrelated things. What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e); The charter for "in" isn't quite as focused as that for ~, and anyway you could view this as finding instances of the regular expression "in" the string.
I think match is pretty short, I don't see any need for any shortcut wich makes the code more obscure. BTW, in case Andrei was looking for a precedent, Python uses the syntax like: regex("a[b-e]", "g")).match("abracazoo") -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- <Palmer> recien estuvimos con el vita... se le paro yo lo vi <Luca> ??????????????????????????????????????????????????????? <Palmer> sisi, cuando vio a josefina <Luca> y quiƩn es josefina? <Palmer> Mi computadora nuevaaaaa
Feb 19 2009
Leandro Lucarella wrote:Bill Baxter, el 19 de febrero a las 14:50 me escribiste: [snip]regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.
Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
special case is not special enough to warrant breaking D's convention, in my opinion. It also breaks D's convention that operators have an inherent meaning which shouldn't be subverted to do unrelated things. What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e); The charter for "in" isn't quite as focused as that for ~, and anyway you could view this as finding instances of the regular expression "in" the string.
I think match is pretty short, I don't see any need for any shortcut wich makes the code more obscure. BTW, in case Andrei was looking for a precedent, Python uses the syntax like: regex("a[b-e]", "g")).match("abracazoo")
Yah, but since even bearophile admitted python kinda botched regexes, I better not consider this argument :o). The Unix toolchain invariably puts the string before the regex. Andrei
Feb 19 2009
On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. foreach (e; match("abracazoo", regex("a[b-e]", "g"))) writeln(e.pre, e.hit, e.post); auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g") I could have made match a member of regex: regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? Andrei
"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
Feb 19 2009
Denis Koroskin wrote:"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
This isn't so good for two reasons. First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated. Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
Feb 19 2009
Simen Kjaeraas:auto re = regex("a[b-e]", "g"); foreach (e; "abracazoo" / re) { }
D has operator overload, that Java lacks, but this fact doesn't force you to use them even when they are unreadable. For people that like the "in" there I'd like to remind how it can look once (if it will ever happen) the foreach uses "in" too: foreach (e in (re in "abracazoo")) {...} Bye, bearophile
Feb 19 2009
bearophile wrote:Simen Kjaeraas:auto re = regex("a[b-e]", "g"); foreach (e; "abracazoo" / re) { }
D has operator overload, that Java lacks, but this fact doesn't force you to use them even when they are unreadable. For people that like the "in" there I'd like to remind how it can look once (if it will ever happen) the foreach uses "in" too: foreach (e in (re in "abracazoo")) {...} Bye, bearophile
But it doesn't, and I can't see how it could given how confusing it would make things. Besides which, we shouldn't be making judgements based on possible, not planned for syntax changes at some unspecified point in the future. We have enough trouble with deciding on things as it is. :P -- Daniel
Feb 19 2009
Denis Koroskin wrote:On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan gmail.com> wrote:Denis Koroskin wrote:"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
This isn't so good for two reasons. First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated. Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
auto re = regex("a[b-e]", "g"); foreach (e; "abracazoo".match(re)) { // what's wrong with that? }
Your first example was: auto match (char[] source, char[] pattern, char[] options); Your second example was: auto match (char[] source, regex expression); The second is good, but more typing than you said originally. The first is problematic.
Feb 19 2009
Christopher Wright wrote:Denis Koroskin wrote:On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan gmail.com> wrote:Denis Koroskin wrote:"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
This isn't so good for two reasons. First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated. Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
auto re = regex("a[b-e]", "g"); foreach (e; "abracazoo".match(re)) { // what's wrong with that? }
Your first example was: auto match (char[] source, char[] pattern, char[] options); Your second example was: auto match (char[] source, regex expression); The second is good, but more typing than you said originally. The first is problematic.
Why is it problematic? Is the name "match" too common? Andrei
Feb 19 2009
Andrei Alexandrescu wrote:Christopher Wright wrote:Your first example was: auto match (char[] source, char[] pattern, char[] options); Your second example was: auto match (char[] source, regex expression); The second is good, but more typing than you said originally. The first is problematic.
Why is it problematic? Is the name "match" too common? Andrei
No. What is the difference between those two? One is building a regex internally and not letting me store it; and it forces me to pass two parameters rather than one. The other takes a regex as a parameter, which I can store, and I can set both options (the pattern and the match options) once for all.
Feb 20 2009
On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan gmail.com> wrote:Denis Koroskin wrote:"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
This isn't so good for two reasons. First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated. Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
auto re = regex("a[b-e]", "g"); foreach (e; "abracazoo".match(re)) { // what's wrong with that? }
Feb 19 2009
On Thu, 19 Feb 2009 14:34:30 +0100, Denis Koroskin <2korden gmail.com> wrote:On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan gmail.com> wrote:Denis Koroskin wrote:"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
This isn't so good for two reasons. First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated. Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
auto re = regex("a[b-e]", "g"); foreach (e; "abracazoo".match(re)) { // what's wrong with that? }
This: auto re = regex("a[b-e]", "g"); foreach (e; "abracazoo" / re) { } -- Simen
Feb 19 2009
On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. foreach (e; match("abracazoo", regex("a[b-e]", "g"))) writeln(e.pre, e.hit, e.post); auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g") I could have made match a member of regex: regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? Andrei
"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't break existing conventions. I prefer it over '~' version. 'in' is also fine (both ways).
Feb 19 2009
On Thu, 19 Feb 2009 11:31:42 +0300, Denis Koroskin wrote:"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't break existing conventions. I prefer it over '~' version. 'in' is also fine (both ways).
i dont see a problem either with just using .match. if you use spaces around the ~ then its actually to more characters plus needing to hit shift. certainly not as simple as .match
Feb 19 2009
On Wed, 18 Feb 2009 21:35:20 -0800, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. foreach (e; match("abracazoo", regex("a[b-e]", "g"))) writeln(e.pre, e.hit, e.post); auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g") I could have made match a member of regex: regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
Please anything but ~. It would be fine, if it didn't follow an array. It could be 'in' as Bill suggested. Or maybe /, which reminds of sed and downs: foreach (e; "abracazoo"/regex("a[b-e]", "g")) writeln(e); If you made 'match' a member of Regex, another option would be the opCall: regex("a[b-e]", "g")( "abracazoo")
Feb 19 2009
Andrei Alexandrescu:but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
I like the following syntaxes (the one with .match() too): import std.re: regex; foreach (e; regex("a[b-e]", "g") in "abracazoo") writeln(e); foreach (e; regex("a[b-e]", "g").match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1.match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1 in "abracazoo") writeln(e); ---------------- I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. This is an example of usual RE in Python: finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): finder = re.compile(r""" ^ \s* # start at beginning+ opt spaces ( [\[\]] ) # Group 1: opening bracket \s* # optional spaces ( [-+]? \d+ ) # Group 2: first number \s* , \s* # opt spaces+ comma+ opt spaces ( [-+]? \d+ ) # Group 3: second number \s* # opt spaces ( [\[\]] ) # Group 4: closing bracket \s* $ # opt spaces+ end at the end """, flags=re.VERBOSE) As you can see it's often very positive to indent logically those lines just like code. ---------------- As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: "abracazoo" ~ regex("a[b-e]", "g") ---------------- I don't like that "g" argument much, my suggestions: RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments ---------------- If not already so, I'd like sub() to take as replacement a string or a callable. Bye, bearophile
Feb 19 2009
On Thu, 19 Feb 2009 06:47:57 -0500, bearophile <bearophileHUGS lycos.com> wrote:If not already so, I'd like sub() to take as replacement a string or a callable. Bye, bearophile
I don't like 'sub' because it can denote anything. The most confusing is substring. 'replace' seems to be better.
Feb 19 2009
Max Samukha wrote:On Thu, 19 Feb 2009 06:47:57 -0500, bearophile <bearophileHUGS lycos.com> wrote:If not already so, I'd like sub() to take as replacement a string or a callable. Bye, bearophile
I don't like 'sub' because it can denote anything. The most confusing is substring. 'replace' seems to be better.
Ok. Andrei
Feb 19 2009
bearophile wrote:Andrei Alexandrescu:but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
I like the following syntaxes (the one with .match() too): import std.re: regex; foreach (e; regex("a[b-e]", "g") in "abracazoo") writeln(e); foreach (e; regex("a[b-e]", "g").match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1.match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1 in "abracazoo") writeln(e);
These all put the regex before the string, something many people would find unsavory.---------------- I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. This is an example of usual RE in Python: finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): finder = re.compile(r""" ^ \s* # start at beginning+ opt spaces ( [\[\]] ) # Group 1: opening bracket \s* # optional spaces ( [-+]? \d+ ) # Group 2: first number \s* , \s* # opt spaces+ comma+ opt spaces ( [-+]? \d+ ) # Group 3: second number \s* # opt spaces ( [\[\]] ) # Group 4: closing bracket \s* $ # opt spaces+ end at the end """, flags=re.VERBOSE) As you can see it's often very positive to indent logically those lines just like code.
Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that.---------------- As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: "abracazoo" ~ regex("a[b-e]", "g") ---------------- I don't like that "g" argument much, my suggestions: RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments
And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.If not already so, I'd like sub() to take as replacement a string or a callable.
It does, I haven't mentioned it yet. Pass-by-alias of course :o). Andrei
Feb 19 2009
On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:These all put the regex before the string, something many people would find unsavory.
I don't. To me the regex is what you are looking for so it's like saying "find this pattern in that string". -- Derek Parnell
Feb 19 2009
Derek Parnell wrote:On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:These all put the regex before the string, something many people would find unsavory.
I don't. To me the regex is what you are looking for so it's like saying "find this pattern in that string".
Yah, but to most others it's "match this string against that pattern". Again, regexes have a long history behind them. So probably we need to have both "find" and "match" with different order of arguments, something . Anyway, std.algorithm defines find() like this: find(haystack, needle) In the least structured case, the haystack is a range and needle is either an element or another range. But then we can think, hey, we can think of efficient finds by using a more structured haystack and/or a more structured needle. So then: string a = "conoco", b = "co"; // linear find auto r1 = find(a, b[0]); // quadratic find auto r2 = find(a, b); // organize a in a Boyer-Moore structure; sublinear find auto r3 = find(boyerMoore(a), b); I'll actually implement the above, it's pretty nice. Now the question is, what's the haystack and what's the needle in a regex find? auto r3 = find("conoco", regex("c[a-z]")); or auto r3 = find(regex("c[a-z]"), "conoco"); ? The argument could go both ways: "Organize the set of 2-char strings starting with 'c' and ending with 'a' to 'z' into a structured haystack, then look for substrings of "conoco" in that haystack." versus "Given the unstructured haystack conoco, look for a structured needle in it that is any 2-char string starting with 'c' and ending with 'a' to 'z'." What is the most natural way? Andrei
Feb 19 2009
Andrei Alexandrescu wrote:Derek Parnell wrote:On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:These all put the regex before the string, something many people would find unsavory.
I don't. To me the regex is what you are looking for so it's like saying "find this pattern in that string".
Yah, but to most others it's "match this string against that pattern". Again, regexes have a long history behind them. So probably we need to have both "find" and "match" with different order of arguments, something .
... "I'm not thrilled about". Andrei
Feb 19 2009
On Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:Derek Parnell wrote:On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:These all put the regex before the string, something many people would find unsavory.
I don't. To me the regex is what you are looking for so it's like saying "find this pattern in that string".
Yah, but to most others it's "match this string against that pattern".
I might not be normal ;-)Again, regexes have a long history behind them. So probably we need to have both "find" and "match" with different order of arguments, something . Anyway, std.algorithm defines find() like this: find(haystack, needle)
I use the Euphoria language a lot, and its routine API is find(needle, haystack), so I'm sure this is where my normality springs from.What is the most natural way?
Get your "personal assistant" to do it. -- Derek Parnell Melbourne, Australia skype: derek.j.parnell
Feb 19 2009
Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:The argument could go both ways: "Organize the set of 2-char strings starting with 'c' and ending with 'a' to 'z' into a structured haystack, then look for substrings of "conoco" in that haystack." versus "Given the unstructured haystack conoco, look for a structured needle in it that is any 2-char string starting with 'c' and ending with 'a' to 'z'." What is the most natural way?
I think calling a regex a 'haystack' is a far-fetched metaphor. A haystack is a pile of stuff, and a needle is a precise thing you're looking for. I think they're unambiguous. Also, the in operator doesn't leave you guessing whether you should put a haystack or a needle first.
Feb 20 2009
Denis Koroskin wrote:On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:bearophile wrote:Andrei Alexandrescu:but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
import std.re: regex; foreach (e; regex("a[b-e]", "g") in "abracazoo") writeln(e); foreach (e; regex("a[b-e]", "g").match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1.match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1 in "abracazoo") writeln(e);
These all put the regex before the string, something many people would find unsavory.---------------- I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. This is an example of usual RE in Python: finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): finder = re.compile(r""" ^ \s* # start at beginning+ opt spaces ( [\[\]] ) # Group 1: opening bracket \s* # optional spaces ( [-+]? \d+ ) # Group 2: first number \s* , \s* # opt spaces+ comma+ opt spaces ( [-+]? \d+ ) # Group 3: second number \s* # opt spaces ( [\[\]] ) # Group 4: closing bracket \s* $ # opt spaces+ end at the end """, flags=re.VERBOSE) As you can see it's often very positive to indent logically those lines just like code.
Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that.---------------- As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: "abracazoo" ~ regex("a[b-e]", "g") ---------------- I don't like that "g" argument much, my suggestions: RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments
And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.
I got disabused a very long time ago of the notion that everything about regexes is clear or self-documenting. Really. You just get to a level of understanding that's appropriate for your needs. On that scale, getting used to "gmi" is so low, it's not even worth discussing. Andrei
Feb 19 2009
Jarrett Billingsley wrote:On Thu, Feb 19, 2009 at 10:01 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments
options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
While we're on the subject I'd like to mention that an unbelievably overwhelming proportion of the time, when I use regexen, I want them to be global. As in, I don't think I've ever used a non-global regex. To that effect I'd like to propose that either "g" be the default attribute, or that it should be on _unless_ some other attribute ("o" for once?) is present. I think this is one thing that Perl got terribly wrong.
Well I agree for searches but not for substitutions. In D searches, the lazy way of matching means you can always go with "g" and change your mind whenever you please. I think I'll simply eliminate "g" from the offered options for search. Andrei
Feb 19 2009
Denis Koroskin wrote:On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:bearophile wrote:Andrei Alexandrescu:but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
import std.re: regex; foreach (e; regex("a[b-e]", "g") in "abracazoo") writeln(e); foreach (e; regex("a[b-e]", "g").match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1.match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1 in "abracazoo") writeln(e);
These all put the regex before the string, something many people would find unsavory.---------------- I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. This is an example of usual RE in Python: finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): finder = re.compile(r""" ^ \s* # start at beginning+ opt spaces ( [\[\]] ) # Group 1: opening bracket \s* # optional spaces ( [-+]? \d+ ) # Group 2: first number \s* , \s* # opt spaces+ comma+ opt spaces ( [-+]? \d+ ) # Group 3: second number \s* # opt spaces ( [\[\]] ) # Group 4: closing bracket \s* $ # opt spaces+ end at the end """, flags=re.VERBOSE) As you can see it's often very positive to indent logically those lines just like code.
Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that.---------------- As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: "abracazoo" ~ regex("a[b-e]", "g") ---------------- I don't like that "g" argument much, my suggestions: RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments
And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.
I think it's worth an overload! (I also keep forgetting those flags.) In fact, *the first thing* the current RegExp.compile does is convert the string attributes to enum flags! L.
Feb 19 2009
And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.
I prefer the enum options too. But not vociferously. I could live with the single-char flags. --benji
Feb 19 2009
Benji Smith wrote:And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.
I prefer the enum options too. But not vociferously. I could live with the single-char flags. --benji
I dislike enum options because it dramatically bloats the code, in terms of how much typing it takes. This is another thing that Visual Basic actually got right [1]; instead of: Match(string, "a[b-e]", Regex.Repeat | Regex.IgnoreCase) you could use this: Match(string, "a[b-e]", Repeat | IgnoreCase) Since the compiler knew the type of that third argument, it allowed you to omit the prefix. If D did that, I would completely reverse my dislike of enums and DEFINITELY prefer them over strings; you could always have this: enum Regex { Repeat, IgnoreCase, R = Repeat, I = IgnoreCase, } And then instead of "ri" you have R|I which is actually shorter AND safer! But yeah; I think Walter said he wanted to do this ages and ages ago, but it never happened. -- Daniel [1] It's funny how many things that poor language *did* get right. I mean, yeah, it's a terrible language from a design standpoint, but boy did it ever let you just get shit done.
Feb 19 2009
On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:bearophile wrote:Andrei Alexandrescu:but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
import std.re: regex; foreach (e; regex("a[b-e]", "g") in "abracazoo") writeln(e); foreach (e; regex("a[b-e]", "g").match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1.match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1 in "abracazoo") writeln(e);
These all put the regex before the string, something many people would find unsavory.---------------- I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. This is an example of usual RE in Python: finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): finder = re.compile(r""" ^ \s* # start at beginning+ opt spaces ( [\[\]] ) # Group 1: opening bracket \s* # optional spaces ( [-+]? \d+ ) # Group 2: first number \s* , \s* # opt spaces+ comma+ opt spaces ( [-+]? \d+ ) # Group 3: second number \s* # opt spaces ( [\[\]] ) # Group 4: closing bracket \s* $ # opt spaces+ end at the end """, flags=re.VERBOSE) As you can see it's often very positive to indent logically those lines just like code.
Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that.---------------- As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: "abracazoo" ~ regex("a[b-e]", "g") ---------------- I don't like that "g" argument much, my suggestions: RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments
And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.If not already so, I'd like sub() to take as replacement a string or a callable.
It does, I haven't mentioned it yet. Pass-by-alias of course :o). Andrei
Feb 19 2009
On Thu, Feb 19, 2009 at 10:01 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments
And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
While we're on the subject I'd like to mention that an unbelievably overwhelming proportion of the time, when I use regexen, I want them to be global. As in, I don't think I've ever used a non-global regex. To that effect I'd like to propose that either "g" be the default attribute, or that it should be on _unless_ some other attribute ("o" for once?) is present. I think this is one thing that Perl got terribly wrong.
Feb 19 2009
On Fri, Feb 20, 2009 at 9:59 PM, Sergey Gromov <snake.scaly gmail.com> wrote:Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:The argument could go both ways: "Organize the set of 2-char strings starting with 'c' and ending with 'a' to 'z' into a structured haystack, then look for substrings of "conoco" in that haystack." versus "Given the unstructured haystack conoco, look for a structured needle in it that is any 2-char string starting with 'c' and ending with 'a' to 'z'." What is the most natural way?
I think calling a regex a 'haystack' is a far-fetched metaphor. A haystack is a pile of stuff, and a needle is a precise thing you're looking for. I think they're unambiguous.
I thought so too. It's a stretch. And I also agree with the various posts people have made about allowing plain char[] search strings to be interchangeable with regex'es as much as possible. And in that light saying you're looking for [big-string] inside of [sub-string] just sounds ridiculous. But a string is certainly a kind of special-case regex that just describes a set consisting 1 element. So yeh, you /could/ say you're looking for matches inside that set, but it's quite a stretch.Also, the in operator doesn't leave you guessing whether you should put a haystack or a needle first.
That's a good point in it's favor. As long as you aren't one of these folks who thinks "looking for matches inside a regular grammar" is just as reasonable as "looking for a pattern inside a string". --bb
Feb 20 2009
Andrei Alexandrescu wrote:I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. foreach (e; match("abracazoo", regex("a[b-e]", "g"))) writeln(e.pre, e.hit, e.post); auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g") I could have made match a member of regex: regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? Andrei
I agree with the comments against ~. I believe this Perl6 document is a must-read: http://dev.perl.org/perl6/doc/design/apo/A05.html There are some excellent observations there, especially near the beginning. By separating the engine from the state of the match, you open the possibilty of subsequently providing cleaner regex syntax. I do wonder though, how you'd deal with a regex which includes a match to a literal string provided as a variable. Would this be passed to the engine, or to the match state? If the engine is using backtracking, there's no difference in the generated bytecode; but if it's creating an automata, the compiled engine depends on the contents of the string variable.
Feb 19 2009
Don wrote:Andrei Alexandrescu wrote:I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. foreach (e; match("abracazoo", regex("a[b-e]", "g"))) writeln(e.pre, e.hit, e.post); auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g") I could have made match a member of regex: regex("a[b-e]", "g")).match("abracazoo") but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. Now, match() is likely to be called very often so I'm considering: foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) writeln(e); In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? Andrei
I agree with the comments against ~. I believe this Perl6 document is a must-read: http://dev.perl.org/perl6/doc/design/apo/A05.html There are some excellent observations there, especially near the beginning. By separating the engine from the state of the match, you open the possibilty of subsequently providing cleaner regex syntax.
I'd read it a while ago, but a refresher is in order. Thanks!I do wonder though, how you'd deal with a regex which includes a match to a literal string provided as a variable. Would this be passed to the engine, or to the match state?
At the moment these are not supported. It's a good question.If the engine is using backtracking, there's no difference in the generated bytecode; but if it's creating an automata, the compiled engine depends on the contents of the string variable.
The current engine is, to the best of my understanding, using backtracking. At least when there's an "or", it tries both matches as recursive calls and picks the longest. Andrei
Feb 19 2009
On 2009-02-19 00:35:20 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
I don't like `sub`, I mean the name. Makes me think of substring more than substitute. My choice would be to reuse what we have in std.string and augment it to work with regular expressions: auto s = replace("abracazoo", regex("a([b-e])", "g"), subex("A$1")); This way it works consistently whether you're using a string or a regular expression: just replace any pattern string with regex(...) and any replacement string with subex(...) -- "substition-expression" -- when you want them to be parsed as such. Omitting subex in the above would make it a plain string replacement for instance (this way it's easy to place use a variable there). These functions should allow easy substitution of any string or regex pattern with another algorithm for matching the pattern. And there's not way to get a range of matches using std.string, but there should be, and it should follow the same rule as above: supporting strings and regex consistently. (Using the `in` operator as suggested by Bill Baxter seems a good fit for this function.) And if any of you complains about the extra verbosity, here's what I suggest: auto s = replace("abracazoo", re"a([b-e])"g, se"A$1"); Yes, syntaxic sugar for declaring regular expressions.Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g")
I despise the second one, because if you omit regex(...) it makes me think you're checking for string matches, not expression matches. There's nothing in the name of the funciton telling you you're dealing with a regular expression, so it could easily get confusing. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 19 2009
Michel Fortin wrote:On 2009-02-19 00:35:20 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
I don't like `sub`, I mean the name. Makes me think of substring more than substitute. My choice would be to reuse what we have in std.string and augment it to work with regular expressions: auto s = replace("abracazoo", regex("a([b-e])", "g"), subex("A$1"));
Ok. Probably subex is a bit of a killer, but I see your point (subex is not an arbitrary string).This way it works consistently whether you're using a string or a regular expression: just replace any pattern string with regex(...) and any replacement string with subex(...) -- "substition-expression" -- when you want them to be parsed as such. Omitting subex in the above would make it a plain string replacement for instance (this way it's easy to place use a variable there).
Indeed, that was part of the impetus for making regex a distinct type that participates in larger functions. The only problem is that regex does not work with std.algorithm in an obvious way, e.g. find() works very differently for strings and regexes. I considered at a point trying to integrate them, but decided to not spend that effort right now.These functions should allow easy substitution of any string or regex pattern with another algorithm for matching the pattern. And there's not way to get a range of matches using std.string, but there should be, and it should follow the same rule as above: supporting strings and regex consistently. (Using the `in` operator as suggested by Bill Baxter seems a good fit for this function.)
I defined the following in std.algorithm (signatures simplified): // Split a range by a 1-element separator Splitter!(...) splitter(Range, Element)(Range input, Range separator); // Split a range by a subrange separator Splitter!(...) splitter(Range)(Range input, Range separator); I then defined this in std.regex: // Split a range by a subrange separator Splitter!(...) splitter(Range)(Range input, Regex separator); Now this is very nice because you get to switch from one to another very easily. foreach (e; splitter(input, ',')) { ... } foreach (e; splitter(input, ", ")) { ... } foreach (e; splitter(input, regex(", *"))) { ... } The speed/flexibility tradeoff is self-evident and under the control of the programmer without much fuss as it's very easy to switch from one form to another.And if any of you complains about the extra verbosity, here's what I suggest: auto s = replace("abracazoo", re"a([b-e])"g, se"A$1"); Yes, syntaxic sugar for declaring regular expressions.Two other syntactic options are available: "abracazoo".match(regex("a[b-e]", "g"))) "abracazoo".match("a[b-e]", "g")
I despise the second one, because if you omit regex(...) it makes me think you're checking for string matches, not expression matches. There's nothing in the name of the funciton telling you you're dealing with a regular expression, so it could easily get confusing.
This is yet another proof that discussion of syntax, notation, and naming will never go out of fashion. I was half convinced by the others that we're in good shape with input.match(regex). Andrei
Feb 19 2009
Michel Fortin wrote:[snip] And if any of you complains about the extra verbosity, here's what I suggest: auto s = replace("abracazoo", re"a([b-e])"g, se"A$1"); Yes, syntaxic sugar for declaring regular expressions.
Didn't D previously have special regex literals that got dropped for being unpopular and/or hated? -- Daniel
Feb 19 2009
Andrei Alexandrescu, el 18 de febrero a las 21:35 me escribiste:I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. foreach (e; match("abracazoo", regex("a[b-e]", "g"))) writeln(e.pre, e.hit, e.post); auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
BTW, why are the flags passed as string and not as an integer mask? For example: auto s = regex.sub("abracazoo", regex.regex("a([b-e])", regex.G), "A$1"); This way you can catch a few errors at compile-time. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- When I was a child I had a fever My hands felt just like two balloons. Now I've got that feeling once again I can't explain you would not understand This is not how I am. I have become comfortably numb.
Feb 19 2009
Some of the things I'd like to see in the regex implementation:
All functions accepting a compiled regex object/struct should also
accept a string version of the pattern (and vice versa). Some
implementations (Java) only accept the compiled version in some places
and the string pattern in other places. That's annoying.
Just like with ordinary string-searching functions, you should be able
to specify a start position (and maybe an end position) for the search.
Even if the match exists somewhere in the string, it fails if not found
within the target slice. Something like this:
auto text = "ABCDEFG";
auto pattern = regex("[ABCEFG]");
// returns false, because the char at position 3 does not match
auto result = match(text, 3);
// this should be exactly equivalent (but the previous version
// uses less memory, and ought to work with infinite ranges, whereas
// the slice version wouldn't make any sense)
auto equivalent = match(text[3..$]);
I've needed to use this technique in a few cases to implement a simple
lexical scanner, and it's a godsend, if the regex engine supports it
(though most don't).
Finally, it'd be extremely cool if the regex compiler automatically
eliminated redundant nodes from its NFA, converting as much of it as
possible to a DFA. I did some work on this a few years ago, and it's
actually remarkably simple to implement using prefix trees.
// These two expressions produce an identical set of matches,
// but the first one is functionally an NFA, while the second
// one is a DFA.
auto a = regex("(cat|car|cry|dog|door|dry)");
auto b = regex("(c(?:a[tr]|ry)|d(?:o(?:g|or)|ry)");
In cases where the expression can only be partially simplified, you can
leave some NFA nodes deep within the tree, while still DFA-ifying the
rest of it:
auto a = regex("(attitude|attribute|att.+ion");
auto b = regex("(att(?:itude|ribute|.+ion)");
It's a very simple transformation, increases speed (dramatically) for
complex regular expressions (especially those produced dynamically at
runtime by combining large sets of unrelated target expressions), and it
reliably produces equivalent results with the inefficient version.
The only really tricky part is if the subexpressions have their own
capturing groups, in which case the DFA transformation screws up the
ordinal-numbering of the resultant captures.
Anyhoo...
I don't have any strong feelings about the function names (though I'd
rather have functions that operators, like "~", for searching and matching).
And I don't have any strong feelings about whether the compiled regex is
an object or a struct (though I prefer reference semantics over value
semantics for regexen, and right now, I think that makes objects the
(slightly) better choice).
Thanks for your hard work! I've implemented a small regex engine before,
so I know it's no small chunk of effort. Regular expressions are my
personal favorite "tiny language", and I'm glad to see them get some
special attention in phobos2.
--benji
Feb 19 2009
Benji Smith:It's a very simple transformation, increases speed (dramatically) for complex regular expressions (especially those produced dynamically at runtime by combining large sets of unrelated target expressions), and it reliably produces equivalent results with the inefficient version.
See: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/List.pm http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Optimizer.pm Something like that can be implemented as small pre-processing layer over the re module. Bye, bearophile
Feb 20 2009









Daniel Keep <daniel.keep.lists gmail.com> 