www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - earthquake changes of std.regexp to come

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I'm quite unhappy with the API of std.regexp. It's a chaotic design that 
provides a hodgepodge of functionality and tries to use as many synonyms 
of "to find" in the dictionary (e.g. search, match). I could swear 
Walter never really cared for using regexps, and that is felt throughout 
the design: it fills the bullet point but it's asinine to use.

Besides std.regexp only works with (narrow) strings and we want it to 
work on streams of all widths and structures. One pet complaint I have 
is that std.regexp puts a class around it all as if everybody's favorite 
pastime would be to inherit Regexp and override some random function in it.

In the upcoming releases of D 2.0 there will be rather dramatic breaking 
changes of phobos. I just wanted to ask whether y'all could stomach yet 
another rewritten API or you'd rather use std.regexp as it is for the 
time being.


Andrei
Feb 17 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Don't be too much hard with the good Walter, please :-) One good thing in his
designs (in D1) is that they are often simple to use: they give you back much
more than you give them. D2 seems to ask much more from the programmer.

I agree that the API of regexes in Phobos is not much good, but I think
designing a good API for it is quite hard.


 I just wanted to ask whether y'all could stomach yet 
 another rewritten API or you'd rather use std.regexp as it is for the 
 time being.

I have no problems in accepting changes here too. D2 is already essentially another language compared to D1. Regarding regexes of D1 Phobos, it has problems bigger than just the API, in the past I have found some common cases where it is O(n^2) or more. You can see a case of such behaviours here (look at my comments that show what parts are slow, I have also commented out versions that more logical but much slower): http://shootout.alioth.debian.org/debian/benchmark.php?test=regexdna&lang=gdc&id=4 If you want to test that code you can generate test data with this other code: http://shootout.alioth.debian.org/debian/benchmark.php?test=fasta&lang=dlang&id=1 Bye, bearophile
Feb 17 2009
parent reply "Joel C. Salomon" <joelcsalomon gmail.com> writes:
bearophile wrote:
 I agree that the API of regexes in Phobos is not much good, but I think
designing a good API for it is quite hard.

So steal one, rather than invent something new. My suggestion would be to expose the DFA object, as in Plan 9’s library (documentation at <http://plan9.bell-labs.com/magic/man2html/2/regexp>, implementation at <http://plan9.bell-labs.com/sources/plan9/sys/src/libregexp/>, discussion and links to a Unix implementation at <http://swtch.com/~rsc/regexp/>). Simple API: • regcomp: Compile a regexp DFA; • regexec: Apply it to a string, returning a slice of the string that matches the first hit (or an array of slices if parenthesized expressions are used); and • regsub: Apply substitutions to subexpressions of the matching slice. —Joel Salomon
Feb 17 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Joel C. Salomon wrote:
 bearophile wrote:
 I agree that the API of regexes in Phobos is not much good, but I think
designing a good API for it is quite hard.

So steal one, rather than invent something new. My suggestion would be to expose the DFA object, as in Plan 9s library (documentation at <http://plan9.bell-labs.com/magic/man2html/2/regexp>, implementation at <http://plan9.bell-labs.com/sources/plan9/sys/src/libregexp/>, discussion and links to a Unix implementation at <http://swtch.com/~rsc/regexp/>). Simple API: regcomp: Compile a regexp DFA; regexec: Apply it to a string, returning a slice of the string that matches the first hit (or an array of slices if parenthesized expressions are used); and

s/string/input range/ Also returning a range instead of an array of slices is more flexible. Andrei
Feb 17 2009
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I'm quite unhappy with the API of std.regexp. It's a chaotic design that
 provides a hodgepodge of functionality and tries to use as many synonyms
 of "to find" in the dictionary (e.g. search, match). I could swear
 Walter never really cared for using regexps, and that is felt throughout
 the design: it fills the bullet point but it's asinine to use.
 Besides std.regexp only works with (narrow) strings and we want it to
 work on streams of all widths and structures. One pet complaint I have
 is that std.regexp puts a class around it all as if everybody's favorite
 pastime would be to inherit Regexp and override some random function in it.
 In the upcoming releases of D 2.0 there will be rather dramatic breaking
 changes of phobos. I just wanted to ask whether y'all could stomach yet
 another rewritten API or you'd rather use std.regexp as it is for the
 time being.
 Andrei

As I've said before, anyone who can't stomach breaking changes w/o complaining has no business using D2 at this point. I'd rather deal with the aggravation of stuff breaking in the sort run to have a nice language and libraries to go with it in the long run. This whole concept of ranges as you've created them seems to have achieved the the holy grail of both making simple things simple and complex things possible, where "complex things" includes needing code to be efficient, so I can see your reason for wanting to redo all kinds of stuff in them. This compares favorably to C++ STL iterators, which are very flexible and efficient but a huge PITA to use for simple things because the syntax is so low-level and ugly, and to the D1/early D2 way, which gives beautiful, simple notation for the more common cases (basic dynamic arrays), at the expense of flexiblity when doing more complicated things like streams, chaining, strides, etc.
Feb 17 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I'm quite unhappy with the API of std.regexp. It's a chaotic design that
 provides a hodgepodge of functionality and tries to use as many synonyms
 of "to find" in the dictionary (e.g. search, match). I could swear
 Walter never really cared for using regexps, and that is felt throughout
 the design: it fills the bullet point but it's asinine to use.
 Besides std.regexp only works with (narrow) strings and we want it to
 work on streams of all widths and structures. One pet complaint I have
 is that std.regexp puts a class around it all as if everybody's favorite
 pastime would be to inherit Regexp and override some random function in it.
 In the upcoming releases of D 2.0 there will be rather dramatic breaking
 changes of phobos. I just wanted to ask whether y'all could stomach yet
 another rewritten API or you'd rather use std.regexp as it is for the
 time being.
 Andrei

no business using D2 at this point. I'd rather deal with the aggravation of stuff breaking in the sort run to have a nice language and libraries to go with it in the long run. This whole concept of ranges as you've created them seems to have achieved the the holy grail of both making simple things simple and complex things possible, where "complex things" includes needing code to be efficient, so I can see your reason for wanting to redo all kinds of stuff in them. This compares favorably to C++ STL iterators, which are very flexible and efficient but a huge PITA to use for simple things because the syntax is so low-level and ugly, and to the D1/early D2 way, which gives beautiful, simple notation for the more common cases (basic dynamic arrays), at the expense of flexiblity when doing more complicated things like streams, chaining, strides, etc.

BTW, can you elaborate on how arrays, both builtin and any library versions, will work when everything is finalized?
Feb 17 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I'm quite unhappy with the API of std.regexp. It's a chaotic design that
 provides a hodgepodge of functionality and tries to use as many synonyms
 of "to find" in the dictionary (e.g. search, match). I could swear
 Walter never really cared for using regexps, and that is felt throughout
 the design: it fills the bullet point but it's asinine to use.
 Besides std.regexp only works with (narrow) strings and we want it to
 work on streams of all widths and structures. One pet complaint I have
 is that std.regexp puts a class around it all as if everybody's favorite
 pastime would be to inherit Regexp and override some random function in it.
 In the upcoming releases of D 2.0 there will be rather dramatic breaking
 changes of phobos. I just wanted to ask whether y'all could stomach yet
 another rewritten API or you'd rather use std.regexp as it is for the
 time being.
 Andrei

no business using D2 at this point. I'd rather deal with the aggravation of stuff breaking in the sort run to have a nice language and libraries to go with it in the long run. This whole concept of ranges as you've created them seems to have achieved the the holy grail of both making simple things simple and complex things possible, where "complex things" includes needing code to be efficient, so I can see your reason for wanting to redo all kinds of stuff in them. This compares favorably to C++ STL iterators, which are very flexible and efficient but a huge PITA to use for simple things because the syntax is so low-level and ugly, and to the D1/early D2 way, which gives beautiful, simple notation for the more common cases (basic dynamic arrays), at the expense of flexiblity when doing more complicated things like streams, chaining, strides, etc.

BTW, can you elaborate on how arrays, both builtin and any library versions, will work when everything is finalized?

Well finalizations hinges not only on me but on Walter (bugfixes and a couple of new features) and on all of you with the continuous stream of great suggestions and ideas. Again, without being able to experiment much I don't have a clear idea on how arrays/containers should at best look like. The interesting challenge is accommodating good, precise semantics with the freedom given by garbage collection. Here are some highlights: * Today's T[] will be firmly an incarnation of the random-access range concept, to the extent that all code expecting a random-access range can always be passed a T[] without any impedance adaptation. * $ will be generalized to mean "end of range" even for infinite ranges. * We don't have a solution to address the perils of extending a slice by using ~=. We're considering adding the type T[new], but I'm not sure we should take the hit of a new built-in type constructor, particularly when it's implementable as a library. * Fixed-size arrays will in all likelihood be value types. We couldn't find any other semantics that works. * Containers will have value semantics. * "Resources come and go; memory is forever" is the likely default in D resource management. This means that destroying e.g. an array of File objects will close the underlying files, but will not deallocate the memory allocated for them. In essence, destroying values means calling the destructor but not delete-ing them (unless of course they're on the stack). This approach has a number of disadvantages, but plenty of advantages that compensate them in most applications. * std.matrix will define memory layouts for a variety of popular libraries and also the common means to iterate said layouts. * For those who want containers with reference semantics, they can use the type Class!(T) for any value type T. That includes built-in value types (int, float...) and whichever value containers we define. It's unclear to me whether this is enough to satisfy those in need for complex container hierarchies. Andrei
Feb 17 2009
next sibling parent Yigal Chripun <yigal100 gmail.com> writes:
Andrei Alexandrescu wrote:
 dsimcha wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 I'm quite unhappy with the API of std.regexp. It's a chaotic design
 that
 provides a hodgepodge of functionality and tries to use as many
 synonyms
 of "to find" in the dictionary (e.g. search, match). I could swear
 Walter never really cared for using regexps, and that is felt
 throughout
 the design: it fills the bullet point but it's asinine to use.
 Besides std.regexp only works with (narrow) strings and we want it to
 work on streams of all widths and structures. One pet complaint I have
 is that std.regexp puts a class around it all as if everybody's
 favorite
 pastime would be to inherit Regexp and override some random function
 in it.
 In the upcoming releases of D 2.0 there will be rather dramatic
 breaking
 changes of phobos. I just wanted to ask whether y'all could stomach yet
 another rewritten API or you'd rather use std.regexp as it is for the
 time being.
 Andrei

complaining has no business using D2 at this point. I'd rather deal with the aggravation of stuff breaking in the sort run to have a nice language and libraries to go with it in the long run. This whole concept of ranges as you've created them seems to have achieved the the holy grail of both making simple things simple and complex things possible, where "complex things" includes needing code to be efficient, so I can see your reason for wanting to redo all kinds of stuff in them. This compares favorably to C++ STL iterators, which are very flexible and efficient but a huge PITA to use for simple things because the syntax is so low-level and ugly, and to the D1/early D2 way, which gives beautiful, simple notation for the more common cases (basic dynamic arrays), at the expense of flexiblity when doing more complicated things like streams, chaining, strides, etc.

BTW, can you elaborate on how arrays, both builtin and any library versions, will work when everything is finalized?

Well finalizations hinges not only on me but on Walter (bugfixes and a couple of new features) and on all of you with the continuous stream of great suggestions and ideas. Again, without being able to experiment much I don't have a clear idea on how arrays/containers should at best look like. The interesting challenge is accommodating good, precise semantics with the freedom given by garbage collection. Here are some highlights: * Today's T[] will be firmly an incarnation of the random-access range concept, to the extent that all code expecting a random-access range can always be passed a T[] without any impedance adaptation. * $ will be generalized to mean "end of range" even for infinite ranges. * We don't have a solution to address the perils of extending a slice by using ~=. We're considering adding the type T[new], but I'm not sure we should take the hit of a new built-in type constructor, particularly when it's implementable as a library. * Fixed-size arrays will in all likelihood be value types. We couldn't find any other semantics that works. * Containers will have value semantics. * "Resources come and go; memory is forever" is the likely default in D resource management. This means that destroying e.g. an array of File objects will close the underlying files, but will not deallocate the memory allocated for them. In essence, destroying values means calling the destructor but not delete-ing them (unless of course they're on the stack). This approach has a number of disadvantages, but plenty of advantages that compensate them in most applications. * std.matrix will define memory layouts for a variety of popular libraries and also the common means to iterate said layouts. * For those who want containers with reference semantics, they can use the type Class!(T) for any value type T. That includes built-in value types (int, float...) and whichever value containers we define. It's unclear to me whether this is enough to satisfy those in need for complex container hierarchies. Andrei

I've got a few questions about the proposed container value semantics: a) I'd like to be able to do for instance: List lst = new LinkedList(); i.e use interfaces everywhere and especially in functions so that I can switch implementations easily when the need arises. In the above I can choose to use singly or doubly linked list without making changes throughout the code by using the List interface. Will this be possible and how? is D going to get proper struct interfaces? b) it is sometimes useful to have a container!(Base) store references to instances of derived classes, a caconical example of this is a container of Widget class in a UI framework, where you can, for instance iterate over the container and paint all the different kinds of widgets on the screen by calling the virtual paint method of the base class. How can this be implemented with your proposed Class template? -- Yigal
Feb 18 2009
prev sibling next sibling parent Yigal Chripun <yigal100 gmail.com> writes:
Andrei Alexandrescu wrote:
 dsimcha wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 I'm quite unhappy with the API of std.regexp. It's a chaotic design
 that
 provides a hodgepodge of functionality and tries to use as many
 synonyms
 of "to find" in the dictionary (e.g. search, match). I could swear
 Walter never really cared for using regexps, and that is felt
 throughout
 the design: it fills the bullet point but it's asinine to use.
 Besides std.regexp only works with (narrow) strings and we want it to
 work on streams of all widths and structures. One pet complaint I have
 is that std.regexp puts a class around it all as if everybody's
 favorite
 pastime would be to inherit Regexp and override some random function
 in it.
 In the upcoming releases of D 2.0 there will be rather dramatic
 breaking
 changes of phobos. I just wanted to ask whether y'all could stomach yet
 another rewritten API or you'd rather use std.regexp as it is for the
 time being.
 Andrei

complaining has no business using D2 at this point. I'd rather deal with the aggravation of stuff breaking in the sort run to have a nice language and libraries to go with it in the long run. This whole concept of ranges as you've created them seems to have achieved the the holy grail of both making simple things simple and complex things possible, where "complex things" includes needing code to be efficient, so I can see your reason for wanting to redo all kinds of stuff in them. This compares favorably to C++ STL iterators, which are very flexible and efficient but a huge PITA to use for simple things because the syntax is so low-level and ugly, and to the D1/early D2 way, which gives beautiful, simple notation for the more common cases (basic dynamic arrays), at the expense of flexiblity when doing more complicated things like streams, chaining, strides, etc.

BTW, can you elaborate on how arrays, both builtin and any library versions, will work when everything is finalized?

Well finalizations hinges not only on me but on Walter (bugfixes and a couple of new features) and on all of you with the continuous stream of great suggestions and ideas. Again, without being able to experiment much I don't have a clear idea on how arrays/containers should at best look like. The interesting challenge is accommodating good, precise semantics with the freedom given by garbage collection. Here are some highlights: * Today's T[] will be firmly an incarnation of the random-access range concept, to the extent that all code expecting a random-access range can always be passed a T[] without any impedance adaptation. * $ will be generalized to mean "end of range" even for infinite ranges. * We don't have a solution to address the perils of extending a slice by using ~=. We're considering adding the type T[new], but I'm not sure we should take the hit of a new built-in type constructor, particularly when it's implementable as a library. * Fixed-size arrays will in all likelihood be value types. We couldn't find any other semantics that works. * Containers will have value semantics. * "Resources come and go; memory is forever" is the likely default in D resource management. This means that destroying e.g. an array of File objects will close the underlying files, but will not deallocate the memory allocated for them. In essence, destroying values means calling the destructor but not delete-ing them (unless of course they're on the stack). This approach has a number of disadvantages, but plenty of advantages that compensate them in most applications. * std.matrix will define memory layouts for a variety of popular libraries and also the common means to iterate said layouts. * For those who want containers with reference semantics, they can use the type Class!(T) for any value type T. That includes built-in value types (int, float...) and whichever value containers we define. It's unclear to me whether this is enough to satisfy those in need for complex container hierarchies. Andrei

Another question regarding the container design - have you considered mutable containers vs. functional style imutable containers? does it make sense to provide both options?
Feb 18 2009
prev sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 * "Resources come and go; memory is forever" is the likely default in D 
 resource management. This means that destroying e.g. an array of File 
 objects will close the underlying files, but will not deallocate the 
 memory allocated for them. In essence, destroying values means calling 
 the destructor but not delete-ing them (unless of course they're on the 
 stack). This approach has a number of disadvantages, but plenty of 
 advantages that compensate them in most applications.

I admit I'm tired right now... You mention disadvantages, the one I can't avoid thinking of is memory leak! Which means you can't write e.g. a simple web server that opens and closes files, instead of creating and managing a file object pool? Eventually it'll run out of memory, unless I'm way too tired now...
 * std.matrix will define memory layouts for a variety of popular 
 libraries and also the common means to iterate said layouts.

I assume this is for handy and practical rectangular (and cubic, etc.) "arrays". Which would be most welcome. This "memory is forever" philosophy, is this discussed in depth somewhere? (With the current amount of traffic here, I simply can't follow every thread anymore. :-( )
Feb 20 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 * "Resources come and go; memory is forever" is the likely default in 
 D resource management. This means that destroying e.g. an array of 
 File objects will close the underlying files, but will not deallocate 
 the memory allocated for them. In essence, destroying values means 
 calling the destructor but not delete-ing them (unless of course 
 they're on the stack). This approach has a number of disadvantages, 
 but plenty of advantages that compensate them in most applications.

I admit I'm tired right now... You mention disadvantages, the one I can't avoid thinking of is memory leak! Which means you can't write e.g. a simple web server that opens and closes files, instead of creating and managing a file object pool? Eventually it'll run out of memory, unless I'm way too tired now...

Better said, I was too tired when I posted that. I gave too little detail. Files are resources, so they will "come and go", i.e. will be under deterministic control; there's no need to worry. Only memory will have a "lives forever" regime for safety reasons. It's not really forever as the GC collects it. In short, my proposed system is to admit that GC is good _only_ for memory, and that deterministic management must prevail for other resources. I'll get back later on this.
 * std.matrix will define memory layouts for a variety of popular 
 libraries and also the common means to iterate said layouts.

I assume this is for handy and practical rectangular (and cubic, etc.) "arrays". Which would be most welcome. This "memory is forever" philosophy, is this discussed in depth somewhere? (With the current amount of traffic here, I simply can't follow every thread anymore. :-( )

I decided to curb my posting as well. Beyond a point even passable content becomes just white noise. Also since we don't have an off-topic group, off-topic discussions tend to carry on here as well and are not trivial to ignore. I'm happy they are civilized (congrats to all involved). Andrei
Feb 20 2009
prev sibling next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Besides std.regexp only works with (narrow) strings and we want it to work
 on streams of all widths and structures. One pet complaint I have is that
 std.regexp puts a class around it all as if everybody's favorite pastime
 would be to inherit Regexp and override some random function in it.

So what do you think it should be, a struct? That would imply to me that everybody's favorite pastime is making value copies of regex structures, when in fact nobody does that. Regex is a class in order to give it reference semantics and provide encapsulation of some re-usable state. Maybe it should be a final class, but my impression is "final class" doesn't really work in D. --bb
Feb 17 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Besides std.regexp only works with (narrow) strings and we want it to work
 on streams of all widths and structures. One pet complaint I have is that
 std.regexp puts a class around it all as if everybody's favorite pastime
 would be to inherit Regexp and override some random function in it.

So what do you think it should be, a struct?

Yes.
 That would imply to me that everybody's favorite pastime is making
 value copies of regex structures, when in fact nobody does that.

Well you'd be surprised. The RegEx class saves the state of the last search, which is a sensible thing to do. But then consider a simple range Splitter that, when iterated, nicely gives you... string a = ",a, bcd, def,gh,"; foreach (e; splitter(a, pattern(", *")) writeln("[", e, "]"); writes [] [a] [bcd] [def] [gh] This is similar to the function std.regex.split with the notable difference that no extra memory is allocated. Now Splitter is an input range. This means you wouldn't expect that you copy a Splitter and then have iterating the original value affect the copy. Well, that's exactly what happens when you use the "good" reference semantics of the RegEx stored inside splitter. Worse, RegExp has no cloning primitive, so I need to resort to storing the pattern and recompiling it from scratch at every copy of Splitter. So essentially the "good" semantics of RegEx are useless when it comes to composing it in larger objects.
 Regex is a class in order to give it reference semantics and provide
 encapsulation of some re-usable state.  Maybe it should be a final
 class, but my impression is "final class" doesn't really work in D.

Re-usable state is provided by structs too. In addition they can choose value vs. reference semantics with ease. Andrei
Feb 17 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 string a = ",a,  bcd, def,gh,";
 foreach (e; splitter(a, pattern(", *"))
      writeln("[", e, "]");

(I often use xplit() that is like split but yields items lazily, for larger strings it's much faster). A better approach is to fuse the xsplit and such xsplitter function in a single lazy generator that can take as a second argument a string or char or RE pattern. A 3rd optional argument can be the max number of splits (so after such max it yields all the rest of the string). You can then add an eager splitter function with the same signature, that outputs an array. Bye, bearophile
Feb 17 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 I think this choice is not so much available with D1, plus the
 constructor situation with D1 is less than ideal.  Given that, I think
 the choice of class for RegEx was apropriate.   But if the struct
 problems are all going away in D2, then that's great.  Sounds like
 you're saying we'll really be able to use D structs just like one uses
 a non-polymorphic C++ class.  If so, then that's super.

I lost that perspective when criticizing RegExp, you're right. But still the API is lousy - every single time I am using a RegExp, I find myself fumbling through the thoroughly overlapping primitives in the documentation, and never seem to find an idiom that's simple, comfortable, and memorable. Andrei
Feb 17 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Bill Baxter:
Python's syntax I have to look over the documentation every time I use it, too.
Maybe it's because of the "matching" vs "searching" distinction that I find
impossible to remember.<

I agree, I too need the Python docs every time I want to use something more than the basics. The syntax for group catching too is bad (groups? group? itersomething? etc). I have proposed an improvement (using [5] to grab the 5th group() but it was not implemented. Such syntax is possible in D too *hint*). It's because of situations like this that I say that designing a good API for std.re isn't easy at all. It will require care, brain, and maybe two or more tries :-) Bye, bearophile
Feb 17 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jarrett Billingsley wrote:
 On Tue, Feb 17, 2009 at 7:13 PM, Bill Baxter <wbaxter gmail.com> wrote:
 Ok.  I'm certainly not in love with the API either.  Though, the only
 RegEx API I've ever used that felt totally comfortable with was
 Perl's, which in large part is syntax instead of an API.  Python's
 syntax I have to look over the documentation every time I use it, too.
  Maybe it's because of the "matching" vs "searching" distinction that
 I find impossible to remember.
 (http://docs.python.org/library/re.html)

Is there ever a situation where you want to use a single regexp for both matching _and_ searching? And if not, couldn't you just use ^ to anchor it? I never understood why Python's API makes such a distinction.

Ehm, that's odd. You'd think that after Perl has set the precedent, it would be hard to do major goofs in designing a regex API. By the way, the more I dig into std.regexp, the stiffer the hair on my neck gets. Get this: the API offers both global functions and member functions, with both RegExp and plain string arguments. The latter are carefully designed to maximize the number of clashes, potential confusions, and errors when using both std.string and std.regex. But wait, there's more. The API defines the following functions that all ostensibly do some sort of mattern patching (sic): find, search, test, match, and exec. I wish I were kidding. There's some opIndex and opEquals thrown in for good measure. Knuth wouldn't know what each of them does after studying them for a week and then watching an episode from "The Bachelor". And get this: global search() does not do what member search() does. Nope. Global search() does what member test() does. I have only contempt for such designs. Andrei
Feb 17 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Bill Baxter wrote:
 Maybe "design" is too strong a word.  Most Phobos modules seem to have
 been put together rather hastily in order to fill a pressing need.
 Often *something* is better than nothing at all, even if the something
 is not so great.

std.regexp evolved out of the ECMAscript regex functions - they have the same names and functionality. Layered on top of that was ruby-like names and functionality. It's a good (bad?) example of an api evolving without sacrificing backwards compatibility.
Feb 20 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Bill Baxter wrote:
 Maybe "design" is too strong a word.  Most Phobos modules seem to have
 been put together rather hastily in order to fill a pressing need.
 Often *something* is better than nothing at all, even if the something
 is not so great.

std.regexp evolved out of the ECMAscript regex functions - they have the same names and functionality. Layered on top of that was ruby-like names and functionality. It's a good (bad?) example of an api evolving without sacrificing backwards compatibility.

s/good \(bad\?\)/REALLY BAD/ Andrei
Feb 20 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Denis Koroskin wrote:
 On Fri, 20 Feb 2009 16:35:54 +0300, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Walter Bright wrote:
 Bill Baxter wrote:
 Maybe "design" is too strong a word.  Most Phobos modules seem to have
 been put together rather hastily in order to fill a pressing need.
 Often *something* is better than nothing at all, even if the something
 is not so great.

the same names and functionality. Layered on top of that was ruby-like names and functionality. It's a good (bad?) example of an api evolving without sacrificing backwards compatibility.

s/good \(bad\?\)/REALLY BAD/ Andrei

Backward compatibility is almost always a bad thing. Look what's happened to C++ and OpenGL.

In this case it's even worse, as I don't think anyone expects to paste their Ruby code and compile it with dmd. Andrei
Feb 20 2009
prev sibling next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 I'm quite unhappy with the API of std.regexp. It's a chaotic design that
 provides a hodgepodge of functionality and tries to use as many synonyms of
 "to find" in the dictionary (e.g. search, match). I could swear Walter never
 really cared for using regexps, and that is felt throughout the design: it
 fills the bullet point but it's asinine to use.

 Besides std.regexp only works with (narrow) strings and we want it to work
 on streams of all widths and structures. One pet complaint I have is that
 std.regexp puts a class around it all as if everybody's favorite pastime
 would be to inherit Regexp and override some random function in it.

 In the upcoming releases of D 2.0 there will be rather dramatic breaking
 changes of phobos. I just wanted to ask whether y'all could stomach yet
 another rewritten API or you'd rather use std.regexp as it is for the time
 being.

Btw, I've got no problems with you breaking the API of 2.0 either. Though you might consider moving the current implementation to std.deprecated.regex and leaving it there for a year with a pragma(msg, "This module is deprecated"). That way making a quick fix to broken code is just a matter of inserting ".deprecated" into your import statements. --bb
Feb 17 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 I'm quite unhappy with the API of std.regexp. It's a chaotic design that
 provides a hodgepodge of functionality and tries to use as many synonyms of
 "to find" in the dictionary (e.g. search, match). I could swear Walter never
 really cared for using regexps, and that is felt throughout the design: it
 fills the bullet point but it's asinine to use.

 Besides std.regexp only works with (narrow) strings and we want it to work
 on streams of all widths and structures. One pet complaint I have is that
 std.regexp puts a class around it all as if everybody's favorite pastime
 would be to inherit Regexp and override some random function in it.

 In the upcoming releases of D 2.0 there will be rather dramatic breaking
 changes of phobos. I just wanted to ask whether y'all could stomach yet
 another rewritten API or you'd rather use std.regexp as it is for the time
 being.

Btw, I've got no problems with you breaking the API of 2.0 either. Though you might consider moving the current implementation to std.deprecated.regex and leaving it there for a year with a pragma(msg, "This module is deprecated"). That way making a quick fix to broken code is just a matter of inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok? Andrei
Feb 17 2009
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 I was thinking of moving older stuff to etc, is that ok?

Yes. But you should also rename the new one, perhaps to std.regex. That way, legacy code will refuse to compile, rather than compile wrongly.
Feb 17 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Andrei Alexandrescu wrote:
 I was thinking of moving older stuff to etc, is that ok?

Yes. But you should also rename the new one, perhaps to std.regex. That way, legacy code will refuse to compile, rather than compile wrongly.

Terrific. I prefer "regex" to "regexp" because it's easier to pronounce, particularly if you're a foreigner. "Regex" sounds like a frog utterance by a forest lake, "regexp" sounds like nothing in particular. Andrei
Feb 17 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 Terrific. I prefer "regex" to "regexp" because it's easier to pronounce, 
 particularly if you're a foreigner. "Regex" sounds like a frog utterance 
 by a forest lake, "regexp" sounds like nothing in particular.

I'd like std.re :-) Bye, bearophile
Feb 17 2009
prev sibling parent Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
Andrei Alexandrescu wrote:
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 I was thinking of moving older stuff to etc, is that ok?

Yes. But you should also rename the new one, perhaps to std.regex. That way, legacy code will refuse to compile, rather than compile wrongly.

Terrific. I prefer "regex" to "regexp" because it's easier to pronounce, particularly if you're a foreigner. "Regex" sounds like a frog utterance by a forest lake, "regexp" sounds like nothing in particular. Andrei

It sounds to me like a frog who, immediately post-utterance, just got gigged. I guess that makes "regex" sound even better... as its still alive (sounding). -- Chris Nicholson-Sauls -- Who so far agrees with pretty much everything you've said, and therefore has no real contribution...
Feb 17 2009
prev sibling parent reply Leandro Lucarella <llucax gmail.com> writes:
Andrei Alexandrescu, el 17 de febrero a las 13:56 me escribiste:
Btw, I've got no problems with you breaking the API of 2.0 either.
Though you might consider moving the current implementation to
std.deprecated.regex and leaving it there for a year with a
pragma(msg, "This module is deprecated").
That way making a quick fix to broken code is just a matter of
inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok?

What's the rationale for "etc"? Why not "deprecated", o something shorter like "old", or "d1" (this last one could be good for future deprecated libraries, like when D3 is available there probably be a "d2" too). -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Hey you, don't tell me there's no hope at all Together we stand, divided we fall.
Feb 19 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Leandro Lucarella wrote:
 Andrei Alexandrescu, el 17 de febrero a las 13:56 me escribiste:
 Btw, I've got no problems with you breaking the API of 2.0 either.
 Though you might consider moving the current implementation to
 std.deprecated.regex and leaving it there for a year with a
 pragma(msg, "This module is deprecated").
 That way making a quick fix to broken code is just a matter of
 inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok?

What's the rationale for "etc"? Why not "deprecated", o something shorter like "old", or "d1" (this last one could be good for future deprecated libraries, like when D3 is available there probably be a "d2" too).

In the words of George Costanza: "Because it's there!" Andrei
Feb 19 2009
next sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
Andrei Alexandrescu wrote:
 Leandro Lucarella wrote:
 Andrei Alexandrescu, el 17 de febrero a las 13:56 me escribiste:
 Btw, I've got no problems with you breaking the API of 2.0 either.
 Though you might consider moving the current implementation to
 std.deprecated.regex and leaving it there for a year with a
 pragma(msg, "This module is deprecated").
 That way making a quick fix to broken code is just a matter of
 inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok?

What's the rationale for "etc"? Why not "deprecated", o something shorter like "old", or "d1" (this last one could be good for future deprecated libraries, like when D3 is available there probably be a "d2" too).

In the words of George Costanza: "Because it's there!" Andrei

Shouldn't that be George Mallory?
Feb 19 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ellery Newcomer wrote:
 Andrei Alexandrescu wrote:
 Leandro Lucarella wrote:
 Andrei Alexandrescu, el 17 de febrero a las 13:56 me escribiste:
 Btw, I've got no problems with you breaking the API of 2.0 either.
 Though you might consider moving the current implementation to
 std.deprecated.regex and leaving it there for a year with a
 pragma(msg, "This module is deprecated").
 That way making a quick fix to broken code is just a matter of
 inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok?

What's the rationale for "etc"? Why not "deprecated", o something shorter like "old", or "d1" (this last one could be good for future deprecated libraries, like when D3 is available there probably be a "d2" too).

In the words of George Costanza: "Because it's there!" Andrei

Shouldn't that be George Mallory?

No, he said "because it is there". George said "because it's there": http://www.classictvquotes.com/quotes/characters/george-costanza/page_14.html George: So, she fell, and then she started screaming, "My back! My back!" So, I picked her up and took her to the hospital. Elaine: How is she? George: She's in traction. Elaine: Okay, I'm sorry. George: It's not funny, Elaine. Elaine: I know. I'm sorry. I'm serious. George: Her back went out. She's gotta be there for a couple of days. All she said on the way over in the car was, "Why, George, why?!" I said, "Because it's there!" Andrei
Feb 19 2009
prev sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 That way making a quick fix to broken code is just a matter of
 inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok?

What's the rationale for "etc"? Why not "deprecated"

In the words of George Costanza: "Because it's there!"

With the critique you've given to the existing regexp stuff, deprecated would be the obvious choice. Then we could have etc for Miscellaneous Stuff.
Feb 20 2009
parent Leandro Lucarella <llucax gmail.com> writes:
Georg Wrede, el 20 de febrero a las 19:38 me escribiste:
 Andrei Alexandrescu wrote:
That way making a quick fix to broken code is just a matter of
inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok?

What's the rationale for "etc"? Why not "deprecated"


With the critique you've given to the existing regexp stuff, deprecated would be the obvious choice. Then we could have etc for Miscellaneous Stuff.

Why not "misc" for that? =) -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- TIGRE SE COMIO A EMPLEADO DE CIRCO: DETUVIERON A DUEÑO Y DOMADOR -- Crónica TV
Feb 20 2009
prev sibling next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Andrei,

 I'm quite unhappy with the API of std.regexp. It's a chaotic design
 that provides a hodgepodge of functionality and tries to use as many
 synonyms of "to find" in the dictionary (e.g. search, match). I could
 swear Walter never really cared for using regexps, and that is felt
 throughout the design: it fills the bullet point but it's asinine to
 use.
 
 Besides std.regexp only works with (narrow) strings and we want it to
 work on streams of all widths and structures. One pet complaint I have
 is that std.regexp puts a class around it all as if everybody's
 favorite pastime would be to inherit Regexp and override some random
 function in it.
 
 In the upcoming releases of D 2.0 there will be rather dramatic
 breaking changes of phobos. I just wanted to ask whether y'all could
 stomach yet another rewritten API or you'd rather use std.regexp as it
 is for the time being.
 
 Andrei
 

For what it's worth, I have a partial clone of the .NET API built on top of PCRE. I would have to ask my boss but I expect I could donate it if anyone want to use it as a basis.
Feb 17 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Jarrett Billingsley:
I'm talking 60-70 seconds to compile a more complex regex.<

A modern CPU is able to do something like 60*2*2E9 operations in that time, DMD needs 6 seconds or less to compile about 60000-80000 lines of my D code, so I think it's a bit too much time (probably 100 or 1000 times too much). Bye, bearophile
Feb 17 2009
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
BCS wrote:
 Reply to Andrei,
 
 I'm quite unhappy with the API of std.regexp. It's a chaotic design
 that provides a hodgepodge of functionality and tries to use as many
 synonyms of "to find" in the dictionary (e.g. search, match). I could
 swear Walter never really cared for using regexps, and that is felt
 throughout the design: it fills the bullet point but it's asinine to
 use.

 Besides std.regexp only works with (narrow) strings and we want it to
 work on streams of all widths and structures. One pet complaint I have
 is that std.regexp puts a class around it all as if everybody's
 favorite pastime would be to inherit Regexp and override some random
 function in it.

 In the upcoming releases of D 2.0 there will be rather dramatic
 breaking changes of phobos. I just wanted to ask whether y'all could
 stomach yet another rewritten API or you'd rather use std.regexp as it
 is for the time being.

 Andrei

For what it's worth, I have a partial clone of the .NET API built on top of PCRE. I would have to ask my boss but I expect I could donate it if anyone want to use it as a basis.

That would be cool; I find the engine in std.regexp rather hard to understand. Andrei
Feb 17 2009
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Daniel de Kok wrote:
 On Tue, Feb 17, 2009 at 8:39 PM, BCS <ao pathlink.com> wrote:
 For what it's worth, I have a partial clone of the .NET API built on top of
 PCRE. I would have to ask my boss but I expect I could donate it if anyone
 want to use it as a basis.

Actually, I was wondering why nobody is considering real regular languages anymore, that can be compiled to a normal finite state recognizer or transducer. While this may not be as fancy as Perl-like extensions, they are much faster, and it's easier to do fun stuff such as composition.

I am considering that. One nice feature of "classic" regexes is that they never backtrack, so they work with pure input iterators. This has crucial consequences with regard to where and how regexes fit the range concept hierarchy. Andrei
Feb 17 2009
prev sibling next sibling parent Daniel de Kok <me danieldk.org> writes:
On Tue, Feb 17, 2009 at 8:39 PM, BCS <ao pathlink.com> wrote:
 For what it's worth, I have a partial clone of the .NET API built on top of
 PCRE. I would have to ask my boss but I expect I could donate it if anyone
 want to use it as a basis.

Actually, I was wondering why nobody is considering real regular languages anymore, that can be compiled to a normal finite state recognizer or transducer. While this may not be as fancy as Perl-like extensions, they are much faster, and it's easier to do fun stuff such as composition. Take care, Daniel
Feb 17 2009
prev sibling next sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Tue, Feb 17, 2009 at 2:47 PM, Daniel de Kok <me danieldk.org> wrote:
 On Tue, Feb 17, 2009 at 8:39 PM, BCS <ao pathlink.com> wrote:
 For what it's worth, I have a partial clone of the .NET API built on top of
 PCRE. I would have to ask my boss but I expect I could donate it if anyone
 want to use it as a basis.

Actually, I was wondering why nobody is considering real regular languages anymore, that can be compiled to a normal finite state recognizer or transducer. While this may not be as fancy as Perl-like extensions, they are much faster, and it's easier to do fun stuff such as composition.

Tango's regex engine is just that. It uses a tagged NFA method. http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.html The problem with this method is that while it's certainly faster to match, it's MUCH slower to compile. There are no pathological matches; only pathological compiles ;) I'm talking 60-70 seconds to compile a more complex regex. This might be an acceptable tradeoff for when you need to compile a regex in a long-running app like a server, but it's completely unacceptable for most small, Perl-like text munging programs. Unless of course this slowdown is unique to Tango's implementation of this method!
Feb 17 2009
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Jarrett,

 On Tue, Feb 17, 2009 at 2:47 PM, Daniel de Kok <me danieldk.org>
 wrote:
 
 Actually, I was wondering why nobody is considering real regular
 languages anymore, that can be compiled to a normal finite state
 recognizer or transducer. While this may not be as fancy as Perl-like
 extensions, they are much faster, and it's easier to do fun stuff
 such as composition.
 

http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.ht ml The problem with this method is that while it's certainly faster to match, it's MUCH slower to compile. There are no pathological matches; only pathological compiles ;) I'm talking 60-70 seconds to compile a more complex regex.

could this be transitioned to CTFE? you could even have a debug mode that delays till runtime RegEx mather = new CTFERegEx!("some regex"); class CTFERegEx(char[] regex) : RegEx { debug(NoCTFE) static char[] done; else static const char[] done = CTFECompile(regex); public this() { debug(NoCTFE) if(done == null) done = CTFECompile(regex); base(done) } }
Feb 17 2009
parent Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
Jarrett Billingsley wrote:
 On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao pathlink.com> wrote:
 could this be transitioned to CTFE? you could even have a debug mode that
 delays till runtime

 RegEx mather = new CTFERegEx!("some regex");


 class CTFERegEx(char[] regex) : RegEx
 {
      debug(NoCTFE)  static char[] done;
      else     static const char[] done = CTFECompile(regex);

      public this()
      {
         debug(NoCTFE) if(done == null) done = CTFECompile(regex);

         base(done)
      }
 }

For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled. So you _could_ precompile the regex into D code and use that.

I feature which I *adore* by the way. So long as the precompiled regex is "guaranteed" to run at best possible performance (hand-rolled, hand-optimized solutions notwithstanding) I for one prefer them. -- Chris Nicholson-Sauls
Feb 17 2009
prev sibling next sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao pathlink.com> wrote:
 could this be transitioned to CTFE? you could even have a debug mode that
 delays till runtime

 RegEx mather = new CTFERegEx!("some regex");


 class CTFERegEx(char[] regex) : RegEx
 {
      debug(NoCTFE)  static char[] done;
      else     static const char[] done = CTFECompile(regex);

      public this()
      {
         debug(NoCTFE) if(done == null) done = CTFECompile(regex);

         base(done)
      }
 }

For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled. So you _could_ precompile the regex into D code and use that. But seriously, man - if something takes 60 seconds to complete at _runtime_, making it CTFE would simply make your computer explode.
Feb 17 2009
parent BCS <ao pathlink.com> writes:
Reply to Jarrett,

 On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao pathlink.com> wrote:
 
 could this be transitioned to CTFE? you could even have a debug mode
 that delays till runtime
 
 RegEx mather = new CTFERegEx!("some regex");
 
 class CTFERegEx(char[] regex) : RegEx
 {
 debug(NoCTFE)  static char[] done;
 else     static const char[] done = CTFECompile(regex);
 public this()
 {
 debug(NoCTFE) if(done == null) done = CTFECompile(regex);
 base(done)
 }
 }

a D function that will implement the regex after it's compiled. So you _could_ precompile the regex into D code and use that. But seriously, man - if something takes 60 seconds to complete at _runtime_, making it CTFE would simply make your computer explode.

For any kind of debug, yeah, that's a problem. OTOH for release, as long as it /does/ compile, who cares? How many real release builds does anyone do a week?
Feb 17 2009
prev sibling parent Daniel de Kok <me danieldk.org> writes:
On Tue, Feb 17, 2009 at 9:26 PM, Jarrett Billingsley
<jarrett.billingsley gmail.com> wrote:
 For what it's worth the Tango regexes actually have a method to output
 a D function that will implement the regex after it's compiled.  So
 you _could_ precompile the regex into D code and use that.

I have only been tinkering with Phobos, but that's good to hear, thanks!
Feb 17 2009
prev sibling next sibling parent Daniel de Kok <me danieldk.org> writes:
On Tue, Feb 17, 2009 at 8:57 PM, Jarrett Billingsley
<jarrett.billingsley gmail.com> wrote:
 The problem with this method is that while it's certainly faster to
 match, it's MUCH slower to compile.  There are no pathological
 matches; only pathological compiles ;)  I'm talking 60-70 seconds to
 compile a more complex regex. This might be an acceptable tradeoff
 for when you need to compile a regex in a long-running app like a
 server, but it's completely unacceptable for most small, Perl-like
 text munging programs.

Hmmm, define "complex", I suppose it's ok for the general line-splitting/matching stuff? I got into trouble (time-wise) when we compiled a part of speech tagger into a transducer. In those cases we generally pre-compile stuff, and output it as a large struct in the target language. Of course, it would be fun if we can do it at compile-time ;). Besides that, if we'd have a good general recognizer/transducer implementation it could also be used for compact dictionary storage, perfect hashing automata, etc. Take care, Daniel
Feb 17 2009
prev sibling next sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me danieldk.org> wrote:
 Hmmm, define "complex"

\w+([\-+.]\w+)* \w+([\-.]\w+)*\.\w+([\-.]\w+)* This is a simple email regexp. This takes about 4 or 5 seconds to compile on my lappy (Pentium M). It only goes up from there.
Feb 17 2009
parent BCS <ao pathlink.com> writes:
Reply to Jarrett,

 On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me danieldk.org>
 wrote:
 
 Hmmm, define "complex"
 

This is a simple email regexp. This takes about 4 or 5 seconds to compile on my lappy (Pentium M). It only goes up from there.

I wonder how well it would work on this: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html :b
Feb 17 2009
prev sibling next sibling parent Daniel de Kok <me danieldk.org> writes:
On Tue, Feb 17, 2009 at 9:50 PM, Jarrett Billingsley
<jarrett.billingsley gmail.com> wrote:
 On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me danieldk.org> wrote:
 Hmmm, define "complex"

\w+([\-+.]\w+)* \w+([\-.]\w+)*\.\w+([\-.]\w+)* This is a simple email regexp. This takes about 4 or 5 seconds to compile on my lappy (Pentium M).

Hmm, odd. I have translated that regexp to the syntax of the tool that we used, that is written in Prolog (it is generally a constant factor slower than C/C++/D equivalents). Generating a minimized DFA takes far less than a second. I used the following expression (abstracted a bit with macros): --- macro(letter, {a..z, 'A'..'Z'}). macro(punctlet,[{-,+,.},letter+]). macro(dompunctlet,[{-,.},letter+]). macro(email,[letter+,punctlet*, ,letter+,dompunctlet*,.,letter+,dompunctlet*]). --- The software is available from: http://www.let.rug.nl/~vannoord/Fsa/fsa.html Take care, Daniel
Feb 17 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Wed, Feb 18, 2009 at 6:56 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 Besides std.regexp only works with (narrow) strings and we want it to
 work
 on streams of all widths and structures. One pet complaint I have is that
 std.regexp puts a class around it all as if everybody's favorite pastime
 would be to inherit Regexp and override some random function in it.

So what do you think it should be, a struct?

Yes.
 That would imply to me that everybody's favorite pastime is making
 value copies of regex structures, when in fact nobody does that.

Well you'd be surprised. The RegEx class saves the state of the last search, which is a sensible thing to do. But then consider a simple range Splitter that, when iterated, nicely gives you... string a = ",a, bcd, def,gh,"; foreach (e; splitter(a, pattern(", *")) writeln("[", e, "]"); writes [] [a] [bcd] [def] [gh] This is similar to the function std.regex.split with the notable difference that no extra memory is allocated. Now Splitter is an input range. This means you wouldn't expect that you copy a Splitter and then have iterating the original value affect the copy. Well, that's exactly what happens when you use the "good" reference semantics of the RegEx stored inside splitter. Worse, RegExp has no cloning primitive, so I need to resort to storing the pattern and recompiling it from scratch at every copy of Splitter. So essentially the "good" semantics of RegEx are useless when it comes to composing it in larger objects.

So that sounds to me like RegEx should have a .dup, and then it would be fine, no? I agree it should have a dup for the odd occasion when you do want to make a copy for some reason.
 Regex is a class in order to give it reference semantics and provide
 encapsulation of some re-usable state.  Maybe it should be a final
 class, but my impression is "final class" doesn't really work in D.


 Re-usable state is provided by structs too. In addition they can choose
 value vs. reference semantics with ease.

I think this choice is not so much available with D1, plus the constructor situation with D1 is less than ideal. Given that, I think the choice of class for RegEx was apropriate. But if the struct problems are all going away in D2, then that's great. Sounds like you're saying we'll really be able to use D structs just like one uses a non-polymorphic C++ class. If so, then that's super. --bb
Feb 17 2009
prev sibling next sibling parent Derek Parnell <derek psych.ward> writes:
On Tue, 17 Feb 2009 10:36:06 -0800, Andrei Alexandrescu wrote:

 I'm quite unhappy with the API of std.regexp.

I was so happy with using it I wrote my own simplified regex ;-)
 In the upcoming releases of D 2.0 there will be rather dramatic breaking 
 changes of phobos. I just wanted to ask whether y'all could stomach yet 
 another rewritten API or you'd rather use std.regexp as it is for the 
 time being.

If your changes are going to make things better for coding and maintenance then go for it. -- Derek Parnell Melbourne, Australia skype: derek.j.parnell
Feb 17 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Wed, Feb 18, 2009 at 7:44 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 I think this choice is not so much available with D1, plus the
 constructor situation with D1 is less than ideal.  Given that, I think
 the choice of class for RegEx was apropriate.   But if the struct
 problems are all going away in D2, then that's great.  Sounds like
 you're saying we'll really be able to use D structs just like one uses
 a non-polymorphic C++ class.  If so, then that's super.

I lost that perspective when criticizing RegExp, you're right. But still the API is lousy - every single time I am using a RegExp, I find myself fumbling through the thoroughly overlapping primitives in the documentation, and never seem to find an idiom that's simple, comfortable, and memorable.

Ok. I'm certainly not in love with the API either. Though, the only RegEx API I've ever used that felt totally comfortable with was Perl's, which in large part is syntax instead of an API. Python's syntax I have to look over the documentation every time I use it, too. Maybe it's because of the "matching" vs "searching" distinction that I find impossible to remember. (http://docs.python.org/library/re.html) --bb
Feb 17 2009
prev sibling next sibling parent Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Tue, Feb 17, 2009 at 7:13 PM, Bill Baxter <wbaxter gmail.com> wrote:
 Ok.  I'm certainly not in love with the API either.  Though, the only
 RegEx API I've ever used that felt totally comfortable with was
 Perl's, which in large part is syntax instead of an API.  Python's
 syntax I have to look over the documentation every time I use it, too.
  Maybe it's because of the "matching" vs "searching" distinction that
 I find impossible to remember.
 (http://docs.python.org/library/re.html)

Is there ever a situation where you want to use a single regexp for both matching _and_ searching? And if not, couldn't you just use ^ to anchor it? I never understood why Python's API makes such a distinction.
Feb 17 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Wed, Feb 18, 2009 at 11:38 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Jarrett Billingsley wrote:
 On Tue, Feb 17, 2009 at 7:13 PM, Bill Baxter <wbaxter gmail.com> wrote:
 Ok.  I'm certainly not in love with the API either.  Though, the only
 RegEx API I've ever used that felt totally comfortable with was
 Perl's, which in large part is syntax instead of an API.  Python's
 syntax I have to look over the documentation every time I use it, too.
  Maybe it's because of the "matching" vs "searching" distinction that
 I find impossible to remember.
 (http://docs.python.org/library/re.html)

Is there ever a situation where you want to use a single regexp for both matching _and_ searching? And if not, couldn't you just use ^ to anchor it? I never understood why Python's API makes such a distinction.

Ehm, that's odd. You'd think that after Perl has set the precedent, it would be hard to do major goofs in designing a regex API. By the way, the more I dig into std.regexp, the stiffer the hair on my neck gets. Get this: the API offers both global functions and member functions, with both RegExp and plain string arguments. The latter are carefully designed to maximize the number of clashes, potential confusions, and errors when using both std.string and std.regex.

All I know is that I found one incantation that works and I've been copy-pasting that every since. :-)
 But wait, there's more. The API defines the following functions that all
 ostensibly do some sort of mattern patching (sic): find, search, test,
 match, and exec. I wish I were kidding. There's some opIndex and opEquals
 thrown in for good measure. Knuth wouldn't know what each of them does after
 studying them for a week and then watching an episode from "The Bachelor".
 And get this: global search() does not do what member search() does. Nope.
 Global search() does what member test() does. I have only contempt for such
 designs.

Maybe "design" is too strong a word. Most Phobos modules seem to have been put together rather hastily in order to fill a pressing need. Often *something* is better than nothing at all, even if the something is not so great. --bb
Feb 17 2009
prev sibling next sibling parent Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Tue, Feb 17, 2009 at 9:38 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 By the way, the more I dig into std.regexp, the stiffer the hair on my neck
 gets. Get this: the API offers both global functions and member functions,
 with both RegExp and plain string arguments. The latter are carefully
 designed to maximize the number of clashes, potential confusions, and errors
 when using both std.string and std.regex.

 But wait, there's more. The API defines the following functions that all
 ostensibly do some sort of mattern patching (sic): find, search, test,
 match, and exec. I wish I were kidding. There's some opIndex and opEquals
 thrown in for good measure. Knuth wouldn't know what each of them does after
 studying them for a week and then watching an episode from "The Bachelor".
 And get this: global search() does not do what member search() does. Nope.
 Global search() does what member test() does. I have only contempt for such
 designs.

Well I don't mean to, uh, toot my own horn but.. I recently bound libpcre to MiniD and came up with a relatively simple but powerful and orthogonal API. http://www.dsource.org/projects/minid/wiki/Addons/PcreLib#LibraryReference The regex object has a single "subject" string at a time, the string that it's matching against. The subject is set with "search" and "test" does everything. All other functions are basically defined in terms of those two. "test" looks for the next match of the regex in the subject and returns true if it matched. "match" returns match groups (0 for the whole regex and 1..n for subgroups, as well as string indices for named subgroups). opApply is just a quicker way of writing something like: re.search(someSubject) while(re.test()) // use re.match to get matches You'll notice that opApply is also just defined in terms of test. I've found it far more intuitive than other APIs. I've never used Perl and I doubt I ever will, though.
Feb 17 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 20 Feb 2009 16:35:54 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Walter Bright wrote:
 Bill Baxter wrote:
 Maybe "design" is too strong a word.  Most Phobos modules seem to have
 been put together rather hastily in order to fill a pressing need.
 Often *something* is better than nothing at all, even if the something
 is not so great.

the same names and functionality. Layered on top of that was ruby-like names and functionality. It's a good (bad?) example of an api evolving without sacrificing backwards compatibility.

s/good \(bad\?\)/REALLY BAD/ Andrei

Backward compatibility is almost always a bad thing. Look what's happened to C++ and OpenGL.
Feb 20 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Sat, Feb 21, 2009 at 2:38 AM, Georg Wrede <georg.wrede iki.fi> wrote:
 Andrei Alexandrescu wrote:
 That way making a quick fix to broken code is just a matter of
 inserting ".deprecated" into your import statements.

I was thinking of moving older stuff to etc, is that ok?

What's the rationale for "etc"? Why not "deprecated"

In the words of George Costanza: "Because it's there!"

With the critique you've given to the existing regexp stuff, deprecated would be the obvious choice. Then we could have etc for Miscellaneous Stuff.

Agreed. etc implies to me that it's stuff that might be useful sometimes but not very often. It does not suggest to me that you shouldn't use it if you can avoid it. Or how about make it std.etc.deprecated.regexp That way it's clear that it's *both* something that might be useful occasionally but something that you should avoid if possible. ... Deprecated is a keyword though, isn't it. Dang. :-P --bb
Feb 20 2009