www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Improving std.regex(p)

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
There are currently two regexen in the standard library. The older one, 
std.regexp, is time-tested but only works with UTF8 and has a clunkier 
API. The newer one, std.regex, is newer and isolates the engine from the 
matches (and therefore can reuse and cache engines easier), and supports 
all character widths. But it's less tested and doesn't have that great 
of an interface because it pretty much inherits the existing one.

I wish to improve regex handling in Phobos. The most important 
improvement is not in the interface - it's in the engine. The current 
engine is adequate but nothing to write home about, and for simple 
regexen is markedly slower than equivalent hand-written code (e.g. 
matching whitespace). One great opportunity would be for D to leverage 
its uncanny compile-time evaluation abilities and offer a regex that 
parses the pattern during compilation:

foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

Such a static regex could be simpler than a full-blown regex with 
captures and backreferences etc., but it would have guaranteed 
performance (e.g. it would be an automaton instead of a backtracking 
engine) and would be darn fast because it would generate custom code for 
each regex pattern.

See related work:

http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

If we get as far as implementing what RE2 can do with compile-time 
evaluation, people will definitely notice.

If there's anyone who'd want to tackle such a project (for Phobos or 
not), I highly encourage you to do so.


Andrei
Jun 17 2010
next sibling parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Thu, 17 Jun 2010 21:44:03 -0700, Andrei Alexandrescu wrote:

 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.
 
 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:
 
 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }
 
 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.
 
 See related work:
 
 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-

 
 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.
 
 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.

There is the 'scregexp' project on dsource: http://www.dsource.org/projects/scregexp/ It's D1/Tango, but maybe it could be adapted to D2/Phobos? It would at least serve as a starting point for anyone wanting to try their hand at doing this. -Lars
Jun 17 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Lars T. Kyllingstad wrote:
 On Thu, 17 Jun 2010 21:44:03 -0700, Andrei Alexandrescu wrote:
 
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-

 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.

There is the 'scregexp' project on dsource: http://www.dsource.org/projects/scregexp/ It's D1/Tango, but maybe it could be adapted to D2/Phobos? It would at least serve as a starting point for anyone wanting to try their hand at doing this.

scregexp includes the following requirement within the license: "Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution." That would need to be changed before inclusion in Phobos. It looks like there are three people in the copyright notice: Walter Bright, Marton Papp, and yidabu. Does anyone know Marton's email address? Andrei
Jun 18 2010
parent David Gileadi <gileadisNO SPAMgmail.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
article&#63275;
 scregexp includes the following requirement within the license:
 "Redistributions in binary form must reproduce the above copyright
 notice, this list of conditions and the following disclaimer in the
 documentation and/or other materials provided with the distribution."
 That would need to be changed before inclusion in Phobos. It looks like
 there are three people in the copyright notice: Walter Bright, Marton
 Papp, and yidabu. Does anyone know Marton's email address?
 Andrei

Since I didn't see someone else reply, back when he announced the code he used anteusz freemail.hu. Google also turned up http://www.equinoxbase.com/. I'm not sure it's the same fellow, but the page has his name, mentions D and has a contact form. You could also try PMing him at dsource: http://www.dsource.org/forums/profile.php? mode=viewprofile&u=1513
Jun 22 2010
prev sibling next sibling parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
Hi Andrei,

I am unable to offer to write an entire compile time regex library in D, but it
is certainly an interesting subject for me and would be happy to contribute to
any discussions. I am sure a D implementation could be much more successful
than a C++ TMP DFA producer (see http://old.nabble.com/-regex%2C-xpressive--
interesting(-)-perf-benchmark-to28800789.html#a28815063 - I'm sure you have
already! ;-))

I am not really into D yet, although I've been following this group for a
while. The release of Visual D is a huge improvement and of course I am very
interested in your new book!

Here are the links to the DFA implementations I have done in C++:

http://www.benhanson.net/lexertl.html
http://www.codeproject.com/KB/recipes/wc.aspx

I'm also mulling a possible C# implementation of the wild card matcher,
hopefully emitting CLR byte-codes.

I would be happy to talk to anyone concerning all things DFA (and it would be
intriguing to code up or try some code snippets!)

Regards,

Ben
Jun 18 2010
next sibling parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
Reading your original post again and thinking about this a bit more...

If someone can help me get up to speed with TMP in D, I could probably put
together a proof of concept pretty quickly. Aside from the D syntax, it is all
in the design (which is why I would like to discuss it in more detail). For
anyone who is interested, here is how I go about DFA production currently:

- Bog standard tokenisation of regex tokens
- Normalise each regex charset and enumerate in a map
- build a regex syntax tree with leaf nodes storing the enumerated charset id
- Intersect all the charsets and build a vector of equivalent charsets
- Perform the followset to DFA transform as described in the Dragon book

For my lexer generator, I build a lookup table for each character to an
equivalence class. For my wild card matcher, I leave the representation as
characters and do a linear lookup of chars to 'next' state. The lexer
equivalence class method (a la flex) is more efficient as you just have two
lookups per character with no linear searches.

I'm thinking I could look at getting as far as the regex syntax tree and then
we could discuss DFA representation (it sounds like you want to generate code
directly - how do you do that?! This is why I mentioned the .NET bytecode stuff
- I imagine there is some overlap there, but I don't know enough to be sure.
Can you output source code at compilation time and have D compile it, or do you
use a load of recursion like C++ TMP?

And I forgot to mention - boost.spirit uses my lexertl library.

Regards,

Ben
Jun 18 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Ben Hanson:
 Can you output source code at compilation time and have D compile it, or do you
 use a load of recursion like C++ TMP?

This can be done with string mixin. You can use compile-time evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax. Bye, bearophile
Jun 18 2010
parent reply BCS <none anon.com> writes:
Hello bearophile,

 Ben Hanson:
 
 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?
 

evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.

OTOH, and IHMO it should be avoided where you can, kinda like goto. (For one things, forget having usable line numbers in your error messages.) You can often get a lot done with static if, tuple foreach, const values feed to boilerplate code and the like with just a scattering of one or two line mixins stuffed in the middle.
 
 Bye,
 bearophile

... <IXOYE><
Jun 19 2010
parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from BCS (none anon.com)'s article
 Hello bearophile,
 Ben Hanson:

 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?

evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.

one things, forget having usable line numbers in your error messages.) You can often get a lot done with static if, tuple foreach, const values feed to boilerplate code and the like with just a scattering of one or two line mixins stuffed in the middle.
 Bye,
 bearophile


It occurred to me that you could have a normal regex library that builds the regex at runtime, debug it, produce a code generator (similar code to re2c), debug that and then just wrap it with the mixin at the end. If you run into bugs, you can just do it at runtime instead and debug. The nice thing is, if I can get this converted, the lexer library is just slightly different. That means you could have the same approach with a lexer generator! This would lead to huge interest from the boost.spirit guys... Regards, Ben
Jun 19 2010
next sibling parent reply BCS <none anon.com> writes:
Hello Ben,

This would lead to huge
 interest from the boost.spirit guys...

I don't remember what versions this is (I've done about 3 that I can remember) or even if it works, but I wonder if they would have any interest in this: http://www.dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d Note (if that is the version I think it is) that the only mixins in that are tiny (mixin("label_"~tag!(id)~":;"); & mixin("goto label_"~tag!(id)~";");) or generating a big list invocations of other code. -- ... <IXOYE><
Jun 19 2010
parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
Hi BCS,
== Quote from BCS (none anon.com)'s article
 Hello Ben,
This would lead to huge
 interest from the boost.spirit guys...

or even if it works, but I wonder if they would have any interest in this: http://www.dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d Note (if that is the version I think it is) that the only mixins in that are tiny (mixin("label_"~tag!(id)~":;"); & mixin("goto label_"~tag!(id)~";");) or generating a big list invocations of other code.

I can't really speak for the Spirit people, but it's certainly interesting to me! :-) Here's some background to make things clearer: I'm not even currently a Spirit user myself as I've only really needed a tokeniser at work (they think that's mind-bogglingly sophisticated, never mind using a beast like Spirit! ;-)). The only reason they were interested in my lexer generator is that recursive descent lexing is pretty slow. So for more demanding grammars, a DFA based lexer is better. The lexer library allows you to create a lexer at runtime (which doesn't fit in so well with their model of doing everything at compile time), or you can generate code offline and then just add that to your project. This is why a compile time DFA lexer would be really interesting to them. From memory (Joel did a PDF recently, but that is on my works machine) Joel has been developing Spirit for over ten years. The latest version is pretty sophisticated and has all kinds of clever stuff for directly parsing data in structures all inline etc. Needless to say, I find the whole thing pretty mind boggling. The biggest problem (as far as I can see as an observer) is the compile times. This is where D could be really interesting overall (Hartmut has certainly got his eye on it and in fact I'm sure all the Boost people do). For my part, I'd like to see an LR parser generator developed. I'd be happy with one that creates the parser at runtime (assuming decent performance), but if it can generate and compile code efficiently at compile time, so much the better! :-) I really like the idea of being able to switch the runtime/compile time behaviour. When it comes to the code generation part for the DFA regex engine in D, I'd be happy to talk to you further about the techniques you've employed regarding emit. I'm still completely new to D, so it'll take me a while! :-) Cheers, Ben
Jun 19 2010
next sibling parent reply div0 <div0 users.sourceforge.net> writes:
On 19/06/2010 20:39, Ben Hanson wrote:
  From memory (Joel did a PDF recently, but that is on my works machine) Joel
has
 been developing Spirit for over ten years. The latest version is pretty
 sophisticated and has all kinds of clever stuff for directly parsing data in
 structures all inline etc. Needless to say, I find the whole thing pretty mind
 boggling. The biggest problem (as far as I can see as an observer) is the
 compile times. This is where D could be really interesting overall (Hartmut has
 certainly got his eye on it and in fact I'm sure all the Boost people do).

I ported most of the classic version of spirit to d a while back. I've recently written a XML parser using it and it takes nearly 7 whole seconds to compile with DMD, which is a vast amount of time compared to rest of the stuff I compile. :) The most trivial spirit parser in c++ takes over 30 seconds on my machine, even with everything in the pre compiled header. D is just a massive win for productively. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 19 2010
parent Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from div0 (div0 users.sourceforge.net)'s article
 On 19/06/2010 20:39, Ben Hanson wrote:
  From memory (Joel did a PDF recently, but that is on my works machine) Joel
has
 been developing Spirit for over ten years. The latest version is pretty
 sophisticated and has all kinds of clever stuff for directly parsing data in
 structures all inline etc. Needless to say, I find the whole thing pretty mind
 boggling. The biggest problem (as far as I can see as an observer) is the
 compile times. This is where D could be really interesting overall (Hartmut has
 certainly got his eye on it and in fact I'm sure all the Boost people do).

I've recently written a XML parser using it and it takes nearly 7 whole seconds to compile with DMD, which is a vast amount of time compared to rest of the stuff I compile. :) The most trivial spirit parser in c++ takes over 30 seconds on my machine, even with everything in the pre compiled header. D is just a massive win for productively.

Faster compile time code generation is a massive advantage for D. Hartmut contacted me about a runtime LALR parser generator called Sweet Parser (http://www.sweetsoftware.co.nz/parser_overview.php). They've contacted the author about integrating a version of that into Spirit. A version in D that could run at compile time would be cool too... Bizzare timing or what! It looks like interesting times lie ahead... Regards, Ben
Jun 22 2010
prev sibling parent reply BCS <none anon.com> writes:
Hello Ben,

 The only reason they were interested in my lexer generator is that
 recursive descent lexing is pretty slow. So for more demanding
 grammars, a DFA based lexer is better. The lexer library allows you to
 create a lexer at runtime (which doesn't fit in so well with their
 model of doing everything at compile time), or you can generate code
 offline and then just add that to your project. This is why a compile
 time DFA lexer would be really interesting to them.
 
 The biggest problem (as far as I can
 see as an observer) is the compile times.

I've never got a usable grammer for D to feed into it but I did get one (~200 productions) that at least compiled: it took 7 (or 17 minutes, I forget) and 700MB or ram To compile (P-III 550MHz, >700MB RAM)
 This is where D could be
 really interesting overall (Hartmut has certainly got his eye on it
 and in fact I'm sure all the Boost people do).
 
 For my part, I'd like to see an LR parser generator developed. I'd be
 happy with one that creates the parser at runtime (assuming decent
 performance), but if it can generate and compile code efficiently at
 compile time, so much the better! :-) I really like the idea of being
 able to switch the runtime/compile time behaviour.
 

Given that most LR parsers are table driven, all that would be needed for a compile time system would be that the table generator be evaluateable at compile time. The result could be stored in a constant. The only tricky bits then would be handling the data values on the stack and the part to tack in the action rules and that could just be a naming convention: class MyActions { resultType Action_statement_if(/+ args +/) { /+ code +/ } resultType Action_statement_ifThen(/+ args +/) { /+ code +/ } resultType Action_addExpression_plus(/+ args +/) { /+ code +/ } resultType Action_addExpression_minus(/+ args +/) { /+ code +/ } resultType Action_addExpression_chain(/+ args +/) { /+ code +/ } // ... mixin Parser!( `statment : "if" "(" expression ")" statment / if | "if" "(" expression ")" statment "then" statment / ifThen | "{" statment* "}" / block | expression ";" /expression; addExpression : ..... `); } Parser!(string) would generate a function that just assumes the existence of functions of the given names.
 When it comes to the code generation part for the DFA regex engine in
 D, I'd be happy to talk to you further about the techniques you've
 employed regarding emit. I'm still completely new to D, so it'll take
 me a while! :-)
 
 Cheers,
 
 Ben
 

... <IXOYE><
Jun 19 2010
parent Ben.Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from BCS (none anon.com)'s article
 Hello Ben,
 The only reason they were interested in my lexer generator is that
 recursive descent lexing is pretty slow. So for more demanding
 grammars, a DFA based lexer is better. The lexer library allows you to
 create a lexer at runtime (which doesn't fit in so well with their
 model of doing everything at compile time), or you can generate code
 offline and then just add that to your project. This is why a compile
 time DFA lexer would be really interesting to them.

 The biggest problem (as far as I can
 see as an observer) is the compile times.

productions) that at least compiled: it took 7 (or 17 minutes, I forget) and 700MB or ram To compile (P-III 550MHz, >700MB RAM)
 This is where D could be
 really interesting overall (Hartmut has certainly got his eye on it
 and in fact I'm sure all the Boost people do).

 For my part, I'd like to see an LR parser generator developed. I'd be
 happy with one that creates the parser at runtime (assuming decent
 performance), but if it can generate and compile code efficiently at
 compile time, so much the better! :-) I really like the idea of being
 able to switch the runtime/compile time behaviour.

a compile time system would be that the table generator be evaluateable at compile time. The result could be stored in a constant. The only tricky bits then would be handling the data values on the stack and the part to tack in the action rules and that could just be a naming convention: class MyActions { resultType Action_statement_if(/+ args +/) { /+ code +/ } resultType Action_statement_ifThen(/+ args +/) { /+ code +/ } resultType Action_addExpression_plus(/+ args +/) { /+ code +/ } resultType Action_addExpression_minus(/+ args +/) { /+ code +/ } resultType Action_addExpression_chain(/+ args +/) { /+ code +/ } // ... mixin Parser!( `statment : "if" "(" expression ")" statment / if | "if" "(" expression ")" statment "then" statment / ifThen | "{" statment* "}" / block | expression ";" /expression; addExpression : ..... `); } Parser!(string) would generate a function that just assumes the existence of functions of the given names.

the Sweet Parser for the Spirit guys. Regards, Ben
Jun 22 2010
prev sibling parent reply Don <nospam nospam.com> writes:
Ben Hanson wrote:
 == Quote from BCS (none anon.com)'s article
 Hello bearophile,
 Ben Hanson:

 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?

evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.



 one things, forget having usable line numbers in your error messages.) 


Whenever that happens, it's a compiler bug. If you can produce a test case where that happens, please put it in Bugzilla. You
 can often get a lot done with static if, tuple foreach, const values feed
 to boilerplate code and the like with just a scattering of one or two line
 mixins stuffed in the middle.
 Bye,it.
 bearophile


It occurred to me that you could have a normal regex library that builds the regex at runtime, debug it, produce a code generator (similar code to re2c), debug that and then just wrap it with the mixin at the end. If you run into bugs, you can just do it at runtime instead and debug. The nice thing is, if I can get this converted, the lexer library is just slightly different. That means you could have the same approach with a lexer generator! This would lead to huge interest from the boost.spirit guys...

That is exactly the way I recommend to develop CTFE code. Get the code completely working, run unit tests on every component, and then CTFE it.
Jun 19 2010
next sibling parent Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from Don (nospam nospam.com)'s article
 Ben Hanson wrote:
 == Quote from BCS (none anon.com)'s article
 Hello bearophile,
 Ben Hanson:

 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?

evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.



 one things, forget having usable line numbers in your error messages.)


If you can produce a test case where that happens, please put it in Bugzilla. You
 can often get a lot done with static if, tuple foreach, const values feed
 to boilerplate code and the like with just a scattering of one or two line
 mixins stuffed in the middle.
 Bye,it.
 bearophile


It occurred to me that you could have a normal regex library that builds


 runtime, debug it, produce a code generator (similar code to re2c), debug


 wrap it with the mixin at the end. If you run into bugs, you can just do it


 instead and debug. The nice thing is, if I can get this converted, the


 slightly different. That means you could have the same approach with a


 would lead to huge interest from the boost.spirit guys...

completely working, run unit tests on every component, and then CTFE it.

Jun 19 2010
prev sibling parent reply BCS <none anon.com> writes:
Hello Don,

 == Quote from BCS (none anon.com)'s article
 
 OTOH, and IHMO it should be avoided where you can, kinda like goto.
 (For one things, forget having usable line numbers in your error
 messages.)
 


If you can produce a test case where that happens, please put it in Bugzilla.

The issue isn't that the error has no line number but that the line number is the line of the mixin or the line within the mixin (I forget which) to be useful you need both AND the full text of the mixin. Yes there are ways around this but my original point is that in my experience, I can get just about as much done and for about the same effort with very little code being mixed in. My approach to this kind of problem has typically been to use CTFE to process input, template and or mixins to generate an AST (or whatever the problem demands) represented as a type and normal template, static ifs and tuple foreach to do the code generation with the occasional line of code mixed in here and there. The result of this is that most of the code is transparently visible as code rather than as string literals. -- ... <IXOYE><
Jun 19 2010
parent "Nick Sabalausky" <a a.a> writes:
"BCS" <none anon.com> wrote in message 
news:a6268ff154bd8ccddee0380cd08 news.digitalmars.com...
 Hello Don,

 == Quote from BCS (none anon.com)'s article

 OTOH, and IHMO it should be avoided where you can, kinda like goto.
 (For one things, forget having usable line numbers in your error
 messages.)


If you can produce a test case where that happens, please put it in Bugzilla.

The issue isn't that the error has no line number but that the line number is the line of the mixin or the line within the mixin (I forget which) to be useful you need both AND the full text of the mixin.

For reference, that's bug #2887: http://d.puremagic.com/issues/show_bug.cgi?id=2887
Jun 22 2010
prev sibling next sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/18/2010 06:36 AM, Ben Hanson wrote:
 Reading your original post again and thinking about this a bit more...

 If someone can help me get up to speed with TMP in D, I could probably put
 together a proof of concept pretty quickly. Aside from the D syntax, it is all
 in the design (which is why I would like to discuss it in more detail). For
 anyone who is interested, here is how I go about DFA production currently:

 - Bog standard tokenisation of regex tokens
 - Normalise each regex charset and enumerate in a map
 - build a regex syntax tree with leaf nodes storing the enumerated charset id
 - Intersect all the charsets and build a vector of equivalent charsets
 - Perform the followset to DFA transform as described in the Dragon book

 For my lexer generator, I build a lookup table for each character to an
 equivalence class. For my wild card matcher, I leave the representation as
 characters and do a linear lookup of chars to 'next' state. The lexer
 equivalence class method (a la flex) is more efficient as you just have two
 lookups per character with no linear searches.

 I'm thinking I could look at getting as far as the regex syntax tree and then
 we could discuss DFA representation (it sounds like you want to generate code
 directly - how do you do that?! This is why I mentioned the .NET bytecode stuff
 - I imagine there is some overlap there, but I don't know enough to be sure.
 Can you output source code at compilation time and have D compile it, or do you
 use a load of recursion like C++ TMP?

 And I forgot to mention - boost.spirit uses my lexertl library.

 Regards,

 Ben

I've always understood that dfas have an exponential lower bound relative to .. something .. but I don't believe I have ever actually used a dfa generator (or written one, sadly enough). What has your experience been on dfa compile time? I suppose RE2's syntax is a good starting point http://code.google.com/p/re2/wiki/Syntax I wonder about those unicode character classes, though. Looks like an imminent head-on collision with dmd's issues regarding memory management at compile time.
Jun 18 2010
next sibling parent Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
Hi Ellery,

== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 I've always understood that dfas have an exponential lower bound
 relative to .. something .. but I don't believe I have ever actually
 used a dfa generator (or written one, sadly enough). What has your
 experience been on dfa compile time?
 I suppose RE2's syntax is a good starting point
 http://code.google.com/p/re2/wiki/Syntax
 I wonder about those unicode character classes, though. Looks like an
 imminent head-on collision with dmd's issues regarding memory management
 at compile time.

I also wondered about this and originally just thought, well, you can solve any performance problems with decent optimal code! :-) So this train of thought was years ago now and in reality the only way I can really see to bring DFA compilation to its knees is to use massive repeats. DFA also implies a more limited syntax (Andrei hinted at this in his original post) and there are certain things you just don't need to worry about any more such as 'possessive' quantifiers. Traditionally DFA regex has been used for lexer generators as they generally don't need some of the more outlandish 'extra' features, so it's possible that a straight up regex engine would have more demands put on it, but there again, a lexer generator is dealing with a list of regexes, not just one! The last time I timed it, lexertl could compile all of the C++ token rules in well under 0.5 seconds. There are such things as tagged dfas, which I would love to know more about. The Tango regex library uses them for captures I believe. If anyone can provide more info on how TDFAs really work that would be fantastic. All the literature I've read didn't exactly enlighten me (including Ville Laurikari's papers). The full re2 syntax is way more than I could provide. See http:// www.benhanson.net/lexertl.html for a table of regex syntax I know how to support. On the subject of Unicode chars, this is an interesting topic. The Tango regex library gets around the storage requirements by having a list of ranges. The approach lexertl takes is to maintain a sorted string of chars. The trick is, if the string has more than 0.5 of the max possible number of chars in it, you negate it. This is actually part of the normalisation process and leads to quicker intersections later on. When it comes to the final machine, lexertl slices 'wide chars' into bytes for lookup now, keeping the first stage lookup to 256 entries (I can elaborate if anyone wants me too). Of course if you are generating code (as seems to be desired) none of this matters very much. Don't forget that if you intersect all the charsets *once* after building your syntax tree, then when you do transition intersections, you are only intersecting sets of indexes to charset equivalence classes which is really quick. For a quick example: - Say you have two char classes, one is [a-z] and the other one is [a]. - You intersect these classes once giving [b-z] and [a] - So if originally [a-z] = 0 and [a] = 1 (as I said, charsets are enumerated as they are scanned), then now the equivalence index sets are [a-z] = {0, 1} and [a] = {1} due to [b-z] = 0 and [a] = 1. When you generate the DFA you are then intersecting {0, 1} and {1} rather than the actual charsets. The sets are obviously really small! :-) Hope that was useful, or at least interesting! Regards, Ben
Jun 18 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ellery Newcomer wrote:
 On 06/18/2010 06:36 AM, Ben Hanson wrote:
 Reading your original post again and thinking about this a bit more...

 If someone can help me get up to speed with TMP in D, I could probably 
 put
 together a proof of concept pretty quickly. Aside from the D syntax, 
 it is all
 in the design (which is why I would like to discuss it in more 
 detail). For
 anyone who is interested, here is how I go about DFA production 
 currently:

 - Bog standard tokenisation of regex tokens
 - Normalise each regex charset and enumerate in a map
 - build a regex syntax tree with leaf nodes storing the enumerated 
 charset id
 - Intersect all the charsets and build a vector of equivalent charsets
 - Perform the followset to DFA transform as described in the Dragon book

 For my lexer generator, I build a lookup table for each character to an
 equivalence class. For my wild card matcher, I leave the 
 representation as
 characters and do a linear lookup of chars to 'next' state. The lexer
 equivalence class method (a la flex) is more efficient as you just 
 have two
 lookups per character with no linear searches.

 I'm thinking I could look at getting as far as the regex syntax tree 
 and then
 we could discuss DFA representation (it sounds like you want to 
 generate code
 directly - how do you do that?! This is why I mentioned the .NET 
 bytecode stuff
 - I imagine there is some overlap there, but I don't know enough to be 
 sure.
 Can you output source code at compilation time and have D compile it, 
 or do you
 use a load of recursion like C++ TMP?

 And I forgot to mention - boost.spirit uses my lexertl library.

 Regards,

 Ben

I've always understood that dfas have an exponential lower bound relative to .. something .. but I don't believe I have ever actually used a dfa generator (or written one, sadly enough). What has your experience been on dfa compile time?

I think only backtracking engines (not DFAs) have the exponential run time problem. Regular expression grammar is itself context free (so it's parsable efficiently), the NFA and NFA->DFA algorithms are polynomial, and the generated DFA makes one transition per input character. Andrei
Jun 18 2010
parent reply sybrandy <sybrandy gmail.com> writes:
 I think only backtracking engines (not DFAs) have the exponential run
 time problem. Regular expression grammar is itself context free (so it's
 parsable efficiently), the NFA and NFA->DFA algorithms are polynomial,
 and the generated DFA makes one transition per input character.


 Andrei

I don't know how practical it is, but I had a thought to allow some of these nasty regexes to run better: use threads. In theory, and I have not done any work to test/prove this so it's very theoretical, you can have multiple threads running in parallel each testing a potential path. Granted, there's always the issue of spawning/managing threads and potentially overloading the system by spawning too many, however it may be a viable alternative if you really HAVE to execute a particularly nasty regex. Anyway, it's just a thought. I'd hate to lose the benefits of a NFA if we're afraid of the potential performance hit if a user writes a bad regex. Casey
Jun 18 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/18/2010 05:08 PM, sybrandy wrote:
 I think only backtracking engines (not DFAs) have the exponential run
 time problem. Regular expression grammar is itself context free (so it's
 parsable efficiently), the NFA and NFA->DFA algorithms are polynomial,
 and the generated DFA makes one transition per input character.


 Andrei


The NFA->DFA algorithms are what I was referring to. My copy of the dragon book suggests that DFAs usually have O(n^3) initialization time but O(n^2 * 2^n) worst case relative to the number of states. I'd love to be shown wrong, though.
 I don't know how practical it is, but I had a thought to allow some of
 these nasty regexes to run better: use threads. In theory, and I have
 not done any work to test/prove this so it's very theoretical, you can
 have multiple threads running in parallel each testing a potential path.
 Granted, there's always the issue of spawning/managing threads and
 potentially overloading the system by spawning too many, however it may
 be a viable alternative if you really HAVE to execute a particularly
 nasty regex.

Isn't what you're describing an NFA?
 Anyway, it's just a thought. I'd hate to lose the benefits of a NFA if
 we're afraid of the potential performance hit if a user writes a bad regex.

 Casey

You don't get a performance hit with DFAs on knarly regexen (actually, they would be faster than NFAs), just regexen that take forever to compile. I wonder if it would be possible to mechanically detect problem cases for DFA compilation and on detection switch over to an NFA engine.
Jun 18 2010
parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 The NFA->DFA algorithms are what I was referring to. My copy of the
 dragon book suggests that DFAs usually have O(n^3) initialization time
 but O(n^2 * 2^n) worst case relative to the number of states. I'd love
 to be shown wrong, though.

challenge anyone to show slow compilation times with any DFA compatible regex. As I don't have some D code yet, you could try the current lexertl code (just add a single rule). Compile as Release if using Visual C++, as the iterator debugging kills performance in VC++ 8 or above (one problem I guess wouldn't exist with a D codebase). As I say, the only obvious area for me that hammers compilation times is huge repeats. If you repeat an expression 50,0000 times or something, then yes, you will get a huge machine and it will take time to construct! :-) There are papers on tackling this with counting repeats, even for DFAs. If this area interests anyone, please let me know.
 You don't get a performance hit with DFAs on knarly regexen (actually,
 they would be faster than NFAs), just regexen that take forever to
 compile. I wonder if it would be possible to mechanically detect problem
 cases for DFA compilation and on detection switch over to an NFA engine.

DFA compilation is particularly good for simplifying verbose regexes, even wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the past and as far as feaures go, this is probably the ideal way to go. Russ Cox has some clever techniques too. Don't forget that grep is (traditionally) implemented with a DFA engine. Of course, the real irony is that Henry Spencer wrote a much better regex engine for TCL (a DFA/NFA hybrid as I understand it) which is much better than his original NFA backtracker and yet everyone holds up the backtracking algorithm as the way to do things... That's inertia for you. Regards, Ben
Jun 19 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible regex.
 As I don't have some D code yet, you could try the current lexertl code (just
 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is huge
 repeats. If you repeat an expression 50,0000 times or something, then yes, you
 will get a huge machine and it will take time to construct! :-)

et unicode?
 There are papers on tackling this with counting repeats, even for DFAs. If this
 area interests anyone, please let me know.

it does
 DFA compilation is particularly good for simplifying verbose regexes, even
 wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the past
 and as far as feaures go, this is probably the ideal way to go. Russ Cox has
 some clever techniques too.

interesting
 Don't forget that grep is (traditionally) implemented with a DFA engine.

I didn't know that
 Of course, the real irony is that Henry Spencer wrote a much better regex
 engine for TCL (a DFA/NFA hybrid as I understand it) which is much better than
 his original NFA backtracker and yet everyone holds up the backtracking
 algorithm as the way to do things...

 That's inertia for you.

 Regards,

 Ben

Jun 19 2010
next sibling parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible


 As I don't have some D code yet, you could try the current lexertl code


 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is


 repeats. If you repeat an expression 50,0000 times or something, then yes,


 will get a huge machine and it will take time to construct! :-)


- If you're building a lexer using 8 bit chars, then no problem. Two phase lookup, first lookup map all 256 chars to equivalence class index, then that index is used to locate the correct column in the DFA. - If you're using 'wide characters', then still have a 256 entry lookup table and slice each 'wide character' into 8 bit chunks and use those chunks as charsets. - If you want to generate code, you can generate tables using a first phase lookup of MAX_CHARS of the 'wide character'. You can then convert this back to a character based state machine and then you can just deal with characters again instead of equivalence classes. One of the enhancements I want to make, is the ability to go straight to a char based state machine to avoid the huge first phase lookup. Now, as far a huge Unicode ranges goes, this shouldn't be a problem when you're generating code. You can use an if statement with an upper and lower bound in those cases and only use a switch statement when the range is disparate etc. If the technique of storing all those chars in a string ends up being too expensive on RAM (even with the auto negation trick), then there's always the method used by Tango Regex, which is a list of ranges (this probably works quite well with garbage collection - remember I designed for C++). Let me know if any of that isn't clear. Basically it's not a huge problem, you can use various techniques to compress the data sensibly and efficiently.
 There are papers on tackling this with counting repeats, even for DFAs. If


 area interests anyone, please let me know.


 DFA compilation is particularly good for simplifying verbose regexes, even
 wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the


 and as far as feaures go, this is probably the ideal way to go. Russ Cox has
 some clever techniques too.


~rsc/regexp/regexp2.html) http://swtch.com/~rsc/regexp/regexp3.html discusses RE2.
 Don't forget that grep is (traditionally) implemented with a DFA engine.


(amongst others). Lua has an interesting pattern matcher, as it goes more in the formal grammar direction to (for example) match balanced parenthesis etc. Another thing I'd like to do (no doubt time will not allow!) is to extend Notepad RE (http://www.codeproject.com/KB/recipes/notepadre.aspx) to have the LUA pattern matcher as an option.
 Of course, the real irony is that Henry Spencer wrote a much better regex
 engine for TCL (a DFA/NFA hybrid as I understand it) which is much better


 his original NFA backtracker and yet everyone holds up the backtracking
 algorithm as the way to do things...

 That's inertia for you.

 Regards,

 Ben


Ben
Jun 19 2010
next sibling parent Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
I forgot to mention... Ideally I'd like some kind of abstract interface to the
generated DFA (this is in data mode before code generation etc.) so that
different implementations can be hidden under a common interface.

This is another example of why it would be nice to discuss the implementation
with others! :-) Generally people just don't care about this stuff. I realise
that various folk would like to go one better and not even assume a DFA behind
the interface, but that is currently out of my scope..!

Regards,

Ben
Jun 19 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message 
news:hvj9i1$1q7b$1 digitalmars.com...
 Lua has an interesting pattern matcher, as it goes more in
 the formal grammar direction to (for example) match balanced parenthesis 
 etc.

That's cool. That's something I've often wanted in regexes (It would be great for defining lever tokens in a grammar-definition language), and though I'm no FSA or regex expert, it always seemed like something entirely doable without having to resort to a full-blown LL or LR parser.
Jun 19 2010
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:hvjmo1$2mnu$1 digitalmars.com...
 "Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message 
 news:hvj9i1$1q7b$1 digitalmars.com...
 Lua has an interesting pattern matcher, as it goes more in
 the formal grammar direction to (for example) match balanced parenthesis 
 etc.

That's cool. That's something I've often wanted in regexes (It would be great for defining lever tokens in a grammar-definition language), and though I'm no FSA or regex expert, it always seemed like something entirely doable without having to resort to a full-blown LL or LR parser.

That's "...defining lexer tokens...", not "...defining lever tokens...", of course.
Jun 19 2010
prev sibling parent Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from Nick Sabalausky (a a.a)'s article
 "Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message
 news:hvj9i1$1q7b$1 digitalmars.com...
 Lua has an interesting pattern matcher, as it goes more in
 the formal grammar direction to (for example) match balanced parenthesis
 etc.

great for defining lever tokens in a grammar-definition language), and though I'm no FSA or regex expert, it always seemed like something entirely doable without having to resort to a full-blown LL or LR parser.

Yeah, it does seem obvious that there'd be some middle ground between regexes - certainly the author of the LUA pattern matching thought so! :-) The topic could definitely do with more attention. There is limited back-tracking in the LUA pattern matching, which I don't like, but it appears more efficient than back-tracking NFA. Apparantly it has some solid theory behind it too, which is good. Sadly just not enough time to persue all these possibilities... Regards, Ben
Jun 22 2010
prev sibling parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible regex.
 As I don't have some D code yet, you could try the current lexertl code (just
 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is huge
 repeats. If you repeat an expression 50,0000 times or something, then yes, you
 will get a huge machine and it will take time to construct! :-)

 There are papers on tackling this with counting repeats, even for DFAs. If this
 area interests anyone, please let me know.


Jun 21 2010
parent Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/21/2010 04:07 AM, Ben Hanson wrote:
 == Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible regex.
 As I don't have some D code yet, you could try the current lexertl code (just
 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is huge
 repeats. If you repeat an expression 50,0000 times or something, then yes, you
 will get a huge machine and it will take time to construct! :-)

 There are papers on tackling this with counting repeats, even for DFAs. If this
 area interests anyone, please let me know.



thanks
Jun 21 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

One less if this works and gets accepted: http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile
Jun 18 2010
parent reply Don <nospam nospam.com> writes:
bearophile wrote:
 Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

One less if this works and gets accepted: http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile

It's a great piece of work, but the rest of the CTFE mechanics are not ready for that patch. The one which is blocking everything in CTFE is bug 1330. At present, CTFE uses copy-on-write, and that doesn't work with reference types. Fixing that is going to require large changes throughout the compiler, so I've been holding off on that one. Meanwhile I've been trying to fix all the non-reference bugs in CTFE and assembling a comprehensive test suite. So to answer the original question -- the main things that don't work in CTFE that are allowable in safe D are classes, exceptions, and built-in features which are implemented in the runtime rather than in the compiler (including array operations).
Jun 18 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Don" <nospam nospam.com> wrote in message 
news:hvgasq$2k3a$1 digitalmars.com...
 bearophile wrote:
 Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

One less if this works and gets accepted: http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile

It's a great piece of work, but the rest of the CTFE mechanics are not ready for that patch. The one which is blocking everything in CTFE is bug 1330.

IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, so if it's fixed, that would break any code that does that detection, so another CTFE-detection mechanism may be needed before a 1330 fix.
Jun 18 2010
parent Don <nospam nospam.com> writes:
Nick Sabalausky wrote:
 "Don" <nospam nospam.com> wrote in message 
 news:hvgasq$2k3a$1 digitalmars.com...
 bearophile wrote:
 Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile

ready for that patch. The one which is blocking everything in CTFE is bug 1330.

IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, so if it's fixed, that would break any code that does that detection, so another CTFE-detection mechanism may be needed before a 1330 fix.

Jun 18 2010
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Fri, 18 Jun 2010 11:36:48 +0000, Ben Hanson wrote:

 I'm thinking I could look at getting as far as the regex syntax tree and
 then we could discuss DFA representation (it sounds like you want to
 generate code directly - how do you do that?! This is why I mentioned
 the .NET bytecode stuff - I imagine there is some overlap there, but I
 don't know enough to be sure. Can you output source code at compilation
 time and have D compile it, or do you use a load of recursion like C++
 TMP?

D's metaprogramming facilities are insanely powerful. You can take any string containing valid source code (that is known at compile time, of course) and "paste" it into your code with the mixin() expression: mixin("int i;"); i = 2; For the regex parser, I suppose you'd do something like: string makeParserCode(string regex) { ... } struct sregex(string regex) { mixin(makeParserCode(regex)); } -Lars
Jun 18 2010
prev sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Fri, 18 Jun 2010 12:09:54 +0000, Lars T. Kyllingstad wrote:

 On Fri, 18 Jun 2010 11:36:48 +0000, Ben Hanson wrote:
 
 I'm thinking I could look at getting as far as the regex syntax tree
 and then we could discuss DFA representation (it sounds like you want
 to generate code directly - how do you do that?! This is why I
 mentioned the .NET bytecode stuff - I imagine there is some overlap
 there, but I don't know enough to be sure. Can you output source code
 at compilation time and have D compile it, or do you use a load of
 recursion like C++ TMP?

D's metaprogramming facilities are insanely powerful. You can take any string containing valid source code (that is known at compile time, of course) and "paste" it into your code with the mixin() expression: mixin("int i;"); i = 2; For the regex parser, I suppose you'd do something like: string makeParserCode(string regex) { ... } struct sregex(string regex) { mixin(makeParserCode(regex)); }

I should mention that the makeParserCode() function in the example must of course be evaluatable at compile time, and there are some restrictions on such functions. Lately, many of those restrictions have been overcome, though, and I'm not sure what's left. Does anyone know what the current limitations on CTFE are? -Lars
Jun 18 2010
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2010-06-18 06:44, Andrei Alexandrescu wrote:
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html


 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.


 Andrei

The is already a compile time regular expression engine available in the DDL project at dsource, it's in the meta package: http://dsource.org/projects/ddl/browser/trunk/meta I don't know if it's is still usable. -- /Jacob Carlborg
Jun 18 2010
parent Don <nospam nospam.com> writes:
Jacob Carlborg wrote:
 On 2010-06-18 06:44, Andrei Alexandrescu wrote:
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-appr
ach-to-regular.html 



 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.


 Andrei

The is already a compile time regular expression engine available in the DDL project at dsource, it's in the meta package: http://dsource.org/projects/ddl/browser/trunk/meta I don't know if it's is still usable.

I wrote that code before CTFE was available. It helped to design D's metaprogramming system, but the code itself is thoroughly obsolete now. It is SO much easier to write that stuff with CTFE. Incidentally, now that we have the no-brackets form of template instantiation, there's much more freedom for compile-time regex syntax. auto k = regex!"[A-Za-z][0-9A-Za-z]*"(s1);
Jun 18 2010
prev sibling next sibling parent Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
OK, I've just ordered The D Programming Language from Amazon. Hopefully this
will
enable me to cut out a certain amount of total n00b questions such as "How do
strings work?".

Regards,

Ben
Jun 18 2010
prev sibling next sibling parent "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:hvetik$2k7j$1 digitalmars.com...
 There are currently two regexen in the standard library. The older one, 
 std.regexp, is time-tested but only works with UTF8 and has a clunkier 
 API. The newer one, std.regex, is newer and isolates the engine from the 
 matches (and therefore can reuse and cache engines easier), and supports 
 all character widths. But it's less tested and doesn't have that great of 
 an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important improvement 
 is not in the interface - it's in the engine. The current engine is 
 adequate but nothing to write home about, and for simple regexen is 
 markedly slower than equivalent hand-written code (e.g. matching 
 whitespace). One great opportunity would be for D to leverage its uncanny 
 compile-time evaluation abilities and offer a regex that parses the 
 pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with captures 
 and backreferences etc., but it would have guaranteed performance (e.g. it 
 would be an automaton instead of a backtracking engine) and would be darn 
 fast because it would generate custom code for each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

 If we get as far as implementing what RE2 can do with compile-time 
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or not), 
 I highly encourage you to do so.

This would be a good thing for it to pay attention to, if it doesn't already: http://swtch.com/~rsc/regexp/regexp1.html
Jun 18 2010
prev sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/17/2010 11:44 PM, Andrei Alexandrescu wrote:
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html


 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.


 Andrei

Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code? But, predicated on the assumption that it will happen relatively soon.. What do we want our regex language to look like? this: http://digitalmars.com/ctg/regular.html is okay, but I like the idea of ascii and unicode character classes as per RE2's syntax, and I don't like the idea of an overloaded \n. (I assume we're talking about strings as bidirectional unicode ranges) and I would like to see the language grammar explicitly written out somewhere. and I like the idea of capture groups. Thinking of something along the lines of import std.regex; int i; double d; ctcapture!r"(?<i>\d+): (?<d>\d+\.\d+)"( "10: 100.5"); //this will probably have to be inside a mixin statement assert(i == 10); assert(d == 100.5); //grumph, floating point.. for that matter, why not some predefined subexpressions corresponding to D's integer literals, et al? And available inside the regex? ctcapture!r"(?<i>{:IntegerLiteral:}): (?<d>{:FloatLiteral:})" // Yeah, I know. You come up with a better syntax ("10: 0xEA.A1p-2"); Okay, I'm done dreaming. I don't know how much of the above can be done within the constraints of an FA. How much sense does regex over unicode make? Do others have ideas for features for regex?
Jun 19 2010
parent reply BCS <none anon.com> writes:
Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before this
 would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

You use the type system. I can say from experience that (memory issues aside) it works. You can even do duck typing "Template Foo works on anything that defines a Bar and a Baz." -- ... <IXOYE><
Jun 19 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"BCS" <none anon.com> wrote in message 
news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...
 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before this
 would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

You use the type system. I can say from experience that (memory issues aside) it works.

Trivial example?
Jun 19 2010
parent reply BCS <none anon.com> writes:
Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...
 
 Hello Ellery,
 
 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?
 

issues aside) it works.


Building an arbitrary tree: int find(char[] data) { foreach(int j, char c; data) switch(c) { default: break; case '[', ']': return j; } return data.length; } int match(char[] data) { int i = 0; foreach(int j, char c; data) switch(c) { default: break; case '[': i++; break; case ']': if(--i == 0) return j; } return data.length; } template Tp(T...){ alias T Tp; } struct Bar(T...) { } template Foo(char[] data) { static if(data.length == 0) alias Tp!() Foo; else static if(data[0] == '[') alias Tp!(Bar!(Foo!(data[1..match(data)])), Foo!(data[match(data)+1..$])) Foo; else alias Tp!(data[0..find(data)], Foo!(data[find(data)..$])) Foo; } import std.stdio; void main() { writef("%s\n", Foo!("So [she was [considering ][in her] own[ mind [(as well] as] she could, ]for the").stringof); } Output: tuple("So ",(Bar!("she was ",Bar!("considering ") ,Bar!("in her") ," own",Bar!(" mind ",Bar!("(as well") ," as") ," she could, ") ),"for the") By placing things under different aliases and whatnot, named members can be done. Also, if the lisp like constructs puts you off, the construction of the tree can be done via more or less the same parsing code and some fairly trivial ctfe+mixin -- ... <IXOYE><
Jun 19 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/19/2010 10:55 PM, BCS wrote:
 Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...

 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

issues aside) it works.


Building an arbitrary tree:

Okay, you use structs without indirection. I guess that works. How much copying does that entail when you e.g. pass your tree in a function call?
Jun 19 2010
next sibling parent Don <nospam nospam.com> writes:
Ellery Newcomer wrote:
 On 06/19/2010 10:55 PM, BCS wrote:
 Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...

 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

issues aside) it works.


Building an arbitrary tree:

Okay, you use structs without indirection. I guess that works. How much copying does that entail when you e.g. pass your tree in a function call?

Currently CTFE uses copy-on-write. If you're just passing the tree, not modifying it, there's no copying. If you modify the tree, there's a horrific amount of copying -- this is the major CTFE bug.
Jun 20 2010
prev sibling next sibling parent BCS <none anon.com> writes:
Hello Ellery,

 On 06/19/2010 10:55 PM, BCS wrote:
 
 Hello Nick,
 
 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...
 Hello Ellery,
 
 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree
 if pointers and classes aren't allowed in CT code?
 

issues aside) it works.



much copying does that entail when you e.g. pass your tree in a function call?

At compile time as a template parameter? I think it gets passed around as a reference. At runtime? That would be pointless as it has no runtime members. -- ... <IXOYE><
Jun 20 2010
prev sibling parent Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/19/2010 11:27 PM, Ellery Newcomer wrote:
 On 06/19/2010 10:55 PM, BCS wrote:
 Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...

 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

issues aside) it works.


Building an arbitrary tree:

Okay, you use structs without indirection. I guess that works. How much copying does that entail when you e.g. pass your tree in a function call?

Nevermind, I see what you're doing now.
Jun 21 2010