digitalmars.D - Improving std.regex(p)

Andrei Alexandrescu (26/26) Jun 17 2010 There are currently two regexen in the standard library. The older one,

Lars T. Kyllingstad (8/40) Jun 17 2010 There is the 'scregexp' project on dsource:

Andrei Alexandrescu (9/52) Jun 18 2010 scregexp includes the following requirement within the license:

David Gileadi (7/15) Jun 22 2010 Since I didn't see someone else reply, back when he announced the code h...

Ben Hanson (19/19) Jun 18 2010 Hi Andrei,

Ben Hanson (24/24) Jun 18 2010 Reading your original post again and thinking about this a bit more...

bearophile (4/6) Jun 18 2010 This can be done with string mixin. You can use compile-time evaluation ...

BCS (8/21) Jun 19 2010 OTOH, and IHMO it should be avoided where you can, kinda like goto. (For...

Ben Hanson (9/28) Jun 19 2010 It occurred to me that you could have a normal regex library that builds...

BCS (9/11) Jun 19 2010 I don't remember what versions this is (I've done about 3 that I can rem...

Ben Hanson (31/40) Jun 19 2010 I can't really speak for the Spirit people, but it's certainly interesti...

div0 (11/18) Jun 19 2010 I ported most of the classic version of spirit to d a while back.

Ben Hanson (10/25) Jun 22 2010 Faster compile time code generation is a massive advantage for D.

BCS (29/58) Jun 19 2010 I've never got a usable grammer for D to feed into it but I did get one ...

Ben.Hanson (5/52) Jun 22 2010 Let's come back to this! For now I will concentrate on the regex stuff a...

Don (8/34) Jun 19 2010 Whenever that happens, it's a compiler bug.

Ben Hanson (7/41) Jun 19 2010 that and then just
BCS (14/23) Jun 19 2010 The issue isn't that the error has no line number but that the line numb...

Nick Sabalausky (4/17) Jun 22 2010 For reference, that's bug #2887:

Lars T. Kyllingstad (13/20) Jun 18 2010 D's metaprogramming facilities are insanely powerful. You can take any

Lars T. Kyllingstad (7/32) Jun 18 2010 I should mention that the makeParserCode() function in the example must

bearophile (5/6) Jun 18 2010 One less if this works and gets accepted:

Don (13/21) Jun 18 2010 It's a great piece of work, but the rest of the CTFE mechanics are not

Nick Sabalausky (5/17) Jun 18 2010 IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, s...

Don (2/21) Jun 18 2010 Not any more. There's __ctfe now.

Ellery Newcomer (10/34) Jun 18 2010 I've always understood that dfas have an exponential lower bound

Ben Hanson (44/53) Jun 18 2010 I also wondered about this and originally just thought, well, you can so...
Andrei Alexandrescu (6/55) Jun 18 2010 I think only backtracking engines (not DFAs) have the exponential run

sybrandy (11/16) Jun 18 2010 I don't know how practical it is, but I had a thought to allow some of

Ellery Newcomer (10/28) Jun 18 2010 The NFA->DFA algorithms are what I was referring to. My copy of the

Ben Hanson (24/32) Jun 19 2010 I can't prove what the comlexity is for DFA compilation, but instead, I

Ellery Newcomer (5/28) Jun 19 2010 it does

Ben Hanson (42/77) Jun 19 2010 (just

Ben Hanson (9/9) Jun 19 2010 I forgot to mention... Ideally I'd like some kind of abstract interface ...
Nick Sabalausky (6/9) Jun 19 2010 That's cool. That's something I've often wanted in regexes (It would be

Nick Sabalausky (4/13) Jun 19 2010 That's "...defining lexer tokens...", not "...defining lever tokens...",...
Ben Hanson (10/19) Jun 22 2010 Yeah, it does seem obvious that there'd be some middle ground between re...

Ben Hanson (2/18) Jun 21 2010 Try this link: http://www.arl.wustl.edu/~mbecchi/files/becchi_ancs2009.p...

Ellery Newcomer (2/20) Jun 21 2010 thanks

Jacob Carlborg (7/33) Jun 18 2010 The is already a compile time regular expression engine available in the...

Don (7/51) Jun 18 2010 I wrote that code before CTFE was available. It helped to design D's

Ben Hanson (5/5) Jun 18 2010 OK, I've just ordered The D Programming Language from Amazon. Hopefully ...
Nick Sabalausky (5/29) Jun 18 2010 This would be a good thing for it to pay attention to, if it doesn't
Ellery Newcomer (29/55) Jun 19 2010 Generally I think D's CT capabilities have a way to go yet before this

BCS (6/9) Jun 19 2010 You use the type system. I can say from experience that (memory issues a...

Nick Sabalausky (3/9) Jun 19 2010 Trivial example?

BCS (52/66) Jun 19 2010 Building an arbitrary tree:

Ellery Newcomer (3/19) Jun 19 2010 Okay, you use structs without indirection. I guess that works. How much

Don (4/26) Jun 20 2010 Currently CTFE uses copy-on-write. If you're just passing the tree, not
BCS (6/29) Jun 20 2010 At compile time as a template parameter? I think it gets passed around a...
Ellery Newcomer (2/23) Jun 21 2010 Nevermind, I see what you're doing now.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

There are currently two regexen in the standard library. The older one, 
std.regexp, is time-tested but only works with UTF8 and has a clunkier 
API. The newer one, std.regex, is newer and isolates the engine from the 
matches (and therefore can reuse and cache engines easier), and supports 
all character widths. But it's less tested and doesn't have that great 
of an interface because it pretty much inherits the existing one.

I wish to improve regex handling in Phobos. The most important 
improvement is not in the interface - it's in the engine. The current 
engine is adequate but nothing to write home about, and for simple 
regexen is markedly slower than equivalent hand-written code (e.g. 
matching whitespace). One great opportunity would be for D to leverage 
its uncanny compile-time evaluation abilities and offer a regex that 
parses the pattern during compilation:

foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

Such a static regex could be simpler than a full-blown regex with 
captures and backreferences etc., but it would have guaranteed 
performance (e.g. it would be an automaton instead of a backtracking 
engine) and would be darn fast because it would generate custom code for 
each regex pattern.

See related work:

http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

If we get as far as implementing what RE2 can do with compile-time 
evaluation, people will definitely notice.

If there's anyone who'd want to tackle such a project (for Phobos or 
not), I highly encourage you to do so.


Andrei

Jun 17 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Thu, 17 Jun 2010 21:44:03 -0700, Andrei Alexandrescu wrote:

 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.
 
 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:
 
 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }
 
 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.
 
 See related work:
 
 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-

to-regular.html
 
 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.
 
 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.


There is the 'scregexp' project on dsource:

    http://www.dsource.org/projects/scregexp/

It's D1/Tango, but maybe it could be adapted to D2/Phobos?  It would at 
least serve as a starting point for anyone wanting to try their hand at 
doing this.

-Lars

Jun 17 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Lars T. Kyllingstad wrote:
 On Thu, 17 Jun 2010 21:44:03 -0700, Andrei Alexandrescu wrote:
 
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-

 to-regular.html
 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.

 
 
 There is the 'scregexp' project on dsource:
 
     http://www.dsource.org/projects/scregexp/
 
 It's D1/Tango, but maybe it could be adapted to D2/Phobos?  It would at 
 least serve as a starting point for anyone wanting to try their hand at 
 doing this.

scregexp includes the following requirement within the license:

"Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution."

That would need to be changed before inclusion in Phobos. It looks like
there are three people in the copyright notice: Walter Bright, Marton
Papp, and yidabu. Does anyone know Marton's email address?


Andrei

Jun 18 2010

David Gileadi <gileadisNO SPAMgmail.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s

 scregexp includes the following requirement within the license:
 "Redistributions in binary form must reproduce the above copyright
 notice, this list of conditions and the following disclaimer in the
 documentation and/or other materials provided with the distribution."
 That would need to be changed before inclusion in Phobos. It looks like
 there are three people in the copyright notice: Walter Bright, Marton
 Papp, and yidabu. Does anyone know Marton's email address?
 Andrei

Since I didn't see someone else reply, back when he announced the code he used
anteusz freemail.hu.

Google also turned up http://www.equinoxbase.com/. I'm not sure it's the same
fellow, but the page has his name, mentions D and has a contact form.  You could
also try PMing him at dsource: http://www.dsource.org/forums/profile.php?
mode=viewprofile&u=1513

Jun 22 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

Hi Andrei,

I am unable to offer to write an entire compile time regex library in D, but it
is certainly an interesting subject for me and would be happy to contribute to
any discussions. I am sure a D implementation could be much more successful
than a C++ TMP DFA producer (see http://old.nabble.com/-regex%2C-xpressive--
interesting(-)-perf-benchmark-to28800789.html#a28815063 - I'm sure you have
already! ;-))

I am not really into D yet, although I've been following this group for a
while. The release of Visual D is a huge improvement and of course I am very
interested in your new book!

Here are the links to the DFA implementations I have done in C++:

http://www.benhanson.net/lexertl.html
http://www.codeproject.com/KB/recipes/wc.aspx


hopefully emitting CLR byte-codes.

I would be happy to talk to anyone concerning all things DFA (and it would be
intriguing to code up or try some code snippets!)

Regards,

Ben

Jun 18 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

Reading your original post again and thinking about this a bit more...

If someone can help me get up to speed with TMP in D, I could probably put
together a proof of concept pretty quickly. Aside from the D syntax, it is all
in the design (which is why I would like to discuss it in more detail). For
anyone who is interested, here is how I go about DFA production currently:

- Bog standard tokenisation of regex tokens
- Normalise each regex charset and enumerate in a map
- build a regex syntax tree with leaf nodes storing the enumerated charset id
- Intersect all the charsets and build a vector of equivalent charsets
- Perform the followset to DFA transform as described in the Dragon book

For my lexer generator, I build a lookup table for each character to an
equivalence class. For my wild card matcher, I leave the representation as
characters and do a linear lookup of chars to 'next' state. The lexer
equivalence class method (a la flex) is more efficient as you just have two
lookups per character with no linear searches.

I'm thinking I could look at getting as far as the regex syntax tree and then
we could discuss DFA representation (it sounds like you want to generate code
directly - how do you do that?! This is why I mentioned the .NET bytecode stuff
- I imagine there is some overlap there, but I don't know enough to be sure.
Can you output source code at compilation time and have D compile it, or do you
use a load of recursion like C++ TMP?

And I forgot to mention - boost.spirit uses my lexertl library.

Regards,

Ben

Jun 18 2010

bearophile <bearophileHUGS lycos.com> writes:

Ben Hanson:
 Can you output source code at compilation time and have D compile it, or do you
 use a load of recursion like C++ TMP?

This can be done with string mixin. You can use compile-time evaluation to
build a string of code, and the mix it in somewhere, the compiler will compile
it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used.
You can use it to build a string of asm code too, because it's part of D syntax.

Bye,
bearophile

Jun 18 2010

BCS <none anon.com> writes:

Hello bearophile,

 Ben Hanson:
 
 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?
 

 This can be done with string mixin. You can use compile-time
 evaluation to build a string of code, and the mix it in somewhere, the
 compiler will compile it. It's a dirty thing, but unless D3 gets
 macros, it's the thing to be used. You can use it to build a string of
 asm code too, because it's part of D syntax.

OTOH, and IHMO it should be avoided where you can, kinda like goto. (For 
one things, forget having usable line numbers in your error messages.) You 
can often get a lot done with static if, tuple foreach, const values feed 
to boilerplate code and the like with just a scattering of one or two line 
mixins stuffed in the middle.

 
 Bye,
 bearophile

-- 
... <IXOYE><

Jun 19 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from BCS (none anon.com)'s article
 Hello bearophile,
 Ben Hanson:

 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?

 This can be done with string mixin. You can use compile-time
 evaluation to build a string of code, and the mix it in somewhere, the
 compiler will compile it. It's a dirty thing, but unless D3 gets
 macros, it's the thing to be used. You can use it to build a string of
 asm code too, because it's part of D syntax.

 OTOH, and IHMO it should be avoided where you can, kinda like goto. (For
 one things, forget having usable line numbers in your error messages.) You
 can often get a lot done with static if, tuple foreach, const values feed
 to boilerplate code and the like with just a scattering of one or two line
 mixins stuffed in the middle.
 Bye,
 bearophile


It occurred to me that you could have a normal regex library that builds the
regex at
runtime, debug it, produce a code generator (similar code to re2c), debug that
and then just
wrap it with the mixin at the end. If you run into bugs, you can just do it at
runtime
instead and debug. The nice thing is, if I can get this converted, the lexer
library is just
slightly different. That means you could have the same approach with a lexer
generator! This
would lead to huge interest from the boost.spirit guys...

Regards,

Ben

Jun 19 2010

BCS <none anon.com> writes:

Hello Ben,

This would lead to huge
 interest from the boost.spirit guys...

I don't remember what versions this is (I've done about 3 that I can remember) 
or even if it works, but I wonder if they would have any interest in this:

http://www.dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d

Note (if that is the version I think it is) that the only mixins in that 
are tiny (mixin("label_"~tag!(id)~":;"); & mixin("goto label_"~tag!(id)~";");) 
or generating a big list invocations of other code.

-- 
... <IXOYE><

Jun 19 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

Hi BCS,
== Quote from BCS (none anon.com)'s article
 Hello Ben,
This would lead to huge
 interest from the boost.spirit guys...

 I don't remember what versions this is (I've done about 3 that I can remember)
 or even if it works, but I wonder if they would have any interest in this:
 http://www.dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d
 Note (if that is the version I think it is) that the only mixins in that
 are tiny (mixin("label_"~tag!(id)~":;"); & mixin("goto label_"~tag!(id)~";");)
 or generating a big list invocations of other code.

I can't really speak for the Spirit people, but it's certainly interesting to
me! :-)

Here's some background to make things clearer:

I'm not even currently a Spirit user myself as I've only really needed a
tokeniser at work (they think that's mind-bogglingly sophisticated, never mind
using a beast like Spirit! ;-)).

The only reason they were interested in my lexer generator is that recursive
descent lexing is pretty slow. So for more demanding grammars, a DFA based
lexer is better. The lexer library allows you to create a lexer at runtime
(which doesn't fit in so well with their model of doing everything at compile
time), or you can generate code offline and then just add that to your project.
This is why a compile time DFA lexer would be really interesting to them.

From memory (Joel did a PDF recently, but that is on my works machine) Joel has
been developing Spirit for over ten years. The latest version is pretty
sophisticated and has all kinds of clever stuff for directly parsing data in
structures all inline etc. Needless to say, I find the whole thing pretty mind
boggling. The biggest problem (as far as I can see as an observer) is the
compile times. This is where D could be really interesting overall (Hartmut has
certainly got his eye on it and in fact I'm sure all the Boost people do).

For my part, I'd like to see an LR parser generator developed. I'd be happy
with one that creates the parser at runtime (assuming decent performance), but
if it can generate and compile code efficiently at compile time, so much the
better! :-) I really like the idea of being able to switch the runtime/compile
time behaviour.

When it comes to the code generation part for the DFA regex engine in D, I'd be
happy to talk to you further about the techniques you've employed regarding
emit. I'm still completely new to D, so it'll take me a while! :-)

Cheers,

Ben

Jun 19 2010

div0 <div0 users.sourceforge.net> writes:

On 19/06/2010 20:39, Ben Hanson wrote:
  From memory (Joel did a PDF recently, but that is on my works machine) Joel
has
 been developing Spirit for over ten years. The latest version is pretty
 sophisticated and has all kinds of clever stuff for directly parsing data in
 structures all inline etc. Needless to say, I find the whole thing pretty mind
 boggling. The biggest problem (as far as I can see as an observer) is the
 compile times. This is where D could be really interesting overall (Hartmut has
 certainly got his eye on it and in fact I'm sure all the Boost people do).

I ported most of the classic version of spirit to d a while back.
I've recently written a XML parser using it and it takes nearly 7 whole 
seconds to compile with DMD, which is a vast amount of time compared to 
rest of the stuff I compile. :)

The most trivial spirit parser in c++ takes over 30 seconds on my 
machine, even with everything in the pre compiled header. D is just a 
massive win for productively.

-- 
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk

Jun 19 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from div0 (div0 users.sourceforge.net)'s article
 On 19/06/2010 20:39, Ben Hanson wrote:
  From memory (Joel did a PDF recently, but that is on my works machine) Joel
has
 been developing Spirit for over ten years. The latest version is pretty
 sophisticated and has all kinds of clever stuff for directly parsing data in
 structures all inline etc. Needless to say, I find the whole thing pretty mind
 boggling. The biggest problem (as far as I can see as an observer) is the
 compile times. This is where D could be really interesting overall (Hartmut has
 certainly got his eye on it and in fact I'm sure all the Boost people do).

 I ported most of the classic version of spirit to d a while back.
 I've recently written a XML parser using it and it takes nearly 7 whole
 seconds to compile with DMD, which is a vast amount of time compared to
 rest of the stuff I compile. :)
 The most trivial spirit parser in c++ takes over 30 seconds on my
 machine, even with everything in the pre compiled header. D is just a
 massive win for productively.

Faster compile time code generation is a massive advantage for D.

Hartmut contacted me about a runtime LALR parser generator called Sweet Parser
(http://www.sweetsoftware.co.nz/parser_overview.php). They've contacted the
author
about integrating a version of that into Spirit. A version in D that could run
at
compile time would be cool too...

Bizzare timing or what!

It looks like interesting times lie ahead...

Regards,

Ben

Jun 22 2010

BCS <none anon.com> writes:

Hello Ben,

 The only reason they were interested in my lexer generator is that
 recursive descent lexing is pretty slow. So for more demanding
 grammars, a DFA based lexer is better. The lexer library allows you to
 create a lexer at runtime (which doesn't fit in so well with their
 model of doing everything at compile time), or you can generate code
 offline and then just add that to your project. This is why a compile
 time DFA lexer would be really interesting to them.
 
 The biggest problem (as far as I can
 see as an observer) is the compile times.

I've never got a usable grammer for D to feed into it but I did get one (~200 
productions) that at least compiled: it took 7 (or 17 minutes, I forget) 
and 700MB or ram To compile (P-III 550MHz, >700MB RAM)

 This is where D could be
 really interesting overall (Hartmut has certainly got his eye on it
 and in fact I'm sure all the Boost people do).
 
 For my part, I'd like to see an LR parser generator developed. I'd be
 happy with one that creates the parser at runtime (assuming decent
 performance), but if it can generate and compile code efficiently at
 compile time, so much the better! :-) I really like the idea of being
 able to switch the runtime/compile time behaviour.
 

Given that most LR parsers are table driven, all that would be needed for 
a compile time system would be that the table generator be evaluateable at 
compile time. The result could be stored in a constant. The only tricky bits 
then would be handling the data values on the stack and the part to tack 
in the action rules and that could just be a naming convention:

class MyActions
{
    resultType Action_statement_if(/+ args +/) { /+ code +/ }
    resultType Action_statement_ifThen(/+ args +/) { /+ code +/ }
    resultType Action_addExpression_plus(/+ args +/) { /+ code +/ }
    resultType Action_addExpression_minus(/+ args +/) { /+ code +/ }
    resultType Action_addExpression_chain(/+ args +/) { /+ code +/ }
    // ...


    mixin Parser!(
`statment : "if" "(" expression ")" statment / if 
          | "if" "(" expression ")" statment "then" statment / ifThen 
          | "{" statment* "}" / block
          | expression ";" /expression;
addExpression : .....
`);
}

Parser!(string) would generate a function that just assumes the existence 
of functions of the given names.

 When it comes to the code generation part for the DFA regex engine in
 D, I'd be happy to talk to you further about the techniques you've
 employed regarding emit. I'm still completely new to D, so it'll take
 me a while! :-)
 
 Cheers,
 
 Ben
 

-- 
... <IXOYE><

Jun 19 2010

Ben.Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from BCS (none anon.com)'s article
 Hello Ben,
 The only reason they were interested in my lexer generator is that
 recursive descent lexing is pretty slow. So for more demanding
 grammars, a DFA based lexer is better. The lexer library allows you to
 create a lexer at runtime (which doesn't fit in so well with their
 model of doing everything at compile time), or you can generate code
 offline and then just add that to your project. This is why a compile
 time DFA lexer would be really interesting to them.

 The biggest problem (as far as I can
 see as an observer) is the compile times.

 I've never got a usable grammer for D to feed into it but I did get one (~200
 productions) that at least compiled: it took 7 (or 17 minutes, I forget)
 and 700MB or ram To compile (P-III 550MHz, >700MB RAM)
 This is where D could be
 really interesting overall (Hartmut has certainly got his eye on it
 and in fact I'm sure all the Boost people do).

 For my part, I'd like to see an LR parser generator developed. I'd be
 happy with one that creates the parser at runtime (assuming decent
 performance), but if it can generate and compile code efficiently at
 compile time, so much the better! :-) I really like the idea of being
 able to switch the runtime/compile time behaviour.

 Given that most LR parsers are table driven, all that would be needed for
 a compile time system would be that the table generator be evaluateable at
 compile time. The result could be stored in a constant. The only tricky bits
 then would be handling the data values on the stack and the part to tack
 in the action rules and that could just be a naming convention:
 class MyActions
 {
     resultType Action_statement_if(/+ args +/) { /+ code +/ }
     resultType Action_statement_ifThen(/+ args +/) { /+ code +/ }
     resultType Action_addExpression_plus(/+ args +/) { /+ code +/ }
     resultType Action_addExpression_minus(/+ args +/) { /+ code +/ }
     resultType Action_addExpression_chain(/+ args +/) { /+ code +/ }
     // ...
     mixin Parser!(
 `statment : "if" "(" expression ")" statment / if
           | "if" "(" expression ")" statment "then" statment / ifThen
           | "{" statment* "}" / block
           | expression ";" /expression;
 addExpression : .....
 `);
 }
 Parser!(string) would generate a function that just assumes the existence
 of functions of the given names.

Let's come back to this! For now I will concentrate on the regex stuff and
review
the Sweet Parser for the Spirit guys.

Regards,

Ben

Jun 22 2010

Don <nospam nospam.com> writes:

Ben Hanson wrote:
 == Quote from BCS (none anon.com)'s article
 Hello bearophile,
 Ben Hanson:

 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?

 This can be done with string mixin. You can use compile-time
 evaluation to build a string of code, and the mix it in somewhere, the
 compiler will compile it. It's a dirty thing, but unless D3 gets
 macros, it's the thing to be used. You can use it to build a string of
 asm code too, because it's part of D syntax.

 OTOH, and IHMO it should be avoided where you can, kinda like goto. 


(For
 one things, forget having usable line numbers in your error messages.) 


Whenever that happens, it's a compiler bug.
If you can produce a test case where that happens, please put it in 
Bugzilla.

You
 can often get a lot done with static if, tuple foreach, const values feed
 to boilerplate code and the like with just a scattering of one or two line
 mixins stuffed in the middle.
 Bye,it.
 bearophile


 
 It occurred to me that you could have a normal regex library that builds the
regex at
 runtime, debug it, produce a code generator (similar code to re2c), debug that
and then just
 wrap it with the mixin at the end. If you run into bugs, you can just do it at
runtime
 instead and debug. The nice thing is, if I can get this converted, the lexer
library is just
 slightly different. That means you could have the same approach with a lexer
generator! This
 would lead to huge interest from the boost.spirit guys...

That is exactly the way I recommend to develop CTFE code. Get the code 
completely working, run unit tests on every component, and then CTFE it.

Jun 19 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from Don (nospam nospam.com)'s article
 Ben Hanson wrote:
 == Quote from BCS (none anon.com)'s article
 Hello bearophile,
 Ben Hanson:

 Can you output source code at compilation time and have D compile it,
 or do you use a load of recursion like C++ TMP?

 This can be done with string mixin. You can use compile-time
 evaluation to build a string of code, and the mix it in somewhere, the
 compiler will compile it. It's a dirty thing, but unless D3 gets
 macros, it's the thing to be used. You can use it to build a string of
 asm code too, because it's part of D syntax.

 OTOH, and IHMO it should be avoided where you can, kinda like goto.


 (For
 one things, forget having usable line numbers in your error messages.)


 Whenever that happens, it's a compiler bug.
 If you can produce a test case where that happens, please put it in
 Bugzilla.
 You
 can often get a lot done with static if, tuple foreach, const values feed
 to boilerplate code and the like with just a scattering of one or two line
 mixins stuffed in the middle.
 Bye,it.
 bearophile


 It occurred to me that you could have a normal regex library that builds


the regex at
 runtime, debug it, produce a code generator (similar code to re2c), debug


that and then just
 wrap it with the mixin at the end. If you run into bugs, you can just do it


at runtime
 instead and debug. The nice thing is, if I can get this converted, the


lexer library is just
 slightly different. That means you could have the same approach with a


lexer generator! This
 would lead to huge interest from the boost.spirit guys...

 That is exactly the way I recommend to develop CTFE code. Get the code
 completely working, run unit tests on every component, and then CTFE it.

Good to know Don, thanks!

Jun 19 2010

BCS <none anon.com> writes:

Hello Don,

 == Quote from BCS (none anon.com)'s article
 
 OTOH, and IHMO it should be avoided where you can, kinda like goto.
 (For one things, forget having usable line numbers in your error
 messages.)
 


 Whenever that happens, it's a compiler bug.
 If you can produce a test case where that happens, please put it in
 Bugzilla.

The issue isn't that the error has no line number but that the line number 
is the line of the mixin or the line within the mixin (I forget which) to 
be useful you need both AND the full text of the mixin.

Yes there are ways around this but my original point is that in my experience, 
I can get just about as much done and for about the same effort with very 
little code being mixed in. My approach to this kind of problem has typically 
been to use CTFE to process input, template and or mixins to generate an 
AST (or whatever the problem demands) represented as a type and normal
template, 
static ifs and tuple foreach to do the code generation with the occasional 
line of code mixed in here and there. The result of this is that most of 
the code is transparently visible as code rather than as string literals.

-- 
... <IXOYE><

Jun 19 2010

"Nick Sabalausky" <a a.a> writes:

"BCS" <none anon.com> wrote in message 
news:a6268ff154bd8ccddee0380cd08 news.digitalmars.com...
 Hello Don,

 == Quote from BCS (none anon.com)'s article

 OTOH, and IHMO it should be avoided where you can, kinda like goto.
 (For one things, forget having usable line numbers in your error
 messages.)


 Whenever that happens, it's a compiler bug.
 If you can produce a test case where that happens, please put it in
 Bugzilla.

 The issue isn't that the error has no line number but that the line number 
 is the line of the mixin or the line within the mixin (I forget which) to 
 be useful you need both AND the full text of the mixin.


http://d.puremagic.com/issues/show_bug.cgi?id=2887

Jun 22 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Fri, 18 Jun 2010 11:36:48 +0000, Ben Hanson wrote:

 I'm thinking I could look at getting as far as the regex syntax tree and
 then we could discuss DFA representation (it sounds like you want to
 generate code directly - how do you do that?! This is why I mentioned
 the .NET bytecode stuff - I imagine there is some overlap there, but I
 don't know enough to be sure. Can you output source code at compilation
 time and have D compile it, or do you use a load of recursion like C++
 TMP?

D's metaprogramming facilities are insanely powerful.  You can take any 
string containing valid source code (that is known at compile time, of 
course) and "paste" it into your code with the mixin() expression:

  mixin("int i;");
  i = 2;

For the regex parser, I suppose you'd do something like:

  string makeParserCode(string regex) { ... }

  struct sregex(string regex)
  {
      mixin(makeParserCode(regex));
  }


-Lars

Jun 18 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Fri, 18 Jun 2010 12:09:54 +0000, Lars T. Kyllingstad wrote:

 On Fri, 18 Jun 2010 11:36:48 +0000, Ben Hanson wrote:
 
 I'm thinking I could look at getting as far as the regex syntax tree
 and then we could discuss DFA representation (it sounds like you want
 to generate code directly - how do you do that?! This is why I
 mentioned the .NET bytecode stuff - I imagine there is some overlap
 there, but I don't know enough to be sure. Can you output source code
 at compilation time and have D compile it, or do you use a load of
 recursion like C++ TMP?

 
 D's metaprogramming facilities are insanely powerful.  You can take any
 string containing valid source code (that is known at compile time, of
 course) and "paste" it into your code with the mixin() expression:
 
   mixin("int i;");
   i = 2;
 
 For the regex parser, I suppose you'd do something like:
 
   string makeParserCode(string regex) { ... }
 
   struct sregex(string regex)
   {
       mixin(makeParserCode(regex));
   }

I should mention that the makeParserCode() function in the example must 
of course be evaluatable at compile time, and there are some restrictions 
on such functions.  Lately, many of those restrictions have been 
overcome, though, and I'm not sure what's left.

Does anyone know what the current limitations on CTFE are?

-Lars

Jun 18 2010

bearophile <bearophileHUGS lycos.com> writes:

Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

One less if this works and gets accepted:
http://d.puremagic.com/issues/show_bug.cgi?id=3050

Bye,
bearophile

Jun 18 2010

Don <nospam nospam.com> writes:

bearophile wrote:
 Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

 
 One less if this works and gets accepted:
 http://d.puremagic.com/issues/show_bug.cgi?id=3050
 
 Bye,
 bearophile

It's a great piece of work, but the rest of the CTFE mechanics are not 
ready for that patch.
The one which is blocking everything in CTFE is bug 1330. At present, 
CTFE uses copy-on-write, and that doesn't work with reference types.
Fixing that is going to require large changes throughout the compiler, 
so I've been holding off on that one. Meanwhile I've been trying to fix 
all the non-reference bugs in CTFE and assembling a comprehensive test 
suite.

So to answer the original question -- the main things that don't work in 
CTFE that are allowable in safe D are classes, exceptions, and built-in 
features which are implemented in the runtime rather than in the 
compiler (including array operations).

Jun 18 2010

"Nick Sabalausky" <a a.a> writes:

"Don" <nospam nospam.com> wrote in message 
news:hvgasq$2k3a$1 digitalmars.com...
 bearophile wrote:
 Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

 One less if this works and gets accepted:
 http://d.puremagic.com/issues/show_bug.cgi?id=3050

 Bye,
 bearophile

 It's a great piece of work, but the rest of the CTFE mechanics are not 
 ready for that patch.
 The one which is blocking everything in CTFE is bug 1330.

IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, so if 
it's fixed, that would break any code that does that detection, so another 
CTFE-detection mechanism may be needed before a 1330 fix.

Jun 18 2010

Don <nospam nospam.com> writes:

Nick Sabalausky wrote:
 "Don" <nospam nospam.com> wrote in message 
 news:hvgasq$2k3a$1 digitalmars.com...
 bearophile wrote:
 Lars T. Kyllingstad:
 Does anyone know what the current limitations on CTFE are?

 One less if this works and gets accepted:
 http://d.puremagic.com/issues/show_bug.cgi?id=3050

 Bye,
 bearophile

 It's a great piece of work, but the rest of the CTFE mechanics are not 
 ready for that patch.
 The one which is blocking everything in CTFE is bug 1330.

 
 IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, so if 
 it's fixed, that would break any code that does that detection, so another 
 CTFE-detection mechanism may be needed before a 1330 fix.
 
 

Not any more. There's __ctfe now.

Jun 18 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 06/18/2010 06:36 AM, Ben Hanson wrote:
 Reading your original post again and thinking about this a bit more...

 If someone can help me get up to speed with TMP in D, I could probably put
 together a proof of concept pretty quickly. Aside from the D syntax, it is all
 in the design (which is why I would like to discuss it in more detail). For
 anyone who is interested, here is how I go about DFA production currently:

 - Bog standard tokenisation of regex tokens
 - Normalise each regex charset and enumerate in a map
 - build a regex syntax tree with leaf nodes storing the enumerated charset id
 - Intersect all the charsets and build a vector of equivalent charsets
 - Perform the followset to DFA transform as described in the Dragon book

 For my lexer generator, I build a lookup table for each character to an
 equivalence class. For my wild card matcher, I leave the representation as
 characters and do a linear lookup of chars to 'next' state. The lexer
 equivalence class method (a la flex) is more efficient as you just have two
 lookups per character with no linear searches.

 I'm thinking I could look at getting as far as the regex syntax tree and then
 we could discuss DFA representation (it sounds like you want to generate code
 directly - how do you do that?! This is why I mentioned the .NET bytecode stuff
 - I imagine there is some overlap there, but I don't know enough to be sure.
 Can you output source code at compilation time and have D compile it, or do you
 use a load of recursion like C++ TMP?

 And I forgot to mention - boost.spirit uses my lexertl library.

 Regards,

 Ben

I've always understood that dfas have an exponential lower bound 
relative to .. something .. but I don't believe I have ever actually 
used a dfa generator (or written one, sadly enough). What has your 
experience been on dfa compile time?


I suppose RE2's syntax is a good starting point

http://code.google.com/p/re2/wiki/Syntax

I wonder about those unicode character classes, though. Looks like an 
imminent head-on collision with dmd's issues regarding memory management 
at compile time.

Jun 18 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

Hi Ellery,

== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 I've always understood that dfas have an exponential lower bound
 relative to .. something .. but I don't believe I have ever actually
 used a dfa generator (or written one, sadly enough). What has your
 experience been on dfa compile time?
 I suppose RE2's syntax is a good starting point
 http://code.google.com/p/re2/wiki/Syntax
 I wonder about those unicode character classes, though. Looks like an
 imminent head-on collision with dmd's issues regarding memory management
 at compile time.

I also wondered about this and originally just thought, well, you can solve any
performance problems with decent optimal code! :-) So this train of thought was
years ago now and in reality the only way I can really see to bring DFA
compilation to its knees is to use massive repeats.

DFA also implies a more limited syntax (Andrei hinted at this in his original
post) and there are certain things you just don't need to worry about any more
such as 'possessive' quantifiers. Traditionally DFA regex has been used for
lexer generators as they generally don't need some of the more outlandish
'extra' features, so it's possible that a straight up regex engine would have
more demands put on it, but there again, a lexer generator is dealing with a
list of regexes, not just one! The last time I timed it, lexertl could compile
all of the C++ token rules in well under 0.5 seconds.

There are such things as tagged dfas, which I would love to know more about.
The Tango regex library uses them for captures I believe. If anyone can provide
more info on how TDFAs really work that would be fantastic. All the literature
I've read didn't exactly enlighten me (including Ville Laurikari's papers).

The full re2 syntax is way more than I could provide. See http://
www.benhanson.net/lexertl.html for a table of regex syntax I know how to
support.

On the subject of Unicode chars, this is an interesting topic. The Tango regex
library gets around the storage requirements by having a list of ranges. The
approach lexertl takes is to maintain a sorted string of chars. The trick is,
if the string has more than 0.5 of the max possible number of chars in it, you
negate it. This is actually part of the normalisation process and leads to
quicker intersections later on. When it comes to the final machine, lexertl
slices 'wide chars' into bytes for lookup now, keeping the first stage lookup
to 256 entries (I can elaborate if anyone wants me too). Of course if you are
generating code (as seems to be desired) none of this matters very much.

Don't forget that if you intersect all the charsets *once* after building your
syntax tree, then when you do transition intersections, you are only
intersecting sets of indexes to charset equivalence classes which is really
quick. For a quick example:

- Say you have two char classes, one is [a-z] and the other one is [a].
- You intersect these classes once giving [b-z] and [a]
- So if originally [a-z] = 0 and [a] = 1 (as I said, charsets are enumerated as
they are scanned), then now the equivalence index sets are [a-z] = {0, 1} and
[a] = {1} due to [b-z] = 0 and [a] = 1.

When you generate the DFA you are then intersecting {0, 1} and {1} rather than
the actual charsets. The sets are obviously really small! :-)

Hope that was useful, or at least interesting!

Regards,

Ben

Jun 18 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Ellery Newcomer wrote:
 On 06/18/2010 06:36 AM, Ben Hanson wrote:
 Reading your original post again and thinking about this a bit more...

 If someone can help me get up to speed with TMP in D, I could probably 
 put
 together a proof of concept pretty quickly. Aside from the D syntax, 
 it is all
 in the design (which is why I would like to discuss it in more 
 detail). For
 anyone who is interested, here is how I go about DFA production 
 currently:

 - Bog standard tokenisation of regex tokens
 - Normalise each regex charset and enumerate in a map
 - build a regex syntax tree with leaf nodes storing the enumerated 
 charset id
 - Intersect all the charsets and build a vector of equivalent charsets
 - Perform the followset to DFA transform as described in the Dragon book

 For my lexer generator, I build a lookup table for each character to an
 equivalence class. For my wild card matcher, I leave the 
 representation as
 characters and do a linear lookup of chars to 'next' state. The lexer
 equivalence class method (a la flex) is more efficient as you just 
 have two
 lookups per character with no linear searches.

 I'm thinking I could look at getting as far as the regex syntax tree 
 and then
 we could discuss DFA representation (it sounds like you want to 
 generate code
 directly - how do you do that?! This is why I mentioned the .NET 
 bytecode stuff
 - I imagine there is some overlap there, but I don't know enough to be 
 sure.
 Can you output source code at compilation time and have D compile it, 
 or do you
 use a load of recursion like C++ TMP?

 And I forgot to mention - boost.spirit uses my lexertl library.

 Regards,

 Ben

 
 I've always understood that dfas have an exponential lower bound 
 relative to .. something .. but I don't believe I have ever actually 
 used a dfa generator (or written one, sadly enough). What has your 
 experience been on dfa compile time?

I think only backtracking engines (not DFAs) have the exponential run 
time problem. Regular expression grammar is itself context free (so it's 
parsable efficiently), the NFA and NFA->DFA algorithms are polynomial, 
and the generated DFA makes one transition per input character.


Andrei

Jun 18 2010

sybrandy <sybrandy gmail.com> writes:

 I think only backtracking engines (not DFAs) have the exponential run
 time problem. Regular expression grammar is itself context free (so it's
 parsable efficiently), the NFA and NFA->DFA algorithms are polynomial,
 and the generated DFA makes one transition per input character.


 Andrei

I don't know how practical it is, but I had a thought to allow some of 
these nasty regexes to run better: use threads.  In theory, and I have 
not done any work to test/prove this so it's very theoretical, you can 
have multiple threads running in parallel each testing a potential path. 
  Granted, there's always the issue of spawning/managing threads and 
potentially overloading the system by spawning too many, however it may 
be a viable alternative if you really HAVE to execute a particularly 
nasty regex.

Anyway, it's just a thought.  I'd hate to lose the benefits of a NFA if 
we're afraid of the potential performance hit if a user writes a bad regex.

Casey

Jun 18 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 06/18/2010 05:08 PM, sybrandy wrote:
 I think only backtracking engines (not DFAs) have the exponential run
 time problem. Regular expression grammar is itself context free (so it's
 parsable efficiently), the NFA and NFA->DFA algorithms are polynomial,
 and the generated DFA makes one transition per input character.


 Andrei


The NFA->DFA algorithms are what I was referring to. My copy of the 
dragon book suggests that DFAs usually have O(n^3) initialization time 
but O(n^2 * 2^n) worst case relative to the number of states. I'd love 
to be shown wrong, though.

 I don't know how practical it is, but I had a thought to allow some of
 these nasty regexes to run better: use threads. In theory, and I have
 not done any work to test/prove this so it's very theoretical, you can
 have multiple threads running in parallel each testing a potential path.
 Granted, there's always the issue of spawning/managing threads and
 potentially overloading the system by spawning too many, however it may
 be a viable alternative if you really HAVE to execute a particularly
 nasty regex.

Isn't what you're describing an NFA?

 Anyway, it's just a thought. I'd hate to lose the benefits of a NFA if
 we're afraid of the potential performance hit if a user writes a bad regex.

 Casey

You don't get a performance hit with DFAs on knarly regexen (actually, 
they would be faster than NFAs), just regexen that take forever to 
compile. I wonder if it would be possible to mechanically detect problem 
cases for DFA compilation and on detection switch over to an NFA engine.

Jun 18 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 The NFA->DFA algorithms are what I was referring to. My copy of the
 dragon book suggests that DFAs usually have O(n^3) initialization time
 but O(n^2 * 2^n) worst case relative to the number of states. I'd love
 to be shown wrong, though.

I can't prove what the comlexity is for DFA compilation, but instead, I
challenge anyone to show slow compilation times with any DFA compatible regex.
As I don't have some D code yet, you could try the current lexertl code (just
add a single rule). Compile as Release if using Visual C++, as the iterator
debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
exist with a D codebase).

As I say, the only obvious area for me that hammers compilation times is huge
repeats. If you repeat an expression 50,0000 times or something, then yes, you
will get a huge machine and it will take time to construct! :-)

There are papers on tackling this with counting repeats, even for DFAs. If this
area interests anyone, please let me know.

 You don't get a performance hit with DFAs on knarly regexen (actually,
 they would be faster than NFAs), just regexen that take forever to
 compile. I wonder if it would be possible to mechanically detect problem
 cases for DFA compilation and on detection switch over to an NFA engine.

DFA compilation is particularly good for simplifying verbose regexes, even
wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the past
and as far as feaures go, this is probably the ideal way to go. Russ Cox has
some clever techniques too.

Don't forget that grep is (traditionally) implemented with a DFA engine.

Of course, the real irony is that Henry Spencer wrote a much better regex
engine for TCL (a DFA/NFA hybrid as I understand it) which is much better than
his original NFA backtracker and yet everyone holds up the backtracking
algorithm as the way to do things...

That's inertia for you.

Regards,

Ben

Jun 19 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible regex.
 As I don't have some D code yet, you could try the current lexertl code (just
 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is huge
 repeats. If you repeat an expression 50,0000 times or something, then yes, you
 will get a huge machine and it will take time to construct! :-)

et unicode?

 There are papers on tackling this with counting repeats, even for DFAs. If this
 area interests anyone, please let me know.

it does

 DFA compilation is particularly good for simplifying verbose regexes, even
 wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the past
 and as far as feaures go, this is probably the ideal way to go. Russ Cox has
 some clever techniques too.

interesting

 Don't forget that grep is (traditionally) implemented with a DFA engine.

I didn't know that

 Of course, the real irony is that Henry Spencer wrote a much better regex
 engine for TCL (a DFA/NFA hybrid as I understand it) which is much better than
 his original NFA backtracker and yet everyone holds up the backtracking
 algorithm as the way to do things...

 That's inertia for you.

 Regards,

 Ben

Jun 19 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible


regex.
 As I don't have some D code yet, you could try the current lexertl code


(just
 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is


huge
 repeats. If you repeat an expression 50,0000 times or something, then yes,


you
 will get a huge machine and it will take time to construct! :-)

 et unicode?

The latest lexertl takes the following strategy:

- If you're building a lexer using 8 bit chars, then no problem. Two phase
lookup, first lookup map all 256 chars to equivalence class index, then that
index is used to locate the correct column in the DFA.
- If you're using 'wide characters', then still have a 256 entry lookup table
and slice each 'wide character' into 8 bit chunks and use those chunks as
charsets.
- If you want to generate code, you can generate tables using a first phase
lookup of MAX_CHARS of the 'wide character'. You can then convert this back to
a character based state machine and then you can just deal with characters
again instead of equivalence classes.

One of the enhancements I want to make, is the ability to go straight to a char
based state machine to avoid the huge first phase lookup.

Now, as far a huge Unicode ranges goes, this shouldn't be a problem when you're
generating code. You can use an if statement with an upper and lower bound in
those cases and only use a switch statement when the range is disparate etc.

If the technique of storing all those chars in a string ends up being too
expensive on RAM (even with the auto negation trick), then there's always the
method used by Tango Regex, which is a list of ranges (this probably works
quite well with garbage collection - remember I designed for C++).

Let me know if any of that isn't clear. Basically it's not a huge problem, you
can use various techniques to compress the data sensibly and efficiently.
 There are papers on tackling this with counting repeats, even for DFAs. If


this
 area interests anyone, please let me know.

 it does

I'll have to dig out any references I have! I don't have them to hand right now.
 DFA compilation is particularly good for simplifying verbose regexes, even
 wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the


past
 and as far as feaures go, this is probably the ideal way to go. Russ Cox has
 some clever techniques too.

 interesting

Russ (from memory) uses competing 'threads' in his matcher (http://swtch.com/
~rsc/regexp/regexp2.html) http://swtch.com/~rsc/regexp/regexp3.html discusses
RE2.
 Don't forget that grep is (traditionally) implemented with a DFA engine.

 I didn't know that

A lot of automata theory appears to have been forgotten, as noted by Russ Cox
(amongst others). Lua has an interesting pattern matcher, as it goes more in
the formal grammar direction to (for example) match balanced parenthesis etc.
Another thing I'd like to do (no doubt time will not allow!) is to extend
Notepad RE (http://www.codeproject.com/KB/recipes/notepadre.aspx) to have the
LUA pattern matcher as an option.
 Of course, the real irony is that Henry Spencer wrote a much better regex
 engine for TCL (a DFA/NFA hybrid as I understand it) which is much better


than
 his original NFA backtracker and yet everyone holds up the backtracking
 algorithm as the way to do things...

 That's inertia for you.

 Regards,

 Ben


Regards,

Ben

Jun 19 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

I forgot to mention... Ideally I'd like some kind of abstract interface to the
generated DFA (this is in data mode before code generation etc.) so that
different implementations can be hidden under a common interface.

This is another example of why it would be nice to discuss the implementation
with others! :-) Generally people just don't care about this stuff. I realise
that various folk would like to go one better and not even assume a DFA behind
the interface, but that is currently out of my scope..!

Regards,

Ben

Jun 19 2010

"Nick Sabalausky" <a a.a> writes:

"Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message 
news:hvj9i1$1q7b$1 digitalmars.com...
 Lua has an interesting pattern matcher, as it goes more in
 the formal grammar direction to (for example) match balanced parenthesis 
 etc.

That's cool. That's something I've often wanted in regexes (It would be 
great for defining lever tokens in a grammar-definition language), and 
though I'm no FSA or regex expert, it always seemed like something entirely 
doable without having to resort to a full-blown LL or LR parser.

Jun 19 2010

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:hvjmo1$2mnu$1 digitalmars.com...
 "Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message 
 news:hvj9i1$1q7b$1 digitalmars.com...
 Lua has an interesting pattern matcher, as it goes more in
 the formal grammar direction to (for example) match balanced parenthesis 
 etc.

 That's cool. That's something I've often wanted in regexes (It would be 
 great for defining lever tokens in a grammar-definition language), and 
 though I'm no FSA or regex expert, it always seemed like something 
 entirely doable without having to resort to a full-blown LL or LR parser.

That's "...defining lexer tokens...", not "...defining lever tokens...", of 
course.

Jun 19 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from Nick Sabalausky (a a.a)'s article
 "Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message
 news:hvj9i1$1q7b$1 digitalmars.com...
 Lua has an interesting pattern matcher, as it goes more in
 the formal grammar direction to (for example) match balanced parenthesis
 etc.

 That's cool. That's something I've often wanted in regexes (It would be
 great for defining lever tokens in a grammar-definition language), and
 though I'm no FSA or regex expert, it always seemed like something entirely
 doable without having to resort to a full-blown LL or LR parser.

Yeah, it does seem obvious that there'd be some middle ground between regexes -
certainly the author of the LUA pattern matching thought so! :-) The topic could
definitely do with more attention.

There is limited back-tracking in the LUA pattern matching, which I don't like,
but it appears more efficient than back-tracking NFA.  Apparantly it has some
solid theory behind it too, which is good. Sadly just not enough time to persue
all these possibilities...

Regards,

Ben

Jun 22 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible regex.
 As I don't have some D code yet, you could try the current lexertl code (just
 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is huge
 repeats. If you repeat an expression 50,0000 times or something, then yes, you
 will get a huge machine and it will take time to construct! :-)

 et unicode?
 There are papers on tackling this with counting repeats, even for DFAs. If this
 area interests anyone, please let me know.

 it does

Try this link: http://www.arl.wustl.edu/~mbecchi/files/becchi_ancs2009.pdf

Jun 21 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 06/21/2010 04:07 AM, Ben Hanson wrote:
 == Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s article
 On 06/19/2010 03:35 AM, Ben Hanson wrote:
 I can't prove what the comlexity is for DFA compilation, but instead, I
 challenge anyone to show slow compilation times with any DFA compatible regex.
 As I don't have some D code yet, you could try the current lexertl code (just
 add a single rule). Compile as Release if using Visual C++, as the iterator
 debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
 exist with a D codebase).

 As I say, the only obvious area for me that hammers compilation times is huge
 repeats. If you repeat an expression 50,0000 times or something, then yes, you
 will get a huge machine and it will take time to construct! :-)

 et unicode?
 There are papers on tackling this with counting repeats, even for DFAs. If this
 area interests anyone, please let me know.

 it does

 Try this link: http://www.arl.wustl.edu/~mbecchi/files/becchi_ancs2009.pdf

thanks

Jun 21 2010

Jacob Carlborg <doob me.com> writes:

On 2010-06-18 06:44, Andrei Alexandrescu wrote:
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html


 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.


 Andrei

The is already a compile time regular expression engine available in the 
DDL project at dsource, it's in the meta package: 
http://dsource.org/projects/ddl/browser/trunk/meta I don't know if it's 
is still usable.

-- 
/Jacob Carlborg

Jun 18 2010

Don <nospam nospam.com> writes:

Jacob Carlborg wrote:
 On 2010-06-18 06:44, Andrei Alexandrescu wrote:
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-appr
ach-to-regular.html 



 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.


 Andrei

 
 The is already a compile time regular expression engine available in the 
 DDL project at dsource, it's in the meta package: 
 http://dsource.org/projects/ddl/browser/trunk/meta I don't know if it's 
 is still usable.
 

I wrote that code before CTFE was available. It helped to design D's 
metaprogramming system, but the code itself is thoroughly obsolete now. 
It is SO much easier to write that stuff with CTFE.
Incidentally, now that we have the no-brackets form of template 
instantiation, there's much more freedom for compile-time regex syntax.

auto k = regex!"[A-Za-z][0-9A-Za-z]*"(s1);

Jun 18 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

OK, I've just ordered The D Programming Language from Amazon. Hopefully this
will
enable me to cut out a certain amount of total n00b questions such as "How do
strings work?".

Regards,

Ben

Jun 18 2010

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:hvetik$2k7j$1 digitalmars.com...
 There are currently two regexen in the standard library. The older one, 
 std.regexp, is time-tested but only works with UTF8 and has a clunkier 
 API. The newer one, std.regex, is newer and isolates the engine from the 
 matches (and therefore can reuse and cache engines easier), and supports 
 all character widths. But it's less tested and doesn't have that great of 
 an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important improvement 
 is not in the interface - it's in the engine. The current engine is 
 adequate but nothing to write home about, and for simple regexen is 
 markedly slower than equivalent hand-written code (e.g. matching 
 whitespace). One great opportunity would be for D to leverage its uncanny 
 compile-time evaluation abilities and offer a regex that parses the 
 pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with captures 
 and backreferences etc., but it would have guaranteed performance (e.g. it 
 would be an automaton instead of a backtracking engine) and would be darn 
 fast because it would generate custom code for each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

 If we get as far as implementing what RE2 can do with compile-time 
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or not), 
 I highly encourage you to do so.

This would be a good thing for it to pay attention to, if it doesn't 
already:
http://swtch.com/~rsc/regexp/regexp1.html

Jun 18 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 06/17/2010 11:44 PM, Andrei Alexandrescu wrote:
 There are currently two regexen in the standard library. The older one,
 std.regexp, is time-tested but only works with UTF8 and has a clunkier
 API. The newer one, std.regex, is newer and isolates the engine from the
 matches (and therefore can reuse and cache engines easier), and supports
 all character widths. But it's less tested and doesn't have that great
 of an interface because it pretty much inherits the existing one.

 I wish to improve regex handling in Phobos. The most important
 improvement is not in the interface - it's in the engine. The current
 engine is adequate but nothing to write home about, and for simple
 regexen is markedly slower than equivalent hand-written code (e.g.
 matching whitespace). One great opportunity would be for D to leverage
 its uncanny compile-time evaluation abilities and offer a regex that
 parses the pattern during compilation:

 foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

 Such a static regex could be simpler than a full-blown regex with
 captures and backreferences etc., but it would have guaranteed
 performance (e.g. it would be an automaton instead of a backtracking
 engine) and would be darn fast because it would generate custom code for
 each regex pattern.

 See related work:

 http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html


 If we get as far as implementing what RE2 can do with compile-time
 evaluation, people will definitely notice.

 If there's anyone who'd want to tackle such a project (for Phobos or
 not), I highly encourage you to do so.


 Andrei

Generally I think D's CT capabilities have a way to go yet before this 
would be worth tackling. E.g. how do you build a parse tree if pointers 
and classes aren't allowed in CT code?

But, predicated on the assumption that it will happen relatively soon..

What do we want our regex language to look like?

this: http://digitalmars.com/ctg/regular.html

is okay, but I like the idea of ascii and unicode character classes as 
per RE2's syntax, and I don't like the idea of an overloaded \n.

(I assume we're talking about strings as bidirectional unicode ranges)

and I would like to see the language grammar explicitly written out 
somewhere.

and I like the idea of capture groups. Thinking of something along the 
lines of

import std.regex;

int i;
double d;

ctcapture!r"(?<i>\d+): (?<d>\d+\.\d+)"( "10: 100.5"); //this will 
probably have to be inside a mixin statement
assert(i == 10);
assert(d == 100.5); //grumph, floating point..


for that matter, why not some predefined subexpressions corresponding to 
D's integer literals, et al? And available inside the regex?

ctcapture!r"(?<i>{:IntegerLiteral:}): (?<d>{:FloatLiteral:})" // Yeah, I 
know. You come up with a better syntax
   ("10: 0xEA.A1p-2");


Okay, I'm done dreaming. I don't know how much of the above can be done 
within the constraints of an FA. How much sense does regex over unicode 
make? Do others have ideas for features for regex?

Jun 19 2010

BCS <none anon.com> writes:

Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before this
 would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

You use the type system. I can say from experience that (memory issues aside) 
it works. You can even do duck typing "Template Foo works on anything that 
defines a Bar and a Baz." 

-- 
... <IXOYE><

Jun 19 2010

"Nick Sabalausky" <a a.a> writes:

"BCS" <none anon.com> wrote in message 
news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...
 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before this
 would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

 You use the type system. I can say from experience that (memory issues 
 aside) it works.

Trivial example?

Jun 19 2010

BCS <none anon.com> writes:

Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...
 
 Hello Ellery,
 
 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?
 

 You use the type system. I can say from experience that (memory
 issues aside) it works.
 

 Trivial example?
 

Building an arbitrary tree:

int find(char[] data)
{
  foreach(int j, char c; data)
    switch(c)
    {
    default: break;
    case '[', ']': return j;
    }
    return data.length;
}
int match(char[] data)
{
  int i = 0;
  foreach(int j, char c; data)
    switch(c)
    {
    default: break;
    case '[': i++; break;
    case ']': if(--i == 0) return j;
    }
    return data.length;
}
template Tp(T...){ alias T Tp; }
struct Bar(T...) { }
template Foo(char[] data)
{
  static if(data.length == 0)
    alias Tp!() Foo;
  else static if(data[0] == '[')
    alias Tp!(Bar!(Foo!(data[1..match(data)])), Foo!(data[match(data)+1..$])) 
Foo;
  else
    alias Tp!(data[0..find(data)], Foo!(data[find(data)..$])) Foo;
}

import std.stdio;

void main()
{
   writef("%s\n", Foo!("So [she was [considering ][in her] own[ mind [(as 
well] as] she could, ]for the").stringof);
}


Output:
tuple("So ",(Bar!("she was ",Bar!("considering ") ,Bar!("in her") ,"
own",Bar!(" 
mind ",Bar!("(as well") ," as") ," she could, ") ),"for the")

By placing things under different aliases and whatnot, named members can 
be done.

Also, if the lisp like constructs puts you off, the construction of the tree 
can be done via more or less the same parsing code and some fairly trivial 
ctfe+mixin

-- 
... <IXOYE><

Jun 19 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 06/19/2010 10:55 PM, BCS wrote:
 Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...

 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

 You use the type system. I can say from experience that (memory
 issues aside) it works.

 Trivial example?

 Building an arbitrary tree:

Okay, you use structs without indirection. I guess that works. How much 
copying does that entail when you e.g. pass your tree in a function call?

Jun 19 2010

Don <nospam nospam.com> writes:

Ellery Newcomer wrote:
 On 06/19/2010 10:55 PM, BCS wrote:
 Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...

 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

 You use the type system. I can say from experience that (memory
 issues aside) it works.

 Trivial example?

 Building an arbitrary tree:

 
 Okay, you use structs without indirection. I guess that works. How much 
 copying does that entail when you e.g. pass your tree in a function call?

Currently CTFE uses copy-on-write. If you're just passing the tree, not 
modifying it, there's no copying. If you modify the tree, there's a 
horrific amount of copying -- this is the major CTFE bug.

Jun 20 2010

BCS <none anon.com> writes:

Hello Ellery,

 On 06/19/2010 10:55 PM, BCS wrote:
 
 Hello Nick,
 
 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...
 Hello Ellery,
 
 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree
 if pointers and classes aren't allowed in CT code?
 

 You use the type system. I can say from experience that (memory
 issues aside) it works.
 

 Trivial example?
 

 Building an arbitrary tree:
 

 Okay, you use structs without indirection. I guess that works. How
 much copying does that entail when you e.g. pass your tree in a
 function call?
 

At compile time as a template parameter? I think it gets passed around as 
a reference.

At runtime? That would be pointless as it has no runtime members.

-- 
... <IXOYE><

Jun 20 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 06/19/2010 11:27 PM, Ellery Newcomer wrote:
 On 06/19/2010 10:55 PM, BCS wrote:
 Hello Nick,

 "BCS" <none anon.com> wrote in message
 news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...

 Hello Ellery,

 Generally I think D's CT capabilities have a way to go yet before
 this would be worth tackling. E.g. how do you build a parse tree if
 pointers and classes aren't allowed in CT code?

 You use the type system. I can say from experience that (memory
 issues aside) it works.

 Trivial example?

 Building an arbitrary tree:

 Okay, you use structs without indirection. I guess that works. How much
 copying does that entail when you e.g. pass your tree in a function call?

Nevermind, I see what you're doing now.

Jun 21 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Improving std.regex(p)