www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Lexer and parser generators using CTFE

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I'm starting a new thread on this because I think the matter is of 
strategic importance.

We all felt for a long time that there's a lot of potential in CTFE, and 
potential applications have been discussed more than a few times, 
ranging from formatting strings parsed to DSLs and parser generators.

Such feats are now approaching fruition because a number of factors 
converge:

* Dmitry Olshansky's regex library (now in Phobos) generates efficient D 
code straight from regexen.

* The scope and quality of CTFE has improved enormously, making more 
advanced uses possible and even relatively easy (thanks Don!)

* Hisayuki Mima implemented a parser generator in only 3000 lines of 
code (sadly, no comments or documentation yet :o))

* With the occasion of that announcement we also find out Philippe 
Sigaud has already a competing design and implementation of a parser 
generator.

This is the kind of stuff I've had an eye on for the longest time. I'm 
saying it's of strategic importance because CTFE technology, though not 
new and already available with some languages, has unique powers when 
combined with other features of D. With CTFE we get to do things that 
are quite literally impossible to do in other languages.

We need to have a easy-to-use, complete, seamless, and efficient 
lexer-parser generator combo in Phobos, pronto. The lexer itself could 
use a character-level PEG or a classic automaton, and emit tokens for 
consumption by a parser generator. The two should work in perfect tandem 
(no need for glue code). At the end of the day, defining a complete 
lexer+parser combo for a language should be just a few lines longer than 
the textual representation of the grammar itself.

What do you all think? Let's get this project off the ground!


Thanks,

Andrei
Feb 27 2012
next sibling parent reply "CTFE-4-the-win" <CTFE 4the.win> writes:
On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu
wrote:
 I'm starting a new thread on this because I think the matter is 
 of strategic importance.

 We all felt for a long time that there's a lot of potential in 
 CTFE, and potential applications have been discussed more than 
 a few times, ranging from formatting strings parsed to DSLs and 
 parser generators.

 Such feats are now approaching fruition because a number of 
 factors converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates 
 efficient D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making 
 more advanced uses possible and even relatively easy (thanks 
 Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 
 lines of code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out 
 Philippe Sigaud has already a competing design and 
 implementation of a parser generator.

 This is the kind of stuff I've had an eye on for the longest 
 time. I'm saying it's of strategic importance because CTFE 
 technology, though not new and already available with some 
 languages, has unique powers when combined with other features 
 of D. With CTFE we get to do things that are quite literally 
 impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and 
 efficient lexer-parser generator combo in Phobos, pronto. The 
 lexer itself could use a character-level PEG or a classic 
 automaton, and emit tokens for consumption by a parser 
 generator. The two should work in perfect tandem (no need for 
 glue code). At the end of the day, defining a complete 
 lexer+parser combo for a language should be just a few lines 
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
Definitely, I applaud this initiative! I've long been of the opinion that CTFE parsing is D's killer-feature, which would allow me to "sneak" D into a "nameless above average size company". ;)
Feb 28 2012
next sibling parent bls <bizprac orange.fr> writes:
On 02/28/2012 12:36 AM, CTFE-4-the-win wrote:
 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.
Are you aware of Philippe Sigaud's PEG work (may become part of his template book) Quote Philippe .. I recently wrote a parsing expression grammar module in D, also to create grammars and parsers at compile-time and parse inputs at CT. (PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar) Usage is like this: mixin Grammar( "JSON <- '{' ( Pair ( ',' Pair )* )? '}'" , "Pair <- String ':' Value" , "Value <- String / Number / JSON / Array / True / False / Null" , `True <- "true"` (..., rest of JSON grammar) ); enum result = JSON.parse( `{ "Hello":42, "World!": { "a":[0,1,2,3] } }`); End Quote No that bad :) , I think. Well, I am still a fan of EBNF based parser generators (recursive decent) but that's an other story. If I recall correctly BCS has created something working. ( BNF , so limited)
Feb 28 2012
prev sibling parent reply "Christopher Bergqvist" <spambox0 digitalpoetry.se> writes:
On Tuesday, 28 February 2012 at 08:36:13 UTC, CTFE-4-the-win 
wrote:
 On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei 
 Alexandrescu
 wrote:
 I'm starting a new thread on this because I think the matter 
 is of strategic importance.

 We all felt for a long time that there's a lot of potential in 
 CTFE, and potential applications have been discussed more than 
 a few times, ranging from formatting strings parsed to DSLs 
 and parser generators.

 Such feats are now approaching fruition because a number of 
 factors converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates 
 efficient D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, 
 making more advanced uses possible and even relatively easy 
 (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 
 lines of code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out 
 Philippe Sigaud has already a competing design and 
 implementation of a parser generator.

 This is the kind of stuff I've had an eye on for the longest 
 time. I'm saying it's of strategic importance because CTFE 
 technology, though not new and already available with some 
 languages, has unique powers when combined with other features 
 of D. With CTFE we get to do things that are quite literally 
 impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and 
 efficient lexer-parser generator combo in Phobos, pronto. The 
 lexer itself could use a character-level PEG or a classic 
 automaton, and emit tokens for consumption by a parser 
 generator. The two should work in perfect tandem (no need for 
 glue code). At the end of the day, defining a complete 
 lexer+parser combo for a language should be just a few lines 
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
Definitely, I applaud this initiative! I've long been of the opinion that CTFE parsing is D's killer-feature, which would allow me to "sneak" D into a "nameless above average size company". ;)
I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time. A run-time generator would benefit from not having to execute within the limited CTFE environment and would always be on-par in that respect. A compile time generator would internalize the generation and compilation of the result (with possible glue-code), simplifying the build process somewhat. What am I failing to pick up on?
Feb 28 2012
next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is
 impressive. However, I fail to see a killer-feature in
 generating a lexer-parser generator at compile-time instead of
 run-time.
 
 A run-time generator would benefit from not having to execute
 within the limited CTFE environment and would always be on-par in
 that respect. A compile time generator would internalize the
 generation and compilation of the result (with possible
 glue-code), simplifying the build process somewhat.
 
 What am I failing to pick up on?
Well, one cool thing would be that a compiler could effectively compile itself - or at least the lexer and parser would compile themselves. You wouldn't have to run a program to generate them. It could be self-contained. - Jonathan M Davis
Feb 28 2012
parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Jonathan M Davis wrote:
 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is
 impressive. However, I fail to see a killer-feature in
 generating a lexer-parser generator at compile-time instead of
 run-time.

 A run-time generator would benefit from not having to execute
 within the limited CTFE environment and would always be on-par in
 that respect. A compile time generator would internalize the
 generation and compilation of the result (with possible
 glue-code), simplifying the build process somewhat.

 What am I failing to pick up on?
Well, one cool thing would be that a compiler could effectively compile itself - or at least the lexer and parser would compile themselves. You wouldn't have to run a program to generate them. It could be self-contained.
CTFE code can be much slower than native one, and I would like to see some kind of compiler cache for such code.
Feb 28 2012
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 CTFE code can be much slower than native one, and I would like to see
 some kind of compiler cache for such code.
I second this. As a fan of clean optimizations this is one of the things I tossed around my head for a while. It could use a cache file or the compiler could be started as a daemon process keeping a memory cache. All code that is called through CTFE would go into the cache, indexed by the internal representation of the function body and parameters. But there are a few open questions, like how seamless this could be integrated. Is it easy to get a hash for function bodies and parameters? How should the cache be limited? N functions, n bytes or maybe one cache per module (together with the .o files)? The latter case would mean that if a.d uses CTFE, that executes code from b.d the results of CTFE would all be cached together with a.o, because that was the entry point. And if there was a module c.d that does the same as a.d it would duplicate the b.d part in its own cache. The benefit is that the cache works with both single module and whole program compilation and it doesn't need any limits. Instead the caches for each module are always refreshed with what was last used in the compilation. In addition to the last compilation, the caches could be aware of versioned compilation. I usually want to compile debug versions and Linux/Windows release versions at least, so I wouldn't want to invalidate the caches. For 32-bit vs. 64-bit I assume it is the best to just cache them separately as it could prove difficult to distinguish two versions of code that uses (void*).sizeof or something else that isn't declared wrapped in a version statement like size_t is.
Feb 29 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/12 1:54 AM, Marco Leise wrote:
 Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 CTFE code can be much slower than native one, and I would like to see
 some kind of compiler cache for such code.
I second this. As a fan of clean optimizations this is one of the things I tossed around my head for a while. It could use a cache file or the compiler could be started as a daemon process keeping a memory cache. All code that is called through CTFE would go into the cache, indexed by the internal representation of the function body and parameters.
I think this is the kind of development that will be naturally motivated and pushed by serious CTFE applications pushing the boundary. I'm not very worried about it right now as about investigating CTFE potential and blockers. Andrei
Mar 01 2012
prev sibling next sibling parent James Miller <james aatch.net> writes:
On 29 February 2012 14:16, Christopher Bergqvist
<spambox0 digitalpoetry.se> wrote:
 On Tuesday, 28 February 2012 at 08:36:13 UTC, CTFE-4-the-win wrote:
 On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu
 wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, an=
d
 potential applications have been discussed more than a few times, rangi=
ng
 from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient =
D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of co=
de
 (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe Siga=
ud
 has already a competing design and implementation of a parser generator=
.
 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not=
new
 and already available with some languages, has unique powers when combi=
ned
 with other features of D. With CTFE we get to do things that are quite
 literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could =
use a
 character-level PEG or a classic automaton, and emit tokens for consump=
tion
 by a parser generator. The two should work in perfect tandem (no need f=
or
 glue code). At the end of the day, defining a complete lexer+parser com=
bo
 for a language should be just a few lines longer than the textual
 representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
Definitely, I applaud this initiative! I've long been of the opinion that CTFE parsing is D's killer-feature, which would allow me to "sneak" D into a "nameless above average size company". ;)
I agree that the current direction of D in this area is impressive. =C2=A0However, I fail to see a killer-feature in generating a lexer-parse=
r
 generator at compile-time instead of run-time.

 A run-time generator would benefit from not having to execute within the
 limited CTFE environment and would always be on-par in that respect. =C2=
=A0A
 compile time generator would internalize the generation and compilation o=
f
 the result (with possible glue-code), simplifying the build process
 somewhat.

 What am I failing to pick up on?
As far as I can tell, the advantage of compile-time parser-generation is that you get significantly better start-up times for the program. Generating a lexer/parser at runtime produces signficant overhead. Also, if you want to do a yacc-style action system, then it being created at compile-time means that you don't have the overhead of using anonymous functions or delegates at runtime, you can just give it a raw string of D code. The CTFE environment is not that limited, and has enough functionality to produce a lexer/parser. Complex language definitions would be benefited by CTFE parser-generation since the load times to create the generator can be quite long, a problem for programs that may not be long-running (a lint tool for example). -- James Miller
Feb 28 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 28, 2012 at 08:25:21PM -0500, Jonathan M Davis wrote:
 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.
 
 A run-time generator would benefit from not having to execute within
 the limited CTFE environment and would always be on-par in that
 respect. A compile time generator would internalize the generation
 and compilation of the result (with possible glue-code), simplifying
 the build process somewhat.
 
 What am I failing to pick up on?
Well, one cool thing would be that a compiler could effectively compile itself - or at least the lexer and parser would compile themselves. You wouldn't have to run a program to generate them. It could be self-contained.
[...] You also eliminate possibly lengthy program startup initialization. Which can add up, if the program is run many times. Plus, if something can already be computed at compile-time, why waste time & resources doing it at runtime? T -- Без труда не выловишь и рыбку из пруда.
Feb 28 2012
prev sibling next sibling parent d coder <dlang.coder gmail.com> writes:
 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.
CTFE parsing is especially useful for DSEL (Domain Specific Embedded Languages) or internal DSLs. The advantages are: 1. Syntactic errors (in the parsed constructs) are given out at compile time. 2. D reflections are available only at compile time. Referencing the variables/identifiers in the parsed subset of DSL with the mainstream D code is impossible without reflections in place. Regards - Puneet
Feb 28 2012
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.
CTFE parsing is especially useful for DSEL (Domain Specific Embedded
Languages) or internal DSLs. The advantages are:
 1. Syntactic errors (in the parsed constructs)  are given out at compile
time.
 2. D reflections are available only at compile time. Referencing the
variables/identifiers in the parsed subset of DSL with the mainstream D code is impossible without reflections in place. One of my goals while writing a CT grammar generator was to get a compile-time parse-tree. Since it contains strings, it's easy to walk the tree, assembling strings as you go and generating the code you want (if//when you want to write code, that is) Strings are a D way to represent code, so any way to get structured strings at compile-time opens whole vistas for code generation. As for semantic actions, I added them in my code yesterday. I had hopes for using D's new anonymous syntax (p => p), but by being anonymous, they cannot be inserted easily in string mixins (other modules do not now about __lambda1 and co). Anyway, I now have semantic actions at compile-time, I used them to write a small (as in, veeery simple) XML parser: I use semantic actions to push node names while encountering them and pop the last tag while encountering a closing tag. It seems to work OK. That looks a bit like this (sorry, writing on a pad) mixin(Grammar!("Doc <- Node*" "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction, "OpeningTag <- '<' Identifier '>'", OpeningAction, "ClosingTag <- `</` Identifier '>'", ClosingAction, "Text <- (!(OpeningTag / ClosingTag) _)+")); The PEG for Text just means: any char, as long as it's not an OpeningTag nor a ClosingTag. PEG use '.' to say 'Any char', but I wanted to be able to deal with qualified names, so I chose '_' instead. When there is no action, it default to NoOp, as is the case for Doc, Node and Text. I also added named captures (and capture comparison), to be able to say: "I want a sequence of equal chars": "Equal <- _ first (_=first)*" That is: any char, store it as "first", then take any number of char, long as their match is equal to first's match. All this work at CT. I'm afraid being in holidays right now means I do not have easy access to GitHub (no git on a pad, and the computer I use to code right now does not have any network connection). I'll put all this online in a few days, because that must seems like the ramblings of a madman right now... Philippe
Feb 29 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/29/12 3:45 AM, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));
That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best. Andrei
Feb 29 2012
next sibling parent deadalnix <deadalnix gmail.com> writes:
Le 29/02/2012 17:45, Andrei Alexandrescu a écrit :
 On 2/29/12 3:45 AM, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <- `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));
That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best. Andrei
It make sense so an external file can be imported as string and processed at compile time.
Mar 01 2012
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));
That looks about right, but still has a fair amount of noise. I think the
approach of putting the entire grammar in one string is best. Yes, using one string is indeed better. That won't be too difficult to code. But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not. In my case, a parse tree is automatically generated (indeed, it's almost the only thing returned by a parse call, that with recognized string and named captures). I have some additional syntax to 'drop' a node ans such, similar to what you proposed a few posts ago.
Mar 01 2012
parent reply "John Belmonte" <john neggie.net> writes:
On Thursday, 1 March 2012 at 15:10:36 UTC, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));
That looks about right, but still has a fair amount of noise. I think the
approach of putting the entire grammar in one string is best. Yes, using one string is indeed better. That won't be too difficult to code.
I'm wondering if people have seen LPeg. It's a Lua library, but the design is interesting in that patterns are first class objects which can be composed with operator overloading. http://www.inf.puc-rio.br/~roberto/lpeg/
May 27 2012
parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, May 27, 2012 at 11:13 PM, John Belmonte <john neggie.net> wrote:
 I'm wondering if people have seen LPeg. =C2=A0It's a Lua library, but the=
design
 is interesting in that patterns are first class objects which can be
 composed with operator overloading.

 http://www.inf.puc-rio.br/~roberto/lpeg/
This is exactly the approach followed by C++ Boost::Spirit, with halas the limitations from Algol-language operators and precedence: no postfix '*' (Kleene Star) operator. It seems that lpeg has the same problem. I played with this idea with my own Pegged (https://github.com/PhilippeSigaud/Pegged), but I wasn't quite convinced by the result, exactly for the reason above. Also, when looking at real-world Spirit examples, I was a bit disappointed by the resulting syntax: it's not *that* readable for complicated expressions. In fact, that's exactly why I decided to follow the DSL road with Pegged, so as to obtain exactly the PEG syntax, with the real operators and their standard precedence. Btw, if you're interested in expression templates, I uploaded yesterday a very preliminary project in Github, to construct, and then manipulate, expressions: auto e =3D expr(); // The null expression auto f =3D 2*e + 1- (e~"abc")*3; // f is an expression template that encodes the right-hand side. Then, it's up to the user to define what the expression represents. https://github.com/PhilippeSigaud/Expression-Tree
May 28 2012
parent reply "John Belmonte" <john neggie.net> writes:
On Monday, 28 May 2012 at 12:27:09 UTC, Philippe Sigaud wrote:
 I played with this idea with my own Pegged
 (https://github.com/PhilippeSigaud/Pegged), but I wasn't quite
 convinced by the result, exactly for the reason above. Also, 
 when
 looking at real-world Spirit examples, I was a bit disappointed 
 by the
 resulting syntax: it's not *that* readable for complicated
 expressions. In fact, that's exactly why I decided to follow 
 the DSL
 road with Pegged, so as to obtain exactly the PEG syntax, with 
 the
 real operators and their standard precedence.
Fair enough. I did notice the following in the markdown PEG though which could benefit from first class patterns: HtmlBlockOpenAddress <- '<' Spnl ("address" / "ADDRESS") Spnl HtmlAttribute* '>' HtmlBlockCloseAddress <- '<' Spnl '/' ("address" / "ADDRESS") Spnl '>' HtmlBlockAddress <- HtmlBlockOpenAddress (HtmlBlockAddress / !HtmlBlockCloseAddress .)* HtmlBlockCloseAddress HtmlBlockOpenBlockquote <- '<' Spnl ("blockquote" / "BLOCKQUOTE") Spnl HtmlAttribute* '>' HtmlBlockCloseBlockquote <- '<' Spnl '/' ("blockquote" / "BLOCKQUOTE") Spnl '>' HtmlBlockBlockquote <- HtmlBlockOpenBlockquote (HtmlBlockBlockquote / !HtmlBlockCloseBlockquote .)* HtmlBlockCloseBlockquote . . .
May 28 2012
parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, May 28, 2012 at 11:42 PM, John Belmonte <john neggie.net> wrote:
 Fair enough. =C2=A0I did notice the following in the markdown PEG though =
which
 could benefit from first class patterns:





 HtmlBlockOpenAddress <- '<' Spnl ("address" / "ADDRESS") Spnl HtmlAttribu=
te*
 '>'
 HtmlBlockCloseAddress <- '<' Spnl '/' ("address" / "ADDRESS") Spnl '>'
 HtmlBlockAddress <- HtmlBlockOpenAddress (HtmlBlockAddress /
 =C2=A0 !HtmlBlockCloseAddress .)* HtmlBlockCloseAddress

 HtmlBlockOpenBlockquote <- '<' Spnl ("blockquote" / "BLOCKQUOTE") Spnl
 =C2=A0 HtmlAttribute* '>'
 HtmlBlockCloseBlockquote <- '<' Spnl '/' ("blockquote" / "BLOCKQUOTE") Sp=
nl
 '>'
 HtmlBlockBlockquote <- HtmlBlockOpenBlockquote (HtmlBlockBlockquote /
 =C2=A0 !HtmlBlockCloseBlockquote .)* HtmlBlockCloseBlockquote
You're exactly right! Nice catch. I took this PEG from another github project (I hope I put the attribution somewhere?) and that was before Pegged accepted parameterized rules. I could indeed drastically factor the previous code.
May 29 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 01, 2012 at 04:10:26PM +0100, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));
That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best.
Yes, using one string is indeed better. That won't be too difficult to code. But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not.
You could maybe just put D code in the grammar string, which gets compiled as a string mixin by CTFE? T -- "A man's wife has more power over him than the state has." -- Ralph Emerson
Mar 01 2012
prev sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
 But how to associate actions with a rule, in that case? I mean, some
 rules will have actions, some not.
You could maybe just put D code in the grammar string, which gets compiled as a string mixin by CTFE?
I could, but one of my driving goals while beginning this project a month ao was to use the shiny new lambda syntax directly :) "rule1", o => o "rule2", o => o etc.
Mar 01 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-03-01 18:21, Philippe Sigaud wrote:
  > > But how to associate actions with a rule, in that case? I mean, some
  > > rules will have actions, some not.
  >
  > You could maybe just put D code in the grammar string, which gets
  > compiled as a string mixin by CTFE?

 I could, but one of my driving goals while beginning this project a
 month ao was to use the shiny new lambda syntax directly :)

 "rule1", o => o
 "rule2", o => o

 etc.
Maybe we can take some ideas from the CoffeeScript compiler (written in CoffeeScript) : Grammar: http://jashkenas.github.com/coffee-script/documentation/docs/grammar.html Lexer: http://jashkenas.github.com/coffee-script/documentation/docs/lexer.html -- /Jacob Carlborg
Mar 01 2012
prev sibling next sibling parent d coder <dlang.coder gmail.com> writes:
 I'm  afraid being in holidays right now means I do not have easy access to
 GitHub (no git on a pad, and the computer I use to code right now does not
 have any network connection). I'll put all this online in a few days,
 because that must seems like the ramblings of a madman right now...
I am eagerly waiting for you to upload the stuff on GitHub. Regards - Puneet
Feb 29 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/12 7:16 PM, Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.

 A run-time generator would benefit from not having to execute within the
 limited CTFE environment and would always be on-par in that respect. A
 compile time generator would internalize the generation and compilation
 of the result (with possible glue-code), simplifying the build process
 somewhat.

 What am I failing to pick up on?
Barrier of entry and granularity of approach, I think. Currently if one wants to parse some simple grammar, there are options such as (a) do it by hand, (b) use boost::spirit, or (c) use lex/yacc. Parsing by hand has the obvious disadvantages. Using boost::spirit has a steep learning curve and tends to create very contorted grammar representations, full of representation noise, and scales very poorly. Using lex/yacc is hamfisted - there's an additional build step, generated files to deal with, and the related logistics, which make lex/yacc a viable choice only for "big" grammars. An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand. Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand. Andrei
Feb 29 2012
next sibling parent "Christopher Bergqvist" <spambox0 digitalpoetry.se> writes:
On Wednesday, 29 February 2012 at 16:41:22 UTC, Andrei 
Alexandrescu wrote:
 On 2/28/12 7:16 PM, Christopher Bergqvist wrote:
 What am I failing to pick up on?
Barrier of entry and granularity of approach, I think. Currently if one wants to parse some simple grammar, there are options such as (a) do it by hand, (b) use boost::spirit, or (c) use lex/yacc. Parsing by hand has the obvious disadvantages. Using boost::spirit has a steep learning curve and tends to create very contorted grammar representations, full of representation noise, and scales very poorly. Using lex/yacc is hamfisted - there's an additional build step, generated files to deal with, and the related logistics, which make lex/yacc a viable choice only for "big" grammars. An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand. Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand. Andrei
Thanks for your response. The lowered barrier of entry in parsing something like a customized JSON format or config files is nice, and something I could see myself use. I'm still skeptical about the level of "killer-featureness" but I would be glad to be proven wrong.
Feb 29 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Feb 29, 2012 at 10:41:22AM -0600, Andrei Alexandrescu wrote:
[...]
 An efficient, integrated parser generator would lower the barrier of
 entry dramatically - if we play our cards right, even a sprintf
 specifier string could be parsed simpler and faster using an embedded
 grammar, instead of painfully writing the recognizer by hand.
I see that std.stdio.writef already parses format strings at compile-time, though the code is somewhat ugly. :) It would be ideal if that can be done just by a series of ENBF/regex declarations.
 Parsing config files, XML, JSON, CSV, various custom file formats and
 many others - all would all be a few lines away. Ideally a user who
 has a basic understanding of grammars should have an easier time using
 a small grammar to parse simple custom formats, than writing the
 parsing code by hand.
[...] This is definitely something worth having. It has always been my dream that an ideal programming language should let you write grammar rules and pretty much autogen the code for you. There shouldn't need to be external utilities with the associated messy build rules, auxiliary files and other such things, esp. if you just need a one-time small parser for a very specific task. T -- Why ask rhetorical questions? -- JC
Feb 29 2012
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Why do you need grammar rules for json? Just use a custom-defined
struct and pass it to a toJson function like ae's json module. :)
Feb 29 2012
prev sibling next sibling parent reply Hisayuki Mima <youxkei gmail.com> writes:
(2012/02/28 16:59), Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and
 potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not
 new and already available with some languages, has unique powers when
 combined with other features of D. With CTFE we get to do things that
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
Hello Andrei, Certainly, I don't write yet the documentation of my library, ctpg. (But a few examples available here: https://github.com/youkei/ctpg/tree/master/example) So, I'd like to describe it here. To be honest, ctpg is inspired by one module of Scala's standard library, Parser Combinators. One of the differences between Parser Combinators and ctpg is that Parser Combinators generates parsers in run-time, but ctpg generates parsers in compile-time by the power of CTFE and mixin. A parser generated by ctpg is a recursive descent parser, so it does lexical analysis and parsing at a time. And the type of input which it can accept is string, wstring, dstring and ForwardRange whose element type is char, wchar or dchar. So, dividing lexical analysis and parsing into two processes is difficult in ctpg. Importantly, a parser generated by ctpg is statically typed as one of the examples, "parsing simple arithmetic expression" shows. Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg. That's all I'd like to say about ctpg here. By the way, I wholeheartedly agree with Andrei's opinion. But currently, I think, CTFE is limited because of this issue: http://d.puremagic.com/issues/show_bug.cgi?id=6498 . Without dealing with this issue, We couldn't bring out the full potential of CTFE. Thanks, Hisayuki Mima
Feb 28 2012
next sibling parent reply Hisayuki Mima <youxkei gmail.com> writes:
(2012/02/28 23:58), Hisayuki Mima wrote:
 (2012/02/28 16:59), Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and
 potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not
 new and already available with some languages, has unique powers when
 combined with other features of D. With CTFE we get to do things that
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
Hello Andrei, Certainly, I don't write yet the documentation of my library, ctpg. (But a few examples available here: https://github.com/youkei/ctpg/tree/master/example) So, I'd like to describe it here. To be honest, ctpg is inspired by one module of Scala's standard library, Parser Combinators. One of the differences between Parser Combinators and ctpg is that Parser Combinators generates parsers in run-time, but ctpg generates parsers in compile-time by the power of CTFE and mixin. A parser generated by ctpg is a recursive descent parser, so it does lexical analysis and parsing at a time. And the type of input which it can accept is string, wstring, dstring and ForwardRange whose element type is char, wchar or dchar. So, dividing lexical analysis and parsing into two processes is difficult in ctpg. Importantly, a parser generated by ctpg is statically typed as one of the examples, "parsing simple arithmetic expression" shows. Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg. That's all I'd like to say about ctpg here. By the way, I wholeheartedly agree with Andrei's opinion. But currently, I think, CTFE is limited because of this issue: http://d.puremagic.com/issues/show_bug.cgi?id=6498 . Without dealing with this issue, We couldn't bring out the full potential of CTFE. Thanks, Hisayuki Mima
A minimum of documentation is now available here: https://github.com/youkei/ctpg/wiki/Home-en Hisayuki Mima
Feb 29 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/29/12 10:03 AM, Hisayuki Mima wrote:
 A minimum of documentation is now available here:
 https://github.com/youkei/ctpg/wiki/Home-en
Great start, you may want to aggressively add examples. Then people would love using ctpg and will write more documentation, tutorials etc. Andrei
Feb 29 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/12 8:58 AM, Hisayuki Mima wrote:
 Certainly, I don't write yet the documentation of my library, ctpg.
 (But a few examples available here:
 https://github.com/youkei/ctpg/tree/master/example)
Nice! I have a few comments. I should say that they entail a fair amount of work. The current setup makes things very difficult for a common case - generating a parse tree. The user would need to insert a lot of custom code to create the appropriate nodes, but that's very easy for the parser because it already has the components. The parser should have a "build AST" option, in which case it should build all nodes without much effort from the user. That's what ANTLR does, and it has a special operator "^" to indicate where the root of the tree should be (http://www.antlr2.org/doc/trees.html). So here's what I think your examples should look like: The syntax could be changed a bit - it should be less like D and more like PEG. The "$" is implicit and shouldn't be needed. The ";" is a useful terminator for large productions spanning more than one line, so let's keep it. I don't understand the use of "!", for example the PEG for expression parsing at http://en.wikipedia.org/wiki/Parsing_expression_grammar is: Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Sum ← Product (('+' / '-') Product)* Expr ← Sum but your grammar has: int primary = !"(" addExp !")" / intLit_p; whereas it seems to me it should be int primary = "(" addExp ")" / intLit_p; No? Here's how I think a parser that also builds an AST looks like: mixin generateParsers!( Options.createAST, q{ root <- addExp; addExp <- mulExp (("+" / "-")^ addExp)? mulExp <- primary (("*" / "/")^ mulExp)? primary <- "(" addExp ")"% / intLit_p; } ); I used the following notation: a node suffixed with "^" becomes the root of the produced AST, and a node suffixed with "%" will be dropped altogether (that's because ")" is not needed in the AST after the expression has been parsed). With only three characters I informed the parser what the shape of the AST will be. At this point calling parse!root("1+2*3") would return an object of type ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs node has type ASTNode!"intLit_p" which has inside a value "1". The rhs node has type ASTNode!"mulExp", and that in turn has two children nodes lhs and rhs, both of type ASTNode!"intLit_p", each with its own value ("2" and "3", respectively). All these nodes are dynamically created by the parsing process and inherit a common type, e.g. ASTNode!"" or some type ASTRootNode.
 So, I'd like to describe it here.

 To be honest, ctpg is inspired by one module of Scala's standard
 library, Parser Combinators.
 One of the differences between Parser Combinators and ctpg is that
 Parser Combinators generates parsers in run-time, but ctpg generates
 parsers in compile-time by the power of CTFE and mixin.

 A parser generated by ctpg is a recursive descent parser, so it does
 lexical analysis and parsing at a time.
Oh, I forgot this property of PEGs - integrated lexing and parsing. So no need for a separate lexer!
 And the type of input which it
 can accept is string, wstring, dstring and ForwardRange whose element
 type is char, wchar or dchar.
Great. I think we should actually define the notion of BufferedRange. I'll get to that topic later.
 So, dividing lexical analysis and parsing into two processes is
 difficult in ctpg.
Yes, sorry I forgot about that!
 Importantly, a parser generated by ctpg is statically typed as one of
 the examples, "parsing simple arithmetic expression" shows.
 Generally a parser generated by other tool and accepting tokens returns
 the abstract syntax tree, but it return the evaluated value in the example.
 In other words, it does lexical analysis, parsing and (type) converting
 at a time.
 If you want simply abstract syntax tree, it may be a little pain to use
 ctpg.

 That's all I'd like to say about ctpg here.
Why would it be difficult to integrate AST creation with parsing? I'm not sure I understand this. My understanding is that you should be able to add a couple of simple operators ("^" and "%" as described above) to inform AST creation, and you're done! Thanks, Andrei
Feb 29 2012
parent reply Hisayuki Mima <youxkei gmail.com> writes:
(2012/03/01 1:19), Andrei Alexandrescu wrote:
 On 2/28/12 8:58 AM, Hisayuki Mima wrote:
 Certainly, I don't write yet the documentation of my library, ctpg.
 (But a few examples available here:
 https://github.com/youkei/ctpg/tree/master/example)
Nice! I have a few comments. I should say that they entail a fair amount of work. The current setup makes things very difficult for a common case - generating a parse tree. The user would need to insert a lot of custom code to create the appropriate nodes, but that's very easy for the parser because it already has the components. The parser should have a "build AST" option, in which case it should build all nodes without much effort from the user. That's what ANTLR does, and it has a special operator "^" to indicate where the root of the tree should be (http://www.antlr2.org/doc/trees.html). So here's what I think your examples should look like: The syntax could be changed a bit - it should be less like D and more like PEG. The "$" is implicit and shouldn't be needed. The ";" is a useful terminator for large productions spanning more than one line, so let's keep it. I don't understand the use of "!", for example the PEG for expression parsing at http://en.wikipedia.org/wiki/Parsing_expression_grammar is: Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Sum ← Product (('+' / '-') Product)* Expr ← Sum but your grammar has: int primary = !"(" addExp !")" / intLit_p; whereas it seems to me it should be int primary = "(" addExp ")" / intLit_p; No? Here's how I think a parser that also builds an AST looks like: mixin generateParsers!( Options.createAST, q{ root <- addExp; addExp <- mulExp (("+" / "-")^ addExp)? mulExp <- primary (("*" / "/")^ mulExp)? primary <- "(" addExp ")"% / intLit_p; } ); I used the following notation: a node suffixed with "^" becomes the root of the produced AST, and a node suffixed with "%" will be dropped altogether (that's because ")" is not needed in the AST after the expression has been parsed). With only three characters I informed the parser what the shape of the AST will be. At this point calling parse!root("1+2*3") would return an object of type ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs node has type ASTNode!"intLit_p" which has inside a value "1". The rhs node has type ASTNode!"mulExp", and that in turn has two children nodes lhs and rhs, both of type ASTNode!"intLit_p", each with its own value ("2" and "3", respectively). All these nodes are dynamically created by the parsing process and inherit a common type, e.g. ASTNode!"" or some type ASTRootNode.
 So, I'd like to describe it here.

 To be honest, ctpg is inspired by one module of Scala's standard
 library, Parser Combinators.
 One of the differences between Parser Combinators and ctpg is that
 Parser Combinators generates parsers in run-time, but ctpg generates
 parsers in compile-time by the power of CTFE and mixin.

 A parser generated by ctpg is a recursive descent parser, so it does
 lexical analysis and parsing at a time.
Oh, I forgot this property of PEGs - integrated lexing and parsing. So no need for a separate lexer!
 And the type of input which it
 can accept is string, wstring, dstring and ForwardRange whose element
 type is char, wchar or dchar.
Great. I think we should actually define the notion of BufferedRange. I'll get to that topic later.
 So, dividing lexical analysis and parsing into two processes is
 difficult in ctpg.
Yes, sorry I forgot about that!
 Importantly, a parser generated by ctpg is statically typed as one of
 the examples, "parsing simple arithmetic expression" shows.
 Generally a parser generated by other tool and accepting tokens returns
 the abstract syntax tree, but it return the evaluated value in the
 example.
 In other words, it does lexical analysis, parsing and (type) converting
 at a time.
 If you want simply abstract syntax tree, it may be a little pain to use
 ctpg.

 That's all I'd like to say about ctpg here.
Why would it be difficult to integrate AST creation with parsing? I'm not sure I understand this. My understanding is that you should be able to add a couple of simple operators ("^" and "%" as described above) to inform AST creation, and you're done! Thanks, Andrei
Certainly, "built AST" option is very interesting. I don't know whether building AST is a common case or not because I'm unfamiliar with syntax analysis. But I'd like to complete ctpg as D version of boost::spirit::qi first. Currently, ctpg can be used probably like boost::spirit::qi. (Both do typed syntax analysis.) There are some points where ctpg is superior to boost::spirit::qi. For example, The source code written by using ctpg is simple because it is like PEG. In addition, I'm planning to improve error messages by making excellent use of #line and __LINE__. I'd like to write the library which is functionally same boost::spirit::qi and written in D style. Hence, I'd like to make implementing "built AST" option i.e. untyped syntax analysis a low priority. What is your idea? P.S. prefix operator "!" is same as postfix operator "%" which you said. The types which parsers on both sides of "/" operator have should be compatible. The type of "(" addExp ")" is Tuple!(string, int, string) and the type of intLit_p is int. They are not compatible. The type of !"(" addExp !")" is int. So the types of !"(" addExp !")" and intLit_p are compatible. Thanks, Hisayuki Mima
Mar 04 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/4/12 8:34 AM, Hisayuki Mima wrote:
 Certainly, "built AST" option is very interesting.
 I don't know whether building AST is a common case or not because I'm
 unfamiliar with syntax analysis.
 But I'd like to complete ctpg as D version of boost::spirit::qi first.
 Currently, ctpg can be used probably like boost::spirit::qi. (Both do
 typed syntax analysis.)
 There are some points where ctpg is superior to boost::spirit::qi. For
 example, The source code written by using ctpg is simple because it is
 like PEG.
 In addition, I'm planning to improve error messages by making excellent
 use of #line and __LINE__.
 I'd like to write the library which is functionally same
 boost::spirit::qi and written in D style.
 Hence, I'd like to make implementing "built AST" option i.e. untyped
 syntax analysis a low priority.
 What is your idea?
It's your project so you define what's fun to work on. Let me just say that I worked on a lot of parsing-related stuff, and most often projects that start as "this grammar is simple enough, let's just do processing during parsing" ultimately required a rewrite to do generate AST, process it, and then use it for computation. You can get away without an AST for the simplest grammars (e.g. printf etc.) but in that case you compete with hand-written code. I wouldn't think of parsing a serious grammar with more than a few productions without generating an AST. My intuition is that your work will best excel at those grammars. Andrei
Mar 04 2012
parent "Ben Hanson" <Ben.Hanson tikit.com> writes:
On Sunday, 4 March 2012 at 21:00:00 UTC, Andrei Alexandrescu
wrote:
 You can get away without an AST for the simplest grammars (e.g. 
 printf etc.) but in that case you compete with hand-written 
 code. I wouldn't think of parsing a serious grammar with more 
 than a few productions without generating an AST. My intuition 
 is that your work will best excel at those grammars.
This chimes nicely with the comments made at the Going Native conference with regard to clang generating an AST and MSVC and GCC not. Herb Sutter referred to it as "AST envy" (as in he wished MSVC took the AST route). I only mention it as maybe not everyone here watched that vid. Andrei himself was at the conference of course! Regards, Ben
Mar 05 2012
prev sibling next sibling parent reply d coder <dlang.coder gmail.com> writes:
 Generally a parser generated by other tool and accepting tokens returns
 the abstract syntax tree, but it return the evaluated value in the example.
 In other words, it does lexical analysis, parsing and (type) converting at
 a time.
 If you want simply abstract syntax tree, it may be a little pain to use
 ctpg.
Hello Youkei I am trying to use CTPG for compile time parsing for a DSL I am working on. I have tried the examples you created in the examples directory. I would like the parser to effect some side effects. For this purpose, I tried including the parser mixin into a class, but I got a strange error saying: Error: need 'this' to access member parse I have copied the code I am trying to compile at the end of the email. Let me know what I could be doing wrong here. Regards - Puneet import ctpg; import std.array: join; import std.conv: to; class Foo { int result; mixin(generateParsers(q{ int root = mulExp $; int mulExp = primary !"*" mulExp >> (lhs, rhs){ return lhs * rhs; } / primary; int primary = !"(" mulExp !")" / [0-9]+ >> join >> to!int; })); void frop() { result = parse!root("5*8"); } } void main(){ Foo foo = new Foo(); foo.frop(); }
May 27 2012
parent Hisayuki Mima <youxkei gmail.com> writes:
(2012年05月28日 02:31), d coder wrote:
     Generally a parser generated by other tool and accepting tokens
     returns the abstract syntax tree, but it return the evaluated value
     in the example.
     In other words, it does lexical analysis, parsing and (type)
     converting at a time.
     If you want simply abstract syntax tree, it may be a little pain to
     use ctpg.


   Hello Youkei

 I am trying to use CTPG for compile time parsing for a DSL I am working
 on. I have tried the examples you created in the examples directory.

 I would like the parser to effect some side effects. For this purpose, I
 tried including the parser mixin into a class, but I got a strange error
 saying:

 Error: need 'this' to access member parse

 I have copied the code I am trying to compile at the end of the email.
 Let me know what I could be doing wrong here.

 Regards
 - Puneet


 import ctpg;
 import std.array: join;
 import std.conv: to;

 class Foo
 {
    int result;
    mixin(generateParsers(q{
          int root = mulExp $;

          int mulExp =
            primary !"*" mulExp >> (lhs, rhs){ return lhs * rhs; }
          / primary;

          int primary = !"(" mulExp !")" / [0-9]+ >> join >> to!int;
        }));

    void frop() {
      result = parse!root("5*8");
    }
 }


 void main(){
    Foo foo = new Foo();
    foo.frop();
 }
Hello Puneet, Thank you for your report. I fixed it. Now CTPG creates a static function as a parser. But I'm afraid this fix doesn't help you because I don't understand what a side effect you said is. Can you show me some examples which include the side effect? Thanks, Hisayuki Mima
May 31 2012
prev sibling parent d coder <dlang.coder gmail.com> writes:
 I would like the parser to effect some side effects. For this purpose, I
 tried including the parser mixin into a class, but I got a strange error
 saying:

 Error: need 'this' to access member parse
Ok. I see my folly. At compile time, there would not be any "this" since the class has not been instantiated yet. I will have to think of other solutions. Basically, I am trying to use the parser to create functions that I can use at run time. But I wanted to create two functions from the same parser. I have succeeded with creating one function. I do not want to create two separate parsers because each of these parsers would add to memory footprint of the compiler. Any ideas? Maybe I could use tuples here? Regards - Puneet
May 27 2012
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.
+1.
 We all felt for a long time that there's a lot of potential in CTFE,
 and potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser
 generators.
CTFE is one of the big features that attracted me to D. [...]
 This is the kind of stuff I've had an eye on for the longest time.
 I'm saying it's of strategic importance because CTFE technology,
 though not new and already available with some languages, has unique
 powers when combined with other features of D. With CTFE we get to do
 things that are quite literally impossible to do in other languages.
For example?
 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect
 tandem (no need for glue code). At the end of the day, defining a
 complete lexer+parser combo for a language should be just a few lines
 longer than the textual representation of the grammar itself.
Definitely agree. I frequently work with code that requires some amount of lexing/parsing; having something like this in Phobos would be absolutely awesome.
 What do you all think? Let's get this project off the ground!
[...] +1. On that note, I'd like to mention that CTFE currently has some parts that need more work: (1) We need compile-time alternatives to the functions in std.math so that functions that are currently only implemented in asm can be used in CTFE. (2) It should be possible to generate AA literals from CTFE and use them to initialize things like parsing/lexing lookup tables, etc.. While this in itself is not a major critical issue, it would make D look bad if we tout D's CTFE capabilities yet at the same time AA's have to be created by static this() at runtime. T -- Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
Feb 28 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/12 9:57 AM, H. S. Teoh wrote:
 On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
 This is the kind of stuff I've had an eye on for the longest time.
 I'm saying it's of strategic importance because CTFE technology,
 though not new and already available with some languages, has unique
 powers when combined with other features of D. With CTFE we get to do
 things that are quite literally impossible to do in other languages.
For example?
For example integrated lexer and parser generators! Andrei
Feb 29 2012
parent "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jiljf7$18em$1 digitalmars.com...
 On 2/28/12 9:57 AM, H. S. Teoh wrote:
 On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
 This is the kind of stuff I've had an eye on for the longest time.
 I'm saying it's of strategic importance because CTFE technology,
 though not new and already available with some languages, has unique
 powers when combined with other features of D. With CTFE we get to do
 things that are quite literally impossible to do in other languages.
For example?
For example integrated lexer and parser generators!
...With statically-checked symbol names that can include arbitrary characters: Sym!"+" Sym!"{" Sym!"<Foo>" etc...
Feb 29 2012
prev sibling next sibling parent reply "mist" <none none.none> writes:
Something similar to Boost::Spirit2 but with sane syntax and 
better debugging would have been awesome.
Feb 28 2012
next sibling parent reply bcs <bcs example.com> writes:
On 02/28/2012 08:23 AM, mist wrote:
 Something similar to Boost::Spirit2 but with sane syntax and better
 debugging would have been awesome.
How about not having to encode the syntax as arithmetic expressions? D can do string parsing of eBNF at compile time if you want that.
Feb 28 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/29/12 12:07 AM, bcs wrote:
 On 02/28/2012 08:23 AM, mist wrote:
 Something similar to Boost::Spirit2 but with sane syntax and better
 debugging would have been awesome.
How about not having to encode the syntax as arithmetic expressions? D can do string parsing of eBNF at compile time if you want that.
I think that's what he meant by "sane syntax". Andrei
Feb 29 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/12 10:23 AM, mist wrote:
 Something similar to Boost::Spirit2 but with sane syntax and better
 debugging would have been awesome.
Exactly. Boost::Spirit2 is a perfect illustration of a dancing bear - using inadequate technology for a specific purpose. Andrei
Feb 29 2012
prev sibling next sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I'm starting a new thread on this because I think the matter is of  
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and  
 potential applications have been discussed more than a few times,  
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors  
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D  
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more  
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of  
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe  
 Sigaud has already a competing design and implementation of a parser  
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm  
 saying it's of strategic importance because CTFE technology, though not  
 new and already available with some languages, has unique powers when  
 combined with other features of D. With CTFE we get to do things that  
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient  
 lexer-parser generator combo in Phobos, pronto. The lexer itself could  
 use a character-level PEG or a classic automaton, and emit tokens for  
 consumption by a parser generator. The two should work in perfect tandem  
 (no need for glue code). At the end of the day, defining a complete  
 lexer+parser combo for a language should be just a few lines longer than  
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
I wrote a generic lexer generator some time ago. It already let to some compiler O(N^2) optimizations, because the token declarations sneak into the mangling :(. https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexer I've ditched an attempt to write a parser combinator. It was based on expression templates and ended up at spirit craziness. <PERSONAL OPINION The hassle of providing good error messages and synthesizing parse results in a generic parser outweigh the benefit of a declarative grammar. /PERSONAL OPINION> A lot becomes feasible from the CTFE perspective, despite some bugfixes I only miss exp and log currently. I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators. Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.
Feb 28 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 28.02.2012 22:46, Martin Nowak wrote:
 On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE,
 and potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient
 D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though
 not new and already available with some languages, has unique powers
 when combined with other features of D. With CTFE we get to do things
 that are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect
 tandem (no need for glue code). At the end of the day, defining a
 complete lexer+parser combo for a language should be just a few lines
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!
To be truly successful it should have some distinct properties like: - being faster or identical to handwritten stuff we already have (like e.g. std.json ), reminds us of recent INI parser proposal, should be an easy pick for general parser. - be flexible, the result need not be a predefined AST tree nodes, Hisayuki Mima's parser shows some nice movement in this direction - have reasonable compile times and memory consumption (though it will only improve over time) Recalling EBNF parser idea that I run with before finally being dragged down by real life. Roughly I thought to follow hybrid LL(*) aproach, while I had a solid plan on doing DFA for lexer and parser lookahead, the other things were more or less floating.
 Thanks,

 Andrei
I wrote a generic lexer generator some time ago. It already let to some compiler O(N^2) optimizations, because the token declarations sneak into the mangling :(. https://gist.github.com/1255439 - lexer generator
Cool stuff.
 https://gist.github.com/1262321 - complete and fast D lexer

 I've ditched an attempt to write a parser combinator. It was based on
 expression templates and ended up at spirit craziness.

 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>
Error reporting is the weak spot, still no good ideas on solving that.
 A lot becomes feasible from the CTFE perspective,
 despite some bugfixes I only miss exp and log currently.
Reminds me that I have to dig up a wonderful CTFE bug in std.regex that's being somehow worked around by duping arrays since August.
 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a good
 I/O
 system and the missing Allocators.
 Any parser written now will either risk to not play nice with ranges
 or has to come up with it's own buffering again.
All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon. -- Dmitry Olshansky
Feb 28 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
 On 28.02.2012 22:46, Martin Nowak wrote:
On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
[...]
We need to have a easy-to-use, complete, seamless, and efficient
lexer-parser generator combo in Phobos, pronto. The lexer itself
could use a character-level PEG or a classic automaton, and emit
tokens for consumption by a parser generator. The two should work in
perfect tandem (no need for glue code). At the end of the day,
defining a complete lexer+parser combo for a language should be just
a few lines longer than the textual representation of the grammar
itself.
[...]
 To be truly successful it should have some distinct properties like:
 - being faster or identical to handwritten stuff we already have
 (like e.g. std.json ), reminds us of recent INI parser proposal,
 should be an easy pick for general parser.
We probably have to use string mixins to achieve this. OTOH, a generic parser generator is unlikely to be able to handle special case optimizations such as operator-precedence parsers (http://en.wikipedia.org/wiki/Operator-precedence_parser), because operator precedence is only implied by grammar rules, and to optimize for this case requires explicitly specifying each operator's precedence. There is some price to be paid for generality, unfortunately.
 - be flexible, the result need not be a predefined AST tree nodes,
 Hisayuki Mima's parser shows some nice movement in this direction
In my mind, D's delegates are ideal for this sort of thing. Each grammar rule will have an associated callback. We can provide default callbacks that generates AST nodes, but the user can override them with his own delegates to do whatever they want (evaluate expressions/commands on the fly, build only AST subtrees of interest, etc.).
 - have reasonable compile times and memory consumption (though it will
 only improve over time)
This will be good motivation to improve dmd's CTFE implementation. ;-) [...]
 Error reporting is the weak spot, still no good ideas on solving that.
This is not specific to D, though, right? [...]
I do not agree that it's the right moment to write a parser though.
It hits the first of phobos two biggest shortcomings, the lack of a
good I/O system and the missing Allocators.  Any parser written now
will either risk to not play nice with ranges or has to come up with
it's own buffering again.
All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.
[...] Are you talking about std.io? I glanced over that code briefly recently; it looks very promising. Hope it will make it into Phobos soon. We *need* to integrate streams, ranges, and the stdio writeln, etc., stuff. The current situation to an outsider looks quite bad (streams not compatible with stdio, no easy way to substitute string buffer for files, etc.). We also need automatic UTF encoding detection for all input streams/files/ranges/whatever. Including mismatching endianness (i.e. auto byte-swapping). In a recent personal project to write a D lexer (as an exercise to help me learn D better), I ended up having to roll my own input stream abstraction in order to handle automatic encoding detection. Which is quite poorly written, I've to admit. Something like this needs to be standardized in Phobos with a well-written implementation. T -- "You know, maybe we don't *need* enemies." "Yeah, best friends are about all I can take." -- Calvin & Hobbes
Feb 28 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29.02.2012 0:19, H. S. Teoh wrote:
 On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
 On 28.02.2012 22:46, Martin Nowak wrote:
 On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:
[...]
 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself
 could use a character-level PEG or a classic automaton, and emit
 tokens for consumption by a parser generator. The two should work in
 perfect tandem (no need for glue code). At the end of the day,
 defining a complete lexer+parser combo for a language should be just
 a few lines longer than the textual representation of the grammar
 itself.
[...]
 To be truly successful it should have some distinct properties like:
 - being faster or identical to handwritten stuff we already have
 (like e.g. std.json ), reminds us of recent INI parser proposal,
 should be an easy pick for general parser.
We probably have to use string mixins to achieve this. OTOH, a generic parser generator is unlikely to be able to handle special case optimizations such as operator-precedence parsers (http://en.wikipedia.org/wiki/Operator-precedence_parser), because operator precedence is only implied by grammar rules, and to optimize for this case requires explicitly specifying each operator's precedence. There is some price to be paid for generality, unfortunately.
Well I see no problem in defining a separate OperatorGrammar constructor, that will take base Unit production and a bunch of operators with arity, plus brackets with their priority is plenty. Semantic then actions are trivially defined for each of operators respectively. Someway of stacking grammars is bonus points for the project. As Andrei started with lexer+parser, I see no problem with it being lexer+parser(+parser)*.
 - be flexible, the result need not be a predefined AST tree nodes,
 Hisayuki Mima's parser shows some nice movement in this direction
In my mind, D's delegates are ideal for this sort of thing. Each grammar rule will have an associated callback. We can provide default callbacks that generates AST nodes, but the user can override them with his own delegates to do whatever they want (evaluate expressions/commands on the fly, build only AST subtrees of interest, etc.).
Yup delegates and mixins are the tools. More over lambdas shine here, less clutter + type deduction.
 - have reasonable compile times and memory consumption (though it will
 only improve over time)
This will be good motivation to improve dmd's CTFE implementation. ;-) [...]
 Error reporting is the weak spot, still no good ideas on solving that.
This is not specific to D, though, right?
Such a framework relying on mixins is bound to produce awful error messages at compile-time, unless explicitly stuffed up to watterline with some kind of static assert(__traits(compiles,...),"Error x + info");
 [...]
 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a
 good I/O system and the missing Allocators.  Any parser written now
 will either risk to not play nice with ranges or has to come up with
 it's own buffering again.
All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.
[...] Are you talking about std.io? I glanced over that code briefly recently; it looks very promising. Hope it will make it into Phobos soon. We *need* to integrate streams, ranges, and the stdio writeln, etc., stuff. The current situation to an outsider looks quite bad (streams not compatible with stdio, no easy way to substitute string buffer for files, etc.).
Aye, that's it. Not to mention DIP9, though that's another topic in itself.
 We also need automatic UTF encoding detection for all input
 streams/files/ranges/whatever. Including mismatching endianness (i.e.
 auto byte-swapping).  In a recent personal project to write a D lexer
 (as an exercise to help me learn D better), I ended up having to roll my
 own input stream abstraction in order to handle automatic encoding
 detection. Which is quite poorly written, I've to admit. Something like
 this needs to be standardized in Phobos with a well-written
 implementation.


 T
-- Dmitry Olshansky
Feb 29 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/29/12 3:48 AM, Dmitry Olshansky wrote:
 Someway of stacking grammars is bonus points for the project. As Andrei
 started with lexer+parser, I see no problem with it being
 lexer+parser(+parser)*.
What does stacking grammars mean? Anyhow, one thing that would become important is tree walkers - e.g. you have the tree and you define code for evaluating it etc.
 Such a framework relying on mixins is bound to produce awful error
 messages at compile-time, unless explicitly stuffed up to watterline
 with some kind of static assert(__traits(compiles,...),"Error x + info");
We need to figure out ways to make that simpler. Andrei
Feb 29 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29.02.2012 20:47, Andrei Alexandrescu wrote:
 On 2/29/12 3:48 AM, Dmitry Olshansky wrote:
 Someway of stacking grammars is bonus points for the project. As Andrei
 started with lexer+parser, I see no problem with it being
 lexer+parser(+parser)*.
What does stacking grammars mean?
In general it should be read as using one parser as lexer for another parser. But there is some nice ways to integrate another parser within recursive descent parser. Let's assume simple recursive descent parser that uses operator precedence grammar. Suppose we have the main grammar: Declaration = Type Identifier; Statement = 'if' '(' Expr ')' | Expr Type = int | float | ... Where Expr is connected to the second grammar expressed as ... Ok, let's sketch it, skipping semantic actions: operatorGrammar!q{ Expr Unit: Constant | Identifier //assume we have lexer Operators: Not, !, prefix, 4 UnMinus, -, prefix, 4 Mul, *, binary, 3 Mul, /, binary, 3 Plus, +, binary, 2 BiMinus, -, binary, 2 Index, [], brace-like postfix Braces, (), brace-like }; Effectively operator grammar for expressions parses "token" Expr. It's easy in recursive descent because it can just try to call this parser, then try the other alternative if it fails. I'm not sure this stacking does add that much, but it's something to keep in mind.
 Anyhow, one thing that would become important is tree walkers - e.g. you
 have the tree and you define code for evaluating it etc.
Yes, it's a must have if AST is generated generically via annotations.
 Such a framework relying on mixins is bound to produce awful error
 messages at compile-time, unless explicitly stuffed up to watterline
 with some kind of static assert(__traits(compiles,...),"Error x + info");
We need to figure out ways to make that simpler. Andrei
-- Dmitry Olshansky
Feb 29 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 To be truly successful it should have some distinct properties like:
 - being faster or identical to handwritten stuff we already have (like
 e.g. std.json ), reminds us of recent INI parser proposal, should be an
 easy pick for general parser.
Yes.
 - be flexible, the result need not be a predefined AST tree nodes,
 Hisayuki Mima's parser shows some nice movement in this direction
Yes. But AST creation should be supported without any custom code.
 - have reasonable compile times and memory consumption (though it will
 only improve over time)
Yes. I guess PEGs have problems there.
 Recalling EBNF parser idea that I run with before finally being dragged
 down by real life. Roughly I thought to follow hybrid LL(*) aproach,
 while I had a solid plan on doing DFA for lexer and parser lookahead,
 the other things were more or less floating.
Well, maybe you could integrate that within your up-and-coming research. Grammars have been considered a solved problem every five years since 1970, and there's always new work coming :o).
 There is prototype of interactive regex matcher that
 works directly on stream (buried in std.regex), it even passed dry-run
 unittests back then. Though I had to postpone till I/O is sorted out. I
 really loved Steven's design with it's easy access to buffer and well
 thought out primitives, hope it will come about sometime soon.
An interactive regex would be a dream come true to me... Andrei
Feb 29 2012
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)
Yes. I guess PEGs have problems there.
Probably LR, too, unless you build the state tables in a separate prior build step.
 There is prototype of interactive regex matcher that
 works directly on stream (buried in std.regex), it even passed dry-run
 unittests back then. Though I had to postpone till I/O is sorted out. I
 really loved Steven's design with it's easy access to buffer and well
 thought out primitives, hope it will come about sometime soon.
An interactive regex would be a dream come true to me...
One thing I've been wanting to see in a lexer's regex would be a way to match things like D's nested comments or: qEOS fancy string blah blah EOS Support for either one would probably render it "not *technically* a regex" but it should be entirely possible by just adding some extra data/processing to the states.
Feb 29 2012
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:jilmhr$1dul$1 digitalmars.com...
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)
Yes. I guess PEGs have problems there.
Probably LR, too, unless you build the state tables in a separate prior build step.
Erm, I mean LALR specifically.
Feb 29 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/29/12 11:15 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)
Yes. I guess PEGs have problems there.
Probably LR, too, unless you build the state tables in a separate prior build step.
It's been a while since I read about PEGs (the packrat parser paper), but think it's a more fundamental issue with PEGs because they need to memoize several possible parses of the input. Andrei
Feb 29 2012
parent "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jilnjg$1eu9$3 digitalmars.com...
 On 2/29/12 11:15 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)
Yes. I guess PEGs have problems there.
Probably LR, too, unless you build the state tables in a separate prior build step.
It's been a while since I read about PEGs (the packrat parser paper), but think it's a more fundamental issue with PEGs because they need to memoize several possible parses of the input.
I don't know much about packrat parsers, but what I meant is that generating LALR tables (which are then used by an LALR parser) can be very computationally expensive: For simple grammars, it's trivial, but for more complex "typical programming langauge" grammars, it can easily take upwards of quite a few minutes when *not* done in CTFE (at least on my underpowered machine). Although maybe my implementation and GOLD's implementation both just have some additional hidden scaling issue that isn't inherent to the algorithms? Actually using an *already*-generated LALR table (which only needs to be generated once per grammar, not once per input) to parse doesn't have major inherent efficiency issues compared to other parsing algorithms, AFAIK.
Feb 29 2012
prev sibling parent "Ben Hanson" <Ben.Hanson tikit.com> writes:
Hi Nick,

On Wednesday, 29 February 2012 at 17:16:44 UTC, Nick Sabalausky
wrote:
 One thing I've been wanting to see in a lexer's regex would be 
 a way to
 match things like D's nested comments or:
I have already implemented a lexer generator that can handle recursion in the token definitions (using multiple lexer states). See http://www.benhanson.net/lexertl.html The library is C++ and generates lexers at runtime, but the concepts should be easily transferable. Basically I have a boolean for pop in the end state and a separate column for push_dfa_: if (end_state_) { // Return longest match if (pop_) { start_state_ = results_.stack.top ().first; results_.stack.pop (); } else if (push_dfa_ != results_.npos ()) { results_.stack.push (typename results::id_type_pair (push_dfa_, id_)); } . . . I'm happy to answer any questions etc. Regards, Ben
Mar 05 2012
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29.02.2012 20:31, Andrei Alexandrescu wrote:
 Recalling EBNF parser idea that I run with before finally being dragged
 down by real life. Roughly I thought to follow hybrid LL(*) aproach,
 while I had a solid plan on doing DFA for lexer and parser lookahead,
 the other things were more or less floating.
Well, maybe you could integrate that within your up-and-coming research. Grammars have been considered a solved problem every five years since 1970, and there's always new work coming :o).
It might involve parsing and grammars, unless academia folk seduce me with their ideas of optimization for highly parallel architectures :).
 There is prototype of interactive regex matcher that
 works directly on stream (buried in std.regex), it even passed dry-run
 unittests back then. Though I had to postpone till I/O is sorted out. I
 really loved Steven's design with it's easy access to buffer and well
 thought out primitives, hope it will come about sometime soon.
An interactive regex would be a dream come true to me...
Where you've been? ;) I mean during GSOC we've spent days with Fawzi talking about shoehorning regex matcher to work on buffered streams. He actually did that prototype, I was mostly reviewing/fixing the source to make sure it fits. Recalling the inner works it was even expected to optionally do NFC normalization on the fly (wow, now that was ambitious) and all that without copying stuff around 99% of the time. -- Dmitry Olshansky
Feb 29 2012
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote:
[...]
 I wrote a generic lexer generator some time ago.
 It already let to some compiler O(N^2) optimizations, because the token
 declarations sneak into the mangling :(.

 
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Cool! I'll have to take a look at this sometime. [...]
 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>
But things like lex/yacc have been useful throughout the years. With D's delegates, lexer/parser action rules should be even cleaner, no?
 A lot becomes feasible from the CTFE perspective, despite some
 bugfixes I only miss exp and log currently.
All of std.math should have CTFE versions so that we can perform complex calculations at compile-time (e.g., create table of scaled sin/cos values for use in high-performance 3D apps -- no need to waste time to generate these tables at startup time).
 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a
 good I/O system and the missing Allocators.  Any parser written now
 will either risk to not play nice with ranges or has to come up with
 it's own buffering again.
OTOH, perhaps once we start writing a lexer/parser generator, that will force us to fix the I/O and allocator system. Without a good project to motivate us to fix these tedious issues, it seems that we just lose inspiration to do it after a while. T -- The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
Feb 28 2012
next sibling parent deadalnix <deadalnix gmail.com> writes:
Le 28/02/2012 21:02, H. S. Teoh a crit :
 OTOH, perhaps once we start writing a lexer/parser generator, that will
 force us to fix the I/O and allocator system. Without a good project to
 motivate us to fix these tedious issues, it seems that we just lose
 inspiration to do it after a while.
std.conainer is already a good candidate for allocators. Back to the topic, this is a great idea, but isn't it building on weak bases ? We still have issue with const correctness, or postblit, and on lib side, we lack allocators, error handling is poor, containers are poor, etc . . .
Feb 28 2012
prev sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 28 Feb 2012 21:02:24 +0100, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote:
 [...]
 I wrote a generic lexer generator some time ago.
 It already let to some compiler O(N^2) optimizations, because the token
 declarations sneak into the mangling :(.


 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Cool! I'll have to take a look at this sometime. [...]
 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse  
 results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>
But things like lex/yacc have been useful throughout the years. With D's delegates, lexer/parser action rules should be even cleaner, no?
Yacc does work but at the price of an additional build step and total automaton obfuscation. And even at that price the results are still hard to maintain klingon sources. http://supercollider.git.sourceforge.net/git/gitweb.cgi?p=supercollider/supercollider;a=blob;f=lang/LangSource/Bison/lang11d I won't deny that the combination of CTFE text processing and static introspection could improve on this. It could be made more feasible by some conventions, e.g. parse result always uses structs or classes and built-in arrays. ---- class Module { this(Declaration[]); } class StructDeclaration : Declaration { enum _enbf = "struct $1=Identifier { $2=Declaration* }"; this(Identifier, Declaration[]); } ... Parser!(Module, StructDeclaration, ...) parser; Module m = parser.parse(read("file.d"));
Feb 28 2012
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote:
[...]
 I won't deny that the combination of CTFE text processing and static
 introspection could improve on this. It could be made more feasible by
 some conventions, e.g. parse result always uses structs or classes and
 built-in arrays.
Excellent idea, I like this.
 class Module
 {
     this(Declaration[]);
 }
 
 class StructDeclaration : Declaration
 {
     enum _enbf = "struct $1=Identifier { $2=Declaration* }";
 
     this(Identifier, Declaration[]);
 }
 
 ...
 
 Parser!(Module, StructDeclaration, ...) parser;
 Module m = parser.parse(read("file.d"));
I like this! Definitely an improvement over yacc syntax. T -- Unix was not designed to stop people from doing stupid things, because that would also stop them from doing clever things. -- Doug Gwyn
Feb 28 2012
parent reply "Nick Sabalausky" <a a.a> writes:
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.215.1330472867.24984.digitalmars-d puremagic.com...
 On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote:
 [...]
 I won't deny that the combination of CTFE text processing and static
 introspection could improve on this. It could be made more feasible by
 some conventions, e.g. parse result always uses structs or classes and
 built-in arrays.
Excellent idea, I like this.
 class Module
 {
     this(Declaration[]);
 }

 class StructDeclaration : Declaration
 {
     enum _enbf = "struct $1=Identifier { $2=Declaration* }";

     this(Identifier, Declaration[]);
 }

 ...

 Parser!(Module, StructDeclaration, ...) parser;
 Module m = parser.parse(read("file.d"));
I like this! Definitely an improvement over yacc syntax.
In Goldie, I've taken an inverted approach, which IMHO is easier to use: The types are automatically generated from the grammar, not the other way around. So applying that approach to the above code, it'd be more like this: mixin genGrammar!("myGrammar", ` Identifier = [a-zA-Z_][a-zA-Z_0-9]* Module = Declaration+ Declaration = StructDeclaration StructDeclaration = 'struct' Identifier '{' Declaration* '}' `); Which generates these classes: Parser!"myGrammar" Symbol!("myGrammar.Identifier") Symbol!("myGrammar.Module") Symbol!("myGrammar.Declaration") Symbol!("myGrammar.StructDeclaration") and/or these: Parser_myGrammar Symbol_myGrammar!"Identifier" Symbol_myGrammar!"Module" Symbol_myGrammar!"Declaration" Symbol_myGrammar!"StructDeclaration" would could then be aliased by the user however they wanted: alias Symbol_myGrammar MySym; And there can still be hooks (delegates, subclassing, whatever) to add customized behavior/functionality.
Feb 28 2012
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:jikcca$1vq7$1 digitalmars.com...
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.215.1330472867.24984.digitalmars-d puremagic.com...
 On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote:
 [...]
 I won't deny that the combination of CTFE text processing and static
 introspection could improve on this. It could be made more feasible by
 some conventions, e.g. parse result always uses structs or classes and
 built-in arrays.
Excellent idea, I like this.
 class Module
 {
     this(Declaration[]);
 }

 class StructDeclaration : Declaration
 {
     enum _enbf = "struct $1=Identifier { $2=Declaration* }";

     this(Identifier, Declaration[]);
 }

 ...

 Parser!(Module, StructDeclaration, ...) parser;
 Module m = parser.parse(read("file.d"));
I like this! Definitely an improvement over yacc syntax.
In Goldie, I've taken an inverted approach, which IMHO is easier to use: The types are automatically generated from the grammar, not the other way around. So applying that approach to the above code, it'd be more like this: mixin genGrammar!("myGrammar", ` Identifier = [a-zA-Z_][a-zA-Z_0-9]* Module = Declaration+ Declaration = StructDeclaration StructDeclaration = 'struct' Identifier '{' Declaration* '}' `); Which generates these classes: Parser!"myGrammar" Symbol!("myGrammar.Identifier") Symbol!("myGrammar.Module") Symbol!("myGrammar.Declaration") Symbol!("myGrammar.StructDeclaration") and/or these: Parser_myGrammar Symbol_myGrammar!"Identifier" Symbol_myGrammar!"Module" Symbol_myGrammar!"Declaration" Symbol_myGrammar!"StructDeclaration" would could then be aliased by the user however they wanted: alias Symbol_myGrammar MySym; And there can still be hooks (delegates, subclassing, whatever) to add customized behavior/functionality.
Hmm, maybe I need to think about what it would take to make Goldie able to parse at compile-time...
Feb 28 2012
parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:jikcit$201o$1 digitalmars.com...
 Hmm, maybe I need to think about what it would take to make Goldie able to 
 parse at compile-time...
Just gave it a quick shot. It was looking like it might not be too bad, but then I hit: Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 'interpret.c' Bleh. (This was with DMD 2.058)
Feb 28 2012
next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
Le 29 f=C3=A9vr. 2012 07:20, "Nick Sabalausky" <a a.a> a =C3=A9crit :
 "Nick Sabalausky" <a a.a> wrote in message
 news:jikcit$201o$1 digitalmars.com...
 Hmm, maybe I need to think about what it would take to make Goldie able
to
 parse at compile-time...
Just gave it a quick shot. It was looking like it might not be too bad,
but
 then I hit:

 Assertion failure: 'ctfeStack.stackPointer() =3D=3D 0' on line 4823 in fi=
le
 'interpret.c'

 Bleh.

 (This was with DMD 2.058)
Yeah, I had the very same yesterday :( Also, another one on line 94 in interpret.c 'v->ctfeSomethin' failing. Too bad. In my case, I found a workaround: I was doing array[] ~=3D SomeStruct(args); which asserts at CTFE. But: auto s =3D SomeStructs(args); array[] ~=3D s; works.
Feb 29 2012
parent "Nick Sabalausky" <a a.a> writes:
"Philippe Sigaud" <philippe.sigaud gmail.com> wrote in message 
news:mailman.229.1330508922.24984.digitalmars-d puremagic.com...
Le 29 fvr. 2012 07:20, "Nick Sabalausky" <a a.a> a crit :
 Just gave it a quick shot. It was looking like it might not be too bad,
but
 then I hit:

 Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file
 'interpret.c'

 Bleh.

 (This was with DMD 2.058)
Yeah, I had the very same yesterday :( Also, another one on line 94 in interpret.c 'v->ctfeSomethin' failing. Too bad. In my case, I found a workaround: I was doing array[] ~= SomeStruct(args); which asserts at CTFE. But: auto s = SomeStructs(args); array[] ~= s; works.
Ooh, cool! I know exactly where I'm doing that!
Feb 29 2012
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:jikfpi$25cc$1 digitalmars.com...
 "Nick Sabalausky" <a a.a> wrote in message 
 news:jikcit$201o$1 digitalmars.com...
 Hmm, maybe I need to think about what it would take to make Goldie able 
 to parse at compile-time...
Just gave it a quick shot. It was looking like it might not be too bad, but then I hit: Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 'interpret.c' Bleh. (This was with DMD 2.058)
Filed: http://d.puremagic.com/issues/show_bug.cgi?id=7667
Mar 07 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/12 11:15 PM, Nick Sabalausky wrote:
 In Goldie, I've taken an inverted approach, which IMHO is easier to use: The
 types are automatically generated from the grammar, not the other way
 around. So applying that approach to the above code, it'd be more like this:

 mixin genGrammar!("myGrammar", `
      Identifier = [a-zA-Z_][a-zA-Z_0-9]*
      Module = Declaration+
      Declaration = StructDeclaration
      StructDeclaration = 'struct' Identifier '{' Declaration* '}'
 `);

 Which generates these classes:

 Parser!"myGrammar"
 Symbol!("myGrammar.Identifier")
 Symbol!("myGrammar.Module")
 Symbol!("myGrammar.Declaration")
 Symbol!("myGrammar.StructDeclaration")

 and/or these:

 Parser_myGrammar
 Symbol_myGrammar!"Identifier"
 Symbol_myGrammar!"Module"
 Symbol_myGrammar!"Declaration"
 Symbol_myGrammar!"StructDeclaration"

 would could then be aliased by the user however they wanted:

 alias Symbol_myGrammar MySym;

 And there can still be hooks (delegates, subclassing, whatever) to add
 customized behavior/functionality.
I think this is the right approach. Andrei
Feb 29 2012
parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jilkj6$1a50$2 digitalmars.com...
 On 2/28/12 11:15 PM, Nick Sabalausky wrote:
 In Goldie, I've taken an inverted approach, which IMHO is easier to use: 
 The
 types are automatically generated from the grammar, not the other way
 around. So applying that approach to the above code, it'd be more like 
 this:

 mixin genGrammar!("myGrammar", `
      Identifier = [a-zA-Z_][a-zA-Z_0-9]*
      Module = Declaration+
      Declaration = StructDeclaration
      StructDeclaration = 'struct' Identifier '{' Declaration* '}'
 `);

 Which generates these classes:

 Parser!"myGrammar"
 Symbol!("myGrammar.Identifier")
 Symbol!("myGrammar.Module")
 Symbol!("myGrammar.Declaration")
 Symbol!("myGrammar.StructDeclaration")

 and/or these:

 Parser_myGrammar
 Symbol_myGrammar!"Identifier"
 Symbol_myGrammar!"Module"
 Symbol_myGrammar!"Declaration"
 Symbol_myGrammar!"StructDeclaration"

 would could then be aliased by the user however they wanted:

 alias Symbol_myGrammar MySym;

 And there can still be hooks (delegates, subclassing, whatever) to add
 customized behavior/functionality.
I think this is the right approach.
What? We agree? ;)
Feb 29 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/29/12 11:05 AM, Nick Sabalausky wrote:
 What? We agree? ;)
It's 2012 after all. Andrei
Feb 29 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/12 12:46 PM, Martin Nowak wrote:
 I wrote a generic lexer generator some time ago.
 It already let to some compiler O(N^2) optimizations, because the token
 declarations sneak into the mangling :(.


 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Wow, now we have an embarrassment of riches!
 I've ditched an attempt to write a parser combinator. It was based on
 expression templates and ended up at spirit craziness.
I'm not sure what a parser combinator is. I think grammar inheritance could actually be interesting to explore.
 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>
I think synthesizing ASTs is a solved problem, see my post before. Error messages are still a hassle, but I missed the point they became good in hand-written parsers :o).
 A lot becomes feasible from the CTFE perspective,
 despite some bugfixes I only miss exp and log currently.

 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a good
 I/O
 system and the missing Allocators.
 Any parser written now will either risk to not play nice with ranges
 or has to come up with it's own buffering again.
Agreed, but that doesn't seem like the largest hurdle to me. Andrei
Feb 29 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
Feb 29 2012
parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop. There is also an --echo option which recreates the complete source from the tokens.
Feb 29 2012
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Feb 29, 2012 at 07:28:48PM +0100, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:
 
On 02/28/2012 07:46 PM, Martin Nowak wrote:
https://gist.github.com/1255439 - lexer generator
https://gist.github.com/1262321 - complete and fast D lexer
Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop. There is also an --echo option which recreates the complete source from the tokens.
This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-) T -- The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike Ellis
Feb 29 2012
prev sibling next sibling parent reply d coder <dlang.coder gmail.com> writes:
 This sounds like something I ran into in my D lexer project. Lexing
 Phobos took upwards of 10-15 seconds, which is extremely slow
 considering that dmd can *compile* it in just a few seconds. So I did
 some profiling, and found out that most of the time was spent formatting
 tokens into strings and running writeln. After I commented that out, the
 running time dropped to just a few seconds. :-)
And how different was just a few seconds from 10-15 seconds? :-)
Feb 29 2012
parent "Nick Sabalausky" <a a.a> writes:
"d coder" <dlang.coder gmail.com> wrote in message 
news:mailman.241.1330541814.24984.digitalmars-d puremagic.com...
 This sounds like something I ran into in my D lexer project. Lexing
 Phobos took upwards of 10-15 seconds, which is extremely slow
 considering that dmd can *compile* it in just a few seconds. So I did
 some profiling, and found out that most of the time was spent formatting
 tokens into strings and running writeln. After I commented that out, the
 running time dropped to just a few seconds. :-)
And how different was just a few seconds from 10-15 seconds? :-)
~10 seconds ;)
Feb 29 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 01, 2012 at 12:19:30AM +0530, d coder wrote:
 This sounds like something I ran into in my D lexer project. Lexing
 Phobos took upwards of 10-15 seconds, which is extremely slow
 considering that dmd can *compile* it in just a few seconds. So I did
 some profiling, and found out that most of the time was spent formatting
 tokens into strings and running writeln. After I commented that out, the
 running time dropped to just a few seconds. :-)
And how different was just a few seconds from 10-15 seconds? :-)
OK so I was imprecise. It was 2-3 seconds compared to 10-15 seconds. So it's about an order of magnitude difference. :) T -- "I speak better English than this villain Bush" -- Mohammed Saeed al-Sahaf, Iraqi Minister of Information
Feb 29 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.
I did that.
Feb 29 2012
parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.
I did that.
Interesting, I've commented it out https://gist.github.com/1262321#L1559 and get the following. <<< PHOBOS=~/Code/D/DPL/phobos mkdir test_lexer cd test_lexer curl https://raw.github.com/gist/1255439/lexer.d > lexer.d curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d curl https://raw.github.com/gist/1262321/entity.d > entity.d dmd -O -release -inline dlexer lexer entity wc -l ${PHOBOS}/std/*.d time ./dlexer ${PHOBOS}/std/*.d

./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total
Feb 29 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02/29/2012 09:03 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.
I did that.
Interesting, I've commented it out https://gist.github.com/1262321#L1559 and get the following. <<< PHOBOS=~/Code/D/DPL/phobos mkdir test_lexer cd test_lexer curl https://raw.github.com/gist/1255439/lexer.d > lexer.d curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d curl https://raw.github.com/gist/1262321/entity.d > entity.d dmd -O -release -inline dlexer lexer entity wc -l ${PHOBOS}/std/*.d time ./dlexer ${PHOBOS}/std/*.d

./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total
I get 0.160s for lexing using your lexer. Parsing the same file with DMDs parser takes 0.155 seconds. The difference grows with larger files.
Feb 29 2012
parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Wed, 29 Feb 2012 21:30:57 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/29/2012 09:03 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer
Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.
I did that.
Interesting, I've commented it out https://gist.github.com/1262321#L1559 and get the following. <<< PHOBOS=~/Code/D/DPL/phobos mkdir test_lexer cd test_lexer curl https://raw.github.com/gist/1255439/lexer.d > lexer.d curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d curl https://raw.github.com/gist/1262321/entity.d > entity.d dmd -O -release -inline dlexer lexer entity wc -l ${PHOBOS}/std/*.d time ./dlexer ${PHOBOS}/std/*.d

./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total
I get 0.160s for lexing using your lexer. Parsing the same file with DMDs parser takes 0.155 seconds. The difference grows with larger files.
Mmh, I've retested and you're right dmd's lexer is about 2x faster. The main overhead stems from using ranges and enforce. Quick profiling shows that 25% is spent in popFront and std.utf.stride. Last time I worked on this I rewrote std.utf.decode to be much faster. But utf characters are still "decoded" twice, once for front and then again for popFront. Also stride uses table lookup and can't be inlined. If switch tables were implemented on x64 one could use them for integral ElementType.
Feb 29 2012
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 01, 2012 at 12:04:39AM +0100, Martin Nowak wrote:
[...]
 Mmh, I've retested and you're right dmd's lexer is about 2x faster.
 The main overhead stems from using ranges and enforce.
 
 Quick profiling shows that 25% is spent in popFront and
 std.utf.stride.  Last time I worked on this I rewrote std.utf.decode
 to be much faster.  But utf characters are still "decoded" twice, once
 for front and then again for popFront. Also stride uses table lookup
 and can't be inlined.
[...] One way to not decode characters twice is by using a single-dchar buffer in your range. Something like: struct MyRange { private File src; char buf[]; dchar readahead; this(File _src) { // ... fill up buf from src here popFront(); } property pure dchar front() { return readahead; } void popFront() { int stride; readahead = decode(buf, stride); buf = buf[stride..$]; // ... fill up buf more if needed } } T -- "A man's wife has more power over him than the state has." -- Ralph Emerson
Feb 29 2012
prev sibling next sibling parent bcs <bcs example.com> writes:
On 02/27/2012 11:59 PM, Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and
 potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))
A bit outdated but.... http://dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d BTW, that uses very little in the way of CTFE and even less in the way of string mixins. Those are, IMHO, very powerful tools that should only be combined as weapons of last resort.
 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not
 new and already available with some languages, has unique powers when
 combined with other features of D. With CTFE we get to do things that
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
Feb 28 2012
prev sibling next sibling parent reply "F i L" <witte2008 gmail.com> writes:
I'm not sure I follow all the details of what Andrei's suggesting 
and what's being talked about here, this parser/lexer stuff is 
still very new to me, so this may be a bit off-topic. However, I 
thought I'd weigh in on something I was very impressed with about 
the Nimrod language's direct AST access/manipulation.

Nim has a "template" which is very much like D's mixin templates, 
example:


   template foo(b:string) =
     var bar = b

   block main:
     foo("test")
     assert(bar == "test")

and the equivalent in...

   // D
   mixin template foo(string b) {
     auto bar = b;
   }

   void main() {
     mixin foo("test");
     assert(bar == "test");
   }

which is all very straight forward stuff, the cool part comes 
with Nim's macro's. Nim has a two unique types: expr & stmt 
(expression & statement). They're direct AST structures which can 
be passed to template/macro procedures and arbitrarily mutated. 
Example:

   macro foo(s:stmt): stmt =
     result = newNimNode(nnkStmtList)
     for i in 1 .. s.len-1:
       var str = s[i].toStrLit()
       result.add(newCall("echo", str))

   block main:
     foo:
       bar
       baz

the above code prints:
"
   bar
   baz
"

**Some notes: result is what's returned, and the reason you can 
use "foo" with a statement body is because any macro/template 
who's last parameter is type 'stmt' can be called with block 
semantics; similar to how UFCS works with the first parameter.**

The above *might* look like the following in D:

   macro foo(ASTNode[] stmts...) {
     ASTNode[] result;
     foreach (s; stmts) {
       auto str = s.toASTString();
       result ~= new ASTCall!"writeln"(str);
     }
     return result;
   }

   void main() {
     foo {
       bar;
       baz;
     }
   }

This kind of direct AST manipulation + body semantics opens the 
doors for a lot of cool things. If you read through Nim's lib 
documentation you'll notice many of the "language" features are 
actually just Library procedures in the defaultly included system 
module. Which is great because contributions to the *language* 
can be made very easily. Also, the infrastructure to read/write 
AST is no-doubt immensely useful for IDE support and other such 
dev tools.

I'm not a huge fan of everything in Nimrod, but this is something 
they definitely got right, and I think D could gain from their 
experience.
May 27 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-05-27 22:15, F i L wrote:
 I'm not sure I follow all the details of what Andrei's suggesting and
 what's being talked about here, this parser/lexer stuff is still very
 new to me, so this may be a bit off-topic. However, I thought I'd weigh
 in on something I was very impressed with about the Nimrod language's
 direct AST access/manipulation.

 Nim has a "template" which is very much like D's mixin templates, example:


 template foo(b:string) =
 var bar = b

 block main:
 foo("test")
 assert(bar == "test")

 and the equivalent in...

 // D
 mixin template foo(string b) {
 auto bar = b;
 }

 void main() {
 mixin foo("test");
 assert(bar == "test");
 }

 which is all very straight forward stuff, the cool part comes with Nim's
 macro's. Nim has a two unique types: expr & stmt (expression &
 statement). They're direct AST structures which can be passed to
 template/macro procedures and arbitrarily mutated. Example:

 macro foo(s:stmt): stmt =
 result = newNimNode(nnkStmtList)
 for i in 1 .. s.len-1:
 var str = s[i].toStrLit()
 result.add(newCall("echo", str))

 block main:
 foo:
 bar
 baz

 the above code prints:
 "
 bar
 baz
 "

 **Some notes: result is what's returned, and the reason you can use
 "foo" with a statement body is because any macro/template who's last
 parameter is type 'stmt' can be called with block semantics; similar to
 how UFCS works with the first parameter.**

 The above *might* look like the following in D:

 macro foo(ASTNode[] stmts...) {
 ASTNode[] result;
 foreach (s; stmts) {
 auto str = s.toASTString();
 result ~= new ASTCall!"writeln"(str);
 }
 return result;
 }

 void main() {
 foo {
 bar;
 baz;
 }
 }

 This kind of direct AST manipulation + body semantics opens the doors
 for a lot of cool things. If you read through Nim's lib documentation
 you'll notice many of the "language" features are actually just Library
 procedures in the defaultly included system module. Which is great
 because contributions to the *language* can be made very easily. Also,
 the infrastructure to read/write AST is no-doubt immensely useful for
 IDE support and other such dev tools.

 I'm not a huge fan of everything in Nimrod, but this is something they
 definitely got right, and I think D could gain from their experience.
This is a very cool feature of Nimrod. It allows to move several language features to the library. * synchronized * scope * foreach (possibly) -- /Jacob Carlborg
May 28 2012
prev sibling parent reply "Ken" <kennethuil gmail.com> writes:
On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu 
wrote:
 I'm starting a new thread on this because I think the matter is 
 of strategic importance.

 We all felt for a long time that there's a lot of potential in 
 CTFE, and potential applications have been discussed more than 
 a few times, ranging from formatting strings parsed to DSLs and 
 parser generators.

 Such feats are now approaching fruition because a number of 
 factors converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates 
 efficient D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making 
 more advanced uses possible and even relatively easy (thanks 
 Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 
 lines of code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out 
 Philippe Sigaud has already a competing design and 
 implementation of a parser generator.

 This is the kind of stuff I've had an eye on for the longest 
 time. I'm saying it's of strategic importance because CTFE 
 technology, though not new and already available with some 
 languages, has unique powers when combined with other features 
 of D. With CTFE we get to do things that are quite literally 
 impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and 
 efficient lexer-parser generator combo in Phobos, pronto. The 
 lexer itself could use a character-level PEG or a classic 
 automaton, and emit tokens for consumption by a parser 
 generator. The two should work in perfect tandem (no need for 
 glue code). At the end of the day, defining a complete 
 lexer+parser combo for a language should be just a few lines 
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei
Great! So what's the next step? Do we wait for the maintainers of one of the CTFE parser gen packages to drop it in the Phobos Review Queue? Do a reimplementation for Phobos? We could attack this in pieces. Start with a lexer/FSA generator (like Ragel but using CTFE) - this will make it much easier to consume many wire protocols, for starters (I found it very easy to make a simple HTTP client using Ragel), and will be quite useful on its own. Then extend it into a lexer for a parser generator.
Jun 01 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/1/12 8:39 AM, Ken wrote:
 Great! So what's the next step? Do we wait for the maintainers of one of
 the CTFE parser gen packages to drop it in the Phobos Review Queue? Do a
 reimplementation for Phobos?

 We could attack this in pieces. Start with a lexer/FSA generator (like
 Ragel but using CTFE) - this will make it much easier to consume many
 wire protocols, for starters (I found it very easy to make a simple HTTP
 client using Ragel), and will be quite useful on its own. Then extend it
 into a lexer for a parser generator.
I think this core strength of the language should be properly supplanted by library support, so I'd be very interested to look at Phobos proposals. The proposals should come with sample grammars of nontrivial size, ideally a parser for the entire D language. There might be things to fix in the compiler, e.g. memory consumption. Andrei
Jun 01 2012